SightX: Architecture Assembly — Freezing Layers and the 65% Paradox
Day 10 & 11
They say a neural network is only as good as its architecture. That is technically true, but when you are doing transfer learning, the architecture is already mostly built, you are just rewiring the final layer and deciding which parts are allowed to adapt. This phase was about loading ResNet50, surgically freezing most of its layers, and replacing the classification head with one that outputs 5 DR grades instead of 1000 ImageNet classes.
And then spending twenty minutes debugging an error that turned out to be a stale Jupyter kernel. Time well spent.
Why ResNet50
ResNet50 is not the newest model. It is not the most accurate on benchmarks. But it is proven, well-documented, and already trained on 1.2 million images. It knows how to detect edges, textures, shapes, and high-level visual patterns. For a project with 35,000 retinal scans, starting from these pre-trained weights is vastly more efficient than training from scratch.
The "50" refers to depth, 50 convolutional layers stacked with residual connections. Those connections are what make deep networks trainable. Without them, gradients vanish during training and learning stops. With them, the network learns small corrections on top of what the previous layer already figured out.
The Freezing Strategy: Selective Training
Not all layers needed to be retrained. Early layers in ResNet50 detect universal features: edges, corners, textures. These apply to any visual task, including retinal scans. Freezing these layers preserves that knowledge and prevents wasting time re-learning basic feature detection.
The final convolutional block — layer4 — was unfrozen. This is where ResNet50 learns high-level, task-specific patterns. For ImageNet, that might be fur texture or wheel shapes. For retinal scans, this needs to become microaneurysms and neovascularization. Unfreezing layer4 lets the model adapt its deepest features to the new domain.
The result: layers 1-3 stay locked, layer4 adapts, and training is faster with less overfitting risk.
Replacing the Classification Head
ResNet50's original purpose was classifying 1000 ImageNet categories. The final layer transforms 2048 feature dimensions into 1000 outputs. For SightX, that needed to become 5 outputs, one per DR severity grade.
Instead of a direct transformation from 2048 to 5, I added an intermediate bottleneck: compress to 512 dimensions, apply activation, add dropout for regularization, then output 5 logits. The intermediate layer gives the model more representational capacity. Dropout forces it to learn redundant features, which prevents over-reliance on any single pattern.
This classification head is trained from scratch. Everything else either stays frozen or fine-tunes from pre-trained weights.
The Architecture Test: Proving It Works
Before writing any training code, I verified the architecture by passing a random tensor through the model. Input shape: 1 batch, 3 color channels, 224×224 pixels. Expected output: 1 batch, 5 logits. Actual output: exactly that.
The forward pass executed without errors. The model accepts the preprocessed input format and produces the correct output shape. This 30-second sanity check catches architectural bugs that would otherwise surface hours into training.
The 65% Trainable Parameters: Not a Mistake
The parameter audit showed something unexpected:
Total parameters: 24.5 million
Trainable parameters: 16 million
Percent trainable: 65.21%
Most transfer learning guides recommend freezing the majority of the network and training only a small head. A typical trainable percentage is 5 - 10%. Here, it is 65%. That sounds wrong.
It is not.
Unfreezing layer4 accounts for a massive number of parameters. ResNet50's final residual block contains multiple convolutional layers with large filter banks. All of those weights are now trainable. Add the custom classification head, and the trainable count hits 16 million.
The decision was deliberate. The dataset is large enough to support fine-tuning this many parameters without catastrophic overfitting, especially with augmentation and dropout already in place. Freezing layer4 would work, but it would limit the model's ability to learn domain specific retinal patterns.
The 65% is high, but justified.
The Stale Kernel Incident
At one point, the model threw an error claiming the forward method was missing. It was not missing. I checked the file. I restarted the kernel. I questioned my entire understanding of Python classes.
The problem: Jupyter had cached an old version of the model class before I had fully implemented the forward pass. Even after updating the file, the notebook was still referencing the stale version in memory. The fix was simple, restart the kernel and re-import.
Lesson learned: when working with external modules in Jupyter, always restart the kernel after making changes. Otherwise, you debug ghosts.
What This Delivered
The model architecture is complete and verified. ResNet50 backbone with pre-trained weights. Layers 1-3 frozen. Layer4 unfrozen for domain adaptation. Custom classification head with bottleneck, activation, dropout, and 5-class output. Architecture test passed. Parameter count confirmed intentional.
Next phase
Building the training loop loss function, optimizer, learning rate schedule, and the epoch-by-epoch logic that will update those 16 million trainable parameters. The architecture is built. Now it needs to learn.
Comments
Post a Comment