SightX: Preprocessing the Pipeline - ImageNet Math and Augmentation Strategy

 Day 8 & 9

They say data is worthless until it is clean. That might be dramatic, but the reality is that 35,000 high-resolution retinal scans sitting in a folder do not mean anything to a neural network until they are resized, normalized, and shaped into the exact format the model expects. This phase was about building the preprocessing pipeline. The bridge between raw JPEG files and tensors ResNet50 can actually train on.


Why Preprocessing Is Not Optional

The temptation to skip this step is real. The data is downloaded, the model architecture is known, and the urge to just start training is strong. But feeding raw images directly into a pre-trained ResNet50 would fail immediately, not because the model is broken, but because the input format would be completely incompatible with what the pre-trained weights expect.

ResNet50 was originally trained on ImageNet: 1.2 million images normalized to specific mean and standard deviation values. If you feed it images with a different distribution during fine-tuning, the pre-trained weights will interpret the pixel values incorrectly. The model will still run, but it will start from a worse position than if you had trained from scratch. The entire advantage of transfer learning evaporates.

This phase was about two things: understanding why ImageNet normalization matters and implementing separate pipelines for training (with augmentation) and inference (deterministic only).


The ImageNet Statistics: Why These Numbers Are Not Arbitrary

Every preprocessing pipeline in this project revolves around three values:

IMAGENET_MEAN = [0.485, 0.456, 0.406]

IMAGENET_STD  = [0.229, 0.224, 0.225]

These numbers represent the mean and standard deviation of pixel values across the entire ImageNet dataset, calculated separately for the Red, Green, and Blue color channels. When ResNet50 was originally trained, every single input image was normalized using these exact statistics. The network's internal weights learned to interpret pixel values within this specific distribution.

If you feed ResNet50 an image that has been normalized differently or worse, not normalized at all, the values flowing through the network will be completely off from what the weights expect. The model will still produce predictions, but those predictions will be unreliable because the input space is misaligned with the learned feature space.

The fix is simple: apply the same normalization that was used during pre-training. This is not a hyperparameter to tune. It is a compatibility requirement.


Training Transforms: Augmentation as a Defense Against Overfitting

The training pipeline applies five sequential transformations to every image before it reaches the model:

train_transforms = transforms.Compose([

    transforms.Resize((224, 224)),

    transforms.RandomHorizontalFlip(),

    transforms.RandomRotation(10),

    transforms.ColorJitter(brightness=0.2, contrast=0.2),

    transforms.ToTensor(),

    transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)

])


The first operation, Resize((224, 224)) is mechanical. ResNet50 expects square inputs of exactly 224×224 pixels. Every image gets resized to match this.

The next three lines are data augmentation: RandomHorizontalFlip(), RandomRotation(10), and ColorJitter(). These operations artificially expand the dataset by creating variations of the original images. A retinal scan flipped horizontally is still medically valid, the anatomy is symmetrical. A 10 degree rotation simulates the minor variations in how a camera might be positioned during capture. Brightness and contrast adjustments account for different lighting conditions across clinical environments.

The goal is to force the model to learn the actual features that distinguish Grade 0 from Grade 4 microaneurysms, vessel structure and neovascularisation rather than memorizing the exact pixel layout of the 25,000 training images. Without augmentation, the model would overfit. It would perform well on the training set and collapse on new data.

The final two operations, ToTensor() and Normalize()convert the image into a PyTorch tensor and apply the ImageNet statistics. The raw pixel values, which start in the range 0 – 255, are first scaled to 0.0 – 1.0 by ToTensor(), then shifted to match the ImageNet distribution by Normalize(). This is the format ResNet50 was trained on, and this is the format it expects.


Inference Transforms: No Randomness, Only Determinism

The inference pipeline is deliberately different. When the model is deployed and processing real retinal scans in production, it cannot randomly flip or rotate the input. The prediction needs to be based on the actual image, not a randomly augmented version of it.

inference_transforms = transforms.Compose([

    transforms.Resize(256),

    transforms.CenterCrop(224),

    transforms.ToTensor(),

    transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)

])


This pipeline removes all augmentation. What remains are the strict mechanical steps required for ResNet50 to function: resize the image, crop the center 224×224 region, convert to tensor, normalize using ImageNet statistics.

One detail here: the inference pipeline uses Resize(256) followed by CenterCrop(224), rather than directly resizing to 224×224. This is the approach recommended in the official PyTorch ResNet50 documentation. The logic is that resizing the shorter edge to 256 pixels and then cropping the center preserves the aspect ratio better than a direct resize, which can squash non-square images. For training, I chose the direct resize for simplicity, but for inference, I followed the documented best practice.


The Preprocessing Function: Bridging File Paths to Tensors

The final piece is a utility function that wraps the inference pipeline and adds error handling:

def preprocess_image(image_path: str) -> torch.Tensor:

    """Load and preprocess a single image for inference."""

    try:

        img = Image.open(image_path).convert('RGB')

    except Exception as e:

        logging.error(f"Error loading image {image_path}: {e}")

        raise e

        

    tensor = inference_transforms(img)

    return tensor.unsqueeze(0)


This function takes a file path, loads the image using PIL, forces it into RGB format (some retinal scans might be grayscale or have an alpha channel), applies the inference transforms, and returns a tensor.

The final operation unsqueeze(0)is critical. Models always expect a batch of images, even when predicting a single one. The shape ResNet50 requires is [batch_size, channels, height, width]. A single image has the shape [3, 224, 224]. Adding the batch dimension with unsqueeze(0) transforms it into [1, 3, 224, 224], which the model can process.


What I Learned Digging Into the Docs

The PyTorch documentation for ResNet50 explicitly describes the expected preprocessing pipeline. It specifies that images should be resized to 232 pixels on the shorter edge, center cropped to 224×224, scaled to [0.0, 1.0], and normalized using the ImageNet mean and standard deviation. This is not a suggestion. It is the preprocessing pipeline that the pre-trained weights were validated against.

I spent time cross-referencing this with what I had implemented. The training pipeline uses a direct resize for speed and simplicity. The inference pipeline follows the documented resize-then-crop approach for maximum compatibility. Both apply the same normalization, which is the non-negotiable requirement.

The deeper insight here is that transfer learning is not just about loading pre-trained weights. It is about maintaining compatibility with the exact data distribution those weights were trained on. The preprocessing pipeline is where that compatibility is enforced.


What This Phase Delivered

Two pipelines are now implemented and ready:

  1. Training transforms: with augmentation to prevent overfitting

  2. Inference transforms: deterministic, following the documented ResNet50 preprocessing

Both apply ImageNet normalization. Both resize to 224 × 224. Both convert to tensors. The difference is whether augmentation is applied, which depends entirely on context: training benefits from variation, inference demands consistency.

The preprocessing.py file is complete. Phase 4 starts next: defining the ResNet50 architecture, replacing the final classification layer with a 5 - class output, and verifying that the model can accept the preprocessed tensors without throwing shape errors. The pipeline is built. The model is next.


Comments

Popular posts from this blog

SightX: We Shipped It (The Journey Comes to an End)

SightX: Data Acquisition & Exploration - 88GB of Reality, Data Acquisition and the 73% Problem

SightX: Teaching the Model to Learn - The Training Loop