SightX: Research & Model training V2
Day 15 & 16
V1 was chaos. Beautiful, productive chaos. I typed python train.py, watched accuracy hit 67.99% at Epoch 7, then watched it overfit into the ground for thirteen more epochs while I ate noodles and questioned my life choices. It worked, but it was messy.
V2 was different. V2 was what happens when you finally read the manual. After V1 finished, I did something I probably should have done before writing any code: I studied the people who actually solved this problem. The 2015 EyePACS Kaggle competition had 661 teams competing for $100K in prizes. The winners didn't just build better models, they built better ways of looking at the data. And that changed everything.
The Kaggle Winner Who Changed My Mind
Ben Graham won first place. His secret wasn't a fancy neural network. It was preprocessing.
His insight: raw retinal photographs are a mess. Different cameras. Different lighting. Different zoom levels. Some images look reddish, some yellowish. Some retinas fill the frame, others are tiny circles floating in black pixels. If you feed this chaos directly into a model, it wastes energy learning to ignore noise instead of detecting disease.
His fix was elegant: rescale every retina to the same size, subtract the local average color to neutralize lighting differences, and clip boundary artifacts. Suddenly every image looks structurally consistent regardless of which camera took it.
I read that and realized: V1 was feeding raw, unprocessed chaos into ResNet50 and hoping it would figure things out. V2 needed to clean the data first.
Why Accuracy Is Lying to You
Here is the secret about this dataset: 73.5% of images are Grade 0 (healthy retinas). A model that does literally nothing, just predicts "healthy" for every single image and gets 73.5% accuracy. Higher than V1. Higher than V2. Completely useless clinically.
This is why the Kaggle competition used something called quadratic weighted kappa instead of raw accuracy. Kappa measures agreement while accounting for chance and severity. Predicting Grade 0 for a Grade 4 (proliferative, sight-threatening) retina is penalized way harder than being off by one grade. Kappa cares about how wrong you are, not just whether you are wrong.
V1 only tracked accuracy. V2 tracks both. And kappa tells the real story.
What Actually Changed
Every change in V2 came from research. Here is what mattered:
Preprocessing the images: I wrote a script that processes all 35,126 images before training. It crops out the black borders, rescales every retina to the same size, applies Ben Graham's local color normalization to strip camera artifacts, and enhances contrast in the green channel (where blood vessels show up best). This took several hours to run but only needed to happen once.
Aggressive augmentation: V1 used gentle augmentation: ±10° rotation, horizontal flip. V2 went nuclear: 360° rotation (retinas can be photographed at any angle), vertical + horizontal flips, aggressive color jitter, random cropping, Gaussian blur, random erasing. The model couldn't memorize anymore, "it was forced to learn actual patterns."
Higher resolution: V1 trained on 224×224 images. V2 uses 384×384. The competition's third-place team showed that doubling resolution improved their kappa significantly. More pixels = more detail for the model to learn from.
Gradual unfreezing: Instead of unfreezing layer4 immediately, V2 gradually unlocks layers 2, 3, and 4 over the first few epochs. Gives the model time to adapt without destroying the pre-trained features.
Early stopping: V2 watches validation kappa and stops training automatically when it plateaus for 5 epochs. V1 just ran all 20 epochs whether it was learning or not.
The Results Or Why 1.7% Feels Like a Victory
V2 finished after 11.5 hours (way longer than V1's 4 hours, thanks to higher resolution). Final numbers:
Validation accuracy: 69.68% (V1 was 67.99%)
Quadratic weighted kappa: 0.6454 (V1 was ~0.50)
Training epochs before overfitting: 20 (V1 collapsed at Epoch 7)
The accuracy only went up 1.7%. If you only looked at that number, you might think V2 barely improved. But kappa went up ~0.15 — and that is the real win. The model isn't just getting more images right. It is getting the severity ordering right. When it is wrong, it is wrong by one grade, not three. That is medically meaningful.
The Overfitting That Never Came
V1's story was a tragedy: steady improvement to Epoch 7, then thirteen epochs of slow collapse as training accuracy climbed and validation accuracy flatlined. The model memorized the training set and forgot how to generalize.
V2 never had that second act. The gap between training accuracy (53%) and validation accuracy (70%) stayed reasonable the whole time. The aggressive augmentation worked. The model couldn't memorize because every epoch showed it slightly different variations of the same images. It was forced to learn patterns, not pixels.
What This Actually Means
Here is where SightX sits now:
Clinical-grade systems (FDA-approved, 90%+ accuracy): Can diagnose autonomously
Clinical assist tools (80-90% accuracy): Doctors use as second opinion
Screening tools (70% accuracy, κ≈0.65): Flags at-risk patients for referral ← This is SightX
Research demos (<65% accuracy): Portfolio projects, proofs of concept
V2 is a screening tool. Not a diagnostic. A triage system. In a rural clinic where the alternative is no diabetic retinopathy screening at all, a system that correctly flags most at-risk patients for an ophthalmologist referral is genuinely useful.
And here is the thing: the model is one swappable component. Someone with more GPU access could fork this repo, train a bigger model at higher resolution, drop in a better best_model.pt, and the entire pipeline would just work. The model is the engine. SightX is the car. V2 proves the engine runs.
The Real Lesson
If V2 taught me anything, it is this: data quality beats model architecture. Every percentage point of improvement came from how we processed, augmented, and sampled the data.
Ben Graham won Kaggle with better preprocessing, not a bigger model. The third-place team won with higher resolution and smarter augmentation. The winners understood something I didn't in V1: the model can only learn patterns that exist in the data you feed it. If you feed it noise, it learns noise (Garbage In Garbage Out). If you feed it clean, consistent, well-augmented medical images, it learns.
V1 was about proving the pipeline worked. V2 was about making it work well. The accuracy jump is modest. The kappa improvement is real. The overfitting problem is solved. The preprocessing is reusable. The training is stable.
V2 is done. The model sits in checkpoints/best_model.pt - 100MB, 24.6 million parameters, κ=0.6454, trained entirely on a MacBook Air M4 that never once sounded like it was preparing for takeoff.
Next phase
Building the FastAPI server that turns this trained model into a callable inference API. The model learned to see disease. Now it needs to serve predictions.
V3 is being researched. But first, we ship V2.
Comments
Post a Comment