SightX: Trained My First AI Model on a Laptop

Day 14

7:55 PM to 11:55 PM

They say you never forget your first. Your first car. Your first apartment. Your first time training a neural network on 35,000 medical images while sitting on your desk eating noodles and questioning every life decision that led to this moment. Tonight was that night.

It was 7:55 PM on a Thursday. I had spent two weeks building the pipeline, downloading the EyePACS dataset, writing the model architecture, debugging preprocessing transforms, and triple-checking my training loop for bugs that would only reveal themselves three hours into training. Everything was ready. The data was clean. The code was tested. The conda environment was activated.

This time I remembered to activate it.

I typed the command:

python inference-engine/train.py

And then the terminal did something beautiful and slightly unhinged. It printed the same line nine times:

Training on device: mps

Training on device: mps

Training on device: mps

...


Nine times. One for each DataLoader worker thread, plus the main process, all screaming in unison: THE M4 CHIP IS ENGAGED. Metal Performance Shaders activated. Apple Silicon neural cores online. No NVIDIA GPU. No cloud instance. No hourly billing. Just a fanless MacBook Air sitting on my desk, about to process thousands of retinal photographs through a 25-million-parameter ResNet50.

I sat back. This was going to take four hours. I had nothing to do but watch progress bars fill, loss values drop, and wonder if 67% validation accuracy was good or if I had just wasted an entire evening.

Spoiler: it was good. But the journey to get there? That was the real story.


What Was Actually Happening Under the Hood

Before I get to the numbers, here is what this training run actually is, because watching loss values scroll past without context is like watching a foreign film without subtitles, technically you are observing events, but you have no idea what is going on.

The mission: build a model that can look at a retinal photograph and grade diabetic retinopathy severity on a 0–4 scale. Grade 0 means healthy. Grade 4 means proliferative disease, new abnormal blood vessels growing in response to oxygen deprivation, which is the stage right before permanent vision loss. This is not an academic exercise. Diabetes runs in my family. I have watched relatives navigate the constant anxiety about complications that arrive silently and irreversibly. Every line of code in this project carries that weight.

The architecture: ResNet50, pre-trained on ImageNet. A neural network that already knows how to recognize edges, textures, and shapes from 1.2 million everyday images. We are not teaching it how to see. We are teaching it what to look for in retinal scans. Layers 1-3 stay frozen to preserve universal feature detectors. Layer 4 gets unfrozen to adapt high-level features to medical images. The final classification head is brand new and trained from scratch to output 5 disease severity scores.

The data: 35,126 retinal photographs from the EyePACS dataset, each labeled by trained clinicians. Massive class imbalance, 73% of images are Grade 0 (no disease). If we trained naively, the model would just predict "healthy" for everything and call it 73% accurate. Completely useless. To fight this, we used weighted cross-entropy loss, the model gets penalized more for missing rare, severe cases than for missing common healthy ones.

The training config: 20 epochs, batch size 32, learning rate 1e-4, Adam optimizer, StepLR scheduler that drops the learning rate by 10× every 7 epochs. Data augmentation with random flips, rotations, and color jitter to prevent overfitting. Train/validation split: 80/20, stratified to maintain class distribution.

And then I hit enter and waited.


The Four-Hour Watch

Training started at 7:55 PM. Each epoch took approximately 12 minutes: 1,100 batches of 32 images each, forward pass through ResNet50, loss computation, backpropagation, weight updates, repeat. I could not look away. This is the machine learning equivalent of watching bread rise. Objectively boring. Subjectively hypnotic.

Epochs 1-3: The Confused Student Phase

The model started blind. It had ImageNet weights, it knew what edges and textures looked like in everyday photos but it had absolutely no idea what a diseased retina looked like. Validation accuracy bounced around 45-52%. Loss hovered above 1.3. It was learning, but it was clumsy. Like a medical student on day one, flipping through a textbook and guessing based on vibes.

I watched the numbers scroll. Train accuracy: 44.96%. Validation accuracy: 52.85%. Better than random guessing (20%), but not inspiring confidence. Epoch 2 dropped validation accuracy to 45.89%. I questioned everything. Was my preprocessing wrong? Did I mess up the class weights? Should I have used a different learning rate?

Then Epoch 3 stabilized at 46.98% validation. The model was finding its footing. Slowly.

Epoch 4-6: The Grind

Epoch 4 gave us 58.21% validation accuracy. Something clicked. The model was starting to distinguish between healthy retinas and diseased ones. Not perfectly, but consistently. I leaned forward. This was progress.

Epochs 5 and 6 were the grind. Validation accuracy wobbled between 52-53%. Training accuracy kept climbing, but validation was stuck. This is normal. This is fine. The model is consolidating what it learned. I told myself this repeatedly while refreshing my coffee for the third time.

Epoch 7: THE MOMENT

Then Epoch 7 happened.

The learning rate scheduler kicked in. The learning rate dropped from 0.0001 to 0.00001. This is like telling a student: "Stop flipping pages so fast. Look carefully at the details." The model listened.

Validation accuracy: 67.99%.

I sat up. I checked the terminal twice to make sure I read it correctly. The loss dropped to 1.0041. Training accuracy was 51.65%, but validation jumped nearly 10 percentage points in a single epoch. This was the breakthrough. The model had learned something real. Not memorization. Actual pattern recognition.

For context: random guessing on 5 classes gets you 20%. A model that always predicts "healthy" gets 73% by exploiting class imbalance but is clinically worthless. A model that gets 68% while correctly identifying diseased retinas? That is learning.

Epochs 8-20: The Overfitting Wall

And then reality set in.

Epoch 8: validation accuracy dropped to 56.55%. Epoch 9: 52.41%. Epoch 10: 52.22%. Training accuracy kept climbing 54%, 55%, 56% but validation was stuck. The gap between train and validation accuracy was widening. This is the textbook signature of overfitting. The model was no longer learning general patterns. It was memorizing the training set.

The second learning rate drop happened at Epoch 14. Learning rate dropped to 0.000001 — essentially zero. The model could barely update its weights anymore. It flatlined. Epochs 14-20 were a slow crawl toward marginal improvements that never materialized. Validation accuracy hovered between 52-58%, occasionally teasing 58.34% at Epoch 19, but never beating the Epoch 7 peak.

At 11:55 PM, the terminal printed its final line:

Training complete. Best val accuracy: 0.6799

Four hours. 20 epochs. One MacBook Air. And a model that peaked at 67.99% validation accuracy before overfitting dragged it back down. The best weights were saved at Epoch 7. Everything after that was watching the model slowly forget how to generalize.


The M4 Chip Deserves Its Own Paragraph

Can we talk about the hardware for a second? Because what just happened is kind of absurd.

Four years ago, training a ResNet50 on 35,000 images meant renting a cloud GPU, an NVIDIA A100 on AWS at $3/hour, or a Colab session that would crash after 2 hours and lose all your progress. Tonight, I ran the entire thing on a fanless ultrabook that weighs 3 pounds, while it played lo-fi beats through its speakers and never once sounded like it was trying to achieve liftoff.

The M4's unified memory architecture means the GPU and CPU share the same memory pool. No bottleneck transferring data across a PCIe bus. PyTorch's MPS backend translates neural network operations into Metal compute shaders. The result? 12 minutes per epoch on 35,000 images. Zero crashes. Zero thermal throttling over four straight hours. The laptop stayed cool enough.

Is it as fast as an NVIDIA A100? No. But it is a consumer laptop that costs $1,200. The fact that it can train a medical image classifier at all is an engineering achievement. Apple Silicon did not just make laptops faster, it democratized machine learning. Tonight, I trained a model that could screen for blindness on the same device I use to watch YouTube. That is wild.


What 67.99% Actually Means And Why I Am Not Disappointed

Let me be clear: 67.99% is not going to win any competitions. Google's diabetic retinopathy model hit 90%+. The FDA-approved IDx-DR system runs at 87-90%. The EyePACS Kaggle winner reached ~85%.

But here is what 67.99% does mean:

  1. The entire pipeline works. Data loading, preprocessing, augmentation, model architecture, training loop, checkpointing, validation, all functional. End to end.

  2. Transfer learning works. A model trained on cats and cars can learn to detect retinal disease. Epoch 7 proved it.

  3. We know exactly what the bottleneck is. Overfitting. The model learned the training set too well and failed to generalize. That is a solvable problem.

  4. This was attempt number one. No hyperparameter tuning. No architecture search. No ensemble methods. Just a clean, well-documented baseline.

This is V1, trained in four hours on a laptop, by someone who three weeks ago did not know what a learning rate scheduler was.

I will take 68% it's a win for me.


Where Do We Go From Here?

Overfitting killed us after Epoch 7. Here is the fix:

  • Stronger augmentation -> more aggressive rotations, elastic transforms, Gaussian blur. Force the model to generalize, not memorize.

  • Increase dropout -> from 0.5 to 0.6, maybe add a second dropout layer. Directly combats overfitting.

  • Unfreeze more layers -> let layers 3 and 4 fine-tune, not just layer 4. Give the model more capacity to adapt.

  • Early stopping -> auto-stop training when validation accuracy plateaus for 5 epochs. Saves time and compute.

  • Image preprocessing -> crop black borders from retinal photos, normalize lighting. Remove noise that confuses the model.

Target for V2: 75-80% validation accuracy. That would be a genuinely strong result for the project, and more importantly, it would mean the model is actually useful for initial screening in real clinical settings.


The Closing Thought

At 11:55 PM, I closed my laptop. Four hours. Twenty epochs. 35,000 retinal images. One MacBook Air that never once complained. And a model that can, more often than not, tell if someone's retina shows signs of disease.

It is not perfect. It is V1. But somewhere between Epoch 1's confused guessing and Epoch 7's breakthrough, something real happened. That is what this project has always been about. Not chasing leaderboard scores. Not optimizing for benchmarks. Building something that matters.

The model sits in checkpoints/best_model.pt, a 100MB file that will never touch GitHub but represents two weeks of work and four hours of training. V2 is coming. And when it does, that 68% is going up.


Comments

Popular posts from this blog

SightX: We Shipped It (The Journey Comes to an End)

SightX: Data Acquisition & Exploration - 88GB of Reality, Data Acquisition and the 73% Problem

SightX: Teaching the Model to Learn - The Training Loop