SightX: Foundational Concepts for Inference Engine
Day 2 & 3
They say you cannot build a skyscraper without understanding why concrete holds. The same logic applies here. Before a single line of the Inference Engine gets written, I needed to genuinely understand what I was building.
Why Bother With Theory First?
I will be honest, my first instinct was to jump straight into the code. But I have done enough "copy-paste and pray" tutorials. So I slowed down, took notes on my iPad, and made sure I could explain each concept in plain English before moving on. Here's what I learned.
What is a CNN?
A Convolutional Neural Network (CNN) is the engine behind most modern image recognition systems and it is the foundation of the Inference Engine. Instead of treating an image as a flat list of numbers, a CNN scans it with small "filters" that progressively detect more complex patterns.
Let us think of it as a pipeline of detectors:
- Early layers catch the simple stuff: horizontal edges, vertical edges, colour blobs.
- Middle layers combine those into shapes: curves, circles, grid-like patterns.
- Late layers identify actual objects: blood vessels, the optic disc, lesions.
For SightX specifically, this matters because the lesions that signal early diabetic retinopathy microaneurysms are tiny, subtle and easy to miss. A CNN trained on thousands of labelled retinal images can spot these patterns far more consistently than a fatigued human examiner/grader.
What is Transfer Learning?
Training a CNN from scratch on medical images would take enormous time, data, and computation. Transfer learning sidesteps this by starting with a model that already knows how to "see", then adapting it to the specific task.
Here is how that maps to SightX:
| Stage | What's Happening |
|---|---|
| Pre-training | ResNet50 was trained on ImageNet: 1.2 million everyday images (dogs, cars, food etc) |
| What it learned | Universal visual features like edges, textures and shapes that are useful for any image task |
| Fine-tuning | We replace the final layer and re-train on approximately 88,000+ labelled retinal images |
| Result | A model that detects retinal lesions with ~70% less training time |
The specific type of transfer learning being used here is Inductive Transfer Learning. In simple terms: the model was originally taught using one set of photos (everyday ImageNet images), and now we're teaching it a completely different skill (reading retinal scans). The two tasks are different, but both come with labels telling the model what's correct.
The key insight is that the knowledge does not go to waste. The model already "knows how to see", it understands edges, textures, and shapes from its first training.
What is ResNet50?
ResNet50 is the backbone of the Inference Engine. A 50-layer CNN developed by Microsoft Research. The most interesting part is its "residual connection": rather than each layer learning a completely new transformation, it learns a small correction on top of what the previous layer already figured out. This solved a major problem in deep learning called the vanishing gradient.
The 5-Grade DR Scale
The model's job is to classify a retinal image into one of five severity grades. Understanding what those grades actually mean clinically kept this from feeling like abstract label prediction:
| Grade | Clinical Meaning |
|---|---|
| Grade 0 — No DR | Healthy retina, no signs of disease |
| Grade 1 — Mild | A few microaneurysms (tiny balloon-like bulges in blood vessels) |
| Grade 2 — Moderate | More microaneurysms, early vessel damage |
| Grade 3 — Severe | Many blocked vessels, significant risk of vision loss |
| Grade 4 — Proliferative | New abnormal blood vessels growing — most severe, requires urgent treatment |
One thing that stood out: this isn't just a classification problem where all wrong answers are equal. Grade 3 is worse than Grade 2, it is not just different from it. That ordering matters clinically, and it turns out it has implications for how the model is trained too (something I'll revisit properly when the engine is further along).
What is Next?
Setting up the Python environment, verifying M4 Neural Engine acceleration via PyTorch MPS, and getting the folder structure in place before any model code is written.
Comments
Post a Comment