SightX: Data Acquisition & Exploration - 88GB of Reality, Data Acquisition and the 73% Problem

Day 6 & 7

They say you cannot build an inference engine without data. That is technically true, but what they don't tell you is that getting the data, understanding the data, and actually seeing what your model will be trained on is a journey of its own. And by journey, I mean downloading 88GB, realizing the unzipped dataset is now 100GB, spending an embarrassing amount of time wondering why Jupyter could not find my libraries, and then finally understanding what conda activate sightx actually meant.

Why Data Exploration Comes Before Everything Else?

The urge to skip straight to model training is real. The environment is set up, PyTorch confirmed the M4 MPS backend is active, and ResNet50 is waiting. But training a model on data you have not actually looked at is like building a house without checking if the land is stable. You will get something, but it probably won't be what you wanted.

This phase was about three things: acquiring the EyePACS dataset from Kaggle, understanding the class distribution that will define the training strategy, and visually confirming what a Grade 0 retina looks like versus a Grade 4. The last part matters more than it sounds.

Downloading 88GB: The Reality Check

The EyePACS Diabetic Retinopathy dataset lives on Kaggle. Getting it requires the Kaggle CLI, an API token from your account, and a stable internet connection that can handle what turned out to be 88GB compressed and 100GB unzipped. The documentation said 35GB. The documentation lied.

The process itself was straightforward once the Kaggle API token was correctly placed in ~/.kaggle/kaggle.json. The command ran:

kaggle competitions download -c diabetic-retinopathy-detection

Then came the unzip. Into the data/directory, exactly where the folder structure already anticipated it would go. The result was approximately 35,000 training images and 53,000 test images, each a high-resolution retinal scan captured under varying lighting conditions, camera angles, and image quality.

The Conda Environment Incident

Here is the part that cost more time than it should have. I opened Jupyter, tried to import pandas, and got a ModuleNotFoundError. I checked the installation, and pandas was definitely installed. I reinstalled it. Still nothing. I questioned my entire setup.

The problem was simple and obvious in hindsight: I was not running Jupyter from the sightx Conda environment. Every library I had carefully installed into that isolated environment was sitting there, unused, while I ran Jupyter from the system Python installation that knew nothing about them. The fix was one line:

conda activate sightx

And suddenly, every import worked. This is what environment isolation actually means and why forgetting to activate the right one turns into a debugging session that should not exist.

Understanding the Class Distribution: The 73% Problem

The first exploration notebook loaded trainLabels.csv and counted how many images belonged to each severity grade. The result was not subtle.

The dataset is heavily imbalanced. Approximately 25,000 images are Grade 0(no diabetic retinopathy), healthy retinas. Grade 2 sits around 5,000 images. Grades 1, 3, and 4 are all under 3,000 each. This is not a bug, this is reality, most people screened for diabetic retinopathy do not have it.

But for a neural network, this is a trap. If the model learns to predict Grade 0 for every single input, it achieves 73% accuracy without learning anything useful. The model is not detecting disease, it is exploiting the class distribution. The training loop will need class weighting or oversampling to force the model to actually learn the rarer, more severe grades. This discovery shapes every training decision that follows.

Seeing the Data: What Each Grade Actually Looks Like

Numbers in a CSV are abstract. Retinal images are not. The second exploration step visualised two sample images from each grade to see what the model will actually be trained to distinguish.

The differences between grades are subtle, especially between Grade 0 and Grade 1. A healthy retina has clear blood vessels radiating from the optic disc, consistent colour, and no visible lesions. Grade 1 introduces a few microaneurysms: tiny red dots that are easy to miss if you are not looking for them. Grade 2 shows more microaneurysms and early haemorrhaging. Grade 3 is where the damage becomes visually obvious: blocked vessels, significant haemorrhaging, darkened regions. Grade 4 introduces neovascularization: abnormal new blood vessels growing in response to oxygen deprivation, which show up as chaotic, irregular patterns overlaying the normal vascular structure.

What stood out immediately was the variability within each grade. Lighting conditions differ. Some images are overexposed, some underexposed. Some cameras captured the full circular field, others cut off the edges. This is messy, real-world data. The preprocessing pipeline will handle all of it.

What did we learn?

Three critical insights came out of this exploration:

Class imbalance is severe: Training without addressing it will produce a model that predicts Grade 0 for everything and calls it 73% accurate.
Visual differences between adjacent grades are subtle: Grade 0 versus Grade 1 is not obvious, which means the model will need significant depth and careful training to distinguish them reliably.
Image quality varies wildly: Preprocessing will need to normalise exposure, crop to consistent dimensions, and potentially filter out the worst-quality images entirely.

The M4 confirmed MPS acceleration. The dataset is downloaded, unzipped, and explored. The Conda environment is finally being used correctly.

What is next? building the preprocessing pipeline that transforms these 100GB of raw retinal scans into something ResNet50 can actually train on.

Search This Blog

Darkmatters