Histopathologic cancer detection as image classification using Pytorch
A few weeks ago another kaggle (playground) competition has ended. Despite lack of any reward (neither money nor kaggle medals) a decent amount of people has participated. I and Alex Donchuk got 15th place, which is quite high, given that some people found ground truth labels and submitted them, so I want to share some ideas about this competition.
The idea of the competition was classify small pathology images, in other words to label as “1” images which contain tumor tissue and “0” images without tumor. I guess this can be an easy task for an expert, but for me it was impossible to distinguish normal and cancer images.
As for the data here, participants were provided with small (96x96 pixels) patches from whole slides, so it was quite easy to handle the data because whole patches are huge megapixel images usually with several levels of resolution and it can be quite tricky to handle them. Hopefully here we had a lot of small images, so we could focus on different tricks and models instead of spending time on creating complicated pipelines for data loading and preprocessing. As it was said by authors of this dataset,
[PCam] packs the clinically-relevant task of metastasis detection into a straight-forward binary image classification task, akin to CIFAR-10 and MNIST. Models can easily be trained on a single GPU in a couple hours, and achieve competitive scores in the Camelyon16 tasks of tumor detection and whole-slide image diagnosis. Furthermore, the balance between task-difficulty and tractability makes it a prime suspect for fundamental machine learning research on topics as active learning, model uncertainty, and explainability.
Classes were almost perfectly balanced (not like 10 times difference), so we can expect that simple binary cross entropy will work (yet focal loss and class weights can also be tested), so this is almost like
ROC-AUC was used as a metrics so the output was probability of cancer presence, and here I can’t help but notice that this metrics would be kind of useless for a clinician and is only good for competitions. In a real case we will also need to think about false-positive cases, since missing them is extremely important. But anyway, roc-auc was used here.
Since we had a lot of data I made a dataset loader, which reads files for disks, performs augmentations and send them to the NN.
I was interested in how different models can perform here, so I trained several of them using same parameters (number of epochs, learning rate and learning rate scheduler). This may not be the best way (since models’ performance can change drastically with optimal hyperparameters) but at least there was a clear trend. For this I took imagenet pretrained models, splitt data into 10 folds and used Adam with learning rate 3e-4. After 5 epochs without improvement of validation loss learning rate was reduced by two, and all models were trained with Adam with default parameters.
Several things are clear from here. First, resnet152, resnet101, densenet169 and densenet201 quickly overfit, which could mean that either we don’t have enough data to train them or a better set of hyperparametrs and training scheme is need, like warm-up\one cycle\cycling lr or restarting. Anyway, seresnet34, densenet121 and resnet34 (not shown here) performed the best, so I focused mostly on them.
Here I have to note that this test was done before whole slides ID’s were released, so images from one slide were both in train and validation, which is a leak and led to validation, so my gap between public lb score and my validation score was huge. After that I split data using whole slides ID’s as groups, which allowed me to have correct validation and thus gap decreased.
And because of this leak, which led to overfit, correlation between local validation and public score wasn’t that high. So that just shows that even if performance on your local validation is good, you need to be very careful, because you may have a leak in your data, so even 10-folds cross validation won’t help.
Overall, since I have trained several models the idea was just to blend them together. Ideally for that you need uncorrelated models, so I looked at correlations of models predictions, where for same model I have several version with prediction, including tta (test time augmentation), retrained models (models, which I trained again with lower LR) and models, trained on original sizes of images (96x96 instead of resized to 224x224). Here we see that in general predictions are highly correlated, but that’s life.
As for tta, I used 8 additional images during inference, Dehidral group D4 + 4 random augmentations (like here), which added quite a lot to.
After all, this competition was quite useful, since it showed the importance of the correct validation scheme. There are many things which can be still done on this dataset, such as
- Correct comparison of the models : using whole slides IDs for folds split
- Finding optimal hyperparametrs for models and test different learning rate schemes, such as one cycle vs cycling, etc
- Using additional information from images for splits\training, such as stratification by staining type\brightness
- Finding optimal set of augmentations, which helps models to generalize
- Metric learning, which is now used for classification. Can it be successfully used here?
- Explainability of predictions — why images were assigned to their class?
- Pseudo labeling — can we use it here to have even more training data?
All this can still be tested here, so this dataset is really usefull for topics like this.
I have also made a repo with a minimal example, correct training of resnet34 with several tricks and correct validation split.