HTTPS SSH

Data Science Bowl 2017 3rd Winning Model

Summary

We used parts of our own pipeline code which uses Tensorflow to build the models. The first step of the pipeline takes the DICOM scan and scales everything to a normalized resolution and orientation. We used two resolutions: 2.5x0.512x0.512 for the detection and 1.25x0.5x0.5 for the final model. The second step of the pipeline uses a fully convolutional Resnet to detect for each pixel whether it is contained in the center of a nodule. We trained it on the LIDC/IDRI dataset. We trained two of those models: one for normal sized nodules and one for masses. We annotated the masses on the train data of Kaggle and trained the mass network on both masses from LIDC/IDRI as well as masses from Kaggle. The third step takes the logit output of that network for the whole volume and thresholds it to determine candidates. It also masks out nodules outside the lung. The fourth step takes the candidates and trains some attributes of the LIDC dataset (malignancy, etc.) and trains the cancer label for the Kaggle scans in a multi-task model. It takes about 3-5 days to train the models on decent hardware.

Features

Because there was so little label data we decided to start detecting lung nodules to reduce the input and then use the malignancy and other attributes of the LIDC/IDRI nodules to improve our model.

We found the most important nodule features to be the nodule radius, height of the nodule in the lung, malignancy, texture, calcification and spiculation. We applied linear augmentations on all of our model inputs.

Training

We used Tensorflow to build our networks. We used ensembling for the final model. The final probabilities are simply averaged per scan.

Dependencies

We ran everything on Ubuntu Linux. We used the following software:

  • python 3.4+ (!)
  • tensorflow 1.1 (!)
  • opencv 3.1+
  • scipy 0.17.0 (!)
  • numpy (latest)
  • yaml (latest)
  • scikit-learn (latest)
  • pydicom (latest)
  • simpleitk (latest)
  • pandas (latest)
  • pycuda (latest)

! = version matters + = or higher

We used the followign hardware:

  • 4 K80 GPUs for everything but the final model
  • 8 GPUs for the final model

It takes about 3-5 days to run everything (infer+train) on a decent machine with 8 GPUs.

Reproduction

Please read run.txt.

Copyright (C) 2017 Aidence B.V.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA

References

https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI

https://luna16.grand-challenge.org/data/

http://aidence.com/