Wiki

Clone wiki

DeepDriving / DeepLearningCifar10

The Deep-Learning Framework

Cifar-10 Example

Cifar-10 Dataset

The Cifar-10 Dataset consists of 60000 color images with a size of 32x32 pixel. It is a perfect dataset for trying out different approaches of training, since due to the small size a training run can be performed within some minutes.

The 60000 images are equally distributed on 10 image categories. Which means, every category consists of 6000 images. The categories are "airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship" and "truck". The goal of a cifar-10 network is to guess the correct category from an input image.

50000 images are used for the training of the network and the remaining 10000 images are used for validation.

Model description

This example uses a neural network with 13 layers. 6 Layers are convolutional layers, 3 layers are pooling layers and the remaining 4 layers are fully connected (dense) layers. The following listing shows the code of this network:

    with Seq.addLayerName("Conv"):
      Layer = Seq.add(dl.layer.Conv2D_BN_ReLU(3, 128, Name="C1"))
      Layer = Seq.add(dl.layer.Conv2D_BN_ReLU(3, 128, Name="C2"))
      Layer = Seq.add(dl.layer.MaxPooling(3, 2))

    with Seq.addLayerName("Conv"):
      Layer = Seq.add(dl.layer.Conv2D_BN_ReLU(3, 128, Name="C1"))
      Layer = Seq.add(dl.layer.Conv2D_BN_ReLU(3, 128, Name="C2"))
      Layer = Seq.add(dl.layer.MaxPooling(3, 2))

    with Seq.addLayerName("Conv"):
      Layer = Seq.add(dl.layer.Conv2D_BN_ReLU(3, 128, Name="C1"))
      Layer = Seq.add(dl.layer.Conv2D_BN_ReLU(3, 128, Name="C2"))
      Layer = Seq.add(dl.layer.MaxPooling(3, 2))

    with Seq.addLayerName("Dense"):
      Layer = Seq.add(dl.layer.Dense_BN_ReLU(1024))
      Layer = Seq.add(dl.layer.Dropout(0.5))

    with Seq.addLayerName("Dense"):
      Layer = Seq.add(dl.layer.Dense_BN_ReLU(256))

    with Seq.addLayerName("Dense"):
      Layer = Seq.add(dl.layer.Dense_BN_ReLU(64))

    with Seq.addLayerName("Dense"):
      Layer = Seq.add(dl.layer.Dense(OutputNodes))

The first 3 layer groups uses two convolutional layers with 3x3 kernel size and 128 filters, combined with batch-normalization and ReLU activation function. After two convolutional layers a pooling layer with 3x3 window size and stride 2 is used. This scheme is repeated 3 times.

Afterwards a fully connected layer with 1024 nodes is used. Also in this case the layer is combined with batch-normalization and ReLU activation function. Furthermore a dropout layer with a keep-ratio of 0.5 is added to the network.

Two other fully connected layer with 256 and 64 nodes are following. Those layers are also combined with batch-normalization and ReLU activation function.

The last layer is a 10 nodes fully-connected layer without any activation function. The activation function of the output layer is contained in the loss function (softmax combined with cross-entropy loss).

The Loss- and Error-Function

The loss function of the cifar-10 model is a softmax value with 10 different probabilities (for every category one probability) which is combined with a cross-entropy loss. The weight decay of all weights and biases is added to this value to act as regularization term.

SampleCrossEntropy = tf.nn.softmax_cross_entropy_with_logits(labels=OneHotLabels, logits=Output, name="SoftmaxLoss")

Loss = tf.reduce_mean(SampleCrossEntropy) + WeightDecayTerm * Lambda

The error function is 0 for the right category and 1 for the wrong category. Thus an error value of 0.2 means, that 80% of the tested images where categorized correctly and 20% of the images were categorized wrongly.

Setup Cifar-10 Data

  • First download the cifar-10 binary version from the cifar-10 homepage.

  • Extract the files from the archive and copy them to your hard-disc. The files data_batch_*.bin should be copied to a training directory and the file test_batch.bin should be copied to a validation directory.

  • The directory python/scripts/cifar of the repository contains all script files for the cifar-10 model. You need to adapt the files train.cfg and eval.cfg to point to the correct path of your training and validation data.

train.cfg:

{
  "Data": {
    "BatchSize": 64,
    "ImageHeight": 32,
    "ImageWidth": 32,
    "TrainingPath": "<path-to-training-files>",
    "ValidatingPath": "<path-to-validation-files>"
  },
  "Optimizer": {
    "EpochsPerDecay": 30,
    "LearnRateDecay": 0.5,
    "StartingLearningRate": 0.003,
    "WeightDecay": 0.000,
    "Momentum": 0.9
  },
  "Trainer": {
    "CheckpointEpochs": 2,
    "CheckpointPath": "Checkpoint",
    "EpochSize": 50000,
    "NumberOfEpochs": 120,
    "SummaryPath": "Summary"
  },
  "Validation": {
    "Samples": 10000
  },
  "PreProcessing": {
    "MeanFile": "image-mean.tfrecord"
  }
}

eval.cfg:

{
  "Data": {
    "BatchSize": 64,
    "ImageHeight": 32,
    "ImageWidth": 32,
    "ValidatingPath": "<path-to-validation-files>"
  },
  "Evaluator": {
    "CheckpointPath": "Checkpoint",
    "EpochSize": 1000,
    "NumberOfEpochs": 10
  },
  "PreProcessing": {
    "MeanFile": "image-mean.tfrecord"
  }
}

Training of Cifar-10

To train the cifar-10 model, simply start the train.py script inside the python/scripts/cifar path of the repository. Depending on your hardware, this training takes some minutes or some hours. On a GTX 1080 Ti the training can be finished within around 60 Minutes.

You can decrease the training time by reducing the number of epochs for training. This can be done by changing the value NumberOfEpochs inside the train.cfg file. You should at least train for 30 Epochs to receive a sufficient result of training.

cd <repository-path>/python/scripts/cifar

python train.py

The output on the console should look like:

***** VIDEOINPUT LIBRARY - 0.1995 - TFW07 *****

Build File-Reader Graph:
* Training is enabled: True
Create Data-Reader for Training-Data:
* Create File Queue with 5 files.
* Perform data-augmentation
Read mean-image with shape (32, 32, 3)
* Perform per-pixel standardization
* Generate Input Batches...
* With Batch-Size: 64
* And Queue-Size: 1140
* Shuffle Data for Batching...
* Prepare Input Batch with Shape [64, 28, 28, 3]
* Prepare Input Batch with Shape [64]
Create Data-Reader for Validation-Data:
* Create File Queue with 1 files.
Read mean-image with shape (32, 32, 3)
* Perform per-pixel standardization
* Generate Input Batches...
* With Batch-Size: 64
* And Queue-Size: 1140
* Shuffle Data for Batching...
* Prepare Input Batch with Shape [64, 28, 28, 3]
* Prepare Input Batch with Shape [64]
* Input-Image has shape (64, 28, 28, 3)
* Input-Label has shape (64,)
* Enable Data Preprocessing
Create Network for State 0
* Do not store Histograms
* Store Output as Text
* Store Feature Maps
* Store the sparsity of parameters
Creating network Graph...
* network Input-Shape: (64, 28, 28, 3)
* Apply sequence of 28 layers with name "Network":
  *** Layer: Conv_1 ***
    * Apply sequence of 4 layers with name "C1":
      * Apply layer "Conv2D"
        * Kernel 3x3
        * Stride 1x1
        * Padding SAME
        * Kernel-Initializer: XavierInitializerConv()
        * with Output-Shape (64, 28, 28, 128) without Bias
      * Batch-Normalization
      * ReLU Activation function
      * Log Featute Map in summary
    * Apply sequence of 4 layers with name "C2":
      * Apply layer "Conv2D"
        * Kernel 3x3
        * Stride 1x1
        * Padding SAME
        * Kernel-Initializer: XavierInitializerConv()
        * with Output-Shape (64, 28, 28, 128) without Bias
      * Batch-Normalization
      * ReLU Activation function
      * Log Featute Map in summary
    * Apply layer "Pooling"
      * Pooling-Type: MAX
      * Pooling-Window: 3
      * Stride: 2
      * Padding: SAME
      * Output-Shape: (64, 14, 14, 128)
  *** Layer: Conv_2 ***
    * Apply sequence of 4 layers with name "C1":
      * Apply layer "Conv2D"
        * Kernel 3x3
        * Stride 1x1
        * Padding SAME
        * Kernel-Initializer: XavierInitializerConv()
        * with Output-Shape (64, 14, 14, 128) without Bias
      * Batch-Normalization
      * ReLU Activation function
      * Log Featute Map in summary
    * Apply sequence of 4 layers with name "C2":
      * Apply layer "Conv2D"
        * Kernel 3x3
        * Stride 1x1
        * Padding SAME
        * Kernel-Initializer: XavierInitializerConv()
        * with Output-Shape (64, 14, 14, 128) without Bias
      * Batch-Normalization
      * ReLU Activation function
      * Log Featute Map in summary
    * Apply layer "Pooling"
      * Pooling-Type: MAX
      * Pooling-Window: 3
      * Stride: 2
      * Padding: SAME
      * Output-Shape: (64, 7, 7, 128)
  *** Layer: Conv_3 ***
    * Apply sequence of 4 layers with name "C1":
      * Apply layer "Conv2D"
        * Kernel 3x3
        * Stride 1x1
        * Padding SAME
        * Kernel-Initializer: XavierInitializerConv()
        * with Output-Shape (64, 7, 7, 128) without Bias
      * Batch-Normalization
      * ReLU Activation function
      * Log Featute Map in summary
    * Apply sequence of 4 layers with name "C2":
      * Apply layer "Conv2D"
        * Kernel 3x3
        * Stride 1x1
        * Padding SAME
        * Kernel-Initializer: XavierInitializerConv()
        * with Output-Shape (64, 7, 7, 128) without Bias
      * Batch-Normalization
      * ReLU Activation function
      * Log Featute Map in summary
    * Apply layer "Pooling"
      * Pooling-Type: MAX
      * Pooling-Window: 3
      * Stride: 2
      * Padding: SAME
      * Output-Shape: (64, 4, 4, 128)
  *** Layer: Dense_4 ***
    * Apply sequence of 3 layers with name "Dense_BN_ReLU":
      * Apply layer "Dense"
        * Reshape layer input (64, 4, 4, 128) to vector with 2048 elements.
        * with 1024 Output-Nodes without Bias
        * Weight-Initializer: XavierInitializer()
        * Output-Shape: (64, 1024)
      * Batch-Normalization
      * ReLU Activation function
    * Dropout with keep ratio 0.5
  *** Layer: Dense_5 ***
    * Apply sequence of 3 layers with name "Dense_BN_ReLU":
      * Apply layer "Dense"
        * with 256 Output-Nodes without Bias
        * Weight-Initializer: XavierInitializer()
        * Output-Shape: (64, 256)
      * Batch-Normalization
      * ReLU Activation function
  *** Layer: Dense_6 ***
    * Apply sequence of 3 layers with name "Dense_BN_ReLU":
      * Apply layer "Dense"
        * with 64 Output-Nodes without Bias
        * Weight-Initializer: XavierInitializer()
        * Output-Shape: (64, 64)
      * Batch-Normalization
      * ReLU Activation function
  *** Layer: Dense_7 ***
    * Apply layer "Dense"
      * with 10 Output-Nodes
      * Weight-Initializer: XavierInitializer()
      * Bias-Decay: 0.0
      * Bias-Initializer: ConstantInitializer(0.0)
      * Output-Shape: (64, 10)
* network Output-Shape: (64, 10)
Finished to build network with 3125514 trainable variables in 47 tensors.
Create Cross-Entropy Loss Function...
* Label Shape: (64,)
* OneHot Label Shape: (64, 10)
* Output Shape: (64, 10)
* Sample Loss Shape: (64,)
Create Error-Measurement Function...
 * Output-Class Shape: (64,)
 * Sample Classification Error Shape: (64,)
2017-08-16 10:40:37.205752: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.
ibrary wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 10:40:37.205822: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.
ibrary wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 10:40:37.206832: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.
ibrary wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 10:40:37.207093: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.
ibrary wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 10:40:37.208154: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.
ibrary wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 10:40:37.208434: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.
ibrary wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 10:40:37.208811: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.
ibrary wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 10:40:37.209018: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.
ibrary wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-08-16 10:40:37.584027: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_devi
 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.721
pciBusID 0000:01:00.0
Total memory: 11.00GiB
Free memory: 9.12GiB
2017-08-16 10:40:37.584101: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_devi
2017-08-16 10:40:37.585094: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_devi
2017-08-16 10:40:37.585346: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_devi
nsorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
Apply individual learning rate scales...
Current Model has 9368097.0 parameters in 108 trainable tensors.
Init network variables by random-values...
Store tensorboard summary at directory Summary\run_1
Store training settings in file Checkpoint\State_0\train.cfg
Run training for 120 epochs beginning with epoch 0 and 782 iterations per epoch.
       0: [Training]   Progress Epoch   0/120 - Loss: 2.62843 Error: 87.50% (2.070 s/Epoch, 30.918 Samples/s)
       0: [Validation] Progress Epoch   0/120 - Loss: 2.30256 Error: 89.97%
     782: [Training]   Progress Epoch   1/120 - Loss: 1.06218 Error: 42.19% (21.799 s/Epoch, 2295.879 Samples/s)
     782: [Validation] Progress Epoch   1/120 - Loss: 19.87696 Error: 90.06%
    1564: [Training]   Progress Epoch   2/120 - Loss: 1.17233 Error: 35.94% (27.395 s/Epoch, 1826.903 Samples/s)
    1564: [Validation] Progress Epoch   2/120 - Loss: 6.99007 Error: 89.82%
Store current model as checkpoint: Checkpoint\State_0\model_2.ckpt
WARNING:tensorflow:Error encountered when serializing __tensorboard_plugin_asset__tensorboard_text.
Type is unsupported, or the types of the items don't match field type in CollectionDef.
'TextSummaryPluginAsset' object has no attribute 'name'

...

   93058: [Training]   Progress Epoch 119/120 - Loss: 0.00036 Error: 0.00% (31.670 s/Epoch, 1580.301 Samples/s)
   93058: [Validation] Progress Epoch 119/120 - Loss: 0.47695 Error: 8.79%
   93840: [Training]   Progress Epoch 120/120 - Loss: 0.01947 Error: 1.56% (27.974 s/Epoch, 1789.102 Samples/s)
   93840: [Validation] Progress Epoch 120/120 - Loss: 0.46359 Error: 8.70%
Store current model as checkpoint: Checkpoint\State_0\model_120.ckpt
WARNING:tensorflow:Error encountered when serializing __tensorboard_plugin_asset__tensorboard_text.
Type is unsupported, or the types of the items don't match field type in CollectionDef.
'TextSummaryPluginAsset' object has no attribute 'name'
Training took 3483.801367521286s (0:58:3.8013675212860107)

Evaluation

After successful training, you can start a evaluation run which tests the trained model against the validation data. This is in general very fast.

python eval.py

The output should look like this:

***** VIDEOINPUT LIBRARY - 0.1995 - TFW07 *****

...

Init network variables by random-values...
Restore from Checkpoint Checkpoint\State_0\model_120.ckpt with Epoch-Number 120
Do not store any summary
Run evaluation for 10 epochs beginning with epoch 0 and 16 iterations per epoch.
      16: [Evaluation] Progress Epoch   1/ 10 - Loss: 0.27316 Error: 7.81% (0.949 s/Epoch, 1079.518 Samples/s)
      32: [Evaluation] Progress Epoch   2/ 10 - Loss: 0.96386 Error: 14.06% (0.421 s/Epoch, 2432.615 Samples/s)
      48: [Evaluation] Progress Epoch   3/ 10 - Loss: 0.33955 Error: 6.25% (0.433 s/Epoch, 2366.371 Samples/s)
      64: [Evaluation] Progress Epoch   4/ 10 - Loss: 0.81803 Error: 9.38% (0.439 s/Epoch, 2331.215 Samples/s)
      80: [Evaluation] Progress Epoch   5/ 10 - Loss: 0.23112 Error: 3.12% (0.436 s/Epoch, 2350.815 Samples/s)
      96: [Evaluation] Progress Epoch   6/ 10 - Loss: 0.36111 Error: 6.25% (0.436 s/Epoch, 2348.372 Samples/s)
     112: [Evaluation] Progress Epoch   7/ 10 - Loss: 0.47963 Error: 7.81% (0.441 s/Epoch, 2324.401 Samples/s)
     128: [Evaluation] Progress Epoch   8/ 10 - Loss: 0.42561 Error: 7.81% (0.447 s/Epoch, 2290.387 Samples/s)
     144: [Evaluation] Progress Epoch   9/ 10 - Loss: 0.12401 Error: 4.69% (0.422 s/Epoch, 2428.228 Samples/s)
     160: [Evaluation] Progress Epoch  10/ 10 - Loss: 0.20070 Error: 4.69% (0.435 s/Epoch, 2355.841 Samples/s)
Mean Absolute Error: 0.08
Full Summary:
 *  Error: 8.49%
Store results at file result.txt

Next Step

Updated