This contains the MATLAB source code for our MM'15 paper "Deep Multimodal Speaker Naming".

Project page:

How to use

Note: please set MATLAB's working folder to the base folder that contains this README.text. All the code mentioned in the following are under folder applications/face-audio/, you will need to Add to Path when running these code.

Prepare face data:

  • Prepare train/test file list, e.g. train-file-list.txt and test-file-list.txt. Each row in the file in the following format: full-path-of-img label.
  • Run gen_face_data.m. After this, it will generate several train_%d and test mats, each of which contains sample (H x W x 3 x N) and tag (numClass x N).

Prepare audio data:

  • Merge all audio clips per character acrossing all videos for both train/test: merge_audio_file.m.
  • Run gen_audio_data.m for both train/test. This will generate audio_samples.mat for both train/test, which contains sample (75 x N) and tag (numClass x N).

Prepare face-audio test data:

  • Run gen_face_audio_data.m.

Train/test face-alone model:

  • Train: train_face_model.m.
  • Test: test_face_model.m.

Train/test face-audio model:

  • Train: train_face_audio_model.m.
  • Test: test_face_audio_model.m.

Train/test face-audio-audio/svm model:

  • Train
    • Merge all face train submat into: merge_face_submat_into_one.m.
    • Prepare face-audio-audio train/test data: gen_svm_face_audio_audio_train/test_data.m.
    • Train: train_svm_face_audio_audio.m.
  • Test
    • Prepare test data: gen_simulate_data.m or gen_simulate_data_voting_segment.m.
    • Test: test_face_audio_audio_model.m or test_face_audio_audio_model_2_models.m.

Hardware/software requirements

  1. Matlab 2014b or later, CUDA 6.0 or later (currently tested in Windows 7).
  2. A Nvidia GPU with 2GB GPU memory or above.
  3. Third-party library: MIRtoolbox v1.5 (for audio processing).

Terms of use

The source code is provided for research purposes only. Any commercial use is prohibited. When using the code in your research work, please cite the following paper:

"Deep Multimodal Speaker Naming."
Yongtao Hu, Jimmy SJ. Ren, Jingwen Dai, Chang Yuan, Li Xu, and Wenping Wang.
ACMMM 2015.

  title={{Deep Multimodal Speaker Naming}},
  author={Hu, Yongtao and Ren, Jimmy SJ. and Dai, Jingwen and Yuan, Chang and Xu, Li and Wang, Wenping},
  booktitle={Proceedings of the 23rd Annual ACM International Conference on Multimedia},


If you find any bug or have any question about the code, please report to the Issues page or email to Yongtao Hu (