HTTPS SSH

Indic Romanizer and CMUDICT Pronunciation Generator

Author: Alok Parlikar <aup@cs.cmu.edu>

Introduction

This repository contains a set of scripts to process UTF-8 encoded Indic text. You can convert Indic into Romanized Latin, or generate CMUDict pronunciations for Indic words.

Requirements

  • Python 3

Supported Indic Character Sets

  • Devanagari (2305 to 2431)
  • Telugu (3072 to 3199) (Mapping by Aasish Pappu)

Usage

Romanize a File

To romanize a file in utf-8 encoded Indic:

./romanize_indic_text.py < indic.input > romanized.output

Pronunciation For Indian Names

To generate a pronunciation lexicon of Indian names:

wget "https://bitbucket.org/happyalu/corpus_indian_names/raw/master/english_hindi_parallel_first_names"

./generate_pronunciation_for_names.py \
< english_hindi_parallel_first_names \
> hindi_firstnames.lex