# Commits

committed b18f506

Add initial raw notes of nlp-class

# index.rst

`    ai-class/index`
`    ml-class/index`
`    db-class/index`
`+   nlp-class/index`
`+`
` `
` `
` `

# nlp-class/edit_distance.rst

`+`
`+=============`
`+Edit Distance`
`+=============`
`+`
`+`
`+`
`+How to compute edit distance?`
`+=============================`
`+`
`+Dynamic Programming`
`+    A tabular omputation :math:`D(n, m)``
`+`
`+Compute :math:`D(i, j)` for all :math:`i` (:math:`0<i<n` ) and :math:`j` (:math:`0<i<m` ).`
`+`
`+Minimun Edit Distance (levinshtein)`
`+===================================`
`+`
`+`
`+`
`+`
`+`
`+`
`+`
`+`
`+`

# nlp-class/index.rst

`+`
`+===========================`
`+Natural Language Processing`
`+===========================`
`+`
`+Apuntes *incompletos* de nlp-class_.`
`+`
`+.. _ml-class: https://www.coursera.org/course/nlp`
`+`
`+.. toctree::`
`+   :maxdepth: 1 `
`+`
`+   textproc`
`+   edit_distance`
`+`

# nlp-class/textproc.rst

`+`
`+=====================`
`+Basic Text Processing`
`+=====================`
`+`
`+`
`+Regular Expressions`
`+===================`
`+`
`+    Regular expressions consist of constants and operators that denote sets of`
`+    strings and operations over these sets, respectively.`
`+`
`+    -- From Wikipedia[#regexwiki]_:`
`+`
`+Regular expressions are a formal language for specifying text string.`
`+In general, regular expressions provides a flexible mean to *match* strings of`
`+text. Commonly abbeviated as **regex** and **regexp**.`
`+`
`+============= =================================================================`
`+Metacharacter Description`
`+============= =================================================================`
`+``.``         Match any character.`
`+``+``         Match the preceding pattern element **one o more times**. `
`+``?``         Match the preceding pattern element **zero o one times**.`
`+``*``         Match the preceding pattern element **zero o more times**. `
`+``{M,N}``     Denotes the minimun *M* andthe maximun *N* match count.`
`+``[...]``     Denotes a set of possible character matches.`
`+``|``         Separates alternate possibilities. `
`+``^``         Initial of line.`
`+``\$``         Final of line.`
`+============= =================================================================`
`+`
`+..`
`+`
`+=============== ===============================================================`
`+regex           matches`
`+=============== ===============================================================`
`+*[Aa]*          amor, Amor`
`+*[123456790]*   Any digit, or simply *[0-9]*`
`+=============== ===============================================================`
`+`
`+Books`
`+-----`
`+Regular Expression Pocket Reference, 2nd Edition`
`+    http://shop.oreilly.com/product/9780596514273.do`
`+`
`+.. TODO: Add book of formal language theory `
`+`
`+Word Tokenization`
`+=================`
`+Task in NLP needs to do text normalization:`
`+`
`+1. Segmentatio/tokenizing words in running text.`
`+2. Normalizing word formats.`
`+3. Segmenting senttences in running texts.`
`+`
`+`
`+Concepts`
`+--------`
`+`
`+Type`
`+    An element of the vocabulary. Represented by :math:`N``
`+`
`+Token`
`+    An instance of that type running text. Represented by :math:`V`. The size of`
`+    the vocabulary is represented by :math:`|V|``
`+`
`+Corpora`
`+    Data sets of text.`
`+`
`+* Generally in an sentence #tokens >= #types`
`+* Chuch and Gale (1990): :math:`|V|>=O(N^{1/2})``
`+`
`+Tokenizing, *first steps*`
`+--------------------------`
`+Using Unix Tools :-), we use *big.txt*, is not exactly the same of the class[#file]_::`
`+`
`+  \$ curl -O http://norvig.com/big.txt`
`+`
`+Replacing all non alphabetic characters with a newline (*\n*), and display only`
`+the first 10 lines (*head*)::`
`+`
`+  \$ tr -sc 'A-Za-z' '\n' < big.txt | head`
`+`
`+Sort the output::`
`+`
`+  \$ tr -sc 'A-Za-z' '\n' < big.txt | sort | head`
`+`
`+Merging upper and lower case::`
`+`
`+  \$ tr 'A-Z' 'a-z' < big.txt | tr –sc 'A-Za-z' '\n' | sort | uniq –c`
`+`
`+Sorting the counts::`
`+`
`+  \$ tr 'A-Z' 'a-z' < big.txt | tr –sc 'A-Za-z' '\n' | sort | uniq –c | sort –n –r`
`+`
`+Issues in tokenization`
`+----------------------`
`+`
`+Apostrophe`
`+  | *Finland's capital* -> Finland, Finlands, Finlands'`
`+  | *I'm* -> I am`
`+`
`+.. TODO: Add more examples`
`+`
`+language issues`
`+~~~~~~~~~~~~~~~`
`+French`
`+  | **L'ensemble** -> one token or two?`
`+  | *L*?, *L’*?, *Le*?`
`+  | Want *l’ensemble* to match with *un ensemble*.`
`+`
`+`
`+Word Normalization and Stemming`
`+===============================`
`+`
`+Normalization`
`+-------------`
`+`
`+Sentence Segmentation`
`+=====================`
`+`
`+`
`+`
`+.. References {{{`
`+.. [#regexwiki] http://en.wikipedia.org/wiki/Regular_expression#Formal_definition`
`+.. [#file] You can get a *shakes.txt* from the Project Gutenberg : http://www.gutenberg.org/ebooks/100`
`+`
`+`
`+.. }}}`
`+`
`+.. vim:ft=rst:tw=80:`