Commits

Mario Rodas committed b18f506

Add initial raw notes of nlp-class

Comments (0)

Files changed (4)

    ai-class/index
    ml-class/index
    db-class/index
+   nlp-class/index
+
 
 
 

nlp-class/edit_distance.rst

+
+=============
+Edit Distance
+=============
+
+
+
+How to compute edit distance?
+=============================
+
+Dynamic Programming
+    A tabular omputation :math:`D(n, m)`
+
+Compute :math:`D(i, j)` for all :math:`i` (:math:`0<i<n` ) and :math:`j` (:math:`0<i<m` ).
+
+Minimun Edit Distance (levinshtein)
+===================================
+
+
+
+
+
+
+
+
+

nlp-class/index.rst

+
+===========================
+Natural Language Processing
+===========================
+
+Apuntes *incompletos* de nlp-class_.
+
+.. _ml-class: https://www.coursera.org/course/nlp
+
+.. toctree::
+   :maxdepth: 1 
+
+   textproc
+   edit_distance
+

nlp-class/textproc.rst

+
+=====================
+Basic Text Processing
+=====================
+
+
+Regular Expressions
+===================
+
+    Regular expressions consist of constants and operators that denote sets of
+    strings and operations over these sets, respectively.
+
+    -- From Wikipedia[#regexwiki]_:
+
+Regular expressions are a formal language for specifying text string.
+In general, regular expressions provides a flexible mean to *match* strings of
+text. Commonly abbeviated as **regex** and **regexp**.
+
+============= =================================================================
+Metacharacter Description
+============= =================================================================
+``.``         Match any character.
+``+``         Match the preceding pattern element **one o more times**. 
+``?``         Match the preceding pattern element **zero o one times**.
+``*``         Match the preceding pattern element **zero o more times**. 
+``{M,N}``     Denotes the minimun *M* andthe maximun *N* match count.
+``[...]``     Denotes a set of possible character matches.
+``|``         Separates alternate possibilities. 
+``^``         Initial of line.
+``$``         Final of line.
+============= =================================================================
+
+..
+
+=============== ===============================================================
+regex           matches
+=============== ===============================================================
+*[Aa]*          amor, Amor
+*[123456790]*   Any digit, or simply *[0-9]*
+=============== ===============================================================
+
+Books
+-----
+Regular Expression Pocket Reference, 2nd Edition
+    http://shop.oreilly.com/product/9780596514273.do
+
+.. TODO: Add book of formal language theory 
+
+Word Tokenization
+=================
+Task in NLP needs to do text normalization:
+
+1. Segmentatio/tokenizing words in running text.
+2. Normalizing word formats.
+3. Segmenting senttences in running texts.
+
+
+Concepts
+--------
+
+Type
+    An element of the vocabulary. Represented by :math:`N`
+
+Token
+    An instance of that type running text. Represented by :math:`V`. The size of
+    the vocabulary is represented by :math:`|V|`
+
+Corpora
+    Data sets of text.
+
+* Generally in an sentence #tokens >= #types
+* Chuch and Gale (1990): :math:`|V|>=O(N^{1/2})`
+
+Tokenizing, *first steps*
+--------------------------
+Using Unix Tools :-), we use *big.txt*, is not exactly the same of the class[#file]_::
+
+  $ curl -O http://norvig.com/big.txt
+
+Replacing all non alphabetic characters with a newline (*\n*), and display only
+the first 10 lines (*head*)::
+
+  $ tr -sc 'A-Za-z' '\n' < big.txt | head
+
+Sort the output::
+
+  $ tr -sc 'A-Za-z' '\n' < big.txt | sort | head
+
+Merging upper and lower case::
+
+  $ tr 'A-Z' 'a-z' < big.txt | tr –sc 'A-Za-z' '\n' | sort | uniq –c
+
+Sorting the counts::
+
+  $ tr 'A-Z' 'a-z' < big.txt | tr –sc 'A-Za-z' '\n' | sort | uniq –c | sort –n –r
+
+Issues in tokenization
+----------------------
+
+Apostrophe
+  | *Finland's capital* -> Finland, Finlands, Finlands'
+  | *I'm* -> I am
+
+.. TODO: Add more examples
+
+language issues
+~~~~~~~~~~~~~~~
+French
+  | **L'ensemble** -> one token or two?
+  | *L*?, *L’*?, *Le*?
+  | Want *l’ensemble* to match with *un ensemble*.
+
+
+Word Normalization and Stemming
+===============================
+
+Normalization
+-------------
+
+Sentence Segmentation
+=====================
+
+
+
+.. References {{{
+.. [#regexwiki] http://en.wikipedia.org/wiki/Regular_expression#Formal_definition
+.. [#file] You can get a *shakes.txt* from the Project Gutenberg : http://www.gutenberg.org/ebooks/100
+
+
+.. }}}
+
+.. vim:ft=rst:tw=80: