Commits

yanchuan sim committed 43ff4b2

updated documentation

Comments (0)

Files changed (9)

docs/html/_modules/ycutils/corpus.html

     <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">__vocab</span><span class="p">[</span><span class="n">token</span><span class="p">]</span>
   <span class="c">#end def</span>
 </div>
-  <span class="k">def</span> <span class="nf">__iter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">__vocab</span><span class="o">.</span><span class="n">__iter__</span><span class="p">()</span>
-
-  <span class="k">def</span> <span class="nf">__contains__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">item</span><span class="p">):</span> <span class="k">return</span> <span class="n">item</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">__vocab</span>
-  
+<div class="viewcode-block" id="CorpusVocabulary.__iter__"><a class="viewcode-back" href="../../corpus.html#ycutils.corpus.CorpusVocabulary.__iter__">[docs]</a>  <span class="k">def</span> <span class="nf">__iter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">__vocab</span><span class="o">.</span><span class="n">__iter__</span><span class="p">()</span>
+</div>
+<div class="viewcode-block" id="CorpusVocabulary.__contains__"><a class="viewcode-back" href="../../corpus.html#ycutils.corpus.CorpusVocabulary.__contains__">[docs]</a>  <span class="k">def</span> <span class="nf">__contains__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">item</span><span class="p">):</span> <span class="k">return</span> <span class="n">item</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">__vocab</span>
+  </div>
 <div class="viewcode-block" id="CorpusVocabulary.__len__"><a class="viewcode-back" href="../../corpus.html#ycutils.corpus.CorpusVocabulary.__len__">[docs]</a>  <span class="k">def</span> <span class="nf">__len__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> 
     <span class="sd">&quot;&quot;&quot;Returns the number of terms in the vocabulary.&quot;&quot;&quot;</span>
     <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">__vocab</span><span class="p">)</span>

docs/html/_sources/corpus.txt

     """
 
 .. automodule:: ycutils.corpus
+   :members: DEFAULT_UNKNOWN_TOKEN
 
 *Corpus* class
 ================
 ========================
 .. autoclass:: ycutils.corpus.CorpusVocabulary
   :members:
-  :special-members: __init__
+  :special-members: __init__, __getitem__, __iter__, __contains__, __len__

docs/html/bagofwords.html

 </tbody>
 </table>
 <dl class="method">
+<dt id="ycutils.bagofwords.BOW.__iadd__">
+<tt class="descname">__iadd__</tt><big>(</big><em>other</em><big>)</big><a class="headerlink" href="#ycutils.bagofwords.BOW.__iadd__" title="Permalink to this definition">¶</a></dt>
+<dd><p>Adds two <a class="reference internal" href="#ycutils.bagofwords.BOW" title="ycutils.bagofwords.BOW"><tt class="xref py py-class docutils literal"><span class="pre">BOW</span></tt></a> in place.</p>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>other</strong> &#8211; the other BOW object to add.</td>
+</tr>
+</tbody>
+</table>
+</dd></dl>
+
+<dl class="method">
 <dt id="ycutils.bagofwords.BOW.__mul__">
 <tt class="descname">__mul__</tt><big>(</big><em>other</em><big>)</big><a class="reference internal" href="_modules/ycutils/bagofwords.html#BOW.__mul__"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.bagofwords.BOW.__mul__" title="Permalink to this definition">¶</a></dt>
 <dd><p>Multiplies two <a class="reference internal" href="#ycutils.bagofwords.BOW" title="ycutils.bagofwords.BOW"><tt class="xref py py-class docutils literal"><span class="pre">BOW</span></tt></a>.</p>
 </dd></dl>
 
 <dl class="method">
+<dt id="ycutils.bagofwords.BOW.normalize">
+<tt class="descname">normalize</tt><big>(</big><em>sum_to=1.0</em><big>)</big><a class="headerlink" href="#ycutils.bagofwords.BOW.normalize" title="Permalink to this definition">¶</a></dt>
+<dd><p>Normalizes the counts of words, such that they sum up to <tt class="xref py py-attr docutils literal"><span class="pre">sum_to</span></tt>.</p>
+<table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>sum_to</strong> &#8211; total count of words after normalizing.</td>
+</tr>
+</tbody>
+</table>
+</dd></dl>
+
+<dl class="method">
 <dt id="ycutils.bagofwords.BOW.to_wc_string">
 <tt class="descname">to_wc_string</tt><big>(</big><big>)</big><a class="reference internal" href="_modules/ycutils/bagofwords.html#BOW.to_wc_string"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.bagofwords.BOW.to_wc_string" title="Permalink to this definition">¶</a></dt>
 <dd><p>Format the <a class="reference internal" href="#ycutils.bagofwords.BOW" title="ycutils.bagofwords.BOW"><tt class="xref py py-class docutils literal"><span class="pre">BOW</span></tt></a> object in a <tt class="docutils literal"><span class="pre">word:count</span></tt> formatted string which looks like <tt class="docutils literal"><span class="pre">word1:count1</span> <span class="pre">word2:count2</span> <span class="pre">...</span></tt>.</p>

docs/html/corpus.html

 <span class="sd">&quot;&quot;&quot;</span>
 </pre></div>
 </div>
-<span class="target" id="module-ycutils.corpus"></span><div class="section" id="corpus-class">
+<span class="target" id="module-ycutils.corpus"></span><dl class="data">
+<dt id="ycutils.corpus.DEFAULT_UNKNOWN_TOKEN">
+<tt class="descclassname">ycutils.corpus.</tt><tt class="descname">DEFAULT_UNKNOWN_TOKEN</tt><em class="property"> = u'__UNK__'</em><a class="headerlink" href="#ycutils.corpus.DEFAULT_UNKNOWN_TOKEN" title="Permalink to this definition">¶</a></dt>
+<dd><p>The type for an unknown token, which is <tt class="docutils literal"><span class="pre">__UNK__</span></tt> by default.</p>
+</dd></dl>
+
+<div class="section" id="corpus-class">
 <h2><em>Corpus</em> class<a class="headerlink" href="#corpus-class" title="Permalink to this headline">¶</a></h2>
 <dl class="class">
 <dt id="ycutils.corpus.Corpus">
 <h2><em>CorpusVocabulary</em> class<a class="headerlink" href="#corpusvocabulary-class" title="Permalink to this headline">¶</a></h2>
 <dl class="class">
 <dt id="ycutils.corpus.CorpusVocabulary">
-<em class="property">class </em><tt class="descclassname">ycutils.corpus.</tt><tt class="descname">CorpusVocabulary</tt><big>(</big><em>corpus=None</em>, <em>from_filename=None</em>, <em>unknown_token='__UNK__'</em><big>)</big><a class="reference internal" href="_modules/ycutils/corpus.html#CorpusVocabulary"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.corpus.CorpusVocabulary" title="Permalink to this definition">¶</a></dt>
+<em class="property">class </em><tt class="descclassname">ycutils.corpus.</tt><tt class="descname">CorpusVocabulary</tt><big>(</big><em>corpus=None</em>, <em>from_filename=None</em>, <em>unknown_token=u'__UNK__'</em><big>)</big><a class="reference internal" href="_modules/ycutils/corpus.html#CorpusVocabulary"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.corpus.CorpusVocabulary" title="Permalink to this definition">¶</a></dt>
 <dd><p>A class that handles vocabulary information related to a corpus, like term frequencies, idf, etc.</p>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 </tbody>
 </table>
 <dl class="method">
+<dt id="ycutils.corpus.CorpusVocabulary.__contains__">
+<tt class="descname">__contains__</tt><big>(</big><em>w</em><big>)</big><a class="reference internal" href="_modules/ycutils/corpus.html#CorpusVocabulary.__contains__"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.corpus.CorpusVocabulary.__contains__" title="Permalink to this definition">¶</a></dt>
+<dd><table class="docutils field-list" frame="void" rules="none">
+<col class="field-name" />
+<col class="field-body" />
+<tbody valign="top">
+<tr class="field-odd field"><th class="field-name">Returns:</th><td class="field-body"><tt class="docutils literal"><span class="pre">True</span></tt> if <tt class="xref py py-attr docutils literal"><span class="pre">w</span></tt> is in the vocabulary.</td>
+</tr>
+</tbody>
+</table>
+</dd></dl>
+
+<dl class="method">
 <dt id="ycutils.corpus.CorpusVocabulary.__getitem__">
 <tt class="descname">__getitem__</tt><big>(</big><em>token</em><big>)</big><a class="reference internal" href="_modules/ycutils/corpus.html#CorpusVocabulary.__getitem__"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.corpus.CorpusVocabulary.__getitem__" title="Permalink to this definition">¶</a></dt>
 <dd><p>Returns statistics about the given token in the corpus.</p>
 </dd></dl>
 
 <dl class="method">
+<dt id="ycutils.corpus.CorpusVocabulary.__iter__">
+<tt class="descname">__iter__</tt><big>(</big><big>)</big><a class="reference internal" href="_modules/ycutils/corpus.html#CorpusVocabulary.__iter__"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.corpus.CorpusVocabulary.__iter__" title="Permalink to this definition">¶</a></dt>
+<dd><p>Returns an iterator for going through each word type in the vocabulary.</p>
+</dd></dl>
+
+<dl class="method">
 <dt id="ycutils.corpus.CorpusVocabulary.__len__">
 <tt class="descname">__len__</tt><big>(</big><big>)</big><a class="reference internal" href="_modules/ycutils/corpus.html#CorpusVocabulary.__len__"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.corpus.CorpusVocabulary.__len__" title="Permalink to this definition">¶</a></dt>
 <dd><p>Returns the number of terms in the vocabulary.</p>
 <tbody valign="top">
 <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><strong>token</strong> &#8211; token to get statistics for.</td>
 </tr>
-<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body">a tuple <cite>(frequency, document frequency, inverse document frequency)</cite>. If token is not found, statistics for <a class="reference internal" href="#ycutils.corpus.CorpusVocabulary.unknown_token" title="ycutils.corpus.CorpusVocabulary.unknown_token"><tt class="xref py py-attr docutils literal"><span class="pre">unknown_token</span></tt></a> is returned.</td>
+<tr class="field-even field"><th class="field-name">Returns:</th><td class="field-body">a tuple <cite>(frequency, document frequency, inverse document frequency)</cite>. If token is not found, statistics for <tt class="xref py py-attr docutils literal"><span class="pre">unknown_token</span></tt> is returned.</td>
 </tr>
 </tbody>
 </table>
 </dd></dl>
 
 <dl class="method">
+<dt id="ycutils.corpus.CorpusVocabulary.iteritems">
+<tt class="descname">iteritems</tt><big>(</big><big>)</big><a class="headerlink" href="#ycutils.corpus.CorpusVocabulary.iteritems" title="Permalink to this definition">¶</a></dt>
+<dd><p>Returns an iterator for a tuple of <cite>(word, (frequency, document frequency, inverse document frequency))</cite>.</p>
+</dd></dl>
+
+<dl class="method">
 <dt id="ycutils.corpus.CorpusVocabulary.to_bow">
 <tt class="descname">to_bow</tt><big>(</big><em>use_counts='freq'</em><big>)</big><a class="reference internal" href="_modules/ycutils/corpus.html#CorpusVocabulary.to_bow"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.corpus.CorpusVocabulary.to_bow" title="Permalink to this definition">¶</a></dt>
 <dd><p>Creates a <tt class="xref py py-class docutils literal"><span class="pre">bagofwords.BOW</span></tt> object from this vocabulary.</p>
 <tr class="field-odd field"><th class="field-name">Parameters:</th><td class="field-body"><ul class="first last simple">
 <li><strong>f</strong> &#8211; file description to write to.</li>
 <li><strong>sort_key</strong> &#8211; sort order to use for vocabulary. Possible options are <cite>idf</cite> and <cite>freq</cite>. A <cite>+</cite> at the back denotes largest first (descending order) and <cite>-</cite> for ascending order. Defaults to descending order.</li>
-<li><strong>save_unknown</strong> &#8211; saves the default statistics for <a class="reference internal" href="#ycutils.corpus.CorpusVocabulary.unknown_token" title="ycutils.corpus.CorpusVocabulary.unknown_token"><tt class="xref py py-attr docutils literal"><span class="pre">unknown_token</span></tt></a> to file.</li>
+<li><strong>save_unknown</strong> &#8211; saves the default statistics for <tt class="xref py py-attr docutils literal"><span class="pre">unknown_token</span></tt> to file.</li>
 </ul>
 </td>
 </tr>
 </table>
 </dd></dl>
 
-<dl class="attribute">
-<dt id="ycutils.corpus.CorpusVocabulary.unknown_token">
-<tt class="descname">unknown_token</tt><em class="property"> = '__UNK__'</em><a class="headerlink" href="#ycutils.corpus.CorpusVocabulary.unknown_token" title="Permalink to this definition">¶</a></dt>
-<dd><p>The type for an unknown token, which is <tt class="docutils literal"><span class="pre">__UNK__</span></tt> by default.</p>
-</dd></dl>
-
 </dd></dl>
 
 </div>

docs/html/genindex.html

  | <a href="#K"><strong>K</strong></a>
  | <a href="#L"><strong>L</strong></a>
  | <a href="#M"><strong>M</strong></a>
+ | <a href="#N"><strong>N</strong></a>
  | <a href="#P"><strong>P</strong></a>
  | <a href="#R"><strong>R</strong></a>
  | <a href="#S"><strong>S</strong></a>
 <table style="width: 100%" class="indextable genindextable"><tr>
   <td style="width: 33%" valign="top"><dl>
       
+  <dt><a href="corpus.html#ycutils.corpus.CorpusVocabulary.__contains__">__contains__() (ycutils.corpus.CorpusVocabulary method)</a>
+  </dt>
+
+      
   <dt><a href="corpus.html#ycutils.corpus.CorpusVocabulary.__getitem__">__getitem__() (ycutils.corpus.CorpusVocabulary method)</a>
   </dt>
 
       
-  <dt><a href="corpus.html#ycutils.corpus.CorpusVocabulary.__len__">__len__() (ycutils.corpus.CorpusVocabulary method)</a>
+  <dt><a href="bagofwords.html#ycutils.bagofwords.BOW.__iadd__">__iadd__() (ycutils.bagofwords.BOW method)</a>
+  </dt>
+
+      
+  <dt><a href="corpus.html#ycutils.corpus.CorpusVocabulary.__iter__">__iter__() (ycutils.corpus.CorpusVocabulary method)</a>
   </dt>
 
   </dl></td>
   <td style="width: 33%" valign="top"><dl>
       
+  <dt><a href="corpus.html#ycutils.corpus.CorpusVocabulary.__len__">__len__() (ycutils.corpus.CorpusVocabulary method)</a>
+  </dt>
+
+      
   <dt><a href="bagofwords.html#ycutils.bagofwords.BOW.__mul__">__mul__() (ycutils.bagofwords.BOW method)</a>
   </dt>
 
 <table style="width: 100%" class="indextable genindextable"><tr>
   <td style="width: 33%" valign="top"><dl>
       
+  <dt><a href="corpus.html#ycutils.corpus.DEFAULT_UNKNOWN_TOKEN">DEFAULT_UNKNOWN_TOKEN (in module ycutils.corpus)</a>
+  </dt>
+
+      
   <dt><a href="bigvocab.html#ycutils.bigvocab.BigBOW.delete_token">delete_token() (ycutils.bigvocab.BigBOW method)</a>
   </dt>
 
   <dt><a href="bigvocab.html#ycutils.bigvocab.BigBOW.inc_token_count">inc_token_count() (ycutils.bigvocab.BigBOW method)</a>
   </dt>
 
+      
+  <dt><a href="corpus.html#ycutils.corpus.Corpus.inverse_document_frequency">inverse_document_frequency() (ycutils.corpus.Corpus method)</a>
+  </dt>
+
   </dl></td>
   <td style="width: 33%" valign="top"><dl>
       
-  <dt><a href="corpus.html#ycutils.corpus.Corpus.inverse_document_frequency">inverse_document_frequency() (ycutils.corpus.Corpus method)</a>
+  <dt><a href="corpus.html#ycutils.corpus.CorpusVocabulary.iteritems">iteritems() (ycutils.corpus.CorpusVocabulary method)</a>
   </dt>
 
       
   </dl></td>
 </tr></table>
 
+<h2 id="N">N</h2>
+<table style="width: 100%" class="indextable genindextable"><tr>
+  <td style="width: 33%" valign="top"><dl>
+      
+  <dt><a href="bagofwords.html#ycutils.bagofwords.BOW.normalize">normalize() (ycutils.bagofwords.BOW method)</a>
+  </dt>
+
+  </dl></td>
+</tr></table>
+
 <h2 id="P">P</h2>
 <table style="width: 100%" class="indextable genindextable"><tr>
   <td style="width: 33%" valign="top"><dl>
   <dt><a href="bigvocab.html#ycutils.bigvocab.VocabularyMap.unknown_token">unknown_token (ycutils.bigvocab.VocabularyMap attribute)</a>
   </dt>
 
-      <dd><dl>
-        
-  <dt><a href="corpus.html#ycutils.corpus.CorpusVocabulary.unknown_token">(ycutils.corpus.CorpusVocabulary attribute)</a>
-  </dt>
-
-      </dl></dd>
       
   <dt><a href="tfidf.html#ycutils.tfidf.TFIDF.untransform">untransform() (ycutils.tfidf.TFIDF method)</a>
   </dt>

docs/html/objects.inv

Binary file modified.

docs/html/searchindex.js

-Search.setIndex({objects:{ycutils:{bleu:[0,0,1,""],bagofwords:[2,0,1,""],tsvio:[4,0,1,""],tfidf:[5,0,1,""],bigvocab:[6,0,1,""],urls:[9,0,1,""],tokenize:[10,0,1,""],corpus:[12,0,1,""]},"ycutils.bagofwords":{Document:[2,3,1,""],cosine_similarity:[2,4,1,""],random_title:[2,4,1,""],BOW:[2,3,1,""]},"ycutils.urls.youtube":{download:[11,4,1,""],get_video_id:[11,4,1,""]},"ycutils.bagofwords.BOW":{dot_product:[2,1,1,""],to_wc_string:[2,1,1,""],"__str__":[2,1,1,""],l2_norm:[2,1,1,""],add_tokens:[2,1,1,""],"__mul__":[2,1,1,""],l1_norm:[2,1,1,""],add_wc_string:[2,1,1,""]},"ycutils.corpus":{CorpusVocabulary:[12,3,1,""],Corpus:[12,3,1,""]},"ycutils.corpus.Corpus":{from_file:[12,1,1,""],unique_title:[12,1,1,""],vocabulary:[12,1,1,""],add_bow:[12,1,1,""],inverse_document_frequency:[12,1,1,""],add_document:[12,1,1,""],document_frequency:[12,1,1,""],IDF_LAPLACE_SMOOTHING:[12,6,1,""],to_file:[12,1,1,""]},"ycutils.bigvocab":{BigBOW:[6,3,1,""],VocabularyMap:[6,3,1,""]},"ycutils.corpus.CorpusVocabulary":{from_file:[12,1,1,""],"__getitem__":[12,1,1,""],from_corpus:[12,1,1,""],find_token:[12,1,1,""],to_bow:[12,1,1,""],filter:[12,1,1,""],unknown_token:[12,6,1,""],to_file:[12,1,1,""],"__len__":[12,1,1,""]},"ycutils.tfidf":{TFIDF:[5,3,1,""]},"ycutils.bigvocab.BigBOW":{itertokens:[6,1,1,""],TokenIter:[6,3,1,""],add_bow:[6,1,1,""],add_tokens:[6,1,1,""],delete_token:[6,1,1,""],tokens:[6,1,1,""],inc_token_count:[6,1,1,""],get_token_count:[6,1,1,""],to_wc_string:[6,1,1,""],set_token_count:[6,1,1,""],add_wc_string:[6,1,1,""]},"ycutils.urls.webpages":{download:[3,4,1,""],MAX_TRIES:[3,2,1,""],WGET_PATH:[3,2,1,""],USER_AGENT:[3,2,1,""]},"ycutils.urls.printable.PrintableURL":{use_rules:[7,5,1,""],find:[7,1,1,""]},"ycutils.bagofwords.Document":{to_wc_string:[2,1,1,""],title:[2,6,1,""],"__str__":[2,1,1,""],add_wc_string:[2,1,1,""],bow:[2,1,1,""]},"ycutils.urls.printable":{PrintableURL:[7,3,1,""]},"ycutils.urls.googlebooks":{download:[8,4,1,""],book_id:[8,4,1,""]},"ycutils.tsvio.TSVFile":{readline:[4,1,1,""],column_headers:[4,6,1,""],readlines:[4,1,1,""],writeline:[4,1,1,""],parseline:[4,1,1,""]},"ycutils.tsvio":{TSVFile:[4,3,1,""]},"ycutils.tokenize":{words_in_sentences:[10,4,1,""],tag_tokens:[10,4,1,""],TAG_PHONE:[10,2,1,""],TAG_URL:[10,2,1,""],TAG_PUNCT:[10,2,1,""],TAG_TIME:[10,2,1,""],to_ascii:[10,4,1,""],words:[10,4,1,""],sentences:[10,4,1,""],TAG_EMPTY:[10,2,1,""],TAG_EMAIL:[10,2,1,""],TAG_NUM:[10,2,1,""],TAG_WORD:[10,2,1,""]},"ycutils.bleu":{count_ngrams:[0,4,1,""],score:[0,4,1,""]},"ycutils.urls.wikipedia":{download:[13,4,1,""],domain_and_title:[13,4,1,""]},"ycutils.tfidf.TFIDF":{untransform:[5,1,1,""],transform:[5,1,1,""]},"ycutils.urls":{follow_url:[9,4,1,""],sort_key:[9,4,1,""],youtube:[11,0,1,""],wikipedia:[13,0,1,""],googlebooks:[8,0,1,""],webpages:[3,0,1,""]},"ycutils.bigvocab.VocabularyMap":{from_file:[6,1,1,""],get_indexes:[6,1,1,""],add_bow:[6,1,1,""],add_tokens:[6,1,1,""],keys_to_tokens:[6,1,1,""],get_index:[6,1,1,""],create_token:[6,6,1,""],unknown_token:[6,6,1,""],get_token:[6,1,1,""],keys_to_indexes:[6,1,1,""],get_tokens:[6,1,1,""],to_file:[6,1,1,""],size:[6,1,1,""]}},terms:{corpu:[5,1,12,6],represent:2,all:[0,6,7,10],code:[7,3],"__email__":10,get_token:6,follow:[3,9,10],row:4,content:[7,3],tag_url:10,dot_product:2,corpusvocabulari:[5,12,6],show:10,readabl:3,send:3,articl:13,those:10,under:[1,6],norm:2,merchant:1,sourc:[0,13,2,3,4,5,6,7,8,9,10,11,12],everi:10,string:[2,4,6,3,10,12],fals:[0,4,13,6,3,10,12],unicodedata:10,mime_typ:3,fall:10,veri:6,retriev:[2,6,11,12],to_ascii:10,tri:[3,10],administr:7,did:10,list:[0,2,4,6,7,9,10],iter:[2,6],l2_norm:2,vector:[5,2],cosin:2,book_id:8,"__time__":10,small:12,wednesdai:10,domain_and_titl:13,pontchartrain:10,deberri:10,impli:1,zlib:3,natur:1,naiv:10,sign:10,past:10,zero:12,video:11,pass:[5,3,10],download:[8,9,13,11,3,7],"__num__":10,even:1,index:[6,9,10],what:10,abc:7,sub:9,neg:10,sum:6,lakeshor:10,abl:6,uniform:12,access:6,delet:6,version:[1,7],consecut:10,"new":[6,7],method:[0,13,2,3,8,5,6,7,9,10,11,12],itertoken:6,full:9,deriv:6,guardian:7,gener:[1,2,4],never:[6,10],here:10,behaviour:6,punct:10,address:[3,9,10],path:[12,7,3,9],along:1,modifi:[5,1,7],valu:[5,4,6,12],convert:[2,6,10],max_tri:3,pirogu:10,bow:[5,2,6,12],precis:0,amount:10,bunker:10,implement:7,firefox:3,appli:[7,10],modul:[0,1,2,3,4,5,6,7,8,9,10,11,12,13],foundat:1,href:9,wget:3,touchdown:10,vovabulari:6,txt:4,all_smal:0,regex:10,keys_to_token:6,from:[0,2,4,6,7,8,9,10,11,12],describ:1,would:10,follow_url:9,two:[0,2],todai:10,next:4,websit:[7,3],predict:0,call:[13,10],type:[4,6,12,3,10],until:4,more:[0,1,12],sort:[12,9],fulltext:10,peopl:10,splitta:10,relat:12,keys_to_index:6,deflat:3,enhanc:4,warn:6,phone:10,"__iter__":6,vocabularymap:6,particular:1,count1:[2,6],count2:[2,6],fly:6,none:[8,2,5,6,7,3,11,12,4],word:[2,5,6,7,10,12],hour:10,hous:10,work:7,uniqu:[2,12],dev:10,descriptor:[4,6,12],tag:[7,10],del:[6,10],can:[1,12,6,10],learn:7,cab:10,purpos:1,want:4,give:9,process:[1,4,10],weightag:12,agent:3,minimum:12,tab:[12,2,6,4],bagofword:[5,1,2,6,12],cours:1,end:[4,10],divid:12,reuter:7,"0x13fab206d2da1b9f":12,classifi:7,i686:3,write:[4,6,12],how:10,pure:10,instead:[4,6],simpl:10,unique_titl:12,updat:3,map:6,product:2,likewis:10,after:10,befor:6,mai:9,data:[11,2,7,4,9],leve:10,attempt:3,counter:[5,2,6,12],correspond:[6,10],think:7,caus:[3,10],inform:[0,8,13,11,3,10,12],writelin:4,order:12,wind:10,over:[12,6,10],move:4,through:[2,5,6,7,3,10],not_punctu:10,still:10,paramet:[0,13,2,3,4,5,6,7,8,9,10,11,12],some:[],fit:1,binari:3,polici:7,timespicayun:10,tag_tim:10,whether:[4,7,12],dailymail:7,tsv:4,bye:[2,6,12],might:7,descend:12,them:[4,6,10],"return":[8,3,2,13,6,7,9,10,11,12,4],thei:[6,10],handl:[12,2,6,4],sentenc:[0,10],count_ngram:0,handi:1,initi:[5,2,6],nation:[7,10],bigvocab:[1,6],save_unknown:12,now:10,model_fil:7,term:[1,12,6],document:[5,1,2,6,12],name:4,revers:[5,9],separ:[4,6,12,10],token:[0,1,2,6,10,12],each:[4,12,3,10],found:[12,6,7],unicod:10,polit:7,set_token_count:6,mean:[7,10],u2019:10,domain:[13,9],replac:[12,10],individu:10,hard:10,realli:2,redistribut:1,"static":7,connect:10,year:10,our:[2,4,6,7,10,12],happen:10,bay:10,extract:[13,8],special:10,out:[4,6,12,9,10],"try":[9,10],network:10,space:7,publish:1,research:1,categori:10,rel:9,print:[2,4,5,6,7,9,12],occurr:0,pred_len:0,math:2,integr:13,idf_laplace_smooth:12,million:6,differ:10,free:[1,7],standard:6,base:[2,10],mime:3,york:7,dictionari:[4,6,12],put:6,org:[1,9],freq:[12,6],spill:10,frequenc:[0,12,6],could:[3,10],british:7,keep:10,filter:12,tag_punct:10,length:[0,10],place:[5,6],isn:10,regress:7,return_tag:10,onto:10,assign:10,frequent:1,first:[4,12,10],origin:[7,9],softwar:1,rang:12,"__unk__":[12,6],feel:7,strip_unicod:10,independ:7,number:[0,4,6,3,10,12],yourself:3,date:3,wrapper:[5,6],wasn:10,skip_empti:4,tsvfile:[4,12],open:4,size:6,given:[0,13,4,3,8,5,6,7,9,10,11,12],"long":10,espn:[7,10],script:7,unknown:[12,6],licens:1,system:3,messag:3,yesterdai:10,citi:10,add_token:[2,6],"final":9,store:2,document_frequ:12,especi:10,tool:3,copi:1,specifi:[4,10],user_ag:3,"__url__":10,part:[9,10],logist:7,aeronaut:7,sum_unknown:6,kind:10,peril:10,googlebook:[8,9],whenev:12,provid:2,remov:[12,10],get_index:6,second:10,structur:2,charact:[2,4,10],matter:6,were:10,posit:6,randomli:2,sai:10,comput:[0,5],angel:7,ani:[1,2,6,10],dash:10,printable_url:7,increment:6,need:5,seen:6,skyland:9,option:[1,12,10],probabl:7,accuraci:7,note:[5,6,11,3,10,7,12],also:[0,12,6,10],without:[1,2],take:[4,6,10],which:[12,2,6,4,9],forb:7,get_token_count:6,subject:10,singl:[13,10],scikit:7,distribut:1,though:10,multipli:2,previou:6,reach:10,most:7,regular:10,contain:[0,13,2,3,4,5,6,7,8,9,10,11,12],"class":[2,4,5,6,7,9,12],known:[],broadcast:7,format:[12,2,6,4],vocab_s:12,url:[1,8,9,13,7,3,10,11],doc:[12,10],clear:7,later:1,drive:10,l1_norm:2,declar:3,determin:[13,7],dot:2,shot:10,parselin:4,add_bow:[12,6],api:11,text:[13,12,11,10],"__str__":2,random:12,create_token:6,word1:[2,6],word2:[2,6],economist:7,find:[2,6,7,3],"730st":10,current:9,onli:10,categor:10,configur:10,bust:10,should:[1,2],pyutil:[1,9],dict:[13,4,11,8,3],to_fil:[12,6],folder:7,ysim:3,hope:1,surviv:10,get:[2,11,12,9],express:10,watch:10,stopword:10,sqrt:2,report:10,youtub:[11,9],him:10,requir:10,enabl:4,"public":1,tfidf:[5,1],bag:[5,2,6,12],gram:0,add_docu:12,where:12,seamlessli:6,summari:10,bleu:[0,1],set:[0,4,6,12,3],fair:10,column_head:4,see:[0,1,2,8,13,6,7,3,10,11,12],result:7,tokenit:6,fail:3,sport:10,concern:10,inconveni:10,hopefulli:3,wikipedia:[13,9],iter_obj:6,figur:9,score:0,between:[2,6,12],"import":[2,4,5,6,10,12],experi:7,email:10,attribut:2,accord:[12,10],kei:[4,6,3,9],start:9,drew:10,highlight:10,distinguish:10,refman:10,"2pm":10,valid:[13,8,7],popul:6,water:10,tag_email:10,howev:10,video_id:11,foreign:7,tag_phon:10,etc:[2,6,12,10],instanc:7,start_url:9,mani:10,ycutil:[0,1,2,3,4,5,6,7,8,9,10,11,12,13],com:10,comment:4,pre:7,prespecifi:10,tokensto:[2,6],hyphen:10,from_fil:[12,6],except:10,header:4,linux:3,assum:6,duplic:12,bigbow:6,creat:[12,6],coupl:10,invers:12,empti:[4,10],compon:9,besid:10,treat:[4,10],basic:[2,3,9],"__len__":12,get_video_id:11,imag:10,search:7,argument:10,tiger:10,togeth:[2,6],sort_kei:[12,9],repetit:10,present:10,statist:[5,12],"case":10,look:[2,6],gnu:1,webpag:[8,9,13,11,3,7],corpora:[12,6],untransform:5,"__phone__":10,defin:[12,6,7,10],calcul:[0,5,2],abov:10,error:[12,6,3,10],huffington:7,anchor:7,bin:3,printabl:[7,9],have:[1,6,10],stdout:[12,6],papineni:0,metric:0,non:10,words_in_sent:10,recal:0,from_corpu:12,ascii:10,use_rul:7,"__getitem__":[12,6],perform:[5,6],gecko:3,make:4,headlight:10,cross:[7,10],same:[2,7,3,9,10],python:[1,11,3,9,10],html:[9,10],split:[6,10],largest:12,from_filenam:[12,6],jerri:10,tag_word:10,tree:7,http:[1,3,9,10],again:10,rais:[12,6,10],user:[3,10],bloomberg:7,improv:7,new_word:12,column_count:4,googl:[8,11],lower:10,add_wc_str:[2,6],nola:10,bow1:2,bow2:2,exampl:[5,2,12,9,10],thi:[0,1,2,3,4,5,6,7,8,9,10,11,12,13],gzip:3,clitic:10,model:[7,10],ref_ngram:0,identifi:[7,3,10],execut:3,boot:10,human:3,mysql:10,languag:[13,1],web:[7,3],field:4,tag_empti:10,lake:10,param:12,add:[2,6,7,12],book:8,els:[6,10],save:[12,6],tag_num:10,build:12,real:7,transpar:6,read:[4,6,12],prefer:9,load:[12,10],piec:10,assert:2,punctuat:10,traffic:10,know:10,world:[5,2,6,4,12],like:[12,2,6,4,10],success:3,whitespac:10,corpor:7,negat:10,integ:6,server:3,collect:[2,6,12],"boolean":13,necessari:7,either:[1,4],yorker:7,popular:[],output:2,page:7,encount:[12,6],www:1,deal:[3,9,10],twitter:10,ascend:12,back:[12,3,10],sampl:10,instal:3,home:3,bore:10,librari:[11,3],pinpoint:10,lead:[12,7],channel:10,avoid:12,normal:10,fox:7,achiev:7,per:6,retri:3,larg:12,filter_stopword:10,refer:[0,3],machin:0,object:[8,2,13,6,11,3,12,4],run:7,compress:3,ngram:0,hexadecim:[2,12],nutshel:7,x11:3,ref_len:0,inverse_document_frequ:12,post:7,about:[0,8,13,11,3,10,12],column:[4,6,12],to_wc_str:[2,6],doc_count:12,"__mul__":2,idf:[5,12],tag_list:10,disabl:4,block:10,client:11,own:10,vocabulari:[12,6],within:[6,10],automat:0,three:10,warranti:1,unknown_token:[12,6],weather:10,tag_token:10,announc:3,lebron:10,your:[1,3],tsvio:[1,4,12],few:10,transform:[5,6],textual:12,delete_token:6,avail:[4,9,10],jdeberri:10,reli:7,interfac:10,includ:0,"function":[13,4,6,9,10],head:10,form:[2,6,12],tupl:[13,12,10],keyerror:6,use_count:12,link:7,translat:0,corpus_vocabulari:[5,6],line:[4,6],"true":[0,4,13,6,7,3,10,12],sent:10,count:[0,2,6,12],jarvi:10,possibl:12,"default":[13,6,7,3,10,12],wish:7,maximum:[0,12,3],record:10,below:10,limit:4,otherwis:[13,6,10],readlin:4,similar:2,classif:7,featur:7,constant:12,evalu:0,find_token:12,to_bow:12,repres:6,twist:10,inc_token_count:6,exist:[12,6],rule:[7,10],file:[4,6,12,3],doe:[12,10],wget_path:3,check:[13,12,10],inc:6,denot:12,quot:10,titl:[13,2,12],when:[2,6,12,9,10],detail:[1,6,7],rubber:10,flood:10,other:[2,10],test:4,you:[1,7,10],nice:4,node:10,mozilla:3,knew:10,comment_char:4,consid:[0,10],cosine_similar:2,ago:10,printableurl:7,bitbucket:9,receiv:1,eof:4,directori:7,descript:[8,11,12,3],random_titl:2,wc_string:[2,6,12],train:7,ignor:[4,6,12,10],pred_ngram:0,time:[7,3,10],far:10,escap:10,hello:[5,2,6,4,12]},objtypes:{"0":"py:module","1":"py:method","2":"py:data","3":"py:class","4":"py:function","5":"py:staticmethod","6":"py:attribute"},titles:["<em>bleu</em> module","Documentation for ycutils","<em>bagofwords</em> module","<em>webpages</em> module","<em>tsvio</em> module","<em>tfidf</em> module","<em>bigvocab</em> module","<em>printable</em> module","<em>googlebooks</em> module","<em>urls</em> module","<em>tokenize</em> module","<em>youtube</em> module","<em>corpus</em> module","<em>wikipedia</em> module"],objnames:{"0":["py","module","Python module"],"1":["py","method","Python method"],"2":["py","data","Python data"],"3":["py","class","Python class"],"4":["py","function","Python function"],"5":["py","staticmethod","Python static method"],"6":["py","attribute","Python attribute"]},filenames:["bleu","index","bagofwords","urls/webpages","tsvio","tfidf","bigvocab","urls/printable","urls/googlebooks","urls","tokenize","urls/youtube","corpus","urls/wikipedia"]})
+Search.setIndex({objects:{ycutils:{bleu:[0,0,1,""],bagofwords:[2,0,1,""],tsvio:[4,0,1,""],tfidf:[5,0,1,""],bigvocab:[6,0,1,""],urls:[7,0,1,""],tokenize:[8,0,1,""],corpus:[9,0,1,""]},"ycutils.bagofwords":{Document:[2,3,1,""],cosine_similarity:[2,4,1,""],random_title:[2,4,1,""],BOW:[2,3,1,""]},"ycutils.bigvocab.BigBOW":{itertokens:[6,1,1,""],TokenIter:[6,3,1,""],add_bow:[6,1,1,""],add_tokens:[6,1,1,""],delete_token:[6,1,1,""],tokens:[6,1,1,""],inc_token_count:[6,1,1,""],get_token_count:[6,1,1,""],to_wc_string:[6,1,1,""],set_token_count:[6,1,1,""],add_wc_string:[6,1,1,""]},"ycutils.tfidf.TFIDF":{untransform:[5,1,1,""],transform:[5,1,1,""]},"ycutils.urls.webpages":{download:[3,4,1,""],MAX_TRIES:[3,2,1,""],WGET_PATH:[3,2,1,""],USER_AGENT:[3,2,1,""]},"ycutils.bleu":{count_ngrams:[0,4,1,""],score:[0,4,1,""]},"ycutils.bagofwords.BOW":{normalize:[2,1,1,""],dot_product:[2,1,1,""],to_wc_string:[2,1,1,""],"__str__":[2,1,1,""],"__iadd__":[2,1,1,""],l2_norm:[2,1,1,""],add_tokens:[2,1,1,""],"__mul__":[2,1,1,""],l1_norm:[2,1,1,""],add_wc_string:[2,1,1,""]},"ycutils.tokenize":{words_in_sentences:[8,4,1,""],tag_tokens:[8,4,1,""],TAG_PHONE:[8,2,1,""],TAG_URL:[8,2,1,""],TAG_PUNCT:[8,2,1,""],TAG_TIME:[8,2,1,""],to_ascii:[8,4,1,""],words:[8,4,1,""],sentences:[8,4,1,""],TAG_EMPTY:[8,2,1,""],TAG_EMAIL:[8,2,1,""],TAG_NUM:[8,2,1,""],TAG_WORD:[8,2,1,""]},"ycutils.corpus":{CorpusVocabulary:[9,3,1,""],Corpus:[9,3,1,""],DEFAULT_UNKNOWN_TOKEN:[9,2,1,""]},"ycutils.corpus.Corpus":{from_file:[9,1,1,""],unique_title:[9,1,1,""],vocabulary:[9,1,1,""],add_bow:[9,1,1,""],inverse_document_frequency:[9,1,1,""],add_document:[9,1,1,""],document_frequency:[9,1,1,""],IDF_LAPLACE_SMOOTHING:[9,5,1,""],to_file:[9,1,1,""]},"ycutils.bigvocab":{BigBOW:[6,3,1,""],VocabularyMap:[6,3,1,""]},"ycutils.bagofwords.Document":{to_wc_string:[2,1,1,""],title:[2,5,1,""],"__str__":[2,1,1,""],add_wc_string:[2,1,1,""],bow:[2,1,1,""]},"ycutils.tfidf":{TFIDF:[5,3,1,""]},"ycutils.tsvio.TSVFile":{readline:[4,1,1,""],column_headers:[4,5,1,""],readlines:[4,1,1,""],writeline:[4,1,1,""],parseline:[4,1,1,""]},"ycutils.corpus.CorpusVocabulary":{from_file:[9,1,1,""],"__getitem__":[9,1,1,""],"__contains__":[9,1,1,""],from_corpus:[9,1,1,""],find_token:[9,1,1,""],to_bow:[9,1,1,""],filter:[9,1,1,""],"__iter__":[9,1,1,""],iteritems:[9,1,1,""],to_file:[9,1,1,""],"__len__":[9,1,1,""]},"ycutils.urls":{sort_key:[7,4,1,""],webpages:[3,0,1,""],follow_url:[7,4,1,""]},"ycutils.bigvocab.VocabularyMap":{from_file:[6,1,1,""],get_indexes:[6,1,1,""],add_bow:[6,1,1,""],add_tokens:[6,1,1,""],keys_to_tokens:[6,1,1,""],get_index:[6,1,1,""],create_token:[6,5,1,""],unknown_token:[6,5,1,""],get_token:[6,1,1,""],keys_to_indexes:[6,1,1,""],get_tokens:[6,1,1,""],to_file:[6,1,1,""],size:[6,1,1,""]},"ycutils.tsvio":{TSVFile:[4,3,1,""]}},terms:{corpu:[5,1,9,6],represent:2,all:[0,6,8],code:3,"__email__":8,get_token:6,follow:[3,7,8],row:4,categori:8,tag_url:8,dot_product:2,corpusvocabulari:[5,9,6],show:8,readabl:3,send:3,cosine_similar:2,inverse_document_frequ:9,those:8,under:[1,6],norm:2,merchant:1,sourc:[0,2,3,4,5,6,7,8,9],everi:8,string:[2,4,6,3,8,9],fals:[0,4,6,3,8,9],unicodedata:8,mime_typ:3,fall:8,veri:6,retriev:[2,6,9],to_ascii:8,tri:[3,8],did:8,list:[0,2,4,6,7,8],iter:[2,6,9],l2_norm:2,vector:[5,2],cosin:2,"__time__":8,small:9,wednesdai:8,pontchartrain:8,deberri:8,impli:1,corpora:[9,6],natur:1,naiv:8,sign:8,past:8,zero:9,pass:[5,3,8],download:[3,7],"__num__":8,even:1,index:[6,7,8],what:8,sub:7,currenc:8,neg:8,sum:[2,6],idf_laplace_smooth:9,abl:6,uniform:9,access:6,delet:[6,8],version:1,consecut:8,"new":6,method:[0,2,7,5,6,3,8,9],itertoken:6,full:7,deriv:6,iteritem:9,gener:[1,2,4],never:[6,8],here:8,behaviour:6,punct:8,address:[3,7,8],path:[9,3,7],along:1,modifi:[5,1],valu:[5,4,6,9],either:[1,4],convert:[2,6,8],max_tri:3,step:8,bow:[5,2,6,9],precis:0,amount:8,bunker:8,firefox:3,appli:8,modul:[0,1,2,3,4,5,6,7,8,9],foundat:1,href:7,touchdown:8,vovabulari:6,total:2,all_smal:0,regex:8,keys_to_token:6,from:[0,2,4,6,7,8,9],describ:1,would:8,follow_url:7,two:[0,2,8],todai:8,next:4,websit:3,predict:0,call:8,type:[4,6,9,3,8],until:4,more:[0,1,9],sort:[9,7],fulltext:8,python:[1,3,7,8],peopl:8,splitta:8,relat:9,keys_to_index:6,about:[0,9,3,8],enhanc:4,warn:6,phone:8,"__iter__":[9,6],vocabularymap:6,particular:1,count1:[2,6],count2:[2,6],fly:6,none:[2,4,5,6,3,8,9],word:[5,2,6,9,8],hour:8,hous:8,uniqu:[2,9],dev:8,descriptor:[4,6,9],tag:8,del:[6,8],can:[1,9,6,8],cab:8,purpos:1,want:4,give:7,process:[1,4,8],weightag:9,agent:3,minimum:9,tab:[9,2,6,4],bagofword:[5,1,2,6,9],cours:1,end:[4,8],divid:9,"0x13fab206d2da1b9f":9,i686:3,write:[4,6,9],how:8,pure:8,instead:[4,6],simpl:8,unique_titl:9,updat:3,map:6,product:2,likewis:8,after:[2,3,8],befor:[6,8],date:3,data:[2,4,7],leve:8,attempt:3,counter:[5,2,6,9],correspond:[6,8],caus:[3,8],inform:[0,9,3,8],writelin:4,order:9,wind:8,over:[9,6,8],move:4,through:[2,5,6,3,8,9],not_punctu:8,still:8,paramet:[0,2,3,4,5,6,7,8,9],fit:1,timespicayun:8,tag_tim:8,whether:[4,9],tsv:4,bye:[2,6,9],"__money__":8,descend:9,them:[4,6,8],"return":[2,3,4,6,7,8,9],thei:[2,6,8],handl:[9,2,6,4],sentenc:[0,8],count_ngram:0,handi:1,initi:[5,2,6],nation:8,bigvocab:[1,6],save_unknown:9,now:8,term:[1,9,6],document:[5,1,2,6,9],name:4,revers:[5,7],separ:[4,6,9,8],token:[0,1,2,6,8,9],each:[4,9,3,8],found:[9,6],unicod:8,set_token_count:6,mean:8,u2019:8,domain:7,tag_punct:8,individu:8,hard:8,realli:2,redistribut:1,connect:8,year:8,our:[9,2,6,4,8],happen:8,bay:8,special:8,out:[4,6,9,7,8],"try":[7,8],network:8,publish:1,research:1,content:3,rel:7,print:[2,4,5,6,7,9],new_token:8,occurr:0,pred_len:0,math:2,gram:0,lakeshor:8,million:6,differ:8,free:1,standard:6,base:[2,8],mime:3,dictionari:[4,6,9],put:6,org:[1,7],freq:[9,6],spill:8,frequenc:[0,9,6],could:[3,8],keep:8,filter:[9,8],length:[0,8],place:[5,2,6],isn:8,return_tag:8,onto:8,assign:8,frequent:1,first:[4,9,8],origin:7,softwar:1,rang:9,"__unk__":[9,6],strip_unicod:8,number:[0,4,6,3,8,9],yourself:3,mai:7,unlik:8,wrapper:[5,6],wasn:8,skip_empti:4,tsvfile:[4,9],open:4,size:6,given:[0,4,7,5,6,3,8,9],"long":8,espn:8,start:7,unknown:[9,6],licens:1,system:3,messag:3,yesterdai:8,citi:8,add_token:[2,6],process_token:8,"final":7,store:2,document_frequ:9,especi:8,"public":1,copi:1,specifi:[4,8],user_ag:3,"__url__":8,part:[7,8],sum_to:2,kind:8,peril:8,googlebook:7,whenev:9,provid:2,remov:[9,3,8],get_index:6,second:8,structur:2,charact:[2,4,8],matter:6,were:8,posit:6,randomli:2,sai:8,comput:[0,5],ani:[1,2,6,8],dash:8,increment:6,need:5,seen:6,skyland:7,option:[1,9,8],callback:8,note:[5,9,6,3,8],also:[0,9,6,8],exampl:[5,2,9,7,8],take:[4,6,8],which:[9,2,6,4,7],get_token_count:6,subject:8,tool:3,channel:8,distribut:1,normal:[2,8],multipli:2,previou:6,reach:8,regular:8,contain:[0,2,3,4,5,6,7,8,9],new_tag:8,"class":[2,4,5,6,7,9],format:[9,2,6,4],vocab_s:9,url:[1,3,7,8],doc:[9,8],later:1,drive:8,l1_norm:2,declar:3,dot:2,shot:8,parselin:4,add_bow:[9,6],text:[9,8],"__str__":2,random:9,create_token:6,word1:[2,6],word2:[2,6],find:[2,6,3],"730st":8,current:7,onli:8,locat:3,categor:8,configur:8,bust:8,should:[1,2,8],pyutil:[1,7],dict:[4,3],to_fil:[9,6],ysim:3,hope:1,surviv:8,get:[2,9,7],express:8,watch:8,stopword:8,sqrt:2,report:8,youtub:7,him:8,requir:8,enabl:4,default_unknown_token:9,tfidf:[5,1],bag:[5,2,6,9],statist:[5,9],add_docu:9,where:9,seamlessli:6,summari:8,bleu:[0,1],set:[0,4,6,9,3],fair:8,column_head:4,see:[0,1,2,6,3,8,9],tokenit:6,fail:3,sport:8,concern:8,inconveni:8,hopefulli:3,wikipedia:7,iter_obj:6,figur:7,score:0,between:[2,6,9],"import":[2,4,5,6,8,9],email:8,attribut:2,accord:[9,8],kei:[4,6,3,7],drew:8,highlight:8,distinguish:8,refman:8,"2pm":8,popul:6,water:8,tag_email:8,howev:8,tag_phon:8,etc:[2,6,9,8],start_url:7,mani:8,ycutil:[0,1,2,3,4,5,6,7,8,9],com:8,comment:4,prespecifi:8,tokensto:[2,6],hyphen:8,from_fil:[9,6],except:8,header:4,linux:3,assum:6,duplic:9,bigbow:6,creat:[9,6],coupl:8,invers:9,empti:[4,8],compon:7,besid:8,treat:[4,8],basic:[2,3,7],"__len__":9,imag:8,argument:8,assert:2,togeth:[2,6],sort_kei:[9,7],repetit:8,present:8,"case":8,look:[2,6],gnu:1,webpag:[3,7],zlib:3,untransform:5,"__phone__":8,"__iadd__":2,defin:[9,6,8],calcul:[0,5,2],abov:8,error:[9,6,3,8],printabl:7,have:[1,6,8],stdout:[9,6],papineni:0,metric:0,non:8,words_in_sent:8,recal:0,from_corpu:9,ascii:8,"__getitem__":[9,6],perform:[5,6],wget_path:3,make:4,headlight:8,cross:8,same:[2,3,7,8],binari:3,html:[7,8],split:[6,8],largest:9,from_filenam:[9,6],jerri:8,tag_word:8,http:[1,3,7,8],again:8,rais:[9,6,8],temporari:3,user:[3,8],new_word:9,column_count:4,lower:8,add_wc_str:[2,6],nola:8,bow1:2,bow2:2,without:[1,2],thi:[0,1,2,3,4,5,6,7,8,9],gzip:3,clitic:8,model:8,ref_ngram:0,identifi:[3,8],execut:3,boot:8,human:3,mysql:8,ngram:0,monei:8,languag:1,web:3,field:4,tag_empti:8,lake:8,param:9,add:[2,6,9],els:[6,8],save:[9,6,3],tag_num:8,build:9,bin:3,transpar:6,read:[4,6,9],prefer:7,load:[9,8],piec:8,punctuat:8,traffic:8,know:8,world:[5,2,6,4,9],save_to:3,like:[9,2,6,4,8],success:3,whitespac:8,negat:8,integ:6,server:3,collect:[2,6,9],singl:8,output:2,encount:[9,6],www:1,right:8,deal:[3,7,8],twitter:8,ascend:9,back:[9,3,8],sampl:8,instal:3,home:3,bore:8,librari:3,txt:4,pinpoint:8,lead:9,"__contains__":9,avoid:9,though:8,per:6,retri:3,larg:9,filter_stopword:8,refer:[0,3],machin:0,object:[9,2,6,4,3],compress:3,pirogu:8,hexadecim:[2,9],x11:3,ref_len:0,wget:3,deflat:3,column:[4,6,9],to_wc_str:[2,6],doc_count:9,"__mul__":2,idf:[5,9],tag_list:8,disabl:4,block:8,own:8,vocabulari:[9,6],within:[6,8],automat:0,three:8,warranti:1,unknown_token:[9,6],weather:8,tag_token:8,announc:3,lebron:8,your:[1,3],tsvio:[1,4,9],few:8,transform:[5,6],textual:9,delete_token:6,avail:[4,7,8],jdeberri:8,interfac:8,includ:0,replac:[9,8],"function":[4,6,7,8],head:8,form:[2,6,9],tupl:[9,8],keyerror:6,use_count:9,translat:0,corpus_vocabulari:[5,6],line:[4,6],"true":[0,4,6,3,8,9],sent:8,count:[0,2,6,9],jarvi:8,possibl:9,"default":[9,6,3,8],maximum:[0,9,3],record:8,below:8,limit:4,otherwis:[6,8],readlin:4,similar:2,sum_unknown:6,constant:9,evalu:0,find_token:9,to_bow:9,repres:6,twist:8,inc_token_count:6,exist:[9,6],rule:8,file:[4,6,9,3],doe:[9,8],gecko:3,check:[9,8],inc:6,denot:9,quot:8,titl:[2,9],when:[2,6,9,7,8],detail:[1,6],rubber:8,flood:8,other:[2,8],test:4,you:[1,8],nice:4,node:8,mozilla:3,knew:8,comment_char:4,consid:[0,8],ago:8,tiger:8,bitbucket:7,receiv:1,eof:4,directori:3,descript:[9,3],random_titl:2,wc_string:[2,6,9],ignor:[4,6,9,8],pred_ngram:0,time:[3,8],far:8,escap:8,hello:[5,2,6,4,9]},objtypes:{"0":"py:module","1":"py:method","2":"py:data","3":"py:class","4":"py:function","5":"py:attribute"},titles:["<em>bleu</em> module","Documentation for ycutils","<em>bagofwords</em> module","<em>webpages</em> module","<em>tsvio</em> module","<em>tfidf</em> module","<em>bigvocab</em> module","<em>urls</em> module","<em>tokenize</em> module","<em>corpus</em> module"],objnames:{"0":["py","module","Python module"],"1":["py","method","Python method"],"2":["py","data","Python data"],"3":["py","class","Python class"],"4":["py","function","Python function"],"5":["py","attribute","Python attribute"]},filenames:["bleu","index","bagofwords","urls/webpages","tsvio","tfidf","bigvocab","urls","tokenize","corpus"]})

docs/html/tokenize.html

 
 <dl class="function">
 <dt id="ycutils.tokenize.words">
-<tt class="descclassname">ycutils.tokenize.</tt><tt class="descname">words</tt><big>(</big><em>text, strip_unicode=False, normalize=['case', 'phone', 'time', 'url', 'email', 'number', 'punct-del', 'hyphen-split', 'clitics-del', 'neg-clitics-keep'], tag_list=['phone', 'time', 'url', 'email', 'number'], filter_stopwords=False, not_punctuations='', return_tags=False</em><big>)</big><a class="reference internal" href="_modules/ycutils/tokenize.html#words"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.tokenize.words" title="Permalink to this definition">¶</a></dt>
+<tt class="descclassname">ycutils.tokenize.</tt><tt class="descname">words</tt><big>(</big><em>text, strip_unicode=False, normalize=['case', 'consecutive', 'phone', 'time', 'url', 'email', 'number', 'money', 'punct-del', 'hyphen-split', 'clitics-del', 'neg-clitics-keep'], tag_list=['phone', 'time', 'url', 'email', 'number', 'money'], filter_stopwords=False, not_punctuations='', return_tags=False, process_token=None</em><big>)</big><a class="reference internal" href="_modules/ycutils/tokenize.html#words"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.tokenize.words" title="Permalink to this definition">¶</a></dt>
 <dd><p>Tokenize a given string into individual words. This tokenizer is based on a couple of regexes. It first splits a sentence by whitespace into tokens, and tries to identify (tag) each token with a category (through <a class="reference internal" href="#ycutils.tokenize.tag_tokens" title="ycutils.tokenize.tag_tokens"><tt class="xref py py-meth docutils literal"><span class="pre">tag_tokens()</span></tt></a>), for example url, email, number, etc. One can configure the types of tag to check for by specifying it in the <cite>tag_list</cite> argument, and the categories to normalize in the <cite>normalize</cite> argument.</p>
 <p>Besides normalizing for categories, <tt class="xref py py-attr docutils literal"><span class="pre">normalize</span></tt> contains options that specify how we deal with punctuation, clitics, negation clitics and case.</p>
 <p id="normalization-options">Below is a summary of available normalization options:</p>
 <li><tt class="docutils literal"><span class="pre">url</span></tt>: Replace URLs with <tt class="docutils literal"><span class="pre">__URL__</span></tt>.</li>
 <li><tt class="docutils literal"><span class="pre">email</span></tt>: Replace email addresses with <tt class="docutils literal"><span class="pre">__EMAIL__</span></tt>.</li>
 <li><tt class="docutils literal"><span class="pre">number</span></tt>: Replace numbers with <tt class="docutils literal"><span class="pre">__NUM__</span></tt>.</li>
+<li><tt class="docutils literal"><span class="pre">money</span></tt>: Replace currency with <tt class="docutils literal"><span class="pre">__MONEY__</span></tt>.</li>
 <li><tt class="docutils literal"><span class="pre">punct-split</span></tt>: Separate punctuations into their own tokens. However, consecutive punctuations will be considered as a single token.</li>
 <li><tt class="docutils literal"><span class="pre">punct-del</span></tt>: Remove all punctuations except hyphens, dashes, single quotes and those specified in <tt class="xref py py-attr docutils literal"><span class="pre">not_punctuations</span></tt>.</li>
 <li><tt class="docutils literal"><span class="pre">hyphen-split</span></tt>: Separate hyphenated tokens into individual tokens. Hyphens are removed in the process.</li>
 <li><strong>tag_list</strong> &#8211; specifies the type of category to tag. See <a class="reference internal" href="#ycutils.tokenize.tag_tokens" title="ycutils.tokenize.tag_tokens"><tt class="xref py py-meth docutils literal"><span class="pre">tag_tokens()</span></tt></a>.</li>
 <li><strong>not_punctuations</strong> &#8211; string of punctuations to ignore and not consider as punctuations. By default, all punctuations (as categorized by Python <cite>unicodedata</cite>) will be considered.</li>
 <li><strong>filter_stopwords</strong> &#8211; if <tt class="docutils literal"><span class="pre">True</span></tt>, default stopword list will be used. Otherwise, remove stopwords according to the given list. Defaults stopword list from MySQL <tt class="xref py py-attr docutils literal"><span class="pre">STOPWORDS</span></tt> (<a class="reference external" href="http://dev.mysql.com/doc/refman/5.5/en/fulltext-stopwords.html">http://dev.mysql.com/doc/refman/5.5/en/fulltext-stopwords.html</a>).</li>
+<li><strong>process_token</strong> &#8211; a callback function with two arguments <tt class="xref py py-attr docutils literal"><span class="pre">tag</span></tt> and <tt class="xref py py-attr docutils literal"><span class="pre">token</span></tt>. It will be called on every token in the list right after our tokenization steps and before filtering consecutive tags and stopwords. The method should return a tuple of <tt class="docutils literal"><span class="pre">(new_token,</span> <span class="pre">new_tag)</span></tt> or <tt class="docutils literal"><span class="pre">None</span></tt> to delete token.</li>
 </ul>
 </td>
 </tr>
 <p class="first admonition-title">Note</p>
 <p class="last">For a special token to be normalized (i.e URL, phone, email, etc), it has to be in the tag list, otherwise it is treated as a word and tokenized according to the specified punctuations/hyphenation rules. Likewise, if it is in <tt class="xref py py-attr docutils literal"><span class="pre">tag_list</span></tt> but not in the <tt class="xref py py-attr docutils literal"><span class="pre">normalize</span></tt> list, it would not be subject to clitics/punctuations/hyphenation rules.</p>
 </div>
+<div class="admonition note">
+<p class="first admonition-title">Note</p>
+<p class="last">Unicode normalization does not ASCII the text, unlike the <tt class="xref py py-attr docutils literal"><span class="pre">strip_unicode</span></tt> option.</p>
+</div>
 </dd></dl>
 
 <dl class="function">
 
 <dl class="function">
 <dt id="ycutils.tokenize.words_in_sentences">
-<tt class="descclassname">ycutils.tokenize.</tt><tt class="descname">words_in_sentences</tt><big>(</big><em>sents, strip_unicode=False, normalize=['case', 'phone', 'time', 'url', 'email', 'number', 'punct-del', 'hyphen-split', 'clitics-del', 'neg-clitics-keep'], tag_list=['phone', 'time', 'url', 'email', 'number'], filter_stopwords=False, not_punctuations='', return_tags=False</em><big>)</big><a class="reference internal" href="_modules/ycutils/tokenize.html#words_in_sentences"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.tokenize.words_in_sentences" title="Permalink to this definition">¶</a></dt>
+<tt class="descclassname">ycutils.tokenize.</tt><tt class="descname">words_in_sentences</tt><big>(</big><em>sents, strip_unicode=False, normalize=['case', 'consecutive', 'phone', 'time', 'url', 'email', 'number', 'money', 'punct-del', 'hyphen-split', 'clitics-del', 'neg-clitics-keep'], tag_list=['phone', 'time', 'url', 'email', 'number', 'money'], filter_stopwords=False, not_punctuations='', return_tags=False, process_token=None</em><big>)</big><a class="reference internal" href="_modules/ycutils/tokenize.html#words_in_sentences"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.tokenize.words_in_sentences" title="Permalink to this definition">¶</a></dt>
 <dd><p>Tokenize by words a list of sentences. Empty sentences will be removed.</p>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 
 <dl class="function">
 <dt id="ycutils.tokenize.tag_tokens">
-<tt class="descclassname">ycutils.tokenize.</tt><tt class="descname">tag_tokens</tt><big>(</big><em>tokens, tag_list=['phone', 'time', 'url', 'email', 'number']</em><big>)</big><a class="reference internal" href="_modules/ycutils/tokenize.html#tag_tokens"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.tokenize.tag_tokens" title="Permalink to this definition">¶</a></dt>
+<tt class="descclassname">ycutils.tokenize.</tt><tt class="descname">tag_tokens</tt><big>(</big><em>tokens, tag_list=['phone', 'time', 'url', 'email', 'number', 'money']</em><big>)</big><a class="reference internal" href="_modules/ycutils/tokenize.html#tag_tokens"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.tokenize.tag_tokens" title="Permalink to this definition">¶</a></dt>
 <dd><p>Given a list of tokens, try to assign a category tag to each of them using prespecified regular expressions.</p>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />

docs/html/urls/webpages.html

 <p>This module contains basic methods that deal with URL addresses.</p>
 <span class="target" id="module-ycutils.urls.webpages"></span><dl class="function">
 <dt id="ycutils.urls.webpages.download">
-<tt class="descclassname">ycutils.urls.webpages.</tt><tt class="descname">download</tt><big>(</big><em>url</em>, <em>referer=None</em>, <em>wget_path='/home/ysim/tools/wget-1.13.4/bin/wget'</em>, <em>user_agent='Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'</em>, <em>max_tries=3</em><big>)</big><a class="reference internal" href="../_modules/ycutils/urls/webpages.html#download"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.urls.webpages.download" title="Permalink to this definition">¶</a></dt>
+<tt class="descclassname">ycutils.urls.webpages.</tt><tt class="descname">download</tt><big>(</big><em>url</em>, <em>referer=None</em>, <em>wget_path='/home/ysim/tools/wget-1.13.4/bin/wget'</em>, <em>user_agent='Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'</em>, <em>max_tries=3</em>, <em>save_to=None</em><big>)</big><a class="reference internal" href="../_modules/ycutils/urls/webpages.html#download"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#ycutils.urls.webpages.download" title="Permalink to this definition">¶</a></dt>
 <dd><p>Downloads the given URL by passing it through <a class="reference external" href="http://www.gnu.org/software/wget/">Wget</a>.</p>
 <table class="docutils field-list" frame="void" rules="none">
 <col class="field-name" />
 <li><strong>wget_path</strong> &#8211; path to Wget executable. Defaults to Wget on your system path (see <a class="reference internal" href="#ycutils.urls.webpages.WGET_PATH" title="ycutils.urls.webpages.WGET_PATH"><tt class="xref py py-attr docutils literal"><span class="pre">WGET_PATH</span></tt></a>).</li>
 <li><strong>user_agent</strong> &#8211; user agent string to identify yourself as. Defaults to <a class="reference internal" href="#ycutils.urls.webpages.USER_AGENT" title="ycutils.urls.webpages.USER_AGENT"><tt class="xref py py-attr docutils literal"><span class="pre">USER_AGENT</span></tt></a>.</li>
 <li><strong>max_tries</strong> &#8211; number of times for Wget to retry. Defaults to <a class="reference internal" href="#ycutils.urls.webpages.MAX_TRIES" title="ycutils.urls.webpages.MAX_TRIES"><tt class="xref py py-attr docutils literal"><span class="pre">MAX_TRIES</span></tt></a> setting.</li>
+<li><strong>save_to</strong> &#8211; location of the file to save to. Defaults to the temporary directory and file will be removed after downloading.</li>
 </ul>
 </td>
 </tr>