HTTPS SSH

tmln-google • Bitbucket

Text Mining & Langage Naturel: TP1 ~ Google

How To Use

  • Compile & run against a folder's input, with dataset extracted above this repo:
    • make clean all && ./tmln ‹path to one or more folders›⁺
      • Or make test
    • make demo for concurrent | distributed awesomeness
      • This will index every folder under *train/ concurrently,
      • then make some concurrent requests
      • and finally demonstrate a concurrent map-reduce query, using tmln:elastic_search/2
    • Incremental indexing is trivialy implemented given the current one-folder-per-index mapping
  • Run benchmarks for eg. “build” with
    • make clean bench.build
  • Step-by-step main function (when only one arg is passed to ./tmln, it actually calls main1):
main1 ([Path]) ->
    Docs = fetch(Path),                  %% Recursively browse Path & load text files in memory
    Proc = fun normalizer/1,             %% The Normalizer text processor
    Tokenizeds = analyze(Docs, [Proc]),  %% Turn text into lists of words using tokenize/3, then apply Proc
    ok = build(Tokenizeds),              %% Store {Word,Urls} pairs into a hashmap
    ok = save("../tmln.save"),
%   io:format("~p\n", [built()]),        %% Display contents of said hashmap
    R1 = search(Q1= <<"erlang">>),       %% Pages where “erlang” appears
    io:format("search(~s) = ~p\n", [Q1,R1]),
    Q2 = [<<"breathe">>, <<"tracks">>, <<"datum">>],
    R2 = search(Q2),                     %% Pages where all those words appear
    io:format("search(~p) = ~p\n", [Q2,R2]),
    io:format("done\n").
  • main when multiple element are passed to it:
main (Paths) ->
    MainPid = self(),
    io:format("Indexing using workers:\n"),
    Generations =                        %% Creates concurrent main1/1 instances
        [ begin
              Pid = spawn_link(fun () -> %% Could be run on another machine
                                       main1([Path]),
                                       MainPid ! 'im_ready!',
                                       serve(Path)
                               end),
              io:format("\tproc ~p  will be indexing  ~p\n", [Pid, Path]),
              Pid
          end || Path <- Paths],
    io:format("   --\n"),
    WorkersCount = length(Generations),
    wait_for_workers(WorkersCount),      %% Wait until all workers have built their resp. index
    io:format("All ~p workers ready.\n", [WorkersCount]),
    elastic_search(Generations, [<<"erlang">>]),   %% Query some words
    elastic_search(Generations, [<<"algeria">>]),  %%   in parallel™
    io:format("Done\n"),
    Generations.

How to specify what to search & among which set?

Here is the REPL equivalent of make demo (input reduced for convenience):

$ make debug
⋮
Eshell V6.0  (abort with ^G)
1> G = tmln:main( filelib:wildcard("../20news-bydate/20news-bydate-train/*") ).
Indexing using workers:
        proc <0.30.0>  will be indexing  "../20news-bydate/20news-bydate-train/alt.atheism"
        proc <0.31.0>  will be indexing  "../20news-bydate/20news-bydate-train/comp.graphics"
        ⋮
   --
…snip…
All 20 workers ready.
…snip…
2> tmln:elastic_search(G, [<<"pink">>, <<"floyd">>]).
⋮
["../20news-bydate/20news-bydate-train/sci.space/59905"]
^G q

Requirements

  • make, curl
  • brew install erlang
  • The dataset
    • mkdir ../20news-bydate && cd ../20news-bydate
    • curl 'http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz' | tar xz

Notes

  • All the interesting code is in src/tmln.erl
  • While testing on 20news-bydate/20news-bydate-train/comp.os.ms-windows.misc, with 4 cores & 2GB RAM available:
    • The plists module implements concurrent lists utilities
      • I was able to gain a 10x speedup on analyze just using plists' map
      • (from ~4s to ~700ms)
    • The hashmap is the process's dictionary: atomic, mutating updates
      • 100x speedup on build just going from ETS de-facto Erlang mutable RAM storage to proc.dict mutable, process-local #HolyShit!
      • (from ~20s to ~500ms)
  • Urls are hashed and an immediate is used in their place
    • Bench shows small (CPU & mem) perf decrease, but keeping code (small, may be usefull when distributed)
    • CPU Perf decreased slightly due to the mean hashmap accesses doubling
  • Compiling with HiPE gives code that is actually slower here…
  • Activated +o3 optimisation option because why not.
    • Maybe a slight gain here, but most of the heavy lifting is done using outside-module functions, which means inlining and such won't be of much help. If only there was unboxing of the stdlib functions…
  • File reading is performed using prim_file's read_file/2 instead of file's, so as not to go through a gen_server.
    • Speedup below an order of magnitude
  • Applying this patch made make bench.search2 slightly slower on average
    • The intent was to construct a flat Acc which is O(Part) & copies Part
    • This patch's relatively bad perf can be…
      • due to lists:append/1 being written in C
      • due to [Part|Acc] being quicker than Part++Acc
        • thus taking less time for gather_results to process its message queue
          • thus the mess.q. grows more slowly, which is good.
diff --git a/src/tmln.erl b/src/tmln.erl
index c7a9c08..630cb0f 100644
--- a/src/tmln.erl
+++ b/src/tmln.erl
@@ -287,12 +287,11 @@ search (Pids, Query) ->
                   end, Pids),
     gather_results(length(Pids), WorkUnit, []).

-gather_results (0, _, Acc) ->
-    lists:append(Acc);
+gather_results (0,   _, Acc) -> Acc;
 gather_results (N, Ref, Acc) ->
     receive
         {Ref, results, Part} ->
-            gather_results(N -1, Ref, [Part|Acc]);
+            gather_results(N -1, Ref, Part++Acc);
         ImpromptuMessage     ->
             io:format("Yo! Got this weird msg: ~p\n", [ImpromptuMessage]),
             gather_results(N, Ref, Acc)

More

  • Compile & debug in the Erlang shell with make debug
  • End | Close shell with ^G q or ^C^C^C