1. Spandana Gella
  2. location_recognizer

Overview

HTTPS SSH
AUTHOR:
------

Name  :: Spandana Gella
email :: spandanagella@gmail.com

EXECUTION:
---------

python src/locationRecognizer.py -i <input_file_path> -n <ngram_character_length> -l <location_file_path> -o <output_file_path>

or (to apply default parameters)

python src/locationRecognizer.py 

Default parameters:
	for all of the required parameters values are set by default
	-n option to set the length of the ngram character string (See Location Pre-Processing section for details), this value is set 3 by default
	-i option to set the input file path (default file: data/Sample_Input.txt)
	-o option to set the output file path (default location: data/output.txt)
	-l option to set the locations file (default location: data/Locations.txt)
	-k location search count (default set to : 50, Irrespectieve of having large N as locations only top K probable locations are searched for alignments)



Locations Pre-Processing:
------------------------

Ngrams for each location are generated and locations are stored based on ngram patterns 
( O(N) operation where N is number of locations )

Example: Assume we are looking at trigrams (by default trigrams are considered based on observations of: pur, bad, war, gar)

'pur': [jaipur, vijayapura, bijapur, etc]
'oor': ['adoor', 'aroor', 'chittoor', 'mavoor', 'mussoorie', etc]


ALGORITHM: (Explained with an example input:: <Royal Challengers Bangalore IPL match in chenmai>)
----------

1) Top 5000 frequently observed words in English are removed from each input line 
	- Input is transformed to: <challengers bangalore ipl chenmai> 

2) Ngrams are generated for the transformed input line

3) All the locations that share same ngrams are filtered

4) Top K locations which has the input line Ngrams are considered (By default K is set to 50) 
	- This step reduces the computation process. That is instead of verifying all the given locations 
	  we reduce the search space to constant K. 
	- if no ngrams are found all N locations are set for looking alignments (which is very less probable)

5) Smith Waterman alignment algorithm is applied to check the parts of input string which could be a location
	- For the input string it predicts (bangalore, bangalore) and (chinnai, chennai)

6) All the observed alignments are scored based on Levenshtein edit distance and min weighted edit 
   distance (considering the probability of c being relaced with 'v' is higher than 'b' as they 
   are beside each other on key board)
 
7) For the considered input top location suggestions sorted on above scores are : (By default top 20 are considered)
    Items in the list are in the format of (predicted_correct_location, observed_possible_location)

   - [(bangalore, bangalore), (chennai_chenmai, champhai_chenmai), (champa, chenmai), (jalore, bangalore), 
      (bengal, bangalore), (nangal, bangalore), (bangla, bangalore), (chamba, chenmai), (bhagalpur, bangalore), 
      (bhinmal, chenmai), (nellore, bangalore), (balod, bangalore), (mangaldoi, bangalore) ... ]

8) Preposition check:
	- All the locations that occur after preposition are noted (In case of multiple locations in input strings, locations after prep are given higher priority)
	- In the input string only chenmai passes this test where as bangalore fails

9) If preposition check satisfied : go to step 10, else step 12

10) If only one location is found in preposition check it is replaced with the top most prediction in the list.
	- In the input chenmai is replaed with chennai -- DONE (EXIT)
    If multiple locations are found in preposition check, the top most occurrence in the prediction list is replaced
	- Assume in input string both chennai and bangalore qualify as bangalore is on the top prediction list it is tagged as location

11) If only one observed_possible_location is found it is replaced right away with the predicted_correct_location which is not the case here
    -- DONE
	
12) Check for the most frequent observed possible location out of top 20
	- For the above example bangalore is more frequent than chenmai, its top suggestion is tagged as location
	-- DONE


PARAMETER SUGGESTIONS:
---------------------

1) Default parameters are set based on observed patterns
2) bi-grams are suggested if there are more number of locations with length <=4
3) Parameters should be set accordingly, unfortunately they act as trade-off between computation time and efficiency

IMPROVEMENTS:
------------

1) This method could be well improved if some patterns in the input lines are observed
	- As per my observation no standard patterns are observed except locations followed by prepositions

NOTE:
----

This program is implemented entirely based on string matching, spelling correction and few patterns.
Applicable for on the fly prediction of locations on non-grammatical sentences as well and could be easily extended to other languages.


levenshtein.py and smithmn.py which are respective implementations of Levenshtein and Smith Waterman 
algorithms are taken from web and modified as required for the program.

Implemented the weighted edit distance part in levenshtein