This is my Master Thesis project submitted in partial fulfillment of the requirements for the degree of Master of Sciences in Communication and Information Sciences, Master Track Human Aspects of Information Technology, at the faculty of humanities of Tilburg University.
It is a source code component retrieval application and it can retrieve Java methods-signatures from the Java Standard Library given an English query.
It works in a fairly unorthodox way: retrieves methods using bag-of-words translation: The translation model is a Ridge Regression model trained on the term-document matrices of the two parallel document collections: Java method-signatures + Descriptions
- Required packeges: Gensim, Scikit-learn and argparse and climate
- First run learn.py to create a model
Run the search engine by running GUI.py
vectorspace.py can be used to create tf*idf vectors from texts
- search.py contains methods to use Gensim's search interface
- learn.py trains the regression model
- The thesis is based on the work of Huijing Deng and Grzegorz Chrupała. 2014. Semantic approaches to software components retrieval using English queries. LREC 2014.
- The data is available in the official library of the publication: https://bitbucket.org/gchrupala/codeine
- The pre-processed version of the data is also available at this repo in the "sets" folder
- For some of the experiments I used a neural network from: https://bitbucket.org/gchrupala/neuralnet/
For more info please contact: email@example.com