Natural language text analysis
This small project implements some basic natural language analysis tools on texts retrieved from the Gutenberg project collections.
The main analysis tool is implemented in Gutenberg.py script. The script accepts three parameters (name, surname, language) and search through the Gutenberg collection and downloads all ebooks of the given author and written in given language.
Once the texts are downloaded locally, the following text analysis is applied on each text:
- Number of tokens
- Number of distinct tokens
- Percentage of the lexical richness (the ratio of the distinct tokens to the total number of tokens)
- List of most common words longer than 7 characters and occurred more than 10 times
- Tokens frequency
The program returns, for each found ebook, in a human-readable format, a full text in .txt format, text statistics (1-3) and the tokens frequencies.
The naming convention of the output files is as follows:
<book_title>.txt <book_title_stat>.txt <book_title_freq>.txt
On an example of Jane Austen's novel 'Pride and Prejudice', three output files in human-readable format are provided: Pride_and_Prejudice.txt, Pride_and_Prejudice_stat.txt and Pride_and_Prejudice_freq.txt
When the program is start for the first time, a folder called gutenberg is created. It this folder all the download ebooks and statistics analysis are kept. Then for each searched author a subfolder is created with additional subfolders for each languge. In these folder are kept downloded ebooks (format <book_id>.txt) and each analysis is stored in folder with the naming convetion analysis_hhmmss. The structure looks as follows:
.<home folder> (i.e. textAnalysis) +-- gutenberg | +-- Author1 | | +-- Language1 | | | +-- book_id1.txt | | | +-- book_id2.txt | | | +-- book_idN.txt | | | +-- analysis1 | | | +-- analysis1 | | | | +-- book_name.txt | | | | +-- book_name_stat.txt | | | | +-- book_name_freq.txt | | +-- Language2 | | | +-- book_id4.txt | | | +-- analysis3 | +-- Author2 | | +-- Language1 | | | +-- book_id5.txt | | | +-- book_id6.txt |___________
- python 2.7+
- virtualenv (optional)
If you want to run the program in a python environment, virtualenv is required as well. It can be installed as follows:
pip install virtualenv
Basic usage (local run)
Download source code and navigate to the folder
Create and start new python virtual environment (optional)
pip install -r requirements.txt
Once we have the requirements installed we can start analysing the Gutenberg collections.
python Gutenberg.py name surname
For example, if we want to analyse all the works of Jane Austen written in English we can do it by running the following:
python Gutenberg.py Jane Austen --language english
Upon completion, a new folder named Name_Surname is created with all the output files placed in. The output files can be opened with a simple text editor (e.g. gedit/vim in Linux or notepad in Windows OS).
Run on EGI FedCloud via DARIAH Science Gateway
To run text analysis on FedCloud resources via DARIAH Science Gateway you do not need an account on the FedCloud, the authorization is done by DARIAH Science Gateway on your behalf. However, only the registered user of the DARIAH Science Gateway can use Simple Cloud Access service. Don't worry, if your local Identity Provider is a member of EduGain (almost all academic and research organisation in the world are members) the registration process is very simple.
In the top-right corner of the welcome page of the DARIAH Science Gateway hit the blue button "Sign in".
On the drop-down menu find your Identity provider, hit "Select" and provide username and password as requested by your Identity provider.
Now you are signed in and ready to start exploring Simple Cloud Access service! If you have any problem with the sign-in process leave a feedback.
Before you start running Gutenberg analysis via Cloud Access, download the gutenberg analysis script from the Bitbucket project repository on your local machine. After file is open, right-click on the file and "Save as" on your machine.
On the welcome screen click on Cloud Access in the top menu bar which will bring you to the Simple Cloud Access service form.
Click Drop files here and browse to the location where you have downloaded the gutenberg analysis script or simply drag and drop script and click Next.
Press Next two more times (to pass steps Drop your static inputs and Drop your parametric inputs) until you reach Specify command line arguments.
In the command line argument text bar input authors name, surname and language. If language is omited, English will be used in analysis. For an example we can make an analysis on a collection of Jane Austen's books published in English. Press Next.
In this step you can define resource providers (currently there are two EGI FedCloud resource providers offering Cloud resources to the users of the DARAIH Science Gateway) and the instance type. You can leave it unchaged. For a more complex and compute intensive jobs, you should consider choosing Medium or Large instance type. Press Next.
In the last step you should define the outputs of the analysis that will be returned to you. The gutenberg analysis script packs all the outputs in the gutenberg.tar file (it is a tar archieve of the /gutenberg folder with all its content). Press Add new and write down gutenberg.tar. Press Start execution to submit your analysis to the EGI FedCloud.
Now you should see the progress of your analysis in the right side of the screen.