Wiki

Clone wiki

r-text-tools / Home

Welcome to Carl Stahmer's r-text-tools repository. The repository Wiki contains documentation on the tools included in the repository. Documentation for each tool can be found using the links below. The tools are meant to be very modular, with each performing a small, limited task. Complex operations can, however, be accomplished by chaining tools together.

Individual Tool Documentation

  • getHtmlLinks - returns all the URLs that appear on a supplied webpage.
  • getHtmlText - returns the text, stripped of header, scripts, etc., that appears on a webpage.
  • getEccoTcpFromCSVList - extracts TCP ids from a designated column of a spreadsheet.
  • getTcpText - takes a TCP id as input, gets the TEI from TCP, and strips it down to text.
  • getWordFrequency - takes text input and calculates word frequency.
  • plotWordFrequency - takes text input and visualizes word frequency on a bar chart.
  • getWordAssoc - takes text and a "seed" and finds all words that appear next to the seed in the text.
  • drawFrequencyWordcloud - takes text input, calculates word frequency, and creates a wordcloud.
  • getAllEccoTcpTexts - chains other tools together to extract all TCP ids from spreadsheet, gets all the full text, and returns it as a running text blob.
  • plotEccoTcpTexts - chains other tools together to extract all TCP ids from spreadsheet, gets all the full text, calculates word frequency for the entire corpus, and visualizes it on a bar chart.
  • makeEccoTcpWordle - chains other tools together to extract all TCP ids from spreadsheet, gets all the full text, calculates word frequency for the entire corpus, and visualizes it in a wordcloud.
  • getEccoTcpWordFrequency - chains other tools together to extract all TCP ids from spreadsheet, gets all the full text, and calculates word frequency for the entire corpus.
  • getEccoTcpWordAssociation - chains other tools together to extract all TCP ids from spreadsheet, gets all the full text, and finds all words that appear next to a "seed" in the entire corpus.

Setting Up Environment These tools require a variety of libraries to function properly. These include:

  • library(RCurl)
  • library(XML)
  • library(tm)

(all are available through CRAN)

Have fun!

Updated