HTTPS SSH

README

Scripts to extract text or metadata for a graphical 'diff'.
The scripts are based on Apache Tika.

These scripts were written to assist my wife in managing document changes
using mercurial, where the documents are generally Word docx files. It was inspired
by the oodiff and
odt2txt programs.

Warning

These scripts have been tested on OS X 10.10 (Yosemite), but have had only scanty
testing on linux.

Scripts in the repository

  • tikadiff: script to extract text or metadata from files and run graphical diff on results.
  • tika: convenience script to run Tika in various configurations.
  • tikserve: convenience script to start Tika server in various configurations.
  • Support scripts
    • localhostports
      Uses netstat to determine currently used ports on localhost.
    • freeport
      In association with localhostports, finds the next free localhost port.
  • Version: 1.0

How do I get set up?

  • Download the repository contents in a convenient location and unzip the file.
  • From the cloned repository, extract scripts tika, tikserve, tikadiff,
    localhostports and freeport, and text files meta-formats and text-formats
    into a directory on your PATH.
  • cd to the the directory containing the scripts.
    • Run sh tika --links.
      • This creates the links, and sets tika and other necessary scripts to
        be executable.
        You must have write permission in the directory, or use sudo.
    • Run sh tikserve --links.
      • This creates the links, and sets tikserve and other necessary scripts to
        be executable.
  • You may delete the download zip and extracted directory.
  • Dependencies:
    • Apache Tika tika-app<-n.m>.jar
      • Download the current version of Tika
      • Either:
        • In .profile, define TIKA_APP to be the path to the jar file
      • Or:
        • In .profile, define $JARS to be the directory with the current tika-app jar.
          Using JARS, the scripts will find the latest version of tika-app-n.m.jar or
          tika-app.jar in that directory.
    • Kdiff3
      • Download an install the compiled application in the usual way for your
        operating system.
    • An alternative to kdiff3 is p4merge,
      although I do not recommend this on OS X,
      because p4merge, like the native FileMerge (opendiff), does not terminate when
      the diff window is closed by means of the Close button.

Executable scripts

  • tikadiff file1 file2
    • Graphically show differences between file1 and file2.
      Text or metadata will be compared depending on the mimetype of the file.
  • tikadiff dir1 dir2

    • For every file in dir1, attempt to graphically compare it with a file of the same
      name in dir2.
  • tika --links

    • set up links to tika
  • tika [args...]
    • pass all arguments to Apache Tika

Links to tika.

  • tikatype filename
    • print the mimetype of the file named in the argument
  • tikatext filename
    • extract and print the text of the named file
  • tikameta filename
    • extract and print the metadata of the named file
  • tikaxml filename
    • extract and print xml text from the named file
  • tikahtml filename

    • extract and print html text from the named file
  • tikserve --links

    • set up links to tikserve

Links to tikserve

  • tikstype [port]
    • set up server to return mimetype of file passed to it
  • tikstext [port]
    • set up server to extract plain text from file passed to it
  • tiksmeta [port]
    • set up server to extract metadata from file passed to it
  • tiksxml [port]
    • set up server to extract xml text from file passed to it
  • tikshtml [port]
    • set up server to extract html text from file passed to it

NOTE: While tika, which is the target for a number of links, can be usefully executed in its own right, tikserve can only be used directly with the --links argument.

Supporting scripts

  • localhostports
    • prints the ports netstat associates with localhost.
  • freeport [first-port-number [last-port-number]]
    • prints the next available port on localhost

Who do I talk to?

pbw -> pbw id au