# MT File Utils

## Overview

MT File Utils provides a simple API for common I/O tasks on text files.

• Tail: performs like the linux 'tail' command.
• Reverse Seek: Starting at the end of a file, returns all the lines until a specific target is found
• Reader: High-performance API for reading lines from a text file. Most useful when you need fast concurrent access to very large data sets that don't fit in memory. Supports multithreaded access, random or sequential reads.
• Thin: Makes a smaller version of a large file by sampling every n lines.

## Compatibility

MT File Utils has been tested in cpython (2.7) and jython (2.5).

## Usage

For each example below, assume a file named 'data.txt' with one number per line:

1
2
3
...
9999


### tail

Functions like the standard unix tail command. ("follow" or "-f" not supported.) Pass in the file name, and the max number of lines you want returned:

>>> from mtfileutil.reverse import tail
>>> tail('data.txt', 5)
['9995', '9996', '9997', '9998', '9999']


### reverse_seek

Searches backward from the end of a file for target text. Returns a list of each line between the end of the file and the target:

>>> from mtfileutil.reverse import reverseSeek
>>> reverseSeek('data.txt', '994')
'994' found after searching back 6 lines.
['9994', '9995', '9996', '9997', '9998', '9999']


You can specify how far back to seek. The default limit is 3000 lines:

>>> from mtfileutil.reverse import reverseSeek
>>> reverseSeek('data.txt', 'loot')
>>> reverseSeek('data.txt', '9993', max=5)
[]
>>> reverseSeek('data.txt', '9993', max=10)
'9993' found after searching back 7 lines.
['9993', '9994', '9995', '9996', '9997', '9998', '9999']


The random reader selects a random line from a given text file every time it is invoked.

CAUTION: The random nature of these reads typically defeats the page caching strategies used by your OS, so for large text files that don't fit entirely in memory, it's easy to saturate your disk I/O capacity with this reader.

The typical pattern is

• Start a reader by assigning it a text file to read and a queue name to use
• Use the reader as desired

Example:

>>> from mtfileutil import reader
Initializing random line queues[rand_queue](100)
Populating random queue
'7276'
'8452'
'640'


For large files that don't fit in memory, this is much friendlier on your disk I/O because the data can be read and used in entire blocks. Makes a "best effort" to remember (bookmark) your location between reads, that will typically be off by the maximum queue size. For text files with many millions of lines this is probably not a big deal, but may be a consideration when using repeatedly with smaller files.

As with the random reader, the typical pattern is to start a reader, use the reader, then stop the reader. The bookmark is not saved until you stop the reader, so this step is required if you want the reader to bookmark its location between runs:

>>> from mtfileutil import reader
Initializing sequential line queue[seq_queue](100)
Reading data file './data.txt' from 0
'1'
'2'
'3'
Writing seek position 312 to temp file ./data.txt.seek


Subsequent invocations will remember the approximate location of the last line you read. A subsequent python shell might look like this:

>>> from mtfileutil import reader
Initializing sequential line queue[seq_queue](100)
>>> Reading data file './data.txt' from 312
'106'
'107'
'108'
Waiting for sequential line reader 'seq_queue' to end
Writing seek position 732 to temp file ./data.txt.seek


As you can see, the bookmarked location is about 100 lines ahead of the last line read. The amount of variance depends on the size of the read-ahead queue, which is sized at 100 by default.

If you want to start reading the file from the beginning each time, the bookmark feature can be disabled.:

>>> from mtfileutil import reader
Initializing sequential line queue[seq_queue](100)
Reading data file './data.txt' from 0
'1'


### thin

Takes a text file and samples every n lines.

>>> from mtfileutil.reduce import thin
>>> thin ('data.txt', 'data.reduced.txt')
Writing every 25 lines of data.txt to data.reduced.txt
>>> thin ('data.txt', 'data.reduced.txt', interval=100)
Writing every 100 lines of data.txt to data.reduced.txt


This utility does not handle the various I/O exceptions that may be caused by nonexistent files, insufficient permissions, etc.

## Examples

Example scripts are included in the package: