Wiki

Clone wiki

arff / Home

ARFF

This library allows you to read and write WEKA ARFF (Attribute-Relation File Format) files. This implementation is based on the developmer version of the ARFF specification. However, this specification isn't very detailed (it seems to be reverse engineered from the Java source) and so this Python implementation might diverge from WEKAs Java implementation. But I guess this only happens in corner cases. If you find a problem, please let me know. Also at least some version of the Java implementation seem itself to violate the specification in some places (e.g. non-unique attribute names or attribute names starting with non-alphabetic characters) for which a Quirks-Mode was implemented in the Python library.

Because the Java ARFF uses java.text.SimpleDateFormat for the date format strings the ARFF specification just refers to the Java documentation at this point. Because I was bored (even though I had enough other work to do) I started a pure Python implementation of this class and called it DateTimeFormat (see below), because, well, it isn't that simple and it not just formats dates but also times. You do not need DateTimeFormat to use the ARFF library but you won't be able to use date attributes with format strings. Date attributes without format strings are specified to use the ISO 8601 date time format. For this I use the iso8601 Python module. If that module is not available my DateTimeFormat is used. However, java.text.SimpleDateFormat has no way to specify optional parts and so DateTimeFormat doesn't. But the ISO 8601 date time format does specify optional parts and so only fully specified ISO 8601 date time strings are supported without the iso8601 module.

If neither the iso8601 module nor the DateTimeFormat module are installed date attributes are not supported (but the rest still works).

Example Usage

>>> from arff import *
>>> arff = '''
@relation foo

@attribute class {foo,bar,baz}
@attribute document string
@attribute timestamp date "yyyy.MM.dd HH:mm:ss z"
@attribute val1 numeric
@attribute val2 numeric

@data
{0 foo, 1 'foo bar.txt', 2 '2010.05.05 15:45:00 +0100', 3 0.65436}
{0 bar, 1 'egg spam.txt', 2 '2010.05.05 15:50:00 +0100', 4 0.99764}
'''
>>> with Reader(arff) as r:
	print 'realation:',r.relation()
	print
	for row in r.data():
		print '%r, %r, %r, %f, %f' % (row['class'], row['document'],
			 row['timestamp'].isoformat(), row['val1'], row['val2'])

		
realation: foo

'foo', 'foo bar.txt', '2010-05-05T15:45:00+01:00', 0.654360, 0.000000
'bar', 'egg spam.txt', '2010-05-05T15:50:00+01:00', 0.000000, 0.997640
>>> with Writer(sys.stdout) as w:
	w.relation('foo')
	w.attribute('className',NOMINAL('foo','bar','baz'))
	w.attribute('document',STRING)
	w.attribute('timestamp',DATE('yyyy.MM.dd HH:mm:ss z'))
	w.attribute('val1',NUMERIC)
	w.attribute('val2',NUMERIC)
	with w.attribute('struct',RELATIONAL) as rel:
		rel.attribute('x',NUMERIC)
		rel.attribute('y',NUMERIC)

	w.data()
	w.sparse(className='foo',document='foo bar.txt',timestamp=datetime(2015,5,4,15,45),val1=0.56432)
	w.sparse(className='bar',document='egg spam.txt',struct=dict(x=10))

	

@relation foo

@attribute className {foo,bar,baz}
@attribute document string
@attribute timestamp date "yyyy.MM.dd HH:mm:ss z"
@attribute val1 numeric
@attribute val2 numeric
@attribute struct relational
    @attribute x numeric
    @attribute y numeric
@end struct

@data
{0 foo, 1 'foo bar.txt', 2 "2015.05.04 15:45:00 GMT+01:00", 3 0.56432000000000004}
{0 bar, 1 'egg spam.txt', 5 '10, 0'}

The Reader lazily parses the file up to the point it absolutely needs. So e.g. the .data() method returns a generator that parses the file during iteration. The Reader does not store the parsed rows, so a API user has to store them them self.

The create_writer and create_reader functions can be used to open files by specifying a path. Using this functions also compressed .arff.gz and (provided the availability of the bz2 module) .arff.bz2 files can be read/written.

DateTimeFormat

This module is a functional clone of Javas java.text.SimpleDateFormat. Keep in mind that behaviour diverges in some corner cases, for instance in some cases where a date time is not fully specified. Also the "week in year" and "week in month" values are buggy. If anyone wants to send me a patch about them, do so!

java.text.SimpleDateFormat supports the usage of different locales and a wide variety of named time zones. This is implemented in DateTimeFormat by using the babel and pytz libraries. However, these modules are not strictly needed. For locales you just have to pass a object that looks like a babel.Locale instance. You don't even need to implement all attributes, only the ones DateTimeFormat uses. For what they are see the getlocale() function (in DateTimeFormat.py), which creates such a mockup object when the babel module is not found.

If the pytz module is not found named time zones are not supported. Other time zones in the "GMT+01:00" or "+0100" formats are still supported.

For a description of the format strings that can be used see the documentation of java.text.SimpleDateFormat. In addition to this some extension features are supported by the DateTimeFormat class: All strings in a format are interpreted as "here might be white space" (so it's equivalent to the regular expression '\\s*'). For formatting the same white space string is used that is supplied in the format string.

In extension mode a syntax for regular expressions is supported. You write regular expressions between '/' (slash) characters. If you want to use a slash inside your regular expression you can escape it by doubling it (yes, not by using backslash). When formatting a date time the regular expression itself is written to the output. Because this is pretty asymmetric (you most likely cannot parse a date time with the same format you've formatted it with when using regular expressions) and collides with the use of slashes in American dates I'm not sure I leave that in.

Also a few more patterns are supported:

LetterDate or Time ComponentPresentationExamples
iTime zoneISO 8601 time zoneZ; +01; -02:30
fSecond in minuteDecimal number1.50; -0.2; 3
uMicrosecondNumber2
xYear of weekISO year of week1995
eWeekdayISO weekday1; 2; 3; 7
oWeek in yearISO week in year52

Example Matches:

Date and Time PatternMatches
"HH:mm:ff i"15:30:02.000050 +01:00
"HH/[:-]/mm/[:-]/ss"15:30:02; 15-30-02

Example Usage

>>> from DateTimeFormat import DateTimeFormat
>>> from datetime import datetime
>>> f = DateTimeFormat("yyyy.MM.dd HH:mm:ss z")
>>> f.format(datetime.now())
'2010.05.06 20:08:08 GMT+01:00'
>>> f.parse('2010.05.06 20:08:08 GMT+01:00')
datetime.datetime(2010, 5, 6, 20, 8, 8, tzinfo=DateTimeFormat.FixedOffset(datetime.timedelta(0, 3600),'GMT+01:00'))

Updated