Wiki

Clone wiki

realKD / model / data / xarf2

The XARF Data File Format

The XARF (eXtended Attribute Relation File Format) format is a simple human-readable file format for the declaration of data tables to be used by realKD. As the name suggests, the format is an extension of Weka's arff (attribute relation file format) that allows the specification of elements of the realKD data model that are not part of regular arff. An example file is available here.

NOTE that the file format is still under development and features might be altered, added, and removed. This page currently reflects the status as used in realKD 0.7.0

Basics

XARF files are essentially comma-separated value (CSV) files with the addition of a metadata section that specifies the metadata of the table, i.e., information required to process the actual data in a correct and semantically meaningful way. This is the same basic layout as in arff files and all arff files are supposed to be also regular xarf files (although the current implementation does not cover all arff features like sparse data declaration, relational attributes, and weights). In comparison to the ARFF files, XARF offers additional metadata options, e.g., allows for a header before the data part (as is common in CSV files), and more important, the user can provide only partial metadata information (see below).

In order to rigorously specify the xarf format we need the following basic building blocks.

<IDENTIFIER> ::= [A-Za-z_$]+[A-Za-z_$0-9]*
<COMMENT> ::= %[<UTF-8>^<LB>]*<LB>
<VALUE> ::= <STRING> | <SET> | <SEQ>
<CHAR> ::= <UTF-8>^<QUOTES_OR_LB>
<STRING> ::= [<CHAR>^<WHITESPACE>]* | "<CHAR>*"
<SET> ::= {} | {<STRING> [, <STRING>]*}
<SEQ> ::= '[]' | '['<STRING> [, <STRING>]*']'

<XARF> :== 
   <COMMENT:table_descr>
   <METADATA>
   <CSV>

<METADATA> :==
   <RELATION_DECLARATION><LB>?
   <ATTRIBUTE_DECLARATION><LB>*
   <GROUP_DECLARATION><LB>*
   <DATA_DECLARATION><LB>?

<CSV> :== 
    <CSV_ROW: header>?
    <CSV_ROW>*

<CSV_ROW> :== <STRING> [<delimiter> <STRING>]*

The metadata are comprised of:

  1. comments specified by the comment symbol % after which the rest of the line is ignored (except leading comments, which are parsed as the data tables general human-readable description)
  2. a relation tag @relation <table_id> which declares the start of the metadata block, an id of the input table, and accepts the further optional parameter declaration caption=<table_caption> which determines the table caption (if absent caption is equal to given id); this whole tag is optional (if absent table id is "datatable").
  3. attribute declarations which declare for each data column in the csv part how it is supposed to be parsed (see below)
  4. group declarations which declare semantic relationships between different attributes (see below)
  5. a data tag @data which declares the start of the csv part of the file (starting from next line); this tag is also optional

An example xarf file can be the following:

<table_description_comment>
<...>
<table_description_comment>
@relation <table_id> caption=<table_caption>

<ATTRIBUTE_DECLARATION_1>
<ATTRIBUTE_DECLARATION_2>
<...>
<ATTRIBUTE_DECLARATION_n>

<GROUP_DECLARATION>
<GROUP_DECLARATION>
...
<GROUP_DECLARATION>

@data
<value_1_1>, <value_1_2>, <...> , <value_1_n>
<value_2_1>, <value_2_2>, <...> , <value_2_n>
<...>
<value_m_1>, <value_m_2>, <...> , <value_m_n>

Attribute Declarations

Attribute declarations determine how the data columns ought to be interpreted and how they should be referred to. Most importantly they specify the level of measurement of the corresponding values, i.e, whether they are categoric, ordinal, or metric. All attribute declarations share the following basic form

<ATTRIBUTE_DECLARATION> ::= @attribute <IDENTIFIER:attr_id> <DOMAIN_SPECIFIER> <OPTIONS>
<PARAMS> ::= <IDENTIFIER>=<VALUE> | <IDENTIFIER>=<VALUE> <PARAMS>
which declares an attribute with id attr_id. A caption (default attr_id) and a description (default "") for the attribute can be determined by the following two optional parameters
caption=<STRING:attr_caption>
description=<STRING:attr_descr>
The mandatory domain specifier determines what type of attribute is declared (corresponding to statistical level of measurement).
<DOMAIN_SPECIFIER> ::= string | categoric | integer | numeric | real | <SET> | <SEQ>
The following subsections explain the available choices.

####Unordered discrete domains#### These domain specifiers all result in categoric attributes. They come in two flavors.

  1. finite domains which are specified by a list of values enclosed in curly braces: {<val_1>, <...>, <val_c>} (as in arff)
  2. infinite domains which are specified by either of the keywords string (as in arff) or categoric

Generally, explicit finite domains are preferred as those can be mapped to an integer key by realKD (although currently not performed by realKD).

Examples

@attribute title {"Mr.", "Mrs.", "Miss", "Sir"}
@attribute city_of_birth categoric
@attribute user_id string

####Ordered discrete domains#### These domain specifiers result in ordinal attributes, which optionally can also be categoric (that is, realKD will represent distinct values in a category map and count category frequencies). This means that for these attributes order-based quantities like percentiles and in particular the median are defined. On the other hand they are not metric (e.g., we do not want to interpolate between values). Again there are two flavors:

  1. finite domains which are specified by an ordered list of values enclosed in brackets: [<val_1>, <val_2>, <...> , <val_c>]
  2. integer domain which is specified by the keyword integer; this domain specifier also enables a further optional parameter for the attribute declaration categorical=<is_categorical> which determines whether the ordinal attribute should also maintain in a category map

Examples

@attribute age_category ["very young", "young", "middle-aged", "old", "very old"]
@attribute post_likes integer description="total number of likes post received within one week"
@attribute week integer categorical=true %data only spans a few weeks so it makes sense to maintain category frequencies

The continuous (real) domain

The real numbers can be specified as domain by the keywords numeric or real (both as in arff). This domain causes the attribute to be metric, i.e., interpolations of values are allowed such that, e.g., the arithmetic mean of the values is a meaningful quantity.

##Group Declarations## Attribute group declarations specify relations among attributes. Their basic form is

@group <group_name> <GROUP_TYPE> <GROUP_MEMBER_DECLARATION>
where the group member declaration is a list of attribute ids (of previously specified attributes) which is either enclosed in curly braces (for unordered attribute relations, i.e., {<attribute_id_1>, <attribute_id_2>, <...> , <attribute_id_r>}) or brackets (for ordered attribute relations, i.e., [<attribute_id_1>, <attribute_id_2>, <...> , <attribute_id_r>]).

Known group types are currently one of the following options.

####Functional groups#### These are unordered relations of attributes that specify each other through a functional dependency. They are specified by the group type functional_group. Currently, it is not further specified which attributes (jointly) specifies which other attributes or if the dependency is bi-directional. The information can be used by the library to avoid the detection of trivial patterns.

Examples

@group "Health indicators" functional_group {height, weight, bmi}
@group "Temperature" functional_group {temp_kelvin, temp_celcius, temp_fahrenheit}

####Distributions#### These are ordered or unordered collections of attributes which jointly represent a distribution. The unordered case is specified by the group type distribution and the ordered case by ordered_distribution. In the first case, the library can compute the derived attribute mode in the second it can also compute quantities like the median.

Examples

@group "Topic distribution" distribution {topic_1, topic_2, topic_3} %number of posts user posted on certain topic
@group "Age distribution" ordered_distribution {young_pop, middle_aged_pop, old_pop}

User input and partial metadata

The user can provide either a file with data and metadata, or two files with data and metadata separately. In either case, the metadata do not have to be complete, e.g., attribute declarations for every attribute in the data. On the contrary, the user can provide only the essential information required for his\her purpose, and the rest will be handled by RealKD.

For example, the CSV file

<header_1>, <header_2>, <...> , <header_n>
<value_1_1>, <value_1_2>, <...> , <value_1_n>
<value_2_1>, <value_2_2>, <...> , <value_2_n>
<...>
<value_m_1>, <value_m_2>, <...> , <value_m_n>
can be annotated with the following metadata (or in separate files)
@attribute <header_n> categoric
@attribute <header_2> categoric
which incorporates the information that attributes with ID header_n and header_2 are categoric. The following subsections describe various aspects of the parsing process.

Autodetection of CSV header presence

If the data declaration's option header=auto is set (or as default if the option or the whole tag is absent) the presence of a header row in the csv portion is assumed if all of the following conditions hold:

  1. all tokens in the first data row cannot be parsed as a number (integer or real) or missing value
  2. all tokens in the first row are unique (after being mapped to identifiers, i.e. column 1 -> column_1
  3. in each column, all token from the second row and below are different from the token in the first row (when leading and trailing whitespaces are stripped)

Partial attribute declarations

If the attribute declarations are not equal to the number of columns in the data, the attributes are created as follows. First the attribute IDs from the declarations are mapped, if present, to the header of the data (ID matching). The remaining attributes receive default IDs and their type is determined by sniffing (default matching). The same occurs when the header is absent, or there are no attribute declarations.

For example, let us assume the following attribute declarations

@attribute header_n categoric
@attribute header_2 categoric
and the existence of the header
header_1, header_2, ... , header_n
The second and the n-th attributes receive IDs 'header_2' and 'header_n' and categoric types. The rest n-2 attributes get the default matching.

Sniffing for attribute type

For default attribute matching, the type of the attribute is determined in the following way. First, all values are checked whether they are integer values. If they are, type is integer. If it fails, all values are checked for being real values. If that check succeeds, then the type is real. If it fails, then the attribute is given the type categoric.

Updated