Wiki

Clone wiki

realKD / model / data / xarf

The XARF Data File Format

The XARF (eXtended Attribute Relation File Format) format is a simple human-readable file format for the declaration of data tables to be used by realKD. As the name suggests, the format is an extension of Weka's arff (attribute relation file format) that allows the specification of elements of the realKD data model that are not part of regular arff. An example file is available here.

NOTE that the file format is still under development and features might be altered, added, and removed. This page currently reflects the status as used in realKD 0.5.3

Basics

XARF files are essentially comma-separated value (CSV) files with an additional header section that specifies the metadata of the table, i.e., information required to process the actual data in a correct and semantically meaningful way. This is the same basic layout as in arff files and all arff files are supposed to be also regular xarf files (although the current implementation does not cover all arff features like sparse data declaration, relational attributes, and weights).

  1. comments specified by the comment symbol % after which the rest of the line is ignored (except leading comments, which are parsed as the data tables general human-readable description)
  2. a relation tag @relation which declares the start of the metadata block and accepts the further optional parameter declaration caption=<table_caption>
  3. a data tag @data which declares the start of the csv part of the file (starting from next line)
  4. attribute declarations which declare for each data column in the csv part how it is supposed to be parsed (see below)
  5. group declarations which declare semantic relationships between different attributes (see below)

In summary, the overall structure of a xarf file is as follows:

<table_description_comment>
<...>
<table_description_comment>
@relation <table_id> caption=<table_caption>

<ATTRIBUTE_DECLARATION_1>
<ATTRIBUTE_DECLARATION_2>
<...>
<ATTRIBUTE_DECLARATION_n>

<GROUP_DECLARATION>
<GROUP_DECLARATION>
...
<GROUP_DECLARATION>

@data
<value_1_1>, <value_1_2>, <...> , <value_1_n>
<value_2_1>, <value_2_2>, <...> , <value_2_n>
<...>
<value_m_1>, <value_m_2>, <...> , <value_m_n>

Attribute Declarations

Attribute declarations determine how the data columns ought to be interpreted and how they should be referred to. Most importantly they specify the level of measurement of the corresponding values, i.e, whether they are categoric, ordinal, or metric. All attribute declarations share the following basic form:

@attribute <attribute_id> <DOMAIN_SPECIFIER> caption=<attribute_caption> description=<attribute_description>
where the attribute caption and description are both optional parameters. The domain specifier varies with the level of measurement. There are the following options.

####Unordered discrete domains#### These domain specifiers all result in categoric attributes. They come in two flavors.

  1. finite domains which are specified by a list of values enclosed in curly braces: {<val_1>, <...>, <val_c>} (as in arff)
  2. infinite domains which are specified by either of the keywords string (as in arff) or categoric

Generally, explicit finite domains are preferred as those can be mapped to an integer key by realKD (although currently not performed by realKD).

Examples

@attribute title {"Mr.", "Mrs.", "Miss", "Sir"}
@attribute city_of_birth categoric
@attribute user_id string

####Ordered discrete domains#### These domain specifiers result in ordinal attributes, which optionally can also be categoric (that is, realKD will represent distinct values in a category map and count category frequencies). This means that for these attributes order-based quantities like percentiles and in particular the median are defined. On the other hand they are not metric (e.g., we do not want to interpolate between values). Again there are two flavors:

  1. finite domains which are specified by an ordered list of values enclosed in brackets: [<val_1>, <val_2>, <...> , <val_c>]
  2. integer domain which is specified by the keyword integer; this domain specifier also enables a further optional parameter for the attribute declaration categorical=<is_categorical> which determines whether the ordinal attribute should also maintain in a category map

Examples

@attribute age_category ["very young", "young", "middle-aged", "old", "very old"]
@attribute post_likes integer description="total number of likes post received within one week"
@attribute week integer categorical=true %data only spans a few weeks so it makes sense to maintain category frequencies

The continuous (real) domain

The real numbers can be specified as domain by the keywords numeric or real (both as in arff). This domain causes the attribute to be metric, i.e., interpolations of values are allowed such that, e.g., the arithmetic mean of the values is a meaningful quantity.

##Group Declarations## Attribute group declarations specify relations among attributes. Their basic form is

@group <group_name> <GROUP_TYPE> <GROUP_MEMBER_DECLARATION>
where the group member declaration is a list of attribute ids (of previously specified attributes) which is either enclosed in curly braces (for unordered attribute relations, i.e., {<attribute_id_1>, <attribute_id_2>, <...> , <attribute_id_r>}) or brackets (for ordered attribute relations, i.e., [<attribute_id_1>, <attribute_id_2>, <...> , <attribute_id_r>]).

Known group types are currently one of the following options.

####Functional groups#### These are unordered relations of attributes that specify each other through a functional dependency. They are specified by the group type functional_group. Currently, it is not further specified which attributes (jointly) specifies which other attributes or if the dependency is bi-directional. The information can be used by the library to avoid the detection of trivial patterns.

Examples

@group "Health indicators" functional_group {height, weight, bmi}
@group "Temperature" functional_group {temp_kelvin, temp_celcius, temp_fahrenheit}

####Distributions#### These are ordered or unordered collections of attributes which jointly represent a distribution. The unordered case is specified by the group type distribution and the ordered case by ordered_distribution. In the first case, the library can compute the derived attribute mode in the second it can also compute quantities like the median.

Examples

@group "Topic distribution" distribution {topic_1, topic_2, topic_3} %number of posts user posted on certain topic
@group "Age distribution" ordered_distribution {young_pop, middle_aged_pop, old_pop}

Updated