Wiki

Clone wiki

realKD / model / data / dataInput

Data input

This page describes the way RealKD handles the data related files provided by the user.

Data and Metadata

RealKD considers data to be comma separated values with an optional header, i.e.,

<header_1>, <header_2>, <...> , <header_n>
<value_1_1>, <value_1_2>, <...> , <value_1_n>
<value_2_1>, <value_2_2>, <...> , <value_2_n>
<...>
<value_m_1>, <value_m_2>, <...> , <value_m_n>
and metadata to be XARF annotated declarations.

User input

The user can provide one file with data and optional metadata, or two files where the first one has data and the second metadata. In both cases, the metadata do not have to be complete, i.e, a relation tag, attribute declarations for every attribute in the data, e.t.c. On the contrary, the user can provide only the essential information required for his\her purpose, and the rest will be handled by RealKD (see next Section).

For example, the data file

<header_1>, <header_2>, <...> , <header_n>
<value_1_1>, <value_1_2>, <...> , <value_1_n>
<value_2_1>, <value_2_2>, <...> , <value_2_n>
<...>
<value_m_1>, <value_m_2>, <...> , <value_m_n>
can be accompanied with the following metadata file (or both in the file)
@attribute header_n categoric
@attribute header_2 categoric
which incorporates the information that attributes with ID header_n and header_2 are categoric.

User attribute declarations matching

The (partial) attribute declarations provided by the user are mapped to the actual data using the following 3 rules in order.

  1. ID matching The attribute declarations are mapped to the corresponding data attributes by matching the attribute ID of the declaration to the corresponding ID from the header of the data
  2. Implicit matching If the attribute declarations are equal in number with the attributes in the data, the order of the attribute declarations determines the matching with the attributes in the data
  3. Default matching Attributes in the data are given default IDs and the type is determined by sniffing

Sniffing for a CSV header

Sniffing for a header works in the following way. The first two lines of the data are parsed into tokens. For every token position, the type of the token of the first and second line are checked, and if for all positions the types are the same, the sniffer concludes there is no header. In addition, the presence of a missing value token in the first line, i.e., ?, indicates no header is present. In case there is a missing value in the second line, the type checking for this position is skipped.

The sniffer can fail in two ways. First, if a header is present and all types are categoric, the sniffer will conclude there is no header. Second, the sniffing will fail if the information required to differentiate the presence of a header is captured by the positions where the second line has missing values.

Sniffing for attribute type

For default attribute matching, the type of the attribute is determined in the following way. First, all values are checked whether they are integer values. If they are, then the type is integer. If that fails, all values are checked whether they are real values. If that check succeeds, then the type is real. If it fails, then the attribute is given the type categoric.

Updated