Wiki

Clone wiki

iMonDB / CollectorConfig

iMonDB Collector configuration

The iMonDB Collector uses a configuration file in which the required settings for connecting to the database and analyzing the raw data files are specified.

The configuration file is a YAML file with as mandatory file name config.yaml and has to be present in the same directory as the iMonDB Collector executable. If no configuration file is present in this directory, a default incomplete configuration file containing some standard settings will be generated. This configuration will have to be completed with the required settings before the iMonDB Collector can be executed successfully.

The configuration file can be modified through the user interface of the iMonDB Collector, or by editing the config.yaml file directly. However, using the iMonDB Collector is recommended, as this provides some immediate feedback on the validity of the configuration settings. The mandatory configuration settings are indicated in bold in the iMonDB Collector.

The configuration file consists of various pieces of information, which can all individually be completed using the iMonDB Collector.

Configuration creation

Database configuration

Database configuration

The information to connect to the iMonDB MySQL database should be provided:

  • Host: The MySQL host.
  • Port: The MySQL port.
  • Database: The iMonDB database name.
  • User name: The MySQL user name.
  • Password: The password for the given user (if applicable).

General configuration

General configuration

The information on which raw data files to process and how they should be processed should be provided.

  • Directory: The base directory where the processing will start. All applicable files found in the base directory and in all subdirectories recursively will be processed.
  • File name regex: A regular expression used to match (case-insensitive) the file name of the raw data files that need to be processed. Examples are to match only the files that have '.raw' as file name extension (default), but can include specific patterns in the file name as well, e.g. experiments of a specific type as indicated in the file name. The regular expression can be evaluated by clicking on the magnifying glass to see whether it matches the expected file names.
  • Starting date: The date of the most recent run in the raw files. Only files with a modification date after this date will be processed. This can be used to limit the analysis to raw files that were created after the previous time the analysis was performed. To analyze all files, set this to a sufficiently early date or leave it blank. The starting date will be automatically set to the most recent file's creation date when the iMonDB Collector finishes its analysis.
  • Number of threads: The number of worker threads used for collecting the raw files and processing them. Using multiple threads will result in several raw files being processed simultaneously, reducing the analysis time (on sufficiently powerful computers). Take into account that when processing multiple big raw files simultaneously on a high number of threads, the size of the available memory might have to be increased (-Xmx argument) when starting the iMonDB Collector.
  • Enforce unique run names: Traditionally, the name of a Run in the iMonDB is the same as the name of the raw file. However, run names are required to be unique, while different files with the same name can exist in separate directories. Therefore, this setting allows to explicitly make the run names unique by appending the MD5 checksum for the raw file to the run name.

Instruments configuration

Instruments configuration

Configure an instrument

Every instrument for which the instrument parameters will be extracted from raw data files needs to be configured, so that raw files can be linked to a specific instrument. An instrument definition contains the following information.

  • Instrument name: The unique instrument name.
  • Instrument model: The specific instrument model, as unambiguously defined by an accession number in the PSI-MS controlled vocabulary.
  • Regex source: The source on which to apply the regular expression used to identify the instrument for a specific raw file. Possible options are name, indicating the base file name, and path, indicating the file patch (excluding the file name).
  • Regular expression: The regular used to link raw files to the defined instrument.

When defining multiple instruments, each represented by a different regular expression, please make sure that a raw file can't match to more than one instrument. Especially take care when mixing regular expressions applied on the file name and regular expressions applied on the file path. If multiple instruments match to a single file, this file will not be processed. Likewise, if no instrument can be matched to a certain raw file, the will not be processed either. Regular expressions can be verified by clicking on the magnifying glass. For a given input string, the matching instrument(s) will be listed. For each raw file that should be analyzed, exactly one instrument should match.

Verify the instruments regex configuration

Metadata configuration

Metadata configuration

Configure metadata

Each Run can have some associated metadata information in a key-value format. Metadata can be configured in a similar fashion to the instruments configuration.

  • Metadata key
  • Metadata value
  • Regex source
  • Regular expression

Unlike instruments, a raw file is allowed to match to multiple metadata regular expressions. The regular expressions can again be evaluated by clicking on the magnifying glass.

Configuration validation

Configuration validation

The execution tab shows an overview of whether the different categories are configured correctly. By hovering over the status symbol, specific status information can be consulted in the tooltip text. Specifically, the following information is checked:

  • Database configuration: The database configuration can be verified by connecting to the database using the specified settings. The user should initiate the verification of the database settings himself. Possible statuses are valid, if the database settings could successfully be verified, invalid, if the database settings could not be verified or some mandatory settings are missing, or unknown, if all the mandatory settings are completed but were not verified.
  • General configuration: The status of the general configuration is invalid if the mandatory directory or the file name regex are not provided, otherwise it is valid.
  • Instrument configuration: If there are no instruments defined, the instrument configuration is invalid. Furthermore, if the database configuration has been verified successfully, the instrument definitions are verified against the database. If all instruments are present in the database and have the same information (instrument name and model), the status is valid. Otherwise, if one of the instruments has conflicting information in the database (i.e. a different instrument model), the status is invalid and the offending instrument configuration should be fixed. Finally, if there are instrument definitions which are not present in the database, the status is unknown.
  • Metadata configuration: The status of the metadata configuration is invalid if there is no metadata configured, otherwise it is valid.

If at least one of the categories has an invalid configuration status, it will be impossible to execute the iMonDB Collector. When the offending configuration item has been amended, and all statuses are either valid or unknown, the iMonDB Collector can be executed. However, the configuration validation can't check for all possible problems, so even if all categories indicate a valid status, the configuration should be carefully verified so that the iMonDB Collector executes as desired.

Updated