Overview

HTTPS SSH

Iota

Iota is a lightweight tracing tool for diagnosing poorly performing I/O operations to parallel file systems, especially Lustre. It collects completes traces of POSIX I/O calls with minimal overhead, and has been tested to scale to 110,052 MPI tasks and 1.5TB of data.

Building

The included Makefile has been tested on Linux. To compile a static version of both the serial and MPI Iota libraries simply use make. This will also build a small test program and generate a file wraps.txt with the linking commands you should add to the link line of the program you want to instrument with Iota.

To enable support for Lustre add LUSTRE=/path to the make command. To compile a shared library instead of a static library add SHARED=1.

Implementation Details

Iota supports both runtime and linktime interposition of POSIX I/O functions in the GNU C library. For runtime interposition of dynamically linked executables, Iota redefines each function and calls the dylib function with RTLD_NEXT to locate the next (e.g. system) symbol for that function name. For linktime interposition of dynamically or statically linked executables, Iota uses the GNU linker's --wrap feature and defines a __wrap_* variant for each function.

Iota supports both MPI and non-MPI executables. In MPI mode, initialization and finalization are accomplished by redefining the MPI_Init and MPI_Finalize functions and calling into the standard MPI profiling interface (PMPI_Init and PMPI_Finalize). In non-MPI mode, the GNU linker’s constructor and destructor function attributes provide similar hooks. In both modes, Iota measures the elapsed time of POSIX I/O calls with the high-precision gettimeofday timer.

Iota traces are buffered in memory to avoid many small writes to the trace file, an important design consideration for scalability on modern parallel file systems. Iota offers two alternative methods for limiting the footprint during tracing, while still collecting complete traces:

  • subsetting, which restricts tracing to a subset of files specified by a wildcard pattern in an environment variable;
  • flushing, which flushes the trace buffer to file at 1MB intervals, but requires writing one such file per MPI task. On Lustre file systems, we open the trace file with stripe count 1 and stripe size 1MB. Flushing has the added benefit that some tracing information may be available even if the program aborts in the middle of a run.

Trace Format

open/create operations:

task    start   elapse  offset  op  ret path    {lustre-string}

where offset has no significnace but serves as a placeholder to keep the number of fields the same as in the fd operations case and lustre-string is:

stripe-size,stripe-count,stripe-offset,ost1{,ost2,...}

fd operations (write, read, seek, flush, sync):

task    start   elapse  offset  op  ret fd

where offset generally only has significance for the read and write operations (read, fread, pread, write, fwrite, pwrite, etc)

path operations (truncate, unlink)::

task    start   elapse  offset  op  ret path

where offset has no significance but serves as a placeholder to keep the number of fields the same as in the fd operations case

License

Iota is available for noncommercial use under an open-source license. See LICENSE for details.