Question seekg on File

Issue #24 open
Tristan VOIRON created an issue

Hi Tobias,

Tristan from the Intempora team here, our tests with your library is doing great.

For our work we would need to access the objects at different timestamps and not only reading them one after the other. For example, skipping two objects and read the third one or go back to the previous object. So I wanted to ask you if there was a way to move in an object File and if yes do you have any example?

I saw there was a function seekg in the objects CompressedFile and UncompressedFile, so maybe there is a way using that?

Thanks in advance and best regards,

Tristan

Comments (8)

  1. Tobias Lorenz repo owner

    Hi Tristan,

    seek forward is possible as the objects just need to be skipped. But seeking backwards is a problem as this requires the knowledge at which file position and/or log container the previous object is contained.

    I propose to keep a queue in form of a std::deque<ObjectHeaderBase>. This can be iterated in both directions. A thread can pump objects from Vector::BLF at the end of the queue: queue.push_back(object); And you can implement a custom strategy for aging, so dropping old data that is likely not gone be read again: queue.pop_front();

    I’m not sure if this should be implemented in the library. Probably just as an example. Reason is that this is going beyond the necessary functionality to read/write Vector::BLF files, and such a functionality is too much tight to the application program, especially if it’s implemented in other languages, Java, …

    Is this ok?

    Bye
    Tobias

  2. Tristan VOIRON reporter

    Hi Tobias,

    Let me be more precise on my question, it was not clear enough on my side.
    What we are trying to do is to record data as BLF format and replay them later. What we would need for that is a way to have random-access to the file. In other words, a way to store positions at the recording time, and at the playback time we would know the position so we would need a function to go to that location.
    So we would need:

    • tellg to store an index (record)
    • seekg to go to that index (replay)

    Best regards and thanks for your time,
    Tristan

    EDIT: specified "random access"

  3. Tobias Lorenz repo owner

    Hi Tristan,

    so far so clear.

    Two data structures are required to accomplish a record index based access:

    1. Each record can have a different length, which is stored in its header. When reading a record, I know at which uncompressed file position it starts. This need to be remembered in a std::map, so record index to uncompressed file position.
    2. The uncompressed stream gets compressed into a log containers, which results in the real compressed file stream. So we also need to remember, where a log container starts, so a list of log container positions with its uncompressed position and uncompressed position.

    Then there is another problem to solve: The current version of the library runs multi threaded to improve read/write performance on sequential access. So one thread takes records and puts them into the uncompressed file stream, and the second thread compresses the stream into the actual file on disk. To have a seekg that works with absolute file positions to seek to a specific record, the current implementation with multiple threads is not an ideal starting point. Version 1, the unthreaded version, might be a better starting point for this.

    I have to think about this…

    1. Maybe we can take advantage that a log container is usually 0x20000 bytes large. This is not guaranteed, but as long as all read containers have this size, it’s possible to jump back without much hazzle.
    2. Other opportunity is to store the aforementioned maps (see above) in a temporary index file(s), so not occupy too much memory.

    Still I think that an intermediate solution could just read all records into a std::deque. With this, record-based access is possible, however with much larger memory consumption.

    Bye
    Tobias

  4. Tristan VOIRON reporter

    Hi Tobias,

    Thank you for your input. Random access to the BLF file while replaying is a good thing for everyone I think.
    I am pretty sure others don't necessary play a record from the beginning to the end. As a matter of fact, BLF records for us can contain up to 8 hours of recording for just one session so storing every Header is really not possible in terms of memory footprint and would take too long as well.
    Performing something like file.JumpToTime(3600s) is mandatory for us. The jump could be forwards or backwards, and quite far from the current point. That's why I call it random access. I see two ways of doing this:

    1. The library handles the jump by storing index files. We could call Reset() and SetTime() for example.
    2. The user (me) stores the indexes needed. The library provides two things:
    • a way to get a pointer to the current index (tellg)

    • a way to set an index to the file (seekg)
      I could store indexes during the record phase. Later on, during the replay phase, I could recall indexes to go to desired time. I could store my indexes every second for example to build my own time table, I have no problem with that.

    About the two threads, when in read mode (replay), I suggest we wait for all operations to be finished (join or equivalent). Waiting for a second so that every resource is ready is fine and by far faster than iterating inside the whole BLF File! 😉

    Hope this helps.

    Bye,

    Tristan

  5. Tobias Lorenz repo owner

    Hi Tristan,

    so I think this request can be established in two steps:

    1. The library need to be changed to support seekg to any position.

      1. There need to be a way to query, which file position the object was read. Either transported in ObjectHeaderBase or with a tellg that works (just need to be checked).
      2. seekg support. This requires significant changes to the multi-threaded read ahead mechanism.
    2. On top of the library, there need to be an object index.

      1. The object index is a std::vector that stores object type, timestamp, file position.
      2. Using this object index, a seekg to particular timestamp can be implemented. So it makes sense to implement this as derived object to File.

    I set the ticket from new to open, but as these are probably 40 hours in my spare time, I cannot promise when I can implement this.

    Bye Tobias

  6. Log in to comment