Wiki

Clone wiki

data2l / help-domsax

SAX and DOM approach to data

When we define structure of binary file we could consider it to be very similar to XML document.

When XML comes to mind then it's just a step to SAX (Simple API for XML) and DOM (Document Object Model) technologies.

DOM

In classical XML DOM just loads whole the XML data to memory where the content can be easily accessed via dot notation.

<body>
     <p>
      some text here
     </p>
     ...
</body>

For instance the content of p can be accessed like document.body.p.getValue().

With binary files we can use the same approach, but if the binary files are really large lets say several GBytes then loading of these would take too much time, also interpreting them wouldn't be exactly fast and on common computers we would not have enough memory. Therefore we load to memory only part of the binary file.

Element which allows us to load only a part of the file to memory is Array.

 struct MyFile: 
     dword numberOfRecords
     array LargeRecordTable:
        size: numberOfRecords
        dword RecOffset
        struct LargeRecord:
           offset: RecOffset
           string  Name:
               size: 20 
           ... #many members... large binary structure (50kB to 5Mb)

MyFile is a simple description of possibly very large file.

 MyFile.LargeRecordTable.loadIndex(5)
 print MyFile.LargeRecordTable.LargeRecord.Name  //this will print the name of element 5

SAX

SAX is an event driven interface. Binary file is read according to definition in top down manner (depth first) and the content is reported by "callbacks". We'll have the same example as in previous DOM section. A data file defined by following description.

 struct MyFile: 
     dword numberOfRecords
     array LargeRecordTable:
        size: numberOfRecords
        dword RecOffset
        struct LargeRecord:
            offset: RecOffset
            string Name:
                size:20 
            ... #many members... large binary structure (50kB to 5Mb)

If instance of my file will have numberOfRecords equal 2 - following sequence of callback happens during SAX parsing.

beginElement(Myfile)
beginElement(LargeRecordTable)
beginElement(LargeRecord)
endElement  (LargeRecord)
beginElement(LargeRecord)
endElement  (LargeRecord)
endElement  (LargeRecordTable)
endElement  (MyFile)

Updated