1. Victor Stinner
  2. hachoir
Issue #9 resolved

"printable" regex in hachoir

bra
created an issue

I'm trying to use hachoir as a binary file splitter for any kind of files for the task of deduplication the splitted chunks. The method would be the following: use hachoir to mark the start (and possible the end) of structures in the binary files, checksum, and store them. With the checksums it is possible to deduplicate the files on not block, but binary structure boundaries (hopefully more effective). It's far from being perfect when parsing for example Debian ISOs and the same content as the ISO has inside, but as a per file basis. Upon investigating the issue it seems it would be good to differentiate plain text files in the binary stream. Currently a plain text file in an ISO gets saved with some binary junk in its start (and/or end), while I parse the same file directly, it gets saved in 1:1, so this can't be used to dedupe them.

So a plain text parser is what I'm trying to achieve with a simple parser, which would start with the following magic_regex: MAGIC = '[%s]{16,128}' % re.escape(string.printable) My problem is that hachoir dies with the following traceback: Traceback (most recent call last): File "/usr/local/bin/hachoir-subfile", line 105, in <module> main() File "/usr/local/bin/hachoir-subfile", line 101, in main ok = runSearch(subfile, values) File "/usr/local/bin/hachoir-subfile", line 77, in runSearch subfile.loadParsers(categories=categories, parser_ids=parsers) File "/usr/local/lib/python2.6/site-packages/hachoir_subfile/search.py", line 69, in loadParsers self.patterns = PatternMatching(categories, parser_ids) File "/usr/local/lib/python2.6/site-packages/hachoir_subfile/pattern.py", line 23, in init self.addRegex(regex, (offset, parser)) File "/usr/local/lib/python2.6/site-packages/hachoir_regex/pattern.py", line 117, in addRegex item = RegexPattern(regex, user) File "/usr/local/lib/python2.6/site-packages/hachoir_regex/pattern.py", line 31, in init self.regex = parse(regex) File "/usr/local/lib/python2.6/site-packages/hachoir_regex/parser.py", line 181, in parse regex, index = _parse(text) File "/usr/local/lib/python2.6/site-packages/hachoir_regex/parser.py", line 150, in _parse raise SyntaxError("Operator '\%s' is not supported" % char) SyntaxError: Operator '_' is not supported

Any ideas on that?

Thanks,

Comments (5)

  1. bra reporter

    I just did a hg clone and this is what I got:

    hachoir-subfile --debug text.py 
    [0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~\ \	\
    \
     \
      ]{4,128}
    Regex compilation: 61.5 ms
    Use regex: (Rar!\x1a\a\0|F(AT(32   |1(2   |6   ))|LV\1(\5\0\0\0\t|\1\0\0\0\t)|WS[\1-\t]|ORM.{4}AIF[CF])|A(VI LIST|CONanih)|WAVEfmt |C(DDAfmt |WS[\1-\t])|M(SCF|Thd|ARC|M\0\*|Z.[\0\1].{4}[^\0-\3])|8BPS\0\1|\xd4\xc3\xb2\xa1|ustar  \0|\xebR\x90NTFS    |\1(CD001|fcp|\0\t\0\0\3)|\x1aE\xdf\xa3|L\0\0\0\1\x14\2\0\0\0\0\0\xc0\0\0\0\0\0\0F|\.(snd|ra\xfd|RMF\0\0\0\x12\0\1)|gimp xcf (file\0|v002\0)|\xd7\xcd\xc6\x9a\0\0| EMF\0\0|\0\0(\t\0\0\3|[\1\2]\0[\1-\x14].(\x10\x10|  |00|@@)[\0\x10]\0[\0\1\4][\0\b\x18 ]\0)|\xed\xab\xee\xdb|Extended Module: |\xff\xd8\xff[\xe0\xe1\xee]|\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1|0&\xb2u\x8ef\xcf\x11\xa6\xd9\0\xaa\0b\xcel|G(nomeKeyring\n\r\0\n|IF8(7a|9a))|bplist00|!<arch>\n|I(TSF\3\0\0\0|I\*\0)|m(hbd|oov)|%PDF-|7z\xbc\xaf'\x1c|d8:announce|S(\xef(\1\0|\2\0|\4\0)|WAP(-SPACE|SPACE2)|1SUSPEND\0)|\x7fELF|B(Zh|M.{4}.{8}[\x0c(l]\0{3})|fLaC\0|\x89PNG\r\n\x1a\n|OggS|PK\3\4|[\t-\r -~]{4,128}|\x1f\x8b\b.{5}[\0\2\4\6][\0-\r]|\0{16}.{24}_FVH)
    [+] Start search on 2369 bytes (2369 bytes)
    
    

    I'm not sure distilling that long regexp (printed on the first line) into [\t-\r -]{4,128} is correct, but it's possible I'm overlooking something (I've just started playing with these great tools).

    BTW, I don't get the previous error, so that's fixed, I'm just not sure it's correct.

    Thanks!

  2. bra reporter

    Any ideas on that? I doesn't look like the generated and the source regexp are in par... Original:

    [0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~\ \	\
    \
     \
      ]{4,128}
    
    

    Generated:

    [\t-\r -~]{4,128}
    
  3. Victor Stinner repo owner
    >>> import string
    >>> x=map(ord, string.printable)
    >>> x.sort()
    >>> x
    [9, 10, 11, 12, 13, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 
    45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
     63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
     81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 
    99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 
    113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126]
    

    So string.printable contains characters 9..13 and 32..126, which are '\t'..'\r' and ' '..''. So [\t-\r -]{4,128} is correct.

    I didn't really understood your idea of matching file start/end in an ISO image, with binary and text files.

    At least, it looks like the hachoir-regex bug is fixed.

  4. Log in to comment