Commits

Jakub Wilk committed a09e730

engines.tesseract: don't let fix_html touch CDATA sections that were like generated by ocrodjvu itself.

Comments (0)

Files changed (2)

 ocrodjvu (0.7.12) UNRELEASED; urgency=low
 
-  * 
+  * Don't let “-X fix-html=1” break HTML snippets ocrodjvu generates itself
+    for the “-t chars” Tesseract support.
+    Thanks to Janusz S. Bień for the test case.
 
- -- Jakub Wilk <jwilk@jwilk.net>  Mon, 04 Jun 2012 00:33:59 +0200
+ -- Jakub Wilk <jwilk@jwilk.net>  Wed, 01 Aug 2012 21:27:46 +0200
 
 ocrodjvu (0.7.11) unstable; urgency=low
 

lib/engines/tesseract.py

     regex = re.compile(
         r'''
         ( <[!/]?[a-z]+(?:\s+[^<>]*)?>
+        | <!--.*?-->
+        | (?<= // ) <!\[CDATA\[
+        | (?<= //]] ) >
         | &[a-z]+;
         | &[#][0-9]+;
         | &[#]x[0-9a-f]+;
         | [^<>&]+
         )
-        ''', re.IGNORECASE | re.VERBOSE
+        ''', re.IGNORECASE | re.VERBOSE | re.MULTILINE
     )
     return ''.join(
         chunk if n & 1 else cgi.escape(chunk)