Commits

David Larlet  committed 4f98b62

Deal with encoding issues if it is only set in HTML's meta, fixes #1

  • Participants
  • Parent commits 10bf72b

Comments (0)

Files changed (1)

File src/browser.py

 from whoosh.index import create_in, open_dir, EmptyIndexError
 
 strip_tags_re = re.compile(r'</?\S([^=]*=(\s*"[^"]*"|\s*\'[^\']*\'|\S*)|[^>])*?>', re.IGNORECASE)
+meta_encoding_re = re.compile(r'<meta.*?charset=([^"\']+)', re.IGNORECASE)
 
 
 def strip_tags(content):
 
         # Retrieves the resource and turns it into a Readability doc
         response = requests.get(url)
+        if response.encoding == 'ISO-8859-1':
+            # By default, the fallback of the content-type text/html
+            # is ISO-8859-1, so in that case we double check that the
+            # encoding is not set in HTML's dedicated meta, see
+            # http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1
+            # Warning: response.text MUST be reevaluated
+            encoding = re.findall(meta_encoding_re, response.text)
+            if encoding:
+                response.encoding = encoding[0] or response.encoding
         document = BrowserDocument(response.text)
 
         # Explicitely parse the HTML to be able to rewrite links