Encoding problems with DLibScraper

Create issue
Issue #25 resolved
Robert Jäschke created an issue

When scraping from D-Lib Magazine, the scraper should decode HTML entities. E.g., Mönnich, Michael should not appear in the author field but instead Mönnich, Michael.

Please implement the decoding using StringEscapeUtils.unescapeHtml(). You can look at other scrapers, how they do it. Just open the call hierarchy for that method.

Also add a JUnit test for the URL http://www.dlib.org/dlib/may08/monnich/05monnich.html.

Comments (4)

  1. Log in to comment