BibTeXScraper should only return result if really valid BibTeX could be found and extracted

Issue #2102 resolved
Robert Jäschke created an issue

Currently, the BibTeX scraper also returns results when a binary file, e.g., a PDF is scraped (test this URL). This needs to be changed: the scraper should check, e.g., the MIME type of the returned document and only try to extract information from text, HTML, etc. files. Another option would be to check the extracted BibTeX for valid characters.

Comments (4)

  1. Robert Jäschke reporter
    • changed status to open

    Please find a way to avoid such cases. A simple and clean solution is probably to check the resulting BibTeX for (in)valid characters.

  2. Former user Account Deleted

    I added a regular expression to handle invalid characters

    private final static Pattern invalidChar = Pattern.compile("[^\\p{L}\\p{Nd}\\p{Punct}\\p{Space}]+");
    

    If the bibtex contains invalid character it will return null.

  3. Robert Jäschke reporter

    Thanks Haile! Please carefully look at and repair your code: 1. The BibTeX extraction should only be done when the matcher does not find something. Thus, parseBibTeX should be called after you checked for the pattern (and only, if nothing could be found). 2. The variable hasInvalidChar is not really needed - you can just use the if(m.find()) to nest the BibTeX extraction and remaining code.

  4. Log in to comment