- changed status to open
BibTeXScraper should only return result if really valid BibTeX could be found and extracted
Currently, the BibTeX scraper also returns results when a binary file, e.g., a PDF is scraped (test this URL). This needs to be changed: the scraper should check, e.g., the MIME type of the returned document and only try to extract information from text, HTML, etc. files. Another option would be to check the extracted BibTeX for valid characters.
Comments (4)
-
reporter -
Account Deleted I added a regular expression to handle invalid characters
private final static Pattern invalidChar = Pattern.compile("[^\\p{L}\\p{Nd}\\p{Punct}\\p{Space}]+");
If the bibtex contains invalid character it will return null.
-
Account Deleted - changed status to resolved
resolved
-
reporter Thanks Haile! Please carefully look at and repair your code: 1. The BibTeX extraction should only be done when the matcher does not find something. Thus, parseBibTeX should be called after you checked for the pattern (and only, if nothing could be found). 2. The variable
hasInvalidChar
is not really needed - you can just use the if(m.find()) to nest the BibTeX extraction and remaining code. - Log in to comment
Please find a way to avoid such cases. A simple and clean solution is probably to check the resulting BibTeX for (in)valid characters.