Why is the DublinCore Scraper not working for SCIRP?

Issue #1895 resolved
Robert Jäschke created an issue

Try

http://www.scirp.org/journal/PaperInformation.aspx?PaperID=37807

Although the web page contains Dublin Core Metadata, the scraper is not working. Which fields are missing or are not extracted?

Comments (6)

  1. Former user Account Deleted

    The problem was in DublinCoreToBibtexConverter class, the regular expression representation of DC, in which it only handles when it is capital letters only.

    Pattern.compile("(?im)<\\s*meta(?=[^>]*lang=\"([^\"]*)\")?(?=[^>]*content=\"([^\"]*)\")[^>]*name=\"(?-i)DC(?i).([^\"]*)\"[^>]*>");
    

    It is modified into

    Pattern.compile("(?im)<\\s*meta(?=[^>]*lang=\"([^\"]*)\")?(?=[^>]*content=\"([^\"]*)\")[^>]*name=\"(?-i)[D|d][C|c](?i).([^\"]*)\"[^>]*>");
    
  2. Robert Jäschke reporter

    Suggestion: simplify and use

    Pattern.compile("(?im)<\\s*meta(?=[^>]*lang=\"([^\"]*)\")?(?=[^>]*content=\"([^\"]*)\")[^>]*name=\"(DC|dc).([^\"]*)\"[^>]*>");
    
  3. Former user Account Deleted

    I modified it a bit because the above expression did not work.

    "(?im)<\\s*meta(?=[^>]*lang=\"([^\"]*)\")?(?=[^>]*content=\"([^\"]*)\")[^>]*name=\"[D|d][C|c].([^\"]*)\"[^>]*>"
    
  4. Log in to comment