bibsonomy / BibSonomy / issues / #1895 - Why is the DublinCore Scraper not working for SCIRP? — Bitbucket

Issue #1895 resolved

Robert Jäschke created an issue 2013-10-21

Try

http://www.scirp.org/journal/PaperInformation.aspx?PaperID=37807

Although the web page contains Dublin Core Metadata, the scraper is not working. Which fields are missing or are not extracted?

Comments (6)

Daniel Zoller
- assigned issue to
  
  Haile Misgna
- changed component to scraper
- changed version to 2.0.43
- edited description
- 2014-02-21T17:06:43+00:00
Daniel Zoller
- changed status to open
- 2014-02-21T17:06:47+00:00

Former user Account Deleted

The problem was in DublinCoreToBibtexConverter class, the regular expression representation of DC, in which it only handles when it is capital letters only.

Pattern.compile("(?im)<\\s*meta(?=[^>]*lang=\"([^\"]*)\")?(?=[^>]*content=\"([^\"]*)\")[^>]*name=\"(?-i)DC(?i).([^\"]*)\"[^>]*>");

It is modified into

Pattern.compile("(?im)<\\s*meta(?=[^>]*lang=\"([^\"]*)\")?(?=[^>]*content=\"([^\"]*)\")[^>]*name=\"(?-i)[D|d][C|c](?i).([^\"]*)\"[^>]*>");

2014-03-17T13:47:26+00:00

Robert Jäschke reporter

Suggestion: simplify and use

Pattern.compile("(?im)<\\s*meta(?=[^>]*lang=\"([^\"]*)\")?(?=[^>]*content=\"([^\"]*)\")[^>]*name=\"(DC|dc).([^\"]*)\"[^>]*>");

2014-03-17T14:41:59+00:00

Former user Account Deleted

I modified it a bit because the above expression did not work.

"(?im)<\\s*meta(?=[^>]*lang=\"([^\"]*)\")?(?=[^>]*content=\"([^\"]*)\")[^>]*name=\"[D|d][C|c].([^\"]*)\"[^>]*>"

2014-03-17T18:23:45+00:00

Former user Account Deleted
- changed status to resolved
It is resolved.
- 2014-03-17T18:25:03+00:00
Log in to comment

Assignee: –

Type: enhancement

Priority: major

Status: resolved

Component: scraper

Milestone: –

Version: 2.0.43

Votes: 0

Watchers: 1

Jira: the preferred issue tracker for Bitbucket. Join the team!