bibsonomy / BibSonomy / issues / #2472 - scraping issues for computer.org — Bitbucket

Issue #2472 resolved

Robert Jäschke created an issue 2015-04-13

Scraping the URL http://www.computer.org/csdl/mags/co/2001/02/r2026.pdf results in a weird error message that shows the content of the page instead of the URL.

Please try to find out, why/how this happens.

Comments (6)

Robert Jäschke reporter
- changed status to open
- 2015-04-13T10:12:59+00:00
Mohammed Abed
i try to solve that as a JUnit, but it seems that the website don't export a valid BibTex or our Scraper must be improved error: scraped BibTex not valid
- 2015-04-13T15:53:43+00:00
Robert Jäschke reporter
There's the IEEEComputerSocietyScraper that should already handle this page. Why does it not work?
- 2015-04-14T05:59:46+00:00
Mohammed Abed
The getDownloadURL method in the Scraper replace the -.* to -reference.bib. That makes problem when we want to scrape data from this URL http://www.computer.org/csdl/mags/co/2001/02/r2026.pdf because it has not the suffix -. so wee need to expand the method getDownloadURL to handel the URL, that have the suffix .pdf

https://bitbucket.org/bibsonomy/bibsonomy/commits/1b4ca543fce5db205e3bd868d51c087d05bb075c?at=bibsonomy-scraper#chg-bibsonomy-scraper/src/main/java/org/bibsonomy/scraper/url/kde/ieee/IEEEComputerSocietyScraper.java
- 2015-04-20T11:17:05+00:00
Robert Jäschke reporter
Thanks. Please see my comments at the corresponding commit.
- 2015-04-21T06:15:19+00:00
Mohammed Abed
- changed status to resolved
resolved wih new test data for the url with suffix .pdf
- 2015-05-01T15:45:56+00:00
Log in to comment

Assignee: Mohammed Abed

Type: bug

Priority: minor

Status: resolved

Component: scraper

Milestone: –

Version: 3.2

Votes: 0

Watchers: 1

Jira: the preferred issue tracker for Bitbucket. Join the team!