Commits

Show all
Author Commit Message Labels Comments Date
Frederic De Groef
fix links extraction from sudpresse frontpage
Branches
v0.4-maintenance
Frederic De Groef
[DHNet] some error handling for when the website is broken
Tags
v0.4.2
Frederic De Groef
Added tag v0.4.1 for changeset 21aae99327f3
Frederic De Groef
JSON MODULE, Y U NO SERIALIZE SETS
Tags
v0.4.1
Frederic De Groef
Added tag v0.4.0 for changeset c457cad767c9
Frederic De Groef
[SudPresse] using a utility func
Tags
v0.4.0
Frederic De Groef
added rtlinfo as provider for the crawler. All provider now returns two lists (news items and possible blogposts)
Frederic De Groef
[DHNet] clean up formatting for in-text links
Frederic De Groef
[RTLInfo] added some actual documentation info
Frederic De Groef
[RTLInfo] but what is a usable link?
Frederic De Groef
[RTLInfo] extract intro, embedded videos. Detects when a frontpage url redirects to some internal blogpost, and discard it.
Frederic De Groef
Removed tag before-big-changes
Frederic De Groef
[RTLInfo] detect and extract plaintext urls. Tags all embedded links.
Frederic De Groef
[RTLInfo] extract and cleanup text content
Frederic De Groef
[RTLInfo] extract associated links
Frederic De Groef
[RTLInfo] extract external links
Frederic De Groef
added a util func to cleanup a whole collection of html fragments
Frederic De Groef
[RTLInfo] detect and extract 'video headline' on frontpage
Frederic De Groef
[RTLInfo] extract title
Frederic De Groef
[RTLInfo] extract category and date
Frederic De Groef
more stuff to ignore
Frederic De Groef
[RTLInfo] extract headlines in "modules". Separate blogposts and actual news tems
Frederic De Groef
[RTLInfo] fetching the (easy0 frontpage stories
Frederic De Groef
[Sud Presse] Better sanitization of link titles
Frederic De Groef
added missing formatting tags
Frederic De Groef
fixed the cropped yticks in figures
Frederic De Groef
Added tag v0.3.0 for changeset 0e58dd7ea6b2
Frederic De Groef
[Sud Presse] remove formatting on in-text links
Tags
v0.3.0
Frederic De Groef
[Sud Presse] Tagging ghost links
Frederic De Groef
[DHNet] tag lists are tag sets
  1. Prev
  2. Next