Author Commit Message Labels Comments Date
Frederic De Groef avatarFrederic De Groef
mass replace : docstrings stays triple quoted
Frederic De Groef avatarFrederic De Groef
mass replace : single quoted strings everywhere
Frederic De Groef avatarFrederic De Groef
merge
Frederic De Groef avatarFrederic De Groef
[Le Soir] minor cleanups
Frederic De Groef avatarFrederic De Groef
[Le Soir] don't save the json file, show report instead
Frederic De Groef avatarFrederic De Groef
updated hgignore
Frederic De Groef avatarFrederic De Groef
[Le Soir] reorganized the ArticleData class, made a distinction between publication date extracted from the html and download date.
Frederic De Groef avatarFrederic De Groef
[Le Soir] clear distinction between internal/external urls, tags
Frederic De Groef avatarFrederic De Groef
[Le Soir] improved sanitization, should probably be unified across all parsers though. Stores paragraphs as a list of strings instead of joining everything
Frederic De Groef avatarFrederic De Groef
[Le Soir] added first attempt atjson serialization
Frederic De Groef avatarFrederic De Groef
first draft for the crawler
Frederic De Groef avatarFrederic De Groef
parsers as a python package
Frederic De Groef avatarFrederic De Groef
[DHNet] cosmetic change
Frederic De Groef avatarFrederic De Groef
[7sur7] first draft
Frederic De Groef avatarFrederic De Groef
[Le Soir] rss fetcher ony returns the titles, for crosschecking
Frederic De Groef avatarFrederic De Groef
[Le Soir] fetching article data is its own function. Don't try to clean up Tags that aren't text formatting tags (e.g. <object> w/ embedded youtube movie)
Frederic De Groef avatarFrederic De Groef
locale settings aren't the same on every platform. hell yeah unix
Frederic De Groef avatarFrederic De Groef
[7sur7] started
Frederic De Groef avatarFrederic De Groef
[DHNet] filled up a bunch of docstrings, for superior collaboration
Frederic De Groef avatarFrederic De Groef
[DHNet] extract & cleanup author name. Removed useless printouts
Frederic De Groef avatarFrederic De Groef
[DHNet] text content cleanup (handle nice paragraph list as well as pure html garbage directly under the content node). Handle updated pubdate
Frederic De Groef avatarFrederic De Groef
[Le Soir] code layout
Frederic De Groef avatarFrederic De Groef
[La Libre] added sample page with more complicated markup to clean up
Frederic De Groef avatarFrederic De Groef
[DHNet] added sample page with no paragraphs. Trying a unified way to extract paragraphs even without the <p> tag
Frederic De Groef avatarFrederic De Groef
[DHNet] keep the list of paragraphs
Frederic De Groef avatarFrederic De Groef
[DHNet] parsing mostly functionnal, text not always extracted because that stupid cms is broken
Frederic De Groef avatarFrederic De Groef
[DHNet] fetch list of frontpage stories
Frederic De Groef avatarFrederic De Groef
refactored make_soup(), optionnal html entities conversion
Frederic De Groef avatarFrederic De Groef
updated todolist
Frederic De Groef avatarFrederic De Groef
[le soir] separate blogposts from actual articles. Keep and the (title, url) for blogposts
  1. Prev
  2. Next
Help
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.