Source

csxj-crawler / csxj / datasources / dhnet.py

Author Commit Message Labels Comments Date
Juliette De Maeyer
deleted useless comment
Juliette De Maeyer
Merge
Juliette De Maeyer
fixed first error from reprocessing: url in embedded video (kewego) player
Frederic De Groef
[dhnet] enhanced embedded media detection (esp. for scripts)
Frederic De Groef
[dhnet] be more defensive for embedded media detection. handles twitter widgets
Frederic De Groef
better tags
Frederic De Groef
fixed text cleanup in dhnet, so we keep paragraphs
Frederic De Groef
moar unicode required
Frederic De Groef
detect links with no target, classify and tag them
Frederic De Groef
detect if an article has an introduction. Use the unified html cleanup fund.
Frederic De Groef
[dhnet] detect embedded content
Frederic De Groef
remove formatting in links
Frederic De Groef
using the constants
Frederic De Groef
more tree reorganization (moved url tagging functions to csxj.common.tagging)
Frederic De Groef
added module-level constants for source names and titles
Frederic De Groef
new top level module name