1. Frederic De Groef
  2. csxj-crawler

Commits

Author Commit Message Date Builds
Juliette De Maeyer
[7sur7] when making ArticleData, avoid confusion between "source" (currently not used function that extracts the source mentioned in an article) and "source" (what we feed the function that extracts ArticleData with).
Juliette De Maeyer
[lalibre] [tests] added a test for embedded tweets
Juliette De Maeyer
[lalibre] [tests] added test fir plaintext links
Juliette De Maeyer
[dhnet] plaintext links are now also tagged 'in text' (that is, when they are in the text…) [tests] added a test for plaintext links
Juliette De Maeyer
[tests] [tagging] updated test to reflect recent changes ('internal' added to links that are tagged 'internal site')
Juliette De Maeyer
[tagging] in the main function that classifies and tags urls: 'internal sites' are now also tagged 'internal'
Frederic De Groef
[sudinfo] don't classify in-text clickable links as plaintext links This only happened when the link text was the same as the link target.
Frederic De Groef
[tests] text cleanup with link removal
Frederic De Groef
improved formatting when presenting mismatching link lists
Frederic De Groef
[utils] added a utility func to remove links when removing markup from BeautifulSoup blobs
Frederic De Groef
[sudinfo] don't add meaningless spaces when rejoining fragments. Added a test for content extraction.
Frederic De Groef
[sudinfo] removed sample_data files that are now used inside a test
Frederic De Groef
Merge
Juliette De Maeyer
[sudpresse] [tests] test in text link extraction and tagging
Juliette De Maeyer
[rossel_utils] added Sudpresse sites to the same owner list
Juliette De Maeyer
[sudpresse] enhanced plaintext / in text detection and tagging. Even added a test about that.
Juliette De Maeyer
[lesoir_new] typo
Juliette De Maeyer
[sudpresse] fake plaintext and true plaintext examples
Juliette De Maeyer
[lavenir] [tests] added a test for bottom box link extraction and tagging
Juliette De Maeyer
[lavenir] added a function that extracts bottom links ("lire aussi")
Frederic De Groef
[sudinfo] iframes in the text are no longer processed by the plaintext url extractor. However, they are processed separately.
Frederic De Groef
[tests] using nose for all the test suites.
Frederic De Groef
configurable pretty printing
Frederic De Groef
Merge
Juliette De Maeyer
[tests] [sudinfo] added test for in text /sidebar links that are also to 'same owner' websites
Frederic De Groef
Frederic De Groef
Merge
Juliette De Maeyer
[tests] [sudinfo] added a 'no links' test
Juliette De Maeyer
[tests] [sudinfo] added test for in text link extraction
Juliette De Maeyer
[sudinfo] trying to figure out if the "Medias" box contains nothing but images
  1. Prev
  2. Next