1. romain flechner
  2. ScrapySharp

Wiki

Clone wiki

ScrapySharp / Home

Welcome

ScrapySharp has a Web Client able to simulate a real Web browser (handle referrer, cookies …)

Html parsing has to be as natural as possible. So I like to use CSS Selectors and Linq.

This framework wraps HtmlAgilityPack.

Basic examples of CssSelect usages

using System.Linq;
using HtmlAgilityPack;
using ScrapySharp.Extensions;

class Example
{
	public void Main()
	{
		var divs = html.CssSelect(div);  //all div elements
		var nodes = html.CssSelect(div.content); //all div elements with css class ‘content’
		var nodes = html.CssSelect(div.widget.monthlist); //all div elements with the both css class
		var nodes = html.CssSelect(“#postPaging); //all HTML elements with the id postPaging
		var nodes = html.CssSelect(div#postPaging.testClass); // all HTML elements with the id postPaging and css class testClass

		var nodes = html.CssSelect(div.content > p.para); //p elements who are direct children of div elements with css class ‘content’
 
		var nodes = html.CssSelect(input[type=text].login); // textbox with css class login
	}
}

Scrapysharp can also simulate a web browser

ScrapingBrowser browser = new ScrapingBrowser();

//set UseDefaultCookiesParser as false if a website returns invalid cookies format
//browser.UseDefaultCookiesParser = false;

WebPage homePage = browser.NavigateToPage(new Uri("http://www.bing.com/"));

PageWebForm form = homePage.FindFormById("sb_form");
form["q"] = "scrapysharp";
form.Method = HttpVerb.Get;
WebPage resultsPage = form.Submit();

HtmlNode[] resultsLinks = resultsPage.Html.CssSelect("div.sb_tlst h3 a").ToArray();

WebPage blogPage = resultsPage.FindLinks(By.Text("romcyber blog | Just another WordPress site")).Single().Click();

Install Scrapysharp in your project

It's easy to use Scrapysharp in your project.

A Nuget package exists (https://www.nuget.org/packages/ScrapySharp)

To install ScrapySharp, run the following command in the Package Manager Console

PM> Install-Package ScrapySharp

Have fun!

Updated