Skip to content

DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit

70dd94d · Mar 27, 2018
Jan 4, 2018
Feb 23, 2017
Dec 13, 2017
Mar 27, 2018
Mar 27, 2018
Jan 17, 2018
Jan 3, 2018
Feb 2, 2018
Feb 2, 2018
Mar 21, 2016
Feb 13, 2018
Feb 2, 2018
Aug 31, 2017
Sep 2, 2016
Feb 2, 2018
Feb 2, 2018
Feb 2, 2018

Repository files navigation


Travis branch NuGet Member project of .NET China Foundation GitHub license

DotnetSpider, a .NET Standard web crawling library similar to WebMagic and Scrapy. It is a lightweight ,efficient and fast high-level web crawling & scraping framework for .NET





  • Storage data to mysql. Download MySql

      grant all on *.* to 'root'@'localhost' IDENTIFIED BY '' with grant option;
      flush privileges;
  • Run distributed crawler. Download Redis for windows

  • SqlServer.

  • PostgreSQL.

  • MongoDb

  • Cassandra



Please see the Projet DotnetSpider.Sample in the solution.


Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View compelte Codes

public class JdSkuSampleSpider : EntitySpider
	public JdSkuSampleSpider() : base("JdSkuSample", new Site

	protected override void MyInit(params string[] arguments)
		Identity = Identity ?? "JD SKU SAMPLE";
		// storage data to mysql, default is mysql entity pipeline, so you can comment this line. Don't miss sslmode.
		AddPipeline(new MySqlEntityPipeline("Database='mysql';Data Source=localhost;User ID=root;Password=;Port=3306;SslMode=None;"));
		AddStartUrl(",653,655&page=2&JL=6_0_0&ms=5#J_main", new Dictionary<string, object> { { "name", "手机" }, { "cat3", "655" } });

[EntityTable("test", "jd_sku", EntityTable.Monday, Indexs = new[] { "Category" }, Uniques = new[] { "Category,Sku", "Sku" })]
[EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")]
[TargetUrlsSelector(XPaths = new[] { "//span[@class=\"p-num\"]" }, Patterns = new[] { @"&page=[0-9]+&" })]
public class Product : SpiderEntity
	[PropertyDefine(Expression = "./@data-sku", Length = 100)]
	public string Sku { get; set; }

	[PropertyDefine(Expression = "name", Type = SelectorType.Enviroment, Length = 100)]
	public string Category { get; set; }

	[PropertyDefine(Expression = "cat3", Type = SelectorType.Enviroment)]
	public int CategoryId { get; set; }

	[PropertyDefine(Expression = "./div[1]/a/@href")]
	public string Url { get; set; }

	[PropertyDefine(Expression = "./div[5]/strong/a")]
	public long CommentsCount { get; set; }

	[PropertyDefine(Expression = ".//div[@class='p-shop']/@data-shop_name", Length = 100)]
	public string ShopName { get; set; }

	[PropertyDefine(Expression = "0", Type = SelectorType.Enviroment)]
	public int ShopId { get; set; }

	[PropertyDefine(Expression = ".//div[@class='p-name']/a/em", Length = 100)]
	public string Name { get; set; }

	[PropertyDefine(Expression = "./@venderid", Length = 100)]
	public string VenderId { get; set; }

	[PropertyDefine(Expression = "./@jdzy_shop_id", Length = 100)]
	public string JdzyShopId { get; set; }

	[PropertyDefine(Expression = "Monday", Type = SelectorType.Enviroment)]
	public DateTime RunId { get; set; }

public static void Main()
	Startup.Run(new string[] { "-s:JdSkuSampleSpider", "-tid:JdSkuSampleSpider", "-i:guid" });

Run via Startup

Command: -s:[spider type name | TaskName attribute] -i:[identity] -a:[arg1,arg2...] -tid:[taskId] -n:[name] -c:[configuration file path]
  1. -s: Type name of spider or TaskNameAttribute for example: DotnetSpider.Sample.BaiduSearchSpiderl
  2. -i: Set identity.
  3. -a: Pass arguments to spider's Run method.
  4. -tid: Set task id.
  5. -n: Set name.
  6. -c: Set config file path, for example you want to run with a customize config:

WebDriver Support

When you want to collect a page JS loaded, there is only one thing to do, set the downloader to WebDriverDownloader.

Downloader=new WebDriverDownloader(Browser.Chrome);

See a complete sample


  1. Make sure there is a ChromeDriver.exe in bin forlder when you try to use Chrome. You can contain it to your project via NUGET manager: Chromium.ChromeDriver
  2. Make sure you already add a *.webdriver Firefox profile when you try to use Firefox:
  3. Make sure there is a PhantomJS.exe in bin folder when you try to use PhantomJS. You can contain it to your project via NUGET manager: PhantomJS

Storage log and status to database

  1. Set SystemConnection in app.config
  2. Update nlog.config like

Web Manager

  1. Dependences a ci platform forexample i used gitlab-ci right now.
  2. Dependences Sceduler.NET
  3. More documents continue...

1 2 3 4 5


when you use redis scheduler, please update your redis config:

timeout 0 
tcp-keepalive 60


  • EntitSpider定义的表名和列名全部小写化, 以备不同数据库间转换或者MYSQL win/linux的切换
  • 允许不添加Pipeline执行爬虫

Buy me a coffe


QQ Group: 477731655 Email: [email protected]