Installing HtmlAgilityPack
Add via NuGet: HtmlAgilityPack. It has no external dependencies.
Loading a Page
Use HtmlWeb.Load() to download and parse a page, or HtmlDocument.LoadHtml() to parse an existing HTML string.
Load page
Scrape.cs
using HtmlAgilityPack;
var web = new HtmlWeb();
var doc = web.Load("https://books.toscrape.com/");
// Select all book titles using XPath
var titles = doc.DocumentNode
.SelectNodes("//article[@class='product_pod']//h3/a");
if (titles is not null)
foreach (var node in titles)
Console.WriteLine(node.GetAttributeValue("title", ""));Extracting Data
XPath lets you drill into the HTML tree. Use InnerText to read text content and GetAttributeValue to read attributes.
Extract data
ExtractData.cs
var prices = doc.DocumentNode
.SelectNodes("//p[@class='price_color']");
if (prices is not null)
foreach (var p in prices)
Console.WriteLine(p.InnerText.Trim());Polite Scraping
Always add a delay between requests, respect robots.txt, and check the site's terms of service before scraping.
Polite scraping
PoliteScrape.cs
for (int page = 1; page <= 5; page++)
{
var doc = web.Load($"https://books.toscrape.com/catalogue/page-{page}.html");
// process...
await Task.Delay(1000); // 1 second pause
Console.WriteLine($"Page {page} done.");
}