32
Lesson 32 of 35 ยท Advanced

Web Scraping with HtmlAgilityPack

Web scraping extracts data from web pages programmatically. The HtmlAgilityPack library parses HTML as a document object model (DOM) that you can query with XPath or LINQ.

Installing HtmlAgilityPack

Add via NuGet: HtmlAgilityPack. It has no external dependencies.

Loading a Page

Use HtmlWeb.Load() to download and parse a page, or HtmlDocument.LoadHtml() to parse an existing HTML string.

Load page Scrape.cs
using HtmlAgilityPack;

var web  = new HtmlWeb();
var doc  = web.Load("https://books.toscrape.com/");

// Select all book titles using XPath
var titles = doc.DocumentNode
    .SelectNodes("//article[@class='product_pod']//h3/a");

if (titles is not null)
    foreach (var node in titles)
        Console.WriteLine(node.GetAttributeValue("title", ""));

Extracting Data

XPath lets you drill into the HTML tree. Use InnerText to read text content and GetAttributeValue to read attributes.

Extract data ExtractData.cs
var prices = doc.DocumentNode
    .SelectNodes("//p[@class='price_color']");

if (prices is not null)
    foreach (var p in prices)
        Console.WriteLine(p.InnerText.Trim());

Polite Scraping

Always add a delay between requests, respect robots.txt, and check the site's terms of service before scraping.

Polite scraping PoliteScrape.cs
for (int page = 1; page <= 5; page++)
{
    var doc = web.Load($"https://books.toscrape.com/catalogue/page-{page}.html");
    // process...
    await Task.Delay(1000); // 1 second pause
    Console.WriteLine($"Page {page} done.");
}