Web scraping recipes

XPath for Web Scraping: Reliable Extraction Patterns

Copy-ready XPath selectors and code snippets for Python (lxml, Scrapy) and JavaScript. Extract links, headlines, prices, and attributes while skipping ads and noise.

Test in playground Examples library Functions reference

Common scraping recipes

Extract all links

//a/@href

Attributes return URLs directly; filter with starts-with(@href, 'http').

Try it

Article titles

//article//h2/text()

Use text() to avoid wrapping tags when exporting.

Try it

Image sources

//img/@src

Pair with @alt to keep context.

Try it

List items

//ul[@class='products']/li

Scope to the product list to avoid nav menus.

Try it

Skip ads

//div[not(contains(@class,'ad'))]

Exclude common ad wrappers.

Try it

Price fields

//span[contains(text(),'$')]

Use contains on text and refine with parent context.

Try it

Python + lxml

from lxml import html
import requests

resp = requests.get("https://example.com/articles")
doc = html.fromstring(resp.text)
titles = doc.xpath("//article//h2/text()")
links = doc.xpath("//article//a/@href")
print(titles, links)

Python + Scrapy

def parse(self, response):
    for article in response.xpath("//article"):
        yield {
            "title": article.xpath(".//h2/text()").get(),
            "url": article.xpath(".//a/@href").get(),
            "tags": article.xpath(".//a[@class='tag']/text()").getall(),
        }

JavaScript (browser/Node)

import { JSDOM } from "jsdom";

const dom = await JSDOM.fromURL("https://example.com");
const doc = dom.window.document;
const result = doc.evaluate("//a/@href", doc, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
const links = [];
for (let i = 0; i < result.snapshotLength; i++) {
  links.push(result.snapshotItem(i)?.textContent);
}
console.log(links);

Scraping tips

Scope selectors to the main content container to avoid nav/footer noise.
Combine predicates to exclude ads or placeholders: //div[not(contains(@class,'ad'))].
Export attributes directly (href, src, data-*) when building datasets.
Pair XPath with request caching and polite crawl delays; this guide focuses purely on selector quality.

Next, compare XPath with CSS for your stack in the selector comparison page or jump to the examples library for more scraping-specific patterns.