Web scraping recipes
XPath for Web Scraping: Reliable Extraction Patterns
Copy-ready XPath selectors and code snippets for Python (lxml, Scrapy) and JavaScript. Extract links, headlines, prices, and attributes while skipping ads and noise.
Common scraping recipes
Extract all links
//a/@href
Attributes return URLs directly; filter with starts-with(@href, 'http').
Try itPython + lxml
from lxml import html
import requests
resp = requests.get("https://example.com/articles")
doc = html.fromstring(resp.text)
titles = doc.xpath("//article//h2/text()")
links = doc.xpath("//article//a/@href")
print(titles, links)Python + Scrapy
def parse(self, response):
for article in response.xpath("//article"):
yield {
"title": article.xpath(".//h2/text()").get(),
"url": article.xpath(".//a/@href").get(),
"tags": article.xpath(".//a[@class='tag']/text()").getall(),
}JavaScript (browser/Node)
import { JSDOM } from "jsdom";
const dom = await JSDOM.fromURL("https://example.com");
const doc = dom.window.document;
const result = doc.evaluate("//a/@href", doc, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
const links = [];
for (let i = 0; i < result.snapshotLength; i++) {
links.push(result.snapshotItem(i)?.textContent);
}
console.log(links);Scraping tips
- Scope selectors to the main content container to avoid nav/footer noise.
- Combine predicates to exclude ads or placeholders:
//div[not(contains(@class,'ad'))]. - Export attributes directly (href, src, data-*) when building datasets.
- Pair XPath with request caching and polite crawl delays; this guide focuses purely on selector quality.
Next, compare XPath with CSS for your stack in the selector comparison page or jump to the examples library for more scraping-specific patterns.