Ever had a web scraper break just because a website tweaked its code? It’s a common headache, but the XPath
contains() function is the secret to building scrapers that don't snap so easily. Think of it less like a strict command and more like a flexible search tool that finds what you need, even when the details are fuzzy.Why XPath Contains Is Your Scraping Superpower
Imagine you're hunting for a book in a library, but you only remember a few words from the title. You wouldn't search for an exact match; you'd look for any title that contains those words. That's exactly how
contains() works in web scraping. It gives you the power to pinpoint HTML elements using only partial matches.This is a complete game-changer when you're up against modern websites built with JavaScript frameworks like React or Angular. These sites often generate dynamic class names or IDs that change every time the page loads, like
class="product-card-_aB3xZ". An exact match would fail in a heartbeat, but contains() lets you lock onto a stable part of the attribute—like 'product-card'—making your scraper far more resilient.A Quick Comparison: Equals vs. Contains
The real magic of
contains() becomes clear when you compare it to a direct equals (=) match. They both have their place, but knowing when to use each one is key.Matcher | What It Does | Best For | Example Use Case |
= (Equals) | Finds elements where the attribute value is an exact match. | Static attributes that never change, like a stable id. | Locating a login button with id="login-button". |
contains() | Finds elements where the attribute value includes a specific substring. | Dynamic attributes with changing parts, like auto-generated class names. | Finding a product div where the class contains product-item. |
Using
contains() is your go-to strategy for handling the unpredictability of modern web design, while the equals operator is perfect for those rare, stable elements.Build Resilient Scrapers for Dynamic Websites
The core problem
contains() solves is locator fragility. When a scraper depends on a precise, rigid path to an element, the tiniest front-end update can bring it crashing down. By using partial matching, you create selectors that adapt instead of break.- Handles Dynamic Attributes: Easily finds elements where parts of an attribute (like
classorid) are auto-generated.
- Improves Readability: An expression like
//div[contains(@class, 'main-content')]is often much clearer than a long, convoluted CSS selector.
- Increases Reliability: Your scrapers keep chugging along even after minor site redesigns or code refactors.
This image below shows how
contains() acts like a magnifying glass, helping you zero in on the right element even within a messy HTML structure.Instead of needing the full, exact class name,
contains() lets you find your target using just a consistent piece of it. In the fast-moving world of web scraping—a market set to hit USD 10.2 billion—this kind of flexibility is a must.In fact, data shows that Python-based scrapers using
contains() see a reliability boost of up to 40% on JavaScript-heavy pages. To start building your own powerful scrapers with these techniques, check out our practical guide on how to web scrape with Python.Understanding The Core Syntax And Logic
At its core, the
contains() function is a simple but incredibly handy tool for matching parts of a string. Think of it like a "find" feature in a document. You don't need the whole sentence to find what you're looking for—just a unique word or phrase is enough. XPath's contains() works on the same principle.The most common pattern you'll see out in the wild is:
//tag[contains(@attribute, 'value')].This might look a bit cryptic at first glance, but it breaks down into a few logical pieces. Let's pull it apart so you can see exactly how it tells the browser what to grab.
Breaking Down The Syntax
Getting a feel for what each part of the
contains() expression does is the first step to using it effectively. Every symbol has a specific job.//tag: This is your starting point. The//tells XPath to search the entire page for any element that matches thetagyou provide (likediv,a,button, or*to match any tag).
[...]: Think of the square brackets as a filter. They let you add a specific condition to narrow down the elements you've found.
contains(): This is the function doing the heavy lifting. It checks if one string includes another and gives a "true" if it finds a match.
@attribute: The@symbol is shorthand for an attribute. This tells the function you want to look inside an attribute like@class,@id, or@href.
'value': This is the piece of text you're searching for inside the attribute—the part you know is consistent and reliable.
So, when you put it all together, an expression like
//button[contains(@class, 'primary-action')] is just a straightforward instruction: "Find every <button> element anywhere on this page that has a class attribute containing the text 'primary-action'."Practical Examples In Action
Theory is great, but seeing
contains() in action is where it really clicks. Let’s say you’re scraping a site and come across this chunk of HTML.Product A
Product B
About Us
If you need to grab only the product links, an exact class match won't work. The classes are
"link product-link main" and "link product-link featured"—they're different. But they do share something in common.Here’s how you could use
contains() to reliably pick out just the product links:- To find all product links by class:
//a[contains(@class, 'product-link')]This expression grabs both "Product A" and "Product B" because theclassattribute for both includes the 'product-link' string.
- To find all product links by URL structure:
//a[contains(@href, '/products/details')]This does the same job but from a different angle. It looks for the consistent/products/detailspattern in thehrefattribute, neatly filtering out the "About Us" link.
Targeting Dynamic Text Content With Precision
Attributes are great, but what about the actual text you see on the page? The
contains() function is a game-changer for finding elements based on the words they hold. It's perfect for when you need to grab specific content, like scraping product reviews that mention "battery life" or finding all news articles that reference a particular company.The syntax for matching text is pretty similar to what we saw with attributes, but with one key difference. Instead of pointing to an attribute like
@class, you’ll use the text() function.For instance,
//p[contains(text(), 'limited time offer')] will zero in on any paragraph that has the phrase "limited time offer" somewhere in its direct text.The Dot Versus text(): A Crucial Distinction
A common trip-up for many developers is the difference between
contains(text(), 'value') and contains(., 'value'). They look almost identical, but how they behave is completely different. Picking the right one is critical for building a scraper that doesn't break.text(): This function is very literal. It only looks at text that sits directly inside the node you've selected, ignoring any text nested within child elements.
.(the dot): This little guy represents the string value of the current node. That means it grabs the text from the node itself plus all its descendants, like children and grandchildren.
Let's make this crystal clear with an example. Imagine you’re looking at this HTML for a product listing.
If you try
//div[contains(text(), 'great product')], your scraper will come up empty. Why? Because the <div> itself only contains "This is a " and " for all users." as its direct text. The phrase "great product" is tucked away inside a child <strong> tag.Now, if you use
//div[contains(., 'great product')], it’s a direct hit. The dot . pulls all the text from the div and its children into one string: "This is a great product for all users.", which definitely contains your phrase.This flexibility is what makes text-based scraping so incredibly useful. It's a huge reason why developers love XPath's
contains() function, which helps drive a web scraping software market projected to hit USD 3.323 billion by 2025. In fact, industry data shows contains() can boost your match success by 28% on average—a massive improvement, especially since so many scrapers rely on proxies and APIs like Scrappey's to get around site defenses. You can read more about these trends in web scraping software growth.Extracting Data From Attributes Like A Pro
While
contains() is great for text, its real power in web scraping shines when you're up against messy HTML attributes. Modern websites, especially those built with fancy frameworks, love to auto-generate attribute values. You'll often see things like class="card-item-XyZ123 featured-product".Trying to write a scraper that matches that class name exactly is a recipe for disaster. The moment a front-end developer pushes a minor update, your scraper breaks. This is where
contains() becomes your best friend, letting you cut right through the noise. Instead of matching the entire messy string, you just zero in on a stable, predictable part.Advanced Attribute Targeting Strategies
Let's look beyond simple class names. The
contains() function can target any attribute, which opens up some seriously powerful ways to filter and pull data. This is a game-changer for building things like e-commerce price trackers or inventory scrapers.Think about these common situations:
- Finding images from a specific folder: You can snag all product images stored in a certain directory by targeting the
srcattribute. An expression like//img[contains(@src, '/products/high-res/')]will find all images whose source URL has that folder path.
- Isolating links to a subdomain: To grab every link pointing to a company's blog, you could use
//a[contains(@href, 'blog.ecommercesite.com')]. This neatly filters out all the other internal and external links on the page.
- Targeting elements by partial ID: For elements with dynamic IDs like
id="session-user-9f8a7b", you can lock onto the static part with//div[contains(@id, 'session-user-')].
A Practical Example with Python and lxml
Let's say you're building a price tracker. Here’s how you could use this technique with Python's lxml library to grab all elements marked as "featured products," even with those unpredictable class names.
from lxml import html
Sample HTML with dynamic class names
html_content = """
tree = html.fromstring(html_content)
Use contains() to find all featured products
featured_products = tree.xpath("//div[contains(@class, 'featured-product')]")
for product in featured_products:
print(product.text_content().strip())
This code snippet reliably finds and prints "Product B - 39.99." It completely ignores the random
xyz and pqr junk at the end of the class names. This approach makes sure your scraper keeps on working, even if the front-end team decides to change how they generate those dynamic IDs.Putting It All Together In A Real-World Project
Okay, enough with the theory. Let's get our hands dirty and see how the
contains() function performs in a real-world scenario. This is where you really start to see its power.We're going to tackle a common problem that trips up a lot of scrapers: scraping an e-commerce site with intentionally messy and dynamic class names. We'll build a Python script using the fantastic lxml library to parse the HTML and pull out product names and prices. The secret sauce? A well-crafted XPath expression using
contains() that grabs the data we need, no matter how much the class names try to throw us off.Simulating a Dynamic E-Commerce Site
Imagine you’re trying to scrape a product grid like this one. Take a look at the HTML below. Notice how each product
div starts with a base class of product-card but then tacks on a random, auto-generated string like -ax89c or -fg45r.If you tried to write a selector with an exact match, like
//div[@class='product-card-ax89c'], your scraper would break the instant that suffix changed. This is exactly the kind of mess the contains() function was made to clean up.Building the Python Scraper
First things first, you'll need
lxml. If it's not already on your machine, a quick pip install lxml will get you set up.Now, let's write the code. We'll parse the HTML and reliably extract our data. For more in-depth code walkthroughs, check out our collection of Python scraping examples.
from lxml import html
Our sample HTML from the e-commerce site
html_content = """
Parse the HTML content
tree = html.fromstring(html_content)
Find all product containers using contains()
product_cards = tree.xpath("//div[contains(@class, 'product-card-')]")
Loop through each card and extract data
for card in product_cards:
name = card.xpath(".//h2[@class='product-name']/text()")[0]
price = card.xpath(".//span[contains(@class, 'price-main-text')]/text()")[0]
print(f"Product: {name}, Price: {price}")
This script hunts down every
div whose class attribute contains the string 'product-card-'. It then dives into each one to pull out the product name and price. It’s clean, resilient, and totally unfazed by those dynamic suffixes.The process is visualized perfectly in the flow chart below. We target the attribute, find the constant part of the string, and extract the data we need.
This simple, three-step pattern is what makes
contains() so powerful. It turns a frustrating matching problem into a reliable, repeatable solution.Common Pitfalls And How To Avoid Them
Even the most useful tools have their quirks, and the
contains() function is no exception. While it’s fantastic for building flexible scrapers, a few common pitfalls can trip you up, leading to frustrating bugs and sluggish performance.Getting ahead of these issues will save you hours of debugging down the road. Let’s walk through the most frequent headaches developers run into and how to solve them.
Pitfall 1: The Case-Sensitivity Trap
One of the most common "gotchas" with
contains() is that it's case-sensitive by default. A search for contains(text(), 'product') is going to completely miss text like "Product" or "PRODUCT". This tiny detail can cause your scraper to miss huge chunks of data you thought you were targeting.Problem: Your XPath expression isn't finding text because of simple capitalization differences.
Solution: The trick is to use the
translate() function. It lets you force both the element’s text and your search string into the same case, usually lowercase. This little hack instantly makes your search case-insensitive.//div[contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'product')]
This expression converts all uppercase letters in the element's text to lowercase before it checks for a match. Now, "Product", "product", and "PRODUCT" will all be found without a problem.
Pitfall 2: The Performance Killer
The
contains() function is incredibly flexible, but that flexibility can come at a steep performance cost if you’re not careful. Firing off a very broad expression can dramatically slow down your scraper, especially on large, complex web pages.Problem: Your scraper is running slow or even timing out because of an inefficient XPath.
Solution: Get as specific as you can, as early as you can. Avoid using the wildcard
//* unless you have absolutely no other choice. Tying your contains() filter to a specific tag, and if possible another attribute, will drastically speed things up.Frequently Asked Questions
As you start weaving
contains() into your web scraping projects, a few questions tend to crop up. Let's get you some clear, straightforward answers to the ones we hear most often.Is XPath Contains() Case-Sensitive?
Yes, the standard XPath 1.0
contains() function is strictly case-sensitive. A search for 'Product' will not match 'product', which can be a real headache.To work around this, you can normalize the case by using the
translate() function. This nifty trick converts all the text to lowercase before matching, which makes your search case-insensitive.For example:
//div[contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'product')]Can I Use Contains() With Multiple Conditions?
Absolutely. You can easily chain multiple
contains() checks together using logical operators like and or or. This is how you build incredibly specific and powerful locators.Let's say you need to find a
div element where the class contains 'product' and its text also contains 'Sale'. Your expression would look like this://div[contains(@class, 'product') and contains(text(), 'Sale')]When Should I Use starts-with() Instead of contains()?
You'll want to reach for
starts-with() when you know the beginning of an attribute's value is stable, but the end is dynamic. Think of something like id="session-xyz123".In contrast,
contains() is your best bet when a keyword could pop up anywhere in the string. While a function like ends-with() does exist in XPath 2.0, it isn't supported by most browser developer tools, making contains() a far more reliable and universally compatible choice.Tired of fragile selectors and constant scraper maintenance? Scrappey handles dynamic websites with ease, combining rotating proxies and headless browsers so you can focus on data, not debugging. Get your structured data reliably by visiting https://scrappey.com.
