GetOData

Xpath for Web Scraping: Complete 2024 Guide

4 min read

What is Xpath and how is it useful in Web Scraping

Xpath (XML Path Language) is a language used to locate and navigate through XML and HTML based documents.

It allows you to create Xpath expressions, which makes it easy to locate any specific elements from large HTML content and do operations like Data Extraction or Web Automation on those specific elements.

It's more powerful than CSS Selectors since you can move up and down through the DOM and has more features to accurately select any element on the page.

HTML Web Page Structure

Consider the Below HTML Web Page Structure. Let's check out the best ways to locate the elements from it using Xpath:

<html>
    <head>
        <title>GetOData Xpath Tutorial</title>
    </head>
    <body>
        <h2>Xpath is the perfect way to locate elements</h2>
        <div>
            <p class='toolInfo'>GetOData is the best tool for Web scraping.</p>
            <p id='tool_link_text'>Check it out here 
                <a href='https://www.getodata.com'>GetOData</a>
            </p>
        </div>
        <ul id='tool_features'>
            <li class='features_list' feature_identifier="proxy">Proxy rotation</li>
            <li class='features_list'>Unblocking API</li>
            <li class='features_list'>AI Based Parsing</li>
            <li class='features_list'>Javascript Execution</li>    
        </ul>    
</body>
</html>

To select the p elements from the above HTML Structure, the syntax is:

//p

This would return all the p tags present on the page.

This syntax applies to every element on the page like:

//div
//span

But the issue with the above is that it would give us all the elements on the page instead of specific elements.

To get specific elements we can take the help of other attributes like their Classes and ID's as given in the next section.

XPath's through Tags like Class & ID

To select element with specific Class and ID, you can use the following syntax:

//element[@attributeName='value']

Example to get elements with specific Class:

//p[@class='toolInfo']

The above example will find all the p elements where the value of class is "toolInfo"

Example to get elements with specific ID:

//p[@id='tool_link_text']

Xpath contains

You can also select elements based on the values of the properties they contains.

For Example:

If you want to select a p element with class name value which contains "tool":

//p[contains(@class,'tool')]

The above xpath would find all elements where the value of class has "tool" in it

Similarly for ID:

//p[contains(@id,'tool')]

For comparing text value:

//p[contains(text(),'GetOData')]

This above xpath would find all elements where the text content has "GetOData" in it. It's super useful in selecting specific elements based on text value on different websites.

Xpath Based on Position

Sometimes when locating elements, the website may return multiple elements with same Xpath. But you would like to pin point through just one of them like second or third or fourth.

In such cases, you can pin point the xpath through Position like below:

//li[@class='features_list'][2]

The above xpath will get the second element from the li elements with class features_list

Xpath Locating Parent Elements

There may come numerous situations where you would like to go up the HTML structure and locate parent elements. Here is an example and way to do it in such cases using the "node()" function:

To get the parent element:

//li[@class='features_list']/parent::node()

To get ancestor:

//li[@class='features_list']/ancestor::node()

To get preceding elements:

//li[@class='features_list']/preceding::node()

The above xpath will get all the preceding elements for a specific node.

To get only one preceding elements:

//li[@class='features_list']/preceding-sibling::node()

Xpath Locating Child Elements

Similarly like locating parent element, you can locate child elements with the below xpath forms:

//li[@class='features_list']/child::node() <!--To get Child element-->

//li[@class='features_list']/following::node() <!--To get all next elements including other parents-->

//li[@class='features_list']/following-sibling::node() <!--To get all next elements but has the same parent-->

//li[@class='features_list']/descendant::node() <!--Will get all the inner elements-->

Free Xpath Finder Tool

Now there are so many ways of creating Xpaths that numerous books have been written on it. But this above guide will get you started super fast with creating xpath for any element.

Also if you struggle to create a XPath by yourself, you can check out the below Free Xpath Finder Chrome Extension Tool which finds accurate XPath using AI:

Free Xpath Finder Chrome Extension by GetOData

It's accurate and will always stay free!