GetOData

Web Scraping Data from Realtor using Selenium (Easy and Fast Solution)

8 min read

Scraping Data from a Real estate website is not easy. It's a complicated project. They block us with their Antibot mechanism and we have to find the ways to bypass them to get the complete data.

Also, this article is a not tutorial per say, so I am not gonna dwelve into too much of details on how each of the line works, but hopefully this gives you an idea on how to tackle any complex data extraction projects like this one.

So without wasting any further time, let's get started.

The first steps starts with Exploring the Realtor website.

We have to get the following data:

We will get the Price, Number of Beds, Number of Baths and so.

While Exploring,

We see that if we disable the JS on the website, we cannot see the data on the page.

This means we have to use some Library like Selenium which has the ability to behave like a browser and get data from websites that require JavaScript.

Also, we see that if we scroll down, the website loads more content. And so to get the complete data, we will need to add scrolling function in our code. Luckily Selenium can do that as well.

So now the exploring is done, Let's start creating the project.

You can install Selenium and Web driver manager with

pip install selenium webdriver-manager

Web driver manager since that installs the chromedriver automatically for us and we do not need to hard code it ourselves.

Now, let's start writing the code

We create a Python script. We import Selenium. We create the initial script that goes to the realtor website and we run the script to see if it is fetching the page correctly.

And as expected...

We got blocked by the Antibot Mechanism of the website.

You didn't thought that it would be this easy to scrape Realtor, right?

Now... let's get to the real business.

There are two ways, we can go across this.

Let's choose the easy route first:

We can use a Service called GetOdata.com which manages the Antibot mechanism for us.

1.) We create an account here, go to the dashboard, add url of the data we want to get, in this case:

https://www.realtor.com/realestateandhomes-search/New-York/pg-%s

  1. We choose the Antibot level-2. This is how it looks:

  1. We have the option to get the data in HTML here or add Selectors and get the data in JSON format. Let's get it in JSON format directly by passing the selectors.

  2. We want to get the price of each of the house, no. of beds, no. of bathrooms, sqft_area, address and the url of the home listing.

    We create Xpath Selectors for each of them to locate the elements and we add them in the Selectors options through the API Playground:

    Here are the Selectors we will pass:

     [ 
     { "key": "Price", "value": "//div[@data-testid='card-content']/div[@class='price-wrapper']/div/text()", "listing_type": "multiple" }, 
     { "key": "Number of Beds", "value": "//li[@data-testid='property-meta-beds']/span/text()", "listing_type": "multiple"}, 
     { "key": "Number of Baths", "value": "//li[@data-testid='property-meta-baths']/span/text()", "listing_type": "multiple" },
     { "key": "Sqft. Area", "value": "//li[@data-testid='property-meta-sqft']/span/text()", "listing_type": "multiple" },
     { "key": "URL", "value": "//div[@data-testid='card-content']/a/@href", "listing_type": "multiple" }
     ]
    

    This will help GetOdata parse the HTML content and return it in JSON format.

Btw, If you use a service like this, make sure to mention the charges of the service to the client other than your own charges of completing the project.

Clients are the one who should pay for it as a project requirement.

Now, we send the request and wait to see if it returns our data.

....

And Bam we got the data 🥳🔥.

Now we can copy the code from here, paste it in our program and iterate through all the pages by changing just the page parameter and get the data.

import requests

for page in range(1,20):
  params = {
    'apiKey': 'Your_API_Key', 
    'url': 'https://www.realtor.com/realestateandhomes-search/New-York/pg-%s'%page,
    'location': 'Auto',
    'js_enabled': 'Yes',
    'antibot_level': '2',
    'screenshot':'True',
    'data_selectors':"""[ 
  { "key": "Price", "value": "//div[@data-testid='card-content']/div[@class='price-wrapper']/div/text()", "listing_type": "multiple" }, 
  { "key": "Number of Beds", "value": "//li[@data-testid='property-meta-beds']/span/text()", "listing_type": "multiple"}, 
  { "key": "Number of Baths", "value": "//li[@data-testid='property-meta-baths']/span/text()", "listing_type": "multiple" },
  { "key": "Sqft. Area", "value": "//li[@data-testid='property-meta-sqft']/span/text()", "listing_type": "multiple" },
  { "key": "URL", "value": "//div[@data-testid='card-content']/a/@href", "listing_type": "multiple" }
  ]""",
    'actions':"""[
  {"action": "scroll_till_end"}
  ]"""
  }

  url = 'https://api.getodata.com/'

  response = requests.get(url, params=params)
  responseData = response.json()
  print(responseData)

Now, To send multiple requests parallelly, you can combine the above code with Scrapy. Scrapy library also makes it easy to output the data in a csv or json file.

Here's how you can do it.

Create a Scrapy project. Create the Spider file. Add the Spider code ( You can get it again from GetOData Dashboard).

Run the command and you will get the data in a CSV file.

Using Selenium-Wire for Extraction

Now, getting to the second way of Scraping: And this is a more complex one... so roll your sleeves and let's start.

For getting data on Realtor, just Selenium is not enough. We need it's father and more powerful library called Selenium-wire.

We install selenium-wire first

pip install selenium-wire

and let's get to writing the code. Let's do it in Scrapy project itself since it helps us with Outputting the data easily.

We create the initial code that goes to the page and tries getting the HTML content, but you will see that it still get's blocked.

And this is because Realtor compares the Request headers present in our request and get's to know that it's coming from a Bot. We need request headers that would look exactly similar to real browser which you can check by going to httpbin.org/headers

So We gotta change the request headers in our code.

For this, We create a Interceptor function in our code which will delete the headers of the bot request and add our own real browser headers that we got from httpbin.org/headers

def interceptor(request):
     del request.headers["User-Agent"]
     del request.headers["Sec-Ch-Ua"]
     del request.headers["Sec-Fetch-Site"]
     del request.headers["Accept-Encoding"]

     request.headers["Accept-Language"] = "en-US,en;q=0.9"
     request.headers["Referer"] = "https://www.google.com/"
     request.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
     request.headers["Sec-Ch-Ua"] = "\"Chromium\";v=\"122\", \"Not(A:Brand\";v=\"24\", \"Google Chrome\";v=\"122\""
     request.headers["Sec-Fetch-Site"] = "cross-site"
     request.headers["Accept-Encoding"] = "gzip, deflate, br, zstd"

And then after we have initiated our webdriver, we pass the driver through our interceptor code. Here is the complete code:

from seleniumwire import webdriver
import json
import time
import json
import requests
import scrapy
from scrapy import Selector


class RealtorSpider(scrapy.Spider):
    name = "realtor"
    allowed_domains = ["google.com"]
    start_urls = ["https://www.google.com"]

    def parse(self, response):

        def mimicHeaders(request):
            del request.headers["User-Agent"]
            del request.headers["Sec-Ch-Ua"]
            del request.headers["Sec-Fetch-Site"]
            del request.headers["Accept-Encoding"]
            request.headers["Accept-Language"] = "en-US,en;q=0.9"
            request.headers["Referer"] = "https://www.google.com/"
            request.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
            request.headers["Sec-Ch-Ua"] = "\"Chromium\";v=\"122\", \"Not(A:Brand\";v=\"24\", \"Google Chrome\";v=\"122\""
            request.headers["Sec-Fetch-Site"] = "cross-site"
            request.headers["Accept-Encoding"] = "gzip, deflate, br, zstd"

        chrome_options = webdriver.ChromeOptions()

        driver = webdriver.Chrome(chrome_options = chrome_options)
        driver.request_interceptor = mimicHeaders
        driver.maximize_window() 
        driver.get("https://www.realtor.com/realestateandhomes-search/New-York/pg-%s"%page)
        driver.close()

And then we check out if it is working.

After going through numerous pages, we will see that oh Wow, it's working... But then... Realtor will block our IP 🥲.

And this is because Realtor see's that we are sending large number of request in a short period of time and so it directly blocks our IP address.

Now, most of the websites don't go to this length, but if they do like Realtor did... we need to use a Proxy Rotation service which will rotate our IP address when we are sending requests.

Now instead of purchasing a Proxy rotation service, I would always recommend to use a Web Scraping API like GetOdata as a I mentioned in the first case since it saves alot of time, but ya for demonstration purposes, we can get a Proxy rotation service like Webshare as well.

So, we get the Web share IP rotation service, we add the proxy rotation service to our program and we check it out by running it again.

Great... We are now not getting blocked.

from seleniumwire import webdriver
import json
import time
import json
import requests
import scrapy
from scrapy import Selector


class RealtorSpider(scrapy.Spider):
    name = "realtor"
    allowed_domains = ["google.com"]
    start_urls = ["https://www.google.com"]

    def parse(self, response):

        proxy_username = 'proxy_username'
        proxy_password = 'proxy_pass'

        def mimicHeaders(request):
            del request.headers["User-Agent"]
            del request.headers["Sec-Ch-Ua"]
            del request.headers["Sec-Fetch-Site"]
            del request.headers["Accept-Encoding"]
            request.headers["Accept-Language"] = "en-US,en;q=0.9"
            request.headers["Referer"] = "https://www.google.com/"
            request.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
            request.headers["Sec-Ch-Ua"] = "\"Chromium\";v=\"122\", \"Not(A:Brand\";v=\"24\", \"Google Chrome\";v=\"122\""
            request.headers["Sec-Fetch-Site"] = "cross-site"
            request.headers["Accept-Encoding"] = "gzip, deflate, br, zstd"

        chrome_options = webdriver.ChromeOptions()

        driver = webdriver.Chrome(chrome_options = chrome_options, seleniumwire_options={
            'proxy': {
                'http': f'http://{proxy_username}:{proxy_password}@p.webshare.io:80',
                'verify_ssl': False,
            }
        })

        driver.request_interceptor = mimicHeaders
        driver.maximize_window() 
        for page in range(1,5):
            driver.get("https://www.realtor.com/realestateandhomes-search/New-York/pg-%s"%page)
            time.sleep(3)
        driver.close()

Now, to get the data from the HTML content, we create a block which will get all the Home listings. We iterate through the complete block, we add the data points we want to scrape and their XPath selectors and then we output the data using yield function.

Now to get the data across various pages, we create a page iteration. You can add a Headless mode option to automatically doing it in the background without you needing to keep the check. We also add Xpaths to locate the elements and get the data.

Here is the complete code:

from seleniumwire import webdriver
import json
import time
import json
import requests
import scrapy
from scrapy import Selector


class HockeyTeamsSpider(scrapy.Spider):
    name = "hockey_teams"
    allowed_domains = ["google.com"]
    start_urls = ["https://www.google.com"]

    def parse(self, response):

        proxy_username = 'proxy_username'
        proxy_password = 'proxy_pass'

        def mimicHeaders(request):
            del request.headers["User-Agent"]
            del request.headers["Sec-Ch-Ua"]
            del request.headers["Sec-Fetch-Site"]
            del request.headers["Accept-Encoding"]
            request.headers["Accept-Language"] = "en-US,en;q=0.9"
            request.headers["Referer"] = "https://www.google.com/"
            request.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
            request.headers["Sec-Ch-Ua"] = "\"Chromium\";v=\"122\", \"Not(A:Brand\";v=\"24\", \"Google Chrome\";v=\"122\""
            request.headers["Sec-Fetch-Site"] = "cross-site"
            request.headers["Accept-Encoding"] = "gzip, deflate, br, zstd"

        chrome_options = webdriver.ChromeOptions()

        driver = webdriver.Chrome(chrome_options = chrome_options, seleniumwire_options={
            'proxy': {
                'http': f'http://{proxy_username}:{proxy_password}@p.webshare.io:80',
                'verify_ssl': False,
            }
        })

        driver.request_interceptor = mimicHeaders
        driver.maximize_window()



        for page in range(1,5):

            driver.get("https://www.realtor.com/realestateandhomes-search/New-York/pg-%s"%page)
            # time.sleep(2)
            driver.execute_script("""window.scrollTo(0, document.body.scrollHeight * 0.25);""")
            time.sleep(3)
            driver.execute_script("""window.scrollTo(0, document.body.scrollHeight * 0.50);""")
            time.sleep(3)
            driver.execute_script("""window.scrollTo(0, document.body.scrollHeight);""")
            time.sleep(3)

            html = (driver.page_source)
            resp = Selector(text=html)
            blocks = resp.xpath("//div[@data-testid='card-content']")

            for x in blocks:
                price = x.xpath(".//div[@class='price-wrapper']/div/text()").get()
                no_of_beds = x.xpath(".//li[@data-testid='property-meta-beds']/span/text()").get()
                no_of_bathrooms = x.xpath(".//li[@data-testid='property-meta-baths']/span/text()").get()
                sqft_area = x.xpath(".//li[@data-testid='property-meta-sqft']/span[2]/text()").get()
                address = x.xpath(".//div[@data-testid='card-address']//text()").getall()
                home_url = x.xpath(".//a/@href").get()

                yield{
                    "Price":price,
                    "no_of_beds":no_of_beds,
                    "no_of_bathrooms":no_of_bathrooms,
                    "sqft_area":sqft_area,
                    "address":address,
                    "home_url":home_url
                }


        driver.close()

And we are done...

Hopefully, you got value out of this article and got an idea on how you can approach a Complex Web Scraping project. Thanks and see you in the next Article!