How to scrape Websites without getting Blocked
4 min read
One aspect which terrorizes even the best Data Extraction experts is getting blocked.
But no worries, Being a Data Extraction expert and Having scraped hundreds of websites myself, I am gonna share all the possible ways to unblock yourself when scraping data from literally any website.
So let's start:
For getting unblocked from simple websites, make sure you have the following in your scraping program:
Add Real User Agent to your program. Get it from google by typing "My User Agent"
Scrape slowly. Some websites will block you immediately if they see you are making lots of request in a short amount of time. And once they block you, you cannot visit the website for a certain period of time (usually around some minutes to 1hr).
More Complex Antibot Bypass Methods
Now the next steps are trial and error.
You can start implementing the below steps one after other and testing which one works.
Add Real Request headers in your program (all of them).
The best way to see the request headers needed for the website is opening inspect element and opening the networks tab and checking the request which has the data.
It has the request headers that you need to mimic:
You can use a tool like Postman to mimic the request and check if adding request headers can solve the issue.
Switch to different Tech: You can switch from Selenium to Puppeteer or to Splash and seeing which one is able to correctly get the data.
Sometimes websites have set stronger antibot mechanisms that detect if you are using Selenium or just Requests library. So to work around that, you can use other Tech like switching from Selenium to Puppeteer or using Splash with Scrapy to get the data.
Also This is the reason why it's generally recommended to just try getting the HTML content from various pages and see if it's correctly fetching the data before creating the entire scraper. This will make it easy for you to switch to other Tech easily without spending much time on one that get's blocked later.
Using Proxy rotation services
If you visit a website for a large number of times, it will start blocking your IP address.
To see if your IP is blocked, you can get a Free VPN, change your IP address and try seeing if you can see the webpage via your Chrome browser.
If yes, that means, you need to use Proxy rotation service to scrape large amount of data from the website.
A proxy rotation service will automatically rotate your IP address for every request.
There are lots of Proxy rotation services in the market that you can use. Here are some of my favorites:
Captchas are considered one of the biggest huddles when web scraping data.
But no worries, let's see how we can solve this issue effortlessly.
There are two types of captchas:
Soft Captchas: These ones show only if the website detects that you are a bot or a Program.
This ones can be solved by again going to the second step mentioned above.
Hard Captchas: These are the ones that do not care if you are a human or a bot. They are displayed every time to access the content.
The best way to solve these Hard Captchas is by using Real Human workers or AI solution that solve the captcha for us and at a very cheaper rate.
Here are the best two services that I usually use for bypassing the captchas:
Sometimes it may happen that even after doing everything you can, you may still get blocked by the website.
This can happen because websites are able to collect Thousands of your Data points like:
The Location you are acccessing the data from
The time you are accessing
Your mouse moments, typing speed
and so much more...
And if they find even one issue in them, they can block your request.
So in this case, the best way to still get the needed data is use an API that manages the antibot mechanism for us.
Here are the most powerful ones in the market:
Hope you find this article useful in your Scraping journey!
Feel free to ask me any question here or on Twitter : https://twitter.com/SwapBuilds
and I will get back asap. Thanks for Reading!