Zach's "Webpage Content To Markdown" Scraper

Scrape a webpage and parse to markdown. Packed with features to ensure high success rate and low cost. Includes 2 modes of operation so that you can optimize for either cost (as cheap as possible) or yield (as many successful results as possible).

LEAD_GENERATIONAIAUTOMATIONApify

Try Now →

This Apify actor scrapes a single webpage and parses to markdown. Includes browser-based scraping, smart retrying, anti-scrape block (e.g. cloudflare) circumvention, and smart proxy support to ensure a high success rate.

It also includes 2 modes of operation so that you can optimize for either cost (as cheap as possible) or yield (as many successful results as possible).

🤔 When To Use It

Whenever you want to reliably get a webpage's content and parse it into markdown.

(I personally mostly use it for feeding data into ChatGPT for freelance cold outreach personalization & automation tasks, which I cover in our $200k Freelancer course.)

😰 Why We Made it:

If you want to have ChatGPT interpret a webpage, it can be surprisingly difficult with current tooling.

😭 ChatGPT's API isn't currently web-connected
😿 If you try to get a page's content via a Make automation and parse it to text/markdown, it's unreliable and produces a lot of soft failures and rendering errors
🤢 If you try to use standalone tools for webpage scraping to markdown conversion, they're expensive and also have a lot of soft failures & markdown rendering errors
😣 If you use the other website-crawling-to-markdown scrapers on Apify they tend to be expensive and unreliable.

That's why we made this Actor...

💪 Why This Actor is Nifty:

😍 This actor allows you to simply plop in a big ole list of domain names, and get a huge spreadsheet of markdown content back, to do whatever you want with.

(e.g. upload to google sheets and have ChatGPT iterate through via Make automation)

🤘 Features:

✅ Anti-Scrape Circumvention — if you use the "Get Data Using Browser" option, we'll be able to circumvent many blocks
✅ Soft-Failure Reporting — e.g. if a webpage comes back blank, we'll mark it as a failure — not a lot of other solutions do this)
✅ Smart Proxy Support — we'll run on Datacenter proxies by default, and only revert to Residential proxies when actually necessary
✅ Smart Retrying — we'll auto retry on failures and rotate proxies and IPs to get you the most successful results possible

💭 Example Use Cases:

If you're a $200k Freelancer course student, be sure to check the course training area for guidance on the below use cases and more.

Website Language Detection:

Run this actor
Put results into a Google Sheet
Filter out the fails
Add the formula =DETECTLANGUAGE(E2) (assuming E is the markdown column) to a new column
Extend that formula to all rows in the column
Filter results to not show languages you don't want (e.g. filter to only show en for only English language websites

Cold Outreach Personalization:

(e.g. find out what kinds of products a company sells, who their audience avatar is, etc.)

Run this actor
Put results into a Google Sheet
Filter out the fails
Create a Make automation that feeds the markdown into ChatGPT for analysis
Have ChatGPT give you its analyses back as JSON if you want multiple fields / analyses back (e.g. "type_of_products_sold," "random_product_name," etc.
Parse the JSON and add each field to a column in the Google Sheet
You can now feed these data into a line-writer ChatGPT prompt to have it rewrite a template line with this personalization data

Modes of Operation

Regardless of which mode you use it in, if you're exporting to a spreadsheet, be sure to choose MS Excel format, not CSV. (Markdown will often mess up the CSV file)

"Low-Hanging Fruit" Mode

The following settings are efficient and the cheapest path to data, but won't work for a lot of websites:

"Get Data Using Browser" option disabled
1GB of RAM
Residential proxies (we use datacenter by default in our code and will only use residential if actually necessary)

Estimated Costs for "Low-Hanging Fruit" Mode:

Est. cost per result in "Low-Hanging Fruit" Mode: $0.00025
Est. yield on results: 84.12%

"All The Damned Fruit" Mode

The following settings have very high reliability, but are more expensive:

"Get Data Using Browser" option enabled
4GB of RAM (You can often get away with 2GB – or even 1GB – of RAM, which will make it much cheaper.)
Residential proxies (we use datacenter by default in our code and will only use residential if actually necessary)

Estimated Costs for "All The Damned Fruit" Mode:

Est. cost per result in "All The Damned Fruit" Mode: $0.0069 CPL for residential proxies ($0.0012 CPL for datacenter)
Est. yield on results: 93.38% for residential (91.64% datacenter)

Pricing Breakdown:

Results	Valid Results	Cost	Cost Per Result (CPL)	Yield	Time	Memory	Proxy	Using Browser Build
2462	2071	$0.612	$0.0002486	84.12%	36min	1 GB	Residential	No
2463	2078	$0.914	$0.0003711	84.37%	19min	4 GB	Residential	No
2463	2257	$2.99	$0.0012140	91.64%	96min	4 GB	Datacenter	Yes
2463	2300	$15-46	As high as $0.02	93.38%	120min	4 GB	If Residential	Yes

Suggested Usage

Depending on your priorities, there are a couple ways to use this scraper. What's your priority?

"My Priority is EASE"

("...And I don't care if it costs more.")

👉 Run it with the settings from the "All The Damned Fruit" Mode from the 'Modes of Operation" instructions right from the start.

Just be aware that at 4GB of RAM + residential proxies, you will have to pay up to 100x more than if you did the "low-hanging fruit mode" first.

If you're exporting to a spreadsheet, be sure to choose MS Excel format, not CSV. (Markdown will often mess up the CSV file)

"My Priority is COST"

("...And I don't care if it means there are a couple extra steps for me.")

👉 You'll do two separate runs — first you'll get all the cheap Low-Hanging Fruit results you can, then you'll re-run all the failures in the "All The Damned Fruit" Mode.

Instructions:

Run for your full set of URLs with the "Low-Hanging Fruit" Mode settings (You can find them in the Modes of Operation section at the top of this page)
After the run is finished, export the results to Excel format and filter the list to only show the failures
Re-run these failures with the settings from the "All The Damned Fruit" Mode settings (You can find them in the Modes of Operation section at the top of this page)
Export the results from both runs and merge the data manually into one sheet

All Config Options

Maximum Content Length (Characters) — This will trim each record's markdown output before we add it to the result set. Cuts down on spreadsheet filesize. (Our hard-set internal trim maximum is 10,000 characters)

Frequently Asked Questions

Is it legal to scrape job listings or public data?

Yes, if you're scraping publicly available data for personal or internal use. Always review Websute's Terms of Service before large-scale use or redistribution.

Do I need to code to use this scraper?

No. This is a no-code tool — just enter a job title, location, and run the scraper directly from your dashboard or Apify actor page.

What data does it extract?

It extracts job titles, companies, salaries (if available), descriptions, locations, and post dates. You can export all of it to Excel or JSON.

Can I scrape multiple pages or filter by location?

Yes, you can scrape multiple pages and refine by job title, location, keyword, or more depending on the input settings you use.

How do I get started?

You can use the Try Now button on this page to go to the scraper. You’ll be guided to input a search term and get structured results. No setup needed!