GetOData

Articles Extractor – Automatically Extract Valuable Data from News Sites

3 min read

Intro:

The Articles Extractor is a robust web scraping tool designed for extracting structured data from news articles, blog posts, and other online content. Whether you're a researcher, marketer, or developer, this tool streamlines the data collection process, enabling you to focus on analysis rather than tedious scraping tasks.

🔍 What Is Articles Extractor?

The Articles Extractor is an enterprise-grade API for web content scraping and analysis. It specializes in accurately extracting articles' data from a myriad of sources, including news sites and blogs. This tool is particularly useful for researchers, content creators, and data analysts who need reliable and organized information from various online publications.

✨ Features

  • Comprehensive Content Extraction: Extract full article text, titles, authors, and publication dates.
  • Content Analytics: Automatically calculate reading times and gather important content metrics.
  • SEO Metadata Mining: Gain access to valuable SEO data from HTML meta tags.
  • Content Cleaning: Remove distractions like ads and navigation elements to deliver clean data.
  • High-Volume Processing: Scrape thousands of articles simultaneously with parallel processing capabilities.
  • Multiple Export Formats: Export data in JSON, CSV, XML, RSS, or HTML formats.

🛠️ How to Use It

Step-by-step tutorial:

  1. Go to the tool’s page: Articles Extractor
  2. Click “Try for free” or “Run actor”
  3. Fill in the required input fields:
    • startUrls: The list of URLs you wish to scrape.
    • saveArticleHtml: Set to true if you want to save the HTML content of the article.
    • saveFullPageHtml: Set to false to avoid saving the entire page HTML.
    • Many other optional parameters for customization.
  4. Click “Run” and wait for the results to be processed.
  5. Download your results or send them to a webhook for further integration.

🧪 Sample Input (JSON)

json { "saveArticleHtml": true, "saveFullPageHtml": false, "startUrls": [ { "url": "https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html" } ], "proxyConfiguration": { "useApifyProxy": true, "apifyProxyGroups": ["RESIDENTIAL"] }, "customHeaders": { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" } }

📤 Output Data (Fields)

  • url: The URL of the article scraped.
  • title: The title of the article.
  • description: A brief description or summary of the article.
  • links: Any relevant links found within the article.
  • image: URL of the featured image.
  • content: The main content of the article, with preserved formatting.
  • author: Author of the article.
  • source: Source of the publication.
  • published: Publication date and time.
  • ttr: Estimated reading time.

💰 Pricing This actor is priced at $0.20 per run. Apify also offers a free tier for new users to test the features.

👨‍💻 Built By The Articles Extractor is developed by Apify — innovative leaders in web automation and data scraping.

Final Thoughts The Articles Extractor is a fantastic tool for anyone looking to gather structured content from various online publications efficiently. Its capacity to handle high volumes makes it ideal for market researchers, content marketers, and developers alike.

🔗 Try the Actor Now 👉 Articles Extractor