Docling

Docling Document Parser & Converter – Convert documents into structured data without complexity. This Actor leverages the powerful Docling library to parse and transform various document formats into clean, structured outputs ready for analysis or integration.

vancura

Try Now →

Docling Actor on Apify

This Actor (specification v1) wraps the Docling project to provide serverless document processing in the cloud. It can process complex documents (PDF, DOCX, images) and convert them into structured formats (Markdown, JSON, HTML, Text, or DocTags) with optional OCR support.

What are Actors?

Actors are serverless microservices running on the Apify Platform. They are based on the Actor SDK and can be found in the Apify Store. Learn more about Actors in the Apify Whitepaper.

Features
Usage
Input Parameters
Output
Performance and Resources
Troubleshooting
Local Development
Architecture
License
Acknowledgments
Security Considerations

Features

Leverages the official docling-serve-cpu Docker image for efficient document processing
Processes multiple document formats:
- PDF documents (scanned or digital)
- Microsoft Office files (DOCX, XLSX, PPTX)
- Images (PNG, JPG, TIFF)
- Other text-based formats
Provides OCR capabilities for scanned documents
Exports to multiple formats:
- Markdown
- JSON
- HTML
- Plain Text
- DocTags (structured format)
No local setup needed—just provide input via a simple JSON config

Usage

Using Apify Console

Go to the Apify Actor page.
Click "Run".
In the input form, fill in:
- The URL of the document.
- Output format (md, json, html, text, or doctags).
- OCR boolean toggle.
The Actor will run and produce its outputs in the default key-value store under the key OUTPUT.

Using Apify API

1curl --request POST \
2  --url "https://api.apify.com/v2/acts/vancura~docling/run" \
3  --header 'Content-Type: application/json' \
4  --header 'Authorization: Bearer YOUR_API_TOKEN' \
5  --data '{
6  "options": {
7    "to_formats": ["md", "json", "html", "text", "doctags"]
8  },
9  "http_sources": [
10    {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
11    {"url": "https://arxiv.org/pdf/2408.09869"}
12  ]
13}'

Using Apify CLI

1apify call vancura/docling --input='{
2  "options": {
3    "to_formats": ["md", "json", "html", "text", "doctags"]
4  },
5  "http_sources": [
6    {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
7    {"url": "https://arxiv.org/pdf/2408.09869"}
8  ]
9}'

Input Parameters

The Actor accepts a JSON schema matching the file .actor/input_schema.json. Below is a summary of the fields:

Field	Type	Required	Default	Description
`http_sources`	object	Yes	None	https://github.com/DS4SD/docling-serve?tab=readme-ov-file#url-endpoint
`options`	object	No	None	https://github.com/DS4SD/docling-serve?tab=readme-ov-file#common-parameters

Example Input

1{
2  "options": {
3    "to_formats": ["md", "json", "html", "text", "doctags"]
4  },
5  "http_sources": [
6    {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"},
7    {"url": "https://arxiv.org/pdf/2408.09869"}
8  ]
9}

Output

The Actor provides three types of outputs:

Processed Documents in a ZIP - The Actor will provide the direct URL to your result in the run log, looking like:

You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT'

Processing Log - Available in the key-value store as DOCLING_LOG
Dataset Record - Contains processing metadata with:
- Direct link to the processed output zip file
- Processing status

You can access the results in several ways:

Direct URL (shown in Actor run logs):

https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT

Programmatically via Apify CLI:

apify key-value-stores get-value OUTPUT

Dataset - Check the "Dataset" tab in the Actor run details to see processing metadata

Example Outputs

Markdown (md)

1# Document Title
2
3## Section 1
4Content of section 1...
5
6## Section 2
7Content of section 2...

JSON

1{
2    "title": "Document Title",
3    "sections": [
4        {
5            "level": 1,
6            "title": "Section 1",
7            "content": "Content of section 1..."
8        }
9    ]
10}

HTML

1<h1>Document Title</h1>
2<h2>Section 1</h2>
3<p>Content of section 1...</p>

Processing Logs (`DOCLING_LOG`)

The Actor maintains detailed processing logs including:

API request and response details
Processing steps and timing
Error messages and stack traces
Input validation results

Access logs via:

apify key-value-stores get-record DOCLING_LOG

Performance and Resources

Docker Image Size: ~4GB
Memory Requirements:
- Minimum: 2 GB RAM
- Recommended: 4 GB RAM for large or complex documents
Processing Time:
- Simple documents: 15-30 seconds
- Complex PDFs with OCR: 1-3 minutes
- Large documents (100+ pages): 3-10 minutes

Troubleshooting

Common issues and solutions:

Document URL Not Accessible
- Ensure the URL is publicly accessible
- Check if the document requires authentication
- Verify the URL leads directly to the document
OCR Processing Fails
- Verify the document is not password-protected
- Check if the image quality is sufficient
- Try processing with OCR disabled
API Response Issues
- Check the logs for detailed error messages
- Ensure the document format is supported
- Verify the URL is correctly formatted
Output Format Issues
- Verify the output format is supported
- Check if the document structure is compatible
- Review the DOCLING_LOG for specific errors

Error Handling

The Actor implements comprehensive error handling:

Detailed error messages in DOCLING_LOG
Proper exit codes for different failure scenarios
Automatic cleanup on failure
Dataset records with processing status

Local Development

If you wish to develop or modify this Actor locally:

Clone the repository.
Ensure Docker is installed.
The Actor files are located in the .actor directory:
- Dockerfile - Defines the container environment
- actor.json - Actor configuration and metadata
- actor.sh - Main execution script that starts the docling-serve API and orchestrates document processing
- input_schema.json - Input parameter definitions
- dataset_schema.json - Dataset output format definition
- CHANGELOG.md - Change log documenting all notable changes
- README.md - This documentation
Run the Actor locally using:
```
apify run
```

Actor Structure

1.actor/
2├── Dockerfile           # Container definition
3├── actor.json           # Actor metadata
4├── actor.sh             # Execution script (also starts docling-serve API)
5├── input_schema.json    # Input parameters
6├── dataset_schema.json  # Dataset output format definition
7├── docling_processor.py # Python script for API communication
8├── CHANGELOG.md         # Version history and changes
9└── README.md            # This documentation

Architecture

This Actor uses a lightweight architecture based on the official quay.io/ds4sd/docling-serve-cpu Docker image:

Base Image: quay.io/ds4sd/docling-serve-cpu:latest (~4GB)
Multi-Stage Build: Uses a multi-stage Docker build to include only necessary tools
API Communication: Uses the RESTful API provided by docling-serve

Request Flow:

The actor script starts the docling-serve API on port 5001
Performs health checks to ensure the API is running
Processes the input parameters

Creates a JSON payload for the docling-serve API with proper format:

1{
2  "options": {
3    "to_formats": ["md"],
4    "do_ocr": true
5  },
6  "http_sources": [{"url": "https://example.com/document.pdf"}]
7}

Makes a POST request to the /v1alpha/convert/source endpoint
Processes the response and stores it in the key-value store

Dependencies:
- Node.js for Apify CLI
- Essential tools (curl, jq, etc.) copied from build stage
Security: Runs as a non-root user for enhanced security

License

This wrapper project is under the MIT License, matching the original Docling license. See ../LICENSE for details.

Acknowledgments

Docling and docling-serve-cpu by IBM
Apify for the serverless actor environment

Security Considerations

Actor runs under a non-root user for enhanced security
Input URLs are validated before processing
Temporary files are securely managed and cleaned up
Process isolation through Docker containerization
Secure handling of processing artifacts

Frequently Asked Questions

Is it legal to scrape job listings or public data?

Yes, if you're scraping publicly available data for personal or internal use. Always review Websute's Terms of Service before large-scale use or redistribution.

Do I need to code to use this scraper?

No. This is a no-code tool — just enter a job title, location, and run the scraper directly from your dashboard or Apify actor page.

What data does it extract?

It extracts job titles, companies, salaries (if available), descriptions, locations, and post dates. You can export all of it to Excel or JSON.

Can I scrape multiple pages or filter by location?

Yes, you can scrape multiple pages and refine by job title, location, keyword, or more depending on the input settings you use.

How do I get started?

You can use the Try Now button on this page to go to the scraper. You’ll be guided to input a search term and get structured results. No setup needed!

Our Apify Actors

Docling

Docling Actor on Apify

What are Actors?

Table of Contents

Features

Usage

Using Apify Console

Using Apify API

Using Apify CLI

Input Parameters

Example Input

Output

Example Outputs

Markdown (md)

JSON

HTML

Processing Logs (`DOCLING_LOG`)

Performance and Resources

Troubleshooting

Error Handling

Local Development

Actor Structure

Architecture

License

Acknowledgments

Security Considerations

Frequently Asked Questions

Is it legal to scrape job listings or public data?

Do I need to code to use this scraper?

What data does it extract?

Can I scrape multiple pages or filter by location?

How do I get started?

Docling

Docling Actor on Apify

What are Actors?

Table of Contents

Features

Usage

Using Apify Console

Using Apify API

Using Apify CLI

Input Parameters

Example Input

Output

Example Outputs

Markdown (md)

JSON

HTML

Processing Logs (DOCLING_LOG)

Performance and Resources

Troubleshooting

Error Handling

Local Development

Actor Structure

Architecture

License

Acknowledgments

Security Considerations

Frequently Asked Questions

Is it legal to scrape job listings or public data?

Do I need to code to use this scraper?

What data does it extract?

Can I scrape multiple pages or filter by location?

How do I get started?

Processing Logs (`DOCLING_LOG`)