Docling Document Parser & Converter – Convert documents into structured data without complexity. This Actor leverages the powerful Docling library to parse and transform various document formats into clean, structured outputs ready for analysis or integration.
This Actor (specification v1) wraps the Docling project to provide serverless document processing in the cloud. It can process complex documents (PDF, DOCX, images) and convert them into structured formats (Markdown, JSON, HTML, Text, or DocTags) with optional OCR support.
Actors are serverless microservices running on the Apify Platform. They are based on the Actor SDK and can be found in the Apify Store. Learn more about Actors in the Apify Whitepaper.
md
, json
, html
, text
, or doctags
).OUTPUT
.1curl --request POST \ 2 --url "https://api.apify.com/v2/acts/vancura~docling/run" \ 3 --header 'Content-Type: application/json' \ 4 --header 'Authorization: Bearer YOUR_API_TOKEN' \ 5 --data '{ 6 "options": { 7 "to_formats": ["md", "json", "html", "text", "doctags"] 8 }, 9 "http_sources": [ 10 {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"}, 11 {"url": "https://arxiv.org/pdf/2408.09869"} 12 ] 13}'
1apify call vancura/docling --input='{ 2 "options": { 3 "to_formats": ["md", "json", "html", "text", "doctags"] 4 }, 5 "http_sources": [ 6 {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"}, 7 {"url": "https://arxiv.org/pdf/2408.09869"} 8 ] 9}'
The Actor accepts a JSON schema matching the file .actor/input_schema.json
. Below is a summary of the fields:
Field | Type | Required | Default | Description |
---|---|---|---|---|
http_sources | object | Yes | None | https://github.com/DS4SD/docling-serve?tab=readme-ov-file#url-endpoint |
options | object | No | None | https://github.com/DS4SD/docling-serve?tab=readme-ov-file#common-parameters |
1{ 2 "options": { 3 "to_formats": ["md", "json", "html", "text", "doctags"] 4 }, 5 "http_sources": [ 6 {"url": "https://vancura.dev/assets/actor-test/facial-hairstyles-and-filtering-facepiece-respirators.pdf"}, 7 {"url": "https://arxiv.org/pdf/2408.09869"} 8 ] 9}
The Actor provides three types of outputs:
Processed Documents in a ZIP - The Actor will provide the direct URL to your result in the run log, looking like:
You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT'
Processing Log - Available in the key-value store as DOCLING_LOG
Dataset Record - Contains processing metadata with:
You can access the results in several ways:
https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT
apify key-value-stores get-value OUTPUT
1# Document Title 2 3## Section 1 4Content of section 1... 5 6## Section 2 7Content of section 2...
1{ 2 "title": "Document Title", 3 "sections": [ 4 { 5 "level": 1, 6 "title": "Section 1", 7 "content": "Content of section 1..." 8 } 9 ] 10}
1<h1>Document Title</h1> 2<h2>Section 1</h2> 3<p>Content of section 1...</p>
DOCLING_LOG
)The Actor maintains detailed processing logs including:
Access logs via:
apify key-value-stores get-record DOCLING_LOG
Common issues and solutions:
Document URL Not Accessible
OCR Processing Fails
API Response Issues
Output Format Issues
DOCLING_LOG
for specific errorsThe Actor implements comprehensive error handling:
DOCLING_LOG
If you wish to develop or modify this Actor locally:
Clone the repository.
Ensure Docker is installed.
The Actor files are located in the .actor
directory:
Dockerfile
- Defines the container environmentactor.json
- Actor configuration and metadataactor.sh
- Main execution script that starts the docling-serve API and orchestrates document processinginput_schema.json
- Input parameter definitionsdataset_schema.json
- Dataset output format definitionCHANGELOG.md
- Change log documenting all notable changesREADME.md
- This documentationRun the Actor locally using:
apify run
1.actor/ 2├── Dockerfile # Container definition 3├── actor.json # Actor metadata 4├── actor.sh # Execution script (also starts docling-serve API) 5├── input_schema.json # Input parameters 6├── dataset_schema.json # Dataset output format definition 7├── docling_processor.py # Python script for API communication 8├── CHANGELOG.md # Version history and changes 9└── README.md # This documentation
This Actor uses a lightweight architecture based on the official quay.io/ds4sd/docling-serve-cpu
Docker image:
quay.io/ds4sd/docling-serve-cpu:latest
(~4GB)The actor script starts the docling-serve API on port 5001
Performs health checks to ensure the API is running
Processes the input parameters
Creates a JSON payload for the docling-serve API with proper format:
1{ 2 "options": { 3 "to_formats": ["md"], 4 "do_ocr": true 5 }, 6 "http_sources": [{"url": "https://example.com/document.pdf"}] 7}
Makes a POST request to the /v1alpha/convert/source
endpoint
Processes the response and stores it in the key-value store
This wrapper project is under the MIT License, matching the original Docling license. See ../LICENSE for details.
Yes, if you're scraping publicly available data for personal or internal use. Always review Websute's Terms of Service before large-scale use or redistribution.
No. This is a no-code tool — just enter a job title, location, and run the scraper directly from your dashboard or Apify actor page.
It extracts job titles, companies, salaries (if available), descriptions, locations, and post dates. You can export all of it to Excel or JSON.
Yes, you can scrape multiple pages and refine by job title, location, keyword, or more depending on the input settings you use.
You can use the Try Now button on this page to go to the scraper. You’ll be guided to input a search term and get structured results. No setup needed!