Documents Extract

This endpoint allows you to parse text from documents using OCR.

This endpoint expects a multipart form containing a file.

filefileRequired

The document file to upload.

embedImagesbooleanRequired

Whether to embed images from the document.

outputFormatstringRequired

The output format for the content of the document.

chunkDocumentbooleanRequired

Whether to separate the document into chunks.

chunkSizeintegerRequired

The size of chunks for the documents.

enableOCRbooleanRequired

Whether to enable OCR for document parsing.

Successful response.

titlestring or null

The parsed document title.

contentsstring or null

The parsed document contents.

countinteger or null

The word count for the document.

1	curl -X POST https://api.predictionguard.com/documents/extract \
2	-H "Toxicity: true" \
3	-H "Pii: replace" \
4	-H "Replace-Method: category" \
5	-H "Injection: true" \
6	-H "Authorization: Bearer <token>" \
7	-H "Content-Type: multipart/form-data" \
8	-F [email protected] \
9	-F embedImages='false' \
10	-F outputFormat="markdown" \
11	-F chunkDocument='true' \
12	-F chunkSize='1000' \
13	-F enableOCR='true'

1	{
2	"title": "sample.pdf",
3	"contents": "## Sample PDF\n\n## This is a simple PDF file. Fun fun fun.\n\nLorem ipsum dolor sit amet, consectetuer adipiscing elit. Phasellus facilisis odio sed mi. Curabitur suscipit...",
4	"count": 3041
5	}