Extracting data from documents

Data extraction is performed by calling the following endpoint:

https://konbert.com/api/v1/extract

You can specify input data via three sources:

  • File upload: Upload local files directly to the API. Ideal for large documents or binary files, compatible with various formats.
  • URL: Provide input data via a URL. Processes documents hosted online, allowing the API to fetch and extract data directly from web servers without local file storage.
  • File ID: Use a previously uploaded file's ID for extraction. This is useful when you've already uploaded a file and want to perform extraction on it.

Extraction Options

When extracting data from documents, you have several options to customize the process. Here are some key parameters you can use:

Input Options

  • input[format_id]: Specify the format of the input document (e.g., pdf, docx, image)
  • input[file]: Used for file uploads
  • input[url]: Used when providing a URL for the input document
  • input[file_id]: Used when referencing a previously uploaded file
  • input[options]: Additional options specific to the input format.

Available formats

The following formats (format_id) are available for unstructured data extraction:

  • Invoice (invoice)
  • Receipt (receipt)
  • Table (table)
  • W-9 Form (w9-form)
  • W-2 Form (w2-form)
  • 1099 Form (1099-form)
  • Legal Contract (legal-contract)
  • Medical Record (medical-record)
  • HR Policy Document (hr-policy)
  • Performance Review (performance-review)
  • Resume (resume)
  • Screenshot (screenshot)

General Options

  • sync: Set to "true" for synchronous extraction (wait for the result)

Response

The API will return a JSON response containing the extracted data. If the extraction is asynchronous, you'll receive a job ID that you can use to check the status and retrieve the results later.

Example

curl -X POST https://konbert.com/api/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "input[format_id]=invoice" \
-F "input[file]=@/path/to/your/document.pdf" \
-F "sync=true"