Skip to main content

OCR and AI pipeline

The OCR and AI pipeline converts files into structured document data. eDocify treats this as a governed pipeline, not a single black-box request.

Pipeline stages

flowchart TD
A["File received"] --> B["PDF text layer extraction"]
B --> C["OCR engine"]
C --> D["AI / rules structuring"]
D --> E["Field processing"]
E --> F["Validation checks"]
F --> G["Confidence and review reasons"]
G --> H["Verification queue"]

Provider types

Azure Document Intelligence

Best for invoice structure extraction and strong baseline recognition. It can return both OCR text and structured invoice fields.

OpenAI / Azure OpenAI

Used for structured JSON extraction, verification, reasoning, field suggestions, command-based filtering, and assistant actions.

Mistral OCR / Mistral AI

Used for OCR and LLM-assisted structuring. Useful as a second provider in benchmark and fallback strategies.

Local Tesseract

Cost-controlled OCR option. Works well when documents are clear and rules can structure the text. Useful for region OCR and low-cost background processing.

Local PaddleOCR

Local OCR option for image-heavy or scanned documents. Can be combined with the same eDocify Rules structuring layer.

eDocify Rules

Deterministic parser for invoices and known patterns. It is especially useful for supplier-specific rules, totals, VAT, dates, IBAN, and line extraction candidates.

Provider routing

Provider routing can be configured by:

  • tenant;
  • client group;
  • company;
  • document type;
  • supplier;
  • confidence requirement;
  • cost limit;
  • data residency requirement;
  • fallback policy;
  • region OCR policy.

Example:

ScenarioRecommended route
New client pilotAzure Document Intelligence + LLM verifier.
Known supplier templateLocal OCR + eDocify Rules.
Low-value high-volume invoicesTesseract or PaddleOCR + rules, fallback only on low confidence.
Critical invoicesPremium provider + second-pass AI verification.
Customer BYOKCustomer provider first, eDocify fallback if allowed.

JSON contract

LLM providers should return a strict JSON contract. The prompt should request:

  • header fields;
  • line items;
  • confidence per field;
  • source evidence;
  • page or region where possible;
  • validation warnings;
  • duplicate or anomaly reasons;
  • empty value handling.

The application should reject unparseable or schema-incompatible responses and record them as provider errors.

Confidence rules

Confidence is not the same as truth. It should be calibrated against golden datasets and human corrections.

Recommended confidence bands:

  • 95-100: safe candidate;
  • 85-94: review when critical field;
  • 70-84: likely needs verifier attention;
  • below 70: require manual review or second provider;
  • missing critical field: always review.

OCR snapshots

For every run, store:

  • provider key and model;
  • prompt/rule version;
  • OCR raw text;
  • structured response;
  • fields and line items;
  • processing duration;
  • cost estimate;
  • error message;
  • confidence summary;
  • document id and tenant id.

Snapshots are needed for audit, AI learning, regression testing, and customer quality explanations.

Enterprise release governance

A new provider, prompt, or rules version should pass:

  • golden dataset benchmark;
  • critical field regression check;
  • line item regression check;
  • cost comparison;
  • latency comparison;
  • data residency review;
  • rollback plan.