OCR and AI pipeline

The OCR and AI pipeline converts files into structured document data. eDocify treats this as a governed pipeline, not a single black-box request.

Pipeline stages

flowchart TD
  A["File received"] --> B["PDF text layer extraction"]
  B --> C["OCR engine"]
  C --> D["AI / rules structuring"]
  D --> E["Field processing"]
  E --> F["Validation checks"]
  F --> G["Confidence and review reasons"]
  G --> H["Verification queue"]

Provider types

Azure Document Intelligence

Best for invoice structure extraction and strong baseline recognition. It can return both OCR text and structured invoice fields.

OpenAI / Azure OpenAI

Used for structured JSON extraction, verification, reasoning, field suggestions, command-based filtering, and assistant actions.

Mistral OCR / Mistral AI

Used for OCR and LLM-assisted structuring. Useful as a second provider in benchmark and fallback strategies.

Local Tesseract

Cost-controlled OCR option. Works well when documents are clear and rules can structure the text. Useful for region OCR and low-cost background processing.

Local PaddleOCR

Local OCR option for image-heavy or scanned documents. Can be combined with the same eDocify Rules structuring layer.

eDocify Rules

Deterministic parser for invoices and known patterns. It is especially useful for supplier-specific rules, totals, VAT, dates, IBAN, and line extraction candidates.

Provider routing

Provider routing can be configured by:

tenant;
client group;
company;
document type;
supplier;
confidence requirement;
cost limit;
data residency requirement;
fallback policy;
region OCR policy.

Example:

Scenario	Recommended route
New client pilot	Azure Document Intelligence + LLM verifier.
Known supplier template	Local OCR + eDocify Rules.
Low-value high-volume invoices	Tesseract or PaddleOCR + rules, fallback only on low confidence.
Critical invoices	Premium provider + second-pass AI verification.
Customer BYOK	Customer provider first, eDocify fallback if allowed.

JSON contract

LLM providers should return a strict JSON contract. The prompt should request:

header fields;
line items;
confidence per field;
source evidence;
page or region where possible;
validation warnings;
duplicate or anomaly reasons;
empty value handling.

The application should reject unparseable or schema-incompatible responses and record them as provider errors.

Confidence rules

Confidence is not the same as truth. It should be calibrated against golden datasets and human corrections.

Recommended confidence bands:

95-100: safe candidate;
85-94: review when critical field;
70-84: likely needs verifier attention;
below 70: require manual review or second provider;
missing critical field: always review.

OCR snapshots

For every run, store:

provider key and model;
prompt/rule version;
OCR raw text;
structured response;
fields and line items;
processing duration;
cost estimate;
error message;
confidence summary;
document id and tenant id.

Snapshots are needed for audit, AI learning, regression testing, and customer quality explanations.

Enterprise release governance

A new provider, prompt, or rules version should pass:

golden dataset benchmark;
critical field regression check;
line item regression check;
cost comparison;
latency comparison;
data residency review;
rollback plan.

Pipeline stages​

Provider types​

Azure Document Intelligence​

OpenAI / Azure OpenAI​

Mistral OCR / Mistral AI​

Local Tesseract​

Local PaddleOCR​

eDocify Rules​

Provider routing​

JSON contract​

Confidence rules​

OCR snapshots​

Enterprise release governance​