OCR and AI pipeline
The OCR and AI pipeline converts files into structured document data. eDocify treats this as a governed pipeline, not a single black-box request.
Pipeline stages
flowchart TD
A["File received"] --> B["PDF text layer extraction"]
B --> C["OCR engine"]
C --> D["AI / rules structuring"]
D --> E["Field processing"]
E --> F["Validation checks"]
F --> G["Confidence and review reasons"]
G --> H["Verification queue"]
Provider types
Azure Document Intelligence
Best for invoice structure extraction and strong baseline recognition. It can return both OCR text and structured invoice fields.
OpenAI / Azure OpenAI
Used for structured JSON extraction, verification, reasoning, field suggestions, command-based filtering, and assistant actions.
Mistral OCR / Mistral AI
Used for OCR and LLM-assisted structuring. Useful as a second provider in benchmark and fallback strategies.
Local Tesseract
Cost-controlled OCR option. Works well when documents are clear and rules can structure the text. Useful for region OCR and low-cost background processing.
Local PaddleOCR
Local OCR option for image-heavy or scanned documents. Can be combined with the same eDocify Rules structuring layer.
eDocify Rules
Deterministic parser for invoices and known patterns. It is especially useful for supplier-specific rules, totals, VAT, dates, IBAN, and line extraction candidates.
Provider routing
Provider routing can be configured by:
- tenant;
- client group;
- company;
- document type;
- supplier;
- confidence requirement;
- cost limit;
- data residency requirement;
- fallback policy;
- region OCR policy.
Example:
| Scenario | Recommended route |
|---|---|
| New client pilot | Azure Document Intelligence + LLM verifier. |
| Known supplier template | Local OCR + eDocify Rules. |
| Low-value high-volume invoices | Tesseract or PaddleOCR + rules, fallback only on low confidence. |
| Critical invoices | Premium provider + second-pass AI verification. |
| Customer BYOK | Customer provider first, eDocify fallback if allowed. |
JSON contract
LLM providers should return a strict JSON contract. The prompt should request:
- header fields;
- line items;
- confidence per field;
- source evidence;
- page or region where possible;
- validation warnings;
- duplicate or anomaly reasons;
- empty value handling.
The application should reject unparseable or schema-incompatible responses and record them as provider errors.
Confidence rules
Confidence is not the same as truth. It should be calibrated against golden datasets and human corrections.
Recommended confidence bands:
- 95-100: safe candidate;
- 85-94: review when critical field;
- 70-84: likely needs verifier attention;
- below 70: require manual review or second provider;
- missing critical field: always review.
OCR snapshots
For every run, store:
- provider key and model;
- prompt/rule version;
- OCR raw text;
- structured response;
- fields and line items;
- processing duration;
- cost estimate;
- error message;
- confidence summary;
- document id and tenant id.
Snapshots are needed for audit, AI learning, regression testing, and customer quality explanations.
Enterprise release governance
A new provider, prompt, or rules version should pass:
- golden dataset benchmark;
- critical field regression check;
- line item regression check;
- cost comparison;
- latency comparison;
- data residency review;
- rollback plan.