Intake sources and document preparation
The intake layer receives files before OCR. It is responsible for source identity, routing, preparation, security checks, duplicate detection, and retry control.
Supported intake channels
| Channel | Typical use | Status model |
|---|---|---|
| Portal upload | Manual upload by accountant, client, verifier, or admin | Immediate batch queue. |
| Client inbox | Per-client email alias such as invoices+client@domain | Delivery log, attachment extraction, bounce handling. |
| IMAP / Graph Mail | Customer mailbox integration | Sync state, retry, token health. |
| Google Drive | Shared folder import | Folder route and incremental sync. |
| Microsoft Graph Drive | OneDrive / Teams document import | Tenant app credentials and folder route. |
| SharePoint | Enterprise document library import | Site, library, folder, and permissions. |
| API push | System-to-system upload | API key, idempotency key, rate limit. |
| Manual import | Admin or operator action | Direct queue insertion. |
| Planned SFTP / scanner input | Enterprise batch scanning | Schedule, folder policy, file naming rules. |
Intake item lifecycle
stateDiagram-v2
[*] --> Received
Received --> Preparing
Preparing --> QueuedForOcr
Preparing --> NeedsManualPreparation
QueuedForOcr --> Processing
Processing --> NeedsVerification
Processing --> Failed
Failed --> Retry
Retry --> QueuedForOcr
NeedsManualPreparation --> QueuedForOcr
Document preparation
Preparation happens before OCR when the file needs cleanup or structural decisions:
- split one PDF into multiple documents;
- merge related files into one document;
- rotate pages;
- detect attachments;
- extract embedded email attachments;
- identify duplicate files by hash;
- classify document type;
- preserve original file and normalized processing file.
For ABBYY-style batch processing, eDocify should support both automatic decisions and human confirmation when confidence is not high enough.
Routing decisions
Routing can use:
- tenant;
- client group;
- company;
- source;
- inbox alias;
- file name pattern;
- sender email;
- document type;
- extracted supplier;
- amount threshold;
- language;
- selected product profile.
Routing determines OCR/AI profile, verification queue, approval policy, archive retention, and ERP export profile.
Connector health center
Enterprise customers need visibility into connector reliability. A connector health dashboard should show:
- last successful sync;
- last error and incident timeline;
- affected client or company;
- source latency;
- imported document count;
- retry count;
- token expiration;
- SLA status;
- connector version;
- fallback or simulated mode warning.
Security checks
Recommended intake security pipeline:
- MIME validation;
- file extension validation;
- maximum size and page count limits;
- malware scan;
- encrypted PDF detection;
- PII classification;
- retention classification;
- audit event for every received file;
- quarantine state for suspicious files.
Operational recommendations
- Keep demo connectors visibly separated from production connectors.
- Do not silently fall back to simulated adapters in production.
- Use per-tenant aliases and routing rules for client inboxes.
- Require idempotency keys for API upload.
- Preserve raw file, normalized file, OCR text, and processing state separately.
- Expose failed intake items with retry, ignore, and reroute actions.