Skip to main content

Intake sources and document preparation

The intake layer receives files before OCR. It is responsible for source identity, routing, preparation, security checks, duplicate detection, and retry control.

Supported intake channels

ChannelTypical useStatus model
Portal uploadManual upload by accountant, client, verifier, or adminImmediate batch queue.
Client inboxPer-client email alias such as invoices+client@domainDelivery log, attachment extraction, bounce handling.
IMAP / Graph MailCustomer mailbox integrationSync state, retry, token health.
Google DriveShared folder importFolder route and incremental sync.
Microsoft Graph DriveOneDrive / Teams document importTenant app credentials and folder route.
SharePointEnterprise document library importSite, library, folder, and permissions.
API pushSystem-to-system uploadAPI key, idempotency key, rate limit.
Manual importAdmin or operator actionDirect queue insertion.
Planned SFTP / scanner inputEnterprise batch scanningSchedule, folder policy, file naming rules.

Intake item lifecycle

stateDiagram-v2
[*] --> Received
Received --> Preparing
Preparing --> QueuedForOcr
Preparing --> NeedsManualPreparation
QueuedForOcr --> Processing
Processing --> NeedsVerification
Processing --> Failed
Failed --> Retry
Retry --> QueuedForOcr
NeedsManualPreparation --> QueuedForOcr

Document preparation

Preparation happens before OCR when the file needs cleanup or structural decisions:

  • split one PDF into multiple documents;
  • merge related files into one document;
  • rotate pages;
  • detect attachments;
  • extract embedded email attachments;
  • identify duplicate files by hash;
  • classify document type;
  • preserve original file and normalized processing file.

For ABBYY-style batch processing, eDocify should support both automatic decisions and human confirmation when confidence is not high enough.

Routing decisions

Routing can use:

  • tenant;
  • client group;
  • company;
  • source;
  • inbox alias;
  • file name pattern;
  • sender email;
  • document type;
  • extracted supplier;
  • amount threshold;
  • language;
  • selected product profile.

Routing determines OCR/AI profile, verification queue, approval policy, archive retention, and ERP export profile.

Connector health center

Enterprise customers need visibility into connector reliability. A connector health dashboard should show:

  • last successful sync;
  • last error and incident timeline;
  • affected client or company;
  • source latency;
  • imported document count;
  • retry count;
  • token expiration;
  • SLA status;
  • connector version;
  • fallback or simulated mode warning.

Security checks

Recommended intake security pipeline:

  • MIME validation;
  • file extension validation;
  • maximum size and page count limits;
  • malware scan;
  • encrypted PDF detection;
  • PII classification;
  • retention classification;
  • audit event for every received file;
  • quarantine state for suspicious files.

Operational recommendations

  • Keep demo connectors visibly separated from production connectors.
  • Do not silently fall back to simulated adapters in production.
  • Use per-tenant aliases and routing rules for client inboxes.
  • Require idempotency keys for API upload.
  • Preserve raw file, normalized file, OCR text, and processing state separately.
  • Expose failed intake items with retry, ignore, and reroute actions.