5 Tips for Improving OCR Accuracy When Scanning Invoices
Practical guidance for accountants and businesses on how to maximize OCR extraction accuracy when processing invoices through VirtuAc's AI pipeline.
Optical character recognition has advanced dramatically over the past several years, but it is not infallible. A modern AI-powered OCR engine can extract structured data from the vast majority of invoices with very high accuracy, provided the input document meets certain basic quality requirements. When the document quality falls below those requirements, accuracy drops, and that drop translates directly into flagged records, manual corrections, and time spent by your accounting team reviewing exceptions.
VirtuAc uses a multi-engine pipeline to minimize the impact of document quality issues. The primary AI engine handles extraction, and when any critical field falls below the confidence threshold, the document is automatically re-processed by a secondary engine. The higher-confidence result is used for the final record. Below the minimum threshold on both engines, the document enters a human review queue rather than being processed automatically.
Understanding how this system works, and how the quality of incoming documents affects it, gives you practical tools to reduce your review queue and increase the proportion of invoices that process fully automatically.
Tip 1: Use PDF Format Where Possible
When a supplier provides their invoice as a PDF, request that format rather than asking for a printed copy that you scan, or a photo sent via WhatsApp. A native PDF (one generated by accounting software or a word processor, as opposed to a scan converted to PDF) contains embedded text, vector graphics, and precise layout information. VirtuAc’s AI engine can read this metadata directly, bypassing the image analysis step entirely for the text content, and the result is close to 100% extraction accuracy on well-structured fields.
Even a scanned PDF is preferable to a JPEG photograph in most cases. When a document is scanned at adequate resolution and saved as PDF, the file preserves the page geometry more faithfully than a handheld photograph, and the layout analysis stage performs better as a result.
For suppliers who use invoicing software (a significant proportion of Israeli businesses using Hashavshevet, Priority, or Green Invoice), they can almost always export their invoice as a PDF without any additional work. If you have suppliers who still provide only paper invoices, encourage them to use their software’s PDF export function. Where that is not possible, the remaining tips apply.
Tip 2: Minimum 200 DPI for Scanned Documents
DPI (dots per inch) measures the resolution of a scanned document. At 200 DPI, most printed text is legible to both human readers and OCR engines. At 300 DPI, accuracy improves further, particularly for small text (below 8pt) and characters with fine details, such as the letters in Hebrew script or numbers printed close together on a crowded invoice.
Below 150 DPI, OCR accuracy degrades noticeably, particularly for numbers. Since invoice amounts, dates, and tax identification numbers are all numeric, low-resolution scans frequently introduce digit substitution errors (3 misread as 8, 5 as 6, and so on) that are difficult to catch without careful manual review.
Most modern flatbed scanners default to 200 DPI or 300 DPI. If your office scanner is producing low-resolution output, check its settings. The default may have been changed, or the scanner may be operating in a “fast draft” mode intended for documents rather than high-accuracy scanning. Set the scanner to 300 DPI for invoices. The file size increase is modest (typically 2 to 4 times larger than a 100 DPI scan for a typical A4 invoice), and the accuracy benefit is substantial.
When using a multifunction printer/scanner, check both the scanner settings and any companion software. Some software overrides the hardware scanner settings and applies additional compression or downsampling during file creation.
Tip 3: Photograph Tips for Smartphone Users
Where a scanner is not available and invoices are photographed using a smartphone, the following practices significantly improve OCR accuracy:
Use even, diffuse lighting. Natural light from a window, positioned to the side of the document rather than directly overhead, works well. Direct overhead light from a single strong source creates shadows in the centre of the document where the paper curves. A fluorescent ceiling light in a typical office often creates a glare spot in the middle of the photograph that obscures the text beneath it.
Lay the document on a flat, dark-coloured surface. A flat surface eliminates distortion caused by paper curvature. A dark background (a desk, a folder, a clipboard) creates strong contrast between the background and the white invoice, which helps the document detection algorithm identify the page edges precisely and apply the correct perspective correction.
Hold the phone directly above the document, parallel to its surface. Even a 10 to 15 degree angle introduces perspective distortion that OCR engines must correct. While modern models handle mild distortion well, severe angles cause character misalignment that reduces confidence scores.
Photograph in landscape orientation if the invoice is wider than it is tall. This maximizes the resolution allocated to the document content.
Avoid flash when possible. Camera flash creates a bright reflection spot on any slightly glossy invoice stock, which obscures text in that region. If the lighting environment is too dark for a flash-free photograph, improve the ambient lighting rather than using flash.
VirtuAc’s mobile-friendly upload interface includes a live camera view with an overlay guide that helps users frame the invoice correctly. Sharing the VirtuAc upload link with clients who submit via web upload gives them access to this guided capture tool.
Tip 4: Understand the Confidence Threshold System
VirtuAc’s extraction pipeline assigns a confidence score to each extracted field. These scores reflect how certain the AI engine is that the extracted value is correct.
The thresholds work as follows:
A field confidence score above 80% means the extraction is considered reliable and the value is accepted automatically without any flag.
A field confidence score between 60% and 80% triggers a secondary extraction pass. If the secondary engine scores the same field above 80%, that result is used. If both passes return scores in the 60 to 80 range, the record is accepted but flagged for a light-touch review, meaning the accountant sees the record with the low-confidence fields highlighted in amber.
A field confidence score below 60% on both engines (or a critical field that could not be detected at all) causes the entire record to enter a “Manual entry required” status, where a human must supply the correct values before the record can be exported.
Understanding this system helps you prioritize where to invest effort in improving document quality. If your review queue shows many amber-flagged records for the vendor tax ID field specifically, for example, that suggests the tax ID area on those invoices has a quality issue: perhaps it is printed in a very small font, partially obscured by a stamp, or positioned inconsistently on the invoices from that supplier. Addressing the document quality at the source, or flagging those suppliers for manual review as a workflow rule, is more efficient than reviewing each invoice individually.
Tip 5: Use the Correction Interface to Improve Baseline Accuracy
Every time an accountant corrects an extracted field in VirtuAc, that correction is recorded and feeds back into the system’s confidence calibration for similar documents. VirtuAc maintains a supplier-specific extraction profile that tracks which fields tend to require correction for invoices from a given supplier.
Over time, this profile allows VirtuAc to apply supplier-specific post-processing rules. For example, if invoices from a particular supplier consistently have the allocation number in an unusual location (say, embedded mid-page in a notes field rather than printed at the top), the system learns to check that location specifically when processing documents from that supplier.
The practical implication is that the more consistently you use the correction interface rather than bypassing it (for example, by deleting and re-entering a record manually), the faster the per-supplier accuracy improves. The correction interface is accessible directly from the invoice detail view: click any extracted field to edit it, and VirtuAc records the original extracted value alongside the corrected value.
Bonus: Document Types That Always Require Manual Review
Certain document types should be routed to manual review from the outset, regardless of apparent image quality:
Handwritten invoices. AI OCR engines handle printed text with much higher accuracy than handwriting. Handwritten Hebrew invoice data, particularly numbers and dates written in non-standard formats, has significantly lower extraction accuracy. If you receive handwritten invoices from any supplier, flag those suppliers in VirtuAc so their documents always enter the manual review queue.
Faded thermal receipts. Many fuel receipts, parking receipts, and small purchase receipts in Israel are printed on thermal paper that fades over time. A receipt that looked legible when first received may be partially illegible by the time it is scanned. If possible, photograph thermal receipts immediately upon receipt before fading can occur.
Non-standard invoice layouts. Most Israeli businesses use accounting software that produces invoices in one of a handful of standard formats. However, invoices received from freelancers, overseas suppliers, or very small businesses may use entirely custom layouts that do not align well with key-value pair extraction models. These invoices frequently require field-by-field manual review. You can use the free-form extraction view in VirtuAc, which presents the full extracted text alongside the invoice image, to copy the correct values into the structured fields manually.
Invoices with stamps or handwritten annotations over printed text. A common practice in some Israeli businesses is to apply an approval stamp directly over the invoice text, or to write corrections or annotations on the document surface. If the stamp or annotation overlaps a critical field, extraction accuracy for that field is typically very low.
The goal of VirtuAc’s confidence scoring system is to make these edge cases visible rather than silently inaccurate. A document that a human eye would recognize as difficult to read will score below the confidence threshold and enter the review queue rather than propagating incorrect data into your accounting system.
To see how VirtuAc handles your current invoice mix in practice, start a free trial. You can process a representative sample of real documents during the trial period to assess extraction accuracy before committing. The features page has more detail on the full OCR pipeline and all supported ingestion channels.