In-place PDF rebuild — why this is the hardest job in DTP
The hardest DTP job is the one where there is no source file. A French tax return, a German CE safety leaflet, a clinical patient information leaflet (PIL), a patent filing — all of these are PDFs the client received from a regulator or an agency, with no editable InDesign or Word original behind them. Most translation agencies handle this by extracting the text into a Word document, translating, and sending the translation back as a fresh document that does not match the regulator's template. This is unacceptable for filings — the regulator expects the original form back, in the target language, looking exactly like the original.
The technical problem
PDFs do not store text as text. They store text as a sequence of glyph operations: "draw character 67 from font Helvetica at coordinate (123, 456)". Most agencies use a PDF-to-Word converter that loses the coordinate information and the font references; the result is editable but visually unrelated to the source. The next level of agency uses Adobe Acrobat's "Export to InDesign" — this preserves more layout but breaks on three categories of content:
- Outlined glyphs. Many regulatory forms (impots.gouv.fr, German Steuererklärung, AEMPS) outline their text labels to a vector path at publication. Acrobat's extractor reads outlined text as a graphic, not a string. The translator never sees the field label and the regulator never sees its template back.
- Form fields. AcroForm and XFA fields have their own text and font references. Extracting the surrounding text but not the field labels produces a half-translated document.
- Scanned hybrids. A PDF that is part vector, part raster (typical for regulator-supplied documents with a scanned signature block) needs OCR on the raster portion and coordinate-preserved extraction on the vector portion. A single tool cannot handle both.
Our engine
We built our PDF engine specifically for this category. It enumerates every text-drawing operation in the PDF, including outlined glyphs (by detecting common label fonts and reverse-mapping the path back to characters), reads form field labels and tooltips, runs OCR over raster regions, and produces a per-span translation manifest. The translator works in the manifest; the engine writes the translation back at the source coordinates. Form fields are preserved; the regulator sees the same form, in their target language, with the right field labels.
What this delivers
A French tax return that looks like a French tax return. A German CE leaflet that matches the German regulator's template. A patient information leaflet that the EMA accepts on first review. We were built for this; most agencies were not.