PDFs generated by Firefox missing a character when parsed by text extraction tools, or when copypasted-from when viewed in any PDF viewer
Categories
(Firefox :: PDF Viewer, defect, P3)
Tracking
()
People
(Reporter: haik, Assigned: calixte)
References
Details
Attachments
(4 files)
When opening the attached input.pdf
PDF file and printing it from Firefox, the resulting PDF file contains a missing i
character (the first i
in the word confidential
) when parsed by PDF text extraction tools. The i
does appear in the viewed PDF, but there appears to be something missing from the encoding such that when parsed by some tools, the character is missing.
Examining the generated PDF in Firefox using the Developer Tools Inspector reveals that the i
is treated differently. There is a missing <span>
element where the i
is displayed.
Using xpdf
, here's there differing output from running pdftotext
on the provided input.pdf file being printed from Firefox and Chrome.
Reproduced on Windows and macOS.
$ pdftotext input.pdf -
Sdsd D Sdsds confidential123
Sd S D D Sd Sd S
$ pdftotext firefox.pdf -
Sdsd D Sdsds confdential123
Sd S D D Sd Sd S
$ pdftotext chrome.pdf -
Sdsd D Sdsds confidential123
Sd S D D Sd Sd S
$ diff firefox.txt chrome.txt
1c1
< Sdsd D Sdsds confdential123
---
> Sdsd D Sdsds confidential123
$ pdftotext -v
pdftotext version 4.05 [www.xpdfreader.com]
Copyright 1996-2024 Glyph & Cog, LLC
Thanks to Broadcom for reporting the bug.
Reporter | ||
Comment 1•2 months ago
|
||
Reporter | ||
Comment 2•2 months ago
|
||
Reporter | ||
Updated•2 months ago
|
Reporter | ||
Comment 3•2 months ago
•
|
||
I attempted to track this down with mozregression, but I found that the text from PDFs generated with earlier versions of Firefox wasn't printed by the xpdf
pdftotext
tool so I couldn't find a version that generated a PDF that included the correct text. If I treated those cases as "good", I ended up with this pushlog.
Comment 4•2 months ago
|
||
I'm guessing this has to do with the PDF's included fonts -- e.g. maybe the "i" character gets rendered using a font that we mistakenly remove, and it ends up getting turned into a (vector) graphical representation as part of us dropping that font, perhaps?
Looks like the original PDF has two embedded fonts, which appear to be two distinct subsets of the "Aptos" font that have different encoding, though they have the same prefix on their name
(not sure if that naming pattern is fine or if it might be causing trouble):
$ pdffonts /tmp/input.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
BCDEEE+Aptos CID TrueType Identity-H yes yes yes 5 0
BCDFEE+Aptos TrueType WinAnsi yes yes no 12 0
The file that Chrome generates looks almost identical (I think only the object
values are different):
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
BCDEEE+Aptos CID TrueType Identity-H yes yes yes 8 0
BCDFEE+Aptos TrueType WinAnsi yes yes no 16 0
The file that Firefox generates (locally on my system) only has one embedded font:
$ pdffonts /tmp/firefox.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
NMVQEC+Aptos TrueType WinAnsi yes yes yes 11 0
For completeness/interest, I also used pdftocairo -pdf
to convert this PDF into a cairo-generated PDF. In that resulting PDF, the text is correctly preserved, and the resulting PDF ends up with three fonts that all have different names:
$ pdftocairo -pdf /tmp/input.pdf /tmp/cairo.pdf
$ pdftotext /tmp/cairo.pdf - | grep conf
Sdsds confidential123
$ pdffonts /tmp/cairo.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
JCVEWA+Aptos TrueType WinAnsi yes yes yes 7 0
KIDSKU+Aptos TrueType WinAnsi yes yes yes 8 0
IRSLTT+Aptos CID TrueType Identity-H yes yes yes 9 0
Anyway. I'm a bit out of my depth, but I suspect jfkthame has insights into what's going wrong here -- jfkthame, would you mind taking a look here when you have cycles?
Reporter | ||
Comment 5•2 months ago
|
||
Here's an updated mozregression pushlog.
This pushlog was found by testing when PDFs being generated by Firefox (tested on macOS) changed from generating PDFs (from attached input.pdf) to an output PDF with text not detected with xpdf
pdftotext
. All the versions of the generated PDF with text that was readable by pdftotext contained the missing i
.
mozregression couldn't go further due to CRITICAL: Last build 720d5125a9b4 is missing, but mozregression can't find a build after - so it is excluded, but it could contain the regression!
That pushlog contains a pdf.js update: Bug 1779408 - Update pdf.js to version 2.15.259.
Comment 6•2 months ago
•
|
||
I got a similar pushlog on linux:
First good revision: 720d5125a9b4aa6750806ed9b51fbd0811da10c4 (2022-07-14)
Last bad revision: 0f67ddd33ffffed755ca5cf2df495e9bb57a649d (2022-07-13)
Pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=0f67ddd33ffffed755ca5cf2df495e9bb57a649d&tochange=720d5125a9b4aa6750806ed9b51fbd0811da10c4
Before that range ("bad"), Firefox's Save-to-PDF print target gives me an OK-looking PDF, but if I select and copypaste the text that looks like "Sdsds confidential123" in that file, it gives me unicode tofu glyphs:
After that range ("good"), if I do the same process, copypaste gives me "Sdsds confdential123"
The most relevant change in that pushlog is bug 1779408 "Update pdf.js to version 2.15.259" which includes this change in its first comment:
#15157 Add unicode mapping in the font cmap to have correct chars when printing in pdf (bug 1778484)
That sounds very likely related to what we're seeing here -- it seems likely this bug here is some still-unfixed followup work associated with that.
Comment 7•2 months ago
|
||
--> Reclassifying to PDF.js to match bug 1778484 since it seems likely that this is a bug on that side, given the pushlog that got us to our current behavior.
Updated•2 months ago
|
Reporter | ||
Updated•2 months ago
|
Comment 8•2 months ago
|
||
[adjusting summary to clarify that this isn't specific to standalone text-extraction tools; this affects copypaste in PDF viewers like Firefox-with-PDF.js-itself, too, when viewing the 'bad' output file]
Assignee | ||
Comment 9•2 months ago
|
||
In the original pdf, the fi
is just one char rendered thanks to one glyph (ligature fi).
We probably make something wrong (i.e. map the fi
onto a f
something like that) when adding the unicode mapping in the font.
Assignee | ||
Comment 10•2 months ago
|
||
The culprit:
https://github.com/mozilla/pdf.js/blob/90a5c37cb0131e8645bc036ec8e2baa1a1f8bca3/src/core/fonts.js#L517
We get the string "fi" (2 chars) and we just take the first one...
Assignee | ||
Updated•2 months ago
|
Updated•2 months ago
|
Comment 11•2 months ago
|
||
Comment 12•2 months ago
|
||
Set release status flags based on info from the regressing bug 1778484
Comment 13•2 months ago
•
|
||
[Edit: when I went to submit this, I see I've mid-aired with Calixte's investigation, which reached much the same conclusion; but posting this anyhow for the record.]
I would guess this is most likely related to the font providing an "fi" ligature glyph, but the ToUnicode mapping generated doesn't map this back to two characters ['f', 'i'] when extracting text.
Poking around a bit, I found this code in core/fonts.js:
let unicode = toUnicode.get(originalCharCode);
if (typeof unicode === "string") {
unicode = unicode.codePointAt(0);
}
Sure enough, when this encounters the "fi" ligature glyph (originally named uniFB01
in the Aptos font, though I don't think we ultimately preserve that name), the call to toUnicode.get(originalCharCode)
returns the two-character string "fi", which is correct as this single glyph represents two characters. But this code just uses the first character "f" from it, and the end result is that text-extraction tools see the single character "f" as the text to be extracted for the "fi" ligature.
(This is especially sad given that in Aptos, the fi
ligature looks identical to the two-glyph sequence f
, i
. So they really wouldn't need to have such a ligature rule at all.)
One (somewhat crude) way to mitigate this would be to check for multi-character strings returned by toUnicode.get()
here, and map them back to the Unicode codepoints for the corresponding ligature, where such codepoints exist. E.g. doing
let unicode = toUnicode.get(originalCharCode);
if (typeof unicode === "string") {
if (unicode === "ff") {
unicode = 0xfb00;
} else if (unicode === "fi") {
unicode = 0xfb01;
} else if (unicode === "fl") {
unicode = 0xfb02;
} else if (unicode === "ffi") {
unicode = 0xfb03;
} else if (unicode === "ffl") {
unicode = 0xfb04;
} else {
unicode = unicode.codePointAt(0);
}
}
handles the common f-ligatures found in many Latin fonts. I tried something like this locally (on macOS), and confirmed that after a Save as PDF operation, the word "confidential" can be successfully found/extracted from the resulting new PDF.
This could obviously be extended to handle all ligatures for which Unicode codepoints exist. However, it seems less than ideal, as fonts may include arbitrary ligature glyphs representing character sequences for which no single Unicode codepoint is defined.
Another thing that might work -- and might solve the problem for arbitrary ligatures, not only for those where Unicode provides a compatibility character -- would be to name the glyphs appropriately. When I look at the Aptos font embedded in the Firefox-generated PDF, I see that we've ended up with the glyph for the fi
ligature being named simply "f":
<TTGlyph name="f" xMin="32" yMin="0" xMax="966" yMax="1351">
<component glyphName="glyph00007" x="0" y="0" flags="0x404"/>
<component glyphName="i" x="617" y="0" flags="0x4"/>
</TTGlyph>
while the actual f
glyph ended up named "glyph00007" (I suspect it "lost" its name because no standalone "f" occurs in the document, and so the actual character "f" didn't need to be included in the font subset at all; the glyph only ends up there because of its use as a component in the FB01 ligature).
If we generated the glyph name "f_i" for the ligature glyph, it's possible that text-extraction tools would be able to successfully use this to map back to the original character sequence; and if that works, it should work for arbitrary ligature glyphs. But I don't know my way around pdf.js to figure out if this is something we could readily implement.
Updated•1 month ago
|
Comment 14•1 month ago
|
||
@marco hi! you changed this from "depends-on" to "regressed-by" bug 1778484, I suspect because you saw a pushlog regression range.
However, this isn't a regression; things were worse before bug 1778484 landed, per comment 6. The pushlog there is where we went from completely-broken to mostly-working; and the pushlog helped us identify bug 1778484 as where the code landed that got us to mostly-working which just needs a bit of further improvement.
Hence: restoring to depends-on instead of regressed-by. (Though feel free to revert if we've got differing understandings of what constitutes a regression.)
Updated•1 month ago
|
Comment 15•1 month ago
|
||
The patch landed in nightly and beta is affected.
:calixte, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox136
towontfix
.
For more information, please visit BugBot documentation.
Assignee | ||
Comment 16•1 month ago
•
|
||
:jfkthame, for now we just write a basic post table (format 3):
https://github.com/mozilla/pdf.js/blob/33c97570f5b9a42411abc85ca2b900f1ac46adf7/src/core/fonts.js#L883-L896
so it'd require to write one in format 2:
https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6post.html
anyway interesting idea, I keep it in mind.
Updated•1 month ago
|
Description
•