Text selected from printed PDF is wrong
Categories
(Firefox :: PDF Viewer, defect, P1)
Tracking
()
People
(Reporter: marco, Assigned: calixte)
References
Details
(Whiteboard: [pdfjs-printing])
Attachments
(1 file)
Steps to reproduce:
- Open https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf
- Get to Print Dialog (CTRL+P) and Save as PDF
- Open Saved File
- Select and copy Text (Dummy PDF file) from the saved pdf file
- Paste it somewhere
Result: Unrecognized characters
Expected: "Dummy PDF file" text
Reporter | ||
Comment 1•3 years ago
|
||
It's actually on the PDF.js side, likely because of https://github.com/mozilla/pdf.js/pull/9340.
Reporter | ||
Comment 2•3 years ago
|
||
Andrei, could somebody from QA help finding a regression range?
Reporter | ||
Comment 3•3 years ago
|
||
Oh, actually it won't be feasible, as we only recently started supporting non-rasterized output.
Technically this is not a regression simply because we started supporting non-rasterized output recently, but if we landed non-rasterized output support earlier than https://github.com/mozilla/pdf.js/pull/9340, then we would consider https://github.com/mozilla/pdf.js/pull/9340 the regressor.
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 4•3 years ago
|
||
A potential solution would be to implement the potential "drawGlyphID" API Brendan mentioned in https://github.com/mozilla/pdf.js/pull/9340.
Another way could be to revert that and fix the bugs it fixed some other way.
Yet another way could be to implement some heuristics to avoid to move the glyphs in the private area.
Comment 5•3 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #4)
A potential solution would be to implement the potential "drawGlyphID" API Brendan mentioned in https://github.com/mozilla/pdf.js/pull/9340.
Given that we have proper /ToUnicode-data available in PDF.js, would it perhaps be feasible to (somehow) pass along that information to the Firefox "Save as PDF"-functionality such that it can include usable /ToUnicode-data in the fonts?
Yet another way could be to implement some heuristics to avoid to move the glyphs in the private area.
That's what we used to do, prior to PR 9340, however it led to a never-ending series of rendering bugs and fixing those meant having to repeatedly extend a list of problematic char-ranges; so can we pretty please not consider doing this :-)
Assignee | ||
Comment 6•3 years ago
|
||
:jfkthame, what do you think about Jonas' idea ?
Comment 7•3 years ago
|
||
I don't know of any mechanism to pass a different ToUnicode mapping through to the PDF output; AFAIK the pdf-generating backend (whether it's the cairo_quartz backend we use on macOS, or the cairo_pdf_surface we use elsewhere) will simply create a ToUnicode mapping based on the 'cmap' in the font resource, and that 'cmap' is the PUA-based one we're creating in pdf.js.
So the "obvious" way to get correct text extraction (and Find functionality, etc) in the pdf output would be to have the "real" Unicode 'cmap' in the font ... but as noted above, that had its own problems. What can we do to resolve this, then? A couple of thoughts:
(1) One thing we could perhaps do to mitigate this for a lot of common content would be to partially revert https://github.com/mozilla/pdf.js/pull/9340, and basically whitelist blocks of Unicode where we know that no complex-script rendering behavior is involved. This would include major scripts such as Latin & Cyrillic, as well as CJK ideographs.
(2) Another idea we could try -- as an alternative, or to supplement (1) for text that we still re-map to PUA codepoints -- would be to double-encode the glyphs in the 'cmap' that we generate. So given a character U+0ABC in the original data, which we remap to a PUA codepoint U+F1234 so that we can render it through canvas.fillText without any unwanted shaping behavior kicking in, we'd actually generate a 'cmap' that maps both the original codepoint U+0ABC and the PUA replacement U+F1234 to the same glyph. Then we can use the PUA codepoint in canvas.fillText, but my (untested) guess is that when the pdf backend generates a ToUnicode mapping for this glyph, it'll pick the original codepoint.
One other observation: after doing Save-to-PDF for the dummy.pdf file linked in comment 0, at least on macOS, what I'm actually seeing in saved version appears to be arbitrary ASCII characters, not PUA codepoints: it looks like the glyphs used ("Dummy...") have simply been assigned sequential codes starting at U+0021, U+0022, U+0023, etc. Is this something coming from pdf.js, or is it the quartz backend doing some remapping of its own?
Comment 8•3 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #7)
One other observation: after doing Save-to-PDF for the dummy.pdf file linked in comment 0, at least on macOS, what I'm actually seeing in saved version appears to be arbitrary ASCII characters, not PUA codepoints: it looks like the glyphs used ("Dummy...") have simply been assigned sequential codes starting at U+0021, U+0022, U+0023, etc. Is this something coming from pdf.js, or is it the quartz backend doing some remapping of its own?
To follow up on this, I tried it on Windows, and the saved PDF output does have the expected PUA codepoints U+E000, U+E001, etc. So I guess the macOS quartz pdf backend is remapping the font again when generating its PDF output.
Assignee | ||
Comment 9•3 years ago
•
|
||
I just tried on Windows to add the entries unicode => glyphId in the font cmap and it works !!
Assignee | ||
Updated•3 years ago
|
Comment 10•3 years ago
|
||
Assignee | ||
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Updated•2 years ago
|
Description
•