Closed Bug 1778484 Opened 3 years ago Closed 3 years ago

Text selected from printed PDF is wrong

Tracking

()

Status:

RESOLVED FIXED

Milestone:

104 Branch

Tracking Flags:

Tracking

Status

firefox-esr102

---

fixed

firefox104

---

fixed

People

(Reporter: marco, Assigned: calixte)

References

Details

(Whiteboard: [pdfjs-printing])

Attachments

(1 file)

Link to GitHub pull-request: https://github.com/mozilla/pdf.js/pull/15157 3 years ago GitHub Bugzilla PR Linker 44 bytes, text/x-github-pull-request		Details \| Review

Marco Castelluccio [:marco]

Reporter

Description

•

3 years ago

Steps to reproduce:

Open https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf
Get to Print Dialog (CTRL+P) and Save as PDF
Open Saved File
Select and copy Text (Dummy PDF file) from the saved pdf file
Paste it somewhere

Result: Unrecognized characters
Expected: "Dummy PDF file" text

Marco Castelluccio [:marco]

Reporter

Comment 1

•

3 years ago

It's actually on the PDF.js side, likely because of https://github.com/mozilla/pdf.js/pull/9340.

Component: Printing: Output → PDF Viewer

Product: Core → Firefox

Marco Castelluccio [:marco]

Reporter

Comment 2

•

3 years ago

Andrei, could somebody from QA help finding a regression range?

Flags: needinfo?(andrei.vaida)

Keywords: regressionwindow-wanted

Marco Castelluccio [:marco]

Reporter

Comment 3

•

3 years ago

Oh, actually it won't be feasible, as we only recently started supporting non-rasterized output.

Technically this is not a regression simply because we started supporting non-rasterized output recently, but if we landed non-rasterized output support earlier than https://github.com/mozilla/pdf.js/pull/9340, then we would consider https://github.com/mozilla/pdf.js/pull/9340 the regressor.

Flags: needinfo?(andrei.vaida)

Keywords: regressionwindow-wanted

Marco Castelluccio [:marco]

Reporter

Updated

•

3 years ago

See Also: → https://github.com/mozilla/pdf.js/pull/9340

Marco Castelluccio [:marco]

Reporter

Comment 4

•

3 years ago

A potential solution would be to implement the potential "drawGlyphID" API Brendan mentioned in https://github.com/mozilla/pdf.js/pull/9340.
Another way could be to revert that and fix the bugs it fixed some other way.
Yet another way could be to implement some heuristics to avoid to move the glyphs in the private area.

Jonas Jenwald [:Snuffleupagus]

Comment 5

•

3 years ago

(In reply to Marco Castelluccio [:marco] from comment #4)

A potential solution would be to implement the potential "drawGlyphID" API Brendan mentioned in https://github.com/mozilla/pdf.js/pull/9340.

Given that we have proper /ToUnicode-data available in PDF.js, would it perhaps be feasible to (somehow) pass along that information to the Firefox "Save as PDF"-functionality such that it can include usable /ToUnicode-data in the fonts?

Yet another way could be to implement some heuristics to avoid to move the glyphs in the private area.

That's what we used to do, prior to PR 9340, however it led to a never-ending series of rendering bugs and fixing those meant having to repeatedly extend a list of problematic char-ranges; so can we pretty please not consider doing this :-)

Calixte Denizet (:calixte)

Assignee

Comment 6

•

3 years ago

:jfkthame, what do you think about Jonas' idea ?

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Comment 7

•

3 years ago

I don't know of any mechanism to pass a different ToUnicode mapping through to the PDF output; AFAIK the pdf-generating backend (whether it's the cairo_quartz backend we use on macOS, or the cairo_pdf_surface we use elsewhere) will simply create a ToUnicode mapping based on the 'cmap' in the font resource, and that 'cmap' is the PUA-based one we're creating in pdf.js.

So the "obvious" way to get correct text extraction (and Find functionality, etc) in the pdf output would be to have the "real" Unicode 'cmap' in the font ... but as noted above, that had its own problems. What can we do to resolve this, then? A couple of thoughts:

(1) One thing we could perhaps do to mitigate this for a lot of common content would be to partially revert https://github.com/mozilla/pdf.js/pull/9340, and basically whitelist blocks of Unicode where we know that no complex-script rendering behavior is involved. This would include major scripts such as Latin & Cyrillic, as well as CJK ideographs.

(2) Another idea we could try -- as an alternative, or to supplement (1) for text that we still re-map to PUA codepoints -- would be to double-encode the glyphs in the 'cmap' that we generate. So given a character U+0ABC in the original data, which we remap to a PUA codepoint U+F1234 so that we can render it through canvas.fillText without any unwanted shaping behavior kicking in, we'd actually generate a 'cmap' that maps both the original codepoint U+0ABC and the PUA replacement U+F1234 to the same glyph. Then we can use the PUA codepoint in canvas.fillText, but my (untested) guess is that when the pdf backend generates a ToUnicode mapping for this glyph, it'll pick the original codepoint.

One other observation: after doing Save-to-PDF for the dummy.pdf file linked in comment 0, at least on macOS, what I'm actually seeing in saved version appears to be arbitrary ASCII characters, not PUA codepoints: it looks like the glyphs used ("Dummy...") have simply been assigned sequential codes starting at U+0021, U+0022, U+0023, etc. Is this something coming from pdf.js, or is it the quartz backend doing some remapping of its own?

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Comment 8

•

3 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #7)

One other observation: after doing Save-to-PDF for the dummy.pdf file linked in comment 0, at least on macOS, what I'm actually seeing in saved version appears to be arbitrary ASCII characters, not PUA codepoints: it looks like the glyphs used ("Dummy...") have simply been assigned sequential codes starting at U+0021, U+0022, U+0023, etc. Is this something coming from pdf.js, or is it the quartz backend doing some remapping of its own?

To follow up on this, I tried it on Windows, and the saved PDF output does have the expected PUA codepoints U+E000, U+E001, etc. So I guess the macOS quartz pdf backend is remapping the font again when generating its PDF output.

Calixte Denizet (:calixte)

Assignee

Comment 9

•

3 years ago

•

Edited

I just tried on Windows to add the entries unicode => glyphId in the font cmap and it works !!

Calixte Denizet (:calixte)

Assignee

Updated

•

3 years ago

Assignee: nobody → cdenizet

Status: NEW → ASSIGNED

GitHub Bugzilla PR Linker

Comment 10

•

3 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/pdf.js/pull/15157 — Details

Calixte Denizet (:calixte)

Assignee

Updated

•

3 years ago

Severity: -- → S3

Priority: -- → P1

Whiteboard: [pdfjs-printing]

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Depends on: 1779408

Marco Castelluccio [:marco]

Reporter

Updated

•

3 years ago

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox104: --- → fixed

status-firefox-esr102: --- → fixed

Depends on: 1780839

Target Milestone: --- → 104 Branch

Daniel Holbert [:dholbert]

Updated

•

8 months ago

Blocks: 1946181

Marco Castelluccio [:marco]

Reporter

Updated

•

8 months ago

No longer blocks: 1946181

Regressions: 1946181

Daniel Holbert [:dholbert]

Updated

•

8 months ago

Blocks: 1946181

No longer regressions: 1946181

You need to log in before you can comment on or make changes to this bug.