Unicode characters outside the BMP (aka SMP, astral or supplementary characters) misrendered

RESOLVED FIXED in Firefox 42

Status

()

P2
normal
RESOLVED FIXED
5 years ago
3 years ago

People

(Reporter: smontagu, Unassigned)

Tracking

({intl})

unspecified
Firefox 42
x86
Mac OS X
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [pdfjs-c-rendering][pdfjs-d-font-conversion][pdfjs-f-fixed-upstream] https://github.com/mozilla/pdf.js/pull/6171, URL)

(Reporter)

Description

5 years ago
Unicode characters above U+FFFF in PDF files appear incorrectly, with the high bits stripped.

E.g. in http://www.unicode.org/charts/PDF/Unicode-7.0/U70-10600.pdf the characters in the chart should be the new Linear A character in the range U+10600-1077F, but what is displayed are Arabic characters in the range U+0600-077F. Similarly for http://www.unicode.org/charts/PDF/U20000.pdf, (but not for all the charts of SMP characters at http://www.unicode.org/charts/, which perhaps are encoded differently)
Works for me on Windows. Specific to Mac?
In part, at least, it's specific to Mac (and mobile, in theory, but I don't think we run pdf.js there).

The embedded font from the PDF is apparently being encoded (in the Linear A case) at codepoints that correspond to Arabic letters; but as it lacks the necessary layout tables for Arabic rendering, Gecko on OS X refuses to use it for those characters, and falls back to another Arabic font.

On Windows, we don't (currently -- see bug 1006122) check whether the font really "supports" the characters properly, and so we go ahead and draw them.
(In reply to Jonathan Kew (:jfkthame) from comment #2)

> The embedded font from the PDF is apparently being encoded (in the Linear A
> case) at codepoints that correspond to Arabic letters; but as it lacks the
> necessary layout tables for Arabic rendering, Gecko on OS X refuses to use
> it for those characters, and falls back to another Arabic font.

What will be a minimal set of the layout tables be added, so Gecko on OS X accepted characters and start drawing them as glyphs?
 
> 
> On Windows, we don't (currently -- see bug 1006122) check whether the font
> really "supports" the characters properly, and so we go ahead and draw them.

Is there a way to add this mode to the canvas, so it would not bother to check and just draw the characters?
Flags: needinfo?(jfkthame)
(In reply to Yury Delendik (:yury) from comment #3)
> (In reply to Jonathan Kew (:jfkthame) from comment #2)
> 
> > The embedded font from the PDF is apparently being encoded (in the Linear A
> > case) at codepoints that correspond to Arabic letters; but as it lacks the
> > necessary layout tables for Arabic rendering, Gecko on OS X refuses to use
> > it for those characters, and falls back to another Arabic font.
> 
> What will be a minimal set of the layout tables be added, so Gecko on OS X
> accepted characters and start drawing them as glyphs?

Don't try to do that. Synthesizing GSUB or similar tables for scripts like Arabic (or Indic scripts - this applies there too) is tricky; and it'd have to be valid, to get past the sanitizer; and then you'll pay the performance cost of going through a complex-script shaper.

How is pdf.js determining what character codes it's using to draw the text to canvas? Unless/until we have a canvas API that lets you address *glyphs* directly, I think you should encode the font so as to use PUA character codes to address the glyphs you want.

>  
> > 
> > On Windows, we don't (currently -- see bug 1006122) check whether the font
> > really "supports" the characters properly, and so we go ahead and draw them.
> 
> Is there a way to add this mode to the canvas, so it would not bother to
> check and just draw the characters?

Not currently. That would require some kind of added API, which I think we'd be hesitant to do. If we're going to create new API to help pdf.js be more efficient, I think we should do a "drawGlyphs" API instead; that's the functionality you really want.
Flags: needinfo?(jfkthame)
We could stop truncating the higher bits of the character and adding a proper support for non-BMP characters for this case.

Updated

5 years ago
Priority: -- → P2
Whiteboard: [pdfjs-c-rendering][pdfjs-d-font-conversion]

Updated

3 years ago
Whiteboard: [pdfjs-c-rendering][pdfjs-d-font-conversion] → [pdfjs-c-rendering][pdfjs-d-font-conversion][pdfjs-f-fixed-upstream] https://github.com/mozilla/pdf.js/pull/6171
Depends on: 1182228
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 42
You need to log in before you can comment on or make changes to this bug.