Open Bug 1810914 Opened 2 years ago Updated 1 year ago

Printing to a PDF file generates ligature characters for ff, fi, fl, ffi, ffl in the ToUnicode CMap

Tracking

()

Status:

NEW

People

(Reporter: vincent-moz, Unassigned)

Details

Attachments

(1 file)

testcase containing "ffx fix flx ffix fflx" 2 years ago Vincent Lefevre 28 bytes, text/html		Details

Vincent Lefevre

Reporter

Description

•

2 years ago

Attached file testcase containing "ffx fix flx ffix fflx" — Details

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0

Steps to reproduce:

Open a web page containing "<html>ffx fix flx ffix fflx".
Print the page to a PDF file.

Actual results:

One gets a PDF file with the following in the ToUnicode CMap:

begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0

def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
5 beginbfchar
<0001> <fb00>
<0002> <fb01>
<0003> <fb02>
<0004> <fb03>
<0005> <fb04>
endbfchar
endcmap

i.e. with ligature characters:
U+FB00 LATIN SMALL LIGATURE FF
U+FB01 LATIN SMALL LIGATURE FI
U+FB02 LATIN SMALL LIGATURE FL
U+FB03 LATIN SMALL LIGATURE FFI
U+FB04 LATIN SMALL LIGATURE FFL

Expected results:

While ligature characters are fine for the glyphs, they exist purely for typographic reasons, thus should not be used in the ToUnicode CMap (they are not handled correctly by some PDF consumers). Either use individual ASCII characters in the CMap as done by pdflatex, which generates

<1B> <00660066>
<1C> <00660069>
<1D> <0066006C>
<1E> <006600660069>
<1F> <00660066006C>

or do not put these glyphs in the ToUnicode CMap (the PDF consumers know how to handle them in practice).

Vincent Lefevre

Reporter

Comment 1

•

2 years ago

Bugzilla broke my text (there should really be a preview for the bug report!). The following should be read above:

begincmap
/CIDSystemInfo
<< /Registry (Adobe)
   /Ordering (UCS)
   /Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
5 beginbfchar
<0001> <fb00>
<0002> <fb01>
<0003> <fb02>
<0004> <fb03>
<0005> <fb04>
endbfchar
endcmap

Jonathan Kew [:jfkthame]

Comment 2

•

2 years ago

I'm not seeing this result in a quick test locally; but I suspect it's probably dependent on details of the font involved. What font are you using?

Flags: needinfo?(vincent-moz)

Vincent Lefevre

Reporter

Comment 3

•

2 years ago

For

<p style="font-family: Bitstream Vera Sans">ffx fix flx ffix fflx (Bitstream Vera Sans).</p>
<p style="font-family: Bitstream Vera Serif">ffx fix flx ffix fflx (Bitstream Vera Serif).</p>
<p style="font-family: DejaVu Sans">ffx fix flx ffix fflx (DejaVu Sans).</p>
<p style="font-family: DejaVu Serif">ffx fix flx ffix fflx (DejaVu Serif).</p>
<p style="font-family: Noto Sans">ffx fix flx ffix fflx (Noto Sans).</p>
<p style="font-family: Noto Serif">ffx fix flx ffix fflx (Noto Serif).</p>

I get

ffx fix flx ffix fflx (Bitstream Vera Sans).
ffx fix flx ffix fflx (Bitstream Vera Serif).
ﬀx ﬁx ﬂx ﬃx ﬄx (DejaVu Sans).
ﬀx ﬁx ﬂx ﬀix ﬄx (DejaVu Serif).
ﬀx ﬁx ﬂx ﬃx ﬄx (Noto Sans).
ﬀx ﬁx ﬂx ﬃx ﬄx (Noto Serif).

i.e. no ligatures for Bitstream Vera, and ligatures for the other ones, except that for DejaVu Serif, "ffi" is transformed to the ligature "ﬀ" + "i". Note that what the text part (i.e. what pdftotext outputs) contains exactly corresponds to the glyphs, i.e. to be able to reproduce the bug, one needs a font that contains the ligatures as glyphs.

Flags: needinfo?(vincent-moz)

Jonathan Kew [:jfkthame]

Comment 4

•

2 years ago

OK, thanks; I can confirm this reproduces with those fonts, for example.

I believe the ToUnicode mapping is automatically generated by cairo based on the 'cmap' of the font, and not directly under our control. If the font maps a sequence like <letter f, letter i> to a ligature glyph "ﬁ", but also maps the single codepoint U+FB01 to that same glyph, it's not surprising that when generating a subset font that includes the "ﬁ" glyph, the natural mapping for cairo to provide is the one from the original font's 'cmap'; i.e. U+FB01.

It's not immediately clear to me what would be the best way to address this, but it would be nice to improve it if we can; generating the presentation-form ligature characters is generally undesirable (and especially so if the original content didn't use them). But at the point where cairo is generating the PDF, it doesn't know what the original text was, it just has the glyphs to work with.

Severity: -- → S4

Status: UNCONFIRMED → NEW

Ever confirmed: true

ajohnson@redneon.com

Comment 5

•

1 year ago

Cairo has the cairo_show_text_glyphs() function to specify the mapping from glyphs back to unicode when drawing glyphs. This will be used to build the ToUnicode CMap in the generated PDF as will as using /ActualText when there is not a 1:n mapping between glyphs and unicode. Pango has been using this API since 2008 and it works well. If you are only using cairo_show_glyphs()then cairo will try to guess the mapping from the font cmap but this will never work as well as using cairo_show_text_glyphs().

ajohnson@redneon.com

Comment 6

•

1 year ago

Looks like Chrome uses ActualText for each ligature to produce correct text extraction.

/Span<</ActualText (fi) >> BDC
4.1600037 0 Td <07AF> Tj
EMC

A little less efficient than using a ToUnicode map but better than nothing.

Jonathan Kew [:jfkthame]

Comment 7

•

1 year ago

Yeah, the problem for Gecko is that by the time we get to DrawTargetCairo::FillGlyphs, where we call into cairo, all we have is a GlyphBuffer that contains glyph IDs and positions, but no record of the actual text.

So to be able to use cairo_show_text_glyphs() here, we'd need a bunch of additional plumbing to pass the original text through the rendering stack alongside the glyphs, or some kind of reference back to the DOM with enough precision to be able to find the correct range of text again.

Vincent Lefevre

Reporter

Comment 8

•

1 year ago

I'm wondering whether, as a temporary fix, the Printing component could modify the ToUnicode CMap after its generation, to change <fb00> to <00660066> and so on.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Printing to a PDF file generates ligature characters for ff, fi, fl, ffi, ffl in the ToUnicode CMap

Categories

(Core :: Printing: Output, defect)

Tracking

()

People

(Reporter: vincent-moz, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Attachment

General

Description

File Name

Content Type