Open Bug 1810914 Opened 9 months ago Updated 9 months ago

Printing to a PDF file generates ligature characters for ff, fi, fl, ffi, ffl in the ToUnicode CMap

Categories

(Core :: Printing: Output, defect)

Firefox 108
defect

Tracking

()

People

(Reporter: vincent-moz, Unassigned)

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0

Steps to reproduce:

  1. Open a web page containing "<html>ffx fix flx ffix fflx".
  2. Print the page to a PDF file.

Actual results:

One gets a PDF file with the following in the ToUnicode CMap:

begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0

def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
5 beginbfchar
<0001> <fb00>
<0002> <fb01>
<0003> <fb02>
<0004> <fb03>
<0005> <fb04>
endbfchar
endcmap

i.e. with ligature characters:
U+FB00 LATIN SMALL LIGATURE FF
U+FB01 LATIN SMALL LIGATURE FI
U+FB02 LATIN SMALL LIGATURE FL
U+FB03 LATIN SMALL LIGATURE FFI
U+FB04 LATIN SMALL LIGATURE FFL

Expected results:

While ligature characters are fine for the glyphs, they exist purely for typographic reasons, thus should not be used in the ToUnicode CMap (they are not handled correctly by some PDF consumers). Either use individual ASCII characters in the CMap as done by pdflatex, which generates

<1B> <00660066>
<1C> <00660069>
<1D> <0066006C>
<1E> <006600660069>
<1F> <00660066006C>

or do not put these glyphs in the ToUnicode CMap (the PDF consumers know how to handle them in practice).

Bugzilla broke my text (there should really be a preview for the bug report!). The following should be read above:

begincmap
/CIDSystemInfo
<< /Registry (Adobe)
   /Ordering (UCS)
   /Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
5 beginbfchar
<0001> <fb00>
<0002> <fb01>
<0003> <fb02>
<0004> <fb03>
<0005> <fb04>
endbfchar
endcmap

I'm not seeing this result in a quick test locally; but I suspect it's probably dependent on details of the font involved. What font are you using?

Flags: needinfo?(vincent-moz)

For

<p style="font-family: Bitstream Vera Sans">ffx fix flx ffix fflx (Bitstream Vera Sans).</p>
<p style="font-family: Bitstream Vera Serif">ffx fix flx ffix fflx (Bitstream Vera Serif).</p>
<p style="font-family: DejaVu Sans">ffx fix flx ffix fflx (DejaVu Sans).</p>
<p style="font-family: DejaVu Serif">ffx fix flx ffix fflx (DejaVu Serif).</p>
<p style="font-family: Noto Sans">ffx fix flx ffix fflx (Noto Sans).</p>
<p style="font-family: Noto Serif">ffx fix flx ffix fflx (Noto Serif).</p>

I get

ffx fix flx ffix fflx (Bitstream Vera Sans).
ffx fix flx ffix fflx (Bitstream Vera Serif).
ffx fix flx ffix fflx (DejaVu Sans).
ffx fix flx ffix fflx (DejaVu Serif).
ffx fix flx ffix fflx (Noto Sans).
ffx fix flx ffix fflx (Noto Serif).

i.e. no ligatures for Bitstream Vera, and ligatures for the other ones, except that for DejaVu Serif, "ffi" is transformed to the ligature "ff" + "i". Note that what the text part (i.e. what pdftotext outputs) contains exactly corresponds to the glyphs, i.e. to be able to reproduce the bug, one needs a font that contains the ligatures as glyphs.

Flags: needinfo?(vincent-moz)

OK, thanks; I can confirm this reproduces with those fonts, for example.

I believe the ToUnicode mapping is automatically generated by cairo based on the 'cmap' of the font, and not directly under our control. If the font maps a sequence like <letter f, letter i> to a ligature glyph "fi", but also maps the single codepoint U+FB01 to that same glyph, it's not surprising that when generating a subset font that includes the "fi" glyph, the natural mapping for cairo to provide is the one from the original font's 'cmap'; i.e. U+FB01.

It's not immediately clear to me what would be the best way to address this, but it would be nice to improve it if we can; generating the presentation-form ligature characters is generally undesirable (and especially so if the original content didn't use them). But at the point where cairo is generating the PDF, it doesn't know what the original text was, it just has the glyphs to work with.

Severity: -- → S4
Status: UNCONFIRMED → NEW
Ever confirmed: true
You need to log in before you can comment on or make changes to this bug.