Printing to a PDF file generates ligature characters for ff, fi, fl, ffi, ffl in the ToUnicode CMap
Categories
(Core :: Printing: Output, defect)
Tracking
()
People
(Reporter: vincent-moz, Unassigned)
Details
Attachments
(1 file)
28 bytes,
text/html
|
Details |
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0
Steps to reproduce:
- Open a web page containing "<html>ffx fix flx ffix fflx".
- Print the page to a PDF file.
Actual results:
One gets a PDF file with the following in the ToUnicode CMap:
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
5 beginbfchar
<0001> <fb00>
<0002> <fb01>
<0003> <fb02>
<0004> <fb03>
<0005> <fb04>
endbfchar
endcmap
i.e. with ligature characters:
U+FB00 LATIN SMALL LIGATURE FF
U+FB01 LATIN SMALL LIGATURE FI
U+FB02 LATIN SMALL LIGATURE FL
U+FB03 LATIN SMALL LIGATURE FFI
U+FB04 LATIN SMALL LIGATURE FFL
Expected results:
While ligature characters are fine for the glyphs, they exist purely for typographic reasons, thus should not be used in the ToUnicode CMap (they are not handled correctly by some PDF consumers). Either use individual ASCII characters in the CMap as done by pdflatex, which generates
<1B> <00660066>
<1C> <00660069>
<1D> <0066006C>
<1E> <006600660069>
<1F> <00660066006C>
or do not put these glyphs in the ToUnicode CMap (the PDF consumers know how to handle them in practice).
Reporter | ||
Comment 1•2 years ago
|
||
Bugzilla broke my text (there should really be a preview for the bug report!). The following should be read above:
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
5 beginbfchar
<0001> <fb00>
<0002> <fb01>
<0003> <fb02>
<0004> <fb03>
<0005> <fb04>
endbfchar
endcmap
Comment 2•2 years ago
|
||
I'm not seeing this result in a quick test locally; but I suspect it's probably dependent on details of the font involved. What font are you using?
Reporter | ||
Comment 3•2 years ago
|
||
For
<p style="font-family: Bitstream Vera Sans">ffx fix flx ffix fflx (Bitstream Vera Sans).</p>
<p style="font-family: Bitstream Vera Serif">ffx fix flx ffix fflx (Bitstream Vera Serif).</p>
<p style="font-family: DejaVu Sans">ffx fix flx ffix fflx (DejaVu Sans).</p>
<p style="font-family: DejaVu Serif">ffx fix flx ffix fflx (DejaVu Serif).</p>
<p style="font-family: Noto Sans">ffx fix flx ffix fflx (Noto Sans).</p>
<p style="font-family: Noto Serif">ffx fix flx ffix fflx (Noto Serif).</p>
I get
ffx fix flx ffix fflx (Bitstream Vera Sans).
ffx fix flx ffix fflx (Bitstream Vera Serif).
ffx fix flx ffix fflx (DejaVu Sans).
ffx fix flx ffix fflx (DejaVu Serif).
ffx fix flx ffix fflx (Noto Sans).
ffx fix flx ffix fflx (Noto Serif).
i.e. no ligatures for Bitstream Vera, and ligatures for the other ones, except that for DejaVu Serif, "ffi" is transformed to the ligature "ff" + "i". Note that what the text part (i.e. what pdftotext outputs) contains exactly corresponds to the glyphs, i.e. to be able to reproduce the bug, one needs a font that contains the ligatures as glyphs.
Comment 4•2 years ago
|
||
OK, thanks; I can confirm this reproduces with those fonts, for example.
I believe the ToUnicode mapping is automatically generated by cairo based on the 'cmap' of the font, and not directly under our control. If the font maps a sequence like <letter f, letter i> to a ligature glyph "fi", but also maps the single codepoint U+FB01 to that same glyph, it's not surprising that when generating a subset font that includes the "fi" glyph, the natural mapping for cairo to provide is the one from the original font's 'cmap'; i.e. U+FB01.
It's not immediately clear to me what would be the best way to address this, but it would be nice to improve it if we can; generating the presentation-form ligature characters is generally undesirable (and especially so if the original content didn't use them). But at the point where cairo is generating the PDF, it doesn't know what the original text was, it just has the glyphs to work with.
Comment 5•10 months ago
|
||
Cairo has the cairo_show_text_glyphs()
function to specify the mapping from glyphs back to unicode when drawing glyphs. This will be used to build the ToUnicode CMap in the generated PDF as will as using /ActualText
when there is not a 1:n mapping between glyphs and unicode. Pango has been using this API since 2008 and it works well. If you are only using cairo_show_glyphs()
then cairo will try to guess the mapping from the font cmap but this will never work as well as using cairo_show_text_glyphs()
.
Comment 6•10 months ago
|
||
Looks like Chrome uses ActualText for each ligature to produce correct text extraction.
/Span<</ActualText (fi) >> BDC
4.1600037 0 Td <07AF> Tj
EMC
A little less efficient than using a ToUnicode map but better than nothing.
Comment 7•10 months ago
|
||
Yeah, the problem for Gecko is that by the time we get to DrawTargetCairo::FillGlyphs
, where we call into cairo, all we have is a GlyphBuffer
that contains glyph IDs and positions, but no record of the actual text.
So to be able to use cairo_show_text_glyphs()
here, we'd need a bunch of additional plumbing to pass the original text through the rendering stack alongside the glyphs, or some kind of reference back to the DOM with enough precision to be able to find the correct range of text again.
Reporter | ||
Comment 8•10 months ago
|
||
I'm wondering whether, as a temporary fix, the Printing component could modify the ToUnicode CMap after its generation, to change <fb00>
to <00660066>
and so on.
Description
•