Closed Bug 989557 Opened 10 years ago Closed 10 years ago

Support fallback for CJK Compatibility Ideographs Standardized Variants

Categories

(Core :: Layout: Text and Fonts, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla31

People

(Reporter: emk, Assigned: emk)

References

(Depends on 1 open bug)

Details

Attachments

(2 files, 1 obsolete file)

Attached patch patch (obsolete) — Splinter Review
Unicode 6.3 added 1002 standardized variation sequences which will correspond to CJK compatibility ideographs.
These SVSes will be useful to avoid a consequence of the Unicode normalization. CJK compatibility ideographs will be decomposed to corresponding unified ideographs by the normalization. Sometimes it will lose the important difference. (Why they were canonical equivalent in the first place if the distinction is important? who knows.) Especially Japanese users were suffered from the loss.
See Ken Lunde's blog entry for more details:
https://blogs.adobe.com/CCJKType/2012/12/standardized-variants.html
https://blogs.adobe.com/CCJKType/2012/12/standardized-variants-2.html

Unlike registered IVSes, we don't have to wait until fonts catch up the SVSes. We know that the SVSes will correspond to CJK compatibility ideographs, so we can fallback to a glyph from the compatibility ideograph if the font does not support the SVS explicitly.
This patch will implement the fallback.
Attachment #8398840 - Flags: review?(jfkthame)
Comment on attachment 8398840 [details] [diff] [review]
patch

Review of attachment 8398840 [details] [diff] [review]:
-----------------------------------------------------------------

This is neat, but I'd like to see some comments added, explaining what sCJKCompatSVSTable is. It's fine for the comments to refer to the OpenType spec for the details of the format 14 cmap subtable structure, but you should clarify what it's being used for here (a mapping in Unicode character space, not directly to glyph IDs because this is font-independent), and also document the trick that you're using to fit the output characters into the 16-bit glyphID field.

Actually, I wonder if that could be made a bit clearer. The target character codes are all in either the U+Fxxx or U+2Fxxx ranges, AFAICS. So how about simply using:

  #define GLYPH(v) U16((v) >= 0x2F000 ? (v) - 0x2F000 : (v))

and a corresponding change in gfxFontUtils.h? That seems less cryptic to me.

The other thing that concerns me a bit is that we're adding 5K of global data here, which seems quite a lot for a feature that I think will only very rarely be used. I'm not saying it is unimportant - for the cases where it matters, it will be a significant improvement - but I wonder if we could reduce the footprint of the feature somehow.

One possibility: rather than defining the table as an array in source, could we load it as a binary resource that is stored compressed (e.g. in omni.jar) and only loaded and expanded to the 5K table on first use? How large would that table be when compressed?
Hmmm. OK, so I experimented a bit by making your mkcjkcisvs.py tool write out the table as a raw binary file, and then tried compressing this with gzip. Disappointingly, it only gets about 17% compression. :(  (lzma does somewhat better, but I'm not sure we have built-in support for that.)

So I guess for now let's take this as-is, and accept that it requires this data; if we can come up with a more compact approach, though, I'd be all for it. But in any case, please do add comments to the patch to document what's being done here. Thanks.
Attached patch patch v1.1Splinter Review
- Added more comments.
- Renamed mkcjkcisvs.py to gencjkcisvs.py for consistency with other generating scripts.

(In reply to Jonathan Kew (:jfkthame) from comment #2)
> Actually, I wonder if that could be made a bit clearer. The target character
> codes are all in either the U+Fxxx or U+2Fxxx ranges, AFAICS. So how about
> simply using:
> 
>   #define GLYPH(v) U16((v) >= 0x2F000 ? (v) - 0x2F000 : (v))
> 
> and a corresponding change in gfxFontUtils.h? That seems less cryptic to me.

I was trying to preserve the numerical order.

  return aCh && (aCh < 0xF900) ? aCh + 0x2F800 : aCh;

looks more magical to me (also a bit more inefficient).
Attachment #8398840 - Attachment is obsolete: true
Attachment #8398840 - Flags: review?(jfkthame)
Attachment #8399431 - Flags: review?(jfkthame)
Comment on attachment 8399431 [details] [diff] [review]
patch v1.1

Review of attachment 8399431 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gencjkcisvs.py
@@ +36,5 @@
> +    offsets.append(length)
> +    length += 4 + 5 * len(mappings)
> +
> +f = open(sys.argv[2] if len(sys.argv) > 2 else 'CJKCompatSVS.cpp', 'wb')
> +f.write("""// Generated by mkcjkcisvs.py. Do not edit.

s/mk/gen/

::: gfx/thebes/gfxFontUtils.h
@@ +786,5 @@
>      static uint16_t
>      MapUVSToGlyphFormat14(const uint8_t *aBuf, uint32_t aCh, uint32_t aVS);
>  
> +    static MOZ_ALWAYS_INLINE uint32_t
> +    GetUVSFallback(uint32_t aCh, uint32_t aVS) {

Please also include a comment here, just mentioning that sCJKCompatSVSTable is a 'cmap' format 14 subtable that maps <char + var-selector> pairs to the corresponding Unicode compatibility ideograph codepoints.

(The idea of using MapUVSToGlyph... and assigning the result back to aCh looks a bit weird without this clarification.)
Attachment #8399431 - Flags: review?(jfkthame) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/9d08e98cc18c
Assignee: nobody → VYV03354
Status: NEW → ASSIGNED
https://hg.mozilla.org/mozilla-central/rev/9d08e98cc18c
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Depends on: 1434127
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: