Closed Bug 443976 Opened 16 years ago Closed 15 years ago

Support unicode-range: descriptor in @font-face parser

Categories

(Core :: CSS Parsing and Computation, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9.3a1

People

(Reporter: zwol, Assigned: zwol)

References

Details

Attachments

(1 file, 4 obsolete files)

css3-fonts specifies a unicode-range: descriptor for @font-face rules, which can be used to avoid downloading fonts that don't contain any glyphs that are relevant to the document being rendered.  As this feature is not critical for initial @font-face support and requires monkeying with nsCSSScanner, I am splitting it off from bug 441469.

To elaborate a little on "requires monkeying with nsCSSScanner": A typical use of the unicode-range: descriptor might be

  unicode-range: U+0374-03FF, U+1D26-1DBF, U+1F00-1FFE; /* Greek */

The parser wants to see each hexadecimal number as a separate token and the "U+" introducers as single entities, like this (spaces separate tokens):

  unicode-range : U+ 0374 - 03FF , U+ 1D26 - 1DBF , U+ 1F00 - 1FFE ;

but the normal CSS tokenization rules have no concept of hex numbers and will split these entirely differently, something like

  unicode-range : U +0374 -03FF , U +1D 26 -1DBF , U + 1F 00 -1FFE ;

(or possibly even worse, I'm not sure exactly how DIMENSION works) To avoid having to re-tokenize the entire descriptor in the parser, we will need to give nsCSSScanner a special mode for these descriptors.
Flags: wanted1.9.1?
throwing the code I've got so far in here so it doesn't get lost.
Attached patch (rev 1) working patch (obsolete) — Splinter Review
Here's a working patch.  As mentioned above, I had to introduce a new mode into the CSS scanner.  I believe this did not affect anything else (and I ran the layout mochitests with no failures) but that part probably deserves close scrutiny anyway.

A unicode-range: descriptor turns into an array of alternating min, max values.  There is no attempt to sort them or combine adjacent/overlapping ranges.  Perhaps there should be.

This patch was developed on top of the patches in bug 452275 and bug 452278; I can redo it without them if necessary.
Attachment #328442 - Attachment is obsolete: true
Attachment #335623 - Flags: superreview?(dbaron)
Attachment #335623 - Flags: review?(dbaron)
Depends on: 452275, 452278
Comment on attachment 335623 [details] [diff] [review]
(rev 1) working patch

I've thought of a less invasive way to do the scanner changes; will post a revised patch later today.  (Will still be on top of the other parser/scanner cleanup bugs.)
Attachment #335623 - Flags: superreview?(dbaron)
Attachment #335623 - Flags: review?(dbaron)
Depends on: 452518
Attached patch (rev 2) less invasive to scanner (obsolete) — Splinter Review
This revised patch avoids making any changes to the normal paths through the scanner; instead there is a new function nsCSSScanner::NextURange() used exclusively by CSSParserImpl::ParseURange().

Revision is on top of bug 452518 in addition to what it used to be.  (The changes in that bug were in rev 1 of this patch but I split them out because they're unrelated.)
Attachment #335623 - Attachment is obsolete: true
Comment on attachment 335800 [details] [diff] [review]
(rev 2) less invasive to scanner

flagging for review
Attachment #335800 - Flags: superreview?(dbaron)
Attachment #335800 - Flags: review?(dbaron)
Flags: wanted1.9.1? → wanted1.9.1+
So in REC-CSS2, unicode-range values were part of the tokenization:
http://www.w3.org/TR/1998/REC-CSS2-19980512/syndata.html#tokenization

That suggests changing the scanner more and the parser less, and that the scanner doesn't even need to be modal.

And I can't think of anything that would be broken by that approach.

That said, maybe this approach is ok too; we used to use it for things like ~= and |=, but once we implemented namespaces it was easier to tokenize |= separately so we could distinguish [namespace|attribute=value] from [attribute|=value] without too much lookahead.

I'll keep looking at it...
Comment on attachment 335800 [details] [diff] [review]
(rev 2) less invasive to scanner

I'm ok with the approach of treating the unicode-range values as
separate tokens, although I'd also be happy with the REC-CSS2-19980512
approach of treating the whole mess as a single token, codewise.  (You'd
just need to add a new token type an a second integer field to
nsCSSToken.)  With the current approach, it might make sense to move
ParseURange to the scanner, though.

I think you should limit the number of digits or question marks to 6
rather than 8, per REC-CSS2-19980512, and I think css3-fonts should say
that that's what the syntax is.

I think you should also update the comments above ParseURange and
NextURange (in nsCSSScanner.h and .cpp) to reflect the understanding
that the code you're writing is parsing what the spec describes as a
single token.

The comments about NextURange (both before and inside the function) that
describe what is consumed do not match the code.  I think the code is
right and the comments are wrong.  (In particular, symbols and
whitespace are consumed, as I think they should be to make the parsing
code easier.)

One difficulty of getting the parsing code really correct is to skip
comments in the right places.  You need to skip comments anywhere that
whitespace is allowed, but not inside of the U+XXXX-XXXX.  I think your
current code is actually correct, but it sort of looks like it's correct
by accident (since you're parsing the "U" and the commas using normal
tokenization).  Some comments would help.

I think you probably should add a distinct token type for NextURange to
return when it parses hex digits (rather than using eCSSToken_Number),
and document that that token type sets both mIdent and mInteger.  I
don't think you need to bother with mIntegerValid.  (Normal number
tokens set mNumber and mIntegerValid and maybe mInteger.)

I tend to think that a unicode range where the min is greater than the
max shouldn't be a parse error; it just shouldn't add anything to the
set.  And I think U+0100-0100 should also be valid; you're treating that
as a parse error too.  (Or is there something in the spec that says it
should be a parse error?  If there isn't, it shouldn't be.)
By the spec, I mean any of:
http://www.w3.org/TR/1998/REC-CSS2-19980512/fonts.html#dataqual
http://www.w3.org/TR/1998/REC-CSS2-19980512/syndata.html#tokenization
http://dev.w3.org/csswg/css3-fonts/#character

What looks like a real problem with your current approach, though, is at
the end of ParseURange, you don't UngetToken().  I think you need to;
otherwise the U+NNNN ranges will be broken because the parsing code will
eat the next character.  The problem is that I'm not even sure that
UngetToken is sufficient, since your parsing code is different.  That
said, I think it should be, since anything (like '(' or '{') that
requires particularly interesting handling will be parsed the same by
Next or NextURange.  I think that's yet another reason for not breaking
the invariants of what goes in what type of token, though.

Then again, it looks like you have tests that test this.  Or are those
tests not doing what I think they do?  Or do some of them fail with this
version of the patch?  If not, why aren't they failing?  What did I
misunderstand?

in nsCSSRules.cpp, you're doing nonstandard indentation of the for loop.
Attachment #335800 - Flags: superreview?(dbaron)
Attachment #335800 - Flags: superreview-
Attachment #335800 - Flags: review?(dbaron)
Attachment #335800 - Flags: review-
(And when you're fixing these issues, add tests for them too. :-)
Just a few quick notes.  I didn't realize css2 had anything to say about this stuff; I was going entirely by css3-fonts, which doesn't give me tokenization rules at all, so I had to make something up.  In general I like dumb scanners and smart parsers, but I'll have to look at the code some more (and read the old spec) to see what makes more sense here.  It's possible that moving what is now ParseURange into the scanner would avoid the too-clever interaction between ParseURange and NextURange that seems to have confused you (see below) and that might be enough reason to do it by itself.

NextURange does *not* consume either whitespace or symbols other than +, -, ?, and it wouldn't work if it did.  The comment is correct and matches the code.  You may have mistaken the initial Peek() for a Read().  I'm pretty certain that the whitespace handling is correct as is (albeit almost entirely implicit in the division of labor between ParseURange and ParseFontRanges) but I probably could use more tests for it, especially involving comments.

The tests in the patch all pass, and your concern about not calling UngetToken() at the end of ParseURange() is *supposed* to be addressed by part of the long comment above ParseURange() -- basically, NextURange doesn't always consume the token it returns because that makes ParseURange not have to call UngetToken and therefore not worry about whether what it pushes back should have been tokenized differently.  It may be too clever, but it works.  I do need to add test cases for a good regular descriptor after both a bad and a good unicode-range descriptor, though.

If I did change to using CSS2's core tokenization rule for this stuff, what should I do about nonsense tokens, e.g.

    U+?020
    U+00??-1100

which are perfectly acceptable according to the stated rule?

I'm probably not going to cycle back to this bug for a few days.
Ah, right, I see.

In any case, NextURange's behavior of setting the token whether or not it consumed something is really quite confusing.

I don't think it matters whether those nonsense tokens produce unicode-range tokens as long as they lead to parse errors.

(The comment isn't quite correct, though, since '/' is also consumed, but that could be fixed by one Pushback call.)


Maybe this approach is ok (modulo the 6 vs 8 digit issue and a few of the other comments), although it does seem a little too clever.  There could also be some other difference between a single token and multiple tokens that I'm not thinking of (like we ran into with |=).
I'm dropping this off the 1.9.1 wanted list because there are more important things to get done, the changes required are more invasive than ideal at this stage, and there are standards ambiguities that have yet to be resolved.
Flags: wanted1.9.1+
Zack, I'm wondering if you think there are still issues with the unicode-range definition in the CSS3 Fonts spec:

http://www.w3.org/TR/css3-fonts/#character-range-the-unicode-range-descri

I biggest change was to confine unicode-range to 0 - 10FFFF, which makes it compatible with the unicode-range grammar given in CSS 2.1 (although I realize that's incomplete).

How much work needs to be done to complete unicode-range parsing?  Do you have time to finish off the patch?  If not, could you briefly outline what you think needs to happen and I'll take a stab at it (or maybe there's an intern hunting around for things to do?).
I think there are no further semantic problems with the spec, but I would like to suggest some editorial changes.  Most important, suggest using the term "ignored" instead of "omitted" for consistency with CSS2.1, and consistently using either "syntax error" or "parse error", not both.  [CSS2.1 seems to use "parsing error".]

# "U+??????" is not a syntax error, even though "U+0??????" would be.

could use clarification as to the meaning of "U+"??????", e.g.

# "U+??????" is not a syntax error, even though "U+0??????" would be;
# it is interpreted as U+0-FFFFFF.

The sentence

# Ranges can overlap but interval ranges that descend (e.g. U+400-32f)
# are invalid and omitted rather than treated as parse errors; they have 
# no effect on other ranges in a list of ranges.

says too many things at once -- I'd like to reorganize that entire paragraph, actually.  Maybe not in a bug comment.

> How much work needs to be done to complete unicode-range parsing?  Do you have
> time to finish off the patch?

I should be able to get to that this week.  It's almost done.
This version of the patch treats unicode-range exactly as css2.1 specifies -- the whole thing is one token at the scanner level.  This does involve adding a special extra condition to Next(), but them's the breaks.

This also incorporates the latest revisions to css3-fonts as far as handling of syntactically valid but meaningless unicode-range tokens, weird combinations, etc. and updates the tests to match.
Attachment #335800 - Attachment is obsolete: true
Attachment #395208 - Flags: review?(dbaron)
Comment on attachment 395208 [details] [diff] [review]
(rev 3) scan unicode-range tokens exactly as in css2.1

>+    { rule: _("unicode-range: U+A5, U+0?F"), d: {} },
>+    { rule: _("unicode-range: U+A5, U+0F?-E00"), d: {} }

One comment before I forget about it:  you can leave the trailing comma after the last entry; it's ok in JavaScript.
And shouldn't the "Mixtures where nothing survives" tests just be under the heading "Incorrect unicode-range:"?  Isn't the whole thing dropped because it's syntactically incorrect (mixing ? with ranges, or having digits after ?)?
Comment on attachment 395208 [details] [diff] [review]
(rev 3) scan unicode-range tokens exactly as in css2.1

Continued from comment 15 and comment 16.  (I started looking thinking
I'd review it tomorrow, and then realized I may as well just review it
now.)

+  if (!srcVals)
+    return PR_FALSE;

mScanner.SetLowLevelError(NS_ERROR_OUT_OF_MEMORY);
before the return

+  PRUint32 i;
+  for (i = 0; i < ranges.Length(); i++)
+    srcVals->Item(i).SetIntValue(ranges[i], eCSSUnit_Integer);

declare the i inside the for()

nsCSSRules.cpp:

+  for (PRUint32 i = 0; i < sources.Count(); i += 2)
+    {
+      if (!(i+1 < sources.Count())) {

The bracing and indentation of this for() should match local style (like
the if, which is correct).

+      if (!(i+1 < sources.Count())) {
+        NS_NOTREACHED("odd number of entries in a unicode-range: array");
+        break;
+      }
+      if (sources[i].GetUnit() != eCSSUnit_Integer ||
+          sources[i+1].GetUnit() != eCSSUnit_Integer) {
+        NS_NOTREACHED("non-integer entries in a unicode-range: array");
+        continue;
+      }

You know what data structures you created in the parser; both of these
checks should just be NS_ABORT_IF_FALSE with no runtime checking.

nsCSSScanner.cpp:

+static inline PRUint32
+DigitValue(PRInt32 ch) {

Brace on its own line.

And maybe call it DecimalDigitValue?

+static inline PRUint32
+HexDigitValue(PRInt32 ch) {

Brace on its own line.


It looks like you set mIdent in ParseURange, but never use it, and
nsCSSScanner.h doesn't document that it's set.  Why bother setting it?

+  PRBool valid = PR_TRUE;
+  PRUint32 low = 0;
+  PRUint32 high = 0;
+  int nques = 0;
+  int i;

Move all these declarations to right before the loop.

int nques can actually just be PRBool haveQues, the way you use it.

For incorrect unicode range tests, you should probably add some more
tests for rejecting the various 7-digit cases.


However, I think we should do unicode range as modal tokenization (like
nsCSSScanner::NextURL), though, so that we don't end up needing to add
special case code for things like 
  font-family: u+7;
which I think still ought to be treated the same as:
  font-family: "u+7";
(and which I think the spec still says ought to be, since it (CSS 2.1)
doesn't describe the handling in terms of tokens).

In other words, ParseFontRanges should call a GetURange function instead
of GetToken, and GetURange should assert there's no pushback and then
call a scanner function like NextURL.

Though... it might be that font-family is actually the only case that
matters (although I'm worried there might be other cases that I'm not
thinking of), in which case you could just use that mToken.mIdent that
you're setting but that I told you not to above.  But you might have to
convince me that ParseOneFamily is the only caller we need to worry
about breaking compatibility in.  That said, I think it might be less
messy to do modal tokenization than fix ParseOneFamily.

(I realize this disagrees with what I said in comment 6, sort of.  I
guess I sort of disagree with the practicality of the spec here.  Of
course, technically, ParseOneFamily disagrees with the spec right now,
since it should accept more odd characters than it does.)


review- only because of the modal tokenization issue, which I think
ought to be pretty straightforward to fix; otherwise it looks good.
Attachment #395208 - Flags: review?(dbaron) → review-
And, to be clear, I think your existing ParseURange could become NextURange with only slight modification at the beginning to check for the U+ and just push things back and call Next() if it finds something else.
Comment on attachment 395208 [details] [diff] [review]
(rev 3) scan unicode-range tokens exactly as in css2.1

Actually, I changed my mind again.  This patch is good with the comments addressed.

However, we should write a separate patch to make ParseOneFamily use modal tokenization according to the rules in http://www.w3.org/TR/CSS21/fonts.html#font-family-prop , and land it at the same time as this patch, to avoid regressing font-family parsing.  Or maybe it's even ok if we regress that.
Attachment #395208 - Flags: review- → review+
(In reply to comment #19)
> However, we should write a separate patch to make ParseOneFamily use modal
> tokenization according to the rules in
> http://www.w3.org/TR/CSS21/fonts.html#font-family-prop , and land it at the
> same time as this patch, to avoid regressing font-family parsing.  Or maybe
> it's even ok if we regress that.

Actually, there's nothing to regress; + isn't an ident character.

However, we should figure out if we're compatible with other browsers as far as font-family parsing goes and either fix our own behavior or get the spec changed.  Or maybe just go back to pretending we didn't notice that part of the spec, since I doubt anybody cares anyway...
To be extra clear:  I'm saying to ignore comment 17 from "However, I think we should do unicode range as modal tokenization" to the end of the comment.
This may now have a textual dependency on the patches for bug 511333.

(In reply to comment #15)
> you can leave the trailing comma after the last entry; it's ok in JavaScript.

Ok, will remember that in future.

(In reply to comment #16)
> And shouldn't the "Mixtures where nothing survives" tests just be under the
> heading "Incorrect unicode-range:"?

I rearranged things and made the comments clearer.  I was trying distinguish cases where one range is ignored from cases where the entire multi-range descriptor is ignored, and both of them from cases where there's only one range to ignore.

(In reply to comment #17)
> mScanner.SetLowLevelError(NS_ERROR_OUT_OF_MEMORY);
> before the return
done

> declare the i inside the for()
done

> +  for (PRUint32 i = 0; i < sources.Count(); i += 2)
> +    {
> 
> The bracing and indentation of this for() should match local style (like
> the if, which is correct).

I'm not sure how I managed to do that.  Fixed.

> You know what data structures you created in the parser; both of these
> checks should just be NS_ABORT_IF_FALSE with no runtime checking.

I pulled the length check out of the loop and made it an NS_ABORT_IF_FALSE.  The type check is redundant with an NS_ASSERTION in GetIntValue() so I just took it out altogether.

> Brace on its own line.

Fixed
 
> And maybe call it DecimalDigitValue?

The idea with the naming was to match the IsDight/IsHexDigit functions above, but it is clearer with DecimalDigitValue, so changed.

> It looks like you set mIdent in ParseURange, but never use it, and
> nsCSSScanner.h doesn't document that it's set.  Why bother setting it?

It's used in nsCSSToken::AppendToString; the idea is to preserve the exact textual form of the token for error messages.  Only valid unicode-range tokens can be reconstructed from the numbers in mInteger and mInteger2, and we don't know whether the question-mark form was used when we do that (e.g. in nsCSSRules.cpp).

The invalid tokens never make it past ParseFontRanges(), so the ident is indeed never used outside the scanner.  I have, however, documented it in nsCSSScanner.h.

I suppose I could produce an error token with a different type for invalid unicode-range, which would simplify the parser a tiny bit, but then we wouldn't be able to reconstruct the exact text when a valid unicode-range token appears in the wrong place, which might be really confusing.  Don't care much either way though.

> +  PRBool valid = PR_TRUE;
> +  PRUint32 low = 0;
> +  PRUint32 high = 0;
> +  int nques = 0;
> +  int i;
> 
> Move all these declarations to right before the loop.
> 
> int nques can actually just be PRBool haveQues, the way you use it.

Done.

> For incorrect unicode range tests, you should probably add some more
> tests for rejecting the various 7-digit cases.

Done; also added a few more miscellaneous syntax error tests.  It's still a bit ad hoc, and it's hard to tell exactly how things are getting tokenized since anything but <unicode-range> tokens and commas in a unicode-range descriptor invalidates the whole descriptor.  I kinda want a unit test program that just tokenizes.
Attachment #395208 - Attachment is obsolete: true
Attachment #395444 - Flags: review?(dbaron)
This is actually causing a bunch of TEST-UNEXPECTED-PASS in test_descriptor_storage.html, but they were trivial to fix.  However, that led me to discover I'd made a few mistakes in descriptor_database.js... except one of my invalid unicode-range values is actually being accepted by your parsing code.

I've modified the patch by adding changes to test_descriptor_storage.html, test_descriptor_syntax_errors.html, and descriptor_database.js ... and I can land it that way, although it might be good to figure out why we're accepting the value "U+0-7,A-C" (in which I'd inadvertently forgotten the second U+, along with one other value).

My modified patch is here:
http://hg.mozilla.org/users/dbaron_mozilla.com/patches/raw-file/d45501bb6873/zwol-443976
Actually, I think the fix is trivial.  This case:

+    if (mToken.mType != eCSSToken_URange)
+      break;

needs an UngetToken(), since it didn't consume that token.  I'll double-check that it works, and if so, just land that way.
Yes, it works, so now the patch I'm planning to land is:
http://hg.mozilla.org/users/dbaron_mozilla.com/patches/raw-file/d3aa3bf485a6/zwol-443976

The code change is near the top of the very first change in the patch.  It looks ok to you, right?
Yes, that is clearly the right fix.  And I'm sorry about not noticing the other mochitests that were affected.  I should have at least done a try server run.

It occurs to me that there's one more thing that should be tested:

  @font-face {
     unicode-range: U+0-9,;
     src: local(Mouse);
  }

Without the UngetToken(), I think the src: descriptor would have been eaten by error recovery.  In test_font_face_parser.html's table syntax,

  { rule: _("unicode-range: U+0-9,; src:local(Mouse);"),
    d: { "src" : "local(Mouse)" }, noncanonical: true },
http://hg.mozilla.org/mozilla-central/rev/feddc329bd56

I didn't add the test from comment 27, though.  But we probably should... care to toss up a patch?
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.3a1
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: