443976 - Support unicode-range: descriptor in @font-face parser

Assignee

Description

•

17 years ago

css3-fonts specifies a unicode-range: descriptor for @font-face rules, which can be used to avoid downloading fonts that don't contain any glyphs that are relevant to the document being rendered. As this feature is not critical for initial @font-face support and requires monkeying with nsCSSScanner, I am splitting it off from bug 441469. To elaborate a little on "requires monkeying with nsCSSScanner": A typical use of the unicode-range: descriptor might be unicode-range: U+0374-03FF, U+1D26-1DBF, U+1F00-1FFE; /* Greek */ The parser wants to see each hexadecimal number as a separate token and the "U+" introducers as single entities, like this (spaces separate tokens): unicode-range : U+ 0374 - 03FF , U+ 1D26 - 1DBF , U+ 1F00 - 1FFE ; but the normal CSS tokenization rules have no concept of hex numbers and will split these entirely differently, something like unicode-range : U +0374 -03FF , U +1D 26 -1DBF , U + 1F 00 -1FFE ; (or possibly even worse, I'm not sure exactly how DIMENSION works) To avoid having to re-tokenize the entire descriptor in the parser, we will need to give nsCSSScanner a special mode for these descriptors.

Flags: wanted1.9.1?

Zack Weinberg (:zwol)

Assignee

Comment 1

•

17 years ago

Attached patch (rev 0) code to date, doesn't work (obsolete) — Details — Splinter Review

throwing the code I've got so far in here so it doesn't get lost.

Zack Weinberg (:zwol)

Assignee

Comment 2

•

17 years ago

Attached patch (rev 1) working patch (obsolete) — Details — Splinter Review

Here's a working patch. As mentioned above, I had to introduce a new mode into the CSS scanner. I believe this did not affect anything else (and I ran the layout mochitests with no failures) but that part probably deserves close scrutiny anyway. A unicode-range: descriptor turns into an array of alternating min, max values. There is no attempt to sort them or combine adjacent/overlapping ranges. Perhaps there should be. This patch was developed on top of the patches in bug 452275 and bug 452278; I can redo it without them if necessary.

Attachment #328442 - Attachment is obsolete: true

Attachment #335623 - Flags: superreview?(dbaron)

Attachment #335623 - Flags: review?(dbaron)

Zack Weinberg (:zwol)

Assignee

Updated

•

17 years ago

Depends on: 452275, 452278

Zack Weinberg (:zwol)

Assignee

Comment 3

•

17 years ago

Comment on attachment 335623 [details] [diff] [review] (rev 1) working patch I've thought of a less invasive way to do the scanner changes; will post a revised patch later today. (Will still be on top of the other parser/scanner cleanup bugs.)

Attachment #335623 - Flags: superreview?(dbaron)

Attachment #335623 - Flags: review?(dbaron)

Zack Weinberg (:zwol)

Assignee

Updated

•

17 years ago

Depends on: 452518

Zack Weinberg (:zwol)

Assignee

Comment 4

•

17 years ago

Attached patch (rev 2) less invasive to scanner (obsolete) — Details — Splinter Review

This revised patch avoids making any changes to the normal paths through the scanner; instead there is a new function nsCSSScanner::NextURange() used exclusively by CSSParserImpl::ParseURange(). Revision is on top of bug 452518 in addition to what it used to be. (The changes in that bug were in rev 1 of this patch but I split them out because they're unrelated.)

Attachment #335623 - Attachment is obsolete: true

Zack Weinberg (:zwol)

Assignee

Comment 5

•

17 years ago

Comment on attachment 335800 [details] [diff] [review] (rev 2) less invasive to scanner flagging for review

Attachment #335800 - Flags: superreview?(dbaron)

Attachment #335800 - Flags: review?(dbaron)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

17 years ago

Flags: wanted1.9.1? → wanted1.9.1+

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 6

•

17 years ago

So in REC-CSS2, unicode-range values were part of the tokenization: http://www.w3.org/TR/1998/REC-CSS2-19980512/syndata.html#tokenization That suggests changing the scanner more and the parser less, and that the scanner doesn't even need to be modal. And I can't think of anything that would be broken by that approach. That said, maybe this approach is ok too; we used to use it for things like ~= and |=, but once we implemented namespaces it was easier to tokenize |= separately so we could distinguish [namespace|attribute=value] from [attribute|=value] without too much lookahead. I'll keep looking at it...

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 7

•

17 years ago

Comment on attachment 335800 [details] [diff] [review] (rev 2) less invasive to scanner I'm ok with the approach of treating the unicode-range values as separate tokens, although I'd also be happy with the REC-CSS2-19980512 approach of treating the whole mess as a single token, codewise. (You'd just need to add a new token type an a second integer field to nsCSSToken.) With the current approach, it might make sense to move ParseURange to the scanner, though. I think you should limit the number of digits or question marks to 6 rather than 8, per REC-CSS2-19980512, and I think css3-fonts should say that that's what the syntax is. I think you should also update the comments above ParseURange and NextURange (in nsCSSScanner.h and .cpp) to reflect the understanding that the code you're writing is parsing what the spec describes as a single token. The comments about NextURange (both before and inside the function) that describe what is consumed do not match the code. I think the code is right and the comments are wrong. (In particular, symbols and whitespace are consumed, as I think they should be to make the parsing code easier.) One difficulty of getting the parsing code really correct is to skip comments in the right places. You need to skip comments anywhere that whitespace is allowed, but not inside of the U+XXXX-XXXX. I think your current code is actually correct, but it sort of looks like it's correct by accident (since you're parsing the "U" and the commas using normal tokenization). Some comments would help. I think you probably should add a distinct token type for NextURange to return when it parses hex digits (rather than using eCSSToken_Number), and document that that token type sets both mIdent and mInteger. I don't think you need to bother with mIntegerValid. (Normal number tokens set mNumber and mIntegerValid and maybe mInteger.) I tend to think that a unicode range where the min is greater than the max shouldn't be a parse error; it just shouldn't add anything to the set. And I think U+0100-0100 should also be valid; you're treating that as a parse error too. (Or is there something in the spec that says it should be a parse error? If there isn't, it shouldn't be.) By the spec, I mean any of: http://www.w3.org/TR/1998/REC-CSS2-19980512/fonts.html#dataqual http://www.w3.org/TR/1998/REC-CSS2-19980512/syndata.html#tokenization http://dev.w3.org/csswg/css3-fonts/#character What looks like a real problem with your current approach, though, is at the end of ParseURange, you don't UngetToken(). I think you need to; otherwise the U+NNNN ranges will be broken because the parsing code will eat the next character. The problem is that I'm not even sure that UngetToken is sufficient, since your parsing code is different. That said, I think it should be, since anything (like '(' or '{') that requires particularly interesting handling will be parsed the same by Next or NextURange. I think that's yet another reason for not breaking the invariants of what goes in what type of token, though. Then again, it looks like you have tests that test this. Or are those tests not doing what I think they do? Or do some of them fail with this version of the patch? If not, why aren't they failing? What did I misunderstand? in nsCSSRules.cpp, you're doing nonstandard indentation of the for loop.

Attachment #335800 - Flags: superreview?(dbaron)

Attachment #335800 - Flags: superreview-

Attachment #335800 - Flags: review?(dbaron)

Attachment #335800 - Flags: review-

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 8

•

17 years ago

(And when you're fixing these issues, add tests for them too. :-)

Zack Weinberg (:zwol)

Assignee

Comment 9

•

17 years ago

Just a few quick notes. I didn't realize css2 had anything to say about this stuff; I was going entirely by css3-fonts, which doesn't give me tokenization rules at all, so I had to make something up. In general I like dumb scanners and smart parsers, but I'll have to look at the code some more (and read the old spec) to see what makes more sense here. It's possible that moving what is now ParseURange into the scanner would avoid the too-clever interaction between ParseURange and NextURange that seems to have confused you (see below) and that might be enough reason to do it by itself. NextURange does *not* consume either whitespace or symbols other than +, -, ?, and it wouldn't work if it did. The comment is correct and matches the code. You may have mistaken the initial Peek() for a Read(). I'm pretty certain that the whitespace handling is correct as is (albeit almost entirely implicit in the division of labor between ParseURange and ParseFontRanges) but I probably could use more tests for it, especially involving comments. The tests in the patch all pass, and your concern about not calling UngetToken() at the end of ParseURange() is *supposed* to be addressed by part of the long comment above ParseURange() -- basically, NextURange doesn't always consume the token it returns because that makes ParseURange not have to call UngetToken and therefore not worry about whether what it pushes back should have been tokenized differently. It may be too clever, but it works. I do need to add test cases for a good regular descriptor after both a bad and a good unicode-range descriptor, though. If I did change to using CSS2's core tokenization rule for this stuff, what should I do about nonsense tokens, e.g. U+?020 U+00??-1100 which are perfectly acceptable according to the stated rule? I'm probably not going to cycle back to this bug for a few days.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 10

•

17 years ago

Ah, right, I see. In any case, NextURange's behavior of setting the token whether or not it consumed something is really quite confusing. I don't think it matters whether those nonsense tokens produce unicode-range tokens as long as they lead to parse errors. (The comment isn't quite correct, though, since '/' is also consumed, but that could be fixed by one Pushback call.) Maybe this approach is ok (modulo the 6 vs 8 digit issue and a few of the other comments), although it does seem a little too clever. There could also be some other difference between a single token and multiple tokens that I'm not thinking of (like we ran into with |=).

Zack Weinberg (:zwol)

Assignee

Comment 11

•

17 years ago

I'm dropping this off the 1.9.1 wanted list because there are more important things to get done, the changes required are more invasive than ideal at this stage, and there are standards ambiguities that have yet to be resolved.

Flags: wanted1.9.1+

John Daggett (:jtd)

Updated

•

16 years ago

Blocks: unicode-range

John Daggett (:jtd)

Comment 12

•

16 years ago

Zack, I'm wondering if you think there are still issues with the unicode-range definition in the CSS3 Fonts spec: http://www.w3.org/TR/css3-fonts/#character-range-the-unicode-range-descri I biggest change was to confine unicode-range to 0 - 10FFFF, which makes it compatible with the unicode-range grammar given in CSS 2.1 (although I realize that's incomplete). How much work needs to be done to complete unicode-range parsing? Do you have time to finish off the patch? If not, could you briefly outline what you think needs to happen and I'll take a stab at it (or maybe there's an intern hunting around for things to do?).

Zack Weinberg (:zwol)

Assignee

Comment 13

•

16 years ago

I think there are no further semantic problems with the spec, but I would like to suggest some editorial changes. Most important, suggest using the term "ignored" instead of "omitted" for consistency with CSS2.1, and consistently using either "syntax error" or "parse error", not both. [CSS2.1 seems to use "parsing error".] # "U+??????" is not a syntax error, even though "U+0??????" would be. could use clarification as to the meaning of "U+"??????", e.g. # "U+??????" is not a syntax error, even though "U+0??????" would be; # it is interpreted as U+0-FFFFFF. The sentence # Ranges can overlap but interval ranges that descend (e.g. U+400-32f) # are invalid and omitted rather than treated as parse errors; they have # no effect on other ranges in a list of ranges. says too many things at once -- I'd like to reorganize that entire paragraph, actually. Maybe not in a bug comment. > How much work needs to be done to complete unicode-range parsing? Do you have > time to finish off the patch? I should be able to get to that this week. It's almost done.

Zack Weinberg (:zwol)

Assignee

Comment 14

•

16 years ago

Attached patch (rev 3) scan unicode-range tokens exactly as in css2.1 (obsolete) — Details — Splinter Review

This version of the patch treats unicode-range exactly as css2.1 specifies -- the whole thing is one token at the scanner level. This does involve adding a special extra condition to Next(), but them's the breaks. This also incorporates the latest revisions to css3-fonts as far as handling of syntactically valid but meaningless unicode-range tokens, weird combinations, etc. and updates the tests to match.

Attachment #335800 - Attachment is obsolete: true

Attachment #395208 - Flags: review?(dbaron)

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 15

•

16 years ago

Comment on attachment 395208 [details] [diff] [review] (rev 3) scan unicode-range tokens exactly as in css2.1 >+ { rule: _("unicode-range: U+A5, U+0?F"), d: {} }, >+ { rule: _("unicode-range: U+A5, U+0F?-E00"), d: {} } One comment before I forget about it: you can leave the trailing comma after the last entry; it's ok in JavaScript.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 16

•

16 years ago

And shouldn't the "Mixtures where nothing survives" tests just be under the heading "Incorrect unicode-range:"? Isn't the whole thing dropped because it's syntactically incorrect (mixing ? with ranges, or having digits after ?)?

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 17

•

16 years ago

Comment on attachment 395208 [details] [diff] [review] (rev 3) scan unicode-range tokens exactly as in css2.1 Continued from comment 15 and comment 16. (I started looking thinking I'd review it tomorrow, and then realized I may as well just review it now.) + if (!srcVals) + return PR_FALSE; mScanner.SetLowLevelError(NS_ERROR_OUT_OF_MEMORY); before the return + PRUint32 i; + for (i = 0; i < ranges.Length(); i++) + srcVals->Item(i).SetIntValue(ranges[i], eCSSUnit_Integer); declare the i inside the for() nsCSSRules.cpp: + for (PRUint32 i = 0; i < sources.Count(); i += 2) + { + if (!(i+1 < sources.Count())) { The bracing and indentation of this for() should match local style (like the if, which is correct). + if (!(i+1 < sources.Count())) { + NS_NOTREACHED("odd number of entries in a unicode-range: array"); + break; + } + if (sources[i].GetUnit() != eCSSUnit_Integer || + sources[i+1].GetUnit() != eCSSUnit_Integer) { + NS_NOTREACHED("non-integer entries in a unicode-range: array"); + continue; + } You know what data structures you created in the parser; both of these checks should just be NS_ABORT_IF_FALSE with no runtime checking. nsCSSScanner.cpp: +static inline PRUint32 +DigitValue(PRInt32 ch) { Brace on its own line. And maybe call it DecimalDigitValue? +static inline PRUint32 +HexDigitValue(PRInt32 ch) { Brace on its own line. It looks like you set mIdent in ParseURange, but never use it, and nsCSSScanner.h doesn't document that it's set. Why bother setting it? + PRBool valid = PR_TRUE; + PRUint32 low = 0; + PRUint32 high = 0; + int nques = 0; + int i; Move all these declarations to right before the loop. int nques can actually just be PRBool haveQues, the way you use it. For incorrect unicode range tests, you should probably add some more tests for rejecting the various 7-digit cases. However, I think we should do unicode range as modal tokenization (like nsCSSScanner::NextURL), though, so that we don't end up needing to add special case code for things like font-family: u+7; which I think still ought to be treated the same as: font-family: "u+7"; (and which I think the spec still says ought to be, since it (CSS 2.1) doesn't describe the handling in terms of tokens). In other words, ParseFontRanges should call a GetURange function instead of GetToken, and GetURange should assert there's no pushback and then call a scanner function like NextURL. Though... it might be that font-family is actually the only case that matters (although I'm worried there might be other cases that I'm not thinking of), in which case you could just use that mToken.mIdent that you're setting but that I told you not to above. But you might have to convince me that ParseOneFamily is the only caller we need to worry about breaking compatibility in. That said, I think it might be less messy to do modal tokenization than fix ParseOneFamily. (I realize this disagrees with what I said in comment 6, sort of. I guess I sort of disagree with the practicality of the spec here. Of course, technically, ParseOneFamily disagrees with the spec right now, since it should accept more odd characters than it does.) review- only because of the modal tokenization issue, which I think ought to be pretty straightforward to fix; otherwise it looks good.

Attachment #395208 - Flags: review?(dbaron) → review-

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 18

•

16 years ago

And, to be clear, I think your existing ParseURange could become NextURange with only slight modification at the beginning to check for the U+ and just push things back and call Next() if it finds something else.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 19

•

16 years ago

Comment on attachment 395208 [details] [diff] [review] (rev 3) scan unicode-range tokens exactly as in css2.1 Actually, I changed my mind again. This patch is good with the comments addressed. However, we should write a separate patch to make ParseOneFamily use modal tokenization according to the rules in http://www.w3.org/TR/CSS21/fonts.html#font-family-prop , and land it at the same time as this patch, to avoid regressing font-family parsing. Or maybe it's even ok if we regress that.

Attachment #395208 - Flags: review- → review+

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 20

•

16 years ago

(In reply to comment #19) > However, we should write a separate patch to make ParseOneFamily use modal > tokenization according to the rules in > http://www.w3.org/TR/CSS21/fonts.html#font-family-prop , and land it at the > same time as this patch, to avoid regressing font-family parsing. Or maybe > it's even ok if we regress that. Actually, there's nothing to regress; + isn't an ident character. However, we should figure out if we're compatible with other browsers as far as font-family parsing goes and either fix our own behavior or get the spec changed. Or maybe just go back to pretending we didn't notice that part of the spec, since I doubt anybody cares anyway...

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 21

•

16 years ago

To be extra clear: I'm saying to ignore comment 17 from "However, I think we should do unicode range as modal tokenization" to the end of the comment.

Zack Weinberg (:zwol)

Assignee

Comment 22

•

16 years ago

Attached patch (rev 4) addressing review comments — Details — Splinter Review

This may now have a textual dependency on the patches for bug 511333. (In reply to comment #15) > you can leave the trailing comma after the last entry; it's ok in JavaScript. Ok, will remember that in future. (In reply to comment #16) > And shouldn't the "Mixtures where nothing survives" tests just be under the > heading "Incorrect unicode-range:"? I rearranged things and made the comments clearer. I was trying distinguish cases where one range is ignored from cases where the entire multi-range descriptor is ignored, and both of them from cases where there's only one range to ignore. (In reply to comment #17) > mScanner.SetLowLevelError(NS_ERROR_OUT_OF_MEMORY); > before the return done > declare the i inside the for() done > + for (PRUint32 i = 0; i < sources.Count(); i += 2) > + { > > The bracing and indentation of this for() should match local style (like > the if, which is correct). I'm not sure how I managed to do that. Fixed. > You know what data structures you created in the parser; both of these > checks should just be NS_ABORT_IF_FALSE with no runtime checking. I pulled the length check out of the loop and made it an NS_ABORT_IF_FALSE. The type check is redundant with an NS_ASSERTION in GetIntValue() so I just took it out altogether. > Brace on its own line. Fixed > And maybe call it DecimalDigitValue? The idea with the naming was to match the IsDight/IsHexDigit functions above, but it is clearer with DecimalDigitValue, so changed. > It looks like you set mIdent in ParseURange, but never use it, and > nsCSSScanner.h doesn't document that it's set. Why bother setting it? It's used in nsCSSToken::AppendToString; the idea is to preserve the exact textual form of the token for error messages. Only valid unicode-range tokens can be reconstructed from the numbers in mInteger and mInteger2, and we don't know whether the question-mark form was used when we do that (e.g. in nsCSSRules.cpp). The invalid tokens never make it past ParseFontRanges(), so the ident is indeed never used outside the scanner. I have, however, documented it in nsCSSScanner.h. I suppose I could produce an error token with a different type for invalid unicode-range, which would simplify the parser a tiny bit, but then we wouldn't be able to reconstruct the exact text when a valid unicode-range token appears in the wrong place, which might be really confusing. Don't care much either way though. > + PRBool valid = PR_TRUE; > + PRUint32 low = 0; > + PRUint32 high = 0; > + int nques = 0; > + int i; > > Move all these declarations to right before the loop. > > int nques can actually just be PRBool haveQues, the way you use it. Done. > For incorrect unicode range tests, you should probably add some more > tests for rejecting the various 7-digit cases. Done; also added a few more miscellaneous syntax error tests. It's still a bit ad hoc, and it's hard to tell exactly how things are getting tokenized since anything but <unicode-range> tokens and commas in a unicode-range descriptor invalidates the whole descriptor. I kinda want a unit test program that just tokenizes.

Attachment #395208 - Attachment is obsolete: true

Attachment #395444 - Flags: review?(dbaron)

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Updated

•

16 years ago

Attachment #395444 - Flags: review?(dbaron) → review+

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 23

•

16 years ago

I'll land this later today.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 24

•

16 years ago

This is actually causing a bunch of TEST-UNEXPECTED-PASS in test_descriptor_storage.html, but they were trivial to fix. However, that led me to discover I'd made a few mistakes in descriptor_database.js... except one of my invalid unicode-range values is actually being accepted by your parsing code. I've modified the patch by adding changes to test_descriptor_storage.html, test_descriptor_syntax_errors.html, and descriptor_database.js ... and I can land it that way, although it might be good to figure out why we're accepting the value "U+0-7,A-C" (in which I'd inadvertently forgotten the second U+, along with one other value). My modified patch is here: http://hg.mozilla.org/users/dbaron_mozilla.com/patches/raw-file/d45501bb6873/zwol-443976

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 25

•

16 years ago

Actually, I think the fix is trivial. This case: + if (mToken.mType != eCSSToken_URange) + break; needs an UngetToken(), since it didn't consume that token. I'll double-check that it works, and if so, just land that way.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 26

•

16 years ago

Yes, it works, so now the patch I'm planning to land is: http://hg.mozilla.org/users/dbaron_mozilla.com/patches/raw-file/d3aa3bf485a6/zwol-443976 The code change is near the top of the very first change in the patch. It looks ok to you, right?

Zack Weinberg (:zwol)

Assignee

Comment 27

•

16 years ago

Yes, that is clearly the right fix. And I'm sorry about not noticing the other mochitests that were affected. I should have at least done a try server run. It occurs to me that there's one more thing that should be tested: @font-face { unicode-range: U+0-9,; src: local(Mouse); } Without the UngetToken(), I think the src: descriptor would have been eaten by error recovery. In test_font_face_parser.html's table syntax, { rule: _("unicode-range: U+0-9,; src:local(Mouse);"), d: { "src" : "local(Mouse)" }, noncanonical: true },

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 28

•

16 years ago

http://hg.mozilla.org/mozilla-central/rev/feddc329bd56 I didn't add the test from comment 27, though. But we probably should... care to toss up a patch?

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla1.9.3a1

(rev 0) code to date, doesn't work 17 years ago Zack Weinberg (:zwol) 5.66 KB, patch		Details \| Diff \| Splinter Review
(rev 1) working patch 17 years ago Zack Weinberg (:zwol) 20.14 KB, patch		Details \| Diff \| Splinter Review
(rev 2) less invasive to scanner 17 years ago Zack Weinberg (:zwol) 13.79 KB, patch	dbaron : review- dbaron : superreview-	Details \| Diff \| Splinter Review
(rev 3) scan unicode-range tokens exactly as in css2.1 16 years ago Zack Weinberg (:zwol) 15.91 KB, patch	dbaron : review+	Details \| Diff \| Splinter Review
(rev 4) addressing review comments 16 years ago Zack Weinberg (:zwol) 21.20 KB, patch	dbaron : review+	Details \| Diff \| Splinter Review