Last Comment Bug 543151 - Eliminate CSS scanner's pushback buffer
: Eliminate CSS scanner's pushback buffer
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: CSS Parsing and Computation (show other bugs)
: Trunk
: x86 Linux
: -- normal with 1 vote (vote)
: mozilla21
Assigned To: Zack Weinberg (:zwol)
:
:
Mentors:
Depends on: 541496
Blocks: 664818 659963
  Show dependency treegraph
 
Reported: 2010-01-29 17:20 PST by Zack Weinberg (:zwol)
Modified: 2013-05-27 23:33 PDT (History)
13 users (show)
ryanvm: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch v1 (44.50 KB, patch)
2010-01-29 17:20 PST, Zack Weinberg (:zwol)
no flags Details | Diff | Splinter Review
v2 (59.84 KB, patch)
2011-05-05 11:59 PDT, Zack Weinberg (:zwol)
no flags Details | Diff | Splinter Review
v3 part 1 (9.68 KB, patch)
2011-05-21 10:29 PDT, Zack Weinberg (:zwol)
dbaron: review+
Details | Diff | Splinter Review
v3 part 2 (10.24 KB, patch)
2011-05-22 20:21 PDT, Zack Weinberg (:zwol)
dbaron: review+
Details | Diff | Splinter Review
v3 part 3 (3.25 KB, patch)
2011-05-22 20:26 PDT, Zack Weinberg (:zwol)
dbaron: review+
bzbarsky: review+
Details | Diff | Splinter Review
v3 part 4 (7.12 KB, patch)
2011-05-22 20:28 PDT, Zack Weinberg (:zwol)
bzbarsky: review+
dbaron: review-
Details | Diff | Splinter Review
v3 part 5 (12.84 KB, patch)
2011-05-22 20:30 PDT, Zack Weinberg (:zwol)
no flags Details | Diff | Splinter Review
v3 part 6 (18.76 KB, patch)
2011-05-22 20:31 PDT, Zack Weinberg (:zwol)
no flags Details | Diff | Splinter Review
v3 part 7 (14.52 KB, patch)
2011-05-22 20:32 PDT, Zack Weinberg (:zwol)
no flags Details | Diff | Splinter Review
v3 part 8 (8.37 KB, patch)
2011-05-22 20:32 PDT, Zack Weinberg (:zwol)
no flags Details | Diff | Splinter Review
v3 part 9 (8.92 KB, patch)
2011-05-22 20:33 PDT, Zack Weinberg (:zwol)
no flags Details | Diff | Splinter Review
v3 part 10 (9.47 KB, patch)
2011-05-22 20:35 PDT, Zack Weinberg (:zwol)
no flags Details | Diff | Splinter Review
v3 part 11 (24.38 KB, patch)
2011-05-29 20:59 PDT, Zack Weinberg (:zwol)
no flags Details | Diff | Splinter Review
v3 part 1 final (9.39 KB, patch)
2011-10-19 18:55 PDT, Zack Weinberg (:zwol)
zackw: review+
zackw: checkin+
Details | Diff | Splinter Review
v3 part 2 final (9.18 KB, patch)
2011-10-19 18:56 PDT, Zack Weinberg (:zwol)
zackw: review+
zackw: checkin+
Details | Diff | Splinter Review
part A1 (42.21 KB, patch)
2013-01-17 11:06 PST, Zack Weinberg (:zwol)
cam: review+
Details | Diff | Splinter Review
part A2 (18.74 KB, patch)
2013-01-17 11:24 PST, Zack Weinberg (:zwol)
no flags Details | Diff | Splinter Review
part A3 (35.27 KB, patch)
2013-01-17 11:25 PST, Zack Weinberg (:zwol)
cam: review+
Details | Diff | Splinter Review
part A4 (16.31 KB, patch)
2013-01-17 11:27 PST, Zack Weinberg (:zwol)
no flags Details | Diff | Splinter Review
part A2 (18.62 KB, patch)
2013-02-01 15:18 PST, Zack Weinberg (:zwol)
cam: review+
Details | Diff | Splinter Review
part A4 (14.90 KB, patch)
2013-02-01 15:19 PST, Zack Weinberg (:zwol)
cam: review+
Details | Diff | Splinter Review
A1 final (45.32 KB, patch)
2013-02-16 15:08 PST, Zack Weinberg (:zwol)
zackw: review+
Details | Diff | Splinter Review
A2 final (19.80 KB, patch)
2013-02-16 15:09 PST, Zack Weinberg (:zwol)
zackw: review+
Details | Diff | Splinter Review
A3 final (41.60 KB, patch)
2013-02-16 15:11 PST, Zack Weinberg (:zwol)
zackw: review+
Details | Diff | Splinter Review
A4 final (17.00 KB, patch)
2013-02-16 15:13 PST, Zack Weinberg (:zwol)
zackw: review+
Details | Diff | Splinter Review

Description Zack Weinberg (:zwol) 2010-01-29 17:20:45 PST
Created attachment 424353 [details] [diff] [review]
patch v1

This patch eliminates the CSS scanner's pushback buffer.  All internal uses were converted from "read past the end, then pushback" behavior to "peek ahead, consume only if certain" behavior.  There was also one place in the CSS parser where we pushed back part of a token in order to retokenize it differently; instead it now processes the token fragment itself.

The advantage here is that the Peek() and Advance() functions can be considerably simpler; one expects that this will improve performance.  (I'll not be able to produce numbers till Monday, though.  Sorry.)  Also, we don't need the ad-hoc expandable pushback buffer anymore.  And it facilitates the next patch in this series, which eliminates the TrackPosition and AdvanceColumn calls into the error reporter, for hopefully even more performance win.

This has a hard dependency on the change in bug 541496 (removal of stream processing) as it only works if the scanner has access at all times to the entire text to be scanned.  It has a weak dependency on the changes in bugs 516091 and 455389, since I developed this patch on top of them; that could be undone with some effort.

There is a tangential included change, which is to use PR_strtod to do the actual text to floating-point conversion in Scanner::ParseNumber.  I did this primarily to make the rewrite of ParseNumber easier on myself; it doesn't need to scan the number and convert it simultaneously anymore.  It should also be more precise (nice for -moz-transform, I'd think) and possibly even faster (we were doing a lot of expensive floating point ops).
Comment 1 Boris Zbarsky [:bz] (still a bit busy) 2010-01-29 18:34:45 PST
> and possibly even faster (we were doing a lot of expensive floating point ops)

Which are cheaper than the string ops that are being done now, possibly (though with an nsAutoString, hard to say).  Oh, and PR_strtod is really slow from past profiling and measuring.  In fact, the changes to ParseNumber seem to be mostly undoing bug 498559.  Have you run the microbenchmark in that bug with and without this patch?  How does it look?
Comment 2 Zack Weinberg (:zwol) 2010-01-29 19:27:45 PST
I didn't know about that microbenchmark.  It is slower with my patch, but not by a whole lot.  Also, bad news: the "CSS parser interface cleanup" patch in bug 523496 (not the deCOM patch but its sequel) introduces a shocking amount of overhead.  Subsequent patches whittle it back down a bit, but ...

unpatched minefield:
    12345px: 112 ms
12.345678px: 115 ms

with interface-cleanup patch:
    12345px: 330 ms
12.345678px: 339 ms

with all the patches that this one depends on, but not this one:
    12345px: 282 ms
12.345678px: 289 ms

with this patch:
    12345px: 309 ms
12.345678px: 327 ms

with the as-yet-unposted follow-up to this patch:
    12345px: 303 ms
12.345678px: 314 ms

On Monday I'll see what putting back the hand calculation does, and I'll also try to track down the cause of that overhead.
Comment 3 Boris Zbarsky [:bz] (still a bit busy) 2010-01-29 19:42:12 PST
Huh.  That's weird; I'd love to see where that overhead is coming from...
Comment 4 Zack Weinberg (:zwol) 2010-01-29 20:08:34 PST
This is only a hunch based on assembly dumps, but I think it might be the nsCSSExpandedDataBlock constructor, which looks significantly more expensive than the Clear() method.  Maybe that's the overhead that parser-object recycling was avoiding.
Comment 5 Boris Zbarsky [:bz] (still a bit busy) 2010-01-29 20:23:54 PST
Hmm.  The nsCSSExpandedDataBlock ctor initializes a whole slew of nsCSSValues and the like?

I think originally the parser recycling was supposed to avoid the allocation overhead on the parser; it definitely predates the expanded data block...
Comment 6 Zack Weinberg (:zwol) 2010-01-29 21:49:29 PST
(In reply to comment #5)
> Hmm.  The nsCSSExpandedDataBlock ctor initializes a whole slew of nsCSSValues
> and the like?

Yah.

> I think originally the parser recycling was supposed to avoid the allocation
> overhead on the parser; it definitely predates the expanded data block...

I don't think malloc overhead could account for a 200% slowdown on your microbenchmark, though.
Comment 7 Zack Weinberg (:zwol) 2010-02-01 12:47:15 PST
More analysis in bug 523496.
Comment 8 Boris Zbarsky [:bz] (still a bit busy) 2010-02-24 10:59:27 PST
Zack, are you still looking for review on the diff attached here?  Or do you have an updated version with different ParseNumber code?
Comment 9 Zack Weinberg (:zwol) 2010-02-24 11:07:45 PST
Comment on attachment 424353 [details] [diff] [review]
patch v1

Canceling review; this is going to get revised at least once before it's worth looking at.
Comment 10 Zack Weinberg (:zwol) 2010-02-24 11:08:52 PST
I don't have a revised version yet, to be clear.
Comment 11 Zack Weinberg (:zwol) 2011-05-05 11:59:19 PDT
Created attachment 530389 [details] [diff] [review]
v2

Hard dependency on 541496, obviously.  I think these changes are correct, but testing blows up spectacularly due to the sync sheet load problems I mentioned there, so I can't be sure.
Comment 12 Zack Weinberg (:zwol) 2011-05-21 10:29:12 PDT
Created attachment 534238 [details] [diff] [review]
v3 part 1

The v2 patch turns out to have bugs in it, and it's also a big blob with unrelated changes that I really should have deferred to other bugs.  Unfortunately, I'm about halfway through redoing it in more digestible, fully tested chunks and I'm not going to have time to finish before the FF6 deadline.  (I won't be around *at all* on Monday or Tuesday because of Oakland [http://www.ieee-security.org/TC/SP2011/].)

But I think this piece of it, which just removes vestiges of support for stream processing from nsCSSScanner, should go into FF6.
Comment 13 David Baron :dbaron: ⌚️UTC-7 2011-05-21 22:21:40 PDT
Comment on attachment 534238 [details] [diff] [review]
v3 part 1

r=dbaron
Comment 14 Zack Weinberg (:zwol) 2011-05-22 20:21:10 PDT
Created attachment 534348 [details] [diff] [review]
v3 part 2

I got more hacking time this weekend than I expected, so I'm able to post the full patch series now; however, I'm not going to be able to address review comments or land anything until after FF6 closes.  If accepted as is or with nits before Tuesday, I'd appreciate someone taking over and getting it landed.

This piece removes special handling of tab characters in the scanner, which is inconsistent with the error console, and tidies up some other junk (#if 0, unused macros, etc).
Comment 15 Zack Weinberg (:zwol) 2011-05-22 20:26:24 PDT
Created attachment 534349 [details] [diff] [review]
v3 part 3

This piece starts pulling newline handling out of Read().  There are only three places where we can actually consume a newline (EatWhiteSpace, ParseAndAppendEscape, and SkipCComment) so it doesn't make sense to have the overhead of checking for newlines on every call to Read.  I'll be doing more of this after everything is converted to peek/advance.
Comment 16 Zack Weinberg (:zwol) 2011-05-22 20:28:23 PDT
Created attachment 534352 [details] [diff] [review]
v3 part 4

Introduce a Peek() that can look arbitrarily far ahead, and Advance() that just moves the read pointer forward without returning anything.  Replace Read with Advance where the return value of Read was being discarded, and rewrite some of the really simple helper functions in terms of Peek and Advance.
Comment 17 Zack Weinberg (:zwol) 2011-05-22 20:30:33 PDT
Created attachment 534353 [details] [diff] [review]
v3 part 5

The next several patches convert scanner subroutines to the peek/advance model.  I erred on the side of breaking things up too small, so there's impedance-matching calls to Pushback() introduced only to disappear again later.

This one does ParseURange, ParseNumber, and SkipCComment, which have the least interdependencies with other subroutines.
Comment 18 Zack Weinberg (:zwol) 2011-05-22 20:31:02 PDT
Created attachment 534354 [details] [diff] [review]
v3 part 6
Comment 19 Zack Weinberg (:zwol) 2011-05-22 20:31:37 PDT
Comment on attachment 534354 [details] [diff] [review]
v3 part 6

^^ That was really part 6.

ParseString, ParseAndAppendEscape, and NextURL.
Comment 20 Zack Weinberg (:zwol) 2011-05-22 20:32:16 PDT
Created attachment 534355 [details] [diff] [review]
v3 part 7

Everything to do with identifiers.
Comment 21 Zack Weinberg (:zwol) 2011-05-22 20:32:59 PDT
Created attachment 534356 [details] [diff] [review]
v3 part 8

The main entry point, Next().
Comment 22 Zack Weinberg (:zwol) 2011-05-22 20:33:33 PDT
Created attachment 534357 [details] [diff] [review]
v3 part 9

It is now possible to remove the pushback buffer.
Comment 23 Zack Weinberg (:zwol) 2011-05-22 20:35:15 PDT
Created attachment 534358 [details] [diff] [review]
v3 part 10

And finally, finish moving newline handling out to the subroutines that can actually consume newlines.  This also changes the way we track column position so we don't have to do bookkeeping work on every character.

http://tbpl.mozilla.org/?tree=Try&rev=9a5121889bb8
Comment 24 Zack Weinberg (:zwol) 2011-05-22 20:38:52 PDT
The microbenchmark in bug 498559 says:

unpatched minefield: 110, 121
full patch series: 99, 110
Comment 25 Zack Weinberg (:zwol) 2011-05-29 20:59:08 PDT
Created attachment 535993 [details] [diff] [review]
v3 part 11

While working on some other stuff down the road I noticed some cleanups that logically belong with this patch series.  This rationalizes the order of functions in nsCSSScanner.cpp a little, adds 'const' to some methods, removes a vestigial declaration of Read(), and imposes a consistent naming convention on the scanner methods.
Comment 26 Zack Weinberg (:zwol) 2011-05-29 21:00:33 PDT
Oh, and also switches all code in nsCSSScanner from PRBool to bool, except the methods that are going to be moved to their own file in bug 516091 (for an update to which, please stay tuned).
Comment 27 Boris Zbarsky [:bz] (still a bit busy) 2011-06-09 22:51:42 PDT
Comment on attachment 534349 [details] [diff] [review]
v3 part 3

r=me
Comment 28 Boris Zbarsky [:bz] (still a bit busy) 2011-06-09 22:59:12 PDT
Comment on attachment 534349 [details] [diff] [review]
v3 part 3

Actually...  Can an escaped newline end up in an ident or something?  Or are we guaranteed that all the callers who now get \r or \f instead of \n don't end up sticking the char anywhere permanent?
Comment 29 Boris Zbarsky [:bz] (still a bit busy) 2011-06-09 23:02:44 PDT
Comment on attachment 534352 [details] [diff] [review]
v3 part 4

This makes it so \r\n returns \r instead on \n.  Is that ok?

If so, r=me.
Comment 30 Boris Zbarsky [:bz] (still a bit busy) 2011-06-09 23:13:43 PDT
Comment on attachment 534353 [details] [diff] [review]
v3 part 5

This has typo fixes (in Peek, Advance, EatWhitespace) that should have been part of the previous patch.

I'd like to understand the benefits of this patch (e.g. in ParseNumber it's sort of weird to have 'c' always be something we haven't consumed yet.....).  What's the goal here?
Comment 31 Zack Weinberg (:zwol) 2011-06-10 12:31:17 PDT
(In reply to comment #28)
> Comment on attachment 534349 [details] [diff] [review] [review]
> v3 part 3
> 
> Actually...  Can an escaped newline end up in an ident or something?  Or are
> we guaranteed that all the callers who now get \r or \f instead of \n don't
> end up sticking the char anywhere permanent?

We do not preserve the full text of comments or whitespace sequences; \-newline is not an escape sequence outside a string; an unescaped newline ends a string token without becoming part of that token; and an escaped newline inside a string is completely removed.  This was one of the key insights that motivated this patch -- it's not necessary to check at every character for \r, \f, or \r\n and normalize them, because there are only a small number of places that can consume a newline and next to nowhere that can put a newline inside a token.

There is one place I missed, though: if we're outside a string and we encounter \r or \f as the first character of a whitespace sequence, we put that character in .mSymbol for the whitespace token.  This is harmless, because the parser ignores all value fields of a whitespace token, but I should probably fix it anyway (perhaps just not do that, since the parser doesn't care).

(In reply to comment #29)
> Comment on attachment 534352 [details] [diff] [review] [review]
> v3 part 4
> 
> This makes it so \r\n returns \r instead on \n.  Is that ok?

Yes, because the previous patch changed all places that look for \n specifically to instead check for \n, \r, or \f.

(In reply to comment #30)
> Comment on attachment 534353 [details] [diff] [review] [review]
> v3 part 5
> 
> This has typo fixes (in Peek, Advance, EatWhitespace) that should have been
> part of the previous patch.

Will fix.

> I'd like to understand the benefits of this patch (e.g. in ParseNumber it's
> sort of weird to have 'c' always be something we haven't consumed yet.....).
> What's the goal here?

Ultimately, to never consume a character -- anywhere in the scanner -- unless it is definitely going to become part of the current token.  This is what I mean by "peek/advance instead of read/pushback".  'c' winds up being that way throughout the code.  It should do slightly less work, but it also makes the pushback buffer unnecessary by construction.  If I'd tried to get rid of the pushback buffer without changing to this pattern, I would have had to demonstrate that there was nowhere the scanner pushed back a character that wasn't the same as the character it had just read.  (We ultimately wind up with a Backup() method for the parser to use when handling :nth-child(2n-1), but that's only one place so it's easy to demonstrate that backing up rather than pushing back is correct in that context.)
Comment 32 David Baron :dbaron: ⌚️UTC-7 2011-07-13 16:02:49 PDT
Comment on attachment 534348 [details] [diff] [review]
v3 part 2

r=dbaron

(did you want both Boris and me to review this, or just one of us?)
Comment 33 David Baron :dbaron: ⌚️UTC-7 2011-07-13 16:11:24 PDT
Comment on attachment 534349 [details] [diff] [review]
v3 part 3

r=dbaron
Comment 34 Zack Weinberg (:zwol) 2011-07-13 17:26:12 PDT
(In reply to comment #32)
> (did you want both Boris and me to review this, or just one of us?)

Just one would be fine by me if it's fine with both of you.  IIRC Boris offered to review this stuff on account of you were too busy, but said I shouldn't take it off your review queue.  And then he got too busy, too.
Comment 35 David Baron :dbaron: ⌚️UTC-7 2011-07-14 12:29:59 PDT
Comment on attachment 534352 [details] [diff] [review]
v3 part 4

So this Peek() function doesn't feel like it's valid given that the CSS scanner still has stream methods -- though I'm a little surprised those didn't go away in bug 541496.  Are they now unused?  If they are, shouldn't they be removed in a patch in the series before this one?
Comment 36 Zack Weinberg (:zwol) 2011-07-14 13:17:23 PDT
(In reply to comment #35)
> So this Peek() function doesn't feel like it's valid given that the CSS
> scanner still has stream methods -- though I'm a little surprised those
> didn't go away in bug 541496.  Are they now unused?

They are unused after bug 541496; I just forgot to remove them in that series.

> If they are, shouldn't they be removed in a patch in the series
> before this one?

That is precisely what part 1 of this series does, unless I missed one.
Comment 37 David Baron :dbaron: ⌚️UTC-7 2011-07-14 14:41:39 PDT
(In reply to comment #36)
> That is precisely what part 1 of this series does, unless I missed one.

Oh, right.  Sorry.
Comment 38 David Baron :dbaron: ⌚️UTC-7 2011-07-14 16:16:21 PDT
Comment on attachment 534352 [details] [diff] [review]
v3 part 4

>+// Returns -1 on eof
>+PRInt32
>+nsCSSScanner::Peek(PRInt32 n)
>+{
>+  if (n < mPushbackCount) {
>+    return PRInt32(mPushback[mPushBackCount - (n+1)]);
>+  }
>+  n -= mPushbackCount;
>+  if (mOffset + n < mCount) {
>+    return PRInt32(mReadPointer[mOffset + n]);
>+  }
>+  return -1;
>+}

This implementation of Peek() doesn't do \r\n unification, which it needs to do to keep its indices in sync with Advance()'s semantics.

Either it needs to do that unification or it needs to assert that it's never called across newlines.

>+void
>+Advance(PRInt32 n)
>+{
>+  while (n--) {
>+    if (mPushbackCount > 0) {
>+      mPushbackCount--;
>+    } else {
>+      if (mOffset >= mCount) {
>+        break;

Before this |break| you should assert that mOffset is actually equal to mCount.  (Having the check be >= is probably a good idea, but really == is the only case that should be possible.)


I'm afraid I think I need to know which strategy you want for Peek() before I continue reviewing.
Comment 39 Zack Weinberg (:zwol) 2011-07-14 19:03:41 PDT
(In reply to comment #38)
> Comment on attachment 534352 [details] [diff] [review] [review]]
> 
> This implementation of Peek() doesn't do \r\n unification

That's intentional.  See comment 31 -- one of the reasons I think this should be a perf win is in not doing newline unification except in the small handful of places that can consume one.

> Either it needs to do that unification or it needs to assert that it's never
> called across newlines.

I'm not sure I understand why this is a useful assertion to have.  At all times in this patch series, there should be no callsite to which it matters whether Peek collapses \r\n or not.  (In part 10, I make Advance refuse to cross newlines, requiring use of a different method, AdvanceLine.  That makes it more explicit what's going on.)

> >+      if (mOffset >= mCount) {
> >+        break;
> 
> Before this |break| you should assert that mOffset is actually equal to
> mCount.  (Having the check be >= is probably a good idea, but really == is
> the only case that should be possible.)

Ok.
Comment 40 David Baron :dbaron: ⌚️UTC-7 2011-07-15 09:16:30 PDT
(In reply to comment #39)
> (In reply to comment #38)
> > Comment on attachment 534352 [details] [diff] [review] [review] [review]]
> > 
> > This implementation of Peek() doesn't do \r\n unification
> 
> That's intentional.  See comment 31 -- one of the reasons I think this
> should be a perf win is in not doing newline unification except in the small
> handful of places that can consume one.
> 
> > Either it needs to do that unification or it needs to assert that it's never
> > called across newlines.
> 
> I'm not sure I understand why this is a useful assertion to have.  At all
> times in this patch series, there should be no callsite to which it matters
> whether Peek collapses \r\n or not.  (In part 10, I make Advance refuse to
> cross newlines, requiring use of a different method, AdvanceLine.  That
> makes it more explicit what's going on.)

It would be a useful assertion to have because it's a nearly guaranteed sign of a mistake in the code if said assertion fires.  It sounds like you're saying you've structured the code so it would never fire, though.  That's fine, but if so, I think it should be explicitly documented that calling Advance() across newlines is unsupported, and Advance() should assert that it isn't done.
Comment 41 David Baron :dbaron: ⌚️UTC-7 2011-07-15 09:17:44 PDT
That said, what you do in part 10 looks like it addresses my comment.
Comment 42 Zack Weinberg (:zwol) 2011-07-15 10:58:50 PDT
(In reply to comment #40)
> (In reply to comment #39)
> > I'm not sure I understand why this is a useful assertion to have.  At all
> > times in this patch series, there should be no callsite to which it matters
> > whether Peek collapses \r\n or not.  (In part 10, I make Advance refuse to
> > cross newlines, requiring use of a different method, AdvanceLine.  That
> > makes it more explicit what's going on.)
> 
> It would be a useful assertion to have because it's a nearly guaranteed sign
> of a mistake in the code if said assertion fires.  It sounds like you're
> saying you've structured the code so it would never fire, though.  That's
> fine, but if so, I think it should be explicitly documented that calling
> Advance() across newlines is unsupported, and Advance() should assert that
> it isn't done.

I mean to make an even stronger claim: at the end of the patch series, all of the callsites are indifferent to whether it fired or not.
Comment 43 Zack Weinberg (:zwol) 2011-07-15 11:00:01 PDT
Although, having said that, I am not one hundred percent certain it's true.
Comment 44 David Baron :dbaron: ⌚️UTC-7 2011-07-15 11:15:46 PDT
I still think you need an assertion Peek() similar to the one that patch 10 introduces in Advance(), except one that allows the character returned to be a newline (i.e., just checks that none of the intermediates are).  I realize nothing in the current parser would trigger such an assertion -- but the point is that it's a nonobvious limitation of the approach you're taking, and it should be documented in case somebody does write parsing code that wants to peek across a newline.
Comment 45 David Baron :dbaron: ⌚️UTC-7 2011-07-15 11:37:42 PDT
Comment on attachment 534352 [details] [diff] [review]
v3 part 4

> void
> nsCSSScanner::EatWhiteSpace()
> {
>-  for (;;) {
>-    PRInt32 ch = Read();
>-    if (ch < 0) {
>-      break;
>-    }
>-    if (!IsWhitespace(ch)) {
>-      Pushback(ch);
>-      break;
>-    }
>-  }
>+  while (IsWhitespace(Peek())) {
>+    Advance();
> }

In addition to comment 44, you also need to fix this, since Peek() can return -1, and IsWhitespace() doesn't handle -1 reasonably.

With those two things fixed, I'll be ok with this, but I'd like to look at the revised patch.
Comment 46 Zack Weinberg (:zwol) 2011-07-15 11:54:33 PDT
(In reply to comment #44)
> I still think you need an assertion Peek() similar to the one that patch 10
> introduces in Advance(), except one that allows the character returned to be
> a newline (i.e., just checks that none of the intermediates are).  I realize
> nothing in the current parser would trigger such an assertion -- but the
> point is that it's a nonobvious limitation of the approach you're taking,
> and it should be documented in case somebody does write parsing code that
> wants to peek across a newline.

Ok, I will add such an assertion.

(In reply to comment #45)

> >+  while (IsWhitespace(Peek())) {
> >+    Advance();
> > }
> 
> In addition to comment 44, you also need to fix this, since Peek() can
> return -1, and IsWhitespace() doesn't handle -1 reasonably.

Good catch, will fix.

> With those two things fixed, I'll be ok with this, but I'd like to look at
> the revised patch.

Ok, but I may not have time to revise the patch in the near future.
Comment 47 Zack Weinberg (:zwol) 2011-10-17 16:32:30 PDT
Working through this stuff now, and noticed a small correction to what I said back in July...

> > In addition to comment 44, you also need to fix this, since Peek() can
> > return -1, and IsWhitespace() doesn't handle -1 reasonably.
> 
> Good catch, will fix.

Actually, no, IsWhitespace *does* handle -1 reasonably - it returns false, which will terminate the loop in EatWhitespace.  All of the IsX() functions in this file have been coded to return false when passed any negative number.
Comment 48 Zack Weinberg (:zwol) 2011-10-19 10:36:42 PDT
I have a new patch series more or less ready to go, but I want to discuss one point first.

I added the assertion requested in comment 44.  It found no bugs.  It made the layout/style mochitests dramatically slower in a debug build.  And in several places it fired inappropriately, because of StartsIdent.  StartsIdent takes two arguments which are supposed to be adjacent characters in the input stream; it's very often called like this:

  if (StartsIdent(Peek(), Peek(1))) { ... process ident ... }

StartsIdent does not examine its second argument unless its first argument is '-', but the call to Peek(1) is evaluated anyway (because that's how C++ function calls work).  Therefore, if the character at peek position 0 happens to be \n, \r, or \f, the call to Peek(1) will trip the assert, even though this is harmless in context.  I was able to restructure the code to avoid this problem, but it is less natural that way.  You have to guard all such calls to StartsIdent with a check that the character at peek position 0 is not whitespace, which is extra work and/or introduces subtle ordering dependencies.

So I think that the assertion is a net lose, and I would like to take it back out.
Comment 49 Zack Weinberg (:zwol) 2011-10-19 18:55:45 PDT
Created attachment 568272 [details] [diff] [review]
v3 part 1 final

I dusted off the first two pieces of this patch (which have been r+ for some time) and pushed them to -inbound along with bug 659963.  A full revision of the remainder of the series will follow in the next few days.

Sheriff: this bug should remain open after the patches hit -central.

https://hg.mozilla.org/integration/mozilla-inbound/rev/f83e3947a7cc
Comment 52 David Baron :dbaron: ⌚️UTC-7 2011-10-28 15:24:35 PDT
(In reply to Zack Weinberg (:zwol) from comment #48)
> I have a new patch series more or less ready to go, but I want to discuss
> one point first.
> 
> I added the assertion requested in comment 44.  It found no bugs.  It made
> the layout/style mochitests dramatically slower in a debug build.

Why?  It should be trivial in cost.

>  And in
> several places it fired inappropriately, because of StartsIdent. 
> StartsIdent takes two arguments which are supposed to be adjacent characters
> in the input stream; it's very often called like this:
> 
>   if (StartsIdent(Peek(), Peek(1))) { ... process ident ... }
> 
> StartsIdent does not examine its second argument unless its first argument
> is '-', but the call to Peek(1) is evaluated anyway (because that's how C++
> function calls work).  Therefore, if the character at peek position 0
> happens to be \n, \r, or \f, the call to Peek(1) will trip the assert, even
> though this is harmless in context.  I was able to restructure the code to
> avoid this problem, but it is less natural that way.  You have to guard all
> such calls to StartsIdent with a check that the character at peek position 0
> is not whitespace, which is extra work and/or introduces subtle ordering
> dependencies.

Can you just make a separate function for whether the current read position starts an identifier?
Comment 53 Zack Weinberg (:zwol) 2013-01-17 11:06:15 PST
Created attachment 703420 [details] [diff] [review]
part A1

Here we go finally with a new patch series which will finish up this bug.  It's been completely redone and I hope will be easier to review now.

This is a preliminary interface-tidying patch which: sorts the functions in nsCSSScanner.cpp into a logical order and renames them with a consistent naming scheme; sorts the token type enumerators into a logical order and corrects some misnomers; pushes the whitespace-skipping loop down into nsCSSScanner::Next; inlines the nsCSSToken constructor; and folds GetURLInParens into ParseMozDocumentRule.
Comment 54 Zack Weinberg (:zwol) 2013-01-17 11:24:20 PST
Created attachment 703431 [details] [diff] [review]
part A2

This patch removes the scanner's pushback buffer; introduces the new scanner functions Peek, Advance, AdvanceLine, and Backup; reimplements the old Read and Pushback functions on top of them; and converts SkipWhitespace and SkipComment to the new API.  There are also some en-passant conversions from old NS_ assertion macros to MOZ_ASSERT, and a new distinction in the gLexTable between horizontal and vertical whitespace.

This is the meat of the change for this bug.  The idea is (has always been) that you do not Advance() over a character unless it is definitely going to become part of the current token, but you can use Peek(n) to look ahead as far as necessary to make that decision.  The benefit of that change is not needing a pushback buffer, and having mTokenOffset be more accurate at all times (you'll notice that the column numbers in the test for bug 413958 change in this patch).

(The reimplemented Pushback() is cheating - it happens to work well enough that this patch passes mochitests in isolation, but it will not work on arbitrary input - it'll go away in the next patch.)

I did *not* add the assertion requested in comment 44, for the reasons stated in comment 48.  The performance impact is trivial in the completed patchset, but it turns out not to be just StartsIdent that gets false positives: it is desirable to be able to write things like

    int32_t c1 = Peek();
    int32_t c2 = Peek(1);
    int32_t c3 = Peek(2);
    if ((c1 == 'u' || c2 == 'U') && c2 == '+' && (IsHexDigit(c3) || c3 == '?'))
      // valid unicode-range token

If the character at peek position 0 or 1 happens to be a vertical space character, this would throw an assertion, even though there is no problem - the conditional will correctly classify any vertical space character as not what it wants here.  I have provided IsVertSpace so that there is no risk of people forgetting to check for all three vertical space characters when it matters, and I enforce the use of AdvanceLine rather than Advance to *consume* a vertical space; I believe that no other special treatment of vertical space is necessary.  As evidence for this belief I observe that, after the next patch, there are only three functions that call AdvanceLine (SkipWhitespace, SkipComment, and GatherEscape).
Comment 55 Zack Weinberg (:zwol) 2013-01-17 11:25:43 PST
Created attachment 703432 [details] [diff] [review]
part A3

Convert the bulk of the scanner from read/pushback to peek/advance.  Large but straightforward.
Comment 56 Zack Weinberg (:zwol) 2013-01-17 11:27:27 PST
Created attachment 703433 [details] [diff] [review]
part A4

Finally, I noticed in the process of developing part A3 that GatherIdent, ScanString, and NextURL had almost exactly the same loop in them.  By adding another character class to gLexTable I was able to fold all three loops together.
Comment 57 Mozilla RelEng Bot 2013-01-17 15:30:45 PST
Try run for f46c2c48d319 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=f46c2c48d319
Results (out of 33 total builds):
    success: 31
    warnings: 2
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/zackw@panix.com-f46c2c48d319
Comment 58 David Baron :dbaron: ⌚️UTC-7 2013-01-21 08:56:44 PST
Comment on attachment 703420 [details] [diff] [review]
part A1

I think it would have been nice to split this patch up much more finely.

I think I might see if heycam feels up to reviewing the whole series, actually.
Comment 59 David Baron :dbaron: ⌚️UTC-7 2013-02-01 11:38:39 PST
Comment on attachment 703420 [details] [diff] [review]
part A1

Transferring these review requests to Cameron, though I don't expect him to get to them until after linux.conf.au (this week) or the SVG WG meeting (next week).

I think these reviews should be relatively straightforward given that these patches aren't (as I understand it) intending to change behavior, so the main thing to review is (a) at some level of detail, whether that's the case and (b) whether the code is what we want for long-term maintainability.

(If they are intending to change behavior, it would be good to call that out.)
Comment 60 Zack Weinberg (:zwol) 2013-02-01 13:06:03 PST
Indeed, the output token sequence should not be changed by these patches.  The only thing I can think of that might change at all is the fine location of errors reported by the parser (because the read pointer is no longer regularly one or two characters past the end of the just-returned token); however, that should always be an improvement, and we're not all that precise about that anyway.

The patches need some minor updates because of bug 836530.  I'll post those later today.
Comment 61 Zack Weinberg (:zwol) 2013-02-01 15:18:29 PST
Created attachment 709263 [details] [diff] [review]
part A2

Part A1 and A3 do not need refreshing for bug 836530.  Here is the refreshed A2.
Comment 62 Zack Weinberg (:zwol) 2013-02-01 15:19:07 PST
Created attachment 709265 [details] [diff] [review]
part A4
Comment 63 Cameron McCormack (:heycam) 2013-02-11 22:47:55 PST
Comment on attachment 703420 [details] [diff] [review]
part A1

Review of attachment 703420 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with an answer of "yes, unused" for my eCSSToken_Whitespace question and if you can find a way to document the unobvious uses of nsCSSToken members somewhere.

::: layout/style/nsCSSScanner.cpp
@@ +920,5 @@
>  }
> +
> +bool
> +nsCSSScanner::NextURL(nsCSSToken& aToken)
> +{

No need to do it now, but the reordering of functions and their modifications really ought to have been in separate patches.  It otherwise makes the review harder (although I can (and did) reorder the functions back again to get a better diff locally).

@@ +1070,5 @@
> +    if (IsWhitespace(ch)) {
> +      SkipWhitespace();
> +      if (!aSkipWS) {
> +        aToken.mType = eCSSToken_Whitespace;
> +        return true;

This used to assign the initial white space character it encountered to mIdent.  I assume this functionality was unused?

::: layout/style/nsCSSScanner.h
@@ +32,5 @@
> +  // Number-like tokens
> +  eCSSToken_Number,         // mNumber mHasSign mInteger mIntegerValid
> +  eCSSToken_Percentage,     // mNumber mHasSign
> +  eCSSToken_Dimension,      // mNumber mHasSign mInteger mIntegerValid mIdent
> +  eCSSToken_URange,         // mInteger mInteger2 mIntegerValid mIdent

Although I like that this enum block looks a lot cleaner now, it has lost some information that was previously in the comments; for example, that mIdent for eCSSToken_URange records the original token string for exact reserialization, or what mSymbol is used for in eCSSToken_URL.  Is there a way you could retain that information without cluttering the code?

@@ +114,5 @@
>    nsDependentSubstring GetCurrentLine() const;
>  
>    // Get the next token. Return false on EOF. aTokenResult
> +  // is filled in with the data for the token.  If aSkipWS
> +  // is true, skip over eCSSToken_Whitespace tokens rather

Nit: add an additional space after the two existing sentences in this comment block or remove one from before the new sentence.
Comment 64 Cameron McCormack (:heycam) 2013-02-11 23:31:25 PST
Comment on attachment 709263 [details] [diff] [review]
part A2

Review of attachment 709263 [details] [diff] [review]:
-----------------------------------------------------------------

This all seems fine to me.  r=me with the two nsCSSScanner.cpp comments done, and an assurance that it's really OK for the column number in error reports changing.

::: layout/style/nsCSSScanner.cpp
@@ +82,5 @@
> +  return uint32_t(ch) < 128 && (gLexTable[ch] & IS_SPACE) != 0;
> +}
> +
> +static inline bool
> +IsIdent(int32_t ch) {

Can you call this "IsIdentChar" to be consistent with IsURLChar and IS_IDCHAR?

@@ +404,5 @@
>  {
> +  int32_t rv = Peek();
> +
> +  // There are four types of newlines in CSS: "\r", "\n", "\r\n", and "\f".
> +  // To simplify dealing with newlines, they are all normalized to "\n" here

Nit: "." at end.

::: layout/style/test/test_bug413958.html
@@ +40,5 @@
>    },
>  ];
>  var results = [
>    [ { errorMessage: /Unknown property 'nosuchprop'/,
> +      lineNumber: 1, columnNumber: 14,

Do the developer tools rely on column numbers being stable?  (That's the only thing I can think of that would.  And hopefully they have tests if they do.)

I find it a bit strange that the columnNumber points to the last character of the token -- to me, either pointing one after the token or pointing at the beginning of the token makes more sense.  I suppose it might not be worth the work to record the start offset of the token so that you could point to the beginning of the token.
Comment 65 Cameron McCormack (:heycam) 2013-02-11 23:33:28 PST
(In reply to David Baron [:dbaron] from comment #59)
> and (b) whether the code is what we want for long-term maintainability.

I do think it's an improvement just from a code complexity point of view to get rid of the pushback buffer, btw.
Comment 66 Zack Weinberg (:zwol) 2013-02-12 06:13:30 PST
(In reply to Cameron McCormack (:heycam) from comment #63)
> Comment on attachment 703420 [details] [diff] [review]
> part A1
...
> > +    if (IsWhitespace(ch)) {
> > +      SkipWhitespace();
> > +      if (!aSkipWS) {
> > +        aToken.mType = eCSSToken_Whitespace;
> > +        return true;
> 
> This used to assign the initial white space character it encountered to
> mIdent.  I assume this functionality was unused?

That is correct.

> ::: layout/style/nsCSSScanner.h
> @@ +32,5 @@
> > +  // Number-like tokens
> > +  eCSSToken_Number,         // mNumber mHasSign mInteger mIntegerValid
> > +  eCSSToken_Percentage,     // mNumber mHasSign
> > +  eCSSToken_Dimension,      // mNumber mHasSign mInteger mIntegerValid mIdent
> > +  eCSSToken_URange,         // mInteger mInteger2 mIntegerValid mIdent
> 
> Although I like that this enum block looks a lot cleaner now, it has lost
> some information that was previously in the comments; for example, that
> mIdent for eCSSToken_URange records the original token string for exact
> reserialization, or what mSymbol is used for in eCSSToken_URL.  Is there a
> way you could retain that information without cluttering the code?

I'll see what I can do.

> >    // Get the next token. Return false on EOF. aTokenResult
> > +  // is filled in with the data for the token.  If aSkipWS
> > +  // is true, skip over eCSSToken_Whitespace tokens rather
> 
> Nit: add an additional space after the two existing sentences in this
> comment block or remove one from before the new sentence.

Doh!

(In reply to Cameron McCormack (:heycam) from comment #64)
> Comment on attachment 709263 [details] [diff] [review]
> part A2
...
> > +static inline bool
> > +IsIdent(int32_t ch) {
> 
> Can you call this "IsIdentChar" to be consistent with IsURLChar and
> IS_IDCHAR?

It had the inconsistent name before I started changing things, but I'm happy to make it consistent.

> > +  // There are four types of newlines in CSS: "\r", "\n", "\r\n", and "\f".
> > +  // To simplify dealing with newlines, they are all normalized to "\n" here
> 
> Nit: "." at end.

OK.  (I think this comment goes away in a subsequent patch though.)

> ::: layout/style/test/test_bug413958.html
> @@ +40,5 @@
> >    },
> >  ];
> >  var results = [
> >    [ { errorMessage: /Unknown property 'nosuchprop'/,
> > +      lineNumber: 1, columnNumber: 14,
> 
> Do the developer tools rely on column numbers being stable?  (That's the
> only thing I can think of that would.  And hopefully they have tests if they
> do.)

Nothing relies on stable column numbers in CSS other than this test, because bug 413958 landed only quite recently, and before then we weren't reporting column number information at all.

> I find it a bit strange that the columnNumber points to the last character
> of the token -- to me, either pointing one after the token or pointing at
> the beginning of the token makes more sense.  I suppose it might not be
> worth the work to record the start offset of the token so that you could
> point to the beginning of the token.

The scanner does report the column number of the first character of each token.  What's going on here is that the *parser* has only one 'current token' which is the location of all diagnostics, and it doesn't issue this error message until after it has advanced the 'current token' to be the colon after 'nosuchprop'.  (This is arguably itself a bug, but it is a separate bug.)  So the expected column number here is the column number of the colon.  If it's off by one in either direction, that's a problem for me to fix.  I'll check.
Comment 67 Zack Weinberg (:zwol) 2013-02-12 06:14:26 PST
Note for the record that I will probably not have time to revise patches until the weekend, and that it doesn't make sense to land A1 and A2 without A3 and A4.
Comment 68 Cameron McCormack (:heycam) 2013-02-12 17:25:40 PST
Comment on attachment 703432 [details] [diff] [review]
part A3

Review of attachment 703432 [details] [diff] [review]:
-----------------------------------------------------------------

I really like how Peek(n) makes it easier and clearer to do lookahead than reading and pushing back, especially for cases like the HTML comment tokens.

r=me with these comments addressed.

::: layout/style/nsCSSScanner.cpp
@@ +532,3 @@
>   *
>   * Returns failure when the character sequence does not form an ident at
>   * all, in which case the caller is responsible for pushing back or

Update the "pushing back" wording to "backing up".  Can you also document where mCurrent should be pointing to?  It's not clear from the comment whether the aChar that you pass in should have been consumed already.

@@ +549,5 @@
> +    uint32_t n = mOffset;
> +    while (n < mCount && IsIdent(mBuffer[n])) {
> +      n++;
> +    }
> +    // Add to the token what we have so far

Nit: append ".".

@@ +571,5 @@
>    return true;
>  }
>  
>  bool
>  nsCSSScanner::ScanIdent(int32_t aChar, nsCSSToken& aToken)

I find it a bit confusing that ScanIdent scans an ident, function or url token, and that ScanURange scans a urange or ident.  (I realise this is the case with the code before these patches.)  I can understand that having ScanURange recognise that what follows the "u" makes the token an ident instead of a urange is easier than putting that detection logic somewhere else.  (And I don't mind the fact that ScanURange and others can scan a symbol in their fallback cases.)

I think it is that the name ScanIdent doesn't tell me "I'm going to scan an ident, function or url".  OTOH I don't think a longer name like ScanIdentFunctionOrURL would be helpful.  Maybe you can add to the comments just above each case in nsCSSScanner::Next to state which set of tokens it is scanning for?  And then a comment on the individual ScanXXX functions too.  I think that would make it clearer.

@@ +612,1 @@
>  {

Like you have an assertion at the top of ScanNumber, can you have one here to ensure ScanHash is only called when Peek() == '#'?  Similarly for the other ScanXXX() functions that assume you are already pointing to a particular character(s).

@@ +641,5 @@
> +      return true;
> +    }
> +  }
> +
> +  // OK, it's definitely a number token (or relative).

I don't understand what "relative" means here.  The token can be one of number, percentage or dimension at this point, so maybe you could say that.

@@ +731,5 @@
>    aToken.mIntegerValid = false;
>  
>    // Time to reassemble our number.
> +  // Do all the math in double precision so it's truncated only once.
> +  double value = double(sign * (intPart + fracPart));

Is this double cast still needed?  (intPart + fracPart) is already a double so the int sign should be promoted to double.

@@ +736,2 @@
>    if (gotE) {
>      // pow(), not powf(), because at least wince doesn't have the latter.

Is this comment line obsolete now that we are doing all the math in doubles?

@@ +951,4 @@
>          break;
>        }
> +      if (!GatherEscape(aToken.mIdent, false)) {
> +        break; // bad escape sequence terminates URL

Can you have the comment mention that due to the "\" remaining at the current position, this will cause the token to be returned as a badurl.
Comment 69 Cameron McCormack (:heycam) 2013-02-12 17:30:15 PST
(In reply to Zack Weinberg (:zwol) from comment #66)
> I'll see what I can do.

I notice that sometimes the use of the nsCSSToken members is documented in the comment for the individual ScanXXX functions.  If you could do that uniformly, and just reference that information from the comment on the enum declaration itself, that'll be fine.

> > ::: layout/style/test/test_bug413958.html
> > @@ +40,5 @@
> > >    },
> > >  ];
> > >  var results = [
> > >    [ { errorMessage: /Unknown property 'nosuchprop'/,
> > > +      lineNumber: 1, columnNumber: 14,
> > 
> > Do the developer tools rely on column numbers being stable?  (That's the
> > only thing I can think of that would.  And hopefully they have tests if they
> > do.)
> 
> Nothing relies on stable column numbers in CSS other than this test, because
> bug 413958 landed only quite recently, and before then we weren't reporting
> column number information at all.

OK.

> The scanner does report the column number of the first character of each
> token.  What's going on here is that the *parser* has only one 'current
> token' which is the location of all diagnostics, and it doesn't issue this
> error message until after it has advanced the 'current token' to be the
> colon after 'nosuchprop'.  (This is arguably itself a bug, but it is a
> separate bug.)  So the expected column number here is the column number of
> the colon.  If it's off by one in either direction, that's a problem for me
> to fix.  I'll check.

Are the column numbers zero- or one-based?  I was interpreting them as one-based, which made me think they were pointing to the final character of the token.  If they are zero-based and it is pointing to just after the token, that is OK too.  (Ideally we'd report the range of column characters, which I imagine would be more useful for devtools.)
Comment 70 Cameron McCormack (:heycam) 2013-02-12 17:45:27 PST
Comment on attachment 709265 [details] [diff] [review]
part A4

Review of attachment 709265 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the assertion.

::: layout/style/nsCSSScanner.cpp
@@ +559,1 @@
>  {

Please assert that aClass has a valid value.
Comment 71 Cameron McCormack (:heycam) 2013-02-12 17:47:30 PST
(Sorry for the plain text reviews not giving much context in some hunks.  I'm not sure how I can make Splinter include more context.)
Comment 72 Zack Weinberg (:zwol) 2013-02-16 08:12:55 PST
(In reply to Cameron McCormack (:heycam) from comment #68)
> >   * Returns failure when the character sequence does not form an ident at
> >   * all, in which case the caller is responsible for pushing back or
> 
> Update the "pushing back" wording to "backing up".  Can you also document
> where mCurrent should be pointing to?  It's not clear from the comment
> whether the aChar that you pass in should have been consumed already.

Will fix.

> > +    // Add to the token what we have so far
> 
> Nit: append ".".

Ok.

> I find it a bit confusing that ScanIdent scans an ident, function or url
> token, and that ScanURange scans a urange or ident.  (I realise this is the
> case with the code before these patches.)  I can understand that having
> ScanURange recognise that what follows the "u" makes the token an ident
> instead of a urange is easier than putting that detection logic somewhere
> else.  (And I don't mind the fact that ScanURange and others can scan a
> symbol in their fallback cases.)
> 
> I think it is that the name ScanIdent doesn't tell me "I'm going to scan an
> ident, function or url".  OTOH I don't think a longer name like
> ScanIdentFunctionOrURL would be helpful.  Maybe you can add to the comments
> just above each case in nsCSSScanner::Next to state which set of tokens it
> is scanning for?  And then a comment on the individual ScanXXX functions
> too.  I think that would make it clearer.

OK.  I'll also take another look at the logic and see if it would be better to move some of it around.  The idea was that Next() should for the most part be able to dispatch to a subroutine after looking at the very next character, but that didn't exactly work out and I think we wound up with some unfortunate order dependencies.

> Like you have an assertion at the top of ScanNumber, can you have one here
> to ensure ScanHash is only called when Peek() == '#'?  Similarly for the
> other ScanXXX() functions that assume you are already pointing to a
> particular character(s).

OK.

> @@ +641,5 @@
> > +      return true;
> > +    }
> > +  }
> > +
> > +  // OK, it's definitely a number token (or relative).
> 
> I don't understand what "relative" means here.  The token can be one of
> number, percentage or dimension at this point, so maybe you could say that.

Relative of a number token, i.e. a percentage or dimension token.  I will change this to read

  // At this point we know we have a number token (or a percentage or dimension).

> @@ +731,5 @@
> >    aToken.mIntegerValid = false;
> >  
> >    // Time to reassemble our number.
> > +  // Do all the math in double precision so it's truncated only once.
> > +  double value = double(sign * (intPart + fracPart));
> 
> Is this double cast still needed?  (intPart + fracPart) is already a double
> so the int sign should be promoted to double.

Yeah, that's unnecessary, I'll remove it.
 
> @@ +736,2 @@
> >    if (gotE) {
> >      // pow(), not powf(), because at least wince doesn't have the latter.
> 
> Is this comment line obsolete now that we are doing all the math in doubles?

Yup.

> @@ +951,4 @@
> >          break;
> >        }
> > +      if (!GatherEscape(aToken.mIdent, false)) {
> > +        break; // bad escape sequence terminates URL
> 
> Can you have the comment mention that due to the "\" remaining at the
> current position, this will cause the token to be returned as a badurl.

OK.

(In reply to Cameron McCormack (:heycam) from comment #69)
> 
> I notice that sometimes the use of the nsCSSToken members is documented in
> the comment for the individual ScanXXX functions.  If you could do that
> uniformly, and just reference that information from the comment on the enum
> declaration itself, that'll be fine.

I decided it was good to have a detailed explanation of which token types use which nsCSSToken members in the header file itself, but I will also improve the comments for the ScanXXX functions.

> Are the column numbers zero- or one-based?  I was interpreting them as
> one-based, which made me think they were pointing to the final character of
> the token.  If they are zero-based and it is pointing to just after the
> token, that is OK too.  (Ideally we'd report the range of column characters,
> which I imagine would be more useful for devtools.)

The column numbers are zero-based.  Well, technically they are not column numbers at all; they are offsets from the beginning of the 'error line' string provided to nsIScriptError::InitWithWindowID, in UTF-16 code units.  (For instance, the error console does not treat a hard tab the way a tty does.)  There *may* be bugs in this area -- I'm not 100% sure, but the error console *appears* to be treating offset 0 as "no intra-line position information available", which makes it impossible to point an error to the first character on a line.  But if that's right, the fix requires coordinated changes across the error console, the other error console, the CSS scanner, and the JS scanner, so it should be its own bug.

(In reply to Cameron McCormack (:heycam) from comment #70)
> 
> Please assert that aClass has a valid value.

That's a little fuzzy ... I guess I can just list the currently-valid choices.
Comment 73 Zack Weinberg (:zwol) 2013-02-16 15:08:22 PST
Created attachment 714839 [details] [diff] [review]
A1 final

Here is the revised patch series.  I am going to go ahead and land them, but if anyone has further comments (particularly on A2 and A3, which changed more than the others) I'm happy to polish in follow-up patches/bugs.
Comment 74 Zack Weinberg (:zwol) 2013-02-16 15:09:20 PST
Created attachment 714840 [details] [diff] [review]
A2 final
Comment 75 Zack Weinberg (:zwol) 2013-02-16 15:11:23 PST
Created attachment 714841 [details] [diff] [review]
A3 final

Lots more comments as requested, but also I moved a bunch of "is this _really_ token type X?" logic back into Next().
Comment 76 Zack Weinberg (:zwol) 2013-02-16 15:13:46 PST
Created attachment 714843 [details] [diff] [review]
A4 final
Comment 77 Cameron McCormack (:heycam) 2013-02-16 15:19:51 PST
(In reply to Zack Weinberg (:zwol) from comment #75)
> Created attachment 714841 [details] [diff] [review]
> A3 final
> 
> Lots more comments as requested, but also I moved a bunch of "is this
> _really_ token type X?" logic back into Next().

I think that's more understandable now.
Comment 79 Cameron McCormack (:heycam) 2013-02-16 15:29:37 PST
I think the comments you have now on nsCSSTokenType/nsCSSToken are fine, as is the assertion in GatherText.
Comment 81 David Baron :dbaron: ⌚️UTC-7 2013-05-27 20:12:29 PDT
Comment on attachment 714840 [details] [diff] [review]
A2 final

>@@ -3796,19 +3796,17 @@ CSSParserImpl::ParsePseudoClassWithNthPa
>     // minus on back onto the scanner's pushback buffer.
>     uint32_t truncAt = 0;
>     if (StringBeginsWith(mToken.mIdent, NS_LITERAL_STRING("n-"))) {
>       truncAt = 1;
>     } else if (StringBeginsWith(mToken.mIdent, NS_LITERAL_STRING("-n-"))) {
>       truncAt = 2;
>     }
>     if (truncAt != 0) {
>-      for (uint32_t i = mToken.mIdent.Length() - 1; i >= truncAt; --i) {
>-        mScanner->Pushback(mToken.mIdent[i]);
>-      }
>+      mScanner->Backup(mToken.mIdent.Length() - truncAt);
>       mToken.mIdent.Truncate(truncAt);
>     }
>   }

I think this change may have actually changed behavior, in the case where the identifier contained escapes.  I think neither old nor new behavior is correct (old is incorrect because it will reparse the escaped characters in non-identifier form, new because it pushes back the wrong number of characters).
Comment 82 David Baron :dbaron: ⌚️UTC-7 2013-05-27 23:33:14 PDT
I filed bug 876585 on comment 81.

Note You need to log in before you can comment on or make changes to this bug.