Closed Bug 1322825 Opened 3 years ago Closed 3 years ago

Implement incremental tokenizer

Categories

(Core :: XPCOM, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox-esr45 --- wontfix
firefox52 --- fixed
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: mayhemer, Assigned: mayhemer)

References

Details

Attachments

(2 files, 15 obsolete files)

48.25 KB, patch
mayhemer
: review+
Details | Diff | Splinter Review
48.25 KB, patch
Details | Diff | Splinter Review
We have mozilla::Tokenizer that can work on a final buffer.  But there are cases when the data come in chunks, specially from the network.  Then we need something opposite.

The tokenizer's input will be a push of data (feeding) + notification of EOF.  The output will be a call to a consumer's callback with each token as found complete in the input.
Attached patch v1 (obsolete) — Splinter Review
- splits Tokenizer to TokenizerBase and Tokenizer
- TokenizerBase keeps only the Parse method and leaves input buffer management on derivatives
- TokenizerBase is adjusted to work with an incomplete input
- added functionality for "custom" tokens (=custom fixed strings that can be then recognized as separate tokens)
- added new mode: "custom only parsing" in which we only look for the custom tokens, data between is returned as a "raw" token referring a fragment from the input buffer ; useful for a boundary token search in a binary input
- IncrementalTokenizer (: TokenizerBase) takes a lambda as a callback that receives each token separately
- it's fed the data incrementally (from a string or from a stream)
- when all data is fed FinishInput() marks the input as complete and makes the last token be delivered
- test added
Attachment #8817815 - Flags: review?(nfroyd)
Attached patch v1.1 (obsolete) — Splinter Review
A bit better patch (few small improvements, the original architecture remains)
Attachment #8817815 - Attachment is obsolete: true
Attachment #8817815 - Flags: review?(nfroyd)
Attachment #8818021 - Flags: review?(nfroyd)
Blocks: 1321612
Attached patch v1.2 (obsolete) — Splinter Review
One more update.  Nathan, sorry to potentially change it under your hands.
Attachment #8818021 - Attachment is obsolete: true
Attachment #8818021 - Flags: review?(nfroyd)
Attachment #8818128 - Flags: review?(nfroyd)
- added Next(Token&) that can be used to synchronously get another token when within the Consumer callback
- rollback arg moved to be a method of IncrementalTokenizer
- comments, polishing.

sorry to change this again under your hands, Nathan.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b3e4cc34642eb3cd0fcf280057f88be5544836ca
Attachment #8818128 - Attachment is obsolete: true
Attachment #8818128 - Flags: review?(nfroyd)
Attachment #8818634 - Flags: review?(nfroyd)
Attached patch v1.4 (obsolete) — Splinter Review
Apologies again, Nathan..

ASAN (bless it!) caught some issues with the custom token API.  Hence I rewrote it a bit.  Might be a bit less efficient, but way more safe to use.

Preliminary run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7d4ea343cafa2e969eb461b7ce8d65b7775e3978
Attachment #8818634 - Attachment is obsolete: true
Attachment #8818634 - Flags: review?(nfroyd)
Attachment #8818658 - Flags: review?(nfroyd)
Comment on attachment 8818658 [details] [diff] [review]
v1.4

OK, apparently it's too late for me...  Dropping r, one more build issue.  Will finish tomorrow.
Attachment #8818658 - Flags: review?(nfroyd)
I looked at this briefly today and it might be helpful for reviewing if the Tokenizer/TokenizerBase changes were made separate from the IncrementalTokenizer changes.
(In reply to Nathan Froyd [:froydnj] from comment #7)
> I looked at this briefly today and it might be helpful for reviewing if the
> Tokenizer/TokenizerBase changes were made separate from the
> IncrementalTokenizer changes.

Last try run is green, so I'm ready to give you final version this time for review.

But I'm not sure how exactly you want me to separate Base from Incremental.  You mean to split the patch simply by files to two pieces?  Or rather have two stages where I 1) split Tokenizer to TokenizerBase and Tokenizer and 2) add features necessary for IncrementalTokenizer to TokenizerBase + add new derivatives and tests?

I can do even more if you want, like split to Base, add support for incremental (incomplete input) to it, then add support for custom tokens, and finally implement the IncrementalTokenizer derivative + its tests.
Flags: needinfo?(nfroyd)
Not asking for r until we agree on how to split this for easier review.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c1cc780a13726026f9786ce4b2d841487020a8b0
Attachment #8818658 - Attachment is obsolete: true
(In reply to Honza Bambas (:mayhemer) from comment #8)
> (In reply to Nathan Froyd [:froydnj] from comment #7)
> > I looked at this briefly today and it might be helpful for reviewing if the
> > Tokenizer/TokenizerBase changes were made separate from the
> > IncrementalTokenizer changes.
> 
> But I'm not sure how exactly you want me to separate Base from Incremental. 
> You mean to split the patch simply by files to two pieces?  Or rather have
> two stages where I 1) split Tokenizer to TokenizerBase and Tokenizer and 2)
> add features necessary for IncrementalTokenizer to TokenizerBase + add new
> derivatives and tests?

I had in mind the #2 possibility here but...

> I can do even more if you want, like split to Base, add support for
> incremental (incomplete input) to it, then add support for custom tokens,
> and finally implement the IncrementalTokenizer derivative + its tests.

...this sounds potentially easier to review.  That being said, I know it's difficult/time-consuming to split one large patch into fine-grained pieces like this.  So I'll ask for #2 above (splitting Tokenizer, then adding the necessary features for IncrementalTokenizer), as that sounds reasonably straightforward (maybe?), and leave it to your judgement whether splitting further than that is helpful/worth your time: I'm happy to review either way.
Flags: needinfo?(nfroyd)
Attachment #8818881 - Attachment is obsolete: true
Attachment #8830841 - Flags: review?(nfroyd)
Attachment #8830842 - Flags: review?(nfroyd)
Attached patch v1.6 (part 4/4 - test updates) (obsolete) — Splinter Review
Attachment #8830845 - Flags: review?(nfroyd)
Comment on attachment 8830841 [details] [diff] [review]
v1.6 (part 1/4 - spit to a base class)

Review of attachment 8830841 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/Tokenizer.cpp
@@ +275,5 @@
> +  , mHasFailed(false)
> +  , mWhitespaces(aWhitespaces ? aWhitespaces : sWhitespaces)
> +  , mAdditionalWordChars(aAdditionalWordChars)
> +  , mCursor(nullptr)
> +  , mEnd(nullptr)

We're initializing mCursor and mEnd in the base class when we weren't initializing them formerly because...the conditions for their initialization in an upcoming subclasse are different?  Or just because it's good programming practice to initialize all the members in the constructor?

Not objecting to the change, just trying to understand why it's here.
Attachment #8830841 - Flags: review?(nfroyd) → review+
Comment on attachment 8830842 [details] [diff] [review]
v1.6 (part 2/4 - ability to handle incomplete input)

Review of attachment 8830842 [details] [diff] [review]:
-----------------------------------------------------------------

I think this is mostly OK.  I'd like to discuss the API comment below concerning Init and a few other things below; r-'ing so I can see another round of these patches.

::: xpcom/ds/IncrementalTokenizer.cpp
@@ +43,5 @@
> +  NS_ENSURE_TRUE(mConsumer, NS_ERROR_NOT_INITIALIZED);
> +  MOZ_ASSERT(mIncomplete);
> +
> +  mInput.Cut(0, mInputCursor);
> +  mInputCursor = 0;

This (and other similar places) can't be Truncate() because there might be unread text beyond mInputCursor, correct?

@@ +61,5 @@
> +  mInputCursor = 0;
> +
> +  nsresult rv = NS_OK;
> +  while (NS_SUCCEEDED(rv) && aCount) {
> +    nsCString::index_type reminder = mInput.Length();

Nit: did you mean to spell this "remainder"?

@@ +63,5 @@
> +  nsresult rv = NS_OK;
> +  while (NS_SUCCEEDED(rv) && aCount) {
> +    nsCString::index_type reminder = mInput.Length();
> +
> +    if (!mInput.SetLength(reminder + aCount, fallible_t())) {

Nit: I think the typical style is to use mozilla::fallible (or just "fallible" if the mozilla namespace is visible, as it is here).

I think you might want to add in some overflow checking here for the argument to SetLength...

@@ +72,5 @@
> +
> +    uint32_t read;
> +    rv = aInput->Read(buffer, aCount, &read);
> +    if (NS_SUCCEEDED(rv)) {
> +      mInput.SetLength(reminder + read);

...and here, although perhaps we can assume this won't overflow if the previous calculation didn't, as |read <= aCount|.

::: xpcom/ds/IncrementalTokenizer.h
@@ +46,5 @@
> +
> +  /**
> +   * Assigns the consumer callback and allows the tokenizer to work.
> +   */
> +  void Init(Consumer aConsumer);

Could we provide the Consumer in the constructor, rather than in a separate Init() method?  I see that the current Init() implementation doesn't permit re-initialization, which is good, but I'm unsure that we gain a lot from separating construction from Init() here, especially when Init() is infallible.

@@ +54,5 @@
> +   * on every found token.  Result of the Consumer callback is returned here.
> +   *
> +   * The tokenizer must be initialized with a valid consumer prior call to these
> +   * methods.  It's not allowed to call Feed/FinishInput from inside the Consumer
> +   * callback.

The asserts around initialization and being inside the consuming callback are much appreciated, thank you!

::: xpcom/ds/Tokenizer.h
@@ +115,5 @@
>    bool mPastEof;
>    // true iff the last Check*() call has returned false, reverts to true on Rollback() call
>    bool mHasFailed;
> +  // true if we still expect more data to be added to the input string
> +  bool mIncomplete;

I wonder if uses of this member wouldn't be easier to read if it were named `mInputComplete` or similar.  WDYT?  Variables that go from false -> true when something is done seem more natural to me than variables going from true -> false.
Attachment #8830842 - Flags: review?(nfroyd) → review+
Comment on attachment 8830844 [details] [diff] [review]
v1.6 (part 3/4 - custom token strings + 'binary' or 'raw' parsing mode)

Review of attachment 8830844 [details] [diff] [review]:
-----------------------------------------------------------------

I can definitely see cases where this would be useful.  But do we need to extend Tokenizer such that it can flip back and forth between modes?  Couldn't we have a CustomTokensTokenizer (name open to debate) that stored its own custom tokens so we could keep things largely separated?  Tokenizer has a reasonable large API, and I'm hesitant to add a whole 'nother axis of stuff to think about.  Enabling/disabling custom tokens also seems kind of complex for not a lot of gain; seems like you could almost have multiple tokenizers to deal with that case instead.

Could you please elaborate more on your use cases for this?  r-'ing with a few comments while we get those details hashed out.

::: xpcom/ds/Tokenizer.cpp
@@ +293,5 @@
> +
> +  UniquePtr<Token>& t = *mCustomTokens.AppendElement();
> +  t = MakeUnique<Token>();
> +
> +  t->mType = static_cast<TokenType>(++mNextCustomTokenID);

Ewwww...Can we make mType a Variant<TokenType, uint32_t> (i.e. you're either a regular typed token or a custom token with a unique ID) or similar instead?  I'm sure that would make the code slightly more complicated in some places, but it would at least have the benefit that we wouldn't be performing this dodgy casting.

::: xpcom/ds/Tokenizer.h
@@ +108,5 @@
> +  void RemoveCustomToken(Token& aToken);
> +  /**
> +   * Only applies to a custom type of a Token (see AddCustomToken below.)
> +   * This turns on and off token recognition.  After Disable() was called
> +   * on a custom token, it's ignored (as never added as a custom token).

Where is this Disable() method that this comment is referring to?

@@ +117,5 @@
> +  /**
> +   * Switch the parser mode to only see the custom added tokens.  Anything in between
> +   * will be seen only as raw data (TOKEN_RAW type of token).  This can be perceived as
> +   * "a binary mode".
> +   * This is applyable only when there are any custom tokens defined.

Nit: I think this should be "applicable", not "applyable".

@@ +124,5 @@
> +  /**
> +   * Switch the parser back to the default full mode where all types of tokens are recognized.
> +   * This is the default.
> +   */
> +  void FullTokenizing() { mCustomOnlyTokenizing = false; }

Maybe you could just have an enum and a single method:

enum TokenizingMode {
  Full,
  CustomTokensOnly,
};

void SetTokenizingMode(TokenizingMode aMode);

WDYT?
Attachment #8830844 - Flags: review?(nfroyd) → review-
Comment on attachment 8830845 [details] [diff] [review]
v1.6 (part 4/4 - test updates)

Review of attachment 8830845 [details] [diff] [review]:
-----------------------------------------------------------------

I scanned through these, and they look mostly reasonable, but I'm not going through them carefully until we get some API details nailed down.

::: xpcom/tests/gtest/TestTokenizer.cpp
@@ +848,5 @@
> +  NS_NAMED_LITERAL_CSTRING(input, "test1,test2,,,test3");
> +  auto cur = input.BeginReading();
> +  auto end = input.EndReading();
> +  for (; cur < end; ++cur) {
> +    i.FeedInput(nsDependentCSubstring(cur, 1));

I was going to ask for a test like this, thank you for writing one!

@@ +1107,5 @@
> +  {
> +    switch (++test) {
> +      case 1: EXPECT_TRUE(t.Fragment().EqualsLiteral("01")); break;
> +      case 2: EXPECT_TRUE(t.Fragment().EqualsLiteral("234567")); break;
> +      case 3: EXPECT_TRUE(t.Fragment().EqualsLiteral("89")); break;

I don't have a good intuition about why these strings are the fragments being observed at each step.  I think they must be related to the min raw length passed to the tokenizer, but then I'd expect the first fragment to be "012", with possibly some other changes.  Can you help me refine my intuition here?
Attachment #8830845 - Flags: review?(nfroyd)
Comment on attachment 8830842 [details] [diff] [review]
v1.6 (part 2/4 - ability to handle incomplete input)

(you wanted to r- this)
Attachment #8830842 - Flags: review+ → review-
(In reply to Nathan Froyd [:froydnj] from comment #17)
> Comment on attachment 8830844 [details] [diff] [review]
> v1.6 (part 3/4 - custom token strings + 'binary' or 'raw' parsing mode)
> 
> Review of attachment 8830844 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I can definitely see cases where this would be useful.  But do we need to
> extend Tokenizer such that it can flip back and forth between modes?

Please see the patch in bug 1321612 for how this class is used.  There are cases where parts are considered binary with a binary separator and parts that need regular tokenization in-between.
Flags: needinfo?(nfroyd)
(In reply to Honza Bambas (:mayhemer) from comment #20)
> Please see the patch in bug 1321612 for how this class is used.

Please CC me so that I can see the bug. :)
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #21)
> (In reply to Honza Bambas (:mayhemer) from comment #20)
> > Please see the patch in bug 1321612 for how this class is used.
> 
> Please CC me so that I can see the bug. :)

ups. done :)
(In reply to Nathan Froyd [:froydnj] from comment #15)
> Comment on attachment 8830841 [details] [diff] [review]
> v1.6 (part 1/4 - spit to a base class)
> 
> Review of attachment 8830841 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: xpcom/ds/Tokenizer.cpp
> @@ +275,5 @@
> > +  , mHasFailed(false)
> > +  , mWhitespaces(aWhitespaces ? aWhitespaces : sWhitespaces)
> > +  , mAdditionalWordChars(aAdditionalWordChars)
> > +  , mCursor(nullptr)
> > +  , mEnd(nullptr)
> 
> We're initializing mCursor and mEnd in the base class when we weren't
> initializing them formerly because...the conditions for their initialization
> in an upcoming subclasse are different?  Or just because it's good
> programming practice to initialize all the members in the constructor?

Because each subclass uses them differently.  mCursor and mEnd are constantly changing in the incremental tokenizer and are initially null.  The final input "standard" tokenizer OTOH keeps them valid and const through out its lifetime.

So, the difference is the reason here and it's also a good practice ;)

> 
> Not objecting to the change, just trying to understand why it's here.
(In reply to Nathan Froyd [:froydnj] from comment #16)
> Comment on attachment 8830842 [details] [diff] [review]
> v1.6 (part 2/4 - ability to handle incomplete input)
> 
> Review of attachment 8830842 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I think this is mostly OK.  I'd like to discuss the API comment below
> concerning Init and a few other things below; r-'ing so I can see another
> round of these patches.
> 
> ::: xpcom/ds/IncrementalTokenizer.cpp
> @@ +43,5 @@
> > +  NS_ENSURE_TRUE(mConsumer, NS_ERROR_NOT_INITIALIZED);
> > +  MOZ_ASSERT(mIncomplete);
> > +
> > +  mInput.Cut(0, mInputCursor);
> > +  mInputCursor = 0;
> 
> This (and other similar places) can't be Truncate() because there might be
> unread text beyond mInputCursor, correct?

Exactly, I only move what's left to read.  The parser may be (obviously) in a state where there is a leftover data while the Feed* API can be called adding more input.

> Nit: did you mean to spell this "remainder"?

Yes:)

> ::: xpcom/ds/IncrementalTokenizer.h
> @@ +46,5 @@
> > +
> > +  /**
> > +   * Assigns the consumer callback and allows the tokenizer to work.
> > +   */
> > +  void Init(Consumer aConsumer);
> 
> Could we provide the Consumer in the constructor, rather than in a separate
> Init() method?  

I had it that way originally, but then you must provide the closure in the constructor code (as an initiator) of the class that is using the tokenizer as a member.  I was facing some obstacle having the API design that way, but can't remember what exactly was the problem.  So I turned that to an Init method rather.

> I see that the current Init() implementation doesn't permit
> re-initialization, which is good, but I'm unsure that we gain a lot from
> separating construction from Init() here, especially when Init() is
> infallible.

I'll try to rewrite this (remove Init and pass the closure in ctor) just to see how well it's been worked with in the dep bug.  The closure became pretty simple, so I believe it might work well now.
(In reply to Nathan Froyd [:froydnj] from comment #17)
> > +  t->mType = static_cast<TokenType>(++mNextCustomTokenID);
> 
> Ewwww...Can we make mType a Variant<TokenType, uint32_t> (i.e. you're either
> a regular typed token or a custom token with a unique ID) or similar
> instead?  I'm sure that would make the code slightly more complicated in
> some places, but it would at least have the benefit that we wouldn't be
> performing this dodgy casting.

So.. this was a really a lot of work to rewrite and ended up as really a bad idea.  The trick I exploit here is that mType is simply a number and I can easily compare mType of two tokens independently if they are built-in or custom.  With Variant I really can't do that at all.

One static_cast here is better then messing with Variants, sorry.
- consumer in ctor
- s/mIncomplete/mInputFinished/ + invert the logic
- IncrementalTokenizer::FeedInput input stream variant added an overflow check (good catch btw!)
Attachment #8833008 - Flags: review?(nfroyd)
- TokenizerBase::SetTokenizingMode
- comments cleanup

I abandoned the Variant change, since I found it impossible to go with.  If you insist, then please note this blocks a sec-high bug and these patches are good as are.  We can try to use Variant here in a followup, but I personally would oppose it.
Attachment #8833012 - Flags: review?(nfroyd)
- API changes reflected
Attachment #8833013 - Flags: review?(nfroyd)
Comment on attachment 8833008 [details] [diff] [review]
v1.6-1.7 IDIFF (part 2/4 - ability to handle incomplete input)

Review of attachment 8833008 [details] [diff] [review]:
-----------------------------------------------------------------

This looks OK, but I am confused by the aRawMinBuffered parameter omission.

::: xpcom/ds/IncrementalTokenizer.cpp
@@ +15,5 @@
>  namespace mozilla {
>  
> +IncrementalTokenizer::IncrementalTokenizer(Consumer aConsumer,
> +                                           const char * aWhitespaces,
> +                                           const char * aAdditionalWordChars)

This needs an aRawMinBuffered parameter to match the declaration in the header file, no?  And mMinRawDelivery will need to be updated?

::: xpcom/ds/Tokenizer.cpp
@@ +17,5 @@
>                       const char* aWhitespaces,
>                       const char* aAdditionalWordChars)
>    : TokenizerBase(aWhitespaces, aAdditionalWordChars)
>  {
> +  mInputFinished = true;

I guess technically this gets set in TokenizerBase, so you don't need to set it here?
Attachment #8833008 - Flags: review?(nfroyd) → review+
Comment on attachment 8833012 [details] [diff] [review]
v1.6-1.7 IDIFF (part 3/4 - custom token strings + 'binary' or 'raw' parsing mode)

Review of attachment 8833012 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/IncrementalTokenizer.cpp
@@ +16,5 @@
>  
>  IncrementalTokenizer::IncrementalTokenizer(Consumer aConsumer,
>                                             const char * aWhitespaces,
> +                                           const char * aAdditionalWordChars,
> +                                           uint32_t aRawMinBuffered)

Ah, here is the missing change from the first part! :)
Attachment #8833012 - Flags: review?(nfroyd) → review+
Comment on attachment 8833013 [details] [diff] [review]
v1.6-1.7 IDIFF (part 4/4 - test updates)

Review of attachment 8833013 [details] [diff] [review]:
-----------------------------------------------------------------

r=me on the interdiff.  Still need to look at the tests.
Attachment #8833013 - Flags: review?(nfroyd) → review+
Still need an answer to my review on the tests in comment 16, please.
Flags: needinfo?(honzab.moz)
Thanks for the quick reviews!

Comment 18 ;)

> > +  {
> > +    switch (++test) {
> > +      case 1: EXPECT_TRUE(t.Fragment().EqualsLiteral("01")); break;
> > +      case 2: EXPECT_TRUE(t.Fragment().EqualsLiteral("234567")); break;
> > +      case 3: EXPECT_TRUE(t.Fragment().EqualsLiteral("89")); break;
> 
> I don't have a good intuition about why these strings are the fragments
> being observed at each step.  I think they must be related to the min raw
> length passed to the tokenizer, but then I'd expect the first fragment to be
> "012", with possibly some other changes.  Can you help me refine my
> intuition here?

The logic is not exactly precise (for simplicity), so what you can expect is raw length ~3.  When conditions to return a 'raw' token are met (enough data), I return pointer to the 'end' of the 'raw' token as |mEnd - longestCustom + 1;|.  IIRC, this simply returns everything we can throw away (cannot be any of the custom tokens) except the part that may potentially fit the longer custom token - 1.  

Diagram:

XY and ABCD are two custom tokens.  Min raw delivery is 3.

"0123ABC" is the input.  As you may see the second custom token is _probably_ emerging, but there is one more char missing.  Until this char is fed, we cannot decide.  But, the 0123 part can already be delivered.

And now the condition:

  if (longestCustom < available && available > mMinRawDelivery) {

longestCustom < available: TRUE, we know that all custom tokens would fit this input and one of them would be found if completely present.  But none of them was found at this point.

available > mMinRawDelivery: TRUE, we have enough data to return something.  this condition is probably the source of the confusion here, since what we _actually_ return is something between min raw delivery and (available - longest token)

  return mEnd - longestCustom + 1: points behind the '3' in the input, which is all of what we can safely return to the consumer and cut from the input.



Hope this helps to understand.  We may improve this logic, but for the first version I think this is sufficient.

Nathan, please keep in mind this blocks a sec-hi bug that needs to land on all branches.  Hence I'd like to get this done ASAP.  Any polishing can be done in followups.
Flags: needinfo?(honzab.moz)
(In reply to Nathan Froyd [:froydnj] from comment #29)
> Comment on attachment 8833008 [details] [diff] [review]
> v1.6-1.7 IDIFF (part 2/4 - ability to handle incomplete input)
> 
> Review of attachment 8833008 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks OK, but I am confused by the aRawMinBuffered parameter omission.
> 
> ::: xpcom/ds/IncrementalTokenizer.cpp
> @@ +15,5 @@
> >  namespace mozilla {
> >  
> > +IncrementalTokenizer::IncrementalTokenizer(Consumer aConsumer,
> > +                                           const char * aWhitespaces,
> > +                                           const char * aAdditionalWordChars)
> 
> This needs an aRawMinBuffered parameter to match the declaration in the
> header file, no?  And mMinRawDelivery will need to be updated?
> 
> ::: xpcom/ds/Tokenizer.cpp
> @@ +17,5 @@
> >                       const char* aWhitespaces,
> >                       const char* aAdditionalWordChars)
> >    : TokenizerBase(aWhitespaces, aAdditionalWordChars)
> >  {
> > +  mInputFinished = true;
> 
> I guess technically this gets set in TokenizerBase, so you don't need to set
> it here?

Split out just to make the partial patches build, no other reason.
Attached patch v1.7 (complete patch) (obsolete) — Splinter Review
Still waiting for Nathan's r on the tests.
Attachment #8830841 - Attachment is obsolete: true
Attachment #8830842 - Attachment is obsolete: true
Attachment #8830844 - Attachment is obsolete: true
Attachment #8830845 - Attachment is obsolete: true
Attachment #8833008 - Attachment is obsolete: true
Attachment #8833012 - Attachment is obsolete: true
Attachment #8833013 - Attachment is obsolete: true
(In reply to Honza Bambas (:mayhemer) from comment #33)
> (cannot be any of the custom tokens) except the part that may potentially
> fit the longer custom token - 1.  

fit the *longest* custom token - 1.
Comment on attachment 8834003 [details] [diff] [review]
v1.7 (complete patch)

Review of attachment 8834003 [details] [diff] [review]:
-----------------------------------------------------------------

It would help if, when waiting for r+ on a patch, you set the r? flag on the patch. :)

r=me, I guess.  Not happy about the terrible custom token matching loop, though.

::: xpcom/ds/Tokenizer.cpp
@@ +390,5 @@
> +    // We have to do a brute-force search for all of the enabled custom
> +    // tokens.
> +    while (next < mEnd) {
> +      ++next;
> +      for (UniquePtr<Token> const& custom : mCustomTokens) {

This loop is horrible performance-wise, with none of the speedups that something like an intelligent implementation of strstr or memmem might bring.  And case insensitive tokens are doubly-terrible, because we don't have a fast implementation of case-insensitive equals.

I realize that the dependent patch also had this same sort of pattern, if not quite so elegantly expressed, so using this code instead of what was there previously is not a step backwards.  But having bad algorithms in one place like that is much different than putting bad algorithms in what is supposed to be general-purpose code.

@@ +407,5 @@
> +
> +    if (longestCustom < available && available > mMinRawDelivery) {
> +      // We can return some data w/o waiting for either a custom token
> +      // or call to FinishData() when we leave the tail where all the
> +      // custom tokens potentially fit, so we can't loose only partially

Typo: "lose", not "loose".

::: xpcom/tests/gtest/TestTokenizer.cpp
@@ +1068,5 @@
> +    return NS_OK;
> +  }, nullptr, nullptr, 3);
> +
> +  custom = i.AddCustomToken("aaa", Tokenizer::CASE_SENSITIVE);
> +  Unused << i.AddCustomToken("bb", Tokenizer::CASE_SENSITIVE);

You don't use the "bb" token here, but neither do you use `custom`, either.  Why bother?

@@ +1108,5 @@
> +    return NS_OK;
> +  }, nullptr, nullptr, 3);
> +
> +  custom = i.AddCustomToken("aaa", Tokenizer::CASE_SENSITIVE);
> +  Unused << i.AddCustomToken("bbbbb", Tokenizer::CASE_SENSITIVE);

Same question here for `custom` and the `Unused` output.

@@ +1114,5 @@
> +
> +  i.FeedInput(NS_LITERAL_CSTRING("01234"));
> +  i.FeedInput(NS_LITERAL_CSTRING("5"));
> +  i.FeedInput(NS_LITERAL_CSTRING("6789aa"));
> +  i.FeedInput(NS_LITERAL_CSTRING("aqwert"));

You should add checks between these FeedInput calls (and after ones that complete custom tokens)--not just here, but in other tests as well--to ensure that tokens are output when you think they are going to be, not just that a sufficient number of tokens are output at the end.  Otherwise, a sufficient implementation of IncrementalTokenizer would just be:

1. Buffer all input read.
2. At EOF, run the regular tokenizer.

But we would like something a little more sophisticated than that, and our tests should confirm that we do something a little more sophisticated than that.
Attachment #8834003 - Flags: review+
(In reply to Nathan Froyd [:froydnj] from comment #39)
> Comment on attachment 8834003 [details] [diff] [review]
> v1.7 (complete patch)
> 
> Review of attachment 8834003 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> It would help if, when waiting for r+ on a patch, you set the r? flag on the
> patch. :)
> 
> r=me, I guess.  Not happy about the terrible custom token matching loop,
> though.
> 
> ::: xpcom/ds/Tokenizer.cpp
> @@ +390,5 @@
> > +    // We have to do a brute-force search for all of the enabled custom
> > +    // tokens.
> > +    while (next < mEnd) {
> > +      ++next;
> > +      for (UniquePtr<Token> const& custom : mCustomTokens) {
> 
> This loop is horrible performance-wise, with none of the speedups that
> something like an intelligent implementation of strstr or memmem might
> bring.  And case insensitive tokens are doubly-terrible, because we don't
> have a fast implementation of case-insensitive equals.

I'm already thinking of some ways how to make this way better (followups).  If you have suggestions, feel free to present them.  I don't like this loop as well, trust me...

> 
> I realize that the dependent patch also had this same sort of pattern, if
> not quite so elegantly expressed, so using this code instead of what was
> there previously is not a step backwards.  But having bad algorithms in one
> place like that is much different than putting bad algorithms in what is
> supposed to be general-purpose code.

Fully agree.

> 
> @@ +407,5 @@
> > +
> > +    if (longestCustom < available && available > mMinRawDelivery) {
> > +      // We can return some data w/o waiting for either a custom token
> > +      // or call to FinishData() when we leave the tail where all the
> > +      // custom tokens potentially fit, so we can't loose only partially
> 
> Typo: "lose", not "loose".
> 
> ::: xpcom/tests/gtest/TestTokenizer.cpp
> @@ +1068,5 @@
> > +    return NS_OK;
> > +  }, nullptr, nullptr, 3);
> > +
> > +  custom = i.AddCustomToken("aaa", Tokenizer::CASE_SENSITIVE);
> > +  Unused << i.AddCustomToken("bb", Tokenizer::CASE_SENSITIVE);
> 
> You don't use the "bb" token here, but neither do you use `custom`, either. 
> Why bother?

It's a test that "bb" doesn't influence the algorithm.  It's simply a check when there are more than one token.

> 
> @@ +1108,5 @@
> > +    return NS_OK;
> > +  }, nullptr, nullptr, 3);
> > +
> > +  custom = i.AddCustomToken("aaa", Tokenizer::CASE_SENSITIVE);
> > +  Unused << i.AddCustomToken("bbbbb", Tokenizer::CASE_SENSITIVE);
> 
> Same question here for `custom` and the `Unused` output.

Because the bbbbb token is again only to have more than one token.  I can add comments to the test why I'm bothering.

> 
> @@ +1114,5 @@
> > +
> > +  i.FeedInput(NS_LITERAL_CSTRING("01234"));
> > +  i.FeedInput(NS_LITERAL_CSTRING("5"));
> > +  i.FeedInput(NS_LITERAL_CSTRING("6789aa"));
> > +  i.FeedInput(NS_LITERAL_CSTRING("aqwert"));
> 
> You should add checks between these FeedInput calls (and after ones that
> complete custom tokens)--not just here, but in other tests as well--to
> ensure that tokens are output when you think they are going to be, not just
> that a sufficient number of tokens are output at the end.  Otherwise, a
> sufficient implementation of IncrementalTokenizer would just be:
> 
> 1. Buffer all input read.
> 2. At EOF, run the regular tokenizer.

Right!

> 
> But we would like something a little more sophisticated than that, and our
> tests should confirm that we do something a little more sophisticated than
> that.

Followup? :)
(In reply to Nathan Froyd [:froydnj] from comment #39)
> Comment on attachment 8834003 [details] [diff] [review]
> v1.7 (complete patch)
> 
> Review of attachment 8834003 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> It would help if, when waiting for r+ on a patch, you set the r? flag on the
> patch. :)
> 

And... Sorry I really though I did that :D
Attached patch v1.8Splinter Review
Thanks once again for all the reviews, Nathan!
Attachment #8834003 - Attachment is obsolete: true
Attachment #8834077 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/6025961af011
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Approval Request Comment
[Feature/Bug causing the regression]: none
[User impact if declined]: none
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: yes (no regressions)
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: no
[Why is the change risky/not risky?]: this just adds a functionality to a core component (that will be used by a sec-high bug that needs an uplift too)
[String changes made/needed]: none
Attachment #8834882 - Flags: approval-mozilla-beta?
Attachment #8834882 - Flags: approval-mozilla-aurora?
Hi Honza,
For uplift requests, we only allow bug fixing for existing feature or regression. However, there is no description in your adjustment and there is also no user impact. I was wondering why we need to uplift these patches. Can we let this fix ride the train and won't fix on other branches?
Flags: needinfo?(honzab.moz)
(In reply to Gerry Chang [:gchang] from comment #48)
> Hi Honza,
> For uplift requests, we only allow bug fixing for existing feature or
> regression. However, there is no description in your adjustment and there is
> also no user impact. I was wondering why we need to uplift these patches.
> Can we let this fix ride the train and won't fix on other branches?

This bug blocks bug 1321612 for uplift to esr45.  Uplifting this bug is pending on resolution of what to with bug 1321612 and esr45, tho.
Flags: needinfo?(honzab.moz)
Attachment #8835001 - Attachment description: v1.8 for esr45 → v1.8 for esr45 [DOESN'T BUILD]
Comment on attachment 8835001 [details] [diff] [review]
v1.8 for esr45 [DOESN'T BUILD]

Probably not needed anymore.
Attachment #8835001 - Attachment is obsolete: true
Attachment #8835001 - Flags: approval-mozilla-esr45?
Comment on attachment 8834882 [details] [diff] [review]
v1.8 for m-a, m-b

let's get this in aurora53 and 52.0b6 so the other fix can land too.
Attachment #8834882 - Flags: approval-mozilla-beta?
Attachment #8834882 - Flags: approval-mozilla-beta+
Attachment #8834882 - Flags: approval-mozilla-aurora?
Attachment #8834882 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.