Closed Bug 1188983 Opened 4 years ago Closed 4 years ago

mozilla::Tokenizer improvements

Categories

(Core :: XPCOM, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla43
Tracking Status
firefox42 --- affected
firefox43 --- fixed

People

(Reporter: mayhemer, Assigned: mayhemer)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 7 obsolete files)

Starting to use it shows we may need few additions to the API.  I will outline them here.
Blocks: 1189349
Attached patch v1 (obsolete) — Splinter Review
Few updates I've made after working more with the class.

- saves some memory copying when Token is of the WORD type
- adds some useful shortcut methods to more directly read numbers/words where expected
- and of course, tests are added
Attachment #8641790 - Flags: review?(nfroyd)
Comment on attachment 8641790 [details] [diff] [review]
v1

Review of attachment 8641790 [details] [diff] [review]:
-----------------------------------------------------------------

Apologies for the review lag here.  A couple comments below.

::: xpcom/ds/Tokenizer.cpp
@@ +14,5 @@
>  static const char sWhitespaces[] = " \t";
>  
> +Tokenizer::Tokenizer(const nsACString& aSource,
> +                     const char* aWhitespaces,
> +                     const char* aAdditionalWordChars)

I'd suggest two things for these constructors:

1) implement a private constructor:

Tokenizer(const char* aWhitespaces, const char* aAdditionalWordChars)

that handles the details of setting the variables in the current constructor's initialization list.  That way, the semantics of aWhitespaces are handled in a single place.  We can then call:

Tokenizer::Tokenizer(const nsACString&, const char*, const char*)
  : Tokenizer(aWhitespaces, aAdditionalWordChars)

and similarly below.

2) The initialization for mRecord and similar should be shared.

@@ +35,5 @@
> +  , mAdditionalWordChars(aAdditionalWordChars)
> +{
> +  nsDependentCString source(aSource);
> +  source.BeginReading(mCursor);
> +  mRecord = mRollback = mCursor;

Is this really safe, that we're holding iterators for the string even after the string goes away?  I guess those iterators are just pointers into the original string, so it's safe...but still seems dodgy.

Maybe we should just hold an nsDependentCString to the input string as a member variable regardless?

::: xpcom/ds/Tokenizer.h
@@ +71,3 @@
>      int64_t AsInteger() const;
> +
> +  private:

This seems superfluous; please remove it.

@@ +76,5 @@
>  public:
> +  /**
> +   * @param aSource
> +   *    The string to parse.
> +   *    IMPORTANT NOTE: Tokenizer doesn't refer the string's buffer to safe memory

Nit: "to save"

I'm not sure this is really what you want to say; Tokenizer *does* refer directly to the string because it's saving memory by not copying...

@@ +84,5 @@
> +   *    If non-null Tokenizer will use this custom set of whitespaces for CheckWhite()
> +   *    and SkipWhites() calls.
> +   * @param aAdditionalWordChars
> +   *    If non-null it will be added to the list of characters that consist a word.
> +   *    This is useful when you want to accept e.g. '-' in HTTP headers.

I agree that this is useful, but the lack of symmetry between how the whitespace characters are specified and how the word characters are specified seems unfortunate.

We should also add documentation for what the default characters for a "word" are...

@@ +87,5 @@
> +   *    If non-null it will be added to the list of characters that consist a word.
> +   *    This is useful when you want to accept e.g. '-' in HTTP headers.
> +   *
> +   * If there is an overlap between aWhitespaces and aAdditionalWordChars check for
> +   * word characters is made first.

Nit: "...aAdditionalWordChars, the check for..."

@@ +209,5 @@
> +    if (!Check(TOKEN_INTEGER, t)) {
> +      return false;
> +    }
> +
> +    *aValue = static_cast<T>(t.AsInteger());

This should check that t.AsInteger() actually fits into a value of type T.  Tests for this are mandatory.

@@ +238,5 @@
>  
> +  /**
> +   * Return the internal read cursor, this is an address to the read buffer.
> +   */
> +  const char* Cursor() const { return mCursor; }

What is this useful for?  It's not clear to me that we actually want to expose this.

@@ +258,5 @@
>    bool IsNumber(const char aInput) const;
>  
>  private:
>    Tokenizer() = delete;
> +  Tokenizer(const Tokenizer&) = delete;

You should delete operator=(const Tokenizer&), too.  And the move variants, while we're at it.

::: xpcom/tests/gtest/TestTokenizer.cpp
@@ +367,5 @@
> +  EXPECT_TRUE(p3.CheckWord("custom"));
> +  EXPECT_TRUE(p3.CheckWhite());
> +  EXPECT_FALSE(p3.CheckWhite());
> +  EXPECT_TRUE(p3.CheckChar(' '));
> +  EXPECT_TRUE(p3.CheckWord("whites-and-word-chars"));

Seems kind of nitpicky, but it'd be good to test the case where we have multiple additional word characters, not just a single one.
Attachment #8641790 - Flags: review?(nfroyd) → feedback+
(In reply to Nathan Froyd [:froydnj][:nfroyd] from comment #2)
> > +  nsDependentCString source(aSource);
> > +  source.BeginReading(mCursor);
> > +  mRecord = mRollback = mCursor;
> 
> Is this really safe, that we're holding iterators for the string even after
> the string goes away?  I guess those iterators are just pointers into the
> original string, so it's safe...but still seems dodgy.

It's not perfect but there will be places working with dependent zstrings anyway.  We cannot from inside Tokenizer guarantee the buffer won't go away, it's simply up to the caller.

> 
> Maybe we should just hold an nsDependentCString to the input string as a
> member variable regardless?

Doesn't change anything about it.

> I'm not sure this is really what you want to say; Tokenizer *does* refer
> directly to the string because it's saving memory by not copying...

Comment updated.  We do point into the buffer, but we don't ensure its lifetime.

> 
> @@ +84,5 @@
> > +   *    If non-null Tokenizer will use this custom set of whitespaces for CheckWhite()
> > +   *    and SkipWhites() calls.
> > +   * @param aAdditionalWordChars
> > +   *    If non-null it will be added to the list of characters that consist a word.
> > +   *    This is useful when you want to accept e.g. '-' in HTTP headers.
> 
> I agree that this is useful, but the lack of symmetry between how the
> whitespace characters are specified and how the word characters are
> specified seems unfortunate.

I think it makes a perfect sense.  Word characters are something you don't want to list again, just add some new to that list.  OTOH whitespaces should be overridable IMHO, it can be useful.  You simply may not want a space and a tab be whites.

> @@ +238,5 @@
> >  
> > +  /**
> > +   * Return the internal read cursor, this is an address to the read buffer.
> > +   */
> > +  const char* Cursor() const { return mCursor; }
> 
> What is this useful for?  It's not clear to me that we actually want to
> expose this.

There is one function I'm rewriting with Tokenizer that wants to return a pointer to the original buffer as a result where a substring (with some additional conditions) has been found.

I'm not happy too with this.  I may think more how to rewrite that function if at all eventually.  I'll for now remove this method from this patch and add it (if no other way) in patch for that method overwrite.
Attached patch v1 (obsolete) — Splinter Review
- review should be addressed
- Cursor() removed
- construction code sharing is made a bit different way than suggested (cannot have ctor with (const char*, const char*) as args since there is an ambiguous overload)
- added option to SkipWhites to also include new lines + a test for it
Attachment #8641790 - Attachment is obsolete: true
Attachment #8644365 - Flags: review?(nfroyd)
Comment on attachment 8644365 [details] [diff] [review]
v1

Review of attachment 8644365 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the review lag here.

::: xpcom/ds/Tokenizer.h
@@ +74,5 @@
>  public:
> +  /**
> +   * @param aSource
> +   *    The string to parse.
> +   *    IMPORTANT NOTE: Tokenizer doesn't doesn't ensure the input string buffer lifetime.

Nit: double "doesn't"

@@ +75,5 @@
> +  /**
> +   * @param aSource
> +   *    The string to parse.
> +   *    IMPORTANT NOTE: Tokenizer doesn't doesn't ensure the input string buffer lifetime.
> +   *    It's up to the consumer to make sure the string's buffer overlives the Tokenizer!

Nit: I think the usual term is "outlives"

@@ +144,5 @@
>    MOZ_WARN_UNUSED_RESULT
>    bool HasFailed() const;
>  
>    /**
> +   * SkipWhites method (bellow) may also skip new line characters automatically.

Nit: "below"

@@ +145,5 @@
>    bool HasFailed() const;
>  
>    /**
> +   * SkipWhites method (bellow) may also skip new line characters automatically.
> +   * This should be optinal and this is the argument to SkipWhites.

I think this sentence can be removed.

@@ +227,5 @@
> +      return false;
> +    }
> +
> +    static_assert(sizeof(T) <= sizeof(int64_t), "The target type is larger than int64_t");
> +    *aValue = static_cast<T>(t.AsInteger());

This is not a useful assert, since it's always true.

What I meant in the previous review comment was when you have something like:

  Tokenizer t("345762");
  uint16_t val;
  MOZ_ASSERT(t.ReadInteger(&val));

That assert should fail, because the value we tried to read was too large for a uint16_t.  Something along the lines of:

  mozilla::CheckedInt<T> actualValue(t.AsInteger());
  if (!t.isValid()) {
    return false;
  }
  *aValue = actualValue.value();
  return true;

And there should be tests that test ReadInteger for a variety of different sizes and signedness, both valid and invalid cases.
Attachment #8644365 - Flags: review?(nfroyd) → feedback+
Attached patch v2 (obsolete) — Splinter Review
- ReadInteger does a run-time check now (btw, good suggestion)
- tests for it added
- fixed (actually added) usage of N template arg in bool CheckWord
Attachment #8644365 - Attachment is obsolete: true
Attachment #8645740 - Flags: review?(nfroyd)
Comment on attachment 8645740 [details] [diff] [review]
v2

Review of attachment 8645740 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!  Just a few small comments, and r=me with those changes.

::: xpcom/ds/Tokenizer.h
@@ +217,5 @@
> +  bool ReadChar(char* aValue);
> +  bool ReadWord(nsACString& aValue);
> +  bool ReadWord(nsDependentCSubstring& aValue);
> +  template <typename T>
> +  bool ReadInteger(T* aValue)

Nit: worth a comment saying that if the read value doesn't fit into the provided integer, we return false?  (I think it's better to be explicit in comments.)

::: xpcom/tests/gtest/TestTokenizer.cpp
@@ +418,5 @@
> +#define INT_6_BITS                 64U
> +#define INT_30_BITS                1073741824UL
> +#define INT_32_BITS                4294967295UL
> +#define INT_50_BITS                1125899906842624ULL
> +#define STR_INT_MORE_THEN_64_BITS "922337203685477580899"

Nit: "THAN", not "THEN"

@@ +428,5 @@
> +    Tokenizer p(_S(INT_6_BITS));
> +    uint8_t u8;
> +    uint16_t u16;
> +    uint32_t u32;
> +    uint64_t u64;

I think it is worth checking signed integers here, too.

@@ +449,5 @@
> +    Tokenizer p(_S(INT_30_BITS));
> +    uint8_t u8;
> +    uint16_t u16;
> +    uint32_t u32;
> +    uint64_t u64;

And here.

@@ +477,5 @@
> +    Tokenizer p(_S(INT_50_BITS));
> +    uint8_t u8;
> +    uint16_t u16;
> +    uint32_t u32;
> +    uint64_t u64;

And here.

@@ +490,5 @@
> +
> +  {
> +    Tokenizer p(STR_INT_MORE_THEN_64_BITS);
> +    int64_t i;
> +    EXPECT_FALSE(p.ReadInteger(&i));

And uint64_t here just for good measure.
Attachment #8645740 - Flags: review?(nfroyd) → review+
Attached patch v2 -> v2.1 idiff (obsolete) — Splinter Review
I've made few more changes:

- since the integer parsing doesn't play with negative numbers, the token value changed to uint64_t (was int64_t)
- enum ClaimInclusion moved closer to its use
- CheckWord[N] fixed now for real (not sure how the previous version could slip in..)
- plus the review requirements should be fixed
Attachment #8647627 - Flags: review?(nfroyd)
Attached patch v2.1 (obsolete) — Splinter Review
carrying, but still want the idiff double-checked
Attachment #8645740 - Attachment is obsolete: true
Attachment #8647630 - Flags: review+
Attached patch v2.1 -w (for ref) (obsolete) — Splinter Review
Blocks: 1165214
Attached patch v2.1 (+linux build fix) (obsolete) — Splinter Review
Attachment #8647630 - Attachment is obsolete: true
Attachment #8647665 - Flags: review+
Comment on attachment 8647627 [details] [diff] [review]
v2 -> v2.1 idiff

Review of attachment 8647627 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great, thank you!  And thanks for the interdiff!

::: xpcom/ds/Tokenizer.h
@@ +207,5 @@
> +   * cursor when any of the following happens:
> +   *  - the token at the read cursor is not an integer
> +   *  - the final number doesn't fit the T type
> +   * Otherwise true is returned, aValue is filled with the integral number 
> +   * and the cursor is shifted

Nit: trailing space and a period at the end of this sentence.  Might be a little more word-smithy to say "...and the cursor is moved forward".
Attachment #8647627 - Flags: review?(nfroyd) → review+
Attached patch v2.2Splinter Review
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b2b0edd5eb4e
Attachment #8647627 - Attachment is obsolete: true
Attachment #8647631 - Attachment is obsolete: true
Attachment #8647665 - Attachment is obsolete: true
Attachment #8648052 - Flags: review+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/1399f33b6a4f
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla43
Comment on attachment 8648052 [details] [diff] [review]
v2.2

>+  nsAutoCString test1;
>+  nsDependentCString test2;
>+  char comma;
>+  uint32_t integer;
>+
>+  EXPECT_TRUE(p.ReadWord(test1));
>+  EXPECT_TRUE(test1 == "test1");
>+  p.SkipWhites();
>+  EXPECT_TRUE(p.ReadWord(test2));
test2 never depends on anything. Did you mean nsDependentCSubstring?
Depends on: 1219726
You need to log in before you can comment on or make changes to this bug.