bugzilla.mozilla.org has resumed normal operation. Attachments prior to 2014 will be unavailable for a few days. This is tracked in Bug 1475801.
Please report any other irregularities here.

Design and implement lexical analyzer to be used in any parser code

RESOLVED FIXED in Firefox 42

Status

()

Core
XPCOM
--
enhancement
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: mayhemer, Assigned: mayhemer)

Tracking

(Blocks: 1 bug)

Trunk
mozilla42
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox42 fixed)

Details

Attachments

(1 attachment, 7 obsolete attachments)

(Assignee)

Description

4 years ago
I on and on see manual walk through of strings when anything from HTTP headers to urls is being parsed which is unsafe, hard to maintain, easy to break, hard to review.

I tend to have a helper class that encapsulates some good approach for this.  We could then (if syntax is more complicated) build simple recursive syntax analyzers using it.

I have a good experience with lexical analyzers.  It's easy to implement and easy to use.

Other option would be to have regex C++ impl with results, but that is much more complicated to implement correctly and not always what can also be easily used.
(Assignee)

Updated

4 years ago
Blocks: 906714
(Assignee)

Comment 1

4 years ago
As Valentin pointed out in a private mail, to let this work with string, better place is to have it XPCOM.

Going to take this.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Component: MFBT → XPCOM
(Assignee)

Comment 2

4 years ago
Created attachment 8519356 [details] [diff] [review]
draft1 - working :)

So, few hours of coding and the first simple working lexical analyzer is alive!  Oh, how I love this job from time to time :)
I'm not sure what this is for, or what it's designed to handle, but you might be interested in these Unicode documents:

UAX 29: Unicode Text Segmentation
http://www.unicode.org/reports/tr29/

UAX 44: Unicode Character Database
http://www.unicode.org/reports/tr44/

And other reports that might be of use to you:
http://www.unicode.org/reports/
(Assignee)

Comment 4

4 years ago
Created attachment 8526908 [details] [diff] [review]
draft1.1

Some API improvements + real-life example in a test of non-recursive parse and added a lot of comments.
Attachment #8519356 - Attachment is obsolete: true
(Assignee)

Comment 5

3 years ago
Created attachment 8614806 [details] [diff] [review]
v1

I am still getting in my hands new patches that use strstr() and strchr().  I really want to do better.  This was intended to base our new URL parser, hence it was blocked on unicode capabilities.  But I would like to start using it for at least ASCII inputs for now and extend in a followup.

I think the headers has good comments and the included test shows how to use the parser.
Attachment #8526908 - Attachment is obsolete: true
Attachment #8614806 - Flags: review?(dougt)
(Assignee)

Comment 6

3 years ago
Remove the AString dependency and we could potentially move this to MFBT?

Updated

3 years ago
Attachment #8614806 - Flags: review?(dougt) → review?(nfroyd)
(Assignee)

Comment 7

3 years ago
review ping

Nathan, this is not that complicated and I know about few places/patches that would like to use it.  Thanks.
Flags: needinfo?(nfroyd)
Comment on attachment 8614806 [details] [diff] [review]
v1

Review of attachment 8614806 [details] [diff] [review]:
-----------------------------------------------------------------

Apologies for the delay.

Where do you see this being used outside of netwerk/?  grepping for strchr, for instance, doesn't yield very many possible use cases outside of netwerk/.

Giving this f+ so I can see a revised patch, but there's nothing major in the comments below.  Should be a quick r+ once comments are addressed.

::: xpcom/string/Lexan.h
@@ +10,5 @@
> +#include "nsString.h"
> +
> +/**
> + * This is a simple implementation of a lexical analyzer or maybe better
> + * called a tokenizer.

This is bikeshed territory, but I think it would be better called Tokenizer.  (It took me a while to work out that Lexan is a LEXical ANalyzer.)

@@ +24,5 @@
> +   */
> +  enum TokenType {
> +    TOKEN_UNKNOWN,
> +    TOKEN_ERROR,
> +    TOKEN_NUMBER,

Might be a good idea to rename this to TOKEN_INTEGER (and similar for all the *Number things throughout the patch) so people don't start thinking this is a number in, say, the JavaScript sense.

@@ +47,5 @@
> +  public:
> +    Token() : mType(TOKEN_UNKNOWN), mChar(0), mNumber(0) {}
> +
> +    // Static constructors of tokens by type and value
> +    static Token Word(nsACString const &aWord);

Gecko style would be |const nsACString& aWord|.  Please fix this instance and so on throughout the patch.

@@ +89,5 @@
> +   * and value to aToken result and shift the cursor past this just parsed token.  Each call
> +   * to Next() reads another token from the input and shifts the cursor.
> +   * Returns false if we have passed the end of the input.
> +   */
> +  bool Next(Token &aToken);

I think all of these bool-returning methods should be MOZ_WARN_UNUSED_RESULT.

@@ +121,5 @@
> +  /**
> +   * Check whitespace character is present, either a specified one by aChar argument
> +   * or any from the mWhitespaces list.
> +   */
> +  bool CheckWhite(char const aChar = 0) { return Check(Token::Whitespace(aChar)); }

Having a default argument here seems redundant with |CheckChar|.  Why not make this function take no arguments and therefore always check its argument against mWhitespaces?  Callers can use CheckChar if they want.

@@ +128,5 @@
> +   * cursor position and return true.  Otherwise false.
> +   */
> +  bool CheckChar(char const aChar) { return Check(Token::Char(aChar)); }
> +  /**
> +   * This is a custumizable version of CheckChar.  aClassifier is a function called with

Nit: "customizable"

@@ +153,5 @@
> +
> +  /**
> +   * Returns the read cursor position back as it was before the last call of any parsing
> +   * method of Lexan (Next, Check*, Skip*) so that the last operation can be repeated.
> +   * Works only once.

This last sentence really means "works only once per read token", right?  i.e. something like:

l.CheckChar(...)
Rollback()
l.CheckChar(...)
Rollback()

is allowed, but not:

l.CheckChar(...)
Rollback()
Rollback()

correct?  Please update the documentation to reflect that.  If you could make Rollback() assert in the incorrect usage case, that would be even better.

@@ +161,5 @@
> +  /**
> +   * Start the process of recording.  Based on aInclude value the begining of the recorded
> +   * sub-string is at the current position (EXCLUDE_LAST) or at the position of the last
> +   * parsed token (INCLUDE_LAST) so as the read position was before the last parsing
> +   * operation call.

What does "so as the read position was before the last parsing operation call" mean?

@@ +174,5 @@
> +
> +protected:
> +  // true if we have already read the EOF token.
> +  bool HasInput() const;
> +  // Main prasing function, it doesn't shift the read cursor, just returns the next

Nit: "parsing"

@@ +193,5 @@
> +
> +  // true iff we have already read the EOF token
> +  bool mPastEof : 1;
> +  // true iff the last Check*() call has returned false, reverts to true on Rollback() call
> +  bool mHasFailed : 1;

I doubt we'll (a) be allocating a lot of these and (b) be allocating them anywhere but the stack.  So we might as well make these non-bitfields to make the accesses more efficient.

::: xpcom/tests/TestLexan.cpp
@@ +1,1 @@
> +/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 4 -*- */

Please fix the modeline (tab-width: 8, c-basic-offset: 2).

@@ +26,5 @@
> +  if (xpcom.failed())
> +    return 1;
> +
> +#define CHECK(a) \
> +  MOZ_ASSERT(a); \

New tests going in to xpcom/ have typically been gtests, not straight cpptests.  You can either:

1) Make this MOZ_RELEASE_ASSERT, so opt and debug builds get the same behavior; or
2) Convert everything to a gtest (doing a search-and-replace will get you 90% of the way there).

@@ +42,5 @@
> +    "\r\n"
> +    "This is the body"));
> +
> +  r = l.CheckWord("HTTP");
> +  CHECK(r);

Whichever method you choose to modify CHECK, above, it might reduce a bit of visual clutter to convert things to:

CHECK(l.CheckWord("HTTP"));

(would be particularly clear with gtests: ASSERT_TRUE(l.CheckWord("HTTP")) )

@@ +112,5 @@
> +
> +
> +  // Synthetic code-specific test
> +
> +  Lexan l1(NS_LITERAL_CSTRING("test123 ,15  \t*\r\n%xx,-15\r\r"));

gtests would also enable separating out individual Lexan tests into functions...
Attachment #8614806 - Flags: review?(nfroyd) → feedback+
Flags: needinfo?(nfroyd)
(Assignee)

Comment 9

3 years ago
(In reply to Nathan Froyd [:froydnj] [away 10 July through 19 July] from comment #8)
> Comment on attachment 8614806 [details] [diff] [review]
> v1
> 
> Review of attachment 8614806 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Apologies for the delay.
> 
> Where do you see this being used outside of netwerk/?  

I want to expose this for any consumer.  It can be used for parsing user input, file content.  That's not limited to network.

> grepping for strchr,
> for instance, doesn't yield very many possible use cases outside of netwerk/.

If you insist on moving this to netwerk then let me know.  True is that the primary target is that module.

> 
> Giving this f+ so I can see a revised patch, but there's nothing major in
> the comments below.  Should be a quick r+ once comments are addressed.
> 
> ::: xpcom/string/Lexan.h
> @@ +10,5 @@
> > +#include "nsString.h"
> > +
> > +/**
> > + * This is a simple implementation of a lexical analyzer or maybe better
> > + * called a tokenizer.
> 
> This is bikeshed territory, but I think it would be better called Tokenizer.
> (It took me a while to work out that Lexan is a LEXical ANalyzer.)

Lexan is shorter although cryptic.  And this class is just the useful and universal subset of what lexical analyzers do.  Tokenizer is the name.

> @@ +89,5 @@
> > +   * and value to aToken result and shift the cursor past this just parsed token.  Each call
> > +   * to Next() reads another token from the input and shifts the cursor.
> > +   * Returns false if we have passed the end of the input.
> > +   */
> > +  bool Next(Token &aToken);
> 
> I think all of these bool-returning methods should be MOZ_WARN_UNUSED_RESULT.

It's disputable.  But consumers may always use 'unused <<'.

> 
> @@ +121,5 @@
> > +  /**
> > +   * Check whitespace character is present, either a specified one by aChar argument
> > +   * or any from the mWhitespaces list.
> > +   */
> > +  bool CheckWhite(char const aChar = 0) { return Check(Token::Whitespace(aChar)); }
> 
> Having a default argument here seems redundant with |CheckChar|.  Why not
> make this function take no arguments and therefore always check its argument
> against mWhitespaces?  Callers can use CheckChar if they want.

I have to remember why I made it this way exactly.  Looks like some leftover as the API had been evolving.

> @@ +153,5 @@
> > +
> > +  /**
> > +   * Returns the read cursor position back as it was before the last call of any parsing
> > +   * method of Lexan (Next, Check*, Skip*) so that the last operation can be repeated.
> > +   * Works only once.
> 
> This last sentence really means "works only once per read token", right? 
> i.e. something like:
> 
> l.CheckChar(...)
> Rollback()
> l.CheckChar(...)
> Rollback()
> 
> is allowed, but not:
> 
> l.CheckChar(...)
> Rollback()
> Rollback()
> 
> correct?  Please update the documentation to reflect that.  

I thought the sentence "Works only once" made it clear.  But I'll think of a better description.

> If you could
> make Rollback() assert in the incorrect usage case, that would be even
> better.

Yup (should be easy).

> 
> @@ +161,5 @@
> > +  /**
> > +   * Start the process of recording.  Based on aInclude value the begining of the recorded
> > +   * sub-string is at the current position (EXCLUDE_LAST) or at the position of the last
> > +   * parsed token (INCLUDE_LAST) so as the read position was before the last parsing
> > +   * operation call.
> 
> What does "so as the read position was before the last parsing operation
> call" mean?

Probably my inefficient english.  I'll rephrase.  probably just s/so as/as if/ ?

> New tests going in to xpcom/ have typically been gtests, not straight
> cpptests.  You can either:
> 
> 1) Make this MOZ_RELEASE_ASSERT, so opt and debug builds get the same
> behavior; or
> 2) Convert everything to a gtest (doing a search-and-replace will get you
> 90% of the way there).

The test is anyway incompilable.  I'll try to build a gtest.  Thanks for the tip.

> CHECK(l.CheckWord("HTTP"));
> 
> (would be particularly clear with gtests: ASSERT_TRUE(l.CheckWord("HTTP")) )

Hmm.. having a local var is IMO a bit better.  It can help putting conditional breakpoints.  But this class is relatively simple, so probably your approach is clearer.
(Assignee)

Comment 10

3 years ago
Created attachment 8634051 [details] [diff] [review]
v2

See comment 9
Attachment #8614806 - Attachment is obsolete: true
Attachment #8634051 - Flags: review?(nfroyd)
Comment on attachment 8634051 [details] [diff] [review]
v2

Review of attachment 8634051 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the changes below.

::: xpcom/string/Tokenizer.cpp
@@ +8,5 @@
> +
> +#include "nsUnicharUtils.h"
> +#include "mozilla/CheckedInt.h"
> +
> +static const char* sWhitespaces = " \t";

Please make this |static const char sWhitespaces[] = " \t";|, as that makes it take up slightly less space in binary form.

@@ +20,5 @@
> +  mRecord = mRollback = mCursor;
> +  aSource.EndReading(mEnd);
> +}
> +
> +bool Tokenizer::Next(Token &aToken)

Two style nits: this should be formatted as:

bool
Tokenizer::Next(Token& aToken)

and so on throughout the patch.  The header file also looks like it has some |Token &aToken| style nits; please fix those up as well.

::: xpcom/string/Tokenizer.h
@@ +15,5 @@
> + *
> + * It is limited only to ASCII input for now. UTF-8 or any other input
> + * encoding must yet be implemented.
> + */
> +class Tokenizer {

Please wrap all this in a |namespace mozilla|.

@@ +62,5 @@
> +
> +    TokenType Type() const { return mType; }
> +    char AsChar() const { return mChar; }
> +    nsCString AsString() const { return mWord; }
> +    int64_t AsInteger() const { return mInteger; }

These As* methods should include asserts that the token is of the correct type.

@@ +70,5 @@
> +  Tokenizer(const nsACString& aSource);
> +
> +  /**
> +   * Some methods are collecting the input as it is being parsed to obtain
> +   * a substring between particular syntax bounderies defined by descent.

What does "defined by descent" refer to here?

::: xpcom/string/moz.build
@@ +35,5 @@
>      'nsXPIDLString.h',
>      'string-template-def-char.h',
>      'string-template-def-unichar.h',
>      'string-template-undef.h',
> +    'Tokenizer.h',

Apologies for not noticing this previously: can you put this someplace other than xpcom/string/?  xpcom/ds/ looks like it has some tokenizer-related stuff already, so putting it there would be fine.

Also, this should go in EXPORTS.mozilla.

::: xpcom/tests/gtest/TestTokenizer.cpp
@@ +14,5 @@
> +         c == '_' ||
> +         c == '-';
> +}
> +
> +TEST(Tokenizer, HTTPResponse)

Thanks for separating these out into individual tests!
Attachment #8634051 - Flags: review?(nfroyd) → review+
(Assignee)

Comment 12

3 years ago
Created attachment 8637850 [details] [diff] [review]
v2.1

Nathan, thanks for the reviews!
Attachment #8634051 - Attachment is obsolete: true
Attachment #8637850 - Flags: review+
(Assignee)

Updated

3 years ago
Keywords: checkin-needed
Please provide a Try link :)
Keywords: checkin-needed
(Assignee)

Comment 14

3 years ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd54f6b2b134

There is just one test added (gtest under xpcom), and this new class is not integrated in the code anywhere but I don't know how to run only gtests on try.
Keywords: checkin-needed
(In reply to Honza Bambas (:mayhemer) from comment #14)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd54f6b2b134
> 
> There is just one test added (gtest under xpcom), and this new class is not
> integrated in the code anywhere but I don't know how to run only gtests on
> try.

the try run shows that this change cause a bustage and so i can not check this in.
bustage:
 /builds/slave/try-lx-d-000000000000000000000/build/src/xpcom/tests/gtest/TestTokenizer.cpp:14:27: error: suggest parentheses around '&&' within '||' [-Werror=parentheses]
Flags: needinfo?(honzab.moz)
Keywords: checkin-needed
(Assignee)

Comment 16

3 years ago
Created attachment 8638627 [details] [diff] [review]
v2.2

linux build fixed, try is closed.  will ask after try's done.
Attachment #8637850 - Attachment is obsolete: true
Flags: needinfo?(honzab.moz)
Attachment #8638627 - Flags: review+
(Assignee)

Comment 18

3 years ago
Created attachment 8638642 [details] [diff] [review]
v2.2 [correct reviewer]
Attachment #8638627 - Attachment is obsolete: true
Attachment #8638642 - Flags: review+
(Assignee)

Comment 19

3 years ago
Created attachment 8639251 [details] [diff] [review]
v2.3

- fixed implicit constructor

https://treeherder.mozilla.org/#/jobs?repo=try&revision=fadd4b6e20a2
Attachment #8638642 - Attachment is obsolete: true
Attachment #8639251 - Flags: review+
(Assignee)

Comment 20

3 years ago
Finally green!
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/397fd842ab54
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
status-firefox42: --- → fixed
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
(Assignee)

Updated

3 years ago
Blocks: 1188983

Updated

3 years ago
Depends on: 1190980
No longer depends on: 1190980
You need to log in before you can comment on or make changes to this bug.