Closed Bug 75102 Opened 23 years ago Closed 23 years ago

New XPath parser and lexer

Categories

(Core :: XSLT, defect, P2)

defect

Tracking

()

VERIFIED FIXED
mozilla0.9.3

People

(Reporter: sicking, Assigned: peterv)

References

Details

(Whiteboard: [nsBranch+][PDT+])

Attachments

(19 files)

126.31 KB, patch
Details | Diff | Splinter Review
2.04 KB, text/plain
Details
1.19 KB, text/plain
Details
5.73 KB, text/plain
Details
3.08 KB, text/plain
Details
137.08 KB, patch
Details | Diff | Splinter Review
6.55 KB, text/plain
Details
3.91 KB, text/plain
Details
1.41 KB, text/plain
Details
133.45 KB, patch
Details | Diff | Splinter Review
10.99 KB, text/plain
Details
152.95 KB, patch
Details | Diff | Splinter Review
149.60 KB, patch
Details | Diff | Splinter Review
10.98 KB, text/plain
Details
1.41 KB, text/plain
Details
153.55 KB, patch
Details | Diff | Splinter Review
149.92 KB, patch
Details | Diff | Splinter Review
26.40 KB, application/zip
Details
153.52 KB, patch
Details | Diff | Splinter Review
This is a tracker bug for the new XPath parser (done by me) and the new XPath
lexer (Pike is working on it)

The parser changes are dependent on the lexer changes, either some small fixes
I've made or Pikes rewrite.

Pike will attach pathes
Status: NEW → ASSIGNED
so, finally I got the lexer revamped. I attached code by sicking@bigfoot.com for
the parser, and all changes necessary for removing Pattern.
The lexer rewrite is pretty brutal, and just in case you wonder, there is no
big difference in a diff -w, though I changed indention to 2.
I didn't implement error reporting to XSLT yet, that's a different bug, imho.
I moved most of the lexer to be iterator friendly, and the Token handling
could need some additional love. I'd like to have that separate though.
(Tokens into a list, instead of linked list, and Tokens just storing indexes
(iterators) to the lex'd string instead of newly allocated substrings)

Note: scc-string-fu for doing comparisons of substrings is supposed to appear
somewhere.

I changed ProcessorState to not add pure Named Templates to the match templs.

hrm. More? probably, but it won't pop up in my mind.

Axel
Keywords: review
Blocks: 70866
This kicks ass! Comments below (some of them are nitpicks bug this is my only 
chance to get back at nickpicks from you and peterv ;-) )

* Do we really need ExprLexer::lookAhead? IMHO it could be removed. Same thing
  for countAllTokens and countRemainingTokens

* Why not use TX_CHAR_RANGE macro in isDigit

* IMHO it dosn't look that nice to use greaterthen and lessthen ops on
  token->type in nextIsOperatorToken... hmm... it's a long list.. Maybe if you
  just did the comparison for all ops (ie not AXIS_IDENTIFIER) and added a
  comment it would be more selfexplaining

* Need to add |token->type == Token::COMMA| comparison in nextIsOperatorToken.
  Also, several small if's instead of one big

* Endcomment in nextIsOperatorToken is wrong

* defType is set to Token::CNAME both in the declaration and as the first thing
  in the while loop

* |iter++ < size| should be |++iter < size| on a lot of places in ::parse
  same for |iter++ == size| on line 326

* check isLetter on first char after colon rather then isNCNameChar

* You need to handle "Qnames" like mathml:* somehow

* A default is needed on the last switch which does some errorreporting.
  Currently the code goes into an neverending loop on an unknown char

* Coulnd't ExprLexer::TOKENS and ExprLexer::MULTIPLY be removed?

* Would you mind moving the |ch=pattern.charAt(iter)| out from the if for the
  PERIOD code?

* You don't really have to give all tokentypes a value (for example
  PARENT_NODE dosn't need to have the value ".."). The only tokentypes that need
  values are CNAME, AXIS_IDENTIFIER, VAR_REFERENCE, LITERAL and FUNCTION_NAME

* If an error occurs before any token is added bad things happen

* *I* would like to have whitespace between ) and { in |if (...){|

Also since, between the two of us, we're rewriting both lexer and parser. Why 
don't we rename CNAME to QNAME, sorry about not doing so myself...
ugh
s/Also, several small if's/Also, why not use several small if's/
Just noticed that my fix for bug 75307 has snuck into your patch as well as a 
fix for PathExpr::getDefaultPriority... Not that I don't want it in... :)
Attached file added license plate
Incorporated most of sickings comments.
Basically all but the one for nextIsOperator, instead I changed the order of
the tokens to have just a single range (plus NULL_TOKEN), and commented that
in ExprLexer.h.
I also added NS_ASSERTIONS(0,message) in Expr.cpp, as we shouldn't have code
paths calling these.

Now I'm ready for more comments.

Axel
Only picky stuff left to comment on...

* ExprLexer::MULTIPLY could still be removed ;)

* could prevToken be local in parse (or removed even?) and check last token in
  nextIsOperatorToken? If you did that I think we could remove NULL_TOKEN
  compleatly!

* You still give all tokentypes a value which IMHO is unneccesary. You could
  even remove the |Token(UNICODE_CHAR uniChar, short type)| constructor

* I'd like a stronger comment the tokentype list, something like "these have to
  be in an unbreaked list" (my english suck sometimes :( )
bullet 1: ugh. Will remove ExprLexer::MULTIPLY.
bullet 2: Hrm. Not convinced, I need to change the previous token type for ::
(axis identifier) and ( (making previous function or nodetype). I like the way
that is done now, as it handles whitespace transparently. So for removing the
previousToken, I had to get the previous Token at three places in the code.
Could I convince you in keeping it?
bullet 3: This is a step towards iterators, awkward as it may seem. I want to
have iterators as members of tokens in the end, and even trivial tokens should
know their position for error reporting. So I assigned a substring (in most 
cases, yeah), to have the needed iterators at the right places.
bullet 4: I will do a better comment.

Axel
Bullet 2: No biggie, it's ok with me as it is now...

Bullet 3: Seems like an excellent plan!
Just realized another bug. Variable names are not allowed to have the form 
moz:*, they must be valid QNames. Not a big problem really, evaluation will 
fail since no variable could ever be named moz:* ...
nice catch, I fixed that in my tree. Code starting at 304 goes now as

          else if (pattern.charAt(iter)=='*')
            if (defType == Token::VAR_REFERENCE) {
              // Error, VariableReference must not be $foo:*
              errorPos = iter;
              errorCode = ERROR_UNRESOLVED_VAR_REFERENCE;
              if (firstItem)
                firstItem->token->type=Token::ERROR;
              else
                addToken(new Token('\0',Token::ERROR));
              iter=size; // bail
            } else
              ++iter; /* eat wildcard */

Axel
adding some depenencies... these bugs should be fixed when this code is checked in
Blocks: 45831, 67004, 67307, 69478
Adding a current patch. Synched with the tip, removed 
PathExpr::getDefaultPriority (is now in 84677), did some cosmetic things to 
lexer for peterv.
Changed the way lexer handles the char tests, these are now in a separate file
ExprLexerChars.cpp, and NCNameChar.h and Letter.h are gone.
Expr.cpp (attachment 30666 [details]) and UnaryExpr.cpp (attachment 30268 [details]) are still
current.
ExprLexerChars.cpp coming up, too.

Axel
Attached file ExprLexerChars.cpp
I introduction to what the XPath lexer is supposed to do is posted on
http://www.mozilla.org/projects/xslt/coding/HowToLexer.html

Axel
reassigning to peterv for review
Assignee: sicking → peterv
Status: ASSIGNED → NEW
taking back...
Assignee: peterv → sicking
r=peterv. Let's try to get this into 0.9.2, it's a big improvement and fixes a
lot of bugs.
Peter changed the diff to -u -4 (who hates macCVS, raise your hands),
included the new files ExprLexerChars.cpp and UnaryExpr.cpp, but not Expr.cpp,
into the patch, and this patch has TX_EXE instead of
MOZ_XSL. The old patch should still work, for those not in favor of mac line
endings.
And finally I have realized that I used a NPL license plate for 
ExprLexerChars.cpp :-(, I sure wanted MPL. new one coming up.
Expr.cpp needs TX_EXE, or 77830.

Axel
adjusting keywords.
the xpath docs are on
http://www.mozilla.org/projects/xslt/coding/xpath-parser.html, I should move
the lexer stuff to http://www.mozilla.org/projects/xslt/coding/xpath-lexer.html
Axel
Severity: normal → major
Priority: -- → P2
Target Milestone: --- → mozilla0.9.2
Hmm, it gets a bit messy. Well, I'm attaching the patch super-reviewers should
look at. I'll try to get this in for 0.9.2.
Assignee: sicking → peterv
r=sicking on the stuff by Pike
Status: NEW → ASSIGNED
Whiteboard: Fix in hand. Have r=. Need sr=, a=.
I just realized that we miss a "what's this" in here.

Basically, the lexer and the parser don't really make any sense as patch. We
rigorously rewired stuff, it's really more a rewrite, following *some* of the
lines of the old code. So if you don't like something, cvs-blame will be almost
all ours anyway. Therefor the pack contains the files for the lexer and parser
as whole files, not as patches. It contains the extra files needed, to, plus a
stripped down patch.
Just to give you a number, the patch is 50% now. Most of it is cleanup of
Expr.h, which used to have the same comment for methods in all inheriting
objects over and over. 
We removed most of the Pattern notation in xpath, this is XSLT special, and is
not supposed to be in xpath.
Plus, BasicNodeExpr can now hold a name, which is for processing-instruction(),
(see xpath spec 2.3).

I think that's it.

Axel
Keywords: nsBranch
Target Milestone: mozilla0.9.2 → mozilla0.9.3
Addressing jst's comments: fixup the TX_CHAR_RANGE and friend macros to take
the character as argument. Don't assume to work on ch.
Fixup whitespace in conditions, or at least make that consistent. I made 
function definitions consistent per file as well, moving to new style where it
was inconsistent. diff -u -N coming up

Axel
sr=jst
Checked in successfully (so far). Keeping open for peterv to try to get this on 
the branch
Whiteboard: Fix in hand. Have r=. Need sr=, a=.
adding vtrunk keyword to indicate the bug has been fixed on the trunk.  Bug left
open for tracking checkin to branch (nsbranch) when appropriate.

Once bug has been fixed on the branch also, pls remove vtrunk keyword.
Keywords: vtrunk
nsBranch ok, limited to XSLT only.
This is the first introduction of our XSLT implementation into the market. 
These  changes do not affect the rest of the Mozilla codebase, only the XSLT
part.  They make our XSLT implementation much more standards conformant by
fixing close to 200  hundred bugs on the Apache XSLT test suite.

Marking nsBranch+ for PDT consideration.
Whiteboard: [nsBranch+]
Steve granted this PDT+ status in today's PDT meeting.
Whiteboard: [nsBranch+] → [nsBranch+][nsPDT+]
Whiteboard: [nsBranch+][nsPDT+] → [nsBranch+][PDT+]
Went in on the branch (with a bit of bumps and bruises).
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Keywords: nsBranch, review, vtrunk
sicking@bigfoot.com (Jonas Sicking) or kvisco@ziplink.net - can you help verify
this bug on the branch 092 builds?  If not, can your direct me to a test suite
(URL) which I can run to verify?  Thanks.
Hi, tested this on the branch, marking verified.

Axel
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: