Closed Bug 588648 Opened 14 years ago Closed 13 years ago

Don't copy chars when scanning

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 5 obsolete files)

Here's how scanning works:

- The common mode of operation is that we already have a jschar buffer
  entirely in memory that we're scanning from, called 'userbuf'.

- Each char read from 'userbuf' is copied twice in the most common case:

  (1) In getCharFillLinebuf(), it's copied from 'userbuf' to 'linebuf', and
      if it's an EOL char, it's normalized to '\n' in the process.

  (2) In getTokenInternal(), it's probably appended to a JSCharBuffer, which
      is a js::Vector.  This is a bit slow because for every char copied the
      jsvector has to check if it has enough space.

- There's a less common mode of operation where we are reading from file
  instead of directly from a buffer.  In that case, we'll read from file
  into 'userbuf' in fillUserbuf();  the file gets read in chunks of size
  1024.

- The copying of chars from 'userbuf' to 'linebuf' doesn't seem necessary,
  though the EOL normalization does simplify things a bit.

- The copying of chars into the JSCharBuffer is necessary only in the second
  mode of operation -- because 'userbuf' can be chunked when reading from
  file, it's possible a single token is split across two (or more, for long
  strings) chunks.  If we only ever scanned directly from 'userbuf' we
  wouldn't need to copy to a JSCharBuffer.  

  AFAICT, we read from file in the following places:
  - Some API functions (js_CompileFile, JS_CompileFileHandle,
    JS_CompileFileHandleForPrincipals)
  - These are used in the JS shell and xpcshell, which aren't important. 
  - JS_CompileFileHandleForPrincipals is used in
    xpconnect/loader/mozJSComponentLoader.cpp only as a fallback if
    HAVE_PR_MEMMAP isn't available, which seems like an unusual case.

  I asked Brendan, he thinks in these case that reading the entire file into
  memory before parsing is reasonable.

SS wanted win: 3ms?  Hard to estimate, that's a guess based on some data in
bug 578216.  string-tagcloud looks to benefit the most;  string-unpack-code
and regexp-dna also some.
Yeah, this is definitely a sweet win (and reduction in complexity). I didn't realize it was kosher to read whole files into memory!
I've hit a snag in avoiding the userbuf-to-JSCharBuffer copy.  The problem is multi-char characters.  Eg. if we have '\' followed by 'n' in a string we need to convert that into a '\n' before we atomize it.  There's a similar issue with \uXXXX sequences.

Maybe the atomizer could do the conversions?  Hmm.
(In reply to comment #2)
> Maybe the atomizer could do the conversions?  Hmm.

Sounds right! Thoughts on this:

- The atomizer has to scan the contents of the string to perform a hash calculation.
- The |JSString *| that winds up representing the atom needs to own its buffer, so we'll at least need to copy from the userbuf to heap space once.
- The length of the string in userbuf isn't known until you scan for the endquot.
- Are escapes uncommon enough in interned strings that we could slow path them?

If the atomizer used a heaped vector to interpret escapes into, hashing the interpreted chars along the way, and made a |JSString *| with that as its backing store, you only have to do one (copying) pass on the string data in userbuf.
We don't want to lex while interning strings passed from C++, so any such path would be lexer-specific, not jsatom-generic.

Also, \uXXXX can occur in other places than string literals. See the ES5 lexical grammar.

/be
Work-in-progress patch.  Removes the scan-from-file option, always reads in the whole file before scanning.  Scanning from 'stdin' isn't yet handled.  Removes linebuf and getbuf, scans directly from userbuf.  Doesn't do anything about userbuf-to-JSCharBuffer copying, I'm a bit leery about futzing with js_AtomizeString().
In this patch:

- Input is now always read fully into 'userbuf';  'linebuf' and 'ungetbuf'
  were removed as a result.  This simplifies scanning quite a bit.

- Improves scanning of strings.  Previously if a string included an EOL,
  we'd detect it in getChar() and then detect it again in
  getTokenInternal().  Now we have getCharIgnoreEOL() which avoids the first
  of these two checks.

- Adds some big comments to help break up getTokenInternal() -- those
  multi-page if-statements have been really bugging me.  I'm happy to
  fiddle with the exact formatting but I'd like to keep something like that
  in there.

- Avoids creating a Flagger for a lot of tokens by introducing a flagless
  version of getToken().

- Things that don't yet work or I'm unsure about are marked with 'njn'
  comments.

Instruction counts on 64-bit linux (32-bit linux results were not quite as
good):

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|    98.702    98.643 (1.001x) |    21.727    21.727 (------) | 3d-cube
|    80.989    80.957 (------) |    19.256    19.256 (------) | 3d-morph
|   135.688   135.315 (1.003x) |    23.025    23.025 (------) | 3d-raytrace
|    66.276    66.238 (1.001x) |    17.993    17.993 (------) | access-binary-
|   155.527   155.493 (------) |    94.894    94.894 (------) | access-fannkuc
|    33.923    33.829 (1.003x) |    18.627    18.627 (------) | access-nbody
|    49.819    49.788 (1.001x) |    27.665    27.665 (------) | access-nsieve
|     8.227     8.216 (1.001x) |     3.486     3.486 (------) | bitops-3bit-bi
|    40.012    40.005 (------) |    35.101    35.101 (------) | bitops-bits-in
|    15.880    15.868 (1.001x) |    11.404    11.404 (------) | bitops-bitwise
|    62.546    62.529 (------) |    38.601    38.601 (------) | bitops-nsieve-
|    38.958    38.955 (------) |    31.397    31.397 (------) | controlflow-re
|    39.471    39.237 (1.006x) |     5.570     5.570 (------) | crypto-md5
|    23.112    22.975 (1.006x) |     7.414     7.414 (------) | crypto-sha1
|    99.551    99.326 (1.002x) |    12.201    12.201 (------) | date-format-to
|    75.279    75.010 (1.004x) |    10.156    10.156 (------) | date-format-xp
|    58.876    58.837 (1.001x) |    29.129    29.129 (------) | math-cordic
|    24.813    24.793 (1.001x) |     5.894     5.894 (------) | math-partial-s
|    20.761    20.735 (1.001x) |    10.955    10.955 (------) | math-spectral-
|    49.951    48.217 (1.036x) |    34.501    34.501 (------) | regexp-dna
|    31.759    31.673 (1.003x) |    10.515    10.515 (------) | string-base64
|   105.522   104.888 (1.006x) |    23.351    23.351 (------) | string-fasta
|   159.016   153.664 (1.035x) |     9.832     9.832 (------) | string-tagclou
|   164.104   162.481 (1.010x) |     9.326     9.326 (------) | string-unpack-
|    37.458    37.405 (1.001x) |     8.211     8.211 (------) | string-validat
-------
|  1676.234  1665.085 (1.007x) |   520.243   520.243 (------) | all


Time improvements may be 2 or 3 ms, but it's hard to tell through the usual
noise.  It'll also help slightly with the 1/2 million or so chars of JS that
get read at browser startup.
Attachment #467653 - Attachment is obsolete: true
Attachment #467700 - Flags: feedback?(brendan)
I thought of a way to avoid the (userbuf.ptr < userbuf.limit) test in getChar().  It would involve having a jschar that represents end-of-file;  this would work because all the callers of getChar() check for end-of-file.  The normal value of EOF is inappropriate because it's a 32-bit -1, which is outside the range of a jschar.

So the question is:  is there a jschar value that is unused and so suitable for this use?  I've been reading about UCS-2 and I think any UTF-16 surrogate value such as 0xD800 would be suitable.  From http://en.wikipedia.org/wiki/UTF-16/UCS-2:

"Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code unit from a surrogate pair does not ever represent a character."

Does this sound plausible?  I fully admit to being a Unicode neophyte.  If it is valid, getChar()'s return type could change from int32 to jschar.
One tricky thing with this idea:  TSF_EOF could no longer be set in getChar().  Looks like it could instead be set in the error case at the end of getTokenInternal() when the token kind is set to TOK_ERROR, if c==EOF.
Re: comment 7 -- JS does not combine surrogates, and it's easy on the web to put them in script content. Better to use int for the return value, as stdio getc did ages ago.

Re: comment 8 -- yes, folding error and eof cases together is plausible. See the existing tt <= TOK_EOF tests.

Nick, want feedback still? 2-3ms is not nothing on SS. Maybe rebase and go for a reviewable patch?

/be
(In reply to comment #9)
> 
> Nick, want feedback still? 2-3ms is not nothing on SS. Maybe rebase and go for
> a reviewable patch?

My impression from what Sayre said recently is that parsing time is not counted in browser runs of SunSpider and V8, which would make this lower priority.  I certainly won't get to this patch until platform work week starts, at the earliest.
Rebased patch.  

- Reading from stdin now works.  Indeed, previously, on Linux at least, I
  had to hit Ctrl-d three times to exit from |load('-')| in the shell.  Now
  it works immediately.

- The patch fixes bug 352970, too.

- Adds ungetCharIgnoreEOL() which was needed to fix a jstest failure.
Attachment #467700 - Attachment is obsolete: true
Attachment #490794 - Flags: review?(brendan)
Attachment #467700 - Flags: feedback?(brendan)
Attachment #490794 - Attachment description: patch (against TM 57130:bc000c1509ac) → patch v3 (against TM 57130:bc000c1509ac)
Reinstate the Flagger-less version of getToken(), which was lost somehow.
Attachment #490794 - Attachment is obsolete: true
Attachment #490797 - Flags: review?(brendan)
Attachment #490794 - Flags: review?(brendan)
I'm at a TC39 meeting, so a bit slow. Would value cdleary's review too if time allows. Nick, one question: is <sys/stat.h> cross-platform?

/be
This fixes a couple of jsapi-test failures.  (I didn't even realize we had jsapi tests as I don't use the 'make check' target.  Woo!)  Problem was that I was returning a NULL JSScript pointer for an empty file.  Now it reads in the file and does nothing, which matches the old behaviour.

I also ran a try server run, it built fine on all platforms, so sys/stat.h looks to be portable.
Attachment #490797 - Attachment is obsolete: true
Attachment #491744 - Flags: review?(brendan)
Attachment #491744 - Flags: feedback?(cdleary)
Attachment #490797 - Flags: review?(brendan)
This patch removes some debugging guff that was accidentally included in v5.
Attachment #491744 - Attachment is obsolete: true
Attachment #491745 - Flags: review?(brendan)
Attachment #491745 - Flags: feedback?(cdleary)
Attachment #491744 - Flags: review?(brendan)
Attachment #491744 - Flags: feedback?(cdleary)
Comment on attachment 491745 [details] [diff] [review]
patch, v6 (against TM 57572:eea1f443ea15)

>+static JSScript *
>+compileFileHelper(JSContext *cx, JSObject *obj, JSPrincipals *principals, uint32 tcflags,

Static helpers have StudlyCaps not camelCaps names.

>+                  const char* filename, FILE *fp)
>+{
>+    struct stat st;
>+    int ok = fstat(fileno(fp), &st);
>+    if (ok != 0)
>+        return NULL;

In the old days this would not do for Windows. Progress!

>+
>+    jschar *buf = NULL;
>+    size_t len = st.st_size;
>+    JSScript *script;

script is odd in that it is the only decl in this group not initialized. It's also the return expression. For these reasons make it come last in the group of four.

>+    size_t i = 0;

Blank line here please.

>+    /* Read in the whole file, then compile it. */
>+    if (fp == stdin) {
>+        JS_ASSERT(len == 0);
>+        len = 8;  /* start with a small buffer, expand as necessary */
>+        int c;
>+        bool hitEOF = false;
>+        while (!hitEOF) {
>+            len *= 2;
>+            jschar* tmpbuf = (jschar *) cx->realloc(buf, len * sizeof(jschar));
>+            if (!tmpbuf) {
>+                cx->free(buf);
>+                return NULL;
>+            }
>+            buf = tmpbuf;
>+
>+            while (true) {
>+                if (i == len)
>+                    break;

This loop should start

              while (i < len)

(both to avoid wasting columns and lines and reader brainwaves on true and if/break, and to use < instead of !=).

>+        int c;
>+        while ((c = fast_getc(fp)) != EOF)
>+            buf[i++] = (jschar) (unsigned char) c;
>+    }
>+    JS_ASSERT(i <= len);
>+    len = i;
>+    script = Compiler::compileScript(cx, obj, NULL, principals, tcflags, buf, len, filename, 1);
>+    cx->free(buf);
>+
>+    return script;

Move the blank line up to come right after the right brace.

>+    int32 c;
>+    if (JS_LIKELY(userbuf.ptr < userbuf.limit)) {
>+        c = *userbuf.ptr++;
>         /*

Blank line before comment taking one or more lines, unless preceding line ends in left brace.

>+        if (JS_UNLIKELY(maybeEOL[c & 0xff])) {

These JS_UNLIKELY uses helped, per valgrind?

> /*
>- * This gets the next char, normalizing all EOL sequences to '\n' as it goes.
>+ * This gets the next char.  It does nothing special with EOL sequences, not
>+ * even updating the line counters.

http://twitter.com/#!/jonniker/status/10534301980430336

Plus, prevailing style. Last such whine from me on this topic.

+    index = (tp->begin.lineno == tp->end.lineno) 
+          ? tp->begin.index         /* the column number of the start of the bad token */
+          : 0;

Uber-nit: ? and : underhang the ( not the = or whatever is two chars to the left of the (.

> #if JS_HAS_XML_SUPPORT
>+    /*
>+     *-----------------------------------------------------------------------

These hyphen-rules don't seem beneficial. Prevailing style does use major comment style even with one-line comment, for emphasis. That ought to be enough.

..                if (c == qc) {
..                    break;
..                } else if (c == '\\') {

Pre-existing else after break: unbrace and lose the else.


>+                } else if (c == '\n' || c == '\r' || c == LINE_SEPARATOR || c == PARA_SEPARATOR ||
>+                           c == EOF)
>+                {
>+                    ungetCharIgnoreEOL(c);
>                     ReportCompileErrorNumber(cx, this, NULL, JSREPORT_ERROR,
>                                              JSMSG_UNTERMINATED_STRING);
>                     goto error;
>                 }

Does it make sense to move this case up to come before the if (c == '\\') clause, since most string chars won't be special? Would help to lose the else (which would now come after goto) before if (c == '\\') and unindent the backslash interpretation code by one level. PGO will handle any further reordering.

Looks great otherwise. r=me with nits picked. Thanks for doing this,

/be
Attachment #491745 - Flags: review?(brendan) → review+
(In reply to comment #16)
> 
> > #if JS_HAS_XML_SUPPORT
> >+    /*
> >+     *-----------------------------------------------------------------------
> 
> These hyphen-rules don't seem beneficial. Prevailing style does use major
> comment style even with one-line comment, for emphasis. That ought to be
> enough.

I've done as you asked, but reluctantly.  I humbly posit that a 1,014 line function -- that's not a typo, it's really 1,014 lines -- needs more formatting/sectioning than a single style of comment can provide.


> >+                } else if (c == '\n' || c == '\r' || c == LINE_SEPARATOR || c == PARA_SEPARATOR ||
> >+                           c == EOF)
> >+                {
> >+                    ungetCharIgnoreEOL(c);
> >                     ReportCompileErrorNumber(cx, this, NULL, JSREPORT_ERROR,
> >                                              JSMSG_UNTERMINATED_STRING);
> >                     goto error;
> >                 }
> 
> Does it make sense to move this case up to come before the if (c == '\\')
> clause, since most string chars won't be special? Would help to lose the else
> (which would now come after goto) before if (c == '\\') and unindent the
> backslash interpretation code by one level. PGO will handle any further
> reordering.

Backslashes are common in long strings, EOLs are very uncommon, so the ordering helps speed.  Or are you saying that PGO would realize that and change the ordering of the branches?  I'm not comfortable relying on that... do we even have PGO on all platforms?
(In reply to comment #17)
> I humbly posit that a 1,014 line
> function -- that's not a typo, it's really 1,014 lines -- needs more
> formatting/sectioning than a single style of comment can provide.

That delimiter should be easily understandable functions with JS_ALWAYS_INLINE. That'll be a nice cleanup patch somewhere down the road.
(Btw, sorry I didn't get around to giving feedback!)
(In reply to comment #19)
> 
> That delimiter should be easily understandable functions with JS_ALWAYS_INLINE.
> That'll be a nice cleanup patch somewhere down the road.

I considered that but getTokenInternal() is riddled with gotos and I couldn't see how to do it in a way that didn't hurt performance.
http://hg.mozilla.org/mozilla-central/rev/2b9d805d77a1
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Blocks: 618572
No longer blocks: 618572
Depends on: 618572
Depends on: 653890
No longer depends on: 653890
Depends on: 653890
You need to log in before you can comment on or make changes to this bug.