Closed Bug 673039 Opened 13 years ago Closed 13 years ago

JSON parse failure with non ascii inputs

Categories

(Tamarin Graveyard :: Library, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED
Q3 11 - Serrano

People

(Reporter: pnkfelix, Assigned: pnkfelix)

References

Details

(Whiteboard: WE:2924175, fixed-in-tr, fixed-in-serrano)

Attachments

(3 files, 2 obsolete files)

I didn't believe it until I saw it.

Somehow I managed to miss testing simple non-ascii inputs.

> (function(s) { print(s); print(JSON.parse(s)); })('"\u017F"')
"ſ"
SyntaxError: Error #1132: Invalid JSON parse input.
	at JSON$/parseCore()
	at JSON$/parse()
	at Function/anonymous()[:1]
	at global$init()[:1]
Whiteboard: WE:2924175
The heart of this bug is that I was sloppy in how I handled iterating over the two views of the string.  The original String object is held in the m_text member of JSONParser, and a StUTF8String is in the m_textUTF8 member, but the same cursor value, m_i, is used to index both into m_text and into m_textUTF8.  That is bogus.

Not sure yet how easy it will be to fix.
(In reply to comment #1)
> Not sure yet how easy it will be to fix.

Actually, I might have an easy fix.  The only reason I was keeping the m_text around was to use it with various m_text->substring calls, which I was assuming would be super-efficient and avoid copying based on my cursor knowledge of the string class.

But I don't *have* to do substring that way (and in fact, if the only cursor knowledge I'm keeping is the cursor into the m_textUTF8, then I *can't* do the substring that way).  I could instead use the m_textUTF8 to construct a fresh string that I then use as the string->append(..) call argument.

I just don't know whether it will have an unacceptable performance impact.  I was trying to avoid unnecessary intermediate allocations.

I see the comment above StringObject::appendLatin1 clearly states that it is not for UTF8 strings.  But why don't we *have* a StringObject::appendUF8 that would let me shove the data in (that's assuming there is available space in the string's buffer for me to write into, which should be the case for JSON-constructed strings, again based on my cursory knowledge of the String API.)
Attached patch A1: dumb (inefficient) fix (obsolete) — Splinter Review
Here's the naive fix I mentioned in my previous comment.

The code is ugly and I wish there were a more direct way to say these things.  I'm not sure if I'll get a chance tomorrow to ping Steven about this...
Assignee: nobody → fklockii
Target Milestone: --- → Q3 11 - Serrano
Hmm, I'm not too thrilled with how my dumb fix performs:

Before my dumb fix:
Metric: iterations/second 
Dir: asmicro/
  json-parse-kraken-0                                  37.8     37.7  0.0% 

After my dumb fix:
  json-parse-kraken-0                                  29.4     29.3  0.0% 

That's quite a drop.  Probably due to now inefficient string-allocations and subsequent GC/RC/mem mgmt effort.
Attachment #547319 - Attachment description: dumb (inefficient) fix → A1: dumb (inefficient) fix
(I just thought of a hack that should buy back nearly all the performance for input strings with just ASCII characters, assuming I can easily add a conditional test as we walk through the string and set a flag we we encounter any non-ASCII input.  Don't think I have time to implement it today, but it would be the easy way out if I don't think of something better before ZBB.)
Depends on: 673170
Status: NEW → ASSIGNED
Flags: flashplayer-qrb+
Flags: flashplayer-injection-
Priority: -- → P2
Slight revision to fix another comment in JSONParser class definition.  But this version remains dumb.
Attachment #547319 - Attachment is obsolete: true
This patch (patchB) is meant to be applied atop patchA.
Expanded set of tests to include non-ascii inputs to illustrate the patches.

All three use the same basic data for JSON parse, taken from the Kraken benchmark suite.

The first uses the kraken input unchanged.  The second replaces one of the characters near the beginning of the text with a non-ascii character.  The third replaces a character near the end of the test with a non-ascii character.  (So the second and third tests are expected to illustrate this bug by throwing an exception, as well as illustrate a detail of how performance will vary with the hack from patchB.)

runtests.py results, five iterations, Before applying A or B:

test                                                 best     avg     95%_conf

Metric: iterations/second 
Dir: asmicro/
  json-parse-kraken-0                                  31.7     30.8  4.7% 
asmicro/json-parse-kraken-1.as                     Avm1 Error: Test Exited with exit code: 1
asmicro/json-parse-kraken-2.as                     Avm1 Error: Test Exited with exit code: 1


runtests.py results, five iterations, After applying A but Before applying B:

test                                                 best     avg     95%_conf

Metric: iterations/second 
Dir: asmicro/
  json-parse-kraken-0                                  24.9     24.8  0.0% 
  json-parse-kraken-1                                  23.5     22.8  4.4% 
  json-parse-kraken-2                                  23.5     23.3  2.1% 


runtests.py results, five iterations, After applying both A and B:

test                                                 best     avg     95%_conf

Metric: iterations/second 
Dir: asmicro/
  json-parse-kraken-0                                  31.0     30.3  4.2% 
  json-parse-kraken-1                                  23.6     23.2  4.3% 
  json-parse-kraken-2                                  29.3     28.9  1.7% 



(I suspect I can do better if necessary, by properly tracking the index of both m_text and m_textUTF8 simultaneously.  But that is also a riskier fix...)
Attachment #547438 - Attachment is obsolete: true
Attachment #549866 - Flags: review?(stejohns)
Comment on attachment 549867 [details] [diff] [review]
patchB v1: smarten slightly to bring back performance (but for ascii only)

(I'm almost certain to fold this into patchA before landing, assuming both pass review.)
Attachment #549867 - Flags: review?(stejohns)
Comment on attachment 549866 [details] [diff] [review]
patchA v2: dumb (inefficient fix).

Review of attachment 549866 [details] [diff] [review]:
-----------------------------------------------------------------

No obvious flaws.

::: core/JSONClass.cpp
@@ +839,5 @@
>                  m_i++;
>              adv_digits();
>          }
>  
> +        m_value = m_toplevel->core()->newStringUTF8(&m_textUTF8.c_str()[start], m_i - start);

FYI: substring() is [potentially] substantially more efficient than newStringUTF8; the former can usually just reference the existing string. Of course, that doesn't matter if the answer is wrong, so...
Attachment #549866 - Flags: review?(stejohns) → review+
Comment on attachment 549867 [details] [diff] [review]
patchB v1: smarten slightly to bring back performance (but for ascii only)

Review of attachment 549867 [details] [diff] [review]:
-----------------------------------------------------------------

::: core/JSONClass.cpp
@@ +841,5 @@
>                  m_i++;
>              adv_digits();
>          }
>  
> +        if (m_indexValidForText)

Nice, this is likely to be much more performance for the common Western-encoding usecases
Attachment #549867 - Flags: review?(stejohns) → review+
Attachment #549873 - Flags: review?(brbaker)
changeset: 6494:3e750904a663
user:      Felix S Klock II <fklockii@adobe.com>
summary:   Bug 673039: construct tokens via substring method only if UTF-8 byte count == text codepoint count (r=stejohns).

http://hg.mozilla.org/tamarin-redux/rev/3e750904a663
Also pushed to TR serrano.

changeset:   6341:0b988039b665
user:        Felix S Klock II <fklockii>
date:        Wed Jul 20 21:19:00 2011 -0400
summary:     Bug 673039: construct tokens via substring method only if UTF-8 byte count == text codepoint count (r=stejohns).


mea culpa for the domain-less email address above.  I forgot to put my hg [hooks] section into my tr-serrano repo.
Depends on: 675769
(follow up work has been filed as the two blocking bugs listed in this bug's "Depends on" field.)
Whiteboard: WE:2924175 → WE:2924175, fixed-in-tr, fixed-in-serrano
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment on attachment 549873 [details] [diff] [review]
testsM v2: three tests of JSON parse via kraken

Review of attachment 549873 [details] [diff] [review]:
-----------------------------------------------------------------

Just change:
Copyright (C) 2010 -> Copyright (C) 2011
Attachment #549873 - Flags: review?(brbaker) → review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: