Closed Bug 673039 Opened 13 years ago Closed 13 years ago

JSON parse failure with non ascii inputs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

Q3 11 - Serrano

People

(Reporter: pnkfelix, Assigned: pnkfelix)

References

Details

(Whiteboard: WE:2924175, fixed-in-tr, fixed-in-serrano)

Attachments

(3 files, 2 obsolete files)

A1: dumb (inefficient) fix 13 years ago Felix S. Klock II [:pnkfelix, :fklock] 3.79 KB, patch		Details \| Diff \| Splinter Review
M1 : asmicro parse benchmark adapted from kraken 13 years ago Felix S. Klock II [:pnkfelix, :fklock] 11.11 KB, patch		Details \| Diff \| Splinter Review
patchA v2: dumb (inefficient fix). 13 years ago Felix S. Klock II [:pnkfelix, :fklock] 4.37 KB, patch	stejohns : review+	Details \| Diff \| Splinter Review
patchB v1: smarten slightly to bring back performance (but for ascii only) 13 years ago Felix S. Klock II [:pnkfelix, :fklock] 4.41 KB, patch	stejohns : review+	Details \| Diff \| Splinter Review
testsM v2: three tests of JSON parse via kraken 13 years ago Felix S. Klock II [:pnkfelix, :fklock] 16.65 KB, patch	brbaker : review+	Details \| Diff \| Splinter Review

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Description

•

13 years ago

I didn't believe it until I saw it.

Somehow I managed to miss testing simple non-ascii inputs.

> (function(s) { print(s); print(JSON.parse(s)); })('"\u017F"')
"ſ"
SyntaxError: Error #1132: Invalid JSON parse input.
	at JSON$/parseCore()
	at JSON$/parse()
	at Function/anonymous()[:1]
	at global$init()[:1]

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Updated

•

13 years ago

Whiteboard: WE:2924175

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Comment 1

•

13 years ago

The heart of this bug is that I was sloppy in how I handled iterating over the two views of the string.  The original String object is held in the m_text member of JSONParser, and a StUTF8String is in the m_textUTF8 member, but the same cursor value, m_i, is used to index both into m_text and into m_textUTF8.  That is bogus.

Not sure yet how easy it will be to fix.

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Comment 2

•

13 years ago

(In reply to comment #1)
> Not sure yet how easy it will be to fix.

Actually, I might have an easy fix.  The only reason I was keeping the m_text around was to use it with various m_text->substring calls, which I was assuming would be super-efficient and avoid copying based on my cursor knowledge of the string class.

But I don't *have* to do substring that way (and in fact, if the only cursor knowledge I'm keeping is the cursor into the m_textUTF8, then I *can't* do the substring that way).  I could instead use the m_textUTF8 to construct a fresh string that I then use as the string->append(..) call argument.

I just don't know whether it will have an unacceptable performance impact.  I was trying to avoid unnecessary intermediate allocations.

I see the comment above StringObject::appendLatin1 clearly states that it is not for UTF8 strings.  But why don't we *have* a StringObject::appendUF8 that would let me shove the data in (that's assuming there is available space in the string's buffer for me to write into, which should be the case for JSON-constructed strings, again based on my cursory knowledge of the String API.)

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Comment 3

•

13 years ago

Attached patch A1: dumb (inefficient) fix (obsolete) — Details — Splinter Review

Here's the naive fix I mentioned in my previous comment.

The code is ugly and I wish there were a more direct way to say these things.  I'm not sure if I'll get a chance tomorrow to ping Steven about this...

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Updated

•

13 years ago

Assignee: nobody → fklockii

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Updated

•

13 years ago

Target Milestone: --- → Q3 11 - Serrano

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Comment 4

•

13 years ago

Attached patch M1 : asmicro parse benchmark adapted from kraken (obsolete) — Details — Splinter Review

Hmm, I'm not too thrilled with how my dumb fix performs:

Before my dumb fix:
Metric: iterations/second 
Dir: asmicro/
  json-parse-kraken-0                                  37.8     37.7  0.0% 

After my dumb fix:
  json-parse-kraken-0                                  29.4     29.3  0.0% 

That's quite a drop.  Probably due to now inefficient string-allocations and subsequent GC/RC/mem mgmt effort.

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Updated

•

13 years ago

Attachment #547319 - Attachment description: dumb (inefficient) fix → A1: dumb (inefficient) fix

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Comment 5

•

13 years ago

(I just thought of a hack that should buy back nearly all the performance for input strings with just ASCII characters, assuming I can easily add a conditional test as we walk through the string and set a flag we we encounter any non-ASCII input.  Don't think I have time to implement it today, but it would be the easy way out if I don't think of something better before ZBB.)

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Updated

•

13 years ago

Depends on: 673170

Dan Smith

Updated

•

13 years ago

Status: NEW → ASSIGNED

Flags: flashplayer-qrb+

Flags: flashplayer-injection-

Priority: -- → P2

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Comment 6

•

13 years ago

Attached patch patchA v2: dumb (inefficient fix). — Details — Splinter Review

Slight revision to fix another comment in JSONParser class definition.  But this version remains dumb.

Attachment #547319 - Attachment is obsolete: true

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Comment 7

•

13 years ago

Attached patch patchB v1: smarten slightly to bring back performance (but for ascii only) — Details — Splinter Review

This patch (patchB) is meant to be applied atop patchA.

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Comment 8

•

13 years ago

Attached patch testsM v2: three tests of JSON parse via kraken — Details — Splinter Review

Expanded set of tests to include non-ascii inputs to illustrate the patches.

All three use the same basic data for JSON parse, taken from the Kraken benchmark suite.

The first uses the kraken input unchanged.  The second replaces one of the characters near the beginning of the text with a non-ascii character.  The third replaces a character near the end of the test with a non-ascii character.  (So the second and third tests are expected to illustrate this bug by throwing an exception, as well as illustrate a detail of how performance will vary with the hack from patchB.)

runtests.py results, five iterations, Before applying A or B:

test                                                 best     avg     95%_conf

Metric: iterations/second 
Dir: asmicro/
  json-parse-kraken-0                                  31.7     30.8  4.7% 
asmicro/json-parse-kraken-1.as                     Avm1 Error: Test Exited with exit code: 1
asmicro/json-parse-kraken-2.as                     Avm1 Error: Test Exited with exit code: 1


runtests.py results, five iterations, After applying A but Before applying B:

test                                                 best     avg     95%_conf

Metric: iterations/second 
Dir: asmicro/
  json-parse-kraken-0                                  24.9     24.8  0.0% 
  json-parse-kraken-1                                  23.5     22.8  4.4% 
  json-parse-kraken-2                                  23.5     23.3  2.1% 


runtests.py results, five iterations, After applying both A and B:

test                                                 best     avg     95%_conf

Metric: iterations/second 
Dir: asmicro/
  json-parse-kraken-0                                  31.0     30.3  4.2% 
  json-parse-kraken-1                                  23.6     23.2  4.3% 
  json-parse-kraken-2                                  29.3     28.9  1.7% 



(I suspect I can do better if necessary, by properly tracking the index of both m_text and m_textUTF8 simultaneously.  But that is also a riskier fix...)

Attachment #547438 - Attachment is obsolete: true

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Updated

•

13 years ago

Attachment #549866 - Flags: review?(stejohns)

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Comment 9

•

13 years ago

Comment on attachment 549867 [details] [diff] [review]
patchB v1: smarten slightly to bring back performance (but for ascii only)

(I'm almost certain to fold this into patchA before landing, assuming both pass review.)

Attachment #549867 - Flags: review?(stejohns)

Steven Johnson

Comment 10

•

13 years ago

Comment on attachment 549866 [details] [diff] [review]
patchA v2: dumb (inefficient fix).

Review of attachment 549866 [details] [diff] [review]:
-----------------------------------------------------------------

No obvious flaws.

::: core/JSONClass.cpp
@@ +839,5 @@
>                  m_i++;
>              adv_digits();
>          }
>  
> +        m_value = m_toplevel->core()->newStringUTF8(&m_textUTF8.c_str()[start], m_i - start);

FYI: substring() is [potentially] substantially more efficient than newStringUTF8; the former can usually just reference the existing string. Of course, that doesn't matter if the answer is wrong, so...

Attachment #549866 - Flags: review?(stejohns) → review+

Steven Johnson

Comment 11

•

13 years ago

Comment on attachment 549867 [details] [diff] [review]
patchB v1: smarten slightly to bring back performance (but for ascii only)

Review of attachment 549867 [details] [diff] [review]:
-----------------------------------------------------------------

::: core/JSONClass.cpp
@@ +841,5 @@
>                  m_i++;
>              adv_digits();
>          }
>  
> +        if (m_indexValidForText)

Nice, this is likely to be much more performance for the common Western-encoding usecases

Attachment #549867 - Flags: review?(stejohns) → review+

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Updated

•

13 years ago

Attachment #549873 - Flags: review?(brbaker)

Tamarin Bot

Comment 12

•

13 years ago

changeset: 6494:3e750904a663
user:      Felix S Klock II <fklockii@adobe.com>
summary:   Bug 673039: construct tokens via substring method only if UTF-8 byte count == text codepoint count (r=stejohns).

http://hg.mozilla.org/tamarin-redux/rev/3e750904a663

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Comment 13

•

13 years ago

Also pushed to TR serrano.

changeset:   6341:0b988039b665
user:        Felix S Klock II <fklockii>
date:        Wed Jul 20 21:19:00 2011 -0400
summary:     Bug 673039: construct tokens via substring method only if UTF-8 byte count == text codepoint count (r=stejohns).


mea culpa for the domain-less email address above.  I forgot to put my hg [hooks] section into my tr-serrano repo.

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Updated

•

13 years ago

Depends on: 675769

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Comment 14

•

13 years ago

(follow up work has been filed as the two blocking bugs listed in this bug's "Depends on" field.)

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Updated

•

13 years ago

Whiteboard: WE:2924175 → WE:2924175, fixed-in-tr, fixed-in-serrano

Felix S. Klock II [:pnkfelix, :fklock]

Assignee

Updated

•

13 years ago

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Brent Baker

Comment 15

•

13 years ago

Comment on attachment 549873 [details] [diff] [review]
testsM v2: three tests of JSON parse via kraken

Review of attachment 549873 [details] [diff] [review]:
-----------------------------------------------------------------

Just change:
Copyright (C) 2010 -> Copyright (C) 2011

Attachment #549873 - Flags: review?(brbaker) → review+

You need to log in before you can comment on or make changes to this bug.