Closed Bug 621360 Opened 14 years ago Closed 14 years ago

ValueError: Invalid control character at: line 1 column 147127 (char 147127)

Categories

(addons.mozilla.org Graveyard :: Developer Pages, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED
5.12.6

People

(Reporter: krupa.mozbugs, Assigned: basta)

Details

(Whiteboard: [Step 2])

Attachments

(2 files)

363.77 KB, application/x-xpinstall
Details
1.05 KB, application/rdf+xml
Details
Attached file test file
steps to reproduce:
1. Upload the test file

observed behavior:
Validation fails with "Unexpected server error while validating"

traceback details:
http://pastebin.mozilla.org/899120

Also:
Upload of the test-file works fine in remora
Attached file install.rdf
install.rdf (copied from Delicious Bookmarks)
Assignee: nobody → kumar.mcmillan
Matt, it looks like the validator needs to strip out control characters before parsing with json.  Another alternative is to dump JSON with strict=False but that seems dangerous in case the control characters could be used to trigger an exploit in spidermonkey.

Control characters are anything in the range of 0 - 31 so perhaps something like this could work:

for line in input:
    for c in line:
        if ord(c) >= 0 and ord(c) <= 31:
            if c in ("\n", "\r"):
                # These will be common and safe
                continue
            # Skip control char or strip it replace it with an empty string?

I wrote a quick script to see what control characters are in the delicious bookmark (attached) and I mostly see tabs.  However, Mac OS X does a funny thing with zip files where it includes metadata in a __MACOSX dir.  I'm not sure if that is in the actual Addon but if so then the js files in the metadata dir will have control characters like \x00, \x16, \x07, \x02, etc.
Assignee: kumar.mcmillan → mbasta
The only thing we're going to have to watch out for is non-standard encodings. I've come across some files encoded in Chinese formats (namely GB2312). Fortunately, they don't have any control characters in that format, but I can't speak to the tens of other encodings that we are (or should be) supporting.

Anyway, here's a commit that fixes the issue. Unit tests and all.

https://github.com/mattbasta/amo-validator/commit/283a8338ab0f910a955f30d6fa03234bfc242b41

One thing to note, I only throw one warning for the first control character. If there are, say, a hundred, that's going to make for some really nasty output.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
After this code revision I still see the same traceback when validating the attached Addon.

$ python ./addon-validator --determined -o json ~/Documents/Mozilla/addon-packages/add-on201012241037.xpi
...
ValueError: Invalid control character at: line 1 column 146932 (char 146932)

Full traceback: http://pastebin.mozilla.org/902626
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Perhaps there is a UnicodeDecodeError getting swallowed?
What you're seeing now is a different problem:

The problem that's occurring here is that unicode/UTF-8 characters are escaped in the JS source code (i.e.: \u2013) and are passed back to the validator as unicode characters. Unfortunately, the codec for these new characters isn't the same one that was used to encode the JSON string (since the characters were escaped), causing this problem.

My fix is going to be to strip out unicode characters from Spidermonkey's output that are out of the range of extended ASCII before the output is decoded into JSON. This will preserve any existing functionality (there are no functions or objects that contain non-ASCII characters).

Any objections? If not, I'll push out an update tonight.
I think that's fine.  Add a decent comment in the code so we know what is going on if this comes up in the future.
I don't follow how Unicode is involved here.

I took a closer look at why the validator was tripping up on deliciousAwesomebarSearch.js with this invalid control character error.  I was mainly confused because your new code was stripping out control characters so I wanted know where they were coming from.  From what I can tell, the control characters are introduced by spidermonkey after calling Reflect.  wtf.  So maybe you just need to strip out control characters again after getting the js shell output?  Or maybe it is safe enough to load the JSON with strict=False (allowing control chars) since they have already been stripped from user code?
Yeah, basically developers can do things like this:

x = "\u1234";

Spidermonkey will unescape that unicode character, so the output from SM contains characters that weren't part of the encoding going in. It's truly effed.

Anyway, I tried strict=False and that works. Since the only control characters are the ones that Spidermonkey outputs, any vulnerabilities in SM as a result of the control characters would also be exposed on the front end (since it's the parser's output and not the interpreter), so that's out of our hands.

Here's my fix. The code to strip the characters in the output is there but commented out. If need be, we can turn it back on later.

https://github.com/mattbasta/amo-validator/commit/2e730746748b4df4120c14d70456ede5131edc75
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Ah, I see.  Well, the validation is working :)  There seems to be a lot of code branches in that file now that I'm not sure we need but I suppose we'll find out through more testing.  Thanks.
I'm going to try to push a round of improvements to the JS traverser today, and I'm planning on moving a lot of code out of the validator.testcases.scripting module. That should help clean things up. Keep an eye open.
This has landed in Zamboni and can be verified
verified fixed
Status: RESOLVED → VERIFIED
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: