ValueError: Invalid control character at: line 1 column 147127 (char 147127)


Status Graveyard
Developer Pages
7 years ago
2 years ago


(Reporter: krupa, Assigned: basta)



(Whiteboard: [Step 2])


(2 attachments)

363.77 KB, application/x-xpinstall
1.05 KB, application/rdf+xml


7 years ago
Created attachment 499696 [details]
test file

steps to reproduce:
1. Upload the test file

observed behavior:
Validation fails with "Unexpected server error while validating"

traceback details:

Upload of the test-file works fine in remora

Comment 1

7 years ago
Created attachment 499697 [details]

install.rdf (copied from Delicious Bookmarks)
Assignee: nobody → kumar.mcmillan
Matt, it looks like the validator needs to strip out control characters before parsing with json.  Another alternative is to dump JSON with strict=False but that seems dangerous in case the control characters could be used to trigger an exploit in spidermonkey.

Control characters are anything in the range of 0 - 31 so perhaps something like this could work:

for line in input:
    for c in line:
        if ord(c) >= 0 and ord(c) <= 31:
            if c in ("\n", "\r"):
                # These will be common and safe
            # Skip control char or strip it replace it with an empty string?

I wrote a quick script to see what control characters are in the delicious bookmark (attached) and I mostly see tabs.  However, Mac OS X does a funny thing with zip files where it includes metadata in a __MACOSX dir.  I'm not sure if that is in the actual Addon but if so then the js files in the metadata dir will have control characters like \x00, \x16, \x07, \x02, etc.
Assignee: kumar.mcmillan → mbasta

Comment 3

7 years ago
The only thing we're going to have to watch out for is non-standard encodings. I've come across some files encoded in Chinese formats (namely GB2312). Fortunately, they don't have any control characters in that format, but I can't speak to the tens of other encodings that we are (or should be) supporting.

Anyway, here's a commit that fixes the issue. Unit tests and all.

One thing to note, I only throw one warning for the first control character. If there are, say, a hundred, that's going to make for some really nasty output.
Last Resolved: 7 years ago
Resolution: --- → FIXED
After this code revision I still see the same traceback when validating the attached Addon.

$ python ./addon-validator --determined -o json ~/Documents/Mozilla/addon-packages/add-on201012241037.xpi
ValueError: Invalid control character at: line 1 column 146932 (char 146932)

Full traceback:
Resolution: FIXED → ---
Perhaps there is a UnicodeDecodeError getting swallowed?

Comment 6

7 years ago
What you're seeing now is a different problem:

The problem that's occurring here is that unicode/UTF-8 characters are escaped in the JS source code (i.e.: \u2013) and are passed back to the validator as unicode characters. Unfortunately, the codec for these new characters isn't the same one that was used to encode the JSON string (since the characters were escaped), causing this problem.

My fix is going to be to strip out unicode characters from Spidermonkey's output that are out of the range of extended ASCII before the output is decoded into JSON. This will preserve any existing functionality (there are no functions or objects that contain non-ASCII characters).

Any objections? If not, I'll push out an update tonight.
I think that's fine.  Add a decent comment in the code so we know what is going on if this comes up in the future.
I don't follow how Unicode is involved here.

I took a closer look at why the validator was tripping up on deliciousAwesomebarSearch.js with this invalid control character error.  I was mainly confused because your new code was stripping out control characters so I wanted know where they were coming from.  From what I can tell, the control characters are introduced by spidermonkey after calling Reflect.  wtf.  So maybe you just need to strip out control characters again after getting the js shell output?  Or maybe it is safe enough to load the JSON with strict=False (allowing control chars) since they have already been stripped from user code?

Comment 9

7 years ago
Yeah, basically developers can do things like this:

x = "\u1234";

Spidermonkey will unescape that unicode character, so the output from SM contains characters that weren't part of the encoding going in. It's truly effed.

Anyway, I tried strict=False and that works. Since the only control characters are the ones that Spidermonkey outputs, any vulnerabilities in SM as a result of the control characters would also be exposed on the front end (since it's the parser's output and not the interpreter), so that's out of our hands.

Here's my fix. The code to strip the characters in the output is there but commented out. If need be, we can turn it back on later.
Last Resolved: 7 years ago7 years ago
Resolution: --- → FIXED
Ah, I see.  Well, the validation is working :)  There seems to be a lot of code branches in that file now that I'm not sure we need but I suppose we'll find out through more testing.  Thanks.

Comment 11

7 years ago
I'm going to try to push a round of improvements to the JS traverser today, and I'm planning on moving a lot of code out of the validator.testcases.scripting module. That should help clean things up. Keep an eye open.
This has landed in Zamboni and can be verified

Comment 13

7 years ago
verified fixed
Product: → Graveyard
You need to log in before you can comment on or make changes to this bug.