Closed Bug 568228 Opened 15 years ago Closed 15 years ago

[HTML5] javascript character decoding issue?

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 566280
Tracking Status
blocking2.0 --- final+

People

(Reporter: jack, Assigned: hsivonen)

References

()

Details

(Whiteboard: Will be fixed by bug 566280.)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; WOW64; en-US; rv:1.9.3a5pre) Gecko/20100526 Minefield/3.7a5pre (.NET CLR 3.5.30729) Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.1; WOW64; en-US; rv:1.9.3a5pre) Gecko/20100526 Minefield/3.7a5pre (.NET CLR 3.5.30729) The company I work at uses a BMC helpdesk system, and their web front end uses a lot of javascript. I have found that it works in firefox 3.6 release, but our change management form fails to load in minefield. This error is not on a server I can provide access to. Error: missing ) after argument list Source file: https://redacted/arsys/forms/arserver-redacted/CHG%3AInfrastructure+Change/Default+User+View/?cacheid=redacted Line: 47, Column: 95 Source code: function ARAL5(){if ((NE(F(302224400).G().part("������������E(F(1000000066).G(), new CharType("GROUP%"))) && (EQ(F(1000000076).G(), new CharType("Select")))) && (EQ(F(1000000140).G(), new IntegerType(0)))){highlight=true;;F(1000000076).S((Null));highlight=false;;return ARACTMessage(2,1440111,"You are not related to any Support Groups. Please have your group administrator create a Support Group Relationship.");}} Reproducible: Always Steps to Reproduce: 1. Try and parse this javascript function declaration Actual Results: Get error Expected Results: function should be defined and runnable in page. This function looks different when I "View Source" than it does when I see it reported in the error console, which makes me think it's trying a different character decoding in one place from the other. I have attempted to reproduce the function definition with it's funky characters. If I hadn't seen it work in firefox 3.6, I wouldn't have bothered reporting, because it looks iffy. function ARAL5(){if ((NE(F(302224400).G().part("����",0), F(302225000).G().part("����",0))) && (NE(F(302225000).G().part("����",0), Null))){highlight=true;;F(302224400).S((F(302225000).G()));highlight=false;;}} 0000000: 6675 6e63 7469 6f6e 2041 5241 4c35 2829 function ARAL5() 0000010: 7b69 6620 2828 4e45 2846 2833 3032 3232 {if ((NE(F(30222 0000020: 3434 3030 292e 4728 292e 7061 7274 2822 4400).G().part(" 0000030: efbf bdef bfbd efbf bdef bfbd 222c 3029 ............",0) 0000040: 2c20 4628 3330 3232 3235 3030 3029 2e47 , F(302225000).G 0000050: 2829 2e70 6172 7428 22ef bfbd efbf bdef ().part("....... 0000060: bfbd efbf bd22 2c30 2929 2920 2626 0a28 .....",0))) &&.( 0000070: 4e45 2846 2833 3032 3232 3530 3030 292e NE(F(302225000). 0000080: 4728 292e 7061 7274 2822 efbf bdef bfbd G().part("...... 0000090: efbf bdef bfbd 222c 3029 2c20 4e75 6c6c ......",0), Null 00000a0: 2929 297b 6869 6768 6c69 6768 743d 7472 ))){highlight=tr 00000b0: 7565 3b3b 4628 3330 3232 3234 3430 3029 ue;;F(302224400) 00000c0: 2e53 2828 4628 3330 3232 3235 3030 3029 .S((F(302225000) 00000d0: 2e47 2829 2929 3b68 6967 686c 6967 6874 .G()));highlight 00000e0: 3d66 616c 7365 3b3b 7d7d 0a =false;;}}. Render Mode says "Quirks mode"; The script is defined in an HTML document and the page info's "Meta" box shows 1 tag "Content-Type" "text/html; charset=UTF=8" I captured the HTTP headers during a reload, in case it mattered: HTTP/1.1 200 OK Date: Wed, 26 May 2010 15:33:48 GMT Server: Apache/2.0.52 (Red Hat) Cache-Control: public,max-age=1 Expires: Wed, 26 May 2010 15:33:49 GMT Last-Modified: Wed, 19 May 2010 18:12:49 GMT Content-Encoding: gzip Content-Length: 149613 Connection: close Content-Type: text/html;charset=UTF-8
So this is an inline <script> in the page? Or an external script? Do you think you can hunt down the first nightly that shows the problem?
It's an inline <script>. I'll see if I can narrow it down for you some, but I don't have a guess yet, since the server using this script is fairly new to me.
OK. Gecko 1.9.2 branched in August 2009, so a binary search on nightlies shouldn't take more than 9 nightlies or so. I'd offer to do it, but don't have access to the page, of course. :(
The script works in 20100503, and fails in 20100504. Thanks for looking into this!
Jack, what are the build ids of those builds (from about:buildconfig)?
Mozilla/5.0 (Windows; U; Windows NT 6.1; WOW64; en-US; rv:1.9.3a5pre) Gecko/20100503 Minefield/3.7a5pre (.NET CLR 3.5.30729) That is the one that works - I got the other from the same place the same way, but I will need about an hour before I can reinstall and be absolutely positive. I'm pretty sure its pretty much the same thing with 20100504 instead of 20100503.
Jack, I'm looking for a string like "Built from http://hg.mozilla.org/mozilla-central/rev/7066468619b5" (again, from about:buildconfig).
That said, the html5 parser landed in that general date range. Henri, could that be causing issues here?
blocking2.0: --- → ?
Sorry about that. This one works: Built from http://hg.mozilla.org/mozilla-central/rev/83c887dff0da This one does not: Built from http://hg.mozilla.org/mozilla-central/rev/d6bb0f9e9519
Status: UNCONFIRMED → NEW
Ever confirmed: true
In that case, does setting html5.enable to false in about:config in the latest minefield make it work Jack?
Yes, it does.
Assignee: general → nobody
Component: JavaScript Engine → HTML: Parser
OS: Windows 7 → All
QA Contact: general → parser
Hardware: x86 → All
Version: unspecified → Trunk
(In reply to comment #0) > This function looks different when I "View Source" than it does when I see it > reported in the error console, which makes me think it's trying a different > character decoding in one place from the other. Indeed. View Source still uses the old HTML parser. > 0000030: efbf bdef bfbd efbf bdef bfbd 222c 3029 ............",0) Those bytes are indeed bogus UTF-8, so once the parser has decided to use UTF-8, turning those bytes into REPLACEMENT CHARACTERs is correct. > Render Mode says "Quirks mode"; The script is defined in an HTML document and > the page info's "Meta" box shows 1 tag "Content-Type" "text/html; > charset=UTF=8" I think the key is the equals sign in place of the hyphen in UTF=8. > I captured the HTTP headers during a reload, in case it mattered: > Content-Type: text/html;charset=UTF-8 Per HTML5, the HTTP header takes precedence over the meta tag. Thus, the HTML5 parser (correctly) decodes the page as UTF-8. However, the page is actually relying on the meta taking precedence and UTF=8 being unrecognized so that the page gets decoded as Windows-1252. This is still not quite a satisfactory explanation. Even the old parser should give precedence to the HTTP headers over the meta. Given the information available to me in this bug report, I am unable to explain why the old parser would ignore the HTTP-level charset parameter. I'm inclined to resolve this as WONTFIX, but I'm leaving this open for more information.
Oh ... darn. The UTF=8 thing is a transcription error on my part. I hadn't noticed that I typoed it nor that I could have just copy & pasted it. The meta has a proper UTF-8 in it.
Jack, can you check what's listed as "Encoding" in the General pane of View Page Info both with the html5.enable pref set to true and with the pref set to false? I'd like to establish whether the old parser is decoding the page as non-UTF-8 or whether both parsers decode as UTF-8 but handle bad byte sequence differently.
Sure! Both ways, it says UTF-8.
Blocking final 1.9.3+
blocking2.0: ? → final+
Note that I'm not convinced that we actually need to change anything here. But we should figure out why the old parser worked here and see if we need to adjust the spec or if there's a bug in the new impl or some such.
(In reply to comment #13) > > 0000030: efbf bdef bfbd efbf bdef bfbd 222c 3029 ............",0) > > Those bytes are indeed bogus UTF-8, so once the parser has decided to use > UTF-8, turning those bytes into REPLACEMENT CHARACTERs is correct. Oops. That's not bogus UTF-8. It's valid UTF-8 that actually encodes REPLACEMENT CHARACTERs. (In reply to comment #16) > Sure! Both ways, it says UTF-8. In that case, it seems that there might be a bug in the code that drives the encoding conversion in the HTML5 parser. The interesting question is why hasn't the bug showed up all over the place already. Another possibility is a bug in the character data flushing when tokenizing script content. Jack, can you paste or attach the entire affected inline script element with its content here redacting what you need to redact but without redacting <!--, -->, <script>,</script> or line breaks appearing inside the script?
I have not been able to fulfill that request exactly as asked, because this file is ridiculous, and my own attempts to reproduce the error from what I would potentially send you are failing. But the good news (or bad, depending on your perspective) is that I have found that I missed something important. The original bytes transmitted to firefox, between the quotation marks, were 0000 0000, not efbf bdef bfbd efbf bdef bfbd. I think I originally did copies & pastes out of the "view source" window and out of the javascript error log, into this window and also to a hex converter in a terminal, but if I'd just saved the file, I would have seen that it started out as four 0x00s. I verified this with wireshark. More positive progress though, is that I got a much much smaller file to reproduce what I believe to be a problem in the same family. http://mortal.peril.org/jack/mozilla/broken.htm With html5 turned off, it shows no text and no errors. With html5 left on, it gives me this: Error: unterminated string literal Source file: http://mortal.peril.org/jack/mozilla/broken.htm Line: 13, Column: 32 Source code: function ARAL5(){ return myfunc("���� I now think that the 0xefbfbc didn't come from firefox at all; I pasted to xxd running on a different linux server's command line and it changed them to 0x2e (.) instead of the UTF-8 replacement character. I'm very sorry I didn't realize I was doing it wrong. Thank your for your patience and your labors.
Thank you for looking into this more. Here's what's happening: First, a new feature of the HTML5 parsing algorithm kicks in: U+0000 is mapped to U+FFFD. Then bug 566280 takes over and causes characters *after* the U+0000, including the close quote, to be dropped by mistake. The next step is to see if the helpdesk system works with the fix for bug 566280 applied. (The fix should land soonish. If not, I can make a tryserver build.) In any case it seems weird that an app would put zero bytes in a JS string literal. So far, I am aware of two compat bugs caused by the mapping of U+0000 to U+FFFD (as opposed to silently discarding U+0000): bug 528045 (WONTFIXed on the spec level) and bug 563526 (which I intend to get fixed both in implementation and on the spec level).
Assignee: nobody → hsivonen
Depends on: 566280
Summary: javascript character decoding issue? → [HTML5] javascript character decoding issue?
I just tried our helpdesk system after installing that build, Henri, and it works just great. Thank you very much.
(In reply to comment #23) > I just tried our helpdesk system after installing that build, Henri, and it > works just great. Thank you very much. Great! Thank you and sorry about the bug. I'll leave this open until the patch for bug 566280 has landed.
Whiteboard: Will be fixed by bug 566280.
Actually, marking as duplicate of bug 566280 per comment #23.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.