Closed Bug 599320 Opened 10 years ago Closed 7 years ago

Don't inherit non-ASCII superset encoding as the tentative encoding of subframes

Categories

(Core :: DOM: HTML Parser, defect, P3, major)

defect

Tracking

()

RESOLVED FIXED
mozilla19

People

(Reporter: moga_marius22, Assigned: hsivonen)

References

()

Details

Attachments

(2 files, 3 obsolete files)

User-Agent:       Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
Build Identifier: Firefox 3.6.10

Wrong display of some pages from www.manastireazemes.ro. Any other browser like Internet explorer, Opera, or Google Chrome works fine and the Multimedia page or Contact page are displayed corectly.

Reproducible: Always

Steps to Reproduce:
1.Start mozilla firefox 3.6.10
2.Browse to http://www.manastireazemes.ro/
3.Select Multimedia and Contact menus.
Actual Results:  
Instead of the Multimedia and Contact pages, you will see a lot of chinese characters.

Expected Results:  
Those pages should be displayed corectly on your browser, like any other browser 

Works fine with Firefox 3.6.8
The reason is that the originating page uses the character encoding: text/html (BOM UTF-16, litte-endian) given as http header.

The iframe ( http://www.manastireazemes.ro/HTMLRO/Contact.html ) is using iso8859-1 and that carset info is only present in the html header as <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

I can confirm this with SM trunk but it works for me with FF3.6.10
Status: UNCONFIRMED → NEW
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → general
Version: unspecified → Trunk
The <meta> tag should override the charset inherited from parent.  So I have no idea why you're seeing this issue.  If you _are_ seeing it, care to debug?
Yes, I am interested in finding a solution. I have access to modify the html code for this website.
To be clear, the tail end of comment 2 was directed at Matti...
Ok, sorry for misunderstanding, I have attached a jpg for this issue. 
Kindly thank you.
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.11pre) Gecko/20100924 Namoroka/3.6.11pre
Able to reproduce. When we directly click the Contact menu, we see the map. To reproduce click on the Menus one by one and wait for the page to load and at the last click on the Contact menu, we can see the lots of chinese characters instead of displaying the map. Able to reproduce in 3.6.8 version also, but not in 3.6.7
The change between 1.9.2 and trunk is the html5 parser. Disabling the html5 parser makes it working again on the trunk. 
I tested that with FF4.0b6 on windows.
I can't reproduce the issue with 1.9.2 on my system.

@bz: sure, i can try to debug it if you have suggestions how to do it.
Well, a good start is stepping through nsHTMLDocument::StartDocumentLoad and seeing where we end up picking up our charset from.

Though given the HTML5 parser thing, perhaps Henri just needs to look at this.  But I thought we had disabled the HTML5 parser on branch, so that wouldn't be what Marius is seeing....
Oh, you mean real debugging. I'm sorry but that is beyond of my possibilities :-(
What I'm seeing on my system seems to be different from that what Marius is seeing.
I've made some changes on http://www.manastireazemes.ro/HTMLRO/Multimedia.html code and that page is working now. I think that I will make those changes on every web page. There are other Multimedia pages that right now are displaing those chineze characters. 
These are the changes for Multimedia.html:
1. The meta info is now before javascript code
2. On <style type="text/css"> I have deleted the spaces and new lines from navigation_menu's properties
3. I have changed the summary parameter name, but I think this is not important.
If it's alright with you I will make the changes.
Regards
Actually, it would be good if there were a page that shows the problem, so Henri can actually debug and fix it....

Was it just moving the <meta> to before the <script> that did it?  But that shouldn't matter on 1.9.2, without the HTML5 parser.
MultimediaFotoViata.html, MultimediaFotoPeisaje.html, MultimediaFotoIstorie.html, MultimediaFotoConstructie.html are similar with Multimedia.html but I have not modified them yet. You can access them from 
the iframe menu (on the top of the iframe, go to "Foto" and than select another submeniu than "Asezamant")
Ok, I'm not going to modify MultimediaFotoPeisaje.html (on the top of the iframe, go to "Foto" and than select "Peisaje" submeniu) and next I will try to find out if the position of <meta> was the real problem.
Yes, this is the problem. When I move the <meta> before the <script> it is working.
On my PC every page is working after I delete browsers history, restart an than load the website again, but after I restart the browser again (without deleting history) and than reload a webpage (that I haven't modified yet) it appears the chinese problem.
I don't see the described problem on trunk on Mac or Linux. Have all the pages on the site now been modified not to show the problem? Is there still a test case somewhere?

Using UTF-16 (any flavor) on the Web is just asking for trouble.

Maybe we shouldn't inherit non-ASCII superset encodings to subframes, because if the meta prescan fails, a later meta fail to correct the situation if the tentative encoding wasn't an ASCII superset.
Component: General → HTML: Parser
QA Contact: general → parser
@hsivonen: Yes there is now only one page intentionally left for you to debug, please go to "Multimedia" menu item and than on the top of the iframe, go to "Foto" menu and than select "Peisaje" submeniu item. I'm writing about the "http://www.manastireazemes.ro/HTMLRO/MultimediaFotoPeisaje.html" page.  I couldn't let the website uncovered with all the pages damaged like this one. Thank you for understanding.
(In reply to comment #17)
> @hsivonen: Yes there is now only one page intentionally left for you to debug,
> please go to "Multimedia" menu item and than on the top of the iframe, go to
> "Foto" menu and than select "Peisaje" submeniu item. I'm writing about the
> "http://www.manastireazemes.ro/HTMLRO/MultimediaFotoPeisaje.html" page.  I
> couldn't let the website uncovered with all the pages damaged like this one.
> Thank you for understanding.

Thanks. Since the <meta> isn't within the 1 KB, the meta prescan fails to recover from the bad inherited encoding (UTF-16) and since the bytes don't decode to something that parses as a <meta> tag, the late <meta> reload mechanism can't correct the situation.

As mentioned in comment 16, this can be fixed by not inhering non-ASCII superset encodings as subframe defaults.
Summary: please browse to Multimedia and than to Contact menu → Don't inherit non-ASCII superset encoding as the tentative encoding of subframes
Priority: -- → P3
In addition to parent encoding, this patch also refuses hints from the previous doc and the cache entry the same way.
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
The binary patch is like the -ref version except:
1) It’s in UTF-16.
2) It references the non-“-ref” child frame.
3) It has class="reftest-wait".

#3 is required because the parser in the child frame restarts, which would confuse the reftest framework without reftest-wait.
Attachment #666928 - Attachment is obsolete: true
Attachment #667352 - Flags: review?(bugs)
This patch may break some old Japanese pages whose encoding is ISO-2022-JP. Some Japanese user complained when IE8 dropped the auto-detect for ISO-2022-JP.
Do other browsers also ignore parent frame, previous doc, and cache entry for ISO-2022-JP?
Comment on attachment 667352 [details] [diff] [review]
Refuse encodings that are not rough ASCII supersets as hints, v2

These days, the spec allows encodings that may encode non-Basic Latin to ASCII bytes as long as Basic Latin encodes to ASCII bytes. That is, the legacy Japanese encodings should not be blacklisted here.
Attachment #667352 - Attachment is obsolete: true
Attachment #667352 - Flags: review?(bugs)
(In reply to Masatoshi Kimura [:emk] from comment #21)
> This patch may break some old Japanese pages whose encoding is ISO-2022-JP.
> Some Japanese user complained when IE8 dropped the auto-detect for
> ISO-2022-JP.
> Do other browsers also ignore parent frame, previous doc, and cache entry
> for ISO-2022-JP?

This patch didn’t affect ISO-2022-JP in any way. The Japanese encodings I had blacklisted were not supported by Gecko anyway, they were cruft carried over from the blacklist I had for Java. Removing those. Also removing x-user-defined, because it is ASCII-compatible by code inspection.
Attachment #667431 - Flags: review?(bugs)
Comment on attachment 667431 [details] [diff] [review]
Blacklist UTF-16 from being inherited from parent


>-    else if (kCharsetFromCache <= parentSource) {
>-      // Make sure that's OK
>-      if (!aParentDocument || !CheckSameOrigin(this, aParentDocument)) {
>-        return false;
>-      }
>-
>-      source = kCharsetFromParentFrame;
>+
>+    // if parent is posted doc, set this prevent autodections
I know you just moved this stuff, but 'autodections' ?

> nsHTMLDocument::UseWeakDocTypeDefault(int32_t& aCharsetSource,
>-                                      nsACString& aCharset)
>+                                           nsACString& aCharset)
Odd indentation


>-  // fallback value in case docshell return error
>-  aCharset.AssignLiteral("ISO-8859-1");
>+    return;
> 
>   const nsAdoptingCString& defCharset =
>     Preferences::GetLocalizedCString("intl.charset.default");
> 
>-  if (!defCharset.IsEmpty()) {
>+  // Don't let the user break things by setting intl.charset.default to
>+  // not a rough ASCII superset
>+  if (!defCharset.IsEmpty() && IsAsciiCompatible(defCharset)) {
>     aCharset = defCharset;
>-    aCharsetSource = kCharsetFromWeakDocTypeDefault;
>+  } else {
>+    aCharset.AssignLiteral("windows-1252");
So I still don't understand the reason to change default to windows-1252
Anyway it shouldn't be about this bug, so could to just not do it.
Attachment #667431 - Flags: review?(bugs) → review+
(In reply to Olli Pettay [:smaug] from comment #24)
> Comment on attachment 667431 [details] [diff] [review]
> Blacklist UTF-16 from being inherited from parent
> 
> 
> >-    else if (kCharsetFromCache <= parentSource) {
> >-      // Make sure that's OK
> >-      if (!aParentDocument || !CheckSameOrigin(this, aParentDocument)) {
> >-        return false;
> >-      }
> >-
> >-      source = kCharsetFromParentFrame;
> >+
> >+    // if parent is posted doc, set this prevent autodections
> I know you just moved this stuff, but 'autodections' ?

Fixed.

> > nsHTMLDocument::UseWeakDocTypeDefault(int32_t& aCharsetSource,
> >-                                      nsACString& aCharset)
> >+                                           nsACString& aCharset)
> Odd indentation

Fixed. (The monospace font in my editor had different widths for regular and bold. Sigh.)

> >-  // fallback value in case docshell return error
> >-  aCharset.AssignLiteral("ISO-8859-1");
> >+    return;
> > 
> >   const nsAdoptingCString& defCharset =
> >     Preferences::GetLocalizedCString("intl.charset.default");
> > 
> >-  if (!defCharset.IsEmpty()) {
> >+  // Don't let the user break things by setting intl.charset.default to
> >+  // not a rough ASCII superset
> >+  if (!defCharset.IsEmpty() && IsAsciiCompatible(defCharset)) {
> >     aCharset = defCharset;
> >-    aCharsetSource = kCharsetFromWeakDocTypeDefault;
> >+  } else {
> >+    aCharset.AssignLiteral("windows-1252");
> So I still don't understand the reason to change default to windows-1252
> Anyway it shouldn't be about this bug, so could to just not do it.

Changed back to ISO-8859-1. Filed bug 802030.

Thanks for the r+. Landed:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6db369a15ea7
I forgot that there was a GUI for the fallback encoding. Filed bug 802033.
(In reply to Ed Morley [:edmorley UTC+1] from comment #27)
> Backed out for intermittent (but frequent) Win7 debug reftest failures in
> bug599320-1.html, eg:

The log says:
REFTEST INFO | Saved log: START file:///c:/talos-slave/test/build/reftest/tests/parser/htmlparser/tests/reftest/bug599320-1.html
REFTEST INFO | Saved log: [CONTENT] OnDocumentLoad triggering WaitForTestEnd
REFTEST INFO | Saved log: [CONTENT] WaitForTestEnd: Adding listeners
REFTEST INFO | Saved log: Initializing canvas snapshot
REFTEST INFO | Saved log: DoDrawWindow 0,0,800,1000
REFTEST INFO | Saved log: [CONTENT] MakeProgress: STATE_WAITING_TO_FIRE_INVALIDATE_EVENT
REFTEST INFO | Saved log: [CONTENT] MakeProgress: waiting for MozAfterPaint
REFTEST INFO | Saved log: [CONTENT] AfterPaintListener in file:///c:/talos-slave/test/build/reftest/tests/parser/htmlparser/tests/reftest/bug599320-1.html
REFTEST INFO | Saved log: [CONTENT] SendUpdateCanvasForEvent with 1 rects
REFTEST INFO | Saved log: [CONTENT] Rect: 10 154 310 555
REFTEST INFO | Saved log: Updating canvas for invalidation
REFTEST INFO | Saved log: DoDrawWindow 10,154,300,401
REFTEST INFO | Saved log: [CONTENT] WaitForTestEnd: Adding listeners
REFTEST INFO | Saved log: [CONTENT] MakeProgress: STATE_WAITING_TO_FIRE_INVALIDATE_EVENT
REFTEST INFO | Saved log: [CONTENT] MakeProgress: waiting for MozAfterPaint
REFTEST INFO | Saved log: [CONTENT] AfterPaintListener in file:///c:/talos-slave/test/build/reftest/tests/parser/htmlparser/tests/reftest/bug599320-1.html
REFTEST INFO | Saved log: [CONTENT] SendUpdateCanvasForEvent with 1 rects
REFTEST INFO | Saved log: [CONTENT] Rect: 10 154 310 555
REFTEST INFO | Saved log: [CONTENT] AfterPaintListener in file:///c:/talos-slave/test/build/reftest/tests/parser/htmlparser/tests/reftest/frame599320-1.html
REFTEST INFO | Saved log: [CONTENT] AttrModifiedListener fired
REFTEST INFO | Saved log: [CONTENT] MakeProgress: STATE_WAITING_TO_FIRE_INVALIDATE_EVENT
REFTEST INFO | Saved log: [CONTENT] MakeProgress: waiting for MozAfterPaint
REFTEST INFO | Saved log: [CONTENT] AfterPaintListener in file:///c:/talos-slave/test/build/reftest/tests/parser/htmlparser/tests/reftest/bug599320-1.html
REFTEST INFO | Saved log: [CONTENT] SendUpdateCanvasForEvent with 0 rects
REFTEST INFO | Saved log: Updating canvas for invalidation
REFTEST INFO | Saved log: [CONTENT] AttrModifiedListener fired
REFTEST INFO | Saved log: [CONTENT] MakeProgress: STATE_WAITING_TO_FIRE_INVALIDATE_EVENT
REFTEST INFO | Saved log: [CONTENT] MakeProgress: waiting for MozAfterPaint
REFTEST INFO | Saved log: [CONTENT] AfterPaintListener in file:///c:/talos-slave/test/build/reftest/tests/parser/htmlparser/tests/reftest/bug599320-1.html
REFTEST INFO | Saved log: [CONTENT] SendUpdateCanvasForEvent with 0 rects
REFTEST INFO | Saved log: [CONTENT] AfterPaintListener in file:///c:/talos-slave/test/build/reftest/tests/parser/htmlparser/tests/reftest/frame599320-1.html
REFTEST INFO | Saved log: [CONTENT] MakeProgress: STATE_WAITING_TO_FIRE_INVALIDATE_EVENT
REFTEST INFO | Saved log: [CONTENT] MakeProgress: dispatching MozReftestInvalidate
REFTEST INFO | Saved log: [CONTENT] MakeProgress: STATE_WAITING_FOR_REFTEST_WAIT_REMOVAL
REFTEST INFO | Saved log: [CONTENT] MakeProgress: STATE_WAITING_TO_FINISH
REFTEST INFO | Saved log: [CONTENT] MakeProgress: Completed
REFTEST INFO | Saved log: [CONTENT] MakeProgress: STATE_COMPLETED

The iframe has a restarting parser. The parser restart confuses the reftest framework on all platforms, but the reftest-wait stuff is supposed to take care of it.

Any ideas why the snapshotting would misfire on Windows despite reftest-wait?

Should I just mark the test as intermittent on Windows and not think about it too much? The code being tested is cross-platform after all, so there’s a good reason to believe the actual code change is doing OK.
This patch is the same as the one landed before except the removal of reftest-wait tries harder to defer removal of the class. This was green on Windows debug on try, but proper landing is needed to see if it is still random.
Attachment #667431 - Attachment is obsolete: true
Attachment #672733 - Flags: review?(bugs)
Attachment #672733 - Flags: review?(bugs) → review+
https://hg.mozilla.org/mozilla-central/rev/89ff8e556381
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
You need to log in before you can comment on or make changes to this bug.