Open Bug 61363 (latemeta) Opened 24 years ago Updated 7 months ago

Make sure that chardetng-triggered encoding reload is read from the cache

Categories

(Core :: DOM: HTML Parser, enhancement, P5)

enhancement

Tracking

()

Future

People

(Reporter: pollmann, Unassigned)

References

()

Details

(Keywords: helpwanted, testcase, Whiteboard: Please read comment 115.)

Attachments

(1 file)

This is a follow-on to bug 27006.  We need to come up with a "real" fix for this
problem.  Since I'm hoping that we can somehow force the reload to come from the
cache instead of the server, I'm starting out with this assigned to Gagan.  I'll
send out an email trying to get a meeting set up so we can work out more details.
Adding keywords from 27006
->to cache
Assignee: gagan → neeti
Component: Networking → Networking: Cache
QA Contact: tever → gordon
Target Milestone: --- → M1
Target Milestone: M1 → mozilla0.9
Cache bugs to Gordon
Assignee: neeti → gordon
Eric, what do you need from the cache and/or http?
Target Milestone: mozilla0.9 → mozilla0.9.1
Darin, this looks like a dup of the other <meta> charset bug you're working on.
Assignee: gordon → darin
actually, i'm going to use this bug to track this problem.  we need:

1) support for overlapped i/o in the disk cache.
2) ability to background the first load and let it finish on its own.
Blocks: 71668
If we had 1) then I wonder if for blocked layout, uninterrupted streaming of
data to cache would be a better solution than our filling up the pipes and
subsequently taking the socket off the select list. This way our network
requests would never pause for layout/other blocking events-- that means we'd be
fast. Maybe we need a PushToBackground (with a better name) on nsIRequest. The
implementation of PushToBackground will simply take all "end" listeners off and
continue to stream data to the cache. So consumers that are currently killing
our first channel would just push it to background and make new requests for the
same URL. What say? 

agreed.. we do need some sort of communication from the parser to http to tell
it to keep going "in the background"... there are two options i see...

1) parser could just eat all the data; then, http would not even need to be made
aware of what's going on.

2) parser could return a special error code from OnDataAvailable that would 
instruct HTTP to not call OnDataAvailable anymore, but to just continue
streaming the data into the cache... this error code could perhaps be
NS_BASE_STREAM_CLOSED.

I'm not sure that option 2 would be that much more efficient than option 1...
option 1 would be a lot easier to implement, but option 2 could be used by
any client of http.

gagan: i'm not sure we can completely background the download... as this would
require pushing it to another thread, which would be difficult.
either way we need overlapped io in the cache. Setting that as the first target
and giving to gordon. This is a serious bug (ibench and double-post)
Assignee: darin → gordon
Keywords: topperf
*** Bug 78018 has been marked as a duplicate of this bug. ***
*** Bug 78494 has been marked as a duplicate of this bug. ***
Whiteboard: want for mozilla 0.9.1
In order to implement over-lapped I/O in the cache, I'll need to finish the Disk 
Cache Level 2, which includes the necessary stream-wrappers.
Depends on: 72507
Priority: P3 → P2
Depends on: 81724
No longer depends on: 81724
per PDT triage to 0.9.2
Target Milestone: mozilla0.9.1 → mozilla0.9.2
Depends on: 81724
Whiteboard: want for mozilla 0.9.1
Keywords: nsenterprise
can't make it to 0.9.2. pushing over...
Target Milestone: mozilla0.9.2 → mozilla0.9.3
moving out milestone. 
Target Milestone: mozilla0.9.3 → mozilla1.0
Removing nsenterprise nomination. Adding nsBranch.
Keywords: nsenterprisensBranch
Blocks: 99142
Gordon/Gagan - This likes a good one to take. How close are you to resolving
this one? If it can't be finished this week, pls mark it as nsbranch- for this
round.
We are not close on this.  It's very doubtful this will be ready to land in the
next month.
Keywords: nsbranchnsbranch-
Blocks: 105709
Blocks: 107067
Keywords: nsbranch-
Keywords: mozilla1.0
any chance this will make the MachV train?
No longer blocks: 107067
Keywords: nsbeta1
Darin and I are backing off of supporting overlapped I/O in the cache (which was
the reason I was given this bug).  We need review the severity and potential
fixes, since necko has changed quite a bit since this bug was originally
reported.   I'll meet with him and update the bug with our current thoughts.
cc'ing shaver, since i know he has comments on this one.
If you submit a POST form, and the returned HTML has a charset -- as is the case
with a number of e-commerce sites in Canada, where we have accents and things --
then you get the scary "resubmit your data?" dialog, sometimes twice.  That
dialog is doubly scary when you're slinging around your credit card with
non-refundable tickets, so I've had to spin up IE for some purchases to keep my
blood pressure down.

I don't understand why we have to go to the network or the cache for this.  When
we hit a <meta charset> tag, we just need to go back and fix up attribute values
to match the new character set, and then make sure that future content is
charset-parsed appropriately.  I don't think it's ever possible for the charset
to change the structure of the document, because in that case we might not
really have seen <meta> and the whole thing collapses on itself.

"Overlapping I/O" sounds like a win other things (multiple copies of an image on
a page, where the second one is requested while the first one is still coming
in?), to be honest, but I don't think the right fix here involves any I/O driven
by a <meta>, just attribute fixup.  And since overlapped I/O seems to be rocket
science, why not let a DOM/parser guy take a swing at it?
agreed... falling back on the cache/necko is just a hack solution at best.

-> parser

shaver: btw, imagelib already gates image requests to avoid multiple hits on the
disk cache / network for an image that appears more than once on a page.
Assignee: gordon → harishd
Component: Networking: Cache → Parser
QA Contact: gordon → moied
Do we understand the situations where the 'meta charset sniffer' is failing --
thus forcing us to reload the document?

our current sniffing code looks at the first buffer of data for a
meta-charset... so, i'm assuming that in situations where we reload, the server
has sent us a 'small' first packet of data...

is this *really* the case, or has our sniffer broken?

-- rick
 
Can't answer the question about server reloads of POST documents, but for 
the GET case of a document, the sniffer is working (i.e., we don't do 
the double GET as long as the meta tag is within 2K bytes of the beginning of
the document; otherwise, we do the double GET).
rpotts: what jrgm said... otherwise, we'd have seen a huge regression on ibench
times.
And duplicate stylesheets and script transclusions from frames in framesets? 
Not to hijack this bug with netwerk talk, now that we've punted it back
(correctly, IMO) to the parser guys -- hi, Harish! -- but it seems like this is
a correctness issue for more than just <meta> tags.  I don't see another bug in
which to beat this dying horse, but I'll be more than happy to take the
discussion to one that someone finds or files.
shaver: duplicate css and js loads are serialized.  hopefully, this is not too
costly in practice.
>Do we understand the situations where the 'meta charset sniffer' is failing --
>thus forcing us to reload the document?
>
>our current sniffing code looks at the first buffer of data for a
>meta-charset... so, i'm assuming that in situations where we reload, the server
>has sent us a 'small' first packet of data...
>
>is this *really* the case, or has our sniffer broken?
First of all, the sniffing code is origional designed as an "inperfect" performance
tuning for "most of the case" instead of a perfect all-the-case general solution
You are right, it only look at the first block. And it is possible in theory that 
the meta tag can happen thoudans bytes after (I saw larnge js in front of that
before)
second, even the meta code sniff correctly, we still need the reload mechanism
work correctly for charset-detector (by exam bytes and freq analysis) reload
turn "character set:auto-detect" from "(Off)" to "All" and visit some 
non latin 1 text file and you will see the reload kick in.


add shanjian
Why do we need to reload for the charset sniffer?  Can't it just look at text
runs and attribute values to do frequency analysis, and the perform the in-place
switchover described above?  The document structure had better not change due to
a charset shift, or there's nothing we can do without an explicit and correct
charset value in the headers.
I expect a reload hack after will always be "easier" than fixing up a bunch of
strings in content node members.  Cc'ing jst.  But easier isn't always better
(nor is worse, always).  If we can avoid reloads, let's do it.

ftang, is it possible shaver's canadian e-commerce website POST data reloads are
due to universalchardet and not a sniffer failure?

/be
A great test case is this: go to the URL I just added, and click:
  [Click Here to Generate Currency Table]


You'll be treated to _four_ POST data alerts, two of each type.

(For bonus marks, just _try_ to use the back button or alt-left to go back in
history.)
Correct me if I'm wrong:

There is code in nsObserverBase::NotifyWebShell() to prevent reload for a POST data.
If meta charset sniffing fails then we fall back on the tag-observer mechansim
to |reload| the document with a new charset. However, the code in nsOberverBase
would make sure that we don't reload POST data. Therefore, we should never ( I
think ) encounter double-submit-problem. The draw back, however, is that the
document wouldn't get the requested charset.
I don't think it's even possible to always correctly do the fixup in the content
nodes after we realize what the charset really should be. The bad conversion
that already happened could've actually lost information if there were
characters in the stream that were not convertable to whatever charset we
converted to, right?
there has to be some way to do this without going back to the cache/network for
the data.  remember: the cache isn't guaranteed to be present.  we need a
solution for this bug that doesn't involve going back to netlib.
jst: Is there a way to throw away the content nodes, that got generated before
encountering a META tag with charset, without reloading the document?
jst: aren't we storing text and attributes as UCS2 -- unless they were
all-ASCII, in which case we can trivially reinflate.  From either of those
conditions, I think we should be able to reconstruct the original (on-wire) text
runs, if we haven't thrown away the original charset info, and then re-decode
with the new character set.

I thought, and that paragraph is very much predicated on this belief, that we
only converted to "native" charsets at the borders of the application: file
names, font/glyph work for rendering, etc.  If that's not the case, I will just
go throw myself into traffic now.
If the conversion from the input stream to unicode (using the default charset)
and back to what we had in the input stream is reliably doable, then yes, we
could convert things back and re-convert once we know the correct charset. But
I'm not sure that's doable with our i18n converter code... ftang, thoughts?
Harish, yes, we can reset the document and start over if we have the data to
start over from.
Strange... I don't see any reposts on the www.xe.net URL when I click on
the [Click Here to Generate Currency Table] button.

I'm using Mozilla0.9.8 on Linux.

But when I choose to View-source on the generated table it is blank.

No longer depends on: 81724
I'm underwater with 0.9.9 reviews and approvals, but I wanted to toss this up
for discussion.  If people agree that it's a viable path, there are a bunch of
improvements that can be made as well: text nodes already know if they're
all-ASCII, for example, though I don't know how to ask them.

Big issues that I need assurance on:
 - all parts of the DOM can handle having their cdata/text/attribute string 
   values set, including <script>, and will DTRT.  (I fear re-running scripts!)

 - the entire concept of re-encoding in the old charset and then decoding with
   the new one is viable.  (Ignore, for now, the XXX comment about the new 
   buffer having more characters.)

Be gentle, but be thorough!
Keywords: mozilla1.0+
Keywords: mozilla1.0
I can't say I had a close look or anything but I really like this aproach, this
would be light years ahead of what we have now (assuming it actually works, that
is :-)
We need Ftang's input too.
The proposal doesn't cover encoding the unparsed data with the correct charset (
new charset ). 

Btw, if the approach works, can we remove the META tag sniffing code?
Yes, you're right: I forgot to mention that we need to update the parser's
notion of current charset.

smontagu has made me nervous about roundtripping illegal values.  I'm hoping
he'll say more here.

Mike
*** Bug 129074 has been marked as a duplicate of this bug. ***
Shaver is working on this. Mike, should I assign this bug to you?
I wish I paid attention to this bug earlier. I suggested the same approach when 
ftang first explained mozilla's doc charset handling. However, I have to say 
that the final patch might be more complicated that shaver's patch. 

Is it possible to convert text back from unicode to current character encoding 
and reconvert to unicode with new encoding? I want to share some of my 
understanding. Theoritically, the answer is NO. It is true ( or practically 
true) that unicode charset covers almost all the native charsets we can 
encounter today. But not all code points in a non-unicode encoding are valid. 
For example, in iso-8859-1, code point 0x81 is not defined. If the incoming data 
stream is encoded in win1251, 0x81 is a valid code point. Suppose somehow we use 
iso-8859-1 to interprete the text data, code point 0x81 will be converted to 
unicode u+fffd. When we later try to convert this code point back, there is no 
way to figure out where it comes from.  I believe this is the only scenario we 
need to worry about. (It is possible that for some encoding, more than one code 
points map to a single unicode code point. If that is the case, it is a bug and 
we can always fix it in unicode conversion module.)

I could not figure out a perfect solution to this problem at this time, but I 
would like to suggest 2 approaches for further discussion.
1) Could we always buffer current page? Probably inside parser?
2) We can use a series of unassigned code points in unicode for unassigned code 
point and change our charset mapping table. The aim is to  make charset 
conversion to be round trip for any character. For single byte encoding, we have 
at most 256 code points, and most of them should be assigned. For multi-byte 
encoding, we can interprete illegal byte sequence byte by byte. This practice 
must be kept internal as much as possible. This should make mike's approach 
feasible.
(We can't ignore the existence of illegal code points in many websites. In many 
cases, a illegal code point usually suggests a wrong encoding. To interrupt the 
process when meeting a invalide code point does not seem like a good idea.)


Blocks: 116217
Blocks: 105636
This happends not only with russian. Set any autodetection and I'll see it.
ADT2 per ADT triage.
Whiteboard: [ADT2]
*** Bug 116217 has been marked as a duplicate of this bug. ***
updating summary
Summary: <meta> with charset should reload from cache, not server → <meta> with charset should NOT cause reload
*** Bug 131524 has been marked as a duplicate of this bug. ***
*** Bug 131966 has been marked as a duplicate of this bug. ***
Shaver: What's the status on this? Can this be done in the 1.0 time frame? If
not let's move it to a more realistic mile stone.
Looks like this is definitely not going to make it to the m1.0 train ( Shaver? ).
Giving bug to Shaver so that he can target it to a more realistic milestone. 
Assignee: harishd → shaver
*** Bug 135852 has been marked as a duplicate of this bug. ***
*** Bug 129196 has been marked as a duplicate of this bug. ***
Attempt to reduce dupes by adding "twice" and "two"
Summary: <meta> with charset should NOT cause reload → <meta> with charset should NOT cause reload (loads twice/two times)
*** Bug 117647 has been marked as a duplicate of this bug. ***
*** Bug 139659 has been marked as a duplicate of this bug. ***
*** Bug 102407 has been marked as a duplicate of this bug. ***
Adding topembed to this one since Bug 102407 was marked duplicate of this. Many
sites from the evangelism effort demonstrates the POSTDATA popup problem. See
more info in bug: Bug 102407. 

Adding jaimejr and roger.
Keywords: topembed
topembed+.  carrying topembed+ over from Bug 102407. 
Keywords: topembedtopembed+
Seems loike a few customers are interested in this one getting fixed soon. What
are the chances we could have a fix in the next week?
Take note of bug 81253. It definitely wasn't a complete fix for this bug, but it
dealt with the 90% case. Specifically, we do not reload if the META tag is in
the first buffer delivered by the server. Can someone confirm that the new bugs
are cases where the META is not in the first buffer? Or did the code change from
81253 just rot?
> Or did the code change from 81253 just rot?

It ain't rotten. 
As vidur notes, for the most common case, if the document returned by GET or 
POST has a "<meta http-equiv='Content-Type' content='text/html; charset=...'>"
within the first ~2k returned (and not beyond that point), then we do not 
re-request the document.

The other bugs that have been marked recent dupe's involve charset 
auto-detection and/or more elaborate form submission scenarios.
We may have to choose not to fix this in 1.0 time frame because of the comlexity
and risk. But we have to fix it sooner or later. It is just unacceptable for
websites without meta charset, but involved in form submitting. 
Yeah, the roundtripping of illegal values makes this turn into something like
rocket science.  I haven't had good success getting i18n brains on this one, no
doubt because they're swamped with 1.0/nsbeta issues as well.

Let's reconvene for 1.1alpha.
Status: NEW → ASSIGNED
Target Milestone: mozilla1.0 → mozilla1.1alpha
It's important to remember that the patch to
http://bugzilla.mozilla.org/show_bug.cgi?id=81253 looks for the META-charset in
the first buffer of data.  This is at most ~2-4k, however, it is whatever the
network hands out to the parser...  It 'could' be much less...

Perhaps, some of the remaining problems are due to servers which return a much
smaller block of data in the first response buffer...
-- rick
rick: yup that would also cause problems, but i think a large part of the
problem has to do with charset detection.  say there is no meta tag... if we
don't know the charset of the document, and we try to sniff out the charset,
then there'll always be a charset reload.  that seems like the killer here IMO.
 it seems like we hit this problem *a lot* when "auto-detect" is enabled.
Sorry for the spam, but any chance for fixing it? 

It's wery annoyng when using character set autodetection and russians etc must
use this feature. I heard many questions about this problem in Moz 1.0 PRx and
Netscape 7.0 PR1...
Whiteboard: [ADT2] → [ADT2 RTM]
A short term solution that we're considering is:
1) The default charset for a page should be set to the charset of the referrer
(both in the link and the form submission case). This is dealt with by bug 143579.
2) Auto-detection should not happen when there's POST data associated with a page. 

Some pages may not be rendered correctly, but this solution should deal with the
common case. Reassigning this bug to Shanjian.
Assignee: shaver → shanjian
Blocks: 141008
Status: ASSIGNED → NEW
I am going to handle this problem in 102407 using above approach, and leave this
bug open for future. 
No longer blocks: 102407, 141008
Jaime, you might want to remove some keywords in this bug. 
thanks shanjian! 

removing nsbeta1+/[adt2 RTM], and strongly suggest Driver's remove Mozilla1.0+,
and EDT remove topembed+, as the short term solution (safer, saner) will be
addressed in bug 102407, well relegating this issue to WFM, or just edge cases.
Keywords: nsbeta1+
Whiteboard: [ADT2 RTM]
Just porting my comment from bug 102407:

Why cannot you keep loading the document to the end even
though meta charset says it's in another charset and after the document has
finished, reload the document from the cache in the same way as viewing source
(finally!) works. The performace could degrade but at least
Mozilla would be doing the right thing -- and for big files reloading full thing
from the cache would be faster than loading from server anyway. Asyncronous
loading to cache would be cool, but it's needed for a feature that isn't used
that much. Performance can be increased later if *really* seen important but how
often charset is changed between page changes anyway?. 9 times of 10 I've seen
this bug is because automatic charset detection has detected the charset
incorrectly and reloads the document even though it should be doing nothing.

I put up a little test at http://www.cc.jyu.fi/~mira/moz/moztest.php which uses
cookies to save 7 last page loading times and changes charset every now and
then. And sends meta charset after 2K. Automatic reloading can be seen as
subsecond reload times and flashing on browser view.
No longer blocks: 105709
Removing topembed+.  As per comment #78
Keywords: topembed+
We should look at fixing this one for the next release, because it is a
performace issue.
Keywords: nsbeta1
Whiteboard: [adt2]
Target Milestone: mozilla1.1alpha → mozilla1.2beta
accepting. 
Status: NEW → ASSIGNED
*** Bug 158331 has been marked as a duplicate of this bug. ***
By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and
<http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss
bugs are of critical or possibly higher severity.  Only changing open bugs to
minimize unnecessary spam.  Keywords to trigger this would be crash, topcrash,
topcrash+, zt4newcrash, dataloss.
Severity: major → critical
No longer blocks: 105636
*** Bug 88701 has been marked as a duplicate of this bug. ***
adt: nsbeta1-
Keywords: nsbeta1nsbeta1-
Summary: <meta> with charset should NOT cause reload (loads twice/two times) → <meta> with charset and autodetection should NOT cause reload (loads twice/two times)
*** Bug 171425 has been marked as a duplicate of this bug. ***
*** Bug 77702 has been marked as a duplicate of this bug. ***
*** Bug 137936 has been marked as a duplicate of this bug. ***
Could we update the target milestone for this 3 year old bug?  I think we missed
1.2  ;-)
Assignee: shanjian → parser
Status: ASSIGNED → NEW
Priority: P2 → P1
Target Milestone: mozilla1.2beta → Future
*** Bug 235160 has been marked as a duplicate of this bug. ***
*** Bug 248610 has been marked as a duplicate of this bug. ***
Blocks: 236858
*** Bug 287569 has been marked as a duplicate of this bug. ***
Status report? This bug has been marked "Severity: Critical", "Priority: 1" and
has keyword "dataloss" and still there hasn't been even a status update in last
3+ years?

Can somebody comment about how hard it would be to implement my suggestion in
comment #79? I haven't hacked with Mozilla's C++ source, so I have no idea.
Here's the suggested algorithm again reworded:

1. In case the meta charset (or any other heuristics) tells Mozilla that it's
using incorrect charset, raise a flag that document is displayed with incorrect
character set.
2. Regardless of this problem, keep going until the document has been fully
transferred so that Mozilla has full copy of it in the cache.
3. Reload the page from cache with the correct charset. (I'm hoping that cache
has *binary* copy of transferred data, not something that has gone through
parser and is therefore hosed anyway.) If the View Source feature can work
without reloading the POSTed page, then it should be possible to reload from
cache, too.
thank you for volunteering
Assignee: parser → mira
Severity: critical → major
Keywords: helpwanted
Priority: P1 → P3
Assignee: mira → mrbkap
Priority: P3 → --
QA Contact: moied → parser
Target Milestone: Future → ---
Priority: -- → P3
Target Milestone: --- → mozilla1.9alpha
I can't seem to reproduce this on the site in the URL. Can someone please update the URL with a testcase that shows this?
Target Milestone: mozilla1.9alpha → Future
(In reply to comment #96)
> I can't seem to reproduce this on the site in the URL. Can someone please
> update the URL with a testcase that shows this?

I was about to change the url from http://www.xe.net/ict/ to one I mentioned in comment #79 (http://www.cc.jyu.fi/~mira/moz/moztest.php) but I wasn't allowed to. The test case changes between iso-8859-1 and iso-8859-15 every second. Hit "Reload page via GET" link a couple of times (wait a few seconds between tries) to see the problem. The test case uses cookies for timing the requests from a single browser. You should be able to see the euro sign when the page text says "iso-8859-15" and there should be a generic currency sign when page text says "iso-8859-1". With GET this is true (the page loads twice if there's a problem) whereas with POST you get incorrect rendering. I have View - Character Encoding - Auto-Detect - (Off) set in case that matters.

I still see the problem with Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20051108 Firefox/1.6a1.

A workaround for this bug is to include the meta charset declaration in the first 2048 bytes of a file.
Blocks: 339459
No longer blocks: 339459
hi,

i can confirm the problem for the new testcase testing

Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3 ID:2007030919

and

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a4pre) Gecko/20070417 Minefield/3.0a2pre ID:2007041704 [cairo]
I'm still seeing this problem on Firefox 2.0.0.14 ... 

I use a page with 
<META HTTP-EQUIV="Content-Language" CONTENT="ro">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-2">

and all pages that contain this are requested twice from the server the first time they are requested. If they already been requested during the current session, the server receives only one request. 

This error is a big problem for me, since on generated pages, data is processed twice and I get incorrect data in the database because of this... 

The only workaround so far was to 
bug 463175 is reporting the same reload twice problem. Always reproduceable. 
This bug still exists in 3.1.10
This bug is serious, causes dataloss, and makes modalDialog stop working. 
It tells me the default behaviour is not supporting "extended" charsets. Which it really should be. 
If the page is somehow faulty, raise a flag, and inform the user. This reload behaviour seems like a fix with good intentions, but this is not a good idea for any page that is dynamic, submits data back or is used in a RIA.
Flags: blocking1.9.2?
This bug needs a test case. The test case should be on the web. That way we can see how other browsers interact with the same web page. The test case should clearly show how data loss occurs. It should also show how the modal dialog stops working. Any other defects should be clearly shown. There should be "Expected Behavior" and "Actual Behavior." The test case should be entered into the URL field of this bug report.
Check bug 463175. I made a test page where you can see the load twice behaviour. The testpage is in th URL field for that bug. That's where I thought it should go, I am new to this forum, still trying to get my head around how this forum works...
Here is that page:
http://beta.backbone.se/tools/test/loadtwice/loadtwice.html 
Spimle, repreducible, happends every time, no swapping of META tags needed as suggested above. I do not know why the URL for this bug points to xe.com.

Here is how the showModalDialog stops working: When the modal dialog is first opened the arguments are fine. But since the double load behaviour then loads the page once more, the arguments are lost (returns null). This is actually how we first found this bug. It took a long time to backtrack that this was actually happening to ALL non cached pages in FF without charset tag. Difficult to believe, but there it is. 
Data loss can occur in all sorts of ways. See thread above. 

Since I am a bit taken back by this, I would like to emphasize three side effects that might speed things up on your end (I guess if the right people read it). I take the chance of barking up the wrong tree, and if I do please accept my appologies. 
Three serious side effects I can think of:
1. Developers need to include the correct charset tag in ALL pages.
This is actually something we should have done in the first place. But for a big site or system, this is months of hard work for any developer. 
Ironically this includes all(?) pages in this forum ;-)
2. Performance. New pages load twice, dynamic pages using ids in urls load twice.
3. Browser statisics is wrong. A large chunk of the FF penetration figures should be taken out. This might be the most serious.

Since I like irony, I will make a little experiment by writing the letter ä in this comment. Voila, this page is hit twice first time it is requested by FF! Same as the page with bug 463175. 

For the developer that cannot wait for this bug to be fixed, here is what we had to do (fixes side effect 1 above):
1. Convert all pages to UTF-8 
2. Pray that the developer tool or HTML-editor you use has support for UTF-8
3. Place the correct charset meta tag in the header of all pages
4. If your webserver is IIS, all parameters must be uri encoded or you loose all extended characters
5. Rewrite you cookie routines to support extended characters as well
This took us four months, and still not everything in place :-(
Hope this helps.
Good luck!
I suggest renaming this bug to:
<meta> with charset and autodetection OR charset missing, should NOT cause reload (loads twice/two times)
Testcase: 

1. Set Character Encoding to Auto Detect. 
2. Go to URL: http://beta.backbone.se/tools/test/loadtwice/loadtwice.html

Expected Results: Page loads once

Actual Results: Page loads twice
Flags: wanted1.9.2?
Keywords: testcase
Summary: <meta> with charset and autodetection should NOT cause reload (loads twice/two times) → <meta> tag with charset and autodetection OR charset missing, should NOT cause reload (loads twice/two times)
Unfortunately I don't think we can fix this for 1.9.2 as this is far from a trivial problem to fix, and we don't have anyone right now with the time to spend on this.

However, if people feel there's value in making the effects of this when it comes to showModalDialog() go away (i.e. if we preserve dialog arguments across reloads), I think we could do *that* for 1.9.2.

I'd like to hear what people think about doing the showModalDialog() part of this only for 1.9.2. I know it sucks to not fix the underlying problem here now, but as I said, it's not very easy to fix in our code, and I'd rather see us fix this for the HTML5 parser than worrying about it in the current parser code. Leaving this nominated for now until I hear some thoughts here.
Assignee: mrbkap → nobody
The HTML5 spec prescribes reparsing when the <meta> is so far from the start of the file that the prescan doesn't find it.

As for chardet, I've made the HTML5 parser only run chardet (if enabled) over the prescan buffer, so chardet-related reparses should be eliminated. However, the HTML5 parser needs more testing in CJK and Cyrillic locales to assess whether the setup is good enough.
Flags: wanted1.9.2?
Flags: wanted1.9.2-
Flags: blocking1.9.2?
Flags: blocking1.9.2-
Information for Web authors seeing this problem and finding this report here in Bugzilla:

This problem can be 100% avoided by the Web page author by using HTML correctly as required by the HTML specification. There are three different solutions any one of which can be used:

 1) Configure your server to declare the character encoding in the Content-Type HTTP header. For example, if your HTML document is encoded as UTF-8 (the preferred encoding for Web pages), make your servers send the HTTP header
Content-Type: text/html; charset=utf-8
instead of
Content-Type: text/html

This solution works with any character encoding supported by Firefox.

OR

 2) Make sure that you declare the character encoding of your HTML document using a "meta" element within the first 1024 bytes of your document. That is, if you are using UTF-8 (as you should considering that UTF-8 is the preferred encoding for Web pages), start your document with
<!DOCTYPE html>
<html>
  <head>
    <meta charset=utf-8>
    <title>…
and don't put comments, scripts or other stuff before <meta charset=utf-8>.

This solution works with any character encoding supported by Firefox except UTF-16 encodings, but UTF-16 should not be used for interchange anyway.

OR

 3) Start your document with a BOM (byte order mark). If you're using UTF-8, make the first three bytes of your file be 0xEF, 0xBB, 0xBF. You probably should not use this method unless you're sure that the software you are using won't accidentally delete these three bytes.

This solution works only with UTF-8 and UTF-16, but UTF-16 should not be used for interchange anyway, which is why I did not give the magic bytes for UTF-16.

- -

As for fixing this:

This bug is WONTFIX for practical purposes, since fixing this would take a substantial amount of work for very little gain. Anyone capable of fixing that this will probably always have higher priority things to work on.

But if this was to be fixed, the first step would be figuring out what WebKit and IE do. Without actually figuring that out, here are a couple of ideas how this could be fixed: 

1) In the case of a late meta, if we want to continue to honor late metas (which isn't a given), we should keep the bytes that the HTML parser has already consumed and keep consuming the network stream into that buffer while causing the docshell to do a renavigation without hitting the network again but instead restarting the parser with the buffer mentioned earlier in this sentence.

2) In the case of chardet, it might be theoretically possible to replace chardet with a multi-encoding decoder with an internal buffer. The decoder would work like this: As long as the incoming bytes are ASCII-only, the decoder would immediately emit the corresponding Basic Latin characters. Upon seeing an non-ASCII byte, the decoder would accumulate bytes into its internal buffer until it can commit to a guess about their encoding. Upon committing to the guess, the decoder would emit its internal buffer decoded according to the guest encoding. Thereafter, the decoder would act just like a normal decoder for that encoding. 

But it would be a bad idea to pursue these ideas without first carefully finding out what WebKit and IE do. I hear that WebKit gets away with much less complexity in this area compared to what Gecko implements.
Alias: latemeta
Severity: major → enhancement
Priority: P3 → --
Summary: <meta> tag with charset and autodetection OR charset missing, should NOT cause reload (loads twice/two times) → Late charset <meta> or autodetection (chardet) should NOT cause reload (loads twice/two times)
Whiteboard: [adt2] → Please read comment 110.
 I can't seem to reproduce this on the site in the URL. Can someone please
 update the URL with a testcase that shows this?There is code in nsObserverBase::NotifyWebShell() to prevent reload for a POST data.
A great test case is this: go to the URL I just added, and click to get currency table
  https://www.timehubzone.com/currencies
Flags: needinfo?(datehubzone)
> There is code in nsObserverBase::NotifyWebShell() to prevent reload for a POST data.

It should still be possible to reproduce this in the case of GET requests.

> A great test case is this: go to the URL I just added, and click to get currency table
> https://www.timehubzone.com/currencies

That site declares the encoding both on the HTTP layer and in early <meta>, so it shouldn't be possible to see this reload case there.

In general, I don't expect us to add complexity to cater for this long-tail legacy issue. If we want to never reload, we should revert bug 620106 and then stop honoring late <meta>, too.

Moving open bugs with topperf keyword to triage queue so they can be reassessed for performance priority.

Performance Impact: --- → ?
Keywords: topperf

The late meta aspect was fixed in bug 1701828.

The page can still be reloaded in the case where it doesn't declare an encoding and the detector guess at the end of the stream differs from the guess made at </head>. The telemetry for how often this happened expired and I've been too busy on other things to reinstate telemetry in this area.

In any case:

  1. Any page can avoid this perf problem by declaring the encoding, and pages that people browse the most declare their encoding.
  2. Even before bug 1701828, which extended the number of bytes that are considered for the initial guess, the detector-triggered reload case affected less than 1.5% of unlabeled page loads globally.

I think it's not useful to try to eliminate the remaining reload case, since it's better for pages to be readable than performantly unreadable.

I'm leaving this bug open for checking that the reload comes from the cache, though.

Severity: normal → S4
Flags: needinfo?(datehubzone)
Keywords: dataloss
Priority: -- → P5
Summary: Late charset <meta> or autodetection (chardet) should NOT cause reload (loads twice/two times) → Make sure that chardetng-triggered encoding reload is read from the cache
Whiteboard: Please read comment 110. → Please read comment 115.
Performance Impact: ? → ---
Restrict Comments: true
Duplicate of this bug: 322181
No longer blocks: 116217, 236858, 338176
Depends on: 27006
See Also: → 1701828
No longer blocks: 134029
Duplicate of this bug: 134029
No longer duplicate of this bug: 322181
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: