Closed Bug 112793 Opened 23 years ago Closed 23 years ago

fine tune frame charset mechanism

Categories

(Core :: Internationalization, defect)

defect
Not set
major

Tracking

()

VERIFIED FIXED
mozilla0.9.8

People

(Reporter: kleist, Assigned: shanjian)

References

()

Details

(Keywords: intl)

Attachments

(1 file, 5 obsolete files)

Build ID: 2001 11 29 03. Windows 2000, locale: English (United States).
View => Character Encoding: Western (ISO-8859-1).

Visit the given page and note that many characters are incorrectly rendered
as a question mark ("?"). This seems to happend to all non-ASCII letters
such as the Swedish åäöÅÄÖ ( å ä ö Å Ä &Oring ).
These question marks are also visible when doing File => View Frame Source.

Both Opera 5.1 and IE 5.5 works fine.
->Il8n
Assignee: harishd → yokoyama
Status: UNCONFIRMED → NEW
Component: Parser → Internationalization
Ever confirmed: true
QA Contact: moied → teruko
Happens on Linux too.

The server sends:

Content-Type: text/html;charset=646

So we show the page using the US-ASCII charset.  Then since it's a framed page
we can't switch the character coding on the subframes...
OS: Windows 2000 → All
Hardware: PC → All
Keywords: intl
Keywords: nsbeta1
>Then since it's a framed page
>we can't switch the character coding on the subframes...
Shouldn't the forced charset propagate to child frames (bug 43529)?

Teruko: can you check 43529?

assiging to nhotta for frame charset issue
and cc shanjian
Assignee: yokoyama → nhotta
Override character coding menu works in frame.- bug 43529.

This case is http charset with frame.  

I think this is also a charset override problem.
The header charset, 'charset=646' is wrong, and override does not work for the
frame (same problem as bug 43529). It does not matter how the page's charset got
wrong (wrong META, wrong HTTP,...) but the problem is that override does not work.
Shanjian, please take a look at this.
Assignee: nhotta → shanjian
I am working on a partial fix right now. We need to implement some fine tune 
mechanism to play with charset. 
Status: NEW → ASSIGNED
Attached patch patch (obsolete) — Splinter Review
This patch implement a mechanism to not only get charset from parent, but also 

find out how parent get its charset from. This is needed for several other 
frame charset related problem. 
Charset resolution in "nsHTMLDocument::StartDocumentLoad" was also rewritten
with 
several newly defined function. The purpose is to make code more readable and
runs
faster. We don't really need to try all sources once a prior source return  a 
charset result.
Target Milestone: --- → mozilla0.9.8
Attached patch update patch (obsolete) — Splinter Review
Attachment #61209 - Attachment is obsolete: true
Roy,
Do you have time to review my patch in near future? This bug lays the foundation 
for a bunch of other frame related charset issues. Since there are several file 
involved, I would like it to check it in ASAP, otherwise I have to spend more 
time to sync with tree.
Summary: Non-ASCII characters become "?", both as rendered and in source → fine tune frame charset mechanism
Comment on attachment 61280 [details] [diff] [review]
update patch

/r=yokoyama.
I always wanted to clean up the charset detection in
nsHTMLDocument::StartDocumentLoad
Attachment #61280 - Flags: review+
chris, could you sr?
1. Why don't you get rid of the nsCharsetSource enum. The casting
   between PRInt32 and nsCharsetSource scares me, especially since it is
   possible that we'll end up with a narrower type for enums on some
   platforms.

2. The ``goto'' stuff isn't really necessary is it? Why not just do a
   cascading if/else if/else if...; e.g.,

   if (...) {
     // Use the user-specified override super-duper magic charset
   }
   else if (...) {
     // Use the default guesstimated overdrive charset
   }
   else if (...) {
     // This charset fell from space, let's use it!
   }

3. I didn't examine the code carefully to make sure you've preserved the
   logic; I assume that you did. Right?

Also cc'ing jst: any comments on this approach?
chris,
>>1. Why don't you get rid of the nsCharsetSource enum. The casting
>>   between PRInt32 and nsCharsetSource scares me, especially since it is
>>   possible that we'll end up with a narrower type for enums on some
>>   platforms.
I have to change additional 12 files to get rid of enum, but I could not see 
any gain doing that. It should be always safe to convert from enum to PRUint32. 
In my patch, I never need to convert/cast PRUint32 back to enum. All I need to 
do is a comparison. So nothing scares me here. 

>>2. The ``goto'' stuff isn't really necessary is it? Why not just do a
>>   cascading if/else if/else if...; e.g.,
There seems to be no easy if/else solution here. Try it yourself and you will see 
what I mean. 

>>3. I didn't examine the code carefully to make sure you've preserved the
>>   logic; I assume that you did. Right? 
Yeh.
> I have to change additional 12 files to get rid of enum, but I could not see
> any gain doing that. It should be always safe to convert from enum to PRUint32.

But it's not always safe convert from a |char *| to a |PRUint32 *|, which is
what you might be doing. Some architectures do not allow unaligned 32-bit access
to word data; some compilers do not use a full word to store an enum.

> There seems to be no easy if/else solution here. Try it yourself and you
> will see what I mean. 

How about:

  if (! TryUserForcedCharset(muCV, dcInfo, charsetSource, &charset)) {
    TryHintCharset(muCV, charsetSource, &charset);
    TryParentCharset(dcInfo, charsetSource, &charset);
  
    if (TryHttpHeaderCharset(httpChannel, charsetSource, &charset)) {
      // Use the header's charset.
    }
    else if (scheme && nsCRT::strcasecmp("about", scheme) &&
             TryBookmarkCharset(&urlSpec, charsetSource, &charset)) {
      // Use the bookmark's charset.
    }
    else if (cacheDescriptor && urlSpec &&
             TryCacheCharset(cacheDescriptor, charsetSource, &charset)) {
      // Use the cache's charset.
    }
    else if (TryWeakDocTypeDefault(charsetSource, &charset)) {
      // Use the weak doc type default charset
    }
    else {
      // Use the user's default charset.
      TryUserDefaultCharset(muCV, charsetSource, &charset);
    }
  }

  if(kCharsetFromAutoDetection > charsetSource)
    StartAutodetection(docShell, &charset, aCommand);

or:

  if (! TryUserForcedCharset(muCV, dcInfo, charsetSource, &charset)) {
    TryHintCharset(muCV, charsetSource, &charset);
    TryParentCharset(dcInfo, charsetSource, &charset);
  
    if (TryHttpHeaderCharset(httpChannel, charsetSource, &charset) ||
        (scheme && nsCRT::strcasecmp("about", scheme) &&
         TryBookmarkCharset(&urlSpec, charsetSource, &charset)) ||
        (cacheDescriptor && urlSpec &&
         TryCacheCharset(cacheDescriptor, charsetSource, &charset)) ||
        TryWeakDocTypeDefault(charsetSource, &charset) ||
        TryUserDefaultCharset(muCV, charsetSource, &charset)) {
      // Found a charset.
    }
  }

  if(kCharsetFromAutoDetection > charsetSource)
    StartAutodetection(docShell, &charset, aCommand);

Attached patch update patch (obsolete) — Splinter Review
Chris,
1, "SetDocumentCharacterSetSource(charsetSource);" is the only place where we 
  cast enum to PRUint32. (No, it is not enum* to PRUint32*). It should be safe.
2, You convinced me that there is a solution. (I always has a fuzzy memory that 
  some complilers will complain about empty block. After I check some C grammar
  book this morning, I find that I might be wrong.) Though I still believe 
  my original approach is much readable and maintenancable, I updated my patch 
  as you wish. 
Attachment #61280 - Attachment is obsolete: true
Attachment #62375 - Flags: review+
Comment on attachment 62375 [details] [diff] [review]
update patch

This is the code that I'm worried about

>+// The following Try*Charset will return PR_FALSE only if the charset source 
>+// should be considered (ie. aCharsetSource < thisCharsetSource) but we failed 
>+// to get the charset from this source. 
>+
>+PRBool 
>+nsHTMLDocument::TryHintCharset( nsIMarkupDocumentViewer* aMarkupDV,
>+                            nsCharsetSource& aCharsetSource, 
>+                            nsAutoString* aCharset)
>+{
>+  nsCharsetSource requestCharsetSource;
>+  nsresult rv;
>+
>+  if (aMarkupDV) {
>+    rv = aMarkupDV->GetHintCharacterSetSource((PRInt32*)(&requestCharsetSource));

Because GetHintCharacterSetSource is implemented as:

  NS_IMETHODIMP DocumentViewerImpl::GetHintCharacterSetSource(PRInt32
*aHintCharacterSetSource)
  {
    NS_ENSURE_ARG_POINTER(aHintCharacterSetSource);
    *aHintCharacterSetSource = mHintCharsetSource;
    ...

(see
<http://lxr.mozilla.org/seamonkey/source/content/base/src/nsDocumentViewer.cpp#
5462>.)

It is possible that the caller of TryHintCharset could pass a reference to
aCharsetSource that is not 32-bit aligned if sizeof(nsCharsetSource) !=
sizeof(PRInt32). I've seen this cause problems with PRPackedBools
before: you're just laying a trap that somebody else may eventually
stumble into. (And if it's only eight files you have to touch, then
just create another code cleanup bug and mark this bug dependent on
that one.)
In fact, sfraser was talking in another bug of enabling the code warrior option
to pack enums into the smallest suitable integral type, so the trap that this
patch would lay already has a victim in sight!

/be
Attached patch another update (obsolete) — Splinter Review
Attachment #62375 - Attachment is obsolete: true
enum has been replaced by PRInt32 as you wish. Please sr. 
Attachment #62469 - Flags: review+
Quick nit about the patch:

+nsHTMLDocument::TryHintCharset( nsIMarkupDocumentViewer* aMarkupDV,
+                            PRInt32& aCharsetSource, 
+                            nsAutoString* aCharset)

Loose the space before the first argument type, and indent the next-line
arguments so that they line up with the first argument. Do this for all your new
methods.
I guess I was thinking you could just |typedef PRInt32 nsCharsetSource|, but I
guess what you did is fine.

Now for the nits :-(.

Don't use |nsAutoString *|s as arguments; best would be |nsAString &|, then
|nsString &|, and at a minimum, |nsString *|; e.g., one place of many:

+PRBool 
+nsHTMLDocument::TryHintCharset( nsIMarkupDocumentViewer* aMarkupDV,
+                            PRInt32& aCharsetSource, 
+                            nsAutoString* aCharset)

(It's bizarre to me that you're passing an |nsAutoString *|, but a |PRInt32 &|.)

This cast should no longer be necessary:

+    rv = aMarkupDV->GetHintCharacterSetSource((PRInt32*)(&requestCharsetSource));

Why the funky indents here? (No hard tabs!)

+    *aCharset = forceCharsetFromWebShell;
+    Recycle(forceCharsetFromWebShell);
+
	//TODO: we should define appropriate constant for force charset
+
	aCharsetSource = kCharsetFromUserForced;  

And here:

+nsHTMLDocument::TryUserDefaultCharset( nsIMarkupDocumentViewer* aMarkupDV,
+                            PRInt32& aCharsetSource, 
+                            nsAutoString* aCharset)
+{
+
if(kCharsetFromUserDefault <= aCharsetSource) 
+    return PR_TRUE;
+
+
PRUnichar* defaultCharsetFromWebShell = NULL;
+  if (aMarkupDV) {

Does this comment still apply? Or ought it be moved closer to TryBookmarksCharset()?

+  // don't try to access bookmarks if we are loading about:blank...it's not going
+  // to give us anything useful and it causes Bug #44397. At the same time, I'm
loath to do something
+  // like this because I think it's really bogus that layout is depending on
bookmarks. This is very evil.

I think that this code could use a bit more commenting, as it is fairly magical
(why is the order what it is?), but I won't hold up a checkin for that. For
example, here:

+  if (! TryUserForcedCharset(muCV, dcInfo, charsetSource, &charset)) {
+    TryHintCharset(muCV, charsetSource, &charset);
+    TryParentCharset(dcInfo, charsetSource, &charset);
+    if (TryHttpHeaderCharset(httpChannel, charsetSource, &charset)) {

why don't we check the results of TryHintCharset and TryParentCharset? There's a
fair amount of state that passes from routine to routine in a squirrely way AFAICT.

Should we file another bug to further consolidate this code and move it outside
of nsHTMLDocument?

Finally, in the Try*Charset functions, comparison operators are annoyingly
inconsistent: sometimes the constant term is on the LHS, other times it's on the
RHS. Some uniformity here might be helpful to the untrained eye.
Attached patch one more update (obsolete) — Splinter Review
Attachment #62469 - Attachment is obsolete: true
Attachment #62497 - Attachment is obsolete: true
>> I think that this code could use a bit more commenting, as it is fairly magical
>> (why is the order what it is?), 
>> why don't we check the results of TryHintCharset and TryParentCharset? There's a
>> fair amount of state that passes from routine to routine in a squirrely way AFAICT.
I put some more comments there. That's one of the reason the I prefer my original 
goto version. The if/else version is less readable.

>>Should we file another bug to further consolidate this code and move it outside
>>of nsHTMLDocument?
Do you have anything in mind? 
Thanks for your patience with me, shanjian. I don't have any immediate
suggestions with respect to hoisting this stuff out of nsHTMLDocument, and if
jst is okay with it (which he said he was on IRC last night), I'm fine with it too.

Please make one more spin through your patch before checking in to remove any
hard tabs: I'm still seeing a few; e.g.,

+{
+
if(kCharsetFromUserDefault <= aCharsetSource) 
+    return PR_TRUE;
+
+
PRUnichar* defaultCharsetFromWebShell = NULL;


Do that, and sr=waterson
tab removed. 
fix checked in. 
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
QA Contact: teruko → ylong
Look at bug 112112 and the URL in this bug does not exists anymore.
Since the URL attached in this bug is not existing, and I can see the non-ascii
characters are displayed fine in the pages and sources in bug 112112. 
Mark as verified.
Status: RESOLVED → VERIFIED
*** Bug 103952 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: