Closed Bug 112793 Opened 23 years ago Closed 23 years ago

fine tune frame charset mechanism

Categories

(Core :: Internationalization, defect)

defect
Not set
major

Tracking

()

VERIFIED FIXED
mozilla0.9.8

People

(Reporter: kleist, Assigned: shanjian)

References

()

Details

(Keywords: intl)

Attachments

(1 file, 5 obsolete files)

Build ID: 2001 11 29 03. Windows 2000, locale: English (United States). View => Character Encoding: Western (ISO-8859-1). Visit the given page and note that many characters are incorrectly rendered as a question mark ("?"). This seems to happend to all non-ASCII letters such as the Swedish åäöÅÄÖ ( å ä ö Å Ä &Oring ). These question marks are also visible when doing File => View Frame Source. Both Opera 5.1 and IE 5.5 works fine.
->Il8n
Assignee: harishd → yokoyama
Status: UNCONFIRMED → NEW
Component: Parser → Internationalization
Ever confirmed: true
QA Contact: moied → teruko
Happens on Linux too. The server sends: Content-Type: text/html;charset=646 So we show the page using the US-ASCII charset. Then since it's a framed page we can't switch the character coding on the subframes...
OS: Windows 2000 → All
Hardware: PC → All
Keywords: intl
Keywords: nsbeta1
>Then since it's a framed page >we can't switch the character coding on the subframes... Shouldn't the forced charset propagate to child frames (bug 43529)? Teruko: can you check 43529? assiging to nhotta for frame charset issue and cc shanjian
Assignee: yokoyama → nhotta
Override character coding menu works in frame.- bug 43529. This case is http charset with frame.
I think this is also a charset override problem. The header charset, 'charset=646' is wrong, and override does not work for the frame (same problem as bug 43529). It does not matter how the page's charset got wrong (wrong META, wrong HTTP,...) but the problem is that override does not work. Shanjian, please take a look at this.
Assignee: nhotta → shanjian
I am working on a partial fix right now. We need to implement some fine tune mechanism to play with charset.
Status: NEW → ASSIGNED
Attached patch patch (obsolete) — Splinter Review
This patch implement a mechanism to not only get charset from parent, but also find out how parent get its charset from. This is needed for several other frame charset related problem. Charset resolution in "nsHTMLDocument::StartDocumentLoad" was also rewritten with several newly defined function. The purpose is to make code more readable and runs faster. We don't really need to try all sources once a prior source return a charset result.
Target Milestone: --- → mozilla0.9.8
Attached patch update patch (obsolete) — Splinter Review
Attachment #61209 - Attachment is obsolete: true
Roy, Do you have time to review my patch in near future? This bug lays the foundation for a bunch of other frame related charset issues. Since there are several file involved, I would like it to check it in ASAP, otherwise I have to spend more time to sync with tree.
Summary: Non-ASCII characters become "?", both as rendered and in source → fine tune frame charset mechanism
Comment on attachment 61280 [details] [diff] [review] update patch /r=yokoyama. I always wanted to clean up the charset detection in nsHTMLDocument::StartDocumentLoad
Attachment #61280 - Flags: review+
chris, could you sr?
1. Why don't you get rid of the nsCharsetSource enum. The casting between PRInt32 and nsCharsetSource scares me, especially since it is possible that we'll end up with a narrower type for enums on some platforms. 2. The ``goto'' stuff isn't really necessary is it? Why not just do a cascading if/else if/else if...; e.g., if (...) { // Use the user-specified override super-duper magic charset } else if (...) { // Use the default guesstimated overdrive charset } else if (...) { // This charset fell from space, let's use it! } 3. I didn't examine the code carefully to make sure you've preserved the logic; I assume that you did. Right? Also cc'ing jst: any comments on this approach?
chris, >>1. Why don't you get rid of the nsCharsetSource enum. The casting >> between PRInt32 and nsCharsetSource scares me, especially since it is >> possible that we'll end up with a narrower type for enums on some >> platforms. I have to change additional 12 files to get rid of enum, but I could not see any gain doing that. It should be always safe to convert from enum to PRUint32. In my patch, I never need to convert/cast PRUint32 back to enum. All I need to do is a comparison. So nothing scares me here. >>2. The ``goto'' stuff isn't really necessary is it? Why not just do a >> cascading if/else if/else if...; e.g., There seems to be no easy if/else solution here. Try it yourself and you will see what I mean. >>3. I didn't examine the code carefully to make sure you've preserved the >> logic; I assume that you did. Right? Yeh.
> I have to change additional 12 files to get rid of enum, but I could not see > any gain doing that. It should be always safe to convert from enum to PRUint32. But it's not always safe convert from a |char *| to a |PRUint32 *|, which is what you might be doing. Some architectures do not allow unaligned 32-bit access to word data; some compilers do not use a full word to store an enum. > There seems to be no easy if/else solution here. Try it yourself and you > will see what I mean. How about: if (! TryUserForcedCharset(muCV, dcInfo, charsetSource, &charset)) { TryHintCharset(muCV, charsetSource, &charset); TryParentCharset(dcInfo, charsetSource, &charset); if (TryHttpHeaderCharset(httpChannel, charsetSource, &charset)) { // Use the header's charset. } else if (scheme && nsCRT::strcasecmp("about", scheme) && TryBookmarkCharset(&urlSpec, charsetSource, &charset)) { // Use the bookmark's charset. } else if (cacheDescriptor && urlSpec && TryCacheCharset(cacheDescriptor, charsetSource, &charset)) { // Use the cache's charset. } else if (TryWeakDocTypeDefault(charsetSource, &charset)) { // Use the weak doc type default charset } else { // Use the user's default charset. TryUserDefaultCharset(muCV, charsetSource, &charset); } } if(kCharsetFromAutoDetection > charsetSource) StartAutodetection(docShell, &charset, aCommand); or: if (! TryUserForcedCharset(muCV, dcInfo, charsetSource, &charset)) { TryHintCharset(muCV, charsetSource, &charset); TryParentCharset(dcInfo, charsetSource, &charset); if (TryHttpHeaderCharset(httpChannel, charsetSource, &charset) || (scheme && nsCRT::strcasecmp("about", scheme) && TryBookmarkCharset(&urlSpec, charsetSource, &charset)) || (cacheDescriptor && urlSpec && TryCacheCharset(cacheDescriptor, charsetSource, &charset)) || TryWeakDocTypeDefault(charsetSource, &charset) || TryUserDefaultCharset(muCV, charsetSource, &charset)) { // Found a charset. } } if(kCharsetFromAutoDetection > charsetSource) StartAutodetection(docShell, &charset, aCommand);
Attached patch update patch (obsolete) — Splinter Review
Chris, 1, "SetDocumentCharacterSetSource(charsetSource);" is the only place where we cast enum to PRUint32. (No, it is not enum* to PRUint32*). It should be safe. 2, You convinced me that there is a solution. (I always has a fuzzy memory that some complilers will complain about empty block. After I check some C grammar book this morning, I find that I might be wrong.) Though I still believe my original approach is much readable and maintenancable, I updated my patch as you wish.
Attachment #61280 - Attachment is obsolete: true
Attachment #62375 - Flags: review+
Comment on attachment 62375 [details] [diff] [review] update patch This is the code that I'm worried about >+// The following Try*Charset will return PR_FALSE only if the charset source >+// should be considered (ie. aCharsetSource < thisCharsetSource) but we failed >+// to get the charset from this source. >+ >+PRBool >+nsHTMLDocument::TryHintCharset( nsIMarkupDocumentViewer* aMarkupDV, >+ nsCharsetSource& aCharsetSource, >+ nsAutoString* aCharset) >+{ >+ nsCharsetSource requestCharsetSource; >+ nsresult rv; >+ >+ if (aMarkupDV) { >+ rv = aMarkupDV->GetHintCharacterSetSource((PRInt32*)(&requestCharsetSource)); Because GetHintCharacterSetSource is implemented as: NS_IMETHODIMP DocumentViewerImpl::GetHintCharacterSetSource(PRInt32 *aHintCharacterSetSource) { NS_ENSURE_ARG_POINTER(aHintCharacterSetSource); *aHintCharacterSetSource = mHintCharsetSource; ... (see <http://lxr.mozilla.org/seamonkey/source/content/base/src/nsDocumentViewer.cpp# 5462>.) It is possible that the caller of TryHintCharset could pass a reference to aCharsetSource that is not 32-bit aligned if sizeof(nsCharsetSource) != sizeof(PRInt32). I've seen this cause problems with PRPackedBools before: you're just laying a trap that somebody else may eventually stumble into. (And if it's only eight files you have to touch, then just create another code cleanup bug and mark this bug dependent on that one.)
In fact, sfraser was talking in another bug of enabling the code warrior option to pack enums into the smallest suitable integral type, so the trap that this patch would lay already has a victim in sight! /be
Attached patch another update (obsolete) — Splinter Review
Attachment #62375 - Attachment is obsolete: true
enum has been replaced by PRInt32 as you wish. Please sr.
Attachment #62469 - Flags: review+
Quick nit about the patch: +nsHTMLDocument::TryHintCharset( nsIMarkupDocumentViewer* aMarkupDV, + PRInt32& aCharsetSource, + nsAutoString* aCharset) Loose the space before the first argument type, and indent the next-line arguments so that they line up with the first argument. Do this for all your new methods.
I guess I was thinking you could just |typedef PRInt32 nsCharsetSource|, but I guess what you did is fine. Now for the nits :-(. Don't use |nsAutoString *|s as arguments; best would be |nsAString &|, then |nsString &|, and at a minimum, |nsString *|; e.g., one place of many: +PRBool +nsHTMLDocument::TryHintCharset( nsIMarkupDocumentViewer* aMarkupDV, + PRInt32& aCharsetSource, + nsAutoString* aCharset) (It's bizarre to me that you're passing an |nsAutoString *|, but a |PRInt32 &|.) This cast should no longer be necessary: + rv = aMarkupDV->GetHintCharacterSetSource((PRInt32*)(&requestCharsetSource)); Why the funky indents here? (No hard tabs!) + *aCharset = forceCharsetFromWebShell; + Recycle(forceCharsetFromWebShell); + //TODO: we should define appropriate constant for force charset + aCharsetSource = kCharsetFromUserForced; And here: +nsHTMLDocument::TryUserDefaultCharset( nsIMarkupDocumentViewer* aMarkupDV, + PRInt32& aCharsetSource, + nsAutoString* aCharset) +{ + if(kCharsetFromUserDefault <= aCharsetSource) + return PR_TRUE; + + PRUnichar* defaultCharsetFromWebShell = NULL; + if (aMarkupDV) { Does this comment still apply? Or ought it be moved closer to TryBookmarksCharset()? + // don't try to access bookmarks if we are loading about:blank...it's not going + // to give us anything useful and it causes Bug #44397. At the same time, I'm loath to do something + // like this because I think it's really bogus that layout is depending on bookmarks. This is very evil. I think that this code could use a bit more commenting, as it is fairly magical (why is the order what it is?), but I won't hold up a checkin for that. For example, here: + if (! TryUserForcedCharset(muCV, dcInfo, charsetSource, &charset)) { + TryHintCharset(muCV, charsetSource, &charset); + TryParentCharset(dcInfo, charsetSource, &charset); + if (TryHttpHeaderCharset(httpChannel, charsetSource, &charset)) { why don't we check the results of TryHintCharset and TryParentCharset? There's a fair amount of state that passes from routine to routine in a squirrely way AFAICT. Should we file another bug to further consolidate this code and move it outside of nsHTMLDocument? Finally, in the Try*Charset functions, comparison operators are annoyingly inconsistent: sometimes the constant term is on the LHS, other times it's on the RHS. Some uniformity here might be helpful to the untrained eye.
Attached patch one more update (obsolete) — Splinter Review
Attachment #62469 - Attachment is obsolete: true
Attachment #62497 - Attachment is obsolete: true
>> I think that this code could use a bit more commenting, as it is fairly magical >> (why is the order what it is?), >> why don't we check the results of TryHintCharset and TryParentCharset? There's a >> fair amount of state that passes from routine to routine in a squirrely way AFAICT. I put some more comments there. That's one of the reason the I prefer my original goto version. The if/else version is less readable. >>Should we file another bug to further consolidate this code and move it outside >>of nsHTMLDocument? Do you have anything in mind?
Thanks for your patience with me, shanjian. I don't have any immediate suggestions with respect to hoisting this stuff out of nsHTMLDocument, and if jst is okay with it (which he said he was on IRC last night), I'm fine with it too. Please make one more spin through your patch before checking in to remove any hard tabs: I'm still seeing a few; e.g., +{ + if(kCharsetFromUserDefault <= aCharsetSource) + return PR_TRUE; + + PRUnichar* defaultCharsetFromWebShell = NULL; Do that, and sr=waterson
tab removed. fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
QA Contact: teruko → ylong
Look at bug 112112 and the URL in this bug does not exists anymore.
Since the URL attached in this bug is not existing, and I can see the non-ascii characters are displayed fine in the pages and sources in bug 112112. Mark as verified.
Status: RESOLVED → VERIFIED
*** Bug 103952 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: