Closed Bug 345016 Opened 19 years ago Closed 16 years ago

After reloading microsummary with non-UTF-8 charset, its summary becomes messed up

Categories

(Firefox Graveyard :: Microsummaries, defect, P3)

2.0 Branch
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bobchao, Assigned: myk)

References

()

Details

(Keywords: fixed1.8.1, intl)

Attachments

(3 files, 2 obsolete files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.8.1b1) Gecko/20060710 Firefox/2.0b1 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.8.1b1) Gecko/20060710 Firefox/2.0b1 When I first install my generator for Yahoo!TW fund, microsummary works ordinary, but it doesn't after the browser restart - the label turns into garbage. Reproducible: Always Steps to Reproduce: 1. Install the generator on my web site (see url above, first link) 2. Add http://tw.money.yahoo.com/fund/c/2450.html to bookmark, choose microsummary for label. 3. restart Firefox Actual Results: it shows something like "b:�ͨ��饻:24,578(-551)" on bookmark label Expected Results: should be correct characters
Flags: blocking-firefox2?
--> myk for investigation
Assignee: nobody → myk
Flags: blocking-firefox2? → blocking-firefox2+
Keywords: intl
Target Milestone: --- → Firefox 2
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1b2) Gecko/20060901 BonEcho/2.0b2 Confirming. The problem occurs when the microsummary is reloaded.
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows XP → All
Hardware: PC → All
Summary: Can't recognize characters other than UTF-8 (without visit the page first) → After reloading microsummary with non-UTF-8 charset, its summary becomes messed up
Version: unspecified → 2.0 Branch
The basic problem appears to be that XMLHttpRequest can't determine the correct charset... The server is sending an HTTP reply with "Content-Type: text/html", but no character encoding is specified in the header. There is a <META http-equiv="Content-Type"> tag in the HTML specifying Big5 -- but since XMLHttpRequest treats non-XML content as opaque data, it never sees the tag. XMLHttpRequest expects UTF8 data by default, and so it's presumably generating garbage when it tries to decode the "UTF8" byte sequences (which are really Big5, and are not valid UTF8 sequences) into internal UTF16/UCS2 characters. I don't recognize any patterns in the garbage, so this probably isn't reversible. So... I'm not sure what the best approach is to fix this. Some ideas: * You can force a mime type (and thus charset) with XMLHttpRequest, but you have to do so before sending the request. In theory you could make one request, grab the charset from the parsed IFRAME, and then make a second request with the correct charset forced. Making 2 requests would suck, though. We could refine this by reusing (caching) the charset from the previous request... If the charset of the document changes, we would sync up after the next update (assuming the charset isn't changing all the time, which seems unlikely). This is a hack, but it might be simple enough to make it into a FF2 release. * Since we're forced to use an IFRAME to parse the document data anyway, perhaps we can cease usage of XMLHttpRequest and just make the IFRAME load by, uhh, doing whatever the browser normally does to load a page into an IFRAME. [Might it be as simple as just setting the SRC attribute, or would it still require mucking about with channels and streams?] * If web developers are also doing this sort of thing (HTML via XMLHttpRequest), perhaps we should enhance XMLHttpRequest to be a little more intelligent about HTML content. [Although I suppose that would require parsing the data and a fix to avoid the IFRAME hack.]
*** Bug 344600 has been marked as a duplicate of this bug. ***
> * You can force a mime type (and thus charset) with XMLHttpRequest, but you > have to do so before sending the request... We could refine this by reusing > (caching) the charset from the previous request... If the charset of the > document changes, we would sync up after the next update (assuming the charset > isn't changing all the time, which seems unlikely). This is a hack, but it > might be simple enough to make it into a FF2 release. It looks like the bookmarks service records the page's charset in the bookmarks datastore every time a user visits a bookmarked page. So instead of caching the charset ourselves, we could use this value, when available, to specify the charset for XMLHttpRequest. As you surmise, pages are unlikely to change charsets, but when one does, this technique will probably break its microsummary until the user revisits the page. Of course, users are likely to revisit pages whose microsummaries break if they care about the microsummary at all, so such bustage would be temporary. bz and biesi pointed out to me in IRC that we can set XMLHttpRequest.channel.contentCharset instead of calling XMLHttpRequest.overrideMimeType() to specify the charset without also overriding the MIME type. cc:ing the two of them for their general thoughts on this issue.
Here's an implementation of the approach outlined in comment 5. This fixes the testcase in the description of this bug and doesn't break a number of other microsummaries I have tested.
Attachment #238055 - Flags: superreview?(bzbarsky)
Attachment #238055 - Flags: review?(mconnor)
Another option would be to expose the parser's meta-tag sniffing to content code and use it in XMLHttpRequest for text/html.... Though I'm really sad that we're using XMLHttpRequest for random non-XML content. There's gotta be a better way. :(
Comment on attachment 238055 [details] [diff] [review] patch v1: retrieves last charset from bookmarks service >Index: browser/components/microsummaries/src/nsMicrosummaryService.js.in >+#ifndef MOZ_PLACES So there's a followup bug to make this work with places? >+ var rdf = Cc["@mozilla.org/rdf/rdf-service;1"] >+ .getService(Ci.nsIRDFService); >+ var bmds = rdf.GetDataSource("rdf:bookmarks"); >+ var resolver = bmds.QueryInterface(Ci.nsICharsetResolver); How about doing something more like what http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/content/html/document/src/nsHTMLDocument.cpp&rev=3.696#563 does? >+ var charset = resolver.requestCharset(null, request.channel, {}, {}); Strictly speaking, this is abuse of the API, if the channel hasn't sent onStartRequest yet... At least as the API was initially created. You might want to file another bug on clarifying that sitation; I have no problem with allowing not-opened channels in this API, in general... sr=bzbarsky with those nits.
Attachment #238055 - Flags: superreview?(bzbarsky) → superreview+
Btw, an advantage of the getService approach is that you can probably remove the PLACES ifdef.
(In reply to comment #5) > As you surmise, pages are unlikely to change charsets, but when one does, this > technique will probably break its microsummary until the user revisits the > page. Of course, users are likely to revisit pages whose microsummaries break > if they care about the microsummary at all, so such bustage would be temporary. Since we're usually retrieving the page in order to generate the microsummary, we should be able to peek at the IFRAME and update the bookmark's charset if it changed. Then the next microsummary update wouldn't be corrupted (except in the pathological case where the charset changes constantly). We could even trigger an immediate "whoops" refresh to hopefully never display garbage.
(In reply to comment #7) > Another option would be to expose the parser's meta-tag sniffing to content > code and use it in XMLHttpRequest for text/html.... Indeed. But that sounds fairly involved (correct me if I'm wrong), and I'm currently looking for a fix low-risk enough to make it into Firefox 2. > So there's a followup bug to make this work with places? There is bug 317472, although it seems broader in scope. > How about doing something more like what > http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/content/html/document/src/nsHTMLDocument.cpp&rev=3.696#563 > does? Looks like that would work just fine. Is it preferable? > >+ var charset = resolver.requestCharset(null, request.channel, {}, {}); > > Strictly speaking, this is abuse of the API, if the channel hasn't sent > onStartRequest yet... At least as the API was initially created. You might > want to file another bug on clarifying that sitation; I have no problem with > allowing not-opened channels in this API, in general... I could also call the bookmarks service's GetLastCharset(), which doesn't rely on a channel at all (but throws an error if the URL isn't bookmarked). It merely takes a URL and returns its stored charset, if any. Would that be preferable to either technique using nsICharsetResolver? > Since we're usually retrieving the page in order to generate the microsummary, > we should be able to peek at the IFRAME and update the bookmark's charset if > it changed. Then the next microsummary update wouldn't be corrupted (except > in the pathological case where the charset changes constantly). We could even > trigger an immediate "whoops" refresh to hopefully never display garbage. Good idea. I'm a bit worried about triggering an immediate refresh, since that could overload a server in the pathological case, but we should be able to at least fix the charset so the microsummary cycles clean the next round.
I think using nsICharsetResolver with the unopened channel is fine for the time being.
> So there's a followup bug to make this work with places? This is bug 317472. >+ var rdf = Cc["@mozilla.org/rdf/rdf-service;1"] >+ .getService(Ci.nsIRDFService); >+ var bmds = rdf.GetDataSource("rdf:bookmarks"); >+ var resolver = bmds.QueryInterface(Ci.nsICharsetResolver); > How about doing something more like what > http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/content/html/document/src/nsHTMLDocument.cpp&rev=3.696#563 > does? Ok, this patch switches to nsHTMLDocument.cpp's approach. >+ var charset = resolver.requestCharset(null, request.channel, {}, {}); > > Strictly speaking, this is abuse of the API, if the channel hasn't sent > onStartRequest yet... At least as the API was initially created. You might > want to file another bug on clarifying that sitation; I have no problem with > allowing not-opened channels in this API, in general... Filed as bug 352426.
Attachment #238055 - Attachment is obsolete: true
Attachment #238106 - Flags: superreview?(mconnor)
Attachment #238106 - Flags: review?(bzbarsky)
Attachment #238055 - Flags: review?(mconnor)
Err, the previous patch contained an unnecessary hunk. Here's the patch with that hunk removed.
Attachment #238106 - Attachment is obsolete: true
Attachment #238109 - Flags: superreview?(mconnor)
Attachment #238109 - Flags: review?(bzbarsky)
Attachment #238106 - Flags: superreview?(mconnor)
Attachment #238106 - Flags: review?(bzbarsky)
Attachment #238109 - Flags: review?(bzbarsky) → review+
Comment on attachment 238109 [details] [diff] [review] patch v3: like v2, but minus unnecessary hunk looks good. if (nodeType == kNC_Bookmark || nodeType == kNC_Bookmark) scares me, who knows what evil lurks in the heart of bookmarks!
Attachment #238109 - Flags: superreview?(mconnor) → superreview+
Comment on attachment 238109 [details] [diff] [review] patch v3: like v2, but minus unnecessary hunk Notes for drivers considering this approval request: This patch makes the microsummary service correctly generate microsummaries for pages in non-UTF-8 charsets where the charset is specified by a meta tag or some other mechanism not detectable by XMLHttpRequest (i.e. something other than in the HTTP response headers). It does this by using the page's last charset as recorded by the bookmarks service. The fix is simple, the patch is small, and testing has identified no problems on both UTF-8 and non-UTF-8 pages, but the risk is still not entirely clear. Since the code uses the page's last charset even for UTF-8 pages, it will break on pages that change their charset (although only until the user next visits the bookmark). And there may be other bugs lurking in the not-well-exercised bookmarks code responsible for tracking pages' charsets. Nevertheless, I suspect that there are many more pages using non UTF-8 charsets specified in HTML than there are pages that change their charset, and fixing this is a much bigger win than any regression it may cause. This is the lowest-risk fix for the problem. The patch was checked into the trunk Tuesday, September 12, where baking it will do little good, since Places doesn't record charsets.
Attachment #238109 - Flags: approval1.8.1?
Depends on: 317472
Whiteboard: [has patch][needs approval]
> if (nodeType == kNC_Bookmark || nodeType == kNC_Bookmark) scares me, who knows > what evil lurks in the heart of bookmarks! FWIW, this was a recent typo. The second conditional was always supposed to check for equivalence to KNC_MicsumBookmark.
Whiteboard: [has patch][needs approval] → [has patch][needs approval] DRIVERS: see comment 16
FWIW, here's a work in progress of a more robust patch. This seems like more risk than we'd be willing to take on the branch, though.
Comment on attachment 238109 [details] [diff] [review] patch v3: like v2, but minus unnecessary hunk a=schrep for 181drivers for uSum UTF-8 fix.
Attachment #238109 - Flags: approval1.8.1? → approval1.8.1+
Fix checked into the branch. Leaving the bug open until we get a real trunk fix (which depends on the trunk saving charsets for bookmarked pages, i.e. the dependent bug).
Keywords: fixed1.8.1
Whiteboard: [has patch][needs approval] DRIVERS: see comment 16
Priority: -- → P3
Target Milestone: Firefox 2 → ---
Dependent bug has been fixed, marking as RESOLVED FIXED.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Depends on: 507386
Product: Firefox → Firefox Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: