Closed
Bug 345016
Opened 19 years ago
Closed 16 years ago
After reloading microsummary with non-UTF-8 charset, its summary becomes messed up
Categories
(Firefox Graveyard :: Microsummaries, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bobchao, Assigned: myk)
References
()
Details
(Keywords: fixed1.8.1, intl)
Attachments
(3 files, 2 obsolete files)
2.55 KB,
patch
|
bzbarsky
:
review+
mconnor
:
superreview+
mtschrep
:
approval1.8.1+
|
Details | Diff | Splinter Review |
2.14 KB,
patch
|
Details | Diff | Splinter Review | |
3.40 KB,
patch
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.8.1b1) Gecko/20060710 Firefox/2.0b1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.8.1b1) Gecko/20060710 Firefox/2.0b1
When I first install my generator for Yahoo!TW fund, microsummary works ordinary, but it doesn't after the browser restart - the label turns into garbage.
Reproducible: Always
Steps to Reproduce:
1. Install the generator on my web site (see url above, first link)
2. Add http://tw.money.yahoo.com/fund/c/2450.html to bookmark, choose microsummary for label.
3. restart Firefox
Actual Results:
it shows something like "b:�ͨ��饻:24,578(-551)" on bookmark label
Expected Results:
should be correct characters
Reporter | ||
Updated•18 years ago
|
Flags: blocking-firefox2?
Comment 1•18 years ago
|
||
--> myk for investigation
Assignee: nobody → myk
Flags: blocking-firefox2? → blocking-firefox2+
Keywords: intl
Target Milestone: --- → Firefox 2
Comment 2•18 years ago
|
||
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1b2) Gecko/20060901 BonEcho/2.0b2
Confirming. The problem occurs when the microsummary is reloaded.
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows XP → All
Hardware: PC → All
Summary: Can't recognize characters other than UTF-8 (without visit the page first) → After reloading microsummary with non-UTF-8 charset, its summary becomes messed up
Version: unspecified → 2.0 Branch
Comment 3•18 years ago
|
||
The basic problem appears to be that XMLHttpRequest can't determine the correct charset...
The server is sending an HTTP reply with "Content-Type: text/html", but no character encoding is specified in the header. There is a <META http-equiv="Content-Type"> tag in the HTML specifying Big5 -- but since XMLHttpRequest treats non-XML content as opaque data, it never sees the tag.
XMLHttpRequest expects UTF8 data by default, and so it's presumably generating garbage when it tries to decode the "UTF8" byte sequences (which are really Big5, and are not valid UTF8 sequences) into internal UTF16/UCS2 characters. I don't recognize any patterns in the garbage, so this probably isn't reversible.
So... I'm not sure what the best approach is to fix this. Some ideas:
* You can force a mime type (and thus charset) with XMLHttpRequest, but you have to do so before sending the request. In theory you could make one request, grab the charset from the parsed IFRAME, and then make a second request with the correct charset forced. Making 2 requests would suck, though. We could refine this by reusing (caching) the charset from the previous request... If the charset of the document changes, we would sync up after the next update (assuming the charset isn't changing all the time, which seems unlikely). This is a hack, but it might be simple enough to make it into a FF2 release.
* Since we're forced to use an IFRAME to parse the document data anyway, perhaps we can cease usage of XMLHttpRequest and just make the IFRAME load by, uhh, doing whatever the browser normally does to load a page into an IFRAME. [Might it be as simple as just setting the SRC attribute, or would it still require mucking about with channels and streams?]
* If web developers are also doing this sort of thing (HTML via XMLHttpRequest), perhaps we should enhance XMLHttpRequest to be a little more intelligent about HTML content. [Although I suppose that would require parsing the data and a fix to avoid the IFRAME hack.]
Comment 4•18 years ago
|
||
*** Bug 344600 has been marked as a duplicate of this bug. ***
Assignee | ||
Comment 5•18 years ago
|
||
> * You can force a mime type (and thus charset) with XMLHttpRequest, but you
> have to do so before sending the request... We could refine this by reusing
> (caching) the charset from the previous request... If the charset of the
> document changes, we would sync up after the next update (assuming the charset
> isn't changing all the time, which seems unlikely). This is a hack, but it
> might be simple enough to make it into a FF2 release.
It looks like the bookmarks service records the page's charset in the bookmarks datastore every time a user visits a bookmarked page. So instead of caching the charset ourselves, we could use this value, when available, to specify the charset for XMLHttpRequest.
As you surmise, pages are unlikely to change charsets, but when one does, this technique will probably break its microsummary until the user revisits the page. Of course, users are likely to revisit pages whose microsummaries break if they care about the microsummary at all, so such bustage would be temporary.
bz and biesi pointed out to me in IRC that we can set XMLHttpRequest.channel.contentCharset instead of calling XMLHttpRequest.overrideMimeType() to specify the charset without also overriding the MIME type.
cc:ing the two of them for their general thoughts on this issue.
Assignee | ||
Comment 6•18 years ago
|
||
Here's an implementation of the approach outlined in comment 5. This fixes the testcase in the description of this bug and doesn't break a number of other microsummaries I have tested.
Attachment #238055 -
Flags: superreview?(bzbarsky)
Attachment #238055 -
Flags: review?(mconnor)
Comment 7•18 years ago
|
||
Another option would be to expose the parser's meta-tag sniffing to content code and use it in XMLHttpRequest for text/html....
Though I'm really sad that we're using XMLHttpRequest for random non-XML content. There's gotta be a better way. :(
Comment 8•18 years ago
|
||
Comment on attachment 238055 [details] [diff] [review]
patch v1: retrieves last charset from bookmarks service
>Index: browser/components/microsummaries/src/nsMicrosummaryService.js.in
>+#ifndef MOZ_PLACES
So there's a followup bug to make this work with places?
>+ var rdf = Cc["@mozilla.org/rdf/rdf-service;1"]
>+ .getService(Ci.nsIRDFService);
>+ var bmds = rdf.GetDataSource("rdf:bookmarks");
>+ var resolver = bmds.QueryInterface(Ci.nsICharsetResolver);
How about doing something more like what http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/content/html/document/src/nsHTMLDocument.cpp&rev=3.696#563 does?
>+ var charset = resolver.requestCharset(null, request.channel, {}, {});
Strictly speaking, this is abuse of the API, if the channel hasn't sent onStartRequest yet... At least as the API was initially created. You might want to file another bug on clarifying that sitation; I have no problem with allowing not-opened channels in this API, in general...
sr=bzbarsky with those nits.
Attachment #238055 -
Flags: superreview?(bzbarsky) → superreview+
Comment 9•18 years ago
|
||
Btw, an advantage of the getService approach is that you can probably remove the PLACES ifdef.
Comment 10•18 years ago
|
||
(In reply to comment #5)
> As you surmise, pages are unlikely to change charsets, but when one does, this
> technique will probably break its microsummary until the user revisits the
> page. Of course, users are likely to revisit pages whose microsummaries break
> if they care about the microsummary at all, so such bustage would be temporary.
Since we're usually retrieving the page in order to generate the microsummary, we should be able to peek at the IFRAME and update the bookmark's charset if it changed. Then the next microsummary update wouldn't be corrupted (except in the pathological case where the charset changes constantly). We could even trigger an immediate "whoops" refresh to hopefully never display garbage.
Assignee | ||
Comment 11•18 years ago
|
||
(In reply to comment #7)
> Another option would be to expose the parser's meta-tag sniffing to content
> code and use it in XMLHttpRequest for text/html....
Indeed. But that sounds fairly involved (correct me if I'm wrong), and I'm currently looking for a fix low-risk enough to make it into Firefox 2.
> So there's a followup bug to make this work with places?
There is bug 317472, although it seems broader in scope.
> How about doing something more like what
> http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/content/html/document/src/nsHTMLDocument.cpp&rev=3.696#563
> does?
Looks like that would work just fine. Is it preferable?
> >+ var charset = resolver.requestCharset(null, request.channel, {}, {});
>
> Strictly speaking, this is abuse of the API, if the channel hasn't sent
> onStartRequest yet... At least as the API was initially created. You might
> want to file another bug on clarifying that sitation; I have no problem with
> allowing not-opened channels in this API, in general...
I could also call the bookmarks service's GetLastCharset(), which doesn't rely on a channel at all (but throws an error if the URL isn't bookmarked). It merely takes a URL and returns its stored charset, if any. Would that be preferable to either technique using nsICharsetResolver?
> Since we're usually retrieving the page in order to generate the microsummary,
> we should be able to peek at the IFRAME and update the bookmark's charset if
> it changed. Then the next microsummary update wouldn't be corrupted (except
> in the pathological case where the charset changes constantly). We could even
> trigger an immediate "whoops" refresh to hopefully never display garbage.
Good idea. I'm a bit worried about triggering an immediate refresh, since that could overload a server in the pathological case, but we should be able to at least fix the charset so the microsummary cycles clean the next round.
Comment 12•18 years ago
|
||
I think using nsICharsetResolver with the unopened channel is fine for the time being.
Assignee | ||
Comment 13•18 years ago
|
||
> So there's a followup bug to make this work with places?
This is bug 317472.
>+ var rdf = Cc["@mozilla.org/rdf/rdf-service;1"]
>+ .getService(Ci.nsIRDFService);
>+ var bmds = rdf.GetDataSource("rdf:bookmarks");
>+ var resolver = bmds.QueryInterface(Ci.nsICharsetResolver);
> How about doing something more like what
> http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/content/html/document/src/nsHTMLDocument.cpp&rev=3.696#563
> does?
Ok, this patch switches to nsHTMLDocument.cpp's approach.
>+ var charset = resolver.requestCharset(null, request.channel, {}, {});
>
> Strictly speaking, this is abuse of the API, if the channel hasn't sent
> onStartRequest yet... At least as the API was initially created. You might
> want to file another bug on clarifying that sitation; I have no problem with
> allowing not-opened channels in this API, in general...
Filed as bug 352426.
Attachment #238055 -
Attachment is obsolete: true
Attachment #238106 -
Flags: superreview?(mconnor)
Attachment #238106 -
Flags: review?(bzbarsky)
Attachment #238055 -
Flags: review?(mconnor)
Assignee | ||
Comment 14•18 years ago
|
||
Err, the previous patch contained an unnecessary hunk. Here's the patch with that hunk removed.
Attachment #238106 -
Attachment is obsolete: true
Attachment #238109 -
Flags: superreview?(mconnor)
Attachment #238109 -
Flags: review?(bzbarsky)
Attachment #238106 -
Flags: superreview?(mconnor)
Attachment #238106 -
Flags: review?(bzbarsky)
Updated•18 years ago
|
Attachment #238109 -
Flags: review?(bzbarsky) → review+
Comment 15•18 years ago
|
||
Comment on attachment 238109 [details] [diff] [review]
patch v3: like v2, but minus unnecessary hunk
looks good.
if (nodeType == kNC_Bookmark || nodeType == kNC_Bookmark) scares me, who knows what evil lurks in the heart of bookmarks!
Attachment #238109 -
Flags: superreview?(mconnor) → superreview+
Assignee | ||
Comment 16•18 years ago
|
||
Comment on attachment 238109 [details] [diff] [review]
patch v3: like v2, but minus unnecessary hunk
Notes for drivers considering this approval request:
This patch makes the microsummary service correctly generate microsummaries for pages in non-UTF-8 charsets where the charset is specified by a meta tag or some other mechanism not detectable by XMLHttpRequest (i.e. something other than in the HTTP response headers). It does this by using the page's last charset as recorded by the bookmarks service.
The fix is simple, the patch is small, and testing has identified no problems on both UTF-8 and non-UTF-8 pages, but the risk is still not entirely clear. Since the code uses the page's last charset even for UTF-8 pages, it will break on pages that change their charset (although only until the user next visits the bookmark). And there may be other bugs lurking in the not-well-exercised bookmarks code responsible for tracking pages' charsets.
Nevertheless, I suspect that there are many more pages using non UTF-8 charsets specified in HTML than there are pages that change their charset, and fixing this is a much bigger win than any regression it may cause. This is the lowest-risk fix for the problem.
The patch was checked into the trunk Tuesday, September 12, where baking it will do little good, since Places doesn't record charsets.
Attachment #238109 -
Flags: approval1.8.1?
Assignee | ||
Comment 17•18 years ago
|
||
Updated•18 years ago
|
Whiteboard: [has patch][needs approval]
Assignee | ||
Comment 18•18 years ago
|
||
> if (nodeType == kNC_Bookmark || nodeType == kNC_Bookmark) scares me, who knows
> what evil lurks in the heart of bookmarks!
FWIW, this was a recent typo. The second conditional was always supposed to check for equivalence to KNC_MicsumBookmark.
Assignee | ||
Updated•18 years ago
|
Whiteboard: [has patch][needs approval] → [has patch][needs approval] DRIVERS: see comment 16
Assignee | ||
Comment 19•18 years ago
|
||
FWIW, here's a work in progress of a more robust patch. This seems like more risk than we'd be willing to take on the branch, though.
Comment 20•18 years ago
|
||
Comment on attachment 238109 [details] [diff] [review]
patch v3: like v2, but minus unnecessary hunk
a=schrep for 181drivers for uSum UTF-8 fix.
Attachment #238109 -
Flags: approval1.8.1? → approval1.8.1+
Assignee | ||
Comment 21•18 years ago
|
||
Fix checked into the branch. Leaving the bug open until we get a real trunk fix (which depends on the trunk saving charsets for bookmarked pages, i.e. the dependent bug).
Keywords: fixed1.8.1
Whiteboard: [has patch][needs approval] DRIVERS: see comment 16
Assignee | ||
Updated•17 years ago
|
Priority: -- → P3
Target Milestone: Firefox 2 → ---
Comment 22•16 years ago
|
||
Dependent bug has been fixed, marking as RESOLVED FIXED.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: Firefox → Firefox Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•