Closed Bug 345016 Opened 19 years ago Closed 16 years ago

After reloading microsummary with non-UTF-8 charset, its summary becomes messed up

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bobchao, Assigned: myk)

References

(
URL
)

Details

(Keywords: fixed1.8.1, intl)

Attachments

(3 files, 2 obsolete files)

patch v1: retrieves last charset from bookmarks service 18 years ago Myk Melez [:myk] [@mykmelez] 2.61 KB, patch	bzbarsky : superreview+	Details \| Diff \| Splinter Review
patch v2: uses nsHTMLDocument.cpp's approach 18 years ago Myk Melez [:myk] [@mykmelez] 3.25 KB, patch		Details \| Diff \| Splinter Review
patch v3: like v2, but minus unnecessary hunk 18 years ago Myk Melez [:myk] [@mykmelez] 2.55 KB, patch	bzbarsky : review+ mconnor : superreview+ mtschrep : approval1.8.1+	Details \| Diff \| Splinter Review
the version of patch v3 checked into the trunk (minus one N/A hunk) 18 years ago Myk Melez [:myk] [@mykmelez] 2.14 KB, patch		Details \| Diff \| Splinter Review
work in progress of more robust patch 18 years ago Myk Melez [:myk] [@mykmelez] 3.40 KB, patch		Details \| Diff \| Splinter Review

Po-chiang Chao [:bobchao]

Reporter

Description

•

19 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.8.1b1) Gecko/20060710 Firefox/2.0b1 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-TW; rv:1.8.1b1) Gecko/20060710 Firefox/2.0b1 When I first install my generator for Yahoo!TW fund, microsummary works ordinary, but it doesn't after the browser restart - the label turns into garbage. Reproducible: Always Steps to Reproduce: 1. Install the generator on my web site (see url above, first link) 2. Add http://tw.money.yahoo.com/fund/c/2450.html to bookmark, choose microsummary for label. 3. restart Firefox Actual Results: it shows something like "b:�ͨ��饻:24,578(-551)" on bookmark label Expected Results: should be correct characters

Po-chiang Chao [:bobchao]

Reporter

Updated

•

18 years ago

Flags: blocking-firefox2?

Mike Beltzner [:beltzner, not reading bugmail]

Comment 1

•

18 years ago

--> myk for investigation

Assignee: nobody → myk

Flags: blocking-firefox2? → blocking-firefox2+

Keywords: intl

Target Milestone: --- → Firefox 2

Adam Guthrie

Comment 2

•

18 years ago

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1b2) Gecko/20060901 BonEcho/2.0b2 Confirming. The problem occurs when the microsummary is reloaded.

Status: UNCONFIRMED → NEW

Ever confirmed: true

OS: Windows XP → All

Hardware: PC → All

Summary: Can't recognize characters other than UTF-8 (without visit the page first) → After reloading microsummary with non-UTF-8 charset, its summary becomes messed up

Version: unspecified → 2.0 Branch

Justin Dolske [:Dolske]

Comment 3

•

18 years ago

The basic problem appears to be that XMLHttpRequest can't determine the correct charset... The server is sending an HTTP reply with "Content-Type: text/html", but no character encoding is specified in the header. There is a <META http-equiv="Content-Type"> tag in the HTML specifying Big5 -- but since XMLHttpRequest treats non-XML content as opaque data, it never sees the tag. XMLHttpRequest expects UTF8 data by default, and so it's presumably generating garbage when it tries to decode the "UTF8" byte sequences (which are really Big5, and are not valid UTF8 sequences) into internal UTF16/UCS2 characters. I don't recognize any patterns in the garbage, so this probably isn't reversible. So... I'm not sure what the best approach is to fix this. Some ideas: * You can force a mime type (and thus charset) with XMLHttpRequest, but you have to do so before sending the request. In theory you could make one request, grab the charset from the parsed IFRAME, and then make a second request with the correct charset forced. Making 2 requests would suck, though. We could refine this by reusing (caching) the charset from the previous request... If the charset of the document changes, we would sync up after the next update (assuming the charset isn't changing all the time, which seems unlikely). This is a hack, but it might be simple enough to make it into a FF2 release. * Since we're forced to use an IFRAME to parse the document data anyway, perhaps we can cease usage of XMLHttpRequest and just make the IFRAME load by, uhh, doing whatever the browser normally does to load a page into an IFRAME. [Might it be as simple as just setting the SRC attribute, or would it still require mucking about with channels and streams?] * If web developers are also doing this sort of thing (HTML via XMLHttpRequest), perhaps we should enhance XMLHttpRequest to be a little more intelligent about HTML content. [Although I suppose that would require parsing the data and a fix to avoid the IFRAME hack.]

Justin Dolske [:Dolske]

Comment 4

•

18 years ago

*** Bug 344600 has been marked as a duplicate of this bug. ***

Myk Melez [:myk] [@mykmelez]

Assignee

Comment 5

•

18 years ago

> * You can force a mime type (and thus charset) with XMLHttpRequest, but you > have to do so before sending the request... We could refine this by reusing > (caching) the charset from the previous request... If the charset of the > document changes, we would sync up after the next update (assuming the charset > isn't changing all the time, which seems unlikely). This is a hack, but it > might be simple enough to make it into a FF2 release. It looks like the bookmarks service records the page's charset in the bookmarks datastore every time a user visits a bookmarked page. So instead of caching the charset ourselves, we could use this value, when available, to specify the charset for XMLHttpRequest. As you surmise, pages are unlikely to change charsets, but when one does, this technique will probably break its microsummary until the user revisits the page. Of course, users are likely to revisit pages whose microsummaries break if they care about the microsummary at all, so such bustage would be temporary. bz and biesi pointed out to me in IRC that we can set XMLHttpRequest.channel.contentCharset instead of calling XMLHttpRequest.overrideMimeType() to specify the charset without also overriding the MIME type. cc:ing the two of them for their general thoughts on this issue.

Myk Melez [:myk] [@mykmelez]

Assignee

Comment 6

•

18 years ago

Attached patch patch v1: retrieves last charset from bookmarks service (obsolete) — Details — Splinter Review

Here's an implementation of the approach outlined in comment 5. This fixes the testcase in the description of this bug and doesn't break a number of other microsummaries I have tested.

Attachment #238055 - Flags: superreview?(bzbarsky)

Attachment #238055 - Flags: review?(mconnor)

Boris Zbarsky [:bzbarsky]

Comment 7

•

18 years ago

Another option would be to expose the parser's meta-tag sniffing to content code and use it in XMLHttpRequest for text/html.... Though I'm really sad that we're using XMLHttpRequest for random non-XML content. There's gotta be a better way. :(

Boris Zbarsky [:bzbarsky]

Comment 8

•

18 years ago

Comment on attachment 238055 [details] [diff] [review] patch v1: retrieves last charset from bookmarks service >Index: browser/components/microsummaries/src/nsMicrosummaryService.js.in >+#ifndef MOZ_PLACES So there's a followup bug to make this work with places? >+ var rdf = Cc["@mozilla.org/rdf/rdf-service;1"] >+ .getService(Ci.nsIRDFService); >+ var bmds = rdf.GetDataSource("rdf:bookmarks"); >+ var resolver = bmds.QueryInterface(Ci.nsICharsetResolver); How about doing something more like what http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/content/html/document/src/nsHTMLDocument.cpp&rev=3.696#563 does? >+ var charset = resolver.requestCharset(null, request.channel, {}, {}); Strictly speaking, this is abuse of the API, if the channel hasn't sent onStartRequest yet... At least as the API was initially created. You might want to file another bug on clarifying that sitation; I have no problem with allowing not-opened channels in this API, in general... sr=bzbarsky with those nits.

Attachment #238055 - Flags: superreview?(bzbarsky) → superreview+

Boris Zbarsky [:bzbarsky]

Comment 9

•

18 years ago

Btw, an advantage of the getService approach is that you can probably remove the PLACES ifdef.

Justin Dolske [:Dolske]

Comment 10

•

18 years ago

(In reply to comment #5) > As you surmise, pages are unlikely to change charsets, but when one does, this > technique will probably break its microsummary until the user revisits the > page. Of course, users are likely to revisit pages whose microsummaries break > if they care about the microsummary at all, so such bustage would be temporary. Since we're usually retrieving the page in order to generate the microsummary, we should be able to peek at the IFRAME and update the bookmark's charset if it changed. Then the next microsummary update wouldn't be corrupted (except in the pathological case where the charset changes constantly). We could even trigger an immediate "whoops" refresh to hopefully never display garbage.

Myk Melez [:myk] [@mykmelez]

Assignee

Comment 11

•

18 years ago

(In reply to comment #7) > Another option would be to expose the parser's meta-tag sniffing to content > code and use it in XMLHttpRequest for text/html.... Indeed. But that sounds fairly involved (correct me if I'm wrong), and I'm currently looking for a fix low-risk enough to make it into Firefox 2. > So there's a followup bug to make this work with places? There is bug 317472, although it seems broader in scope. > How about doing something more like what > http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/content/html/document/src/nsHTMLDocument.cpp&rev=3.696#563 > does? Looks like that would work just fine. Is it preferable? > >+ var charset = resolver.requestCharset(null, request.channel, {}, {}); > > Strictly speaking, this is abuse of the API, if the channel hasn't sent > onStartRequest yet... At least as the API was initially created. You might > want to file another bug on clarifying that sitation; I have no problem with > allowing not-opened channels in this API, in general... I could also call the bookmarks service's GetLastCharset(), which doesn't rely on a channel at all (but throws an error if the URL isn't bookmarked). It merely takes a URL and returns its stored charset, if any. Would that be preferable to either technique using nsICharsetResolver? > Since we're usually retrieving the page in order to generate the microsummary, > we should be able to peek at the IFRAME and update the bookmark's charset if > it changed. Then the next microsummary update wouldn't be corrupted (except > in the pathological case where the charset changes constantly). We could even > trigger an immediate "whoops" refresh to hopefully never display garbage. Good idea. I'm a bit worried about triggering an immediate refresh, since that could overload a server in the pathological case, but we should be able to at least fix the charset so the microsummary cycles clean the next round.

Boris Zbarsky [:bzbarsky]

Comment 12

•

18 years ago

I think using nsICharsetResolver with the unopened channel is fine for the time being.

Myk Melez [:myk] [@mykmelez]

Assignee

Comment 13

•

18 years ago

Attached patch patch v2: uses nsHTMLDocument.cpp's approach (obsolete) — Details — Splinter Review

> So there's a followup bug to make this work with places? This is bug 317472. >+ var rdf = Cc["@mozilla.org/rdf/rdf-service;1"] >+ .getService(Ci.nsIRDFService); >+ var bmds = rdf.GetDataSource("rdf:bookmarks"); >+ var resolver = bmds.QueryInterface(Ci.nsICharsetResolver); > How about doing something more like what > http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/content/html/document/src/nsHTMLDocument.cpp&rev=3.696#563 > does? Ok, this patch switches to nsHTMLDocument.cpp's approach. >+ var charset = resolver.requestCharset(null, request.channel, {}, {}); > > Strictly speaking, this is abuse of the API, if the channel hasn't sent > onStartRequest yet... At least as the API was initially created. You might > want to file another bug on clarifying that sitation; I have no problem with > allowing not-opened channels in this API, in general... Filed as bug 352426.

Attachment #238055 - Attachment is obsolete: true

Attachment #238106 - Flags: superreview?(mconnor)

Attachment #238106 - Flags: review?(bzbarsky)

Attachment #238055 - Flags: review?(mconnor)

Myk Melez [:myk] [@mykmelez]

Assignee

Comment 14

•

18 years ago

Attached patch patch v3: like v2, but minus unnecessary hunk — Details — Splinter Review

Err, the previous patch contained an unnecessary hunk. Here's the patch with that hunk removed.

Attachment #238106 - Attachment is obsolete: true

Attachment #238109 - Flags: superreview?(mconnor)

Attachment #238109 - Flags: review?(bzbarsky)

Attachment #238106 - Flags: superreview?(mconnor)

Attachment #238106 - Flags: review?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Updated

•

18 years ago

Attachment #238109 - Flags: review?(bzbarsky) → review+

Mike Connor [:mconnor]

Comment 15

•

18 years ago

Comment on attachment 238109 [details] [diff] [review] patch v3: like v2, but minus unnecessary hunk looks good. if (nodeType == kNC_Bookmark || nodeType == kNC_Bookmark) scares me, who knows what evil lurks in the heart of bookmarks!

Attachment #238109 - Flags: superreview?(mconnor) → superreview+

Myk Melez [:myk] [@mykmelez]

Assignee

Comment 16

•

18 years ago

Comment on attachment 238109 [details] [diff] [review] patch v3: like v2, but minus unnecessary hunk Notes for drivers considering this approval request: This patch makes the microsummary service correctly generate microsummaries for pages in non-UTF-8 charsets where the charset is specified by a meta tag or some other mechanism not detectable by XMLHttpRequest (i.e. something other than in the HTTP response headers). It does this by using the page's last charset as recorded by the bookmarks service. The fix is simple, the patch is small, and testing has identified no problems on both UTF-8 and non-UTF-8 pages, but the risk is still not entirely clear. Since the code uses the page's last charset even for UTF-8 pages, it will break on pages that change their charset (although only until the user next visits the bookmark). And there may be other bugs lurking in the not-well-exercised bookmarks code responsible for tracking pages' charsets. Nevertheless, I suspect that there are many more pages using non UTF-8 charsets specified in HTML than there are pages that change their charset, and fixing this is a much bigger win than any regression it may cause. This is the lowest-risk fix for the problem. The patch was checked into the trunk Tuesday, September 12, where baking it will do little good, since Places doesn't record charsets.

Attachment #238109 - Flags: approval1.8.1?

Myk Melez [:myk] [@mykmelez]

Assignee

Updated

•

18 years ago

Depends on: 317472

Myk Melez [:myk] [@mykmelez]

Assignee

Comment 17

•

18 years ago

Attached patch the version of patch v3 checked into the trunk (minus one N/A hunk) — Details — Splinter Review

Mike Connor [:mconnor]

Updated

•

18 years ago

Whiteboard: [has patch][needs approval]

Myk Melez [:myk] [@mykmelez]

Assignee

Comment 18

•

18 years ago

> if (nodeType == kNC_Bookmark || nodeType == kNC_Bookmark) scares me, who knows > what evil lurks in the heart of bookmarks! FWIW, this was a recent typo. The second conditional was always supposed to check for equivalence to KNC_MicsumBookmark.

Myk Melez [:myk] [@mykmelez]

Assignee

Updated

•

18 years ago

Whiteboard: [has patch][needs approval] → [has patch][needs approval] DRIVERS: see comment 16

Myk Melez [:myk] [@mykmelez]

Assignee

Comment 19

•

18 years ago

Attached patch work in progress of more robust patch — Details — Splinter Review

FWIW, here's a work in progress of a more robust patch. This seems like more risk than we'd be willing to take on the branch, though.

Mike Schroepfer

Comment 20

•

18 years ago

Comment on attachment 238109 [details] [diff] [review] patch v3: like v2, but minus unnecessary hunk a=schrep for 181drivers for uSum UTF-8 fix.

Attachment #238109 - Flags: approval1.8.1? → approval1.8.1+

Myk Melez [:myk] [@mykmelez]

Assignee

Comment 21

•

18 years ago

Fix checked into the branch. Leaving the bug open until we get a real trunk fix (which depends on the trunk saving charsets for bookmarked pages, i.e. the dependent bug).

Keywords: fixed1.8.1

Whiteboard: [has patch][needs approval] DRIVERS: see comment 16

Myk Melez [:myk] [@mykmelez]

Assignee

Updated

•

17 years ago

Priority: -- → P3

Target Milestone: Firefox 2 → ---

Nochum Sossonko [:Natch]

Comment 22

•

16 years ago

Dependent bug has been fixed, marking as RESOLVED FIXED.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Michael Ryan

Updated

•

16 years ago

Depends on: 507386

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: Firefox → Firefox Graveyard

You need to log in before you can comment on or make changes to this bug.