Closed Bug 272812 Opened 20 years ago Closed 19 years ago

RSS item Subject has mis-transcoded international characters

Categories

(MailNews Core :: Feed Reader, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: leech_joe, Assigned: mscott)

References

Details

Attachments

(3 files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0
Build Identifier: Thunderbird 1.0 RC1 and earlier

For example [1]. Works fine with firefox but with thunderbird i get Ã?3 instead
of Ö3 for example.

[1] http://rss.orf.at/oesterreich.xml

Reproducible: Always
Steps to Reproduce:
1. subscribe to [1]
2. look at the subject of the messages
Actual Results:  
The umlauts are not correctly displayed

Expected Results:  
Display them correct :-).
Seems to work sometimes and sometimes not. Here is one that doesn't work:

http://rss.orf.at/oe3.xml

Subject:  Ä?3 Freundeskreis

linked to
http://oe3.orf.at/oe3.orf?read=detail&id=224269&channel=4
the ones that don't work are listing the wrong charset so we guess incorrectly.
Typically these feeds list a specific charset even though the feed is UTF-8 so
we use the charset it lists. (I've seen the opposite too where it says the feed
is UTF-8 but the strings are in a charset). I'm not sure what we can do for these...
Ok, so the problem is caused by the feed and not by thunderbird.

Hmm firefox guesses correct. So why not simply using the same algorithm, when
thunderbird has to guess ?

btw. It seems also Bugzilla doesn't like umlauts (see my first posting) :-).
Here's why I think your first example feed is invalid:

It says the charset is: iso-8859-15
but it looks like the characters in the actual feed are UTF-8.

So we end up converting the characters from iso-8859-15 to unicode and they look
incorrect....
Ok I contacted the customer service to correct the bug. Thanks anyway.
Hmm they say that the feed is valid. I checked this myself with [1] and [2] and
no problems occurred. So this seems to be a bug in thunderbird (or in the
validater implementation -> but I do not believe this).

Kind regards, Joe

[1] http://www.w3.org/RDF/Validator/
[2] http://www.feedvalidator.org/

PS: Works fine with konqueror and centericq.
The characters are fine if you print the feed to the console just before it hits the RDF parser in 
parseAsRSS1(), and a print statement in nsXMLHttpRequest showed that the charset is being correctly 
detected as ISO-8859-15.

http://bonsai.mozilla.org/cvsview2.cgi?
diff_mode=context&whitespace_mode=show&file=nsIRDFXMLParser.idl&branch=&root=/
cvsroot&subdir=mozilla/rdf/base/idl&command=DIFF_FRAMESET&rev1=1.3&rev2=1.4

Seems it the method was changed to take an nsAUTF8String as input... shouldn't it be a double byte 
string to talk to javascript? The change says it went in to correct this exact problem, but the only place 
it's used in script is in Feed.js, according to lxr.
*** Bug 285391 has been marked as a duplicate of this bug. ***
Note that the feed shown at the duplicate is served as UTF-8, but still has 
problems with the Subject line:
  http://japan-in-nutshell.blogspot.com/atom.xml
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → Windows 2000
Summary: RSS subject index doesn't display umlauts correct → RSS item Subject has mis-transliterated international characters
Duh, that's not "transliteration" we're talking about.
Summary: RSS item Subject has mis-transliterated international characters → RSS item Subject has mis-transcoded international characters
I just figured this one out, along with bug293279 and bug272875.
XMLHttpRequest's responseText attribute is misencoded for non-UTF8 strings (IE
has the responseBody attribute, which we would use if Moz had it). I have a
patch for this, but I need my patch for bug288110 to go in first. 
request.responseText returns UTF-8 by default. This is causing problems when
RSS1 documents are not returned in a compatible charset, perhaps because of the
way we override the MIME type in the request (perhaps this also catches the
charset parameter?). 

In any case, the XML parser seems to catch this (perhaps by sniffing the XML
declaration) and the DOM is always correct. As a short term solution, I decided
to serialize the responseXML DOM and feed that to the RDF parser, instead of
the responseText. I suppose there's a performance penalty in there, but it was
imperceptable to me, and it's the only solution I could think of that would
keep the changes minimal during release mode.

On the plus side, the patch fixes every encoding bug I could find.
(In reply to comment #9)
> Note that the feed shown at the duplicate is served as UTF-8, but still has 
> problems with the Subject line:
>   http://japan-in-nutshell.blogspot.com/atom.xml

This blog appears to be in Hungarian, and seems to work with the patch (I missed
this one in my testing).
Attachment #185984 - Flags: review?(mscott)
Comment on attachment 185984 [details] [diff] [review]
character encoding fixes

thanks a lot Robert.
Attachment #185984 - Flags: review?(mscott) → review+
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird1.1
I'm testing 2005-06-15-05-trunk.

Some of feeds imported from OPML file are not loaded.
e.g. http://weblogs.mozillazine.org/hyatt/blogger_rss.xml

JavaScript Console says:
Error: XMLSerializer is not defined
Source File: chrome://messenger-newsblog/content/feed-parser.js    Line: 67

Same feeds added myself are loaded correctly. Hmm...
(In reply to comment #16)
This problem has happened since attachment 185984 [details] [diff] [review] checked in.
No problem with 2005-06-12-05-trunk.
I see the bug here too, now that I've added Hyatt's feed. It has to do with
declaring that XMLSerializer. The problem seems to go away if the one declared
at the top of the file is used.
Attachment #186405 - Flags: review?(mscott)
There's a little glitch with OPML-imported feeds.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #186405 - Flags: review?(mscott) → review+
Attachment #186405 - Attachment description: reuse the serializer instance at the top of the file → [patch checked in] reuse the serializer instance at the top of the file
(In reply to comment #17)
> (In reply to comment #16)
> This problem has happened since attachment 185984 [details] [diff] [review] [edit] checked in.
> No problem with 2005-06-12-05-trunk.

Kohei, could you verify this is fixed for you with the 2005-06-16 trunk? I think
it's patched up.
I just tested 2005-06-16-06-trunk.
Looks good. No error. Thanks!
Status: REOPENED → RESOLVED
Closed: 19 years ago19 years ago
Resolution: --- → FIXED
*** Bug 276350 has been marked as a duplicate of this bug. ***
*** Bug 293279 has been marked as a duplicate of this bug. ***
Note that the analysis here was wrong.  The encoding of responseText is always UTF16 and any time there is an actual responseXML DOM around the responseText is correct. What you guys _actually_ ran into was bug 230275 -- the RDF parser is broken.  I suggest backing out this hackaround once that issue is fixed...
Depends on: 230275
Component: RSS → Feed Reader
Product: Thunderbird → MailNews Core
Target Milestone: Thunderbird1.1 → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: