Closed Bug 1174580 Opened 9 years ago Closed 9 years ago

Doesn't display GB2312 encoded texts correctly for Chinese Characters

Categories

(MailNews Core :: Internationalization, defect)

defect
Not set
major

Tracking

(thunderbird39 fixed, thunderbird40 fixed, thunderbird41 fixed, thunderbird_esr3839+ fixed)

RESOLVED FIXED
Thunderbird 41.0
Tracking Status
thunderbird39 --- fixed
thunderbird40 --- fixed
thunderbird41 --- fixed
thunderbird_esr38 39+ fixed

People

(Reporter: wenbins, Assigned: mkmelin)

References

Details

(Keywords: regression)

Attachments

(5 files, 4 obsolete files)

Attached image Image 2.png
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0
Build ID: 20150525141253

Steps to reproduce:

My Thunderbird is on the release channel. Just upgraded to 38.0.1 this morning. Now some emails are not displayed correctly.


Actual results:

Thunderbird is set to use UTF-8 as default encoding. Windows 7 system.

When a received email is encoded using GB2312, Thunderbird can detect the email uses GB2312 encoding and selects Chinese, simplified to decode. However, wrong characters are displayed. 

The previous version 31.7 has no such issue.


Expected results:

Correct characters should be used.
I can confirm this bug both on Linux and Windows.
Change Prefercences -> Fonts & Encodings ->Character Encodings -> Incoming Mail to Chinese Simplified (GBK) with no luck.
But if checked "Apply encoding to all messages in the folder ..." under Folder Properties, all the gb2312 encoding mails shows correctly, but of course, the other encoding not (utf-8 etc)
Two reports now.
Severity: normal → major
Status: UNCONFIRMED → NEW
Component: Untriaged → Internationalization
Ever confirmed: true
Product: Thunderbird → MailNews Core
Did an experiment, in the eml file, find the part below:

Content-Type: text/html; charset="gb2312"
Content-Transfer-Encoding: quoted-printable

Replace "gb2312" to "gb18030", then the display is correct. So it seems related to gb2312 only.

P.S. IIRC there used to be "chinese" in character encoding auto-detect menu, but now there are only japanese, russian, ukranian.
Depends on: 964225
Depends on a "fixed" bug? So bug 964225 should be reopened?
(In reply to Lu Wei from comment #5)
> Depends on a "fixed" bug? So bug 964225 should be reopened?

You don't reopen it.  Although this may be regression by bug 964225, it is Gecko's bug.
If TB still needs this encoding, we should add fallback of it to c-c.
This is not a Gecko bug. Gecko on m-c is internally consistent: GB2312 is no longer a Gecko-canonical name but a label for gbk. gbk and gb18030 are Gecko-canonical names.

c-c has its own list of label overrides in https://mxr.mozilla.org/comm-central/source/mailnews/intl/charsetalias.properties . This list still contains the mapping gb2312=GB2312, which overrides the right mapping from https://mxr.mozilla.org/comm-central/source/mozilla/dom/encoding/labelsencodings.properties#183 . But now GB2312 is no longer a Gecko-canonical name, so treating it as one leads to failure.

The least invasive fix is making sure that charsetalias.properties has no mappings with GB2312 on the right-hand side of the equals sign. (While at it, it's a good idea to review the file for other mappings that no longer work.)

Compared to old Thunderbird, the main resulting change is that replies will say charset=gbk instead of saying charset=GB2312.

Note that the old GB2312 decoder had actually the same behavior as the old gbk decoder. The gb18030 decoder is a subset of the old gbk decoder, which is why (when charsetalias.properties doesn't interfere) Gecko *decodes* content labeled as GB2312, gbk and gb18030 *exactly* the same way (as gb18030).

I expect jcranmer and mkmelin to have opinions.
Flags: needinfo?(mkmelin+mozilla)
Flags: needinfo?(Pidgeot18)
> The gb18030 decoder is a subset of the old gbk decoder

Oops. s/subset/superset/
It would be useful if someone could post here a valid gb2312 .eml file that display in correctly in TB 31 but not in TB 38, along with a screenshot of what the correct display looks like. That would make it easier for a non-Chinese developer to try to fix this.

Of course it would be great if a Chinese reader wants to try to fix it.
Keywords: regression
petercpg is checking for someone to fix
I find no support requests about this problem except http://forums.mozillazine.org/viewtopic.php?f=31&t=2941613
(In reply to Kent James (:rkent) from comment #10)
> It would be useful if someone could post here a valid gb2312 .eml file that
> display in correctly in TB 31 but not in TB 38, along with a screenshot of
> what the correct display looks like. That would make it easier for a
> non-Chinese developer to try to fix this.
> 
I uploaded an eml file and corresponding 2 sreenshots when reporting Bug 1174634. Please check.
(In reply to Kent James (:rkent) from comment #10)
> It would be useful if someone could post here a valid gb2312 .eml file that
> display in correctly in TB 31 but not in TB 38, along with a screenshot of
> what the correct display looks like. That would make it easier for a
> non-Chinese developer to try to fix this.
> 
I uploaded an eml file and corresponding 2 sreenshots when reporting Bug 1174634. Please check.

P.S. From this accident I think TB is lack of simplified chinese testers. How can I help? Switch to beta channel and report bugs of beta version?
(In reply to Magnus Melin from comment #13)
> Looks like bug 1174634 has .eml test case + wrong / right pictures.
> attachment 8622327 [details]:
> https://bugzilla.mozilla.org/attachment.cgi?id=8622327 
> attachment 8622328 [details]:
> https://bugzilla.mozilla.org/attachment.cgi?id=8622328
> attachment 8622329 [details]:
> https://bugzilla.mozilla.org/attachment.cgi?id=8622329

Oh, you have pasted the link. Sorry for my quick comment. And for correctly display chinese characters, maybe you need a chinese font file too. Do I need to upload one?
Attached patch gbk.patch (obsolete) — Splinter Review
magnus, if it helps, this patch wfm.

(long ago it was required to learn both fantizi and jiantizi, so it's not so greek ;)
(In reply to Lu Wei from comment #15)
> (In reply to Kent James (:rkent) from comment #10)
> > It would be useful if someone could post here a valid gb2312 .eml file that
> > display in correctly in TB 31 but not in TB 38, along with a screenshot of
> > what the correct display looks like. That would make it easier for a
> > non-Chinese developer to try to fix this.
> > 
> I uploaded an eml file and corresponding 2 sreenshots when reporting Bug
> 1174634. Please check.
> 
> P.S. From this accident I think TB is lack of simplified chinese testers.
> How can I help? Switch to beta channel and report bugs of beta version?

Yes, we need both automated tests, and more manual testers.  If you know other people who can test please email me.

As for version 38, you can run either the beta http://download.cdn.mozilla.net/pub/mozilla.org/thunderbird/releases/38.0b6/ or the released 38.0.1 https://www.mozilla.org/en-US/thunderbird/
(In reply to Wayne Mery (:wsmwk, use Needinfo for questions) from comment #18)
> > 
> > P.S. From this accident I think TB is lack of simplified chinese testers.
> > How can I help? Switch to beta channel and report bugs of beta version?
> 
> Yes, we need both automated tests, and more manual testers.  If you know
> other people who can test please email me.
> 
> As for version 38, you can run either the beta
> http://download.cdn.mozilla.net/pub/mozilla.org/thunderbird/releases/38.0b6/
> or the released 38.0.1 https://www.mozilla.org/en-US/thunderbird/

All right, I'll use beta version as long as my extensions work fine.
Summary: Not display GB2312 encoded texts correctly → Not display GB2312 encoded texts correctly for Chinese Characters
Attached patch gbk.patch (obsolete) — Splinter Review
Attachment #8623450 - Attachment is obsolete: true
Thx alta88!
Summary: Not display GB2312 encoded texts correctly for Chinese Characters → Doesn't display GB2312 encoded texts correctly for Chinese Characters
Attached patch bug1174580_gb3212.patch (obsolete) — Splinter Review
I came to the same result as alta88, with the addition that hz-gb-2312 is also dead (mapped hz-gb-2312=replacement in labelsencodings.properties) + removal of related cruft.

For 38 we'd only land the charsetalias.properties changes.
Attachment #8623746 - Attachment is obsolete: true
Attachment #8623846 - Flags: review?(Pidgeot18)
And I guess it can be discussed if zh_cn.euc is really needed for anything...
here's a reference:
http://www.yale.edu/chinesemac/pages/character_sets.html

zh_cn.euc is solaris; since it maps to a valid decoder (gbk), may as well keep it.
hz-gb-2312 is 7bit (now) gbk.  if there isn't a decoder remaining, well nothing to do.  i put it back to err on the side of caution assuming this would need to go to 38 and that one could be addressed better later. ie, as henri suggests, auditing charsetalias to only contain valid overrides and not dupe existing entries in labelsencodings.
The decoder for hz-gb-2312 appears not to exist anymore, but the 7-bit charsets in general are icky and should be obsoleted if possible. Preferably, the alias code should use the Gecko encoding lookup and only fallback to a small hard-coded list, but that requires some more auditing that we don't have right now.

Looking around for zh_CN.euc, the evidence appears to be that it was added for Solaris, but a quick google search tends to suggest that this is used primarily as a locale/internal format switch that was more or less accidentally(?) exposed to the real world. I don't think it's necessary to keep these days.
Flags: needinfo?(Pidgeot18)
(In reply to Joshua Cranmer [:jcranmer] from comment #27)
> Looking around for zh_CN.euc, the evidence appears to be that it was added
> for Solaris, but a quick google search tends to suggest that this is used
> primarily as a locale/internal format switch that was more or less
> accidentally(?) exposed to the real world. I don't think it's necessary to
> keep these days.

This indeed seems like the more likely explanation than "zh_CN.euc" being a necessary label to support for email compat.
So when can we get a patch? I can not read email send from Outlook(GB2312 encoded email) with this buggy TB and need a patch to resolve the issue. I have to swith folder encoding to read emails which is terrible experience.
Attached patch bug1174580_gb3212.patch (obsolete) — Splinter Review
Remove zh_cn.euc too
Attachment #8623846 - Attachment is obsolete: true
Attachment #8623846 - Flags: review?(Pidgeot18)
Attachment #8624696 - Flags: review?(Pidgeot18)
(In reply to Tony Yan from comment #29)
> So when can we get a patch? 

We expect to include this fix in the next point release and nightlies as soon as it's been reviewed.
Forgot to qrefresh.
Attachment #8624696 - Attachment is obsolete: true
Attachment #8624696 - Flags: review?(Pidgeot18)
Attachment #8624697 - Flags: review?(Pidgeot18)
Comment on attachment 8624697 [details] [diff] [review]
bug1174580_gb3212.patch

Review of attachment 8624697 [details] [diff] [review]:
-----------------------------------------------------------------

An automated test would have been nice, but it's not strictly necessary for this change. I'll probably consider it later when we work on redoing the mailnews/intl stuff (I rather think we could get rid of both the properties files with some cleanup).
Attachment #8624697 - Flags: review?(Pidgeot18) → review+
https://hg.mozilla.org/comm-central/rev/7170634c1998 -> FIXED
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 41.0
Comment on attachment 8624697 [details] [diff] [review]
bug1174580_gb3212.patch

[Approval Request Comment]
Regression caused by (bug #): bug 964225
User impact if declined: some chinese mails garbled

Should uplift this after some days trunk baking
Attachment #8624697 - Flags: approval-comm-esr38?
Attachment #8624697 - Flags: approval-comm-beta?
Attachment #8624697 - Flags: approval-comm-aurora?
So when can I upgrade to get the fix?

Tony
Comment on attachment 8624697 [details] [diff] [review]
bug1174580_gb3212.patch

Ah, should do a branch patch for this, with only the mailnews/intl/charsetalias.properties changes
Attachment #8624697 - Flags: approval-comm-esr38?
(In reply to Tony Yan from comment #36)
> So when can I upgrade to get the fix?
> 
> Tony

Nightly builds should be available in ~6-7h. Upgrade of 38 would be the next point release - unclear yet exactly when that will happen.
Magnus, can you do the esr38 branch patch?
In the future, I would appreciate if you separate our the suite pieces from these patches. suite both has a different approval process, but even worse because they are perma-closed so I have to add CLOSED TREE to any checkins.

I do not understand the point of the SM perma-closed, but that seems to be what they want. But because of that, I cannot rely on the tree being closed for real reasons, like infrastructure problems, so I am forced to go through all sorts of extra steps to checkin patches that have suite pieces.

I continue to implore the SM people to get rid of this permanent CLOSED TREE status. If you are going to allow checkins in any case, what is the point? It certainly complicates the like of patches like this that need changes in mailnews, mail, and suite to be coordinated.
Attachment #8625322 - Flags: approval-comm-esr38?
And yes, keeping the seamonkey tree closed for well over a year is completely unreasonable. All it does is force people to spend time fixing the commit messages.
Although the suite piece was not checked in, all it does is remove a line from the localization file. I don't believe it is necessary to uplift that deletion.
Comment on attachment 8624697 [details] [diff] [review]
bug1174580_gb3212.patch

modified patch was used sans suite
Attachment #8624697 - Flags: approval-comm-beta?
Attachment #8624697 - Flags: approval-comm-beta-
Attachment #8624697 - Flags: approval-comm-aurora?
Attachment #8624697 - Flags: approval-comm-aurora-
I tried the latest nightly build from https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/latest-earlybird-l10n/ , but problem still exist. May I know which nightly build fix the issue? I am running Windows 8.1.

Thanks and Regards,
Tony
Should be in earlybird nightly builds from 2015-06-23
Could you please teach me how to fix this issue ?
(In reply to Henry Fung from comment #49)
> Could you please teach me how to fix this issue ?

Hold down Alt key; press key v, c, s one by one; release Alt key.
Can anyone send me a link for a fixed temp build? It is just crazy to manually do this every time to open a mail in SC.
Nightlies are around here: https://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/latest-comm-central/ (earlybird in a sibling dir if you want that)
me too! So when can i get the patch ?
"The patch" is is attached to this bug. If you want a running build, see previous comment, comment 52.

BTW, could people confirm it fixes the issue for them?
To Magnus, I'm using the daily build as in comment 52. It works.
The search pane can't display the correct chinese character in gb2312 while the main message pane can,and alse the message body can't be indexed correctly.
I'm using the  Daily 41.0a1 (2015-06-25)   version,the bug is still not fixed as I mentioned before.
With the 38ESR, in the reading panel, the characters in the message body cannot be displayed; title is OK. But if I search using keywords from the title, in the search result pages, there is a short preview, which displays the characters correctly.

Withe 41.0a1 (2015-06-25) Daily, at my end it seems all right. Win 7 EN, Thunderbird EN, Unicode as default.
I guess it's possible gloda (the search index) has stored incorrect data for those mails during the time you used the version not supporting gb2312 properly. If so if you need it fixed you have to have the database rebuilt. https://support.mozilla.org/kb/rebuilding-global-database
yes!
I indexed the mails with version 38.
after rebuild the index,everything seems right in Daily build version 41.0a1 (2015-06-26)
The latest build wfm.  And setting GB18030 as the outgoing default in Display-Formatting-Advanced makes replies also encode properly and roundtrip fine.  However, using the GB2312 menuitem causes encoding in UTF8; is there any reason that option is still kept, given 1) it doesn't do what it advertises, 2) it's a subset of 18030 and officially superseded by it anyway. And for incoming, the label is GBK, which means 18030 but doesn't say it and isn't consistent with outgoing, thus confusing.

To be sure, the advice of the mozilla zh-cn localizer should be sought.
I split that out to bug 1177830. 
Note that we silently use UTF-8 if what was written didn't fit in the selected charset.
Comment on attachment 8625322 [details] [diff] [review]
bug1174580_GB2312_branch.patch

http://hg.mozilla.org/releases/comm-esr38/rev/788fc052d220
Attachment #8625322 - Flags: approval-comm-esr38? → approval-comm-esr38+
I have upgraded my TB form 38.0.1 to 38.1.0.But the Bug 1174634 doesn't appear to be fixed.

User Agent:
    Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0
Steps to reproduce:
    After upgraded to TB 38.1.0, all mails of one account displays wrong.
Actual results:
   The chinese characters using GB2312 displays wrong, as if the encoding is not recognizedas. If I select character encoding manually from menu>view>character encoding>chinese simplified(The fallback character encoding in folder properties is set to Unicode), it will display OK, but only temporary; switch to another message and back, it displays wrong again. 

   Remarks:
     1)If the fallback character encoding in folder properties is set to chinese simplified,the mail using GB2312 will be displayed correctly,but the mail using UTF-8 will be displayed  incorrectly.
     2)In menu>view>character encoding>Auto-Detect,there are only Japanese,Russian and Ukrainian,i can't find chinese simplified,although I am using a chinese simplified win7 OS.

Expected results:
  It should display correctly as before.
Sorry my mistake.
In TB 38.1.0 the reasons for the error display is a checkbox in folder properties is selected.The checkbox is about "Apply encoding to all messages in the folder(individual message...)"

(In reply to WeiXianguo from comment #65)
> I have upgraded my TB form 38.0.1 to 38.1.0.But the Bug 1174634 doesn't
> appear to be fixed.
> 
> User Agent:
>     Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101
> Thunderbird/38.1.0
> Steps to reproduce:
>     After upgraded to TB 38.1.0, all mails of one account displays wrong.
> Actual results:
>    The chinese characters using GB2312 displays wrong, as if the encoding is
> not recognizedas. If I select character encoding manually from
> menu>view>character encoding>chinese simplified(The fallback character
> encoding in folder properties is set to Unicode), it will display OK, but
> only temporary; switch to another message and back, it displays wrong again. 
> 
>    Remarks:
>      1)If the fallback character encoding in folder properties is set to
> chinese simplified,the mail using GB2312 will be displayed correctly,but the
> mail using UTF-8 will be displayed  incorrectly.
>      2)In menu>view>character encoding>Auto-Detect,there are only
> Japanese,Russian and Ukrainian,i can't find chinese simplified,although I am
> using a chinese simplified win7 OS.
> 
> Expected results:
>   It should display correctly as before.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: