Closed Bug 177505 Opened 22 years ago Closed 19 years ago

Autodetect=Universal misidentifies some text as GB10830, and leaves Encoding menu in wrong state

Categories

(MailNews Core :: Internationalization, defect)

x86
All
defect
Not set
minor

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: u32858, Assigned: jgmyers)

References

Details

Attachments

(3 files, 2 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021029
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021029

email displayed as charset are iso-8859-1 are not displayed correctly

Reproducible: Always

Steps to Reproduce:
1.get an html email
2.
3.

Actual Results:  
see that the £ pound are ?? and other letters have kanji spread occasionally
throughout them

Expected Results:  
should display in iso-8859-1 correctly.
clickeing view->charset-> iso-8859-1 (and selecting again) fixes this problem,
but as the "iso-8859-1" is already highlighted this should not hapen first

bug submited in UTF-8
reporter: what locale are you on? is your auto-detect turned on or off? the
email messages that are not displayed correctly are not mime encoded? could you
please attach a problematic mail to this bug report? thanks.
> 
> ------- Additional Comments From marina@netscape.com  2002-10-30 09:23 -------
> reporter: what locale are you on? is your auto-detect turned on or off? the
> email messages that are not displayed correctly are not mime encoded? could you
> please attach a problematic mail to this bug report? thanks.

Hi Marina,

autodetect was on "universal", it selected iso-8859-1 as highlighted, if i turn
it off it uses the default iso-8859-1 i set for all message display in the prefs

perhaps this is an autodetect bug.

the email was MIME encoded,

regards

JG

$ locale
LANG=en_GB.UTF-8
LC_CTYPE=ja_JP.UTF-8
LC_NUMERIC=en_GB.UTF-8
LC_TIME=en_GB.UTF-8
LC_COLLATE=en_GB.UTF-8
LC_MONETARY=en_GB.UTF-8
LC_MESSAGES=en_GB.UTF-8
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=



Received: from y01.blackstar.co.uk ([212.250.176.31]) by mail1.tay.ac.uk with
SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2656.59)
	id VTJMJC8K; Tue, 29 Oct 2002 19:33:38 -0000
Received: (qmail 4331 invoked by uid 1008); 29 Oct 2002 12:37:46 -0000
Date: 29 Oct 2002 12:37:46 -0000
Message-ID: <20021029123746.4330.qmail@y01.blackstar.co.uk>
Content-Transfer-Encoding: 7bit
Content-Type: multipart/alternative; boundary="_----------=_10358950661610159692"
MIME-Version: 1.0
X-Mailer: MIME::Lite 1.135  (B2.12; Q2.03)
From: update@blackstar.co.uk
To: 0013499@tay.ac.uk
Reply-To: service@blackstar.co.uk
Subject: Who Killed Laura Palmer?

This is a multi-part message in MIME format.

--_----------=_10358950661610159692
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
Content-Type: text/plain
example MIME email attactched as text/plain
shanjian, ftang said this might be related to your recent work on charsets
"Anything can go wrong will go wrong. "
This problem is not caused by my recent change, but it is a problem in universal
detector. 

Before doing multibyte detection, I removed all ascii characters that are not
adjacent to high 8-bit. The aim is to improve performance. At that time, gb18030
was not added yet. 

In gb18030, 0x81~0xfe, 0x30~0x39, 0x81~oxfe, 0x30~0x39 is four bytes characters.
Because such sequence will not appears in "almost" any other encoding, I report
this immediately. but in this testcase, through filtering, we help create such a
sequence, which is 0xa3,0x34,0xa3,0x31. That lead to universal detector mislabel
the text as gb18030. 

This problem probably can be easily fixed by not reporting immediately for such
sequence. To add a additional character after high 8-bit will help eliminate
gb18030 from consideration. 



Assignee: nhotta → shanjian
This bug was not present with the 1.6 version of Mozilla.
It appears in 1.7C1. I do not know if ti was present in version previous to
1.6.
This mail is probably wrongly formated, but no error message is displayed.
Probably pb with  "\n", end line caracters.
See mail content and compare with the result in Mozilla.
Attachment #104637 - Attachment mime type: text/plain → message/rfc822
Attachment #147045 - Attachment mime type: text/plain → message/rfc822
In attachment 104637 [details], the charset is not specified.  As noted in comment 5, the 
actual character set that gets used, if Autodetect=Universal, is GB18030.  
However, the Encoding menu indicates that ISO-8859-1 has been selected, unlike 
successful cases of detection (e.g. Big5) -- but see bug 163272.

Bug 181344 is about the same problem, in the browser.


In attachment 147045 [details] (from Franck Depierre), the charset is *illegally* 
specified, as:
    Content-Type: text/plain; charset=iso.8859.1
The misdisplay of this message is a different problem, and I've opened
bug 251634 for this.  This problem does, however, also show the problem of the 
Encoding menu being incorrectly updated.


Due to bug 129443, viewing a message/rfc822 file in the browser is very little 
help in getting to the root of the encoding problem: the display is still 
broken, but differently than in Mail/News.
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Summary: email displayed as charset are iso-8859-1 are not displayed correctly → Autodetect=Universal misidentifies some text as GB10830, and leaves Encoding menu in wrong state
shanjian: are you still around, and able to produce a patch for this bug? It
causes regular problems for UK users of Mozilla, such as myself :-)

I can provide many more example URLs if you need them.

Gerv
Bug 253849 opened for the failure of the Mail/News View|Encoding menu to update, 
which is a common symptom to some otherwise distinct Mail/News charset bugs.

Recommend duping this bug to the generic Auto-Detect bug 181344.
Blocks: 264871
Product: MailNews → Core
shanjian is no longer working on mozilla for 2 years and these bugs are still
here. Mark them won't fix. If you want to reopen it, find a good owner first. 
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → WONTFIX
Mass Reassign Please excuse the spam
Assignee: shanjian → nobody
Mass Re-opening Bugs Frank Tang Closed on Wensday March 02 for no reason, all
the spam is his fault feel free to tar and feather him
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Reassigning Franks old bugs to Jungshik Shin for triage - Sorry for spam
Assignee: nobody → jshin1987
Status: REOPENED → NEW
Attached patch Proposed fix (obsolete) — Splinter Review
Assignee: jshin1987 → jgmyers
Status: NEW → ASSIGNED
Attachment #200801 - Flags: review?(smontagu)
jgmyers: you are da man! This bug bites me daily, whenever I hit a site with a UK currency pound sign.

Gerv
*** Bug 181344 has been marked as a duplicate of this bug. ***
Attached patch Corrected proposed fix (obsolete) — Splinter Review
Attachment #200801 - Attachment is obsolete: true
Attachment #200808 - Flags: review?(smontagu)
Attachment #200801 - Flags: review?(smontagu)
Comment on attachment 200808 [details] [diff] [review]
Corrected proposed fix

Can you add some explanation of how this fixes the bug? Is it the approach suggested in comment 5?
Comment 5 has two suggestions.  The patch chooses to do only the second: after a high-bit octet I feed the next two non-high-bit octets to the lower detectors.  The previous code would only feed the next one non-high-bit octet.

I also got rid of the malloc and made the algorithm for what gets removed be unaffected by the input buffer block boundaries.
I'm not seeing any difference in the testcases in here or the duplicate bugs.
They're detecting as windows-1252 for me.  Perhaps you're somehow loading an old version of a shared library?
That could be, because I didn't build at top level. I'll test again, but I'm afraid it won't be before Sunday.
Comment on attachment 200808 [details] [diff] [review]
Corrected proposed fix

This isn't feeding 8bit data to the MBCS probers when the 8bit data isn't followed by non-8bit data.
Attachment #200808 - Attachment is obsolete: true
Attachment #200808 - Flags: review?(smontagu)
Attached patch Corrected fixSplinter Review
Corrects an off-by-one in the buffer length when passing data to the lower probers.  Sends data to lower prober when last char of buffer is 8bit.
Adds my employer to Contributors list as this is work-for-hire.
Includes some #ifdef DEBUG_jgmyers code from my work tree.  The rest of that debugging code will come in a different patch to a different bug.
Attachment #201042 - Flags: review?(smontagu)
Comment on attachment 201042 [details] [diff] [review]
Corrected fix

r=smontagu
Attachment #201042 - Flags: review?(smontagu) → review+
Attachment #201042 - Flags: superreview?(roc)
Comment on attachment 201042 [details] [diff] [review]
Corrected fix

okay, but how hard would it be to merge the duplice code into a single helper function?
Attachment #201042 - Flags: superreview?(roc) → superreview+
Fixed on trunk
Status: ASSIGNED → RESOLVED
Closed: 19 years ago19 years ago
Resolution: --- → FIXED
Verified using Amazon. jgmyers: British geeks thank you :-)

I wonder how one goes about nominating this for checkin on the Firefox 2.0 track?

Gerv
Status: RESOLVED → VERIFIED
*** Bug 328456 has been marked as a duplicate of this bug. ***
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: