Closed Bug 100377 Opened 24 years ago Closed 24 years ago

Auto-detect All detects yahoo-japan page as a wrong charset that save by Composer

Categories

(Core :: Internationalization, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla0.9.6

People

(Reporter: amyy, Assigned: shanjian)

Details

(Keywords: intl)

Attachments

(2 files)

Reproduce builds: 09-17 branch build, N6.1 RTM build on Windows 2000 Simplified Chinese. Steps: 1. Launch browser, go http://www.yahoo.co.jp 2. File | Save As, save as a new file A. 3. File | Edit Page, to bring the page into Composer, page display fine, then go File | Save As, save as a new file B. 4. Open file A and B in Browser, View | Character Coding | Auto-Detect, set set auto-detector to "All". Results: File A that saved in step 2 will display fine and charset will marked as EUC-JP which is correct behavior, however file B that saved in step 3 will display garbled and charset is marked as ISO-8859-2. I can not reproduce it on 09-17 branch build on WinXP-Ja, Mac9.1-Ja and Linux6.2-Ja by same steps, they are display fine no matter the files are saved in Browser or Composer when auto-detect All. If I change auto-detect to Japanese then will detect as EUC-JP though.
Keywords: intl
QA Contact: andreasb → ylong
Summary: Auto-detect All detects yahoo-japan as a wrong charset that save by Composer → Auto-detect All detects yahoo-japan page as a wrong charset that save by Composer
assigning to shanjian
Assignee: yokoyama → shanjian
Yuying, can you send me fileB, the one that you declare has problem with autodetection? I could not reproduce the problem following your steps. thanks.
This bug expose a problem in my charset auto-detector. For scripts that also utilize latin letters, I preserve all latin letters outside tags. Unfortunately, sometimes we have too many english words left. That eventually lead to high confidence in my negative approach in detecting single byte charset. Since there is no easy way to identify what text is English and what is not, I have to remove all pure latin word. Since all scripts we try to identify use code points outside ascii, I don't see an immediate limitation to this approach.
Status: NEW → ASSIGNED
bugscape 9786 was filed for detect all. This bug will take care of universal charset detector.
Attached patch proposed patchSplinter Review
roy, could you review my fix?
Shanjian: I see nsSBCSGroupProber::FilterWithEnglishLetters() being ifdef'ed by NO_ENGLISH_CONTAMINATION. Is this NO_ENGLISH_CONTAMINATION related to this bug?
No. I don't want to remove that part of code because I believe the algorithm is theoritically sound and it might be used in future for detecting languages. So I just comment it out.
Comment on attachment 50580 [details] [diff] [review] proposed patch /r=yokoyama
Attachment #50580 - Flags: review+
brendan, can you sr my patch?
Comment on attachment 50580 [details] [diff] [review] proposed patch sr=brendan@mozilla.org so long as mTotalChars can't be 0 when GetConfidence is called, or else its callers can handle a NaN. /be
Attachment #50580 - Flags: superreview+
Thanks brendan. To address your concern, mTotalChars won't be zero if mTotalSeqs is more than zero. A sequence contains 2 chars, and at least one char is counted in mTotalChars.
fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla0.9.6
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: