Closed
Bug 100377
Opened 24 years ago
Closed 24 years ago
Auto-detect All detects yahoo-japan page as a wrong charset that save by Composer
Categories
(Core :: Internationalization, defect)
Tracking
()
RESOLVED
FIXED
mozilla0.9.6
People
(Reporter: amyy, Assigned: shanjian)
Details
(Keywords: intl)
Attachments
(2 files)
|
49.26 KB,
text/html
|
Details | |
|
2.86 KB,
patch
|
tetsuroy
:
review+
brendan
:
superreview+
|
Details | Diff | Splinter Review |
Reproduce builds: 09-17 branch build, N6.1 RTM build on Windows 2000 Simplified
Chinese.
Steps:
1. Launch browser, go http://www.yahoo.co.jp
2. File | Save As, save as a new file A.
3. File | Edit Page, to bring the page into Composer, page display fine, then go
File | Save As, save as a new file B.
4. Open file A and B in Browser, View | Character Coding | Auto-Detect, set set
auto-detector to "All".
Results:
File A that saved in step 2 will display fine and charset will marked as EUC-JP
which is correct behavior, however file B that saved in step 3 will display
garbled and charset is marked as ISO-8859-2.
I can not reproduce it on 09-17 branch build on WinXP-Ja, Mac9.1-Ja and
Linux6.2-Ja by same steps, they are display fine no matter the files are saved
in Browser or Composer when auto-detect All.
If I change auto-detect to Japanese then will detect as EUC-JP though.
| Reporter | ||
Updated•24 years ago
|
Keywords: intl
QA Contact: andreasb → ylong
Summary: Auto-detect All detects yahoo-japan as a wrong charset that save by Composer → Auto-detect All detects yahoo-japan page as a wrong charset that save by Composer
| Assignee | ||
Comment 2•24 years ago
|
||
Yuying, can you send me fileB, the one that you declare has problem with
autodetection? I could not reproduce the problem following your steps.
thanks.
| Reporter | ||
Comment 3•24 years ago
|
||
| Assignee | ||
Comment 4•24 years ago
|
||
This bug expose a problem in my charset auto-detector. For scripts that also
utilize latin letters, I preserve all latin letters outside tags. Unfortunately,
sometimes we have too many english words left. That eventually lead to high
confidence in my negative approach in detecting single byte charset. Since there
is no easy way to identify what text is English and what is not, I have to
remove all pure latin word. Since all scripts we try to identify use code points
outside ascii, I don't see an immediate limitation to this approach.
Status: NEW → ASSIGNED
| Assignee | ||
Comment 5•24 years ago
|
||
bugscape 9786 was filed for detect all. This bug will take care of universal
charset detector.
| Assignee | ||
Comment 6•24 years ago
|
||
| Assignee | ||
Comment 7•24 years ago
|
||
roy, could you review my fix?
Comment 8•24 years ago
|
||
Shanjian: I see nsSBCSGroupProber::FilterWithEnglishLetters()
being ifdef'ed by NO_ENGLISH_CONTAMINATION.
Is this NO_ENGLISH_CONTAMINATION related to this bug?
| Assignee | ||
Comment 9•24 years ago
|
||
No. I don't want to remove that part of code because I believe the algorithm is
theoritically sound and it might be used in future for detecting languages. So I
just comment it out.
Comment 10•24 years ago
|
||
Comment on attachment 50580 [details] [diff] [review]
proposed patch
/r=yokoyama
Attachment #50580 -
Flags: review+
| Assignee | ||
Comment 11•24 years ago
|
||
brendan, can you sr my patch?
Comment 12•24 years ago
|
||
Comment on attachment 50580 [details] [diff] [review]
proposed patch
sr=brendan@mozilla.org so long as mTotalChars can't be 0 when GetConfidence is called, or else its callers can handle a NaN.
/be
Attachment #50580 -
Flags: superreview+
| Assignee | ||
Comment 13•24 years ago
|
||
Thanks brendan. To address your concern, mTotalChars won't be zero if mTotalSeqs
is more than zero. A sequence contains 2 chars, and at least one char is counted
in mTotalChars.
| Assignee | ||
Comment 14•24 years ago
|
||
fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla0.9.6
You need to log in
before you can comment on or make changes to this bug.
Description
•