Closed Bug 115114 Opened 23 years ago Closed 23 years ago

autodetect universal detects french as Central European (ISO-8859-2)

Categories

(Core :: Internationalization, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

VERIFIED FIXED

People

(Reporter: jmdesp, Assigned: shanjian)

References

()

Details

(Keywords: intl, Whiteboard: [adt2])

Attachments

(6 files)

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.6+)
Gecko/20011212
BuildID:    2001121203

autodetect universal is about unusable for French.

It wrongly identifies about systematically french page encoded in ISO-8859-1 as
ISO-8859-2.

I reported this as a browser bug, but there's the same problem in mail and news
everytime the content headers are missing.
In fact, it's most often encountered in mail/news as most web pages have the
ISO-8859-1 header.

"Auto-detect - all" in Netscape 6.2 has the same problem.

Reproducible: Always
Steps to Reproduce:
1. Select Universal in View/Character Coding/Auto-Detect
2. Load the sample URL http://www.aminautes.org/forums/configurer/oe/proe5.html

Actual Results:  
French users can tell immediatly some of the characters in the page are wrong. 
But if you don't know french, you can go in View/Character Coding, and see
Central European (ISO-8859-2) is shown selected. 
The content is truly in ISO-8859-1.

This happens about systematically with french text. 

I tried with some german, and it was also identified as Central European
(ISO-8859-2), but it just happens that all the german characters used in the
page (ö, ü, ä) have the same mapping in ISO-8859-1 and in ISO-8859-2.
Still the "normal" encoding of german is ISO-8859-1, and if the data is detected
as ISO-8859-2 in mail/news, the answer will be in ISO-8859-2, which might be
annoying.

shanjian@netscape.com is the owner of the universal autodetector.
Status: UNCONFIRMED → NEW
Ever confirmed: true
The german umlaut characters are in fact common in ISO-8859-2 and ISO-8859-1,
and as a result the display is correct.
But the normal charset of german is ISO-8859-1, and it will be annoying for the
user that the data is identified is ISO-8859-2 instead.
As the bug is completely reproductible in Netscape 6.2 whose implementation of
"autodetect-all" is independant from the implementation of
"autodetect-universal" in Mozilla, I wonder if the problem really is in the
autodetection module, or if there's not something wrong in the interpretation of
it's result that would map a result of ISO-8859-1 to ISO-8859-2.
I knew what cause the problem. Current algorithm for latinX detection need to 
be fine tuned. We do not try to detect latin1 charset and only default to it 
when all other detectors give us low confidence. Because latin1 and latin2 
charset share many code points, it is necessary to add latin1 detector and let 
it compete with latin2. I think most of the problem can be resolve this way. 
Status: NEW → ASSIGNED
Keywords: intl
QA Contact: teruko → ylong
This seems tangentially related a the bug I filed, #117758

http://bugzilla.mozilla.org/show_bug.cgi?id=117758

For some reason character E9 (hexidecimal) renders correctly with the
sample attached to this bug (115114) but on the front page of www.salon.com
it renders with a diamond w/ question mark.

Any ideas why?

Default Language English[en]and coding Western (iso-8859-1)
*** Bug 124734 has been marked as a duplicate of this bug. ***
-> nsbeta1, since more similar bugs keep coming. 
Keywords: nsbeta1
nsbeta1- since ns are not shipping with this.
Keywords: nsbeta1nsbeta1-
But there is a very similar bug in the "Auto-detect - all" feature of NS 
releases.
*** Bug 129246 has been marked as a duplicate of this bug. ***
Attached patch patchSplinter Review
Charset latin1 and latin2 share many similarity, and it is pretty hard to
discriminate between each other. In current universal charset implementation,
there is not latin1 verifier, we only use latin1 as default. That's say if no
other charset verifiers declare confidence, we use latin1. That's works pretty
well until I added latin2 charset verifier. It seems like we have to  remove
this latin2 verifier before latin1 is available. Otherwise the possible
sideeffect is simpily too serious. 
roy, could you r= this patch?

Since the change is very simple, and impact is big, I would suggest to
reconsider this one for mozilla 1.0.  
Keywords: nsbeta1-nsbeta1
*** Bug 131401 has been marked as a duplicate of this bug. ***
shanjian: 
I think we should ask Frank for review.
I am happy to review it for you if frank is too busy.

frank?
nsbeta1+ per ftang

roy, please go ahead and review it. I talked to frank and he had no objection. 
Keywords: nsbeta1nsbeta1+
shanjian:
why do we need to remove
+//mProbers[11] = new nsSingleByteCharSetProber(&Win1250HungarianModel);
Did we have two latin2 charset verifiers?


win1250 should have the same problem as latin2. It may share some similarity
with win1252. Besides, we don't want have only one charset for Hungarian,
especially when it is not a popular one. 
win1250 does have the same problem as latin2. 
In one of the duplicates, the ISO859-1 page was detected as win1250.
I didn't report specific cases, but I think I saw the same problem too.

A bit better change would be to have a preference in prefs.js that would enable
bulgarian users to reenable win1250 /latin2 autodetection if needed.
But it might be too late for that for 1.0 and not that useful compared to a
patch that would truly solve the problem.
Comment on attachment 78373 [details] [diff] [review]
patch

r=ftang. remove sub module should not cause problem
Attachment #78373 - Flags: review+
brendan, could you sr?
[adt2], this impact europen users. We want to use Universial detector for
Netscape beta also because the quality is better once we fix this problem 
Whiteboard: [adt2]
Comment on attachment 78373 [details] [diff] [review]
patch

sr=scc
Attachment #78373 - Flags: superreview+
fix checked into trunk.
Verified on 04-19 on 3 platforms: WinME, Mac10.1.3 and linux RH7.2, the original
report page:
http://www.aminautes.org/forums/configurer/oe/proe5.html
is deletect as iso-8859-1 with auto-detect Universal.
But under Mac OS X 10.1.4 the original report page:
http://www.aminautes.org/forums/configurer/oe/proe5.html
is detect as iso-8859-2 with auto-detect Universal.
As I reported inside Bug 131401, this issue isn't occuring only with pages using
foreign characters. The problem sounds deeper. Even that pure american page is
detected as 8859-2 when Auto-detect is set to Universal:

http://maccentral.macworld.com/news/0204/19.sjaug.php

Many other american page can give the same result.
If this problem was a pure Mac OS X problem we probably should reopen Bug 131401...
Please forget my Comments #28 and #29 : happily I was then using a previous
trunk. Sorry for that... Using the last trunk (2002041903) is giving excellent
results with the test page, with MacCentral pages and with each other pages
where the bug was detected. Great work.
Using the 2002O41903 build I discovered that when Auto Detect is set to
Universal, japanese encoding could be used when it should be western 8859-1 (I
already reported this issue before: see bug 131401 comment #26). The screenshot
was taken from the following page:
http://www.versiontracker.com/moreinfo.fcgi?id=13513&db=macosx
bruno lapeyre,
You might want to file a separate bug for that. Generally speaking, charset
autodetection will never be 100% accurate. That page contains some byte
combinations that can also be interpreted as a valid japanese character. So I am
not sure if that is solvable in near future.
adk for adt1.0.0, QA have verify this on the trunk. 
The risk is low since we are removing the code inside that paticular detector
which are not turn on by default for English users. 
Keywords: adt1.0.0
some other bugs depend on this:
http://bugscape.mcom.com/show_bug.cgi?id=11459
http://bugscape.mcom.com/show_bug.cgi?id=11250
also, vantive 279691 is depend on this.
adding adt1.0.0+.  Please check this into the branch as soon as possible and add
the fixed1.0.0 keyword.
Keywords: adt1.0.0adt1.0.0+
fix checked into trunk.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Keywords: fixed1.0.0
Resolution: --- → FIXED
Vefified works fine on 04-25 branch build.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: