Closed Bug 115114 Opened 23 years ago Closed 22 years ago

autodetect universal detects french as Central European (ISO-8859-2)

Tracking

()

Status:

VERIFIED FIXED

People

(Reporter: jmdesp, Assigned: shanjian)

References

(
URL
)

Details

(Keywords: intl, Whiteboard: [adt2])

Attachments

(6 files)

a sample short french text that has the same problem 23 years ago Jean-Marc Desperrier 695 bytes, text/html		Details
Another sample short french text that has the same problem 23 years ago Jean-Marc Desperrier 694 bytes, text/html		Details
Another sample short french text that has the same problem 23 years ago Jean-Marc Desperrier 1.61 KB, text/html		Details
a sample short german text that is audetected as central european (the display happens to be correct) 23 years ago Jean-Marc Desperrier 882 bytes, text/html		Details
patch 22 years ago Shanjian Li 2.26 KB, patch	ftang : review+ scc : superreview+ jesup : approval+	Details \| Diff \| Splinter Review
Japanese encoding used when it shouldn't 22 years ago bruno lapeyre 93.76 KB, image/png		Details

Jean-Marc Desperrier

Reporter

Description

•

23 years ago

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.6+)
Gecko/20011212
BuildID:    2001121203

autodetect universal is about unusable for French.

It wrongly identifies about systematically french page encoded in ISO-8859-1 as
ISO-8859-2.

I reported this as a browser bug, but there's the same problem in mail and news
everytime the content headers are missing.
In fact, it's most often encountered in mail/news as most web pages have the
ISO-8859-1 header.

"Auto-detect - all" in Netscape 6.2 has the same problem.

Reproducible: Always
Steps to Reproduce:
1. Select Universal in View/Character Coding/Auto-Detect
2. Load the sample URL http://www.aminautes.org/forums/configurer/oe/proe5.html

Actual Results:  
French users can tell immediatly some of the characters in the page are wrong. 
But if you don't know french, you can go in View/Character Coding, and see
Central European (ISO-8859-2) is shown selected. 
The content is truly in ISO-8859-1.

This happens about systematically with french text. 

I tried with some german, and it was also identified as Central European
(ISO-8859-2), but it just happens that all the german characters used in the
page (ö, ü, ä) have the same mapping in ISO-8859-1 and in ISO-8859-2.
Still the "normal" encoding of german is ISO-8859-1, and if the data is detected
as ISO-8859-2 in mail/news, the answer will be in ISO-8859-2, which might be
annoying.

shanjian@netscape.com is the owner of the universal autodetector.

Boris Zbarsky [:bzbarsky]

Updated

•

23 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Jean-Marc Desperrier

Reporter

Comment 1

•

23 years ago

Attached file a sample short french text that has the same problem — Details

Jean-Marc Desperrier

Reporter

Comment 2

•

23 years ago

Attached file Another sample short french text that has the same problem — Details

Jean-Marc Desperrier

Reporter

Comment 3

•

23 years ago

Attached file Another sample short french text that has the same problem — Details

Jean-Marc Desperrier

Reporter

Comment 4

•

23 years ago

Attached file a sample short german text that is audetected as central european (the display happens to be correct) — Details

The german umlaut characters are in fact common in ISO-8859-2 and ISO-8859-1,
and as a result the display is correct.
But the normal charset of german is ISO-8859-1, and it will be annoying for the
user that the data is identified is ISO-8859-2 instead.

Jean-Marc Desperrier

Reporter

Comment 5

•

23 years ago

As the bug is completely reproductible in Netscape 6.2 whose implementation of
"autodetect-all" is independant from the implementation of
"autodetect-universal" in Mozilla, I wonder if the problem really is in the
autodetection module, or if there's not something wrong in the interpretation of
it's result that would map a result of ISO-8859-1 to ISO-8859-2.

Shanjian Li

Assignee

Comment 6

•

23 years ago

I knew what cause the problem. Current algorithm for latinX detection need to 
be fine tuned. We do not try to detect latin1 charset and only default to it 
when all other detectors give us low confidence. Because latin1 and latin2 
charset share many code points, it is necessary to add latin1 detector and let 
it compete with latin2. I think most of the problem can be resolve this way.

Status: NEW → ASSIGNED

Teruko Kobayashi

Updated

•

23 years ago

Keywords: intl

QA Contact: teruko → ylong

Chris Kuklewicz

Comment 7

•

23 years ago

This seems tangentially related a the bug I filed, #117758

http://bugzilla.mozilla.org/show_bug.cgi?id=117758

For some reason character E9 (hexidecimal) renders correctly with the
sample attached to this bug (115114) but on the front page of www.salon.com
it renders with a diamond w/ question mark.

Any ideas why?

Default Language English[en]and coding Western (iso-8859-1)

Hugh Kennedy

Comment 8

•

23 years ago

*** Bug 124734 has been marked as a duplicate of this bug. ***

Yuying Long

Comment 9

•

23 years ago

-> nsbeta1, since more similar bugs keep coming.

Keywords: nsbeta1

Frank Tang

Comment 10

•

23 years ago

nsbeta1- since ns are not shipping with this.

Keywords: nsbeta1 → nsbeta1-

Jean-Marc Desperrier

Reporter

Comment 11

•

23 years ago

But there is a very similar bug in the "Auto-detect - all" feature of NS 
releases.

Boris Zbarsky [:bzbarsky]

Comment 12

•

22 years ago

*** Bug 129246 has been marked as a duplicate of this bug. ***

Shanjian Li

Assignee

Comment 13

•

22 years ago

Attached patch patch — Details — Splinter Review

Shanjian Li

Assignee

Comment 14

•

22 years ago

Charset latin1 and latin2 share many similarity, and it is pretty hard to
discriminate between each other. In current universal charset implementation,
there is not latin1 verifier, we only use latin1 as default. That's say if no
other charset verifiers declare confidence, we use latin1. That's works pretty
well until I added latin2 charset verifier. It seems like we have to  remove
this latin2 verifier before latin1 is available. Otherwise the possible
sideeffect is simpily too serious.

Shanjian Li

Assignee

Comment 15

•

22 years ago

roy, could you r= this patch?

Since the change is very simple, and impact is big, I would suggest to
reconsider this one for mozilla 1.0.

Keywords: nsbeta1- → nsbeta1

Shanjian Li

Assignee

Comment 16

•

22 years ago

*** Bug 131401 has been marked as a duplicate of this bug. ***

Roy Yokoyama

Comment 17

•

22 years ago

shanjian: 
I think we should ask Frank for review.
I am happy to review it for you if frank is too busy.

frank?

Shanjian Li

Assignee

Comment 18

•

22 years ago

nsbeta1+ per ftang

roy, please go ahead and review it. I talked to frank and he had no objection.

Keywords: nsbeta1 → nsbeta1+

Roy Yokoyama

Comment 19

•

22 years ago

shanjian:
why do we need to remove
+//mProbers[11] = new nsSingleByteCharSetProber(&Win1250HungarianModel);
Did we have two latin2 charset verifiers?

Shanjian Li

Assignee

Comment 20

•

22 years ago

win1250 should have the same problem as latin2. It may share some similarity
with win1252. Besides, we don't want have only one charset for Hungarian,
especially when it is not a popular one.

Jean-Marc Desperrier

Reporter

Comment 21

•

22 years ago

win1250 does have the same problem as latin2. 
In one of the duplicates, the ISO859-1 page was detected as win1250.
I didn't report specific cases, but I think I saw the same problem too.

A bit better change would be to have a preference in prefs.js that would enable
bulgarian users to reenable win1250 /latin2 autodetection if needed.
But it might be too late for that for 1.0 and not that useful compared to a
patch that would truly solve the problem.

Frank Tang

Comment 22

•

22 years ago

Comment on attachment 78373 [details] [diff] [review]
patch

r=ftang. remove sub module should not cause problem

Attachment #78373 - Flags: review+

Shanjian Li

Assignee

Comment 23

•

22 years ago

brendan, could you sr?

Frank Tang

Comment 24

•

22 years ago

[adt2], this impact europen users. We want to use Universial detector for
Netscape beta also because the quality is better once we fix this problem

Whiteboard: [adt2]

Scott Collins

Comment 25

•

22 years ago

Comment on attachment 78373 [details] [diff] [review]
patch

sr=scc

Attachment #78373 - Flags: superreview+

Shanjian Li

Assignee

Comment 26

•

22 years ago

fix checked into trunk.

Yuying Long

Comment 27

•

22 years ago

Verified on 04-19 on 3 platforms: WinME, Mac10.1.3 and linux RH7.2, the original
report page:
http://www.aminautes.org/forums/configurer/oe/proe5.html
is deletect as iso-8859-1 with auto-detect Universal.

bruno lapeyre

Comment 28

•

22 years ago

But under Mac OS X 10.1.4 the original report page:
http://www.aminautes.org/forums/configurer/oe/proe5.html
is detect as iso-8859-2 with auto-detect Universal.

bruno lapeyre

Comment 29

•

22 years ago

As I reported inside Bug 131401, this issue isn't occuring only with pages using
foreign characters. The problem sounds deeper. Even that pure american page is
detected as 8859-2 when Auto-detect is set to Universal:

http://maccentral.macworld.com/news/0204/19.sjaug.php

Many other american page can give the same result.
If this problem was a pure Mac OS X problem we probably should reopen Bug 131401...

bruno lapeyre

Comment 30

•

22 years ago

Please forget my Comments #28 and #29 : happily I was then using a previous
trunk. Sorry for that... Using the last trunk (2002041903) is giving excellent
results with the test page, with MacCentral pages and with each other pages
where the bug was detected. Great work.

bruno lapeyre

Comment 31

•

22 years ago

Attached image Japanese encoding used when it shouldn't — Details

Using the 2002O41903 build I discovered that when Auto Detect is set to
Universal, japanese encoding could be used when it should be western 8859-1 (I
already reported this issue before: see bug 131401 comment #26). The screenshot
was taken from the following page:
http://www.versiontracker.com/moreinfo.fcgi?id=13513&db=macosx

Shanjian Li

Assignee

Comment 32

•

22 years ago

bruno lapeyre,
You might want to file a separate bug for that. Generally speaking, charset
autodetection will never be 100% accurate. That page contains some byte
combinations that can also be interpreted as a valid japanese character. So I am
not sure if that is solvable in near future.

Frank Tang

Comment 33

•

22 years ago

adk for adt1.0.0, QA have verify this on the trunk. 
The risk is low since we are removing the code inside that paticular detector
which are not turn on by default for English users.

Keywords: adt1.0.0

Frank Tang

Comment 34

•

22 years ago

some other bugs depend on this:
http://bugscape.mcom.com/show_bug.cgi?id=11459
http://bugscape.mcom.com/show_bug.cgi?id=11250
also, vantive 279691 is depend on this.

scottputterman

Comment 35

•

22 years ago

adding adt1.0.0+.  Please check this into the branch as soon as possible and add
the fixed1.0.0 keyword.

Keywords: adt1.0.0 → adt1.0.0+

Randell Jesup [:jesup] (needinfo me)

Comment 36

•

22 years ago

Comment on attachment 78373 [details] [diff] [review]
patch

a=rjesup@wgate.com for branch checkin

Attachment #78373 - Flags: approval+

Shanjian Li

Assignee

Comment 37

•

22 years ago

fix checked into trunk.

Status: ASSIGNED → RESOLVED

Closed: 22 years ago

Keywords: fixed1.0.0

Resolution: --- → FIXED

Yuying Long

Comment 38

•

22 years ago

Vefified works fine on 04-25 branch build.

Status: RESOLVED → VERIFIED

Keywords: fixed1.0.0 → verified1.0.0

Michael Dunn

Comment 39

•

22 years ago

Internal reference:
http://bugscape.netscape.com/show_bug.cgi?id=13722

You need to log in before you can comment on or make changes to this bug.