"Central European (Windows-1250)" autodetected as "Western (ISO-8859-1)"

RESOLVED WONTFIX

Status

()

Core
Internationalization
RESOLVED WONTFIX
11 years ago
2 years ago

People

(Reporter: Alexander Strange, Assigned: smontagu)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(URL)

(Reporter)

Description

11 years ago
User-Agent:       Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/419 (KHTML, like Gecko) Safari/419.3
Build Identifier: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3

The Universal character set auto-detector can't detect Windows Latin-2/CP1250 text.
This file is an example; it's detected as Latin-1 despite having characters which are illegal in that charset.

For instance:
281
00:20:20,780 --> 00:20:22,740
ať se nezhroutím. "

the ť is only valid in Latin-2.

Reproducible: Always

Steps to Reproduce:
1. Enable Character Encoding > Auto-detect > Universal
2. Load example page
3. Check detected charset
Actual Results:  
Western (ISO-8859-1)

Expected Results:  
Central European (Windows-1250)

    SBCS Group Prober --------begin status

        SBCS: 0.140 [windows-1251]
 SBCS: 0.000 [KOI8-R]
 SBCS: 0.000 [ISO-8859-5]
 SBCS: 0.108 [x-mac-cyrillic]
 SBCS: 0.000 [IBM866]
 SBCS: 0.000 [IBM855]
 SBCS: 0.006 [ISO-8859-7]
 SBCS: 0.006 [windows-1253]
 SBCS: 0.000 [ISO-8859-5]
 SBCS: 0.074 [windows-1251]
 HEB: 144 - 0 [Logical-Visual score]
 inactive: [windows-1255] (i.e. confidence is too low).
 inactive: [windows-1255] (i.e. confidence is too low).

    SBCS Group found best match [windows-1251] confidence 0.139692.

 Latin1Prober: 0.010 [windows-1252]

The Latin-2 propers are commented out in the source, which is obviously the problem. Haven't enabled them to see if they actually work.
(Reporter)

Comment 1

11 years ago
s/proper/prober/

Another example:
http://astrange.ithinksw.net/deadgirl.srt

This is detected as Windows-1252 at ~50%.

  MBCS inactive: [UTF8] (confidence is too low).
  MBCS inactive: [SJIS] (confidence is too low).
  MBCS inactive: [EUCJP] (confidence is too low).
  MBCS inactive: [GB18030] (confidence is too low).
  MBCS inactive: [EUCKR] (confidence is too low).
  MBCS inactive: [Big5] (confidence is too low).
  MBCS inactive: [EUCTW] (confidence is too low).
 SBCS Group Prober --------begin status 
  SBCS: 0.051 [windows-1251]
  SBCS: 0.000 [KOI8-R]
  SBCS: 0.000 [ISO-8859-5]
  SBCS: 0.054 [x-mac-cyrillic]
  SBCS: 0.010 [IBM866]
  SBCS: 0.010 [IBM855]
  SBCS: 0.002 [ISO-8859-7]
  SBCS: 0.002 [windows-1253]
  SBCS: 0.000 [ISO-8859-5]
  SBCS: 0.000 [windows-1251]
  HEB: 0 - 0 [Logical-Visual score]
  inactive: [windows-1255] (i.e. confidence is too low).
  inactive: [windows-1255] (i.e. confidence is too low).
 SBCS Group found best match [x-mac-cyrillic] confidence 0.054220.
 Latin1Prober: 0.500 [windows-1252]

This one looks quite difficult.
(Assignee)

Comment 2

11 years ago
Latin2 probers were removed in bug 115114. See bug 115114 comment 14 for the reasoning.
(Reporter)

Comment 3

11 years ago
I enabled the Latin2 probers in my copy of the charset detector, then generated this lame latin1/2 disambiguator to fix the resulting huge false positive rate on Latin1.

static BOOL DifferentiateLatin12(const unsigned char *data, int length)
{
	// generated from french/german (latin1) and hungarian/slovak (latin2)
	
	const short frequencies[] = {
		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
		513, 1000, -196, -338, 497, -1420, -1356, -850, 452, -1961, -1513, 726, -2247, -367, 1490, -1300, 
		-158, -2306, -1420, 16, 352, 226, -330, -1495, 0, 959, 1308, 0, 0, 0, 0, 0, 
		0, 1845, 1743, 2658, 234, -4533, -1098, -1782, -1138, 2185, 3159, 4390, -1125, 2217, -2643, 647, 
		297, -4997, -3176, -4854, -505, -1176, 744, -1243, -2163, 3706, 763, 0, 0, 0, 0, 0, 
		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 628, 0, 0, 0, 363, 0, 
		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3989, 0, -1279, 513, 4714, 0, 
		0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2405, 0, 0, 0, 0, 0, 
		0, 0, 0, 0, 0, 0, 1777, 0, 0, 0, 6035, 0, 0, 0, 0, 0, 
		-1107, 0, 0, 811, -639, 0, 0, -1107, 725, -745, 0, 0, 0, 0, 1986, 0, 
		0, 0, 0, 0, 0, 0, 0, 0, 888, 0, 0, 0, 0, 0, 725, -1567, 
		-3675, 6477, 2190, 10702, -1107, 0, 0, -2306, -824, -2951, -3262, 0, 5665, 6755, 2178, 1622, 
		0, 0, 811, 1088, -1430, 0, -1567, 0, 4352, 1048, 725, -904, -1107, 3343, 4616, 0};
	
	int frcount = 0;
	
	while (length--) {
		frcount += frequencies[*data++];
	}
	
	return frcount <= 0;
}


I guess the real problem is that the Latin1 prober should be better.
The detector is gone.
Status: UNCONFIRMED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.