Bug 1602816 Comment 34 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

Henri Sivonen (:hsivonen)

on 2019-12-17 02:05:42 PST

(In reply to Ben Campbell from comment #30)
> (hmm. I wonder why it didn't keep going and default to KOI8-RU, which substitutes one more box character to also cover Belarusian?)

The encoding that the Web calls "KOI8-U" is actually KOI8-RU. Somewhat like ISO-8859-1 is an alias of windows-1252, and EUC-KR really means with windows-949 extensions included.

(In reply to Jorg K (GMT+1) (PTO to 5th Jan 2020, sporadically reading bugmail) from comment #32)
> Henri, how is it possible to reliably detect any ANSI encoding like windows-1252.

The reliability increases as the length of input increases. Indeed, a single letter ä without any word context for it is not a great test case.

> How do you know it's not Greek

Greek words are completely non-ASCII. windows-1252 words tend to mix ASCII and non-ASCII, and four or more consecutive non-ASCII is rare.

> or Polish for example?

Less well, but it's pretty good after a couple of sentences.

> Or Turkish (windows-1254)

Even less well, but it's pretty OK after a couple of sentences. windows-1252 has two language models: One for Icelandic and Faroese and another that merges everything else. The Turkish letters that overlap some Icelandic/Faroese-only letters which are penalized in the "everything else" windows-1252 model.

Revision 1 by

Henri Sivonen (:hsivonen)

on 2019-12-17 02:08:48 PST

(In reply to Ben Campbell from comment #30)
> (hmm. I wonder why it didn't keep going and default to KOI8-RU, which substitutes one more box character to also cover Belarusian?)

The encoding that the Web calls "KOI8-U" is actually KOI8-RU. Somewhat like ISO-8859-1 is an alias of windows-1252, and EUC-KR really means with windows-949 extensions included.

(In reply to Jorg K (GMT+1) (PTO to 5th Jan 2020, sporadically reading bugmail) from comment #32)
> Henri, how is it possible to reliably detect any ANSI encoding like windows-1252.

The reliability increases as the length of input increases. Indeed, a single letter ä without any word context for it is not a great test case.

> How do you know it's not Greek

Greek words are completely non-ASCII. windows-1252 words tend to mix ASCII and non-ASCII, and four or more consecutive non-ASCII is rare.

> or Polish for example?

Less well, but it's pretty good after a couple of sentences.

> Or Turkish (windows-1254)

Even less well, but it's pretty OK after a couple of sentences. windows-1252 has two language models: One for Icelandic and Faroese and another that merges everything else. There are Turkish letters that overlap some Icelandic/Faroese-only letters which are penalized in the "everything else" windows-1252 model.

Back to Bug 1602816 Comment 34

Bugzilla

Quick Search

Bug 1602816 Comment 34 Edit History