(In reply to Ben Campbell from comment #30) > (hmm. I wonder why it didn't keep going and default to KOI8-RU, which substitutes one more box character to also cover Belarusian?) The encoding that the Web calls "KOI8-U" is actually KOI8-RU. Somewhat like ISO-8859-1 is an alias of windows-1252, and EUC-KR really means with windows-949 extensions included. (In reply to Jorg K (GMT+1) (PTO to 5th Jan 2020, sporadically reading bugmail) from comment #32) > Henri, how is it possible to reliably detect any ANSI encoding like windows-1252. The reliability increases as the length of input increases. Indeed, a single letter ä without any word context for it is not a great test case. > How do you know it's not Greek Greek words are completely non-ASCII. windows-1252 words tend to mix ASCII and non-ASCII, and four or more consecutive non-ASCII is rare. > or Polish for example? Less well, but it's pretty good after a couple of sentences. > Or Turkish (windows-1254) Even less well, but it's pretty OK after a couple of sentences. windows-1252 has two language models: One for Icelandic and Faroese and another that merges everything else. The Turkish letters that overlap some Icelandic/Faroese-only letters which are penalized in the "everything else" windows-1252 model.
Bug 1602816 Comment 34 Edit History
Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.
(In reply to Ben Campbell from comment #30) > (hmm. I wonder why it didn't keep going and default to KOI8-RU, which substitutes one more box character to also cover Belarusian?) The encoding that the Web calls "KOI8-U" is actually KOI8-RU. Somewhat like ISO-8859-1 is an alias of windows-1252, and EUC-KR really means with windows-949 extensions included. (In reply to Jorg K (GMT+1) (PTO to 5th Jan 2020, sporadically reading bugmail) from comment #32) > Henri, how is it possible to reliably detect any ANSI encoding like windows-1252. The reliability increases as the length of input increases. Indeed, a single letter ä without any word context for it is not a great test case. > How do you know it's not Greek Greek words are completely non-ASCII. windows-1252 words tend to mix ASCII and non-ASCII, and four or more consecutive non-ASCII is rare. > or Polish for example? Less well, but it's pretty good after a couple of sentences. > Or Turkish (windows-1254) Even less well, but it's pretty OK after a couple of sentences. windows-1252 has two language models: One for Icelandic and Faroese and another that merges everything else. There are Turkish letters that overlap some Icelandic/Faroese-only letters which are penalized in the "everything else" windows-1252 model.