Bug 1551276 Comment 2 Edit History

Original comment by

on 2019-05-13 11:10:19 PDT

A quick dump of some ideas:

In the Google detector, a single-byte encoding either takes 3 * 256 bytes of tables or 3 * 256 + 1024 bytes of data tables plus a few words. The 3 * 256 bytes are never reused across encodings. The 1024-byte table is reused between similar windows and ISO encodings (like windows-1252 and ISO-8859-1) and between KOI8-R and KOI8-U.

There doesn't seem to be much logic to whether the 1024-byte table is present for some single-byte encodings. In particular, there are Cyrillic encodings with and without the 1024-byte table.

The use of the 1024-byte table doesn't seem optimal, since there is a 256-byte table indexed by 4 and 4 high bits to decide whether the to index the 1024-byte table by 5 and 5 low bits. The approach taken by our current Cyrillic detectors of using a 256-byte table to map bytes to 5-bit classes uses as much space but is much more versatile in terms of how much significant information can be put in the 1024-byte table. Also, the approach of mapping to 5-bit classes means that all Cyrillic encodings, both Greek encodings, both Arabic encodings, etc. can share the 1024-byte table.

If the hit rate to the bigram table is good enough, it's not clear that a unigram table is needed at all. Google's second unigram table becomes unnecessary when single-byte scoring is decoupled from legacy CJK scoring.

Thus, a 256-byte class mapping table plus a 1024-byte class bigram score table (shared among same-script encodings) should be space-competitive with the Google detector and should make better use of the 1024-byte table. (Not including browser-irrelevant encodings will obviously be space-competitive with the Google detector.)

5-bit classes should be enough at least for non-Latin encodings (with Cyrillic and Greek case-collapsed) as well as case-collapsed windows-1254.

One of the spare bits (the 3 bits in byte after the 5-bit class) should be used to flag impossible bytes: bytes that are either unmapped or mapped to the C1 controls.

Non-Turkish Latin encodings could potentially benefit from a 5-bit times 5-bit times 1-bit table that represents bigraphs where one half is ASCII and the other half is non-ASCII and the extra bit tells which one comes first.

ISO-2022-JP can be distinguished from everything else by seeing if a shift sequence occurs before any non-ASCII.

The other legacy CJK encodings can be distinguished from single-byte encodings by seeing if after a couple hundred non-ASCII bytes seen at least one legacy CJK encoding has not experienced an error condition.

Non-Latin single byte encodings can be distinguished from Latin single-byte encodings by seeing bigraphs were both bytes are non-ASCII exceed bigraphs where one byte is ASCII and the other is not.

Shift_JIS can be distinguished from EUC-style encodings by seeing which encounters an error first.

Of the EUC-style encodings, Japanese can be distinguished from Korean and Chinese the kana range. Korean can be distinguished from Chinese by seeing if the two-byte original EUC-range characters stay almost exclusively in the KS X 1001 Hangul range.

Big5 can be distinguished from GBK and EUC-KR by seeing if a lot of characters are outside the original EUC square.

Non-letter characters in single-byte encodings should be mapped to the same equivalence class as whitespace.

For multi-language encodings, since the training data can be of different length for different languages, the frequency table should be computed first on a per-language basis and then the languages merged by taking the maximum of each table slot from the different languages. (For example, the French and German usage of windows-1252 is basically disjoint for the non-ASCII letters, but either case should give the full frequency score relevant to the language in question, hence merging by max rather than average or something like that.)

Synthetize training data from Wikipedia.

Train windows-1254 with Azeri in addition to Turkish.

Apply Estonian training to windows-1252 in addition to windows-1257. (Non-loan words in Estonian are the same bytes in both windows-1252 and windows-1257.)

Special-case visual Hebrew scoring by reversing each bigram and using the logical Hebrew tables.

Train windows-1258 both with NFD data and with data where the first diacritic is combined with the base character if representable that way.

Open question: How to optimally allocate 5-bit classes? (I.e. something better than 1 class for space, 30 classes for the 30 most common letters and 1 class for the remaining letters combined.)

Open question: Can the frequency score be a linear 256-value number or does the scale need to be non-linear? (I.e. should be 8-bit score be an index into a table of non-linear 16-bit scores?)

Open question: What to do about vocalized Hebrew training?

Open question: What single-byte encodings to omit? Likely Mac encodings and two-digit ISO encodings. (11 is the same as windows-874, 15 is in use but was never a fallback or in IE menu, so practically has to be labeled, and the others are approximately unused and just pose misdetection risk.)

Revision 1 by

Henri Sivonen (:hsivonen)

on 2019-05-13 11:12:39 PDT

A quick dump of some ideas:

In the Google detector, a single-byte encoding either takes 3 * 256 bytes of tables or 3 * 256 + 1024 bytes of data tables plus a few words. The 3 * 256 bytes are never reused across encodings. The 1024-byte table is reused between similar windows and ISO encodings (like windows-1252 and ISO-8859-1) and between KOI8-R and KOI8-U.

There doesn't seem to be much logic to whether the 1024-byte table is present for some single-byte encodings. In particular, there are Cyrillic encodings with and without the 1024-byte table.

The way they use the 1024-byte table doesn't seem optimal, since there is a 256-byte table indexed by 4 and 4 high bits to decide whether the to index the 1024-byte table by 5 and 5 low bits. The approach taken by our current Cyrillic detectors of using a 256-byte table to map bytes to 5-bit classes uses as much space but is much more versatile in terms of how much significant information can be put in the 1024-byte table. Also, the approach of mapping to 5-bit classes means that all Cyrillic encodings, both Greek encodings, both Arabic encodings, etc. can share the 1024-byte table.