Closed Bug 1615836 Opened 1 year ago Closed 1 year ago

wrong encoding, unable to set default

Categories

(Core :: Internationalization, defect, P2)

73 Branch
defect

Tracking

()

VERIFIED FIXED
mozilla75
Tracking Status
firefox-esr68 --- unaffected
firefox73 --- wontfix
firefox74 --- fixed
firefox75 --- verified

People

(Reporter: ersatzemail, Assigned: hsivonen)

References

(Regression)

Details

(Keywords: regression)

Attachments

(5 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0

Steps to reproduce:

Opened a page encoded (without declaration) in Windows-1252.

Actual results:

Page was wrongly rendered in (apparently) ISO/IEC 8859-2, e.g. Ň instead of Ò.

Expected results:

Since it's impossible to always determine the correct encoding automatically, there should be an option to set a default encoding for pages that don't declare. One may be working with a large number of similar pages, where it is hardly practical to manually correct every page every time.

(In reply to ersatzemail from comment #0)

Opened a page encoded (without declaration) in Windows-1252.

What is the link to the page?

there should be an option to set a default encoding

It was removed in bug 1551276.

Component: Untriaged → Internationalization
Product: Firefox → Core

It was actually a local file; in fact it occurs in many similar files, which used to work until one of the latest updates. The files contain English text but include names with diacritics as far as available within Windows-1252.

(In reply to ersatzemail from comment #2)

It was actually a local file;

Could you please attach the file? Feel free to edit it beforehand, as long as there's enough content left to reproduce the problem.

in fact it occurs in many similar files

I used Notepad to save the attached testcase as ANSI, but I got different results than comment 0:

  • IE and Edge display it correctly; console.log(document.characterSet); reports Windows-1252 as expected.
  • Firefox uses Windows-1258 encoding, according to the Page Info window.
  • Vivaldi mangles it in a different way; console.log(document.characterSet); reports Windows-874.

which used to work until one of the latest updates.

Can you find the exact regression range? If not, do you remember the most recent version that definitely worked?
https://mozilla.github.io/mozregression/quickstart.html

Edit: You may have had a text encoding for legacy content set. As I mentioned at comment 1, that was removed in bug 1551276 (Firefox 73).

Flags: needinfo?(ersatzemail)
Attachment #9126953 - Attachment description: Testcase for comment 3 → Testcase for comment 3 - save and open locally
Attached file test.htm

Here's a short example containing some Catalan names, and the page is detected as Baltic.

Flags: needinfo?(ersatzemail)

I don't think I had any special setting before, it just worked. It was almost certainly the latest update which broke it.

It turns out that my testcase also used to display correctly in the previous version. I tested both files and the regression range is the same:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=e41f354e1e867d8b62d922b504e553a41d56d56f&tochange=96b8dc8fd886ba29f08294f9fcb27fa46d63072e

Status: UNCONFIRMED → NEW
Has Regression Range: --- → yes
Has STR: --- → yes
Ever confirmed: true
Flags: needinfo?(hsivonen)
Keywords: regression
Regressed by: 1551276

Thank you for the report.

The files contain English text but include names with diacritics as far as available within Windows-1252.

Unfortunately, English with a couple of isolated non-English non-ASCII bytes is impossible to get right in all cases, since there isn't enough context to reliably guess which non-English language the isolated words are from.

One option would be to force the guess to windows-1252 if at the end of the file the total number of non-ASCII bytes relative to file length is below some threshold.

Here's a short example containing some Catalan names, and the page is detected as Baltic.

Do you see problems with fully-Catalan pages? The detector is trained with Catalan, but in Firefox 73, there was a bug that ignored windows-1252 training for bytes that were next to space or punctuation. I have already fixed this locally, and the fix addresses the case you attached. Making this bug depend on bug 1613861.

For detection purposes, Lithuanian and Latvian are difficult to deal with in both directions: getting them right and not misdetecting anything else as them.

(In reply to Gingerbread Man from comment #3)

I used Notepad to save the attached testcase as ANSI, but I got different results than comment 0:

A test case with a single non-ASCII byte is not meaningful. It could be anything.

there should be an option to set a default encoding for pages that don't declare

That kind of thing might work for power users in specific circumstances, but 1) won't be discoverable enough to work for users in general and 2) most users once having changed such a pref won't attribute future failures with other encodings to having set the pref earlier. Therefore, I'm more interested in adjusting the detector than providing UI for the pref. For the time being, setting intl.charset.detector.ng.enabled to false (and if you didn't use a windows-1252-affiliated localization, intl.charset.fallback.override to windows-1252) does what you are asking for, but these prefs will go away in the foreseeable future.

Depends on: 1613861
Flags: needinfo?(hsivonen)

(In reply to Henri Sivonen (:hsivonen) from comment #7)

For the time being, setting intl.charset.detector.ng.enabled to false (and if you didn't use a windows-1252-affiliated localization, intl.charset.fallback.override to windows-1252) does what you are asking for, but these prefs will go away in the foreseeable future.

Actually, that doesn't work for local files.

Wouldn't it be possible for the detector to assign a confidence value to its determination and then provide a default option only for cases where this value falls below a certain point?

(In reply to ersatzemail from comment #9)

Wouldn't it be possible for the detector to assign a confidence value to its determination and then provide a default option only for cases where this value falls below a certain point?

Then different users would get different results again. Before overdesigning a solutions around the detector, let's try fixing the detector first. As noted, the root cause here is a bug that I had already found and fixed locally.

(In reply to Henri Sivonen (:hsivonen) from comment #7)

A test case with a single non-ASCII byte is not meaningful. It could be anything.

Yet it displays correctly in Firefox 72, IE and Edge. Here's an alternate with several sentences and even <html lang="it"> — Firefox 73/Nightly still garble it as Windows-1250. Vivaldi does fine with this one.

(In reply to Gingerbread Man from comment #11)

Created attachment 9127126 [details]
Testcase for comment 11 - save and open locally

(In reply to Henri Sivonen (:hsivonen) from comment #7)

A test case with a single non-ASCII byte is not meaningful. It could be anything.

Yet it displays correctly in Firefox 72, IE and Edge.

Those don't run this kind of detection either at all (Firefox 72 and Spartan Edge) or not by default (IE).

Here's an alternate with several sentences and even <html lang="it">

The detector doesn't parse HTML.

— Firefox 73/Nightly still garble it as Windows-1250. Vivaldi does fine with this one.

This works with the fix that I have locally. (I'm trying to improve Thai accuracy while at it before posting the patch.)

Duplicate of this bug: 1613861
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
  • Properly take into account non-ASCII bytes at word boundaries for windows-1252. (Especially relevant for Italian, Catalan, Icelandic, and Faroese.)
  • Move Estonian from the Baltic model to the Western model. This improves overall Estonian detection but causes š and ž encoded as windows-1257, ISO-8859-13, or ISO-8859-4 to get misdecoded. (It would be possible to add a post-processing step to adjust for š and ž, but this would cause reloads given the way chardetng is integrated with Firefox.)
  • Improve Thai accuracy a lot.
  • Improve Vietnamese, Lithuanian, and Latvian accuracy a bit.
  • Improve accuracy for most Central European languages a bit.
  • Regress accuracy for some Central European languages a bit (as side effect of fixing Italian and Catalan).
  • Properly classify letters that ISO-8859-4 has but windows-1257 doesn't have in order to avoid misdetecting non-ISO-8859-4 input as ISO-8859-4.
  • Improve character classification of windows-1254.
  • Avoid classifying byte 0xA1 or above as space-like to avoid misdetection.
  • Reduce binary size.
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/21875 for changes under testing/web-platform/tests
Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.
Priority: -- → P3
Priority: P3 → P2
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla75

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0
20200219095352

All attached testcases correctly display as Windows-1252 now.

Status: RESOLVED → VERIFIED
Upstream PR merged by moz-wptsync-bot

I ran an test on legacy-encoded input synthetized from Wikipedia articles whose wikitext length exceeds 6000 UTF-8 bytes (to exclude stub articles) in Wikipedias for languages are not near-ASCII-only, that have a Web-relevant legacy encoding, and whose Wikipedia has more than 10000 articles. For Chinese, I used MediaWiki's own capabilities to synthetize both Simplified Chinese and Traditional Chinese content. I excluded Chinese articles that contained kana.

When rounded to integer percentages, most languages tested got 100% accuracy. Languages that got worse than 98% accuracy include:

  • Greek as ISO-8859-7: 97% (100% for windows-1253.)
  • Kurdish as windows-1254: 96%
  • Faroese and Italian as windows-1252: 95%. (Non-ASCII in Italian is relatively infrequent, so there is relatively little to work with.)
  • Lithuanian 95% as windows-1257 and 49% as ISO-8859-4. (Lithuanian detection is hard to improve without messing up something else.)
  • Hungarian 92%, both windows-1250 and ISO-8859-2. (Hungarian detection is hard to improve without messing up something else.)
  • Estonian encoded as non-windows-1252 gets the Estonian-native non-ASCII vowels right but garbles infrequent loan-only non-ASCII consonants. The consonants involved are too infrequent to deal with in the generic way. Fixing this would need Estonian-specific code and would introduce more page reloading.

In all cases, Wikipedia includes more foreign words mixed in a given language than texts in that languages generally do, so one should expect normal input to do better. Obviously, very short pages will do worse.

Comment on attachment 9127293 [details]
Bug 1615836 - Update chardetng to 0.1.6.

Beta/Release Uplift Approval Request

  • User impact if declined: For various Latin-script languages (particularly non-windows-1252 ones as well as windows-1252 ones whose non-ASCII most often occurs at word boundaries as in Italian and Catalan) and Thai, unlabeled legacy-encoded local files and pages on generic domains could appear worse than in Firefox 72, which used localization-based guessing instead of content-based guessing.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The change is to safe-only Rust code, so the change can't introduce memory bugs. As for the user-perceivable behavior, the result has been tested with content synthesized from Wikipedia, and the results with page-length input look very good.
  • String changes made/needed: None
Attachment #9127293 - Flags: approval-mozilla-beta?

Comment on attachment 9127293 [details]
Bug 1615836 - Update chardetng to 0.1.6.

Improvement on bug 1551276 landed in 73, low risk, uplift approved for 74.0b8, thanks.

Attachment #9127293 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

For reference, here are the results of the full-page-length test that I ran.

You need to log in before you can comment on or make changes to this bug.