Closed Bug 1615836 Opened 4 years ago Closed 4 years ago

wrong encoding, unable to set default

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla75

Tracking Flags:

Tracking

Status

firefox-esr68

---

unaffected

firefox73

---

wontfix

firefox74

---

fixed

firefox75

---

verified

People

(Reporter: ersatzemail, Assigned: hsivonen)

References

(Regression)

Details

(Keywords: regression)

Attachments

(5 files)

Testcase for comment 3 - save and open locally 4 years ago Gingerbread Man 165 bytes, text/html		Details
test.htm 4 years ago ersatzemail 580 bytes, text/html		Details
Testcase for comment 11 - save and open locally 4 years ago Gingerbread Man 435 bytes, text/html		Details
Bug 1615836 - Update chardetng to 0.1.6. 4 years ago Henri Sivonen (:hsivonen) 47 bytes, text/x-phabricator-request	pascalc : approval-mozilla-beta+	Details \| Review
Results for the full-page test 4 years ago Henri Sivonen (:hsivonen) 65.25 KB, text/html		Details

ersatzemail

Reporter

Description

•

4 years ago

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0

Steps to reproduce:

Opened a page encoded (without declaration) in Windows-1252.

Actual results:

Page was wrongly rendered in (apparently) ISO/IEC 8859-2, e.g. Ň instead of Ò.

Expected results:

Since it's impossible to always determine the correct encoding automatically, there should be an option to set a default encoding for pages that don't declare. One may be working with a large number of similar pages, where it is hardly practical to manually correct every page every time.

Gingerbread Man

Comment 1

•

4 years ago

(In reply to ersatzemail from comment #0)

Opened a page encoded (without declaration) in Windows-1252.

What is the link to the page?

there should be an option to set a default encoding

It was removed in bug 1551276.

Component: Untriaged → Internationalization

Product: Firefox → Core

ersatzemail

Reporter

Comment 2

•

4 years ago

It was actually a local file; in fact it occurs in many similar files, which used to work until one of the latest updates. The files contain English text but include names with diacritics as far as available within Windows-1252.

Gingerbread Man

Comment 3

•

4 years ago

•

Edited

Attached file Testcase for comment 3 - save and open locally — Details

(In reply to ersatzemail from comment #2)

It was actually a local file;

Could you please attach the file? Feel free to edit it beforehand, as long as there's enough content left to reproduce the problem.

in fact it occurs in many similar files

I used Notepad to save the attached testcase as ANSI, but I got different results than comment 0:

IE and Edge display it correctly; console.log(document.characterSet); reports Windows-1252 as expected.
Firefox uses Windows-1258 encoding, according to the Page Info window.
Vivaldi mangles it in a different way; console.log(document.characterSet); reports Windows-874.

which used to work until one of the latest updates.

Can you find the exact regression range? If not, do you remember the most recent version that definitely worked?
https://mozilla.github.io/mozregression/quickstart.html

Edit: You may have had a text encoding for legacy content set. As I mentioned at comment 1, that was removed in bug 1551276 (Firefox 73).

Flags: needinfo?(ersatzemail)

Gingerbread Man

Updated

•

4 years ago

Attachment #9126953 - Attachment description: Testcase for comment 3 → Testcase for comment 3 - save and open locally

ersatzemail

Reporter

Comment 4

•

4 years ago

Attached file test.htm — Details

Here's a short example containing some Catalan names, and the page is detected as Baltic.

Flags: needinfo?(ersatzemail)

ersatzemail

Reporter

Comment 5

•

4 years ago

I don't think I had any special setting before, it just worked. It was almost certainly the latest update which broke it.

Gingerbread Man

Comment 6

•

4 years ago

regression-window

It turns out that my testcase also used to display correctly in the previous version. I tested both files and the regression range is the same:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=e41f354e1e867d8b62d922b504e553a41d56d56f&tochange=96b8dc8fd886ba29f08294f9fcb27fa46d63072e

Status: UNCONFIRMED → NEW

Has Regression Range: --- → yes

Has STR: --- → yes

Ever confirmed: true

Flags: needinfo?(hsivonen)

Keywords: regression

Regressed by: 1551276

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

Type: enhancement → defect

Henri Sivonen (:hsivonen)

Assignee

Comment 7

•

4 years ago

Thank you for the report.

The files contain English text but include names with diacritics as far as available within Windows-1252.

Unfortunately, English with a couple of isolated non-English non-ASCII bytes is impossible to get right in all cases, since there isn't enough context to reliably guess which non-English language the isolated words are from.

One option would be to force the guess to windows-1252 if at the end of the file the total number of non-ASCII bytes relative to file length is below some threshold.

Here's a short example containing some Catalan names, and the page is detected as Baltic.

Do you see problems with fully-Catalan pages? The detector is trained with Catalan, but in Firefox 73, there was a bug that ignored windows-1252 training for bytes that were next to space or punctuation. I have already fixed this locally, and the fix addresses the case you attached. Making this bug depend on bug 1613861.

For detection purposes, Lithuanian and Latvian are difficult to deal with in both directions: getting them right and not misdetecting anything else as them.

(In reply to Gingerbread Man from comment #3)

I used Notepad to save the attached testcase as ANSI, but I got different results than comment 0:

A test case with a single non-ASCII byte is not meaningful. It could be anything.

there should be an option to set a default encoding for pages that don't declare

That kind of thing might work for power users in specific circumstances, but 1) won't be discoverable enough to work for users in general and 2) most users once having changed such a pref won't attribute future failures with other encodings to having set the pref earlier. Therefore, I'm more interested in adjusting the detector than providing UI for the pref. For the time being, setting intl.charset.detector.ng.enabled to false (and if you didn't use a windows-1252-affiliated localization, intl.charset.fallback.override to windows-1252) does what you are asking for, but these prefs will go away in the foreseeable future.

Depends on: 1613861

Flags: needinfo?(hsivonen)

Henri Sivonen (:hsivonen)

Assignee

Comment 8

•

4 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #7)

For the time being, setting intl.charset.detector.ng.enabled to false (and if you didn't use a windows-1252-affiliated localization, intl.charset.fallback.override to windows-1252) does what you are asking for, but these prefs will go away in the foreseeable future.

Actually, that doesn't work for local files.

ersatzemail

Reporter

Comment 9

•

4 years ago

Wouldn't it be possible for the detector to assign a confidence value to its determination and then provide a default option only for cases where this value falls below a certain point?

Henri Sivonen (:hsivonen)

Assignee

Comment 10

•

4 years ago

(In reply to ersatzemail from comment #9)

Wouldn't it be possible for the detector to assign a confidence value to its determination and then provide a default option only for cases where this value falls below a certain point?

Then different users would get different results again. Before overdesigning a solutions around the detector, let's try fixing the detector first. As noted, the root cause here is a bug that I had already found and fixed locally.

Gingerbread Man

Comment 11

•

4 years ago

Attached file Testcase for comment 11 - save and open locally — Details

(In reply to Henri Sivonen (:hsivonen) from comment #7)

A test case with a single non-ASCII byte is not meaningful. It could be anything.

Yet it displays correctly in Firefox 72, IE and Edge. Here's an alternate with several sentences and even <html lang="it"> — Firefox 73/Nightly still garble it as Windows-1250. Vivaldi does fine with this one.

Henri Sivonen (:hsivonen)

Assignee

Comment 12

•

4 years ago

(In reply to Gingerbread Man from comment #11)

Created attachment 9127126 [details]
Testcase for comment 11 - save and open locally

(In reply to Henri Sivonen (:hsivonen) from comment #7)

A test case with a single non-ASCII byte is not meaningful. It could be anything.

Yet it displays correctly in Firefox 72, IE and Edge.

Those don't run this kind of detection either at all (Firefox 72 and Spartan Edge) or not by default (IE).

Here's an alternate with several sentences and even <html lang="it">

The detector doesn't parse HTML.

— Firefox 73/Nightly still garble it as Windows-1250. Vivaldi does fine with this one.

This works with the fix that I have locally. (I'm trying to improve Thai accuracy while at it before posting the patch.)

Henri Sivonen (:hsivonen)

Assignee

Updated

•

4 years ago

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Henri Sivonen (:hsivonen)

Assignee

Comment 14

•

4 years ago

Attached file Bug 1615836 - Update chardetng to 0.1.6. — Details

Properly take into account non-ASCII bytes at word boundaries for windows-1252. (Especially relevant for Italian, Catalan, Icelandic, and Faroese.)
Move Estonian from the Baltic model to the Western model. This improves overall Estonian detection but causes š and ž encoded as windows-1257, ISO-8859-13, or ISO-8859-4 to get misdecoded. (It would be possible to add a post-processing step to adjust for š and ž, but this would cause reloads given the way chardetng is integrated with Firefox.)
Improve Thai accuracy a lot.
Improve Vietnamese, Lithuanian, and Latvian accuracy a bit.
Improve accuracy for most Central European languages a bit.
Regress accuracy for some Central European languages a bit (as side effect of fixing Italian and Catalan).
Properly classify letters that ISO-8859-4 has but windows-1257 doesn't have in order to avoid misdetecting non-ISO-8859-4 input as ISO-8859-4.
Improve character classification of windows-1254.
Avoid classifying byte 0xA1 or above as space-like to avoid misdetection.
Reduce binary size.

Henri Sivonen (:hsivonen)

Assignee

Comment 15

•

4 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa104ffeff252575edbe46456e6d523c6eab71dc

Pulsebot

Comment 16

•

4 years ago

Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e2bdbdac0d33
Update chardetng to 0.1.6. r=emk

Web Platform Test Sync Bot (Matrix: #interop:mozilla.org)

Comment 17

•

4 years ago

Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/21875 for changes under testing/web-platform/tests

Web Platform Test Sync Bot (Matrix: #interop:mozilla.org)

Comment 18

•

4 years ago

Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.

Makoto Kato [:m_kato]

Updated

•

4 years ago

Priority: -- → P3

Makoto Kato [:m_kato]

Updated

•

4 years ago

Priority: P3 → P2

Cosmin Sabou [:CosminS]

Comment 19

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e2bdbdac0d33

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox75: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla75

Gingerbread Man

Comment 20

•

4 years ago

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0
20200219095352

All attached testcases correctly display as Windows-1252 now.

Status: RESOLVED → VERIFIED

status-firefox75: fixed → verified

Web Platform Test Sync Bot (Matrix: #interop:mozilla.org)

Comment 21

•

4 years ago

Upstream PR merged by moz-wptsync-bot

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox73: --- → wontfix

status-firefox74: --- → affected

status-firefox-esr68: --- → unaffected

Flags: in-testsuite+

Henri Sivonen (:hsivonen)

Assignee

Comment 22

•

4 years ago

I ran an test on legacy-encoded input synthetized from Wikipedia articles whose wikitext length exceeds 6000 UTF-8 bytes (to exclude stub articles) in Wikipedias for languages are not near-ASCII-only, that have a Web-relevant legacy encoding, and whose Wikipedia has more than 10000 articles. For Chinese, I used MediaWiki's own capabilities to synthetize both Simplified Chinese and Traditional Chinese content. I excluded Chinese articles that contained kana.

When rounded to integer percentages, most languages tested got 100% accuracy. Languages that got worse than 98% accuracy include:

Greek as ISO-8859-7: 97% (100% for windows-1253.)
Kurdish as windows-1254: 96%
Faroese and Italian as windows-1252: 95%. (Non-ASCII in Italian is relatively infrequent, so there is relatively little to work with.)
Lithuanian 95% as windows-1257 and 49% as ISO-8859-4. (Lithuanian detection is hard to improve without messing up something else.)
Hungarian 92%, both windows-1250 and ISO-8859-2. (Hungarian detection is hard to improve without messing up something else.)
Estonian encoded as non-windows-1252 gets the Estonian-native non-ASCII vowels right but garbles infrequent loan-only non-ASCII consonants. The consonants involved are too infrequent to deal with in the generic way. Fixing this would need Estonian-specific code and would introduce more page reloading.

In all cases, Wikipedia includes more foreign words mixed in a given language than texts in that languages generally do, so one should expect normal input to do better. Obviously, very short pages will do worse.

Henri Sivonen (:hsivonen)

Assignee

Comment 23

•

4 years ago

Comment on attachment 9127293 [details]
Bug 1615836 - Update chardetng to 0.1.6.

Beta/Release Uplift Approval Request

User impact if declined: For various Latin-script languages (particularly non-windows-1252 ones as well as windows-1252 ones whose non-ASCII most often occurs at word boundaries as in Italian and Catalan) and Thai, unlabeled legacy-encoded local files and pages on generic domains could appear worse than in Firefox 72, which used localization-based guessing instead of content-based guessing.
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): The change is to safe-only Rust code, so the change can't introduce memory bugs. As for the user-perceivable behavior, the result has been tested with content synthesized from Wikipedia, and the results with page-length input look very good.
String changes made/needed: None

Attachment #9127293 - Flags: approval-mozilla-beta?

Pascal Chevrel:pascalc (PTO until April 26)

Comment 24

•

4 years ago

Comment on attachment 9127293 [details]
Bug 1615836 - Update chardetng to 0.1.6.

Improvement on bug 1551276 landed in 73, low risk, uplift approved for 74.0b8, thanks.

Attachment #9127293 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Razvan Maries

Comment 25

•

4 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/4d1de71d2e11

status-firefox74: affected → fixed

Henri Sivonen (:hsivonen)

Assignee

Comment 26

•

4 years ago

Attached file Results for the full-page test — Details

For reference, here are the results of the full-page-length test that I ran.

You need to log in before you can comment on or make changes to this bug.