Closed Bug 1551276 Opened 1 year ago Closed 7 months ago

Autodetect legacy encoding on unlabeled pages

Categories

(Core :: Internationalization, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla73
Webcompat Priority ?
Tracking Status
relnote-firefox --- 73+
firefox73 --- fixed

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

Attachments

(3 files, 2 obsolete files)

Previously, I thought that we should aim to minimize encoding detection from content. However, with Chromium's adoption of a detector and with Edge moving from the no-detector camp to the detector camp, I'm worried that we might lose users over worse long-tail Web experience to Chromium-based browsers. Especially on iOS, Safari has a captive audience (no one is going to quit using iOS due to not being able to read a 1990s legacy page, but someone might open Chrome or Edge on Windows if a 1990s long-tail page doesn't work right away in Firefox), so we can't really reason about what Safari does.

We don't really have proper signals of whether long-tail 1990s Web UX is a real competitive issue or not.

I evaluated the prospect of just adopting Google's detector, and I'm very uneasy about it.

However, now that one can easily get a text corpus in various languages from Wikipedia, training a detector should be a lot easier than it was in the Netscape days when the not-really-universal "universal" detector was created. (I still think removing it was the right call.)

I think it's now probably worthwhile to just go ahead and write a legacy encoding detector designed for browser-relevant coverage (unlike Google's) and trained with a wide range of languages for multi-language single-byte encodings (unlike Netscape's) and use it to solve the issue that .com/.net/.org don't have clear one-encoding-fits-all fallbacks for the legacy long tail.

annevk, emk, what do you think?

(A significant part of our menu usage is overriding a declared encoding, which is a case for which Chromium provides no recourse to the user. Even if we had a detector, we could still have a single menu item for manually triggering detection of labeled content.)

For previously stated reasons, I'm not suggesting detecting unlabeled and BOMless UTF-8 on the Web without a menu action.

A quick dump of some ideas:

In the Google detector, a single-byte encoding either takes 3 * 256 bytes of tables or 3 * 256 + 1024 bytes of data tables plus a few words. The 3 * 256 bytes are never reused across encodings. The 1024-byte table is reused between similar windows and ISO encodings (like windows-1252 and ISO-8859-1) and between KOI8-R and KOI8-U.

There doesn't seem to be much logic to whether the 1024-byte table is present for some single-byte encodings. In particular, there are Cyrillic encodings with and without the 1024-byte table.

The way they use the 1024-byte table doesn't seem optimal, since there is a 256-byte table indexed by 4 and 4 high bits to decide whether the to index the 1024-byte table by 5 and 5 low bits. The approach taken by our current Cyrillic detectors of using a 256-byte table to map bytes to 5-bit classes uses as much space but is much more versatile in terms of how much significant information can be put in the 1024-byte table. Also, the approach of mapping to 5-bit classes means that all Cyrillic encodings, both Greek encodings, both Arabic encodings, etc. can share the 1024-byte table.

If the hit rate to the bigram table is good enough, it's not clear that a unigram table is needed at all. Google's second unigram table becomes unnecessary when single-byte scoring is decoupled from legacy CJK scoring.

Thus, a 256-byte class mapping table plus a 1024-byte class bigram score table (shared among same-script encodings) should be space-competitive with the Google detector and should make better use of the 1024-byte table. (Not including browser-irrelevant encodings will obviously be space-competitive with the Google detector.)

5-bit classes should be enough at least for non-Latin encodings (with Cyrillic and Greek case-collapsed) as well as case-collapsed windows-1254.

One of the spare bits (the 3 bits in byte after the 5-bit class) should be used to flag impossible bytes: bytes that are either unmapped or mapped to the C1 controls.

Non-Turkish Latin encodings could potentially benefit from a 5-bit times 5-bit times 1-bit table that represents bigraphs where one half is ASCII and the other half is non-ASCII and the extra bit tells which one comes first.

ISO-2022-JP can be distinguished from everything else by seeing if a shift sequence occurs before any non-ASCII.

The other legacy CJK encodings can be distinguished from single-byte encodings by seeing if after a couple hundred non-ASCII bytes seen at least one legacy CJK encoding has not experienced an error condition.

Non-Latin single byte encodings can be distinguished from Latin single-byte encodings by seeing bigraphs were both bytes are non-ASCII exceed bigraphs where one byte is ASCII and the other is not.

Shift_JIS can be distinguished from EUC-style encodings by seeing which encounters an error first.

Of the EUC-style encodings, Japanese can be distinguished from Korean and Chinese by the kana range. Korean can be distinguished from Chinese by seeing if the two-byte original EUC-range characters stay almost exclusively in the KS X 1001 Hangul range.

Big5 can be distinguished from GBK and EUC-KR by seeing if a lot of characters are outside the original EUC square.

Non-letter characters in single-byte encodings should be mapped to the same equivalence class as whitespace.

For multi-language encodings, since the training data can be of different length for different languages, the frequency table should be computed first on a per-language basis and then the languages merged by taking the maximum of each table slot from the different languages. (For example, the French and German usage of windows-1252 is basically disjoint for the non-ASCII letters, but either case should give the full frequency score relevant to the language in question, hence merging by max rather than average or something like that.)

Synthetize training data from Wikipedia.

Train windows-1254 with Azeri in addition to Turkish.

Apply Estonian training to windows-1252 in addition to windows-1257. (Non-loan words in Estonian are the same bytes in both windows-1252 and windows-1257.)

Special-case visual Hebrew scoring by reversing each bigram and using the logical Hebrew tables.

Train windows-1258 both with NFD data and with data where the first diacritic is combined with the base character if representable that way.

Open question: How to optimally allocate 5-bit classes? (I.e. something better than 1 class for space, 30 classes for the 30 most common letters and 1 class for the remaining letters combined.)

Open question: Can the frequency score be a linear 256-value number or does the scale need to be non-linear? (I.e. should be 8-bit score be an index into a table of non-linear 16-bit scores?)

Open question: What to do about vocalized Hebrew training?

Open question: What single-byte encodings to omit? Likely Mac encodings and two-digit ISO encodings. (11 is the same as windows-874, 15 is in use but was never a fallback or in IE menu, so practically has to be labeled, and the others are approximately unused and just pose misdetection risk.)

Flags: needinfo?(annevk)
Flags: needinfo?(VYV03354)

(In reply to Henri Sivonen (:hsivonen) from comment #0)

Previously, I thought that we should aim to minimize encoding detection from content. However, with Chromium's adoption of a detector and with Edge moving from the no-detector camp to the detector camp, I'm worried that we might lose users over worse long-tail Web experience to Chromium-based browsers.

I totally agree.

Firefox UX is already inferior to Chrome for me because Chrome perfectly detects encodings of web pages which I reported in 1543077. It is simply wrong that legacy long tail is distributed over only .com/.net/.org. Also, I have to manually select Chinese to read Chinese text using Firefox while Chrome detects Chinese automatically.

Flags: needinfo?(VYV03354)

(In reply to Masatoshi Kimura [:emk] from comment #3)

It is simply wrong that legacy long tail is distributed over only .com/.net/.org.

It is not only that. However, beta telemetry indicates that .com/.net/.org combined is three times as bad as the other domains combined. Hard to say how much of that is non-locale-specificity and how much is just .com being very, very popular.

(In reply to Henri Sivonen (:hsivonen) from comment #2)

What single-byte encodings to omit? Likely Mac encodings and two-digit ISO encodings. (11 is the same as windows-874, 15 is in use but was never a fallback or in IE menu, so practically has to be labeled, and the others are approximately unused and just pose misdetection risk.)

We should probably also omit ISO-8859-3, since Web devs have never been able to rely on it as a fallback, so it's like ISO-8859-15 in that regard, but even more niche.

Unclear if -4, -5, and -6 have substantial usage, but at least they could share the bigram table with other encodings, so those three wouldn't add much to the binary footprint if supported. (Chrome and IE support detection of -5 and -6. Didn't test -4.)

I'm supportive of this in principle as Google's unilateral move certainly created a lot of risk for all involved, but it would be really good to have an idea to what extent Apple and Google are eventually willing to align on things. As ideally this is something we don't have to revisit too many times after this.

Flags: needinfo?(annevk)
Priority: -- → P3

Maybe the keyboard dictionaries from fxos can help with training. They're including word frequencies, and are probably somewhat of an intermediate processing step from using wikipedia corpora (they were built from web crawlers at the time):

https://github.com/mozilla-b2g/gaia/tree/master/apps/keyboard/js/imes/latin/dictionaries

Also, when you talk about training, you mean: take corpus, encode it into X encodings, and then train on that data, right?

(In reply to Axel Hecht [:Pike] from comment #6)

Maybe the keyboard dictionaries from fxos can help with training. They're including word frequencies, and are probably somewhat of an intermediate processing step from using wikipedia corpora (they were built from web crawlers at the time):

https://github.com/mozilla-b2g/gaia/tree/master/apps/keyboard/js/imes/latin/dictionaries

Thanks. I'd expect actual text (as opposed to a word list) to better represent the overall letter pair frequencies, though.

Also, when you talk about training, you mean: take corpus, encode it into X encodings, and then train on that data, right?

Roughly, yes.

Webcompat Priority: --- → ?

The encoding menu is rarely used. Looking at the numbers for Firefox 66, the usage level in major windows-1252-legacy regions / locales, such as Germany and Spain, is that the character encoding menu is used at least once in 0.00026% of Firefox subsessions, where a subsession starts at Firefox launch and ends when Firefox is quit or crashes, and if Firefox runs past midnight or the telemetry environment changes, midnight or the environment change ends the previous subsession and starts a new one. The important point isn't really the precise definition of subsession but that what is being counted is at least once per subsession, so if a user needs to use the menu many times in succession within a single subsession, it's counted the same way as using it only once in the subsession.

Compared to those regions / locales, there's an 85-fold difference to the region where the menu is used the most: In Macau the menu is used at least once in 0.023% of subsessions. This exceeds the usage for the Traditional Chinese localization generally (0.017%), but the regions where Traditional Chinese is the primary script use the menu the most (except for a couple of regions whose appearance near the top of the list is best explained by smaller number of data points in which case a little usage affects the percentage a lot).

If the explanation for this result is that users in regions where Traditional Chinese is the primary script also read unlabeled legacy-encoded Simplified Chinese content, adding autodetection for unlabeled pages would improve the user experience.

The menu is the next-most used in Japan, where there are multiple legacy encodings, at 0.012% of subsessions. The proportion of subsessions where the menu was used at least once went up by 36% in Japan and Japanese localizations when Japanese encoding autodetection was limited to .jp, .com, .org, and .net top-level domains. It's unclear how much of this is noise considering that during the same period, without Firefox behavior changes affecting Korean, Thai, or the TLDs for South Korea and Thailand, usage in South Korea went up by 8.5% and in Thailand down by 9.0%, and this is a comparison between Firefox 65 and 66, so there's no view to how much usage in Japan usually fluctuates. Still, a detector that could generically by applied to all TLDs by default would likely improve the user experience.

Of regions that primarily use Simplified Chinese, menu usage in China is a bit less frequent than in Japan (0.010%) but on a similar level. Menu usage in Singapore, however, considerably lower (0.0018).

After China, the menu is the next-most used in Bulgaria, South Korea, and Thailand followed mostly by other non-Latin-script regions. (There are many non-Latin-script regions where the menu is used more than in Singapore.)

Of Latin-script regions, Azerbaijan is unusually high. Because there wasn't an Azeri Firefox localization at the time when Firefox got associations from top-level domains to encodings, .az didn't get one. Moreover, .cy lacks an encoding association and the menu is used considerably more in Cyprus than in Greece. In general, it's probably worthwhile to redo the TLD associations as part of this bug as an input signal to detection.

The menu usage in the Baltic (windows-1257) countries is higher (0.0016%) than in the Central European (windows-1250) countries (0.00066%). I'm not yet sure how this should inform the detector design.

Detector success can't outright be defined as bringing menu usage everywhere to the German/Spanish level or less, since German or Spanish with mojibake is readable unlike non-Latin mojibake. I should have mentioned that menu usage in Greece is on the Baltic level. Bringing menu usage in all regions where the primary script is non-Latin to the same level as in Greece would already be a success of some kind, and should be achievable. Whether that level is enough to remove the menu is less clear.

Other note to self:
Estonian and Albanian are a bit unusual in the sense that windows-1252 would work for them, but they've had different Windows defaults (windows-1257 and windows-1250, respectively) as a matter of geographic grouping. Need to make sure that Estonian and Albanian training doesn't cause the detector to bounce between windows-1252 and windows-1257 or windows-1252 and windows-1250 in a way that would cause unnecessary reloads for windows-1252-legacy languages.

Whether that level is enough to remove the menu is less clear.

My current thinking is to replace the whole submenu with a single menu item "Override Text Encoding" which would be disabled in the same situations as the submenu is disabled in now and when enabled and invoked would reload the page such that

  • HTTP charset and meta charset are ignored.
  • Non-content hints (TLD) are ignored.
  • UTF-8 is allowed as a detection outcome.

This would give the user a recourse for the usual issues of mislabeling (e.g. server config having a blanket UTF-8 label despite the server carrying legacy-encoded content, other than ISO-2022-JP), the issue of default detection intentionally not detecting UTF-8 (see bug 1585935 comment 6 why), and the less likely issue of TLD hints misfiring.

It would not address the issue of ISO-2022-JP mislabeled as UTF-8, but for security reasons, I think it's prudent not to give the user a way to override that case.

Attached file Status as of 2019-10-22 (obsolete) —

Here's a spreadsheet of a test that synthetizes test inputs from Wikipedia article titles and discards a title (i.e. not counted towards total) if the synthetic title is 100% ASCII. Detection is considered to be successful if the string round-trips using the detected encoding. So e.g. an Estonian windows-1257 string usually also round-trips as windows-1252.

A Web page has more content that just a title, but title-length inputs give an idea under difficult conditions and an idea of how well a guess made from the first 1024 bytes, which may not contain more natural language than the title, might perform.

The spreadsheet is sorted from worse to better according to the percentage point difference between the Chrome detector ("ced") and the detector I'm writing ("ng").

Right now, the detector guesses legacy CJK encodings only for longer inputs, so in that sense, it's having an easier time here, since e.g. misdetection of Russian as Korean isn't a possible source of error.

Anyway, the aggregate accuracy in this test is now slightly better that Chrome's detector.

Notably, Lithuanian windows-1257 detection is really, really bad. The reason appears to be that merging Icelandic, French, and Danish models under one windows-1252 model results in a model that matches Lithuanian windows-1257. I intend to try to split the Icelandic model away from the rest of windows-1252 to see if that helps.

The point of this update is mainly to say that this isn't vaporware and it's already in the ced ballpark. (It's also rather easy to see why Chrome felt they needed not to use ICU.)

Some issues that I will need to remember to address:

  • In general, the new detector prefers to do negative matching. That is, it prefers to eliminate encodings from the set of possibilities when it sees that the input is invalid according to some encoding. Should ISO-2022-JP be an exception to this rule? The current Japanese detector commits to ISO-2022-JP once it sees one ISO-2022-JP escape sequence (preceded only by ASCII). Should the new detector do the same and not allow later ISO-2022-JP errors let a page be treated as non-ISO-2022-JP? Probably yes.
  • The current Japanese detector typically decides between Shift_JIS and EUC-JP from the first two non-ASCII bytes. Running a more general detector on .jp would regress performance for the Shift_JIS case (by keeping running the detector for longer than now) and would defer the reload for the EUC-JP case (for the first non-ASCII seen to the end of the stream). Should the current Japanese detector behavior be retained for .jp? This would effectively special-case one TLD and not detect non-Japanese encodings there in general. Probably yes?
  • Should ISO-8859-6 be a possible detection outcome? AFAIK, ISO-8859-6 has never been the fallback for any locale or TLD for any major browser, but it is in the menu in IE and Firefox. Chrome detects it. It seems that IE does not detect it (detects as windows-1256, which overlaps). Probably it should be a possible outcome, since it's a possible outcome in Chrome, and I'm currently not aware of ISO-8859-6 contributing to misdetection in a way that detecting windows-1256 wouldn't.
  • Should ISO-8859-4 be a possible detection outcome? AFAIK, ISO-8859-4 has never been the fallback for any locale or TLD for any major browser, but it is in the menu in IE and Firefox. I haven't properly investigated what Chrome and IE detectors do about this. ISO-8859-4 seems to contribute to some misdetection, but is itself easier to detect than windows-1257, and the model is going to be there for windows-1257 anyway.

(Currently, the possible set of outcomes is what's in the menu in Firefox, excluding UTF-8, which is handled as a separate flag so as to not guess it without user interaction to avoid a situation where Web developers would rely on UTF-8 getting autodetected.)

(In reply to Henri Sivonen (:hsivonen) from comment #11)

Notably, Lithuanian windows-1257 detection is really, really bad.

Chrome solves this by looking at trigraphs. (Aside: Chrome appears to support detecting ISO-8859-13, which was introduced really late and is basically windows-1257 with quotes and apostrophes moved. Supporting ISO-8859-13 would be fairly cheap and would be unlikely to cause misdetections that windows-1257 does not cause.)

Attached file Status as of 2019-10-31 (obsolete) —

I spotted a pretty embarrassing bug. Fixing it made the numbers better, but still didn't solve the issue that matching ced on Lithuanian, Latvian, and Central European languages is likely impossible without introducing a trigram layer. Imitating ced's trigram layer isn't conceptually difficult, but ced's trigram layer takes 8 KB of data. Still, I'm going to try splitting out Icelandic first (which involves a similar data footprint).

Attachment #9103232 - Attachment is obsolete: true

On x86_64, the Chrome detector, without debugging symbols, adds 284 KB to an application's binary size. The detector I'm writing currently adds 64 KB to the size of an application that already contains a copy of encoding_rs.

In that light, adding 8 KB of data to improve accuracy between Turkish, Icelandic, and Lithuanian on very short inputs is a large percentage-wise addition.

  • This patch turns the new detector on, except for the .jp TLD, which uses the detector that's already in the tree (in order not to regress early EUC-JP reload).
  • The new detector can be turned off for the Web (but not for file:) by flipping intl.charset.detector.ng.enabled to false. (The file: logic is messy enough that I don't want to support multiple configurations. I'm including the pref just in case this needs to be unshipped from beta due to an unforeseen disaster, but I don't actually expect the pref to be flipped to false.)
  • The new detector can be turned on for the .jp TLD by flipping intl.charset.detector.ng.jp.enabled to true. (I don't intend to turn the new detector on for .jp as-is. I'm thinking of doing a follow-up that would make .jp use a hybrid approach that could reload twice in the worst case, but would reload early in the EUC-JP case.)
  • To minimize the binary size impact, I'm removing the old Cyrillic detector in the same patch. If the new detector is turned off but the old Cyrillic detector should run, the new detector is run in a mode that approximates the old detector.
  • When the new detector is enabled, UI for controlling the .com fallback encoding and the old Cyrillic detector are obsolete and, therefore, hidden.
  • I didn't include the extra 8 KB of data to make the detector accurate for Icelandic/Turkish/Baltic for very short inputs. Instead, the detector needs to see a couple of sentences to decide between those. After all, the whole page is there eventually, and misdetection between Latin encodings isn't disastrous.
  • I did include (less than 1 KB) of extra data to make the detector more accurate between EUC-JP, EUC-KR, and GBK on short inputs. (But I still didn't include so much data as to make it accurate for GBK with fewer than 6 hanzi.) I'm not happy about the performance implications of how this data is searched linearly, but let's land something instead of tweaking everything endlessly before landing.
  • Thai detection is pretty bad with title-length inputs, but with a sentence or so, things seem OK.
  • Test cases are marked as tentative, since there's no spec. However, the test against detecting UTF-8 on the Web is intentionally not tentative.

Contrary to comment 12, I made the detector able to change its mind about ISO-2022-JP. There are potential user-generated content-based attacks either way.

Try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=71a0df8c1d1607c9636b38214895012a90f5c9e4

I didn't include the extra 8 KB of data to make the detector accurate for Icelandic/Turkish/Baltic for very short inputs. Instead, the detector needs to see a couple of sentences to decide between those.

Notably, the patch includes test cases for Lithuanian and Latvian, and, as can be seen, the test cases don't need to be particularly long, even though they are longer than the non-Latin test cases included in the patch.

On TLDs that have associated encodings, the detector currently just discards confusable encodings. Ideally, this should be tuned later such that it becomes possible to detect a full-page input in a confusable encoding on a TLD where it's unexpected.

The "combined" item doesn't mean much, since it depends on how the languages end up getting weighed relative to each other. Now each title that's not discarded from the test in each encoding has the same weight, but some languages have multiple encodings, so the same titles count more than once.

Attachment #9105538 - Attachment is obsolete: true

Weird. It looks like we have a bunch of tests that are encoded in UTF-8 but are not declared as UTF-8, and this patch makes them fail by not decoding them as UTF-8. How did those pass before?

Also, it looks like the WPT framework is intolerant of tests renavigating to themselves when the tests are run in bulk. The framework seems OK with it when the tests are run locally one by one. I guess I'll just turn off the reload-expecting tests that I wrote today. :-(

Ah, it looks like I accidentally made the detector run for same-origin iframes and I need to make those inherit instead.

Binary patches aren't reviewable on Phabricator, so here's a tarball of the tests.

Late last night, I had a sudden moment of clarity and worry about these lines in Chrome:
https://github.com/google/compact_enc_det/blob/3e65e9dc136bac03e60457fdb213f13cf169905f/compact_enc_det/compact_enc_det.cc#L5645-L5658

I had previously noticed that ced detects a bunch of legacy Tamil font encodings, but I had dismissed this as just a matter of someone working on the Google search engine having been particularly interested in making Tamil search work long ago.

New realizations:

  • In addition to the numerous Tamil font encodings, ced also detects 3 Devanagari font encodings.
  • Considering how Brahmic scripts are logically isomorphic with each other, there's a chance that there have existed non-Devanagari fonts with logically similar glyph assignments, so the Devanagari font hack detectors might actually catch even more font hacks.
  • Tamil is visually more suitable for this kind of font hack than the other scripts of India, so ced's seemingly excessive focus on Tamil may not in fact be a sign of someone at Google giving disproportionate attention to Tamil but may actually reflect frequency of legacy Web authoring practice. Also, a couple of the Tamil encodings appear to be Tamil Nadu state government-endorsed from 1999.
  • Since Chrome maps these detection results to windows-1252, Chrome ends up not breaking these hacks even when they don't have a meta tag declaring x-user-defined in a way that https://searchfox.org/mozilla-central/rev/62a130ba0ac80f75175e4b65536290b52391f116/parser/html/nsHtml5StreamParser.cpp#1512 would catch.

I've updated the patch to by default (controllable by pref) disable the new detector on the .in TLD. This may interfere with the usage of .in as a generic TLD (the Google frequency data publish in the ced sources indicates that there exist enough e.g. Japanese content on .in to make it to the top encodings for .in list), but at least no-change avoids regression while researching further. (My understanding is that this is a thing specific to India, but I did the same for .lk just in case to get more time for research.)

The research question is: Do people still browse legacy long-tail sites that depend on these font hack without declaring windows-1252 (either as windows-1252 or via x-user-defined meta)? The newspaper sites that used to rely on these have migrated to UTF-8. I expect that the difficulty of installing fonts on mobile devices has forced all actively-maintained sites to switch over to UTF-8 by now.

Since Chrome doesn't detect Armenian font hacks, I didn't change anything for the Armenian TLD.

Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0810ad586986
Autodetect legacy encodings on unlabeled pages. r=emk
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/20743 for changes under testing/web-platform/tests
Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.
Upstream PR was closed without merging

Oops. I guess I accidentally used tooling that corrupts non-UTF-8 patches when I uploaded the revised patch earlier today. Trying with other tooling.

Flags: needinfo?(hsivonen)
Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/96b8dc8fd886
Autodetect legacy encodings on unlabeled pages. r=emk

Re-landed the same changeset but via different tooling.

Upstream web-platform-tests status checks passed, PR will merge once commit reaches central.
Status: ASSIGNED → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla73
See Also: → 1603982
Depends on: 1604158
Regressions: 1604166
Upstream PR merged by moz-wptsync-bot

This feels like something we should call out in the Beta73 relnotes. EMK, since Henri is away, can you please set the relnote-firefox? flag for this bug and fill out the form so we can get that process moving forward? Thanks!

Flags: needinfo?(VYV03354)

Release Note Request (optional, but appreciated)
[Why is this notable]: This change will improve compatibility with web pages that are encoded in legacy text encodings.
[Affects Firefox for Android]: Yes
[Suggested wording]: Firefox will now auto-detect legacy text encodings on old web pages that do not declare the text encoding explicitly.
[Links (documentation, blog post, etc)]:
https://lists.mozilla.org/pipermail/dev-platform/2019-December/025023.html
https://lists.mozilla.org/pipermail/dev-platform/2019-December/025078.html

relnote-firefox: --- → ?
Flags: needinfo?(VYV03354)

Added to the Beta73 relnotes.

Regressions: 1612343
Regressions: 1615836
Regressions: 1615853

This might not be related to this bug, but on this old webpage:

  1. doesn't have DOCTYPE
  2. has encoding Win-1257 meta-info

Firefox v73 shows Unicode for webpage. If you save HTML locally, then Firefox shows encoding correctly.

It has a wrong Content-Type field in the HTTP response header:

Content-Type: text/html; charset=UTF-8

It does not have to do with this bug at all.

Could you, please, point to the right bug for this?
Also, I couldn't find any mention of Unicode on that webpage. I found Windows-1257:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1257">
Flags: needinfo?(VYV03354)

(In reply to soshial from comment #48)

Could you, please, point to the right bug for this?
Also, I couldn't find any mention of Unicode on that webpage. I found Windows-1257:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1257">

UTF-8 is specified in the HTTP headers, which take precedence over the meta. When you make a local copy, you lose the HTTP headers, so the meta takes effect. In both cases, the code added in this bug does not participate.

Flags: needinfo?(VYV03354)
Regressions: 1627671
Regressions: 1631983
You need to log in before you can comment on or make changes to this bug.