1551276 - (chardetng) Autodetect legacy encoding on unlabeled pages

Assignee

Description

•

6 years ago

•

Previously, I thought that we should aim to minimize encoding detection from content. However, with Chromium's adoption of a detector and with Edge moving from the no-detector camp to the detector camp, I'm worried that we might lose users over worse long-tail Web experience to Chromium-based browsers. Especially on iOS, Safari has a captive audience (no one is going to quit using iOS due to not being able to read a 1990s legacy page, but someone might open Chrome or Edge on Windows if a 1990s long-tail page doesn't work right away in Firefox), so we can't really reason about what Safari does.

We don't really have proper signals of whether long-tail 1990s Web UX is a real competitive issue or not.

I evaluated the prospect of just adopting Google's detector, and I'm very uneasy about it.

However, now that one can easily get a text corpus in various languages from Wikipedia, training a detector should be a lot easier than it was in the Netscape days when the not-really-universal "universal" detector was created. (I still think removing it was the right call.)

I think it's now probably worthwhile to just go ahead and write a legacy encoding detector designed for browser-relevant coverage (unlike Google's) and trained with a wide range of languages for multi-language single-byte encodings (unlike Netscape's) and use it to solve the issue that .com/.net/.org don't have clear one-encoding-fits-all fallbacks for the legacy long tail.

annevk, emk, what do you think?

(A significant part of our menu usage is overriding a declared encoding, which is a case for which Chromium provides no recourse to the user. Even if we had a detector, we could still have a single menu item for manually triggering detection of labeled content.)

Status as of 2019-10-22 6 years ago Henri Sivonen (:hsivonen) 72.25 KB, text/html		Details
Status as of 2019-10-31 6 years ago Henri Sivonen (:hsivonen) 43.34 KB, text/html		Details
Bug 1551276 - Autodetect legacy encodings on unlabeled pages. 6 years ago Henri Sivonen (:hsivonen) 47 bytes, text/x-phabricator-request		Details \| Review
Title-length status 2019-12-09 6 years ago Henri Sivonen (:hsivonen) 59.91 KB, text/html		Details
Test as a gzipped tarball 6 years ago Henri Sivonen (:hsivonen) 7.93 KB, application/gzip		Details