Closed Bug 910211 Opened 11 years ago Closed 11 years ago

Guess the fallback encoding from the TLD the content is served from before guessing from browser UI locale

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla30

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

Attachments

(1 file, 7 obsolete files)

Currently, we guess the fallback encoding for HTML and plain text from the UI localization of the browser. This means that if you use a browser build with that Traditional Chinese UI localization, the guessed fallback encoding is big5 so if you go read legacy content from mainland China, you need to manually override the encoding to gbk, which would be the fallback encoding for Simplified Chinese builds. Similar problems with reading content from neighboring locales exist with other locale pairs as well, Finnish and Russian, for example.

To alleviate the problem, we should try guessing the fallback encoding from the TLD  of the site if the site's  domain is under a TLD that has a strong affiliation with a particular legacy encoding. For example, we'd guess gbk for .cn, big5 for .tw, windows-1252 for .fi and windows-1251 for .ru  regardless of the UI localization. For TLDs  that don't have a strong affiliation with a particular legacy encoding, such as .com, .org or .net, we guess from the browser UI locale as before.
I thought we did statistical analysis on the content for charset detection, rather than simply guessing based on UI locale? That's what the Universal setting is in the charset autodetect menu...

Gerv
(In reply to Gervase Markham [:gerv] from comment #1)
> I thought we did statistical analysis on the content for charset detection,
> rather than simply guessing based on UI locale?

For Japanese UI Firefox builds, by default, we do an analysis to guess between various Japanese encodings. Likewise, for Russian UI Firefox builds, by default, we do an analysis to guess between various Cyrillic encodings with the  assumption that the language of the text is Russian. For Ukrainian UI Firefox builds, by default, we do an analysis to guess between various Cyrillic encodings with the assumption that the language of the text is Ukrainian.

For other locales, by default, we do no guessing based on content. (And we shouldn't start now; see below.)

> That's what the Universal
> setting is in the charset autodetect menu...

1) The "Universal" detector is not actually universal. See bug 844115. It's very frustrating that it's being oversold as universal. The "Universal" detector is rather arbitrary in what it tries to detect. For example, it tries to detect Hebrew, Hungarian and Thai, but it doesn't try to detect Arabic, Czech or Vietnamese (and the Hungarian detection apparently doesn't actually work right).  As far as I can tell,  what's detected depends on the interests of the people who worked on the detector in the Netscape days.

2) I see no indication that it would be reasonable to expect the "Universal" detector to actually grow universal encoding and language coverage in a reasonable timeframe. The code hasn't seen much active development since the Netscape days.

3) I think it's reasonable to assume that even if the "Universal" detector gained coverage for more languages and encodings, reliably choosing between various single-byte encodings would be a gamble. For example, if we enabled a detector in builds that we ship to Western Europe, Africa, South Asia and the Americas, I  expect it would be virtually certain that the result would be worse than just having the fallback encoding always be windows-1252, because we'd introduce e.g. windows-1250 misguesses to locales where windows-1252 is consistently the best guess.

4) Basing the detection on the payload of the HTTP response is bad for incremental parsing and stopping in the mid-parse and reloading the page with a different encoding is bad for the user experience. Due to this (and the above point) I think it doesn't make sense to try to improve the detector but to try to get rid of the detector and then maybe add a guess such as the one proposed in this bug that doesn't depend on the payload but is still connected with the content better than the properties of the browser configuration.
Attached file WIP (obsolete) —
Attached patch WIP patch (obsolete) — Splinter Review
Assignee: nobody → hsivonen
Attachment #827263 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attached patch Guess from TLD (obsolete) — Splinter Review
I thought about doing A/B  testing for this using telemetry, but for proper results, testing needs to happen on the release channel, so we might as well treat the last release without this patch as A  and the release with this patch as B.

I put this stuff behind a pref  so that it can be easily turned off out-of-cycle if this turns out to be disastrously wrong. I didn't add UI for the press, though, because if this stuff works, users probably shouldn't be meddling with it.

Anne, does the list look OK to you?
Attachment #8336745 - Attachment is obsolete: true
Attachment #8340398 - Flags: feedback?(annevk)
Comment on attachment 8340398 [details] [diff] [review]
Guess from TLD

Oops. Uploaded an obsolete patch.
Attachment #8340398 - Attachment is obsolete: true
Attachment #8340398 - Flags: feedback?(annevk)
Attached patch Updated patch (obsolete) — Splinter Review
Attachment #8340401 - Flags: feedback?(annevk)
I wonder when we would end up with ".." in a host. That seems like it should not be possible.

I think we should post about this plan to the WHATWG list and maybe www-international@w3.org. Seems good to inform the broader community.
Attached patch Patch that actually works (obsolete) — Splinter Review
The list is the same but now the code actually works. Sorry about the spam.
Attachment #8340401 - Attachment is obsolete: true
Attachment #8340401 - Flags: feedback?(annevk)
Attachment #8340413 - Flags: feedback?(annevk)
Comment on attachment 8340413 [details] [diff] [review]
Patch that actually works

Review of attachment 8340413 [details] [diff] [review]:
-----------------------------------------------------------------

Aside from the comment above about the double dot, this approach looks good to me. Very curious to see what feedback in the wild would be about this approach. domainsfallbacks.properties could do with more clarity maybe about locales, in particular for the IDN entries. Maybe group them by locale if order is not relevant?
Attachment #8340413 - Flags: feedback?(annevk) → feedback+
Attached patch Make the mapping more readable (obsolete) — Splinter Review
(In reply to Anne (:annevk) from comment #8)
> I wonder when we would end up with ".." in a host. That seems like it should
> not be possible.

I sure hope it's not possible. That's why I think it's safe to give up in that case. Would you prefer to MOZ_ASSERT, too?

> I think we should post about this plan to the WHATWG list and maybe
> www-international@w3.org. Seems good to inform the broader community.

I'd like to see what smontagu and emk think first.

(In reply to Anne (:annevk) from comment #10)
> Aside from the comment above about the double dot, this approach looks good
> to me.

Thanks.

> domainsfallbacks.properties could do with more clarity maybe about
> locales, in particular for the IDN entries. Maybe group them by locale if
> order is not relevant?

Done.

- -

smontagu, emk: Does this general approach look OK to you? Do the details of the mapping make sense?

# Goals

 * Reduce the effect of browser configuration (localization) on how the Web renders.

 * Make it easier for people to read legacy content on the Web across-locales without having to use the Character Encoding menu.

 * Address the potential use cases for the combined Chinese, CJK and Universal detectors without the downsides of heuristic detection. (UI for activating these detectors that aren't on-by-default in any locale went away as a side effect of bug 805374.)

 * Avoid introducing new fallback encoding guesses that don't already result from guessing based on the browser localization.

# Why is this better than our current defaults?

 * We are trying to guess a characteristic of legacy content. The TLD is more tightly attached to the content then the browser localization.

# Why is this better that analyzing the content itself?

 * Analyzing the content itself has to interfere with one the following:
   - Incremental parsing/rendering of HTML
   - Script side effects / visual appearance of the page load
   - Inheritance of character encoding into CSS/JS

 * When documents contain a lot of code before human-readable text, a lot of bytes need to be consumed before making a decision. The TLD is available before seeing any content bytes at all.

 * Especially for single-byte encodings, content analysis is not exact and could introduce mis-detections to locales that currently get by fine with their fallback encoding.

 * A static mapping from TLDs to encodings is more understandable/debuggable than having guesses potentially change when content changes.

 * This is simpler to implement than creating a truly universal detector (our current Universal detector is not) or vetting, potentially adjusting and integrating the ICU detector.

# Why is this harmless?

 * This only applies to non-conforming content. Any site that this applies to always has a way to opt out by becoming conforming (by declaring its encoding).

 * All the possible TLD-based guesses are pre-existing browser locale-based guesses, so the situation faced by any site as the result of this feature is the situation already faces with some Firefox localization.

 * UTF-8 is never guessed, so this feature doesn't give anyone who uses UTF-8 a reason not to declare it. In that sense, this feature doesn't interfere with the authoring methodology sites should, ideally, be adhering to.

# How could this be harmful?

 * This could emphasize pre-existing brokenness (failure to declare the encoding) of sites targeted at language minorities when the legacy encoding for the minority language doesn't match the language of the country and 100% of the target audience of the site uses a browser localization that matches the language of the site. For example, it's imaginable that there exists a Russian-language windows-1251-encoded (but not declared) site under .ee that's currently always browsed with the Russian Firefox localization. More realistically, minority-language sites whose encoding doesn't match the dominant encoding of the country are probably more aware than most about encoding issues and already declare their encoding, so I'm not particularly worried about this scenario being a serious problem. And sites can always fix things by declaring their encoding.

 * This could cause some breakage when unlabeled non-windows-1252 sites are hosted under a foreign TLD, because the TLD looks cool (e.g. .io). However, this is a relatively new phenomenon, so one might hope that there's less content authored according to legacy practices involved.

 * This probably lowers the incentive to declare the legacy encoding a little.
Attachment #8340413 - Attachment is obsolete: true
Attachment #8341644 - Flags: feedback?(smontagu)
Attachment #8341644 - Flags: feedback?(VYV03354)
Note to self: Add the new charset source to the telemetry code in nsDocShell.
Comment on attachment 8341644 [details] [diff] [review]
Make the mapping more readable

Review of attachment 8341644 [details] [diff] [review]:
-----------------------------------------------------------------

Overall it sounds a nice idea assuming that the mapping will be eventually spec'ed.
Please use the effective TLD service instead of parsing domains on your own.

::: parser/nsCharsetSource.h
@@ +7,5 @@
>  
>  // note: the value order defines the priority; higher numbers take priority
>  #define kCharsetUninitialized           0
>  #define kCharsetFromFallback            1
> +#define kCharsetFromTopLevelDomain      2

Consider converting these #defines to enum instead of maintaining numeric values manually.
Attachment #8341644 - Flags: feedback?(VYV03354) → feedback+
(In reply to Masatoshi Kimura [:emk] from comment #13)
> Overall it sounds a nice idea assuming that the mapping will be eventually spec'ed.

If this feature is successful and the list stabilizes, it would make sense to spec it, yes.

> Please use the effective TLD service instead of parsing domains on your own.

That stuff is Public Suffix List-based and has undesirable behaviors for the use case here. AFAICT, that service doesn't have a method for asking for the TLD only, since the whole point of that service is to take into account stuff like .ne.jp and .or.jp instead of ever returning just "jp". But simply "jp" is what's needed for this feature.

> Consider converting these #defines to enum instead of maintaining numeric values manually.

Maybe in another bug. Those constants are never quite painful enough to justify the effort of modifying every place where they are passed around as int32_t...

Thanks.
Comment on attachment 8341644 [details] [diff] [review]
Make the mapping more readable

Review of attachment 8341644 [details] [diff] [review]:
-----------------------------------------------------------------

> > I think we should post about this plan to the WHATWG list and maybe
> > www-international@w3.org. Seems good to inform the broader community.
> 
> I'd like to see what smontagu and emk think first.

Personally I'm sceptical whether the potential rewards justify the investment, but I think it is worth bringing it up on the WHATWG list and getting feedback on the concept.

I'm also less sanguine than you about this issue:

> minority-language sites whose encoding doesn't match the
> dominant encoding of the country are probably more aware than most about
> encoding issues and already declare their encoding, so I'm not particularly
> worried about this scenario being a serious problem.

Practically you may well be right, but it still feels wrong to me ideologically. These communities are also especially sensitive about their own identity as minority languages and I don't want to do something that might be picked up (however unreasonably) as giving the impression that we lack sensitivity about that
Attachment #8341644 - Flags: feedback?(smontagu)
(In reply to Simon Montagu :smontagu from comment #15)
> Personally I'm sceptical whether the potential rewards justify the
> investment,

The investment is pretty small.

> but I think it is worth bringing it up on the WHATWG list and
> getting feedback on the concept.

OK. I'll do that.
 
> I'm also less sanguine than you about this issue:
> 
> > minority-language sites whose encoding doesn't match the
> > dominant encoding of the country are probably more aware than most about
> > encoding issues and already declare their encoding, so I'm not particularly
> > worried about this scenario being a serious problem.
> 
> Practically you may well be right, but it still feels wrong to me
> ideologically.

That assumes an ideology that supports a language community being so closed that it can assume such uniformity of browser localization usage as to be able to rely on members of the community not using other browser localizations (e.g. the localization for another language of the country) for browsing the community's sites.

To the extent this feature is ideologically-driven, it's driven by non-isolation: That users should be able to read out-of-locale content--even old content--without the UI badness of the encoding menu.

> These communities are also especially sensitive about their
> own identity as minority languages and I don't want to do something that
> might be picked up (however unreasonably) as giving the impression that we
> lack sensitivity about that

Is this a generic concern, or do you have particular cases in mind? After all, this is a non-issue for countries where the different languages share the same legacy encoding or where languages don't have a legacy encoding that any browser localization would use as the fallback encoding or haven't had localizations that could have established a sufficently reliable fallback and, hence, can't have been relying on fallbacks without this patch, either.

Even though I put it on the list as windows-1255, I was wondering about .il. Should .il be expected to have enough unlabeled windows-1256 content that it's a bad idea to guess windows-1255 for .il? That is, should .il join .ba, .cy, .my, .com, .org and .net as domains that don't participate in this feature?
(In reply to Henri Sivonen (:hsivonen) from comment #16)
> > but I think it is worth bringing it up on the WHATWG list and
> > getting feedback on the concept.
> 
> OK. I'll do that.

http://lists.w3.org/Archives/Public/public-whatwg-archive/2013Dec/0142.html
http://lists.w3.org/Archives/Public/www-international/2013OctDec/0134.html
(In reply to Henri Sivonen (not reading bugmail until 2014-01-02) (:hsivonen) from comment #16)
> Even though I put it on the list as windows-1255, I was wondering about .il.
> Should .il be expected to have enough unlabeled windows-1256 content that
> it's a bad idea to guess windows-1255 for .il? That is, should .il join .ba,
> .cy, .my, .com, .org and .net as domains that don't participate in this
> feature?

I can only reply with anecdotal evidence, but for what that's worth I haven't come across any unlabeled windows-1256 content in .il. In fact I haven't come across any labeled windows-1256 content either -- any Arabic language content I have seen recently in .il has been UTF-8.
There was no feedback on the WHATWG list. Feedback on www-international  was either cautiously positive, neutral questioning  whether this is worthwhile or urging research. There was no public feedback definitely against. (There was one instance of private feedback against, but the reasoning didn't really make sense when comparing the situation after the patch to the situation before the patch.)

In order to avoid lack of progress due to doubts that might need more research, I have erred on the side of making TLDs that might warrant more research not participate in this feature at this time even if my doubts are very slight. Not participating in this feature makes a TLD no worse off than it is now. However, landing *something* allows seeing how this works for the motivating TLDs (.tw&.hk vs. .cn). If this feature turns out not to work for the TLDs for which it was designed, research about other TLD's would be wasted. However, if the feature works for the domains that it was designed for, we can research the situation for the domains that are left out and add them later.

https://tbpl.mozilla.org/?tree=Try&rev=d4657a4bf818
Attachment #8341644 - Attachment is obsolete: true
(Also, I want code in the tree that splits out the TLD as preparation for bug 845791 so that telemetry can be focus on .ru and .ua and not count the detectors potentially misfiring elsewhere.)
(In reply to Henri Sivonen (:hsivonen) from comment #21)
> https://tbpl.mozilla.org/?tree=Try&rev=4a38a1da8fa4

Orange Cpp. Almost certainly unrelated, but let's try with a new base revision:
https://tbpl.mozilla.org/?tree=Try&rev=25e29cab3bb5
Comment on attachment 8366642 [details] [diff] [review]
Patch with more domains not participating, with test actually hg added

Let's try to get this in the next train. (The current one has enough changes that should affect charset menu telemetry that it'll be easier to attribute changes in telemetry to the changes if this one goes into a separate train.)
Attachment #8366642 - Flags: review?(VYV03354)
For future reference: current: 29, next: 30
Comment on attachment 8366642 [details] [diff] [review]
Patch with more domains not participating, with test actually hg added

Review of attachment 8366642 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/encoding/domainsfallbacks.properties
@@ +53,5 @@
> +xn--wgbh1c=windows-1256
> +
> +gr=ISO-8859-7
> +
> +hk=Big5

Please use Big5-HKSCS until bug 912470 is fixed.
Attachment #8366642 - Flags: review?(VYV03354) → review+
https://hg.mozilla.org/mozilla-central/rev/a4e9e8bead92
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
> # Malaysia has an Arabic-script TLD, official script is latin, possibly Chinese-script publications
> my=???

The last remaining newspaper written in Jawi (a modified arabic script) has closed due to declining readership as the older generation die off. I wouldn't worry about Arabic here. Chinese yes (Simplified). We also have newspapers and magazines in Tamil (a southern Indian language/script).
Depends on: 977573
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: