Closed Bug 708995 Opened 13 years ago Closed 6 years ago

Find out what fallback charset users choose for each localization

Categories

(Toolkit :: Telemetry, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: hsivonen, Unassigned)

Details

(Keywords: privacy-review-needed)

(Unsure what component this should be under.)

The character encoding that the HTML parser falls back on when the encoding has not been declared by the page is a preference whose default depends on the localization, because long time ago, someone made the mistake of making the default locale-dependent, which lead to locale-siloed content that depends on the default that the page author's browser has.

By looking at the values our minority language localizations set for the preference, it is easy to suspect that the defaults we ship aren't optimal. In particular, when a minority language wasn't supported by any legacy encoding but the minority language is used in a region whose majority-language sites do depend on a legacy encoding, chances are that defaulting to UTF-8 leads to a worse user experience than defaulting to the legacy encoding for the regional majority language if native users of the minority language end up reading majority-language content. (That is, if one assumes that users of a minority-language Firefox when reading pages in their own language encounter pages that declare their encoding so that they work with Firefox localized for the majority language and, therefore, the bulk of unlabeled pages the minority-language users encounter are majority-language pages from their region or perhaps unlabeled globally-visited English-language pages.)

Since Firefox allows the user to change the fallback encoding from a menu, we could measure how much users change the fallback on a per localization basis--i.e. how much users end up manually correcting the default. If we see that users for some localization end up correcting the default a lot, we should probably change the default for that localization.

To this end, we should gather telemetry data for the default encoding setting per Firefox localization. (Since we don't have per-user persistent ids, it's unclear to me how often we should sample the setting. Per page load maybe to weigh the numbers by browser usage instead of browser launch patterns?)

I believe this information is far enough removed from personally identifiable to make it OK to use telemetry (as opposed to Test Pilot) for this, but marking privacy-review-needed.

(It's probably not useful to gather this data for the en-US locale, since it's a catch-all global version of Firefox in practice. When users have to configure it for use outside the U.S., having to configure it is a reasonably expectable step rather than a bug.)
> To this end, we should gather telemetry data for the default encoding setting per Firefox localization.

s/default/fallback/
> (Since we don't have per-user persistent ids, it's unclear to me how often we should sample the setting. Per page load maybe to weigh the numbers by browser usage instead of browser launch patterns?)

I thought about this more. It makes sense to record the setting only when the setting is taking effect, because page loads where the setting doesn't take effect tell nothing about the successfulness of the fallback.

So I think we should record the value the intl.charset.detector pref, the value of mCharset and the localization of Firefox when nsHtml5StreamParser::OnStartRequest runs, the request isn't for an about:, res: or chrome: URL and mCharsetSource equals kCharsetFromUserDefault. Recording other data would introduce noise.
(In reply to Henri Sivonen (:hsivonen) from comment #2)
> So I think we should record the value the intl.charset.detector pref, the
> value of mCharset and the localization of Firefox when
> nsHtml5StreamParser::OnStartRequest runs, the request isn't for an about:,
> res: or chrome: URL and mCharsetSource equals kCharsetFromUserDefault.
> Recording other data would introduce noise.

...and the values recorded there should be forwarded to telemetry only if
1) chardet ends up switching the encoding
or
2) mCharsetSource is lower than kCharsetFromMetaPrescan at EOF.
This is perhaps subject for another bug:  It could be interesting to let some users of a locale which, today, defaults to WINDOWS-1252, and - without their knowledge - get version which defaults to UTF-8, just to register if/when they change encoding. Such a test could be run regularily to tracj how - over time - the general swithc to UTF-8 affects users. The goal with such a test, should be to see if - or when - it would bet possible to swith the default encoding to UTF-8.
Can we get a reasonable sample of non- en-US locale data from Test Pilot?  If we can we should use that.
The perf team doesn't have bandwidth to implement probes. This is generally the domain of each specific team. The perf team can review patches.
Would this still be relevant? If so, I think a keyed string scalar might be a good choice. (And this bug would need to be moved to the appropriate component for implementation)
Flags: needinfo?(hsivonen)
While this is still partially relevant, it's probably not worthwhile to put effort into the part that remains relevant.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(hsivonen)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.