Closed Bug 844087 Opened 11 years ago Closed 11 years ago

Welsh localization should use the same fallback encoding as the rest of Western European locales

Categories

(Mozilla Localizations :: cy / Welsh, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hsivonen, Assigned: hsivonen)

Details

(Whiteboard: [fixed by bug 910192])

Attachments

(1 file)

The Welsh localization currently uses UTF-8 as the fallback encoding for unlabeled legacy content. It seems incredible that users of the Welsh UI wouldn't encounter mostly the same legacy content as other users in the UK. Or more to the point, it seems implausible that they'd encounter unlabeled UTF-8 content more often than unlabeled Windows-1252 content.

Unless there's a really good explanation why the users of the Welsh localization are more likely to encounter unlabeled UTF-8 content than unlabeled Windows-1252 content, we should change the fallback to Windows-1252, which we, for the time being for historical reasons, call ISO-8851-1 when used as an encoding default.
Attachment #717099 - Flags: review?(bugzilla)
I'm guessing it's set this way because not all Welsh characters are in Windows-1252 or ISO-8859-1.  For example:

U+0174 Ŵ
U+0175 ŵ
U+0176 Ŷ
U+0177 ŷ

These *are* in ISO-8859-14, however.

I don't understand all the implications of this change, but I'm guessing it would break any unlabeled ISO-8859-14 pages?  If the default is set to UTF-8, is a generic charset detector run on an unlabeled 8-bit page?  Has anyone tested to see if that works on ISO 8859-14?
(In reply to Kevin P. Scannell from comment #2)
> I'm guessing it's set this way because not all Welsh characters are in
> Windows-1252 or ISO-8859-1.

That isn't really relevant. For example, Welsh-language pages that are encoded in UTF-8 and that are labeled as UTF-8 would be unaffected by the change.

The relevant question is: Considering *unlabeled* (i.e. legacy and per spec non-conforming) pages *only*, what's the encoding that users of the Welsh locale most commonly encounter?

Are users of the Welsh locale more likely to encounter *unlabeled* ISO-8850-14 pages than *unlabeled* Windows-1252 pages when browsing the Web, including Web pages not in the Welsh (there seems to be plenty of reason to browse to English pages even if one uses Welsh UI)?

> If the default is set to
> UTF-8, is a generic charset detector run on an unlabeled 8-bit page?

No. (Enabling detectors for locales that aren't already stuck with them is a bad idea. And we don't have a generic detector in the codebase. The "universal" detector is considerably less universal than the name suggests.)
(In reply to Kevin P. Scannell from comment #2)
> I don't understand all the implications of this change, but I'm guessing it
> would break any unlabeled ISO-8859-14 pages?

Unlabeled ISO-8859-14 pages are already broken with UTF-8 as the default and would continue to be broken if the default changed to Windows-1252 (called ISO-8859-1 for the time being for historical reasons).
(In reply to Henri Sivonen (:hsivonen) from comment #3) 
> The relevant question is: Considering *unlabeled* (i.e. legacy and per spec
> non-conforming) pages *only*, what's the encoding that users of the Welsh
> locale most commonly encounter?
> 

I understand.  It would be nice to have someone from the Welsh team weigh in on this point before making the change.   If the default were set to ISO-8859-14 (or, strictly speaking, the analogue of cp1252 with the stuff between 0x80 and 0x9f if that's possible), it would have the effect of fixing unlabeled 8-bit Welsh pages. I did big crawls of the Welsh language web a few years ago and there were plenty of such pages.

It's true Welsh users visit non-Welsh pages but those would often be English, in which case the small number of differences ISO 8859-14 vs. -1 wouldn't matter.  

Thanks for your clarifications on charset detection, that's good to know.
What do other browsers default to?
(In reply to Anne van Kesteren from comment #6)
> What do other browsers default to?

I installed the Welsh language pack for Windows 7 and the Welsh language pack for IE9 over en-US. As a result, the fallback encoding in IE changed to koi8-r! Really. QA FTW.
I think I may have originally put intl.charset.default=UTF-8 because it used to determine the Accept-Charset header, but I believe Firefox no longer sends this.

On consideration, I would recommend we keep UTF-8 as the default, but that we be open to changing it in future if further evidence so suggests. See below for details.

I've asked a few different groups of people about this. Everyone agrees unlabelled ISO-8859-1/Windows-1252 is far more likely than unlabelled ISO-8859-14, even without considering English content. We're not aware of any new Welsh content being published in ISO-8859-14, labelled or unlabelled. As Kevin says, it was more common a few years ago (before Unicode-aware software was more widespread).

Unlabelled UTF-8 content is another matter. Henri asks whether Welsh users would be more likely to encounter unlabelled UTF-8 than unlabelled Windows-1252. I'll try and answer this.

(1). As Kevin notes, not all Welsh characters are in Windows-1252, and therefore, unlabelled Welsh text containing ŵ and ŷ is very likely to be in UTF-8. If this is misinterpreted as Windows-1252, the accents in the words will be corrupted, which impairs the meaning of the text considerably.

(2). As Henri notes, unlabelled English text is likely to be Windows-1252. (Actually, it's likely to fit in the plain ASCII subset of Windows-1252 and UTF-8, in which case there's no problem at all; but let's just consider the case where it doesn't). English has very few accented letters, so the most likely corrupted characters are £ and smart quotes, which will impair the meaning of the text but perhaps not as badly as corrupted accents.

(3). Occasionally, Welsh writers may avoid accents altogether, especially in informal text, because they know the complications of character sets. In this case, the text may well be encoded as Windows-1252, and the situation is similar to (2) above, except that corruption of the apostrophe U+2019 impairs the meaning rather more in Welsh.

The key question is the relative frequencies of cases (1) and (2) for the "average Welsh Firefox user". We just don't have the information to answer this definitively. However, I would tentatively guess (1) is the more important case, for two reasons: 

(A) accents are probably more common and more important than smart quotes, and pound signs are generally quite clear from context; and

(B) most Welsh Firefox users have specifically chosen to use the Welsh version as it rarely comes pre-installed (or at the very least, there is a choice between English and Welsh). Therefore they have indicated a preference for Welsh-language content.

Another important factor is that Welsh is a minority language. As such, it's more likely that content is hosted using software configured for English. As such, the server administrator is less likely to be aware of character set configuration problems, and less likely to have adapted the configuration to support Welsh accents.

For the reasons above, I would be in favour of keeping intl.charset.default=UTF-8 for now. But if evidence about usage patterns becomes available in future, then we might need to revise this.
Note that keeping the UTF-8 fallback makes it more likely that more unlabeled UTF-8 content gets created and then that content doesn't work out-of-the-box for users who have an English browser UI but read Welsh content. This is bad, since English-localized browser versions are probably common for reading Welsh content.
(In reply to Henri Sivonen (:hsivonen) from comment #9)

Henri: I agree this is a problem. But I think the problem's still there if we use a default of ISO-8859-1.

If the author is forced to create unlabelled ISO-8859-1, then there's no way to encode some Welsh words at all. There are lots of minimal pairs dy/dŷ, gwr/gŵr, dwr/dŵr, ym/ŷm, gwyn/gŵyn etc where the missing accent makes a completely different word.  This will be true with (Broken Server + English Browser) no matter what we choose here.

On the other hand if the author creates unlabelled UTF-8, then it can display correctly on any browser that defaults to UTF-8 or uses heuristics to detect it, and on all other browsers after clicking a menu option.

The only case that is worse is when the author *could* label the text as UTF-8, but forgets to do so because of the Welsh Firefox default. I think this scenario is unlikely, because (a) as you say other browsers have a different default; (b) Welsh authors are accustomed to checking their accents display correctly; and (c) because of the minority language situation, it will often be the case that no-one who cares about Welsh can control the labelling (e.g. in some broken proprietary CMS which is mostly used for English).
How could the author be forced to create unlabeled iso-8859-1? If the server does not forcibly apply a label on the HTTP layer, the author can use an in-band label (<meta charset=utf-8> or the UTF-8 BOM).
Henri: If authors are writing static HTML pages then they can indeed use an in-band label such as the <meta> tag. However, if authors are entering text into a broken content management system then they may not be able to do this.
intl.charset.default is no more.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by bug 910192]
Attachment #717099 - Flags: review?(bugzilla)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: