Closed Bug 154570 Opened 23 years ago Closed 12 years ago

www.mozilla.org charset inconsistency/problems

Categories

(www.mozilla.org :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: 3.14, Unassigned)

References

Details

Attachments

(4 files)

Look at URL, the validator complains that there is no charset information. For the bugzilla server (Apache) this would be fixed by adding the following line to .htaccess: AddDefaultCharset On You might want to check if this also works for CGIs like attachments to bugs. Maybe the other bugs the validator finds can also be fixed;-) pi
I am wondering there is no reaction. This is a trivial thing to fix which has no side effects. pi
Changing summary; Bugzillas in general seem to be covered by bug 126266.
Summary: bugzilla.mozilla.org and www.mozilla.org don't send charset information → www.mozilla.org doesn't send charset information
You are right on one hand. On the other hand this bug here asks about actually doing it. Probably a simple thing in .htaccess will do the job. pi
There are 10377 *.html files at www.mozilla.org. Of these the vast majority (9927 files) is plain ASCII. That leaves 450 non-ASCII files. The majority of those documents are ISO-8859-1 coded, which is good. The problematic documents include: * about 79 documents that are Windows-1252 encoded * i18n test cases whose path matches */intl/* * The Polish and Czech evangelism letters * http://www.mozilla.org/projects/bugzilla/download.html which is UTF-8 coded but outside any */intl/* directory Probably the best way to deal with the i18n test cases is not to change the configuration of the intl directories. What to do with the evang letters and the Bugzilla download page depends on the configurability of the server. The Windows-1252 encoded could be converted to ISO-8859-1 with numeric character references with a Perl script. The downside would be that some old docs would have their modification dates touched even though the content would still be old in substance.
I see basically two options. Both would set the default charset to ISO-8859-1 (AddDefaultCharset On). 1) You can set individual files or directories to different charsets as required. 2) You can -- as you suggested -- transform those files which are not ISO-8859-1 with entities or numerical reference to Unicode characters. This method is not that nice for files with many of those characters (say pages in Russian), here 1) is the better way. Polish and Czech should be OK here. pi
Blocks: validate
Blocks: 89885
i'm attaching an XHTML/1.0 transitional version of the home page which uses an xml prolog to declare its charset. i don't know how to pull from the mozilla.org cvs tree, but someone who can should be able to generate a diff from this.
shoot! i posted that on the wrong bug. i'll cross-reference it from bug 89885. sorry for the spam.
What's the problem with Czech evangelism letter? I even didn't know something like that was here...
To comment 5. Unicode is not widely used in Poland. Most websites are encoded using ISO-8859-2 (Latin-2), which is an official national standard - also known as "Polska Norma PN-93 T-42118". Unicode has also some problems in the ancient Netscape 4.x (at least on UNIX).
The problem with Czech evangelism letter is that it isn't ISO-8859-1-encoded, so slapping an ISO-8859-1 default on all pages including the evangelism letter would cause the character encoding of the evangelism letter ot be misdeclared. Can Apache's default charset feature (with ISO-8859-1 as the value) be applied to everything except */intl/*? Then the evan letters could be dealt with in a .htaccess file.
Yes, we can do it by directory (and even by file if we need to). Apache also has content negotiation, meaning you can do index.html.utf8 and someone requesting index.html will get it, and it'll feed utf8 as the charset. I believe this requires enabling "MultiViews" though, which has other side-effects as well (requesting /foo will return the same thing requesting /foo.html would, because the mime-type is considered to be negotiated as well).
I'd like to go ahead and do this pretty soon, now that we have the power to do it very easily. However, we probably need to advertise it loudly ahead of time if it's going to break existing content when we put it in, so that page authors know to fix their directories. (Or we could just do it and let people file bugs on the ones that break). MultiViews makes it really easy for the content authors to declare their own charset by just sticking it on the end of the filename. However, it also opens up possible issues with similarly named files getting confused with each other. In my opinion, MultiViews isn't something we'd want to use unless we had started that way from the very begining and everyone knew what to watch out for in their filenaming conventions to avoid choking it. What I think we should do is go ahead and set a site-wide default charset, then override it on an as-needed basis in .htaccess files. The standards-advocate in me says we should set the global default to utf-8 :) But that's probably a pipe dream and we really need to use iso-8859-1 for the sitewide default. The way you override this in an .htaccess file is like so, using the /projects/bugzilla/download.html file as an example: <FilesMatch ^download.html$> AddDefaultCharset utf-8 </FilesMatch> Placing the above snippet at /projects/bugzilla/.htaccess would be all that's needed. Placing the AddDefaultCharset directive all by itself in an .htaccess file would cause it to apply to the entire directory.
If only ISO-8859-2 would be widly used in Czech... Most servers are preconfigured to send Windows-1250 REGADLESS of the actual content. And where is no header you can bet 90% of text will be Windows-1250 too... As for me I would go with UTF-8. Recoded and corrected old Czech evangelism letter as attachment added. Looking into English version I see a lot of changes, probably a new translation would be better than hacking old text. I'll take a look into this later.
I converted Polish tech letter from ISO-8859-1 to UTF-8. This is still HTML 4, maybe we should convert the letters to XHTML?
QA Contact: imajes → stolenclover
Per comment 12, I'm ready to go ahead and do this. I'd like to set a site-wide default to utf-8. I'm not a publicity person though, and I don't know where this needs to be advertised in advance. If someone knows where said advertisements should be made, let me know. Otherwise I'm likely to just do this in about a week and we'll let people start filing bugs on the pages that get messed up. :)
-> dave don't break http://www.mozilla.org/quality/intl/testprojects/mozcntopsitetest-zh.html > I don't know where this needs to be advertised in advance. try the l10n and i18n newsgroups.
Assignee: endico → justdave
Somebody please remove the stupid <meta charset="ISO-8859-1"> from wrapper. It's breaking all UTF-8 pages www.mozilla.org/quality/intl/testprojects/mozcntopsitetest-zh.html www.mozilla.org/community/intl/ja.html
Severity: normal → blocker
Whiteboard: careful, site is burning
WTF?!?? I thought the style guidelines had a strict moratoriam on <META name="Content-Type">. CVS Blame says it showed up in the initial commit of the new website on the beta branch.
I don't have access to fix this. Just tried and got rejected on commit.
I think we should add it in the wrapping script unless the page explicitly specifies something else so authors don't add pages that work using their defaults but not other people's defaults. I'll try to write a patch for this.
I fixed the wrapper so it only adds a meta with charset if the page doesn't already have a meta for Content-Type. If we want to move to UTF-8 (which seems reasonable), the first step would probably be to change the header that the wrapper adds to pages that don't specify from ISO-8859-1 to UTF-8. This shouldn't break any content that existed before the new site landed, but could break some content added since. Then, once all pages are converted to UTF-8 and METAs (in the unwrapped HTML) are removed, we could switch to using HTTP headers (probably in addition to METAs, so they work locally but so the METAs don't cause reloading).
Hmm, I was leaning really heavily towards flipping the switch to have UTF-8 as a server header, and letting people file bugs on the pages that break so we can do per-file overrides :)
The advantage of METAs is that it allows testing in local trees. That said, I do prefer the idea of server headers. There isn't a way to send the appropriate header based on the META and (optionally) strip out the META, is there?
It could be done if the munge script were to maintain a list of exceptions to the default charset in the .htaccess file in each directory... Seems like it would be a pain in the butt, but it could be done... :) <FilesMatch ^(pipe-separated-list-of-regexp-meta-escaped-filenames)$> AddDefaultCharset <charset> </FilesMatch> Or something to that effect. By regexp-meta-escaped, I mean things like periods are escaped with \, etc. Call quotemeta() on it in perl before adding it to the string :) You'd need a block like the above for each charset being used in the directory.
this *will* crash some browsers.
How can something which only happens on the web server crash a browser? pi
The way we do it *now* crashes some browsers (notably NS4 - well, causes problems, not exactly a crash). The idea is to get away from that.
I checked the validator link (as mentioned in the URL link) http://validator.w3.org/check?uri=http%3A%2F%2Fwww.mozilla.org%2F&charset=%28detect+automatically%29&doctype=Inline The document located at <http://www.mozilla.org/> was checked and found to be valid HTML 4.01 Strict. This means that the resource in question identified itself as "HTML 4.01 Strict" and that we successfully performed a formal validation using an SGML or XML Parser (depending on the markup language used).
The main mozilla.org page is ISO-8859-1 (as specified by Meta tag). This breaks the RSS feed of the Mozilla Weblogs headlines, which are in UTF-8. This bug appears to have evolved from its original subject. Perhaps someone with privileges should change it to "www.mozilla.org charset inconsistency/problems" or something similar. This bug should really be addressed ASAP, especially with the new site launch. It looks very bad to see that hosed encoding on the front page.
I would vote for converting every single page to UTF-8. If help is needed, please give me some directions.
Summary: www.mozilla.org doesn't send charset information → www.mozilla.org charset inconsistency/problems
Most files I edited on mozilla.org are already UTF-8 encoded. Quite a bit from ISO-8859-1 is encoded the same as in UTF-8. (Actually, most files seem to be US-ASCII compliant.) Although people could have a bit more trouble editing files locally, removing the META element isn't a bad idea. Doing this using HTTP is a much better solution. Note that when a switch is made, doctor.mozilla.org and other editing places should send the correct HTTP header as well, otherwise it would/could become a mess.
Blocks: 222580
Depends on: 261258
Blocks: 262394
Blocks: 270453
No longer blocks: 270453
Whiteboard: careful, site is burning
Mozilla Japan team is also interested in converting to UTF-8. We have 1,500+ translated documents and current charset is EUC-JP.
reassigning this back to default for now. When we get to a point where I can flip the switch and send UTF-8 headers server-side across the board, assign it back to me.
Assignee: justdave → mozilla.webmaster
Assignee: www-mozilla-org → nobody
QA Contact: danielwang → www-mozilla-org
(In reply to comment #33) > reassigning this back to default for now. When we get to a point where I can > flip the switch and send UTF-8 headers server-side across the board, assign it > back to me. I think you have the power to do this now. Reassigning to you. :)
Assignee: nobody → justdave
No, we most definitely don't, unless a lot of work auditing existing content has happened in the past month or so without me looking.
Assignee: justdave → nobody
The following needs to happen before we can send UTF-8 as the header. Somebody needs to pull the *entire web tree*, and then: Phase one: * fix all instances of "Warning: assuming non-ASCII characters are ISO-8859-1." output by tools/wrap.pl * change tools/wrap.pl to default to UTF-8 and check in (and wait for feedback about problems, although they're unlikely) Phase two: * change tools/wrap.pl to warn about any explicit meta headers that are not UTF-8. * fix all *those* warnings Phase three: * search all the pages that aren't wrapped but are affected by AddDefaultCharset for high characters. Re-encode all of them in UTF-8. (Or if they're testcases, use whatever the opposite of AddDefaultCharset is.) Or something like that. I don't promise that list is accurate, but it's a rough idea.
After fixing a couple of Perl warnings and adding a default interpreter to the top of the script, running the script returns nothing: reed@aramis:~/mozilla/www/mozilla-org$ tools/wrap.pl reed@aramis:~/mozilla/www/mozilla-org$ Am I missing something or is this correct? If so, I'll modify it to default to UTF-8 instead of ISO-8859-1.
It's run by the makefile; you don't use it directly.
It's not a matter of having the power, it's whether it's appropriate to do yet. (see ensuing comments from dbaron). Also, please assign to server-ops@mozilla-org.bugs when it's ready, there's more than me here, now, and I'm on vacation this week anyway.
Severity change: Wang wrote on 2004-02: > Somebody please remove the stupid <meta charset="ISO-8859-1"> from wrapper. and marked it blocker. This has been fixed, reverting to normal.
Severity: blocker → normal
Product: mozilla.org → Websites
This bug is ancient. It seems that many of the files listed in attachment 210564 [details] (from comment 40) no longer exist, either because they were archived or because they were deleted or moved. Do we still have to worry about these charset inconsistencies?
Component: www.mozilla.org → General
Product: Websites → www.mozilla.org
Going through old bugs and marking this resolved. For more info in our bug triage process and if you want to help: https://blog.mozilla.org/websites/2012/11/15/mozilla-org-bug-triage-process/
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: