Closed
Bug 154570
Opened 23 years ago
Closed 12 years ago
www.mozilla.org charset inconsistency/problems
Categories
(www.mozilla.org :: General, defect)
www.mozilla.org
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: 3.14, Unassigned)
References
Details
Attachments
(4 files)
Look at URL, the validator complains that there is no charset information. For
the bugzilla server (Apache) this would be fixed by adding the following line to
.htaccess:
AddDefaultCharset On
You might want to check if this also works for CGIs like attachments to bugs.
Maybe the other bugs the validator finds can also be fixed;-)
pi
Reporter | ||
Comment 1•23 years ago
|
||
I am wondering there is no reaction. This is a trivial thing to fix which has no
side effects.
pi
Comment 2•22 years ago
|
||
Changing summary; Bugzillas in general seem to be covered by bug
126266.
Summary: bugzilla.mozilla.org and www.mozilla.org don't send charset information → www.mozilla.org doesn't send charset information
Reporter | ||
Comment 3•22 years ago
|
||
You are right on one hand. On the other hand this bug here asks about actually
doing it. Probably a simple thing in .htaccess will do the job.
pi
Comment 4•22 years ago
|
||
There are 10377 *.html files at www.mozilla.org. Of these the vast
majority (9927 files) is plain ASCII. That leaves 450 non-ASCII files.
The majority of those documents are ISO-8859-1 coded, which is good.
The problematic documents include:
* about 79 documents that are Windows-1252 encoded
* i18n test cases whose path matches */intl/*
* The Polish and Czech evangelism letters
* http://www.mozilla.org/projects/bugzilla/download.html which is UTF-8
coded but outside any */intl/* directory
Probably the best way to deal with the i18n test cases is not to change
the configuration of the intl directories.
What to do with the evang letters and the Bugzilla download page
depends on the configurability of the server.
The Windows-1252 encoded could be converted to ISO-8859-1 with numeric
character references with a Perl script. The downside would be that
some old docs would have their modification dates touched even though
the content would still be old in substance.
Reporter | ||
Comment 5•22 years ago
|
||
I see basically two options. Both would set the default charset to ISO-8859-1
(AddDefaultCharset On).
1) You can set individual files or directories to different charsets as required.
2) You can -- as you suggested -- transform those files which are not ISO-8859-1
with entities or numerical reference to Unicode characters. This method is not
that nice for files with many of those characters (say pages in Russian), here
1) is the better way. Polish and Czech should be OK here.
pi
Comment 6•22 years ago
|
||
i'm attaching an XHTML/1.0 transitional version of the home page which uses an
xml prolog to declare its charset. i don't know how to pull from the
mozilla.org cvs tree, but someone who can should be able to generate a diff
from this.
Comment 7•22 years ago
|
||
shoot! i posted that on the wrong bug. i'll cross-reference it from bug 89885.
sorry for the spam.
Comment 8•22 years ago
|
||
What's the problem with Czech evangelism letter? I even didn't know something
like that was here...
Comment 9•21 years ago
|
||
To comment 5.
Unicode is not widely used in Poland. Most websites are encoded using ISO-8859-2
(Latin-2), which is an official national standard - also known as "Polska Norma
PN-93 T-42118".
Unicode has also some problems in the ancient Netscape 4.x (at least on UNIX).
Comment 10•21 years ago
|
||
The problem with Czech evangelism letter is that it isn't ISO-8859-1-encoded, so
slapping an ISO-8859-1 default on all pages including the evangelism letter
would cause the character encoding of the evangelism letter ot be misdeclared.
Can Apache's default charset feature (with ISO-8859-1 as the value) be applied
to everything except */intl/*? Then the evan letters could be dealt with in a
.htaccess file.
Comment 11•21 years ago
|
||
Yes, we can do it by directory (and even by file if we need to).
Apache also has content negotiation, meaning you can do index.html.utf8 and
someone requesting index.html will get it, and it'll feed utf8 as the charset.
I believe this requires enabling "MultiViews" though, which has other
side-effects as well (requesting /foo will return the same thing requesting
/foo.html would, because the mime-type is considered to be negotiated as well).
Comment 12•21 years ago
|
||
I'd like to go ahead and do this pretty soon, now that we have the power to do
it very easily. However, we probably need to advertise it loudly ahead of time
if it's going to break existing content when we put it in, so that page authors
know to fix their directories. (Or we could just do it and let people file bugs
on the ones that break).
MultiViews makes it really easy for the content authors to declare their own
charset by just sticking it on the end of the filename. However, it also opens
up possible issues with similarly named files getting confused with each other.
In my opinion, MultiViews isn't something we'd want to use unless we had
started that way from the very begining and everyone knew what to watch out for
in their filenaming conventions to avoid choking it.
What I think we should do is go ahead and set a site-wide default charset, then
override it on an as-needed basis in .htaccess files.
The standards-advocate in me says we should set the global default to utf-8 :)
But that's probably a pipe dream and we really need to use iso-8859-1 for the
sitewide default.
The way you override this in an .htaccess file is like so, using the
/projects/bugzilla/download.html file as an example:
<FilesMatch ^download.html$>
AddDefaultCharset utf-8
</FilesMatch>
Placing the above snippet at /projects/bugzilla/.htaccess would be all that's
needed.
Placing the AddDefaultCharset directive all by itself in an .htaccess file would
cause it to apply to the entire directory.
Comment 13•21 years ago
|
||
If only ISO-8859-2 would be widly used in Czech... Most servers are
preconfigured to send Windows-1250 REGADLESS of the actual content. And where
is no header you can bet 90% of text will be Windows-1250 too...
As for me I would go with UTF-8. Recoded and corrected old Czech evangelism
letter as attachment added. Looking into English version I see a lot of
changes, probably a new translation would be better than hacking old text. I'll
take a look into this later.
Comment 14•21 years ago
|
||
I converted Polish tech letter from ISO-8859-1 to UTF-8.
This is still HTML 4, maybe we should convert the letters to XHTML?
Updated•21 years ago
|
QA Contact: imajes → stolenclover
Comment 15•21 years ago
|
||
Per comment 12, I'm ready to go ahead and do this. I'd like to set a site-wide
default to utf-8.
I'm not a publicity person though, and I don't know where this needs to be
advertised in advance. If someone knows where said advertisements should be
made, let me know. Otherwise I'm likely to just do this in about a week and
we'll let people start filing bugs on the pages that get messed up. :)
Comment 16•21 years ago
|
||
-> dave
don't break
http://www.mozilla.org/quality/intl/testprojects/mozcntopsitetest-zh.html
> I don't know where this needs to be advertised in advance.
try the l10n and i18n newsgroups.
Assignee: endico → justdave
Comment 17•21 years ago
|
||
Somebody please remove the stupid <meta charset="ISO-8859-1"> from wrapper. It's
breaking all UTF-8 pages
www.mozilla.org/quality/intl/testprojects/mozcntopsitetest-zh.html
www.mozilla.org/community/intl/ja.html
Severity: normal → blocker
Whiteboard: careful, site is burning
Comment 18•21 years ago
|
||
WTF?!?? I thought the style guidelines had a strict moratoriam on <META
name="Content-Type">.
CVS Blame says it showed up in the initial commit of the new website on the beta
branch.
Comment 19•21 years ago
|
||
I don't have access to fix this. Just tried and got rejected on commit.
I think we should add it in the wrapping script unless the page explicitly
specifies something else so authors don't add pages that work using their
defaults but not other people's defaults. I'll try to write a patch for this.
I fixed the wrapper so it only adds a meta with charset if the page doesn't
already have a meta for Content-Type.
If we want to move to UTF-8 (which seems reasonable), the first step would
probably be to change the header that the wrapper adds to pages that don't
specify from ISO-8859-1 to UTF-8. This shouldn't break any content that existed
before the new site landed, but could break some content added since.
Then, once all pages are converted to UTF-8 and METAs (in the unwrapped HTML)
are removed, we could switch to using HTTP headers (probably in addition to
METAs, so they work locally but so the METAs don't cause reloading).
Comment 22•21 years ago
|
||
Hmm, I was leaning really heavily towards flipping the switch to have UTF-8 as a
server header, and letting people file bugs on the pages that break so we can do
per-file overrides :)
The advantage of METAs is that it allows testing in local trees. That said, I
do prefer the idea of server headers. There isn't a way to send the appropriate
header based on the META and (optionally) strip out the META, is there?
Comment 24•21 years ago
|
||
It could be done if the munge script were to maintain a list of exceptions to
the default charset in the .htaccess file in each directory... Seems like it
would be a pain in the butt, but it could be done... :)
<FilesMatch ^(pipe-separated-list-of-regexp-meta-escaped-filenames)$>
AddDefaultCharset <charset>
</FilesMatch>
Or something to that effect. By regexp-meta-escaped, I mean things like periods
are escaped with \, etc. Call quotemeta() on it in perl before adding it to the
string :)
You'd need a block like the above for each charset being used in the directory.
Comment 25•21 years ago
|
||
this *will* crash some browsers.
Reporter | ||
Comment 26•21 years ago
|
||
How can something which only happens on the web server crash a browser?
pi
Comment 27•21 years ago
|
||
The way we do it *now* crashes some browsers (notably NS4 - well, causes
problems, not exactly a crash). The idea is to get away from that.
Comment 28•21 years ago
|
||
I checked the validator link (as mentioned in the URL link)
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.mozilla.org%2F&charset=%28detect+automatically%29&doctype=Inline
The document located at <http://www.mozilla.org/> was checked and found to be
valid HTML 4.01 Strict. This means that the resource in question identified
itself as "HTML 4.01 Strict" and that we successfully performed a formal
validation using an SGML or XML Parser (depending on the markup language used).
Comment 29•21 years ago
|
||
The main mozilla.org page is ISO-8859-1 (as specified by Meta tag). This breaks
the RSS feed of the Mozilla Weblogs headlines, which are in UTF-8.
This bug appears to have evolved from its original subject. Perhaps someone with
privileges should change it to "www.mozilla.org charset inconsistency/problems"
or something similar.
This bug should really be addressed ASAP, especially with the new site launch.
It looks very bad to see that hosed encoding on the front page.
Comment 30•21 years ago
|
||
I would vote for converting every single page to UTF-8. If help is needed,
please give me some directions.
Summary: www.mozilla.org doesn't send charset information → www.mozilla.org charset inconsistency/problems
Comment 31•21 years ago
|
||
Most files I edited on mozilla.org are already UTF-8 encoded. Quite a bit from
ISO-8859-1 is encoded the same as in UTF-8. (Actually, most files seem to be
US-ASCII compliant.)
Although people could have a bit more trouble editing files locally, removing
the META element isn't a bad idea. Doing this using HTTP is a much better solution.
Note that when a switch is made, doctor.mozilla.org and other editing places
should send the correct HTTP header as well, otherwise it would/could become a mess.
Updated•21 years ago
|
Updated•20 years ago
|
Whiteboard: careful, site is burning
Comment 32•20 years ago
|
||
Mozilla Japan team is also interested in converting to UTF-8.
We have 1,500+ translated documents and current charset is EUC-JP.
Comment 33•20 years ago
|
||
reassigning this back to default for now. When we get to a point where I can
flip the switch and send UTF-8 headers server-side across the board, assign it
back to me.
Assignee: justdave → mozilla.webmaster
Updated•19 years ago
|
Assignee: www-mozilla-org → nobody
QA Contact: danielwang → www-mozilla-org
Comment 34•19 years ago
|
||
(In reply to comment #33)
> reassigning this back to default for now. When we get to a point where I can
> flip the switch and send UTF-8 headers server-side across the board, assign it
> back to me.
I think you have the power to do this now. Reassigning to you. :)
Assignee: nobody → justdave
No, we most definitely don't, unless a lot of work auditing existing content has happened in the past month or so without me looking.
Assignee: justdave → nobody
The following needs to happen before we can send UTF-8 as the header. Somebody needs to pull the *entire web tree*, and then:
Phase one:
* fix all instances of "Warning: assuming non-ASCII characters are ISO-8859-1." output by tools/wrap.pl
* change tools/wrap.pl to default to UTF-8 and check in (and wait for feedback about problems, although they're unlikely)
Phase two:
* change tools/wrap.pl to warn about any explicit meta headers that are not UTF-8.
* fix all *those* warnings
Phase three:
* search all the pages that aren't wrapped but are affected by AddDefaultCharset for high characters. Re-encode all of them in UTF-8. (Or if they're testcases, use whatever the opposite of AddDefaultCharset is.)
Or something like that. I don't promise that list is accurate, but it's a rough idea.
Comment 37•19 years ago
|
||
After fixing a couple of Perl warnings and adding a default interpreter to the top of the script, running the script returns nothing:
reed@aramis:~/mozilla/www/mozilla-org$ tools/wrap.pl
reed@aramis:~/mozilla/www/mozilla-org$
Am I missing something or is this correct? If so, I'll modify it to default to UTF-8 instead of ISO-8859-1.
It's run by the makefile; you don't use it directly.
Comment 39•19 years ago
|
||
It's not a matter of having the power, it's whether it's appropriate to do yet. (see ensuing comments from dbaron).
Also, please assign to server-ops@mozilla-org.bugs when it's ready, there's more than me here, now, and I'm on vacation this week anyway.
Comment 40•19 years ago
|
||
Comment 41•17 years ago
|
||
Severity change:
Wang wrote on 2004-02:
> Somebody please remove the stupid <meta charset="ISO-8859-1"> from wrapper.
and marked it blocker. This has been fixed, reverting to normal.
Severity: blocker → normal
Assignee | ||
Updated•17 years ago
|
Product: mozilla.org → Websites
Comment 42•14 years ago
|
||
This bug is ancient. It seems that many of the files listed in attachment 210564 [details] (from comment 40) no longer exist, either because they were archived or because they were deleted or moved.
Do we still have to worry about these charset inconsistencies?
Assignee | ||
Updated•13 years ago
|
Component: www.mozilla.org → General
Product: Websites → www.mozilla.org
Comment 43•12 years ago
|
||
Going through old bugs and marking this resolved.
For more info in our bug triage process and if you want to help:
https://blog.mozilla.org/websites/2012/11/15/mozilla-org-bug-triage-process/
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•