Closed Bug 452718 Opened 16 years ago Closed 12 years ago

hgweb pages sometimes don't grok non-ascii characters (output uses incorrect character encoding/charset in some cases)

Categories

(mozilla.org Graveyard :: Server Operations: Projects, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: Dolske, Unassigned)

References

()

Details

This seems to have broken recently (today's upgrade, maybe?)... The noticeable side effect is that non-ascii characters in commit messages get displayed (sent?) as '?'. This commonly happens when cut'n'pasting bugzilla titles (the emdash in "Bug 12345 -- foo".

Example: http://hg.mozilla.org/mozilla-central/rev/ce557eb9ef4a says "Bug 451479 ? storage ..."

bz said it appears that the server is sending the encoding as ANSI_X3.4-1968, which is ascii.
This probably has to do with the move from CGIs to mod_wsgi. hgweb takes its encoding from the environment (LC_TYPE). If you want to override that, you can use the HGENCODING environment variable. I think you can use SetEnv in the apache vhost config to fix this.
Assignee: server-ops → aravind
Done - thanks to djc.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Hmm. Oddly this seems to have fixed the changeset page I linked in comment 0, but the shortlog and pushlog are still showing "?", even though the page encoding is UTF-8.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The changeset I mentioned is currently on page 3 of the pushlog, and shows a "?" after the bug number. So do 2 of the 7 changesets below it.
Seems to be working now.  If not, could you please link to the page with the question marks (I can't seem to find them)?
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
The changeset I've been talking about it currently towards the bottom of:

http://hg.mozilla.org/mozilla-central/pushloghtml/3 ("storage-mozStorage should use COUNT in countLogins")

I also see some oddness in another recent changeset: 9d40cd95d9c9 ("Deprecate the timed textbox binding"). Dão Gottwald's name has a non-ascii character (the a+tilde in 'Dao', lest Bugzilla munge it here).

At the top of http://hg.mozilla.org/mozilla-central/pushloghtml/1 (currently) it's shown fine. On the changeset page at http://hg.mozilla.org/mozilla-central/rev/9d40cd95d9c9 and the commit log at http://hg.mozilla.org/mozilla-central/ it's shown as "D?o".
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I think we should get a few things seperated.

First, if you're linking to pushlog or pushloghtml, mind linking to date-based queries? Dao's push shows on http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=Aug%2029%202008&enddate=Aug%2031%202008, for example.

Regarding Dao, can you confirm that you have your name encoded in utf-8 in .hgrc locally? We might hit some encoding mismatches between non-ascii chars locally and server side.

I'm not sure if the pushlog db handles unicode fine, either.

Dirkjan, it'd be great if you could share some insight on unicode handling in user names in hg, too.
(In reply to comment #9)

> First, if you're linking to pushlog or pushloghtml, mind linking to date-based
> queries?

Ooo, I didn't know that was possible. Is there discoverable UI for this (which I can't seem to discover :), or is it just URL hacking?
(In reply to comment #9)
> Regarding Dao, can you confirm that you have your name encoded in utf-8 in
> .hgrc locally?

Yes, confirmed.
(Updating summary, since the the page encoding wasn't the whole cause of whatever's happening)
Summary: hgweb pages are in us-ascii encoding, not UTF8 → hgweb pages sometimes don't grok non-ascii characters
(In reply to comment #8)
> http://hg.mozilla.org/mozilla-central/pushloghtml/3 ("storage-mozStorage should
> use COUNT in countLogins")

I don't see anything wrong with that - could you maybe explain what you are seeing?
 
> I also see some oddness in another recent changeset: 9d40cd95d9c9 ("Deprecate
> the timed textbox binding"). Dão Gottwald's name has a non-ascii character (the
> a+tilde in 'Dao', lest Bugzilla munge it here).
> 
> At the top of http://hg.mozilla.org/mozilla-central/pushloghtml/1 (currently)
> it's shown fine. On the changeset page at
> http://hg.mozilla.org/mozilla-central/rev/9d40cd95d9c9 and the commit log at
> http://hg.mozilla.org/mozilla-central/ it's shown as "D?o".

I see the tildas just fine, in the changelog, in the summary page and in the specific changeset page.  Maybe its a problem with your browser/O.S settings?
Hmm. I had http://hg.mozilla.org/mozilla-central/ loaded (showing "D?o"), clicked reload, and now it's showing fine. Also shows fine after restarting the browser (current nightly and my build from yesterday).

Did something happen on the server?
(In reply to comment #14)
> Hmm. I had http://hg.mozilla.org/mozilla-central/ loaded (showing "D?o"),
> clicked reload, and now it's showing fine.

I've seen the same thing some hours ago, now it shows D?o all the time.
(In reply to comment #15)
> now it shows D?o all the time.

i.e. on summary and shortlog.
I got Dão on changelog, and after that also on summary and shortlog.
I noticed that bug 453085 says there are 2 pooled Hg machines... http://dm-hg01.mozilla.org/mozilla-central/ is showing "Dão", whereas dm-hg02 is showing "D?o". No change on reload.
Axel, in hg, usernames at commit time are converted from the local encoding (detected using the LCTYPE environment variable, overridable using the HGENCODING environment variable) to UTF-8. At output time, they are converted to the local encoding yet; this *should* be utf-8 for hgweb after the fixed Aravind did last week.
Hmm, people are seeing some problems with hg.m.o/mozilla-central/atom-log. The problem seems to be an incorrect XML prolog, which has encoding=""UTF-8"". Aravind, could it be that you've added some stray quotes somewhere?
Seems like it, in the http header, it's 

Content-Type: application/atom+xml; charset="UTF-8"

too, where the docs I found just say charset=UTF-8
> seeing some problems with hg.m.o/mozilla-central/atom-log

Oh, that's why I thought nobody had pushed anything for days. Since that makes it not-well-formed XML, that's not "some problems" it's "completely useless unless you do your XML processing with regexes instead of an XML parser."
Status: REOPENED → NEW
Confirmed the ffeeds for mozilla-central and comm-central don't work for me in either Firefox 3.1b1pre or Shredder 3.0b1pre any longer. Started around September 1.
I definitely fixed the headers, so it now says charset=UTF=8 without the quotes.

The logs also seem to be fixed now (though, I am not quite sure how that happened).
Status: NEW → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
nevermind, apparently they are messed up again.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Guys, what is left to try on this?  Aside for hgweb code changes I have tweaked almost every single apache setting that could affect it.  I am open to ideas that folks may have..
Given the output, it seems more likely to be some sort of hgweb configuration issue than an apache issue.  If I compare:
http://dm-hg01.mozilla.org/mozilla-central/rev/4020e95b0e4e
http://dm-hg02.mozilla.org/mozilla-central/rev/4020e95b0e4e
In both cases, the name of the author is actually 100% character escaped.  However, the copy from hg01 has D?o ... ("D?o...") while the copy from hg02 has Dão ... ("Dão...").  So the charset headers certainly aren't relevant here.  The question is what hgweb is using to do that entity-escaping.

I think this escaping is actually done in order to obfuscate email addresses from bots.  It's represented in the templates as "#author|obfuscate#".  The definition of obfuscate is in templatefilters.py, which uses util._encoding.

If you look at the code in util.py that sets _encoding, it seems to depend on:
 * the HGENCODING environment variable
 * the locale information (the various environment variables that control the output of the "locale" command, most likely)

Do these differ on the two machines?
Actually, I'd previously been told this was a machine-to-machine difference, but I think I'm seeing different results within machines on the above URLs.  So maybe it's not.

And it looks like much of the advice in the previous comment was a repetion of what was above, so probably not relevant...
The configs on both the servers are identical (w.r.t apache anyway).  I just compared the md5 sums and those match as well.  So, I am fairly certain that this isn't a result of server side configuration difference.
Meh, this is exceedingly weird. Anyway, I just wanted to point out that on my server (with a similar configuration), it works fine, but "Simon Bünzli" still appears. So someone might want to tell him to fix his client.

Aravind, there's not really any way for me to dig around, right? Maybe we can do another debug session at some point?
(In reply to comment #29)
> Anyway, I just wanted to point out that on my
> server (with a similar configuration), it works fine, but "Simon Bünzli" still
> appears. So someone might want to tell him to fix his client.

Simon doesn't push, but you're right, the string itself is broken in this case. (I think Ehsan is to blame.)
Assignee: aravind → nobody
Component: Server Operations → Server Operations: Projects
I think testing dm-hg01 and dm-hg02 directly isn't particularly useful, since I think they just proxy http traffic to the dm-vcview* boxes somewhat randomly. Testing dm-vcview01/dm-vcview02 directly, on the other hand, seems to provide interesting results:

http://dm-vcview01.mozilla.org/mozilla-central/rev/9d40cd95d9c9 seems to always show "D?o"

http://dm-vcview02.mozilla.org/mozilla-central/rev/9d40cd95d9c9 seems to sometimes show "D?o" and sometimes show "Dão"

A few other people confirmed that they see the same behavior on IRC.
Summary: hgweb pages sometimes don't grok non-ascii characters → hgweb pages sometimes don't grok non-ascii characters (output uses incorrect character encoding/charset in some cases)
I just saw both results on dm-vcview01.

I got the right result the first time and then kept getting the wrong result ("?") on reload.  (Do others see that pattern?  Does it mean something?)
Hmm, actually, that is consistent with my results too (I thought perhaps I had confused the tabs for the first attempt, so wrote it off as a mistake). Clearing my cache doesn't seem to have any effect, but trying it in a new browser I see the same thing (first load of dm-vcview01 works, second and subsequent loads fail, all loads of dm-vcview02 are random).
I was wondering if it was related more to "has the server served any other requests in the last N seconds" than "have I loaded this page in my browser before".
I think the first-load-in-a-browser thing is probably not useful to focus on (though I can't explain why it seems to be so consistent and reproducable in different browsers).

"curl http://dm-vcview01.mozilla.org/mozilla-central/rev/9d40cd95d9c9 2>/dev/null | md5" alternates pretty randomly between 9274fc36f95872c17d2dad16a91cd5dd and e96c52a1ddfef20d93eafd876a26bae5 for both vcview01 and vcview02, and the difference between the two outputs is the same as described in comment 26.

I guess this really just needs some server-side debugging.
Aravind did prod around with me around in IRC, but I think at one point we just gave up. I'd be happy to try it myself if I'm granted access.
Blocks: 573144
Hi guys, is there still a reproduce-able bug here?
I haven't seen this for some time.
Status: REOPENED → RESOLVED
Closed: 16 years ago12 years ago
Resolution: --- → FIXED
Resolution: FIXED → WORKSFORME
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.