Closed
Bug 452718
Opened 16 years ago
Closed 12 years ago
hgweb pages sometimes don't grok non-ascii characters (output uses incorrect character encoding/charset in some cases)
Categories
(mozilla.org Graveyard :: Server Operations: Projects, task)
mozilla.org Graveyard
Server Operations: Projects
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: Dolske, Unassigned)
References
()
Details
This seems to have broken recently (today's upgrade, maybe?)... The noticeable side effect is that non-ascii characters in commit messages get displayed (sent?) as '?'. This commonly happens when cut'n'pasting bugzilla titles (the emdash in "Bug 12345 -- foo". Example: http://hg.mozilla.org/mozilla-central/rev/ce557eb9ef4a says "Bug 451479 ? storage ..." bz said it appears that the server is sending the encoding as ANSI_X3.4-1968, which is ascii.
Comment 2•16 years ago
|
||
This probably has to do with the move from CGIs to mod_wsgi. hgweb takes its encoding from the environment (LC_TYPE). If you want to override that, you can use the HGENCODING environment variable. I think you can use SetEnv in the apache vhost config to fix this.
Updated•16 years ago
|
Assignee: server-ops → aravind
Comment 3•16 years ago
|
||
Done - thanks to djc.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 4•16 years ago
|
||
Hmm. Oddly this seems to have fixed the changeset page I linked in comment 0, but the shortlog and pushlog are still showing "?", even though the page encoding is UTF-8.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 5•16 years ago
|
||
http://hg.mozilla.org/mozilla-central/pushloghtml seems fine to me? But yeah, http://hg.mozilla.org/mozilla-central/shortlog/18512 still has question marks...
Reporter | ||
Comment 6•16 years ago
|
||
The changeset I mentioned is currently on page 3 of the pushlog, and shows a "?" after the bug number. So do 2 of the 7 changesets below it.
Comment 7•16 years ago
|
||
Seems to be working now. If not, could you please link to the page with the question marks (I can't seem to find them)?
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 8•16 years ago
|
||
The changeset I've been talking about it currently towards the bottom of: http://hg.mozilla.org/mozilla-central/pushloghtml/3 ("storage-mozStorage should use COUNT in countLogins") I also see some oddness in another recent changeset: 9d40cd95d9c9 ("Deprecate the timed textbox binding"). Dão Gottwald's name has a non-ascii character (the a+tilde in 'Dao', lest Bugzilla munge it here). At the top of http://hg.mozilla.org/mozilla-central/pushloghtml/1 (currently) it's shown fine. On the changeset page at http://hg.mozilla.org/mozilla-central/rev/9d40cd95d9c9 and the commit log at http://hg.mozilla.org/mozilla-central/ it's shown as "D?o".
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 9•16 years ago
|
||
I think we should get a few things seperated. First, if you're linking to pushlog or pushloghtml, mind linking to date-based queries? Dao's push shows on http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=Aug%2029%202008&enddate=Aug%2031%202008, for example. Regarding Dao, can you confirm that you have your name encoded in utf-8 in .hgrc locally? We might hit some encoding mismatches between non-ascii chars locally and server side. I'm not sure if the pushlog db handles unicode fine, either. Dirkjan, it'd be great if you could share some insight on unicode handling in user names in hg, too.
Reporter | ||
Comment 10•16 years ago
|
||
(In reply to comment #9) > First, if you're linking to pushlog or pushloghtml, mind linking to date-based > queries? Ooo, I didn't know that was possible. Is there discoverable UI for this (which I can't seem to discover :), or is it just URL hacking?
Comment 11•16 years ago
|
||
(In reply to comment #9) > Regarding Dao, can you confirm that you have your name encoded in utf-8 in > .hgrc locally? Yes, confirmed.
Reporter | ||
Comment 12•16 years ago
|
||
(Updating summary, since the the page encoding wasn't the whole cause of whatever's happening)
Summary: hgweb pages are in us-ascii encoding, not UTF8 → hgweb pages sometimes don't grok non-ascii characters
Comment 13•16 years ago
|
||
(In reply to comment #8) > http://hg.mozilla.org/mozilla-central/pushloghtml/3 ("storage-mozStorage should > use COUNT in countLogins") I don't see anything wrong with that - could you maybe explain what you are seeing? > I also see some oddness in another recent changeset: 9d40cd95d9c9 ("Deprecate > the timed textbox binding"). Dão Gottwald's name has a non-ascii character (the > a+tilde in 'Dao', lest Bugzilla munge it here). > > At the top of http://hg.mozilla.org/mozilla-central/pushloghtml/1 (currently) > it's shown fine. On the changeset page at > http://hg.mozilla.org/mozilla-central/rev/9d40cd95d9c9 and the commit log at > http://hg.mozilla.org/mozilla-central/ it's shown as "D?o". I see the tildas just fine, in the changelog, in the summary page and in the specific changeset page. Maybe its a problem with your browser/O.S settings?
Reporter | ||
Comment 14•16 years ago
|
||
Hmm. I had http://hg.mozilla.org/mozilla-central/ loaded (showing "D?o"), clicked reload, and now it's showing fine. Also shows fine after restarting the browser (current nightly and my build from yesterday). Did something happen on the server?
Comment 15•16 years ago
|
||
(In reply to comment #14) > Hmm. I had http://hg.mozilla.org/mozilla-central/ loaded (showing "D?o"), > clicked reload, and now it's showing fine. I've seen the same thing some hours ago, now it shows D?o all the time.
Comment 16•16 years ago
|
||
(In reply to comment #15) > now it shows D?o all the time. i.e. on summary and shortlog. I got Dão on changelog, and after that also on summary and shortlog.
Reporter | ||
Comment 17•16 years ago
|
||
I noticed that bug 453085 says there are 2 pooled Hg machines... http://dm-hg01.mozilla.org/mozilla-central/ is showing "Dão", whereas dm-hg02 is showing "D?o". No change on reload.
Comment 18•16 years ago
|
||
Axel, in hg, usernames at commit time are converted from the local encoding (detected using the LCTYPE environment variable, overridable using the HGENCODING environment variable) to UTF-8. At output time, they are converted to the local encoding yet; this *should* be utf-8 for hgweb after the fixed Aravind did last week.
Comment 19•16 years ago
|
||
Hmm, people are seeing some problems with hg.m.o/mozilla-central/atom-log. The problem seems to be an incorrect XML prolog, which has encoding=""UTF-8"". Aravind, could it be that you've added some stray quotes somewhere?
Comment 20•16 years ago
|
||
Seems like it, in the http header, it's Content-Type: application/atom+xml; charset="UTF-8" too, where the docs I found just say charset=UTF-8
Updated•16 years ago
|
Comment 21•16 years ago
|
||
> seeing some problems with hg.m.o/mozilla-central/atom-log
Oh, that's why I thought nobody had pushed anything for days. Since that makes it not-well-formed XML, that's not "some problems" it's "completely useless unless you do your XML processing with regexes instead of an XML parser."
Status: REOPENED → NEW
Comment 22•16 years ago
|
||
Confirmed the ffeeds for mozilla-central and comm-central don't work for me in either Firefox 3.1b1pre or Shredder 3.0b1pre any longer. Started around September 1.
Comment 23•16 years ago
|
||
I definitely fixed the headers, so it now says charset=UTF=8 without the quotes. The logs also seem to be fixed now (though, I am not quite sure how that happened).
Status: NEW → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Comment 24•16 years ago
|
||
nevermind, apparently they are messed up again.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 25•16 years ago
|
||
Guys, what is left to try on this? Aside for hgweb code changes I have tweaked almost every single apache setting that could affect it. I am open to ideas that folks may have..
Given the output, it seems more likely to be some sort of hgweb configuration issue than an apache issue. If I compare: http://dm-hg01.mozilla.org/mozilla-central/rev/4020e95b0e4e http://dm-hg02.mozilla.org/mozilla-central/rev/4020e95b0e4e In both cases, the name of the author is actually 100% character escaped. However, the copy from hg01 has D?o ... ("D?o...") while the copy from hg02 has Dão ... ("Dão..."). So the charset headers certainly aren't relevant here. The question is what hgweb is using to do that entity-escaping. I think this escaping is actually done in order to obfuscate email addresses from bots. It's represented in the templates as "#author|obfuscate#". The definition of obfuscate is in templatefilters.py, which uses util._encoding. If you look at the code in util.py that sets _encoding, it seems to depend on: * the HGENCODING environment variable * the locale information (the various environment variables that control the output of the "locale" command, most likely) Do these differ on the two machines?
Actually, I'd previously been told this was a machine-to-machine difference, but I think I'm seeing different results within machines on the above URLs. So maybe it's not. And it looks like much of the advice in the previous comment was a repetion of what was above, so probably not relevant...
Comment 28•16 years ago
|
||
The configs on both the servers are identical (w.r.t apache anyway). I just compared the md5 sums and those match as well. So, I am fairly certain that this isn't a result of server side configuration difference.
Comment 29•16 years ago
|
||
Meh, this is exceedingly weird. Anyway, I just wanted to point out that on my server (with a similar configuration), it works fine, but "Simon Bünzli" still appears. So someone might want to tell him to fix his client. Aravind, there's not really any way for me to dig around, right? Maybe we can do another debug session at some point?
Comment 30•16 years ago
|
||
(In reply to comment #29) > Anyway, I just wanted to point out that on my > server (with a similar configuration), it works fine, but "Simon Bünzli" still > appears. So someone might want to tell him to fix his client. Simon doesn't push, but you're right, the string itself is broken in this case. (I think Ehsan is to blame.)
Updated•16 years ago
|
Assignee: aravind → nobody
Component: Server Operations → Server Operations: Projects
Comment 33•14 years ago
|
||
I think testing dm-hg01 and dm-hg02 directly isn't particularly useful, since I think they just proxy http traffic to the dm-vcview* boxes somewhat randomly. Testing dm-vcview01/dm-vcview02 directly, on the other hand, seems to provide interesting results: http://dm-vcview01.mozilla.org/mozilla-central/rev/9d40cd95d9c9 seems to always show "D?o" http://dm-vcview02.mozilla.org/mozilla-central/rev/9d40cd95d9c9 seems to sometimes show "D?o" and sometimes show "Dão" A few other people confirmed that they see the same behavior on IRC.
Summary: hgweb pages sometimes don't grok non-ascii characters → hgweb pages sometimes don't grok non-ascii characters (output uses incorrect character encoding/charset in some cases)
I just saw both results on dm-vcview01. I got the right result the first time and then kept getting the wrong result ("?") on reload. (Do others see that pattern? Does it mean something?)
Comment 35•14 years ago
|
||
Hmm, actually, that is consistent with my results too (I thought perhaps I had confused the tabs for the first attempt, so wrote it off as a mistake). Clearing my cache doesn't seem to have any effect, but trying it in a new browser I see the same thing (first load of dm-vcview01 works, second and subsequent loads fail, all loads of dm-vcview02 are random).
I was wondering if it was related more to "has the server served any other requests in the last N seconds" than "have I loaded this page in my browser before".
Comment 37•14 years ago
|
||
I think the first-load-in-a-browser thing is probably not useful to focus on (though I can't explain why it seems to be so consistent and reproducable in different browsers). "curl http://dm-vcview01.mozilla.org/mozilla-central/rev/9d40cd95d9c9 2>/dev/null | md5" alternates pretty randomly between 9274fc36f95872c17d2dad16a91cd5dd and e96c52a1ddfef20d93eafd876a26bae5 for both vcview01 and vcview02, and the difference between the two outputs is the same as described in comment 26. I guess this really just needs some server-side debugging.
Comment 38•14 years ago
|
||
Aravind did prod around with me around in IRC, but I think at one point we just gave up. I'd be happy to try it myself if I'm granted access.
Comment 39•13 years ago
|
||
Hi guys, is there still a reproduce-able bug here?
Comment 40•13 years ago
|
||
I haven't seen this for some time.
Updated•12 years ago
|
Status: REOPENED → RESOLVED
Closed: 16 years ago → 12 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Resolution: FIXED → WORKSFORME
Assignee | ||
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•