Closed
Bug 856393
Opened 11 years ago
Closed 11 years ago
Issues loading various hg pages via http(s)
Categories
(Developer Services :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Callek, Assigned: fox2mike)
References
Details
http://hg.mozilla.org/releases/comm-release/pushloghtml is blank, unlike what I would expect While http://hg.mozilla.org/releases/comm-release/pushlog does show entries I don't know what is wrong, but I worry this might affect other less frequent trees as well.
Assignee | ||
Updated•11 years ago
|
Severity: normal → major
Summary: comm-release pushlog broken → Issues loading various pushloghtml requests
Assignee | ||
Updated•11 years ago
|
Assignee: server-ops-devservices → shyam
Assignee | ||
Comment 2•11 years ago
|
||
So I took an hg webhead out of rotation, downgraded mod_wsgi (we've had issues elsewhere [aka other webapps, not with hg so far] with the new version) and then ran this locally bypassing the load balancer...tracked down the process and traced it to see the following : [pid 7463] open("/repo/hg/mozilla/releases/comm-release/.hg/store/00changelog.d", O_RDONLY) = 17 [pid 7463] fstat(17, {st_mode=S_IFREG|0664, st_size=3559928, ...}) = 0 [pid 7463] fstat(17, {st_mode=S_IFREG|0664, st_size=3559928, ...}) = 0 [pid 7463] mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1c59c33000 [pid 7463] lseek(17, 1900544, SEEK_SET) = 1900544 PANIC: attached pid 7462 exited with 255 PANIC: handle_group_exit: 7462 leader 7463 PANIC: attached pid 7461 exited with 255 PANIC: handle_group_exit: 7461 leader 7463 PANIC: attached pid 7463 exited with 255 PANIC: attached pid 7460 exited with 255 PANIC: handle_group_exit: 7460 leader 7463 I'm not sure what this exactly means at this point. It seems like this is completely reproducible and happens exactly around 35 seconds into the request, every time. Other similar requests (all for pushloghtml) are running into the same fate. Apache's logs show nothing and return a 200. CC'ing some storage folks (wondering about the lseek and if they're seeing any errors on the netapp), ted (for pushloghtml) and digi (our resident strace guru).
Assignee | ||
Comment 3•11 years ago
|
||
CC'ing Hal too, for his information.
Comment 4•11 years ago
|
||
That 00changelog.d file is an internal Mercurial data file, FWIW.
Comment 5•11 years ago
|
||
Yeah, 00changelog.d basically contains the data for the changeset metadata. 00changelog.i has an index into the data file, such that quick seeks can be done into the data file. So I'm guessing that might explain the lseek() call, at least? Is it possible there's some kind of filesystem corruption there?
Comment 6•11 years ago
|
||
Not much information on my end, but a couple of questions on scope of impact. Two big questions: 1. how widespread is this? (other repos?) 2. which use cases/user communities is this going to impact? To the 2nd, I believe we provide pushlog in (at least) 3 formats: - atom (working) - html (not working) - json (working - http://hg.mozilla.org/releases/comm-release/json-pushes) What are the results of 'hg verify' on this webhead/repository combination?
Comment 7•11 years ago
|
||
(In reply to Hal Wine [:hwine] from comment #6) > To the 2nd, I believe we provide pushlog in (at least) 3 formats: > - atom (working) > - html (not working) > - json (working - http://hg.mozilla.org/releases/comm-release/json-pushes) The json-pushes webcommand doesn't load information from the hg repo by default, which is why this works. The other formats default to loading changeset info.
Comment 8•11 years ago
|
||
(In reply to Dirkjan Ochtman (:djc) from comment #5) > Is it possible there's some kind of filesystem corruption there? The file is stored on NFS. There's been no indication on the filer that something's amiss on the filesystem. The long form of the strace shows tens of thousands of seeks against the file (and nothing else happening) before it blows up. Tried to see (the hard way) if there was a ticklish spot within the file, but nothing showed there. Basic methodology was a loop over `dd if=/repo/hg/mozilla/releases/comm-release/.hg/store/00changelog.d of=/dev/null bs=1 count=1 skip=$i` (took 12 hours, I don't recommend that). The repeatability smells like a timeout to me, but I'm solution-fitting more than working from concrete data.
Assignee | ||
Updated•11 years ago
|
Summary: Issues loading various pushloghtml requests → Issues loading various hg pages via http(s)
Assignee | ||
Comment 10•11 years ago
|
||
(In reply to Greg Cox [:gcox] from comment #8) > The repeatability smells like a timeout to me, but I'm solution-fitting more > than working from concrete data. Correct. What threw me off initially was my thought that we hadn't changed anything with respect to Apache configs on our end and turned out I was wrong :) I tracked this down to a config change that was made on March 21st to the Apache config to make sure we don't have too many mod_wsgi processes hanging around doing nothing. This setting is called the inactivity_timeout and was set to 30 seconds. For more information about this timeout, you can read http://code.google.com/p/modwsgi/wiki/ConfigurationDirectives#WSGIDaemonProcess I've bumped up this timeout to 300 seconds. All tested URLs work fine now. When puppet runs complete across the webheads, things will be back to normal. Let me know if you run into any issues. Apologies for the issues this caused you all :) gcox++ for help reading stuff and tracking down the change.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in
before you can comment on or make changes to this bug.
Description
•