Closed Bug 856393 Opened 11 years ago Closed 11 years ago

Issues loading various hg pages via http(s)

Categories

(Developer Services :: General, task)

x86_64
Windows 7
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: fox2mike)

References

Details

http://hg.mozilla.org/releases/comm-release/pushloghtml is blank, unlike what I would expect

While http://hg.mozilla.org/releases/comm-release/pushlog does show entries

I don't know what is wrong, but I worry this might affect other less frequent trees as well.
Severity: normal → major
Summary: comm-release pushlog broken → Issues loading various pushloghtml requests
Assignee: server-ops-devservices → shyam
So I took an hg webhead out of rotation, downgraded mod_wsgi (we've had issues elsewhere [aka other webapps, not with hg so far] with the new version) and then ran this locally bypassing the load balancer...tracked down the process and traced it to see the following :

[pid  7463] open("/repo/hg/mozilla/releases/comm-release/.hg/store/00changelog.d", O_RDONLY) = 17
[pid  7463] fstat(17, {st_mode=S_IFREG|0664, st_size=3559928, ...}) = 0
[pid  7463] fstat(17, {st_mode=S_IFREG|0664, st_size=3559928, ...}) = 0
[pid  7463] mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1c59c33000
[pid  7463] lseek(17, 1900544, SEEK_SET) = 1900544
PANIC: attached pid 7462 exited with 255
PANIC: handle_group_exit: 7462 leader 7463
PANIC: attached pid 7461 exited with 255
PANIC: handle_group_exit: 7461 leader 7463
PANIC: attached pid 7463 exited with 255
PANIC: attached pid 7460 exited with 255
PANIC: handle_group_exit: 7460 leader 7463

I'm not sure what this exactly means at this point. It seems like this is completely reproducible and happens exactly around 35 seconds into the request, every time. Other similar requests (all for pushloghtml) are running into the same fate. Apache's logs show nothing and return a 200.

CC'ing some storage folks (wondering about the lseek and if they're seeing any errors on the netapp), ted (for pushloghtml) and digi (our resident strace guru).
CC'ing Hal too, for his information.
That 00changelog.d file is an internal Mercurial data file, FWIW.
Yeah, 00changelog.d basically contains the data for the changeset metadata. 00changelog.i has an index into the data file, such that quick seeks can be done into the data file. So I'm guessing that might explain the lseek() call, at least? Is it possible there's some kind of filesystem corruption there?
Not much information on my end, but a couple of questions on scope of impact.

Two big questions:
 1. how widespread is this? (other repos?)
 2. which use cases/user communities is this going to impact?

To the 2nd, I believe we provide pushlog in (at least) 3 formats:
 - atom (working)
 - html (not working)
 - json (working - http://hg.mozilla.org/releases/comm-release/json-pushes)

What are the results of 'hg verify' on this webhead/repository combination?
(In reply to Hal Wine [:hwine] from comment #6)
> To the 2nd, I believe we provide pushlog in (at least) 3 formats:
>  - atom (working)
>  - html (not working)
>  - json (working - http://hg.mozilla.org/releases/comm-release/json-pushes)

The json-pushes webcommand doesn't load information from the hg repo by default, which is why this works. The other formats default to loading changeset info.
(In reply to Dirkjan Ochtman (:djc) from comment #5)
> Is it possible there's some kind of filesystem corruption there?

The file is stored on NFS.  There's been no indication on the filer that something's amiss on the filesystem.  The long form of the strace shows tens of thousands of seeks against the file (and nothing else happening) before it blows up.  Tried to see (the hard way) if there was a ticklish spot within the file, but nothing showed there.  
Basic methodology was a loop over `dd if=/repo/hg/mozilla/releases/comm-release/.hg/store/00changelog.d of=/dev/null bs=1 count=1 skip=$i` (took 12 hours, I don't recommend that).

The repeatability smells like a timeout to me, but I'm solution-fitting more than working from concrete data.
Summary: Issues loading various pushloghtml requests → Issues loading various hg pages via http(s)
(In reply to Greg Cox [:gcox] from comment #8)

> The repeatability smells like a timeout to me, but I'm solution-fitting more
> than working from concrete data.

Correct. What threw me off initially was my thought that we hadn't changed anything with respect to Apache configs on our end and turned out I was wrong :)

I tracked this down to a config change that was made on March 21st to the Apache config to make sure we don't have too many mod_wsgi processes hanging around doing nothing. This setting is called the inactivity_timeout and was set to 30 seconds. For more information about this timeout, you can read http://code.google.com/p/modwsgi/wiki/ConfigurationDirectives#WSGIDaemonProcess 

I've bumped up this timeout to 300 seconds. All tested URLs work fine now. When puppet runs complete across the webheads, things will be back to normal. 

Let me know if you run into any issues. Apologies for the issues this caused you all :)

gcox++ for help reading stuff and tracking down the change.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in before you can comment on or make changes to this bug.