856393 - Issues loading various hg pages via http(s)

Reporter

Description

•

11 years ago

http://hg.mozilla.org/releases/comm-release/pushloghtml is blank, unlike what I would expect

While http://hg.mozilla.org/releases/comm-release/pushlog does show entries

I don't know what is wrong, but I worry this might affect other less frequent trees as well.

Shyam Mani [:fox2mike]

Assignee

Updated

•

11 years ago

Severity: normal → major

Summary: comm-release pushlog broken → Issues loading various pushloghtml requests

Shyam Mani [:fox2mike]

Assignee

Updated

•

11 years ago

Assignee: server-ops-devservices → shyam

Shyam Mani [:fox2mike]

Assignee

Comment 2

•

11 years ago

So I took an hg webhead out of rotation, downgraded mod_wsgi (we've had issues elsewhere [aka other webapps, not with hg so far] with the new version) and then ran this locally bypassing the load balancer...tracked down the process and traced it to see the following :

[pid  7463] open("/repo/hg/mozilla/releases/comm-release/.hg/store/00changelog.d", O_RDONLY) = 17
[pid  7463] fstat(17, {st_mode=S_IFREG|0664, st_size=3559928, ...}) = 0
[pid  7463] fstat(17, {st_mode=S_IFREG|0664, st_size=3559928, ...}) = 0
[pid  7463] mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1c59c33000
[pid  7463] lseek(17, 1900544, SEEK_SET) = 1900544
PANIC: attached pid 7462 exited with 255
PANIC: handle_group_exit: 7462 leader 7463
PANIC: attached pid 7461 exited with 255
PANIC: handle_group_exit: 7461 leader 7463
PANIC: attached pid 7463 exited with 255
PANIC: attached pid 7460 exited with 255
PANIC: handle_group_exit: 7460 leader 7463

I'm not sure what this exactly means at this point. It seems like this is completely reproducible and happens exactly around 35 seconds into the request, every time. Other similar requests (all for pushloghtml) are running into the same fate. Apache's logs show nothing and return a 200.

CC'ing some storage folks (wondering about the lseek and if they're seeing any errors on the netapp), ted (for pushloghtml) and digi (our resident strace guru).

Shyam Mani [:fox2mike]

Assignee

Comment 3

•

11 years ago

CC'ing Hal too, for his information.

(not currently active) Ted Mielczarek

Comment 4

•

11 years ago

That 00changelog.d file is an internal Mercurial data file, FWIW.

Dirkjan Ochtman (:djc)

Comment 5

•

11 years ago

Yeah, 00changelog.d basically contains the data for the changeset metadata. 00changelog.i has an index into the data file, such that quick seeks can be done into the data file. So I'm guessing that might explain the lseek() call, at least? Is it possible there's some kind of filesystem corruption there?

Hal Wine [:hwine] use NI!

Comment 6

•

11 years ago

Not much information on my end, but a couple of questions on scope of impact.

Two big questions:
 1. how widespread is this? (other repos?)
 2. which use cases/user communities is this going to impact?

To the 2nd, I believe we provide pushlog in (at least) 3 formats:
 - atom (working)
 - html (not working)
 - json (working - http://hg.mozilla.org/releases/comm-release/json-pushes)

What are the results of 'hg verify' on this webhead/repository combination?

(not currently active) Ted Mielczarek

Comment 7

•

11 years ago

(In reply to Hal Wine [:hwine] from comment #6)
> To the 2nd, I believe we provide pushlog in (at least) 3 formats:
>  - atom (working)
>  - html (not working)
>  - json (working - http://hg.mozilla.org/releases/comm-release/json-pushes)

The json-pushes webcommand doesn't load information from the hg repo by default, which is why this works. The other formats default to loading changeset info.

Greg Cox [:gcox]

Comment 8

•

11 years ago

(In reply to Dirkjan Ochtman (:djc) from comment #5)
> Is it possible there's some kind of filesystem corruption there?

The file is stored on NFS.  There's been no indication on the filer that something's amiss on the filesystem.  The long form of the strace shows tens of thousands of seeks against the file (and nothing else happening) before it blows up.  Tried to see (the hard way) if there was a ticklish spot within the file, but nothing showed there.  
Basic methodology was a loop over `dd if=/repo/hg/mozilla/releases/comm-release/.hg/store/00changelog.d of=/dev/null bs=1 count=1 skip=$i` (took 12 hours, I don't recommend that).

The repeatability smells like a timeout to me, but I'm solution-fitting more than working from concrete data.

Shyam Mani [:fox2mike]

Assignee

Updated

•

11 years ago

Summary: Issues loading various pushloghtml requests → Issues loading various hg pages via http(s)

Shyam Mani [:fox2mike]

Assignee

Comment 10

•

11 years ago

(In reply to Greg Cox [:gcox] from comment #8)

> The repeatability smells like a timeout to me, but I'm solution-fitting more
> than working from concrete data.

Correct. What threw me off initially was my thought that we hadn't changed anything with respect to Apache configs on our end and turned out I was wrong :)

I tracked this down to a config change that was made on March 21st to the Apache config to make sure we don't have too many mod_wsgi processes hanging around doing nothing. This setting is called the inactivity_timeout and was set to 30 seconds. For more information about this timeout, you can read http://code.google.com/p/modwsgi/wiki/ConfigurationDirectives#WSGIDaemonProcess 

I've bumped up this timeout to 300 seconds. All tested URLs work fine now. When puppet runs complete across the webheads, things will be back to normal. 

Let me know if you run into any issues. Apologies for the issues this caused you all :)

gcox++ for help reading stuff and tracking down the change.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Component: Server Operations: Developer Services → General

Product: mozilla.org → Developer Services

Bugzilla

Quick Search

Issues loading various hg pages via http(s)

Categories

(Developer Services :: General, task)

Tracking

(Not tracked)

People

(Reporter: Callek, Assigned: fox2mike)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 10

Updated