Closed
Bug 485928
Opened 15 years ago
Closed 15 years ago
many GraphServerPost steps are timing out
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: bhearsum, Assigned: aravind)
References
Details
Since about 5am this morning we've seen 12 attempts to post to the graph server time out after 30 seconds. They're attempting to post to graphs.mozilla.org. Is this machine less responsive lately? Is there any way we can see how long it takes it to respond to POSTs? This is causing intermittent burning of Firefox/Firefox3.5, setting as critical.
Updated•15 years ago
|
Assignee: server-ops → reed
Assignee | ||
Comment 1•15 years ago
|
||
Is this still happening? Reed has a theory that its a security scanner that's causing it to timeout at 5:00 AM. If you don't see this issue anymore, thats most likely the case.
Assignee: reed → aravind
Reporter | ||
Comment 2•15 years ago
|
||
The last 5 failures were at these times: 9:45am, 7:49am, 7:34am, 7:37am, and 7:28am
Assignee | ||
Comment 3•15 years ago
|
||
For 9:45, the only requests I see are 10.2.73.155 - - [30/Mar/2009:09:44:42 -0700] "POST /server/bulk.cgi HTTP/1.0" 200 123 10.2.73.155 - - [30/Mar/2009:09:44:42 -0700] "POST /server/bulk.cgi HTTP/1.0" 200 121 10.2.73.155 - - [30/Mar/2009:09:44:43 -0700] "POST /server/bulk.cgi HTTP/1.0" 200 127 10.2.71.90 - - [30/Mar/2009:09:46:01 -0700] "POST /server/collect.cgi HTTP/1.0" 200 68 10.2.71.90 - - [30/Mar/2009:09:46:12 -0700] "POST /server/collect.cgi HTTP/1.0" 200 69 And I didn't find any requests in the logs for POST requests that failed.
Assignee | ||
Comment 4•15 years ago
|
||
What servers are these failing requests coming from?
Reporter | ||
Comment 5•15 years ago
|
||
production-master.build.mozilla.org
Assignee | ||
Comment 6•15 years ago
|
||
I don't see that server hitting the graphs server at all, are you sure about the hostname?
Comment 7•15 years ago
|
||
(In reply to comment #5) > production-master.build.mozilla.org (In reply to comment #6) > I don't see that server hitting the graphs server at all, are you sure about > the hostname? Errr... actually, I dont think its production-master. Instead, I believe the Talos slaves post results directly to graphs.m.o. Can you check for any of the qm-*... slaves?
Reporter | ||
Comment 8•15 years ago
|
||
(In reply to comment #7) > (In reply to comment #5) > > production-master.build.mozilla.org > > (In reply to comment #6) > > I don't see that server hitting the graphs server at all, are you sure about > > the hostname? > > Errr... actually, I dont think its production-master. > > Instead, I believe the Talos slaves post results directly to graphs.m.o. Can > you check for any of the qm-*... slaves? Yes, the Talos machines themselves post to the graph server. However, the codesighs and leak test builders also post to the graph server, but the GraphServerPost step is a master-side step, which gets executed on production-master. AFAIK we've only seen failures on the leak test machines.
Comment 9•15 years ago
|
||
So it was pretty much dead again this morning (and paging me for load) and this time it happened to be someone running a script on people that was hitting getdata.cgi on the old graph server really hard. Blocked people in iptables and restarted apache, and everything cleared up.
Comment 10•15 years ago
|
||
A quick re-check of the logs shows the same getdata.cgi hit pattern coming from people at each of the times mentioned in comment #2
Assignee | ||
Comment 11•15 years ago
|
||
I looked up the wrong ip earlier (when I said I don't see any hits from production-master). I see multiple hits from that box now, but still every singe hit has a return code of 200, so still going no where with this. You mentioned you are seeing this only on the leak-test machines? Could this be a problem on those machines themselves and nothing to do with the graph server?
Reporter | ||
Comment 12•15 years ago
|
||
(In reply to comment #11) > I looked up the wrong ip earlier (when I said I don't see any hits from > production-master). I see multiple hits from that box now, but still every > singe hit has a return code of 200, so still going no where with this. > > You mentioned you are seeing this only on the leak-test machines? Could this > be a problem on those machines themselves and nothing to do with the graph > server? Very unlikely. The BuildStep which does the GraphServerPost doesn't interact with the slave at all - it's 100% run on production-master.
Comment 13•15 years ago
|
||
Do we already use a mirrored database setup (with a read-only slave) for graphserver?
Comment 14•15 years ago
|
||
There is a slave, I don't know if the app is using it.
Assignee | ||
Comment 15•15 years ago
|
||
If this is not happening anymore, can we close this bug out? There are other bugs filed for db access for Jonathan.
Comment 16•15 years ago
|
||
I'm not sure if this is related, but the Firefox3.5 Linux build tinderbox just went red with the message "Error: failed graph server post" http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.5/1239397188.1239399485.29459.gz This was the only open bug I could find that looked vaguely related -- let me know if I should file a different bug on this issue.
Assignee | ||
Comment 17•15 years ago
|
||
I can't find any specific details or patterns I can debug. Please re-open with more information if this continues to be a problem.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → INCOMPLETE
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•