Closed Bug 485928 Opened 15 years ago Closed 15 years ago

many GraphServerPost steps are timing out

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set
critical

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: bhearsum, Assigned: aravind)

References

Details

Since about 5am this morning we've seen 12 attempts to post to the graph server time out after 30 seconds. They're attempting to post to graphs.mozilla.org. Is this machine less responsive lately? Is there any way we can see how long it takes it to respond to POSTs? This is causing intermittent burning of Firefox/Firefox3.5, setting as critical.
Assignee: server-ops → reed
Is this still happening?

Reed has a theory that its a security scanner that's causing it to timeout at 5:00 AM.  If you don't see this issue anymore, thats most likely the case.
Assignee: reed → aravind
The last 5 failures were at these times: 9:45am, 7:49am, 7:34am, 7:37am, and 7:28am
For 9:45, the only requests I see are

10.2.73.155 - - [30/Mar/2009:09:44:42 -0700] "POST /server/bulk.cgi HTTP/1.0" 200 123
10.2.73.155 - - [30/Mar/2009:09:44:42 -0700] "POST /server/bulk.cgi HTTP/1.0" 200 121
10.2.73.155 - - [30/Mar/2009:09:44:43 -0700] "POST /server/bulk.cgi HTTP/1.0" 200 127
10.2.71.90 - - [30/Mar/2009:09:46:01 -0700] "POST /server/collect.cgi HTTP/1.0" 200 68
10.2.71.90 - - [30/Mar/2009:09:46:12 -0700] "POST /server/collect.cgi HTTP/1.0" 200 69


And I didn't find any requests in the logs for POST requests that failed.
What servers are these failing requests coming from?
production-master.build.mozilla.org
I don't see that server hitting the graphs server at all, are you sure about the hostname?
(In reply to comment #5)
> production-master.build.mozilla.org

(In reply to comment #6)
> I don't see that server hitting the graphs server at all, are you sure about
> the hostname?

Errr... actually, I dont think its production-master. 

Instead, I believe the Talos slaves post results directly to graphs.m.o. Can you check for any of the qm-*... slaves?
(In reply to comment #7)
> (In reply to comment #5)
> > production-master.build.mozilla.org
> 
> (In reply to comment #6)
> > I don't see that server hitting the graphs server at all, are you sure about
> > the hostname?
> 
> Errr... actually, I dont think its production-master. 
> 
> Instead, I believe the Talos slaves post results directly to graphs.m.o. Can
> you check for any of the qm-*... slaves?

Yes, the Talos machines themselves post to the graph server. However, the codesighs and leak test builders also post to the graph server, but the GraphServerPost step is a master-side step, which gets executed on production-master. AFAIK we've only seen failures on the leak test machines.
So it was pretty much dead again this morning (and paging me for load) and this time it happened to be someone running a script on people that was hitting getdata.cgi on the old graph server really hard.  Blocked people in iptables and restarted apache, and everything cleared up.
A quick re-check of the logs shows the same getdata.cgi hit pattern coming from people at each of the times mentioned in comment #2
I looked up the wrong ip earlier (when I said I don't see any hits from production-master).  I see multiple hits from that box now, but still every singe hit has a return code of 200, so still going no where with this.

You mentioned you are seeing this only on the leak-test machines?  Could this be a problem on those machines themselves and nothing to do with the graph server?
(In reply to comment #11)
> I looked up the wrong ip earlier (when I said I don't see any hits from
> production-master).  I see multiple hits from that box now, but still every
> singe hit has a return code of 200, so still going no where with this.
> 
> You mentioned you are seeing this only on the leak-test machines?  Could this
> be a problem on those machines themselves and nothing to do with the graph
> server?

Very unlikely. The BuildStep which does the GraphServerPost doesn't interact with the slave at all - it's 100% run on production-master.
Depends on: 486662
Do we already use a mirrored database setup (with a read-only slave) for graphserver?
There is a slave, I don't know if the app is using it.
If this is not happening anymore, can we close this bug out?  There are other bugs filed for db access for Jonathan.
I'm not sure if this is related, but the Firefox3.5 Linux build tinderbox just went red with the message "Error: failed graph server post"

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.5/1239397188.1239399485.29459.gz

This was the only open bug I could find that looked vaguely related -- let me know if I should file a different bug on this issue.
I can't find any specific details or patterns I can debug.  Please re-open with more information if this continues to be a problem.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.