According to https://secure.pub.build.mozilla.org/buildapi/self-serve/jobs self-serve hasn't done anything since sometime between 14:46 and 16:13. Not quite a tree closer, but since most trees are closed for bug 893511 anyway, doesn't matter either way.
It was waiting on a network socket, so this will be fallout from the network work in the downtime. Restarting the service processed 64 requests.
Assignee: nobody → nthomas
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Priority: -- → P1
Resolution: --- → FIXED
<aki> retriggers from tbpl aren't retriggering, or aren't showing up And the self-serve log hasn't anything since 2013-07-14 03:22:35, shortly after I restarted it and did a test rebuild.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I've restarted it again, lots of requests got processed. Something stopped working between the last log message at 03:22 and the next request at 04:58 PT.
Created attachment 775402 [details] [diff] [review] [buildapi] Use a reconecting class for receiving requests I suspect that bug 891375 or bug 891383 have changed the session timeout for connections from buildbot-master36.srv.releng.scl3 to rabbit2.build.scl1. This patch is a stab in the dark, based on the JobRequestConsumer creation at http://hg.mozilla.org/build/buildapi/file/default/buildapi/scripts/selfserve-agent.py#l635 not using one of the reconnecting classes in buildapi/lib/mq.py. It's the only use of JobRequestConsumer(). This bit of code in mq.py is a little worrying though, inheriting from object instead of Consumer like the comment says: 70 class ReconnectingConsumer(object): 71 """Wrapper class around carrot.messaging.Consumer that handles disconnects 72 from the broker in the wait() loop.""" and there may be errors when trying to do consumer.register_callback(self.receive)
(In reply to Nick Thomas [:nthomas] from comment #4) > I suspect that bug 891375 or bug 891383 have changed the session timeout for > connections from buildbot-master36.srv.releng.scl3 to rabbit2.build.scl1. <casey> could be but i can't look at the moment. can check in later. needinfo to follow up on that.
Time to stop getting messages is between 5 and 33 minutes (assuming it's repeatable, not much data yet).
If the TCP session goes idle for more than 30 minutes, the firewall will drop the session and the server will no longer be able to send data to the client over that socket. I've added a longer timeout for amqp and applied a policy to use that timeout for rabbitmq in scl1. Any current sessions will need to be re-established to pick up that timer adjustment. Alternatively, if there's anyway you can enable a keepalive in the application or TCP keepalives in the kernel, you'll avoid the session timeouts. that is a better fix than the bandaid applied above.
I've restarted self-serve to pick up the longer timeout, thanks for that. How long is the timeout now ? Unfortunately the idle time is variable, as traffic comes from developers making requests to start/cancel builds. So it's particularly a problem on the weekend. We will look at making the app level more resilient to this (made a start with the patch above, but catlee is the expert here).
36 hours, which is the longest I can make it before specifying 'never'.
That should be plenty longer than the few hours between self-serve activity. Trees are clear to reopen at this point, lowering serverity.
Severity: critical → normal
Priority: P1 → P2
Comment on attachment 775402 [details] [diff] [review] [buildapi] Use a reconecting class for receiving requests Pretty sure this won't work, and I've filed bug 894124 for us to make the app manage the connection to the message server more robustly.
Resolving fixed as self-serve is operational, see 894124 for the followup.
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago → 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.