Status

Release Engineering
General
P2
normal
RESOLVED FIXED
5 years ago
8 months ago

People

(Reporter: philor, Assigned: nthomas)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 obsolete attachment)

(Reporter)

Description

5 years ago
According to https://secure.pub.build.mozilla.org/buildapi/self-serve/jobs self-serve hasn't done anything since sometime between 14:46 and 16:13. Not quite a tree closer, but since most trees are closed for bug 893511 anyway, doesn't matter either way.
(Reporter)

Updated

5 years ago
Blocks: 889996
(Assignee)

Comment 1

5 years ago
It was waiting on a network socket, so this will be fallout from the network work in the downtime. Restarting the service processed 64 requests.
Assignee: nobody → nthomas
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Priority: -- → P1
Resolution: --- → FIXED
(Assignee)

Comment 2

5 years ago
<aki> retriggers from tbpl aren't retriggering, or aren't showing up

And the self-serve log hasn't anything since 2013-07-14 03:22:35, shortly after I restarted it and did a test rebuild.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 3

5 years ago
I've restarted it again, lots of requests got processed. Something stopped working between the last log message at 03:22 and the next request at 04:58 PT.
(Assignee)

Comment 4

5 years ago
Created attachment 775402 [details] [diff] [review]
[buildapi] Use a reconecting class for receiving requests

I suspect that bug 891375 or bug 891383 have changed the session timeout for connections from buildbot-master36.srv.releng.scl3 to rabbit2.build.scl1. 

This patch is a stab in the dark, based on the JobRequestConsumer creation at http://hg.mozilla.org/build/buildapi/file/default/buildapi/scripts/selfserve-agent.py#l635 not using one of the reconnecting classes in buildapi/lib/mq.py. It's the only use of JobRequestConsumer(). 

This bit of code in mq.py is a little worrying though, inheriting from object instead of Consumer like the comment says:
70 class ReconnectingConsumer(object):
71     """Wrapper class around carrot.messaging.Consumer that handles disconnects
72     from the broker in the wait() loop."""
and there may be errors when trying to do
  consumer.register_callback(self.receive)
Attachment #775402 - Flags: review?(catlee)
(Assignee)

Comment 5

5 years ago
(In reply to Nick Thomas [:nthomas] from comment #4)
> I suspect that bug 891375 or bug 891383 have changed the session timeout for
> connections from buildbot-master36.srv.releng.scl3 to rabbit2.build.scl1. 

<casey>	could be but i can't look at the moment. can check in later.

needinfo to follow up on that.
Flags: needinfo?(cransom)
(Assignee)

Comment 6

5 years ago
Time to stop getting messages is between 5 and 33 minutes (assuming it's repeatable, not much data yet).
If the TCP session goes idle for more than 30 minutes, the firewall will drop the session and the server will no longer be able to send data to the client over that socket. I've added a longer timeout for amqp and applied a policy to use that timeout for rabbitmq in scl1.  Any current sessions will need to be re-established to pick up that timer adjustment.

Alternatively, if there's anyway you can enable a keepalive in the application or TCP keepalives in the kernel, you'll avoid the session timeouts. that is a better fix than the bandaid applied above.
Flags: needinfo?(cransom)
(Assignee)

Comment 8

5 years ago
I've restarted self-serve to pick up the longer timeout, thanks for that. How long is the timeout now ?

Unfortunately the idle time is variable, as traffic comes from developers making requests to start/cancel builds. So it's particularly a problem on the weekend. We will look at making the app level more resilient to this (made a start with the patch above, but catlee is the expert here).
36 hours, which is the longest I can make it before specifying 'never'.
(Assignee)

Comment 10

5 years ago
That should be plenty longer than the few hours between self-serve activity. Trees are clear to reopen at this point, lowering serverity.
Severity: critical → normal
Priority: P1 → P2
(Assignee)

Comment 11

5 years ago
Comment on attachment 775402 [details] [diff] [review]
[buildapi] Use a reconecting class for receiving requests

Pretty sure this won't work, and I've filed bug 894124 for us to make the app manage the connection to the message server more robustly.
Attachment #775402 - Attachment is obsolete: true
Attachment #775402 - Flags: review?(catlee)
(Assignee)

Comment 12

5 years ago
Resolving fixed as self-serve is operational, see 894124 for the followup.
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.