Closed Bug 967705 Opened 8 years ago Closed 8 years ago

selfserve agent drops requests

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(2 files)

A lot of requests to rebuild are not getting completed.  It appears that the requested rebuilds do not occur, either.

Based on the /jobs endpoint, requests for bid's 36446213 and 36446262 fall into this category.  Unfortunately, that endpoint does not list the jobrequestid.
Probable instance: https://secure.pub.build.mozilla.org/buildapi/self-serve/jobs/628840 (completed jobs are taking a couple seconds right now, and that's been 5 minutes, so I doubt it's going to suddenly get completed).
And then https://secure.pub.build.mozilla.org/buildapi/self-serve/jobs/628841 which was another rebuild_build for the same job, which was completed in seconds.
2014-02-09 14:01:38,591 Received {u'action': u'rebuild_build', u'body': {u'priority': 0, u'bid': 36793003, u'when': 1391983299, u'request_id': 628841}, u'who': u'philringnalda@gmail.com'}
2014-02-09 14:01:38,592 Loading masters from https://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-masters.json
2014-02-09 14:01:38,796 Loading branches from https://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-branches.json
2014-02-09 14:01:38,905 rebuilding build by philringnalda@gmail.com of 36793003
2014-02-09 14:01:38,933 INSERT INTO buildsets
        (`external_idstring`, `reason`, `sourcestampid`, `submitted_at`, `complete`, `complete_at`, `results`)
        VALUES
        (:idstring, :reason, :sourcestampid, :submitted_at, 0, NULL, NULL)
2014-02-09 14:01:38,938 Created buildset 8645846
2014-02-09 14:01:38,938 INSERT INTO buildset_properties
                (`buildsetid`, `property_name`, `property_value`)
                SELECT :buildsetid, `property_name`, `property_value` FROM
                    buildset_properties WHERE
                    buildsetid = :oldbsid
2014-02-09 14:01:38,942 Created 1 properties
2014-02-09 14:01:38,943 INSERT INTO buildrequests
                (`buildsetid`, `buildername`, `submitted_at`, `priority`, `claimed_at`, `claimed_by_name`, `claimed_by_incarnation`, `complete`, `results`, `complete_at`)
                VALUES
                (:buildsetid, :buildername, :submitted_at, 0, 0, NULL, NULL, 0, NULL, NULL)
2014-02-09 14:01:38,948 Created buildrequest 36268881
2014-02-09 14:01:38,953 Start from server, version: 8.0, properties: {u'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', u'product': 'RabbitMQ', u'copyright': 'Copyright (C) 2007-2013 GoPivotal, Inc.', u'capabilities': {}, u'platform': 'Erlang/OTP', u'version': '3.2.2'}, mechanisms: ['PLAIN', 'AMQPLAIN'], locales: ['en_US']
2014-02-09 14:01:38,997 Open OK! known_hosts []
2014-02-09 14:01:38,997 using channel_id: 1
2014-02-09 14:01:38,999 Channel open
2014-02-09 14:01:39,043 Closed channel #1
Traceback (most recent call last):
  File "/builds/selfserve-agent/bin/selfserve-agent", line 9, in <module>
    load_entry_point('buildapi==0.1.2', 'console_scripts', 'selfserve-agent')()
  File "/builds/selfserve-agent/lib/python2.7/site-packages/buildapi/scripts/selfserve_agent.py", line 645, in main
    consumer.wait()
  File "/builds/selfserve-agent/lib/python2.7/site-packages/carrot/messaging.py", line 446, in wait
    it.next()
  File "/builds/selfserve-agent/lib/python2.7/site-packages/carrot/backends/pyamqplib.py", line 300, in consume
    self.channel.wait()
  File "/builds/selfserve-agent/lib/python2.7/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait
    self.channel_id, allowed_methods)
  File "/builds/selfserve-agent/lib/python2.7/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method
    self.method_reader.read_method()
  File "/builds/selfserve-agent/lib/python2.7/site-packages/amqplib/client_0_8/method_framing.py", line 215, in read_method
    raise m
IOError: Socket closed
It looks like the rebuild buildrequest was created, but the reply message about the selfserve job wasn't sent.  Does that match what you're seeing, philor?
No, I'm seeing that the rebuild does not happen.
Duplicate of this bug: 967459
Flags: needinfo?(catlee)
catlee and I have decided to deploy the changes I made earlier when refactoring buildapi to use kombu:
  http://hg.mozilla.org/build/buildapi/diff/6ee57dbd0b1e/buildapi/scripts/selfserve-agent.py

to see if that helps.
Attached patch bug967705.patchSplinter Review
Rail, as the original author of this module - does this look sufficient?
Assignee: nobody → dustin
Attachment #8375695 - Flags: review?(rail)
Attachment #8375695 - Flags: review?(rail) → review+
This doesn't appear to have worked.  I still see lots of jobs with no completion on record.
I set up an extra, temporary queue in RabbitMQ to capture all messages on this exchange, hoping to gather more data on where, exactly, things are getting stuck.
https://secure.pub.build.mozilla.org/buildapi/self-serve/jobs/637966 never completed.

on web2:
2014-02-14 09:53:45,375 INFO  [buildapi.lib.helpers] [MainThread] Fetching branches list from http://hg.mozilla.org/build/tools/raw-file/default/buildfarm/maintenance/production-branches.json
2014-02-14 09:53:45,439 INFO  [buildapi.access] [MainThread] emanuel.hoogeveen@gmail.com accessing POST /self-serve/try/build
2014-02-14 09:53:45,461 INFO  [buildapi.access] [MainThread] emanuel.hoogeveen@gmail.com rebuild_build try 37062601 0
2014-02-14 09:53:45,471 WARNI [buildapi.lib.mq] [MainThread] HERE
2014-02-14 09:53:45,480 INFO  [buildapi.lib.mq] [MainThread] allocated request id 637966
2014-02-14 09:53:45,480 INFO  [buildapi.lib.mq] [MainThread] Sending {'action': 'rebuild_build', 'body': {'priority': 0, 'bid': 37062601, 'when': 1392400425L, 'request_id': 637966L}, 'who': 'emanuel.hoogeveen@gmail.com'}
2014-02-14 09:53:45,481 INFO  [buildapi.lib.mq] [MainThread] sent request id 637966

but no such message appeared in the sampling queue.

Simple test script:

from buildapi.lib.mq import JobRequestPublisher

pub = JobRequestPublisher({
    'mq.kombu_url' : 'amqp://buildapi:pass@knuth//buildapi',
    'mq.heartbeat_interval': 5,
    'mq.exchange': 'buildapi.control',
    'mq.queue.web': 'buildapi-test-done-q',
    'mq.queue.agent': 'buildapi-test-q',
})

i = 1
while True:
    print "attempt", i
    print pub.reprioritizeRequest('dmitchell@mozilla.com (testing)', -1, i)
    print "done", i
    print "waiting"
    raw_input()
    i += 1


This works fine (no lost messages) on my dev hosts, using iptables -j DROP to cut off communication.

On web1, however, waiting >580s leads to what looks like success from the Python code, but no message appears in the queue.
I have a hunch that this comes from a conflict between the Zeus timeout, which appears to be 300s, and the rabbitmq heartbeat timeout, which defaults to 580s.  Setting the script to send every 310s results in lost messages.  Rabbitmq waits for two heartbeat times before severing a connection, so in comment 13 when I waited ">580s", that was probably less than 1160s.  So this can probably be fixed by tuning the heartbeat timeout to <(300/2)s.

Still, it's disconcerting to think that py-amqp is tossing messages into the bitbucket.
py-amqp and kombu are using basic.publish, which has no response in the AMQP protocol.

The network traffic looks something like

> 11:26:24.385933 IP web1.releng.webapp.scl3.mozilla.com.34281 > releng-rabbitmq-zlb.webapp.scl3.mozilla.com.amqp: Flags [P.], seq 1556930082:1556930123, ack 2319846932, win 123, options [nop,nop,TS val 1199434907 ecr 1268905164], length 41
> 11:26:24.386762 IP web1.releng.webapp.scl3.mozilla.com.34281 > releng-rabbitmq-zlb.webapp.scl3.mozilla.com.amqp: Flags [P.], seq 41:92, ack 1, win 123, options [nop,nop,TS val 1199434908 ecr 1268905164], length 51
> 11:26:24.386927 IP web1.releng.webapp.scl3.mozilla.com.34281 > releng-rabbitmq-zlb.webapp.scl3.mozilla.com.amqp: Flags [P.], seq 92:205, ack 1, win 123, options [nop,nop,TS val 1199434908 ecr 1268905164], length 113
> 11:26:24.387161 IP releng-rabbitmq-zlb.webapp.scl3.mozilla.com.amqp > web1.releng.webapp.scl3.mozilla.com.34281: Flags [.], ack 205, win 16, options [nop,nop,TS val 1268942286 ecr 1199434907], length 0

Zeus half-closes the connection:

> 11:31:26.429247 IP releng-rabbitmq-zlb.webapp.scl3.mozilla.com.amqp > web1.releng.webapp.scl3.mozilla.com.34281: Flags [F.], seq 1, ack 205, win 16, options [nop,nop,TS val 1269244328 ecr 1199434907], length 0
> 11:31:26.468421 IP web1.releng.webapp.scl3.mozilla.com.34281 > releng-rabbitmq-zlb.webapp.scl3.mozilla.com.amqp: Flags [.], ack 2, win 123, options [nop,nop,TS val 1199736990 ecr 1269244328], length 0

the client fires a basic.publish frame, and gets no error back immediately

> 11:38:28.844963 IP web1.releng.webapp.scl3.mozilla.com.34281 > releng-rabbitmq-zlb.webapp.scl3.mozilla.com.amqp: Flags [P.], seq 205:246, ack 2, win 123, options [nop,nop,TS val 1200159366 ecr 1269244328], length 41

although the kernel processes the RST and will provide an error on the next socket operation:

> 11:38:28.845273 IP releng-rabbitmq-zlb.webapp.scl3.mozilla.com.amqp > web1.releng.webapp.scl3.mozilla.com.34281: Flags [R], seq 2319846933, win 0, length 0

This is easy to replicate with a little Python:

# "server"
>>> from socket import *
>>> s = socket(AF_INET, SOCK_STREAM)
>>> s.bind(('', 9999))
>>> s.listen(5)
>>> ss = s.accept()[0]
>>> ss.read()
>>> ss.recv(100)
'abcd\n'
>>> ss.close()
>>>

# "client"
>>> from socket import *
>>> s = socket(AF_INET, SOCK_STREAM)
>>> s.connect(('', 9999))
>>> s.send('abcd\n')
5
# server runs 'ss.close()' here
>>> s.send('abcd\n')
5
# no error on first send!
>>> s.send('abcd\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
socket.error: [Errno 32] Broken pipe
>>>
Attached file bug967705-p2.patch
This fixed the 310s issue, anyway.

I will also dial the heartbeat down to 140s in the config.
Attachment #8376457 - Flags: review?(catlee)
Comment on attachment 8376457 [details]
bug967705-p2.patch

r=sure
Attachment #8376457 - Flags: review?(catlee) → review+
Thanks - deployed.  Let's see what happens.
That seems a remarkable improvement - I haven't seen any forgotten requests since this landed.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Flags: needinfo?(catlee)
You need to log in before you can comment on or make changes to this bug.