Closed Bug 1287404 Opened 8 years ago Closed 8 years ago

publish-job-action tasks failing with 'IOError: Socket closed' connecting to pulse.mozilla.org

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: emorley)

Details

In a daily role of a perf sheriff, it is very common to: * backfill a job * add new test jobs * retrigger all talos jobs In fact, all of these options don't seem to be working. They were working last week. One quirk is we now have the autoland branch, we should ensure this isn't causing us troubles.
I landed a fixed for a regression in bug 1286839. I will investigate if I caused another one.
Summary: sheriff actions on treeherder don't seem to be working → pulse_actions - sheriff actions on treeherder don't seem to be working
Heroku dynos cycle every 24 hours. Somehow, after Friday's cycling the listeners are not consuming any new messages (there's 60+ pending messages). Jul 15 16:00:09 pulse-actions app/worker2.1: root INFO: Created Treeherder 'Sch' job. Jul 15 16:00:09 pulse-actions app/worker2.1: root INFO: #### End of request ####. Jul 16 09:54:37 pulse-actions heroku/worker2.1: Cycling Jul 16 09:54:37 pulse-actions heroku/worker2.1: State changed from up to starting Jul 16 09:54:41 pulse-actions heroku/worker2.1: Stopping all processes with SIGTERM Jul 16 09:54:42 pulse-actions heroku/worker2.1: Starting process with command `python pulse_actions/worker.py --config-file configs/worker2.json --memory-saving` Jul 16 09:54:42 pulse-actions heroku/worker2.1: State changed from starting to up Jul 16 09:54:43 pulse-actions heroku/worker2.1: Process exited with status 143 Jul 16 09:54:45 pulse-actions app/worker2.1: root INFO: Console output logs INFO level messages. Jul 16 09:54:45 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #) Jul 16 09:54:45 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/resultset-actions, #.#) Jul 16 09:54:45 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill) Jul 17 10:12:43 pulse-actions heroku/worker2.1: Cycling Jul 17 10:12:43 pulse-actions heroku/worker2.1: State changed from up to starting Jul 17 10:12:48 pulse-actions heroku/worker2.1: Starting process with command `python pulse_actions/worker.py --config-file configs/worker2.json --memory-saving` Jul 17 10:12:48 pulse-actions heroku/worker2.1: Stopping all processes with SIGTERM Jul 17 10:12:49 pulse-actions heroku/worker2.1: Process exited with status 143 Jul 17 10:12:49 pulse-actions heroku/worker2.1: State changed from starting to up Jul 17 10:12:51 pulse-actions app/worker2.1: root INFO: Console output logs INFO level messages. Jul 17 10:12:51 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/resultset-actions, #.#) Jul 17 10:12:51 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #) Jul 17 10:12:51 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill)
I can't receive any messages from Treeherder. There's a large number of errors in new relic in the last 3 days. Jul 15, '16 10:54 am Jul 18, '16 9:45 am log-parser celery.exceptions:Retry Retry in 10s: timeout('timed out',) 7,556 Jul 15, '16 10:29 pm Jul 18, '16 9:43 am publish-job-action exceptions:IOError Socket closed 3,563
Summary: pulse_actions - sheriff actions on treeherder don't seem to be working → sheriff actions on treeherder don't seem to be working
Moving to Treeherder.
Component: General → Treeherder
Product: Testing → Tree Management
Version: 49 Branch → ---
Assignee: nobody → emorley
In the publish-job-action task, pulse_publisher.py:126 calls into Kombu which uses py-amqp, which is then giving: ... File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connection conn = self.Connection(**opts) File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/connection.py", line 180, in __init__ (10, 30), # tune File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait self.channel_id, allowed_methods, timeout) File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/connection.py", line 241, in _wait_method channel, method_sig, args, content = read_timeout(timeout) File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/connection.py", line 330, in read_timeout return self.method_reader.read_method() File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method raise m IOError: Socket closed These errors started around 0300 Saturday UTC+1 (1900 PT). Nothing has changed code-wise on Treeherder prod, in the timeframe where errors have been occurring. The connection to pulse is fine: [emorley@treeherder-rabbitmq2.private.scl3 ~]$ nc -vz pulse.mozilla.org 5671 Connection to pulse.mozilla.org 5671 port [tcp/amqps] succeeded! Publish to pulse is using the user `treeherder` on pulse.m.o; Cameron/Mark did the credentials for this change at all recently?
Flags: needinfo?(mcote)
Flags: needinfo?(cdawson)
The timeline is shortly after bug 1286614 comment 4. Did the 'treeherder' user get edit after that?
Edited even, typing fail
Hm I did nothing except change the owner. I can reset the password, although so can camd.
Flags: needinfo?(mcote)
Component: Treeherder → Treeherder: Infrastructure
Priority: -- → P1
Summary: sheriff actions on treeherder don't seem to be working → publish-job-action tasks failing with 'IOError: Socket closed' connecting to pulse.mozilla.org
On https://pulseguardian.mozilla.org/all_pulse_users I only see the following Treeherder users: treeherder-prod cdawson[at]mozilla.com treeherder-stage cdawson[at]mozilla.com treeherder-staging cdawson[at]mozilla.com treeherder-test cdawson[at]mozilla.com I'm guessing Cameron delete the 'treeherder' user, not realising it was still used. Since the user was deleted rather than the password rotated, I was able to create a user with the same name under my own Pulse Guardian account, and have set the password to what was defined in the Treeherder prod PULSE_URI env variable. Restarting the rabbitmq celery workers and then retriggering a job shows it's now working: [emorley@treeherder-rabbitmq2.private.scl3 ~]$ grep 'publish-job-action' /var/log/celery/celery_worker_hp.log | grep 'succeeded' [2016-07-19 01:52:31,412: INFO/MainProcess] Task publish-job-action[cddf9c39-91b4-44ef-b24c-c3d329674cca] succeeded in 0.108628049493s: None Cam - let's definitely clean up/combine the Pulse users at some point (likely after moving to Heroku), but let's file a bug and document changes so any breakage is easier to debug :-)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
This is now the third time we've been trying to debug apparent connectivity issues, when it's really an authentication issue (see also bug 1281810 bug 1261208). Kombu/py-aqmp should really differentiate this case. I've filed: https://github.com/celery/kombu/issues/611
I can verify this as fixed. Thanks Ed!
Dang, I apologize for doing that. I couldn't see that it was in use, but clearly didn't do all the due diligence I should have. It would be great if PulseGuardian would also show exchanges for a user. Sigh... did I just sign up for more work on PulseGuardian? :)
Flags: needinfo?(cdawson)
If you load queues [1] You can see each user you have, which queues they have and what exchanges/topics it consumes. [1] https://pulseguardian.mozilla.org/queues
The problem is that the 'treeherder' user doesn't have any queues, it just publishes to an exchange. However Pulse guardian doesn't list exchanges. I've found bug 1079523 which is about adding such a feature.
Good point!
(In reply to Ed Morley [:emorley] from comment #12) > This is now the third time we've been trying to debug apparent connectivity > issues, when it's really an authentication issue (see also bug 1281810 bug > 1261208). > > Kombu/py-aqmp should really differentiate this case. I've filed: > https://github.com/celery/kombu/issues/611 It turns out that the AMQP spec says to close the socket immediately. However rabbitmq-server extends the AMQP protocol slightly (http://www.rabbitmq.com/auth-notification.html) - and the real issue is that py-amqp (plus librabbitmq fwiw) doesn't yet support the `authentication_failure_close` capability required to make use of it. Follow along in: https://github.com/celery/py-amqp/issues/106 https://github.com/celery/librabbitmq/issues/89 https://github.com/celery/kombu/issues/611
(In reply to Ed Morley [:emorley] from comment #18) > It turns out that the AMQP spec says to close the socket immediately. > However rabbitmq-server extends the AMQP protocol slightly > (http://www.rabbitmq.com/auth-notification.html) - and the real issue is > that py-amqp (plus librabbitmq fwiw) doesn't yet support the > `authentication_failure_close` capability required to make use of it. > > Follow along in: > https://github.com/celery/py-amqp/issues/106 > https://github.com/celery/librabbitmq/issues/89 > https://github.com/celery/kombu/issues/611 This is now fixed in py-amqp 2.1.1.
You need to log in before you can comment on or make changes to this bug.