1287404 - publish-job-action tasks failing with 'IOError: Socket closed' connecting to pulse.mozilla.org

Reporter

Description

•

8 years ago

In a daily role of a perf sheriff, it is very common to: * backfill a job * add new test jobs * retrigger all talos jobs In fact, all of these options don't seem to be working. They were working last week. One quirk is we now have the autoland branch, we should ensure this isn't causing us troubles.

Armen [:armenzg]

Comment 1

•

8 years ago

I landed a fixed for a regression in bug 1286839. I will investigate if I caused another one.

Summary: sheriff actions on treeherder don't seem to be working → pulse_actions - sheriff actions on treeherder don't seem to be working

Armen [:armenzg]

Comment 2

•

8 years ago

Heroku dynos cycle every 24 hours. Somehow, after Friday's cycling the listeners are not consuming any new messages (there's 60+ pending messages). Jul 15 16:00:09 pulse-actions app/worker2.1: root INFO: Created Treeherder 'Sch' job. Jul 15 16:00:09 pulse-actions app/worker2.1: root INFO: #### End of request ####. Jul 16 09:54:37 pulse-actions heroku/worker2.1: Cycling Jul 16 09:54:37 pulse-actions heroku/worker2.1: State changed from up to starting Jul 16 09:54:41 pulse-actions heroku/worker2.1: Stopping all processes with SIGTERM Jul 16 09:54:42 pulse-actions heroku/worker2.1: Starting process with command `python pulse_actions/worker.py --config-file configs/worker2.json --memory-saving` Jul 16 09:54:42 pulse-actions heroku/worker2.1: State changed from starting to up Jul 16 09:54:43 pulse-actions heroku/worker2.1: Process exited with status 143 Jul 16 09:54:45 pulse-actions app/worker2.1: root INFO: Console output logs INFO level messages. Jul 16 09:54:45 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #) Jul 16 09:54:45 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/resultset-actions, #.#) Jul 16 09:54:45 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill) Jul 17 10:12:43 pulse-actions heroku/worker2.1: Cycling Jul 17 10:12:43 pulse-actions heroku/worker2.1: State changed from up to starting Jul 17 10:12:48 pulse-actions heroku/worker2.1: Starting process with command `python pulse_actions/worker.py --config-file configs/worker2.json --memory-saving` Jul 17 10:12:48 pulse-actions heroku/worker2.1: Stopping all processes with SIGTERM Jul 17 10:12:49 pulse-actions heroku/worker2.1: Process exited with status 143 Jul 17 10:12:49 pulse-actions heroku/worker2.1: State changed from starting to up Jul 17 10:12:51 pulse-actions app/worker2.1: root INFO: Console output logs INFO level messages. Jul 17 10:12:51 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/resultset-actions, #.#) Jul 17 10:12:51 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #) Jul 17 10:12:51 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill)

Armen [:armenzg]

Comment 3

•

8 years ago

I can't receive any messages from Treeherder. There's a large number of errors in new relic in the last 3 days. Jul 15, '16 10:54 am Jul 18, '16 9:45 am log-parser celery.exceptions:Retry Retry in 10s: timeout('timed out',) 7,556 Jul 15, '16 10:29 pm Jul 18, '16 9:43 am publish-job-action exceptions:IOError Socket closed 3,563

Armen [:armenzg]

Updated

•

8 years ago

Summary: pulse_actions - sheriff actions on treeherder don't seem to be working → sheriff actions on treeherder don't seem to be working

Armen [:armenzg]

Comment 4

•

8 years ago

Moving to Treeherder.

Component: General → Treeherder

Product: Testing → Tree Management

Version: 49 Branch → ---

Armen [:armenzg]

Updated

•

8 years ago

Assignee: nobody → emorley

Ed Morley [:emorley]

Assignee

Comment 5

•

8 years ago

In the publish-job-action task, pulse_publisher.py:126 calls into Kombu which uses py-amqp, which is then giving: ... File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connection conn = self.Connection(**opts) File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/connection.py", line 180, in __init__ (10, 30), # tune File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait self.channel_id, allowed_methods, timeout) File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/connection.py", line 241, in _wait_method channel, method_sig, args, content = read_timeout(timeout) File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/connection.py", line 330, in read_timeout return self.method_reader.read_method() File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method raise m IOError: Socket closed These errors started around 0300 Saturday UTC+1 (1900 PT). Nothing has changed code-wise on Treeherder prod, in the timeframe where errors have been occurring. The connection to pulse is fine: [emorley@treeherder-rabbitmq2.private.scl3 ~]$ nc -vz pulse.mozilla.org 5671 Connection to pulse.mozilla.org 5671 port [tcp/amqps] succeeded! Publish to pulse is using the user `treeherder` on pulse.m.o; Cameron/Mark did the credentials for this change at all recently?

Flags: needinfo?(mcote)

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Assignee

Comment 6

•

8 years ago

The timeline is shortly after bug 1286614 comment 4. Did the 'treeherder' user get edit after that?

Ed Morley [:emorley]

Assignee

Comment 7

•

8 years ago

Edited even, typing fail

Mark Côté [:mcote]

Comment 8

•

8 years ago

Hm I did nothing except change the owner. I can reset the password, although so can camd.

Flags: needinfo?(mcote)

Comment hidden (offtopic)

Ed Morley [:emorley]

Assignee

Updated

•

8 years ago

Component: Treeherder → Treeherder: Infrastructure

Priority: -- → P1

Summary: sheriff actions on treeherder don't seem to be working → publish-job-action tasks failing with 'IOError: Socket closed' connecting to pulse.mozilla.org

Ed Morley [:emorley]

Assignee

Comment 11

•

8 years ago

On https://pulseguardian.mozilla.org/all_pulse_users I only see the following Treeherder users: treeherder-prod cdawson[at]mozilla.com treeherder-stage cdawson[at]mozilla.com treeherder-staging cdawson[at]mozilla.com treeherder-test cdawson[at]mozilla.com I'm guessing Cameron delete the 'treeherder' user, not realising it was still used. Since the user was deleted rather than the password rotated, I was able to create a user with the same name under my own Pulse Guardian account, and have set the password to what was defined in the Treeherder prod PULSE_URI env variable. Restarting the rabbitmq celery workers and then retriggering a job shows it's now working: [emorley@treeherder-rabbitmq2.private.scl3 ~]$ grep 'publish-job-action' /var/log/celery/celery_worker_hp.log | grep 'succeeded' [2016-07-19 01:52:31,412: INFO/MainProcess] Task publish-job-action[cddf9c39-91b4-44ef-b24c-c3d329674cca] succeeded in 0.108628049493s: None Cam - let's definitely clean up/combine the Pulse users at some point (likely after moving to Heroku), but let's file a bug and document changes so any breakage is easier to debug :-)

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Assignee

Comment 12

•

8 years ago

This is now the third time we've been trying to debug apparent connectivity issues, when it's really an authentication issue (see also bug 1281810 bug 1261208). Kombu/py-aqmp should really differentiate this case. I've filed: https://github.com/celery/kombu/issues/611

Armen [:armenzg]

Comment 13

•

8 years ago

I can verify this as fixed. Thanks Ed!

Cameron Dawson [:camd]

Comment 14

•

8 years ago

Dang, I apologize for doing that. I couldn't see that it was in use, but clearly didn't do all the due diligence I should have. It would be great if PulseGuardian would also show exchanges for a user. Sigh... did I just sign up for more work on PulseGuardian? :)

Flags: needinfo?(cdawson)

Armen [:armenzg]

Comment 15

•

8 years ago

If you load queues [1] You can see each user you have, which queues they have and what exchanges/topics it consumes. [1] https://pulseguardian.mozilla.org/queues

Ed Morley [:emorley]

Assignee

Comment 16

•

8 years ago

The problem is that the 'treeherder' user doesn't have any queues, it just publishes to an exchange. However Pulse guardian doesn't list exchanges. I've found bug 1079523 which is about adding such a feature.

Armen [:armenzg]

Comment 17

•

8 years ago

Good point!

Ed Morley [:emorley]

Assignee

Comment 18

•

8 years ago

(In reply to Ed Morley [:emorley] from comment #12) > This is now the third time we've been trying to debug apparent connectivity > issues, when it's really an authentication issue (see also bug 1281810 bug > 1261208). > > Kombu/py-aqmp should really differentiate this case. I've filed: > https://github.com/celery/kombu/issues/611 It turns out that the AMQP spec says to close the socket immediately. However rabbitmq-server extends the AMQP protocol slightly (http://www.rabbitmq.com/auth-notification.html) - and the real issue is that py-amqp (plus librabbitmq fwiw) doesn't yet support the `authentication_failure_close` capability required to make use of it. Follow along in: https://github.com/celery/py-amqp/issues/106 https://github.com/celery/librabbitmq/issues/89 https://github.com/celery/kombu/issues/611

Ed Morley [:emorley]

Assignee

Comment 19

•

8 years ago

(In reply to Ed Morley [:emorley] from comment #18) > It turns out that the AMQP spec says to close the socket immediately. > However rabbitmq-server extends the AMQP protocol slightly > (http://www.rabbitmq.com/auth-notification.html) - and the real issue is > that py-amqp (plus librabbitmq fwiw) doesn't yet support the > `authentication_failure_close` capability required to make use of it. > > Follow along in: > https://github.com/celery/py-amqp/issues/106 > https://github.com/celery/librabbitmq/issues/89 > https://github.com/celery/kombu/issues/611 This is now fixed in py-amqp 2.1.1.

Bugzilla

publish-job-action tasks failing with 'IOError: Socket closed' connecting to pulse.mozilla.org

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

Tracking

(Not tracked)

People

(Reporter: jmaher, Assigned: emorley)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19