publish-job-action tasks failing with 'IOError: Socket closed' connecting to pulse.mozilla.org

RESOLVED FIXED

Status

P1
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: jmaher, Assigned: emorley)

Tracking

Details

(Reporter)

Description

2 years ago
In a daily role of a perf sheriff, it is very common to:
* backfill a job
* add new test jobs
* retrigger all talos jobs

In fact, all of these options don't seem to be working.  They were working last week.  One quirk is we now have the autoland branch, we should ensure this isn't causing us troubles.

Comment 1

2 years ago
I landed a fixed for a regression in bug 1286839.

I will investigate if I caused another one.
Summary: sheriff actions on treeherder don't seem to be working → pulse_actions - sheriff actions on treeherder don't seem to be working

Comment 2

2 years ago
Heroku dynos cycle every 24 hours.
Somehow, after Friday's cycling the listeners are not consuming any new messages (there's 60+ pending messages).

Jul 15 16:00:09 pulse-actions app/worker2.1: root	 INFO:	 Created Treeherder 'Sch' job. 
Jul 15 16:00:09 pulse-actions app/worker2.1: root	 INFO:	 #### End of request ####. 
Jul 16 09:54:37 pulse-actions heroku/worker2.1: Cycling 
Jul 16 09:54:37 pulse-actions heroku/worker2.1: State changed from up to starting 
Jul 16 09:54:41 pulse-actions heroku/worker2.1: Stopping all processes with SIGTERM 
Jul 16 09:54:42 pulse-actions heroku/worker2.1: Starting process with command `python pulse_actions/worker.py --config-file configs/worker2.json --memory-saving` 
Jul 16 09:54:42 pulse-actions heroku/worker2.1: State changed from starting to up 
Jul 16 09:54:43 pulse-actions heroku/worker2.1: Process exited with status 143 
Jul 16 09:54:45 pulse-actions app/worker2.1: root	 INFO:	 Console output logs INFO level messages. 
Jul 16 09:54:45 pulse-actions app/worker2.1: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #) 
Jul 16 09:54:45 pulse-actions app/worker2.1: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-actions, #.#) 
Jul 16 09:54:45 pulse-actions app/worker2.1: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill) 
Jul 17 10:12:43 pulse-actions heroku/worker2.1: Cycling 
Jul 17 10:12:43 pulse-actions heroku/worker2.1: State changed from up to starting 
Jul 17 10:12:48 pulse-actions heroku/worker2.1: Starting process with command `python pulse_actions/worker.py --config-file configs/worker2.json --memory-saving` 
Jul 17 10:12:48 pulse-actions heroku/worker2.1: Stopping all processes with SIGTERM 
Jul 17 10:12:49 pulse-actions heroku/worker2.1: Process exited with status 143 
Jul 17 10:12:49 pulse-actions heroku/worker2.1: State changed from starting to up 
Jul 17 10:12:51 pulse-actions app/worker2.1: root	 INFO:	 Console output logs INFO level messages. 
Jul 17 10:12:51 pulse-actions app/worker2.1: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-actions, #.#) 
Jul 17 10:12:51 pulse-actions app/worker2.1: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #) 
Jul 17 10:12:51 pulse-actions app/worker2.1: replay.replay	 INFO:	 Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill)

Comment 3

2 years ago
I can't receive any messages from Treeherder.
There's a large number of errors in new relic in the last 3 days.

Jul 15, '16 10:54 am 	Jul 18, '16 9:45 am 		log-parser celery.exceptions:Retry	Retry in 10s: timeout('timed out',) 	7,556
Jul 15, '16 10:29 pm 	Jul 18, '16 9:43 am 		publish-job-action exceptions:IOError	Socket closed 	3,563

Updated

2 years ago
Summary: pulse_actions - sheriff actions on treeherder don't seem to be working → sheriff actions on treeherder don't seem to be working

Comment 4

2 years ago
Moving to Treeherder.
Component: General → Treeherder
Product: Testing → Tree Management
Version: 49 Branch → ---

Updated

2 years ago
Assignee: nobody → emorley
(Assignee)

Comment 5

2 years ago
In the publish-job-action task, pulse_publisher.py:126 calls into Kombu which uses py-amqp, which is then giving:

  ...
  File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connection
    conn = self.Connection(**opts)
  File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/connection.py", line 180, in __init__
    (10, 30),  # tune
  File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait
    self.channel_id, allowed_methods, timeout)
  File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/connection.py", line 241, in _wait_method
    channel, method_sig, args, content = read_timeout(timeout)
  File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/connection.py", line 330, in read_timeout
    return self.method_reader.read_method()
  File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
    raise m
IOError: Socket closed

These errors started around 0300 Saturday UTC+1 (1900 PT).

Nothing has changed code-wise on Treeherder prod, in the timeframe where errors have been occurring.

The connection to pulse is fine:

[emorley@treeherder-rabbitmq2.private.scl3 ~]$ nc -vz pulse.mozilla.org 5671
Connection to pulse.mozilla.org 5671 port [tcp/amqps] succeeded!

Publish to pulse is using the user `treeherder` on pulse.m.o; Cameron/Mark did the credentials for this change at all recently?
Flags: needinfo?(mcote)
Flags: needinfo?(cdawson)
(Assignee)

Comment 6

2 years ago
The timeline is shortly after bug 1286614 comment 4. Did the 'treeherder' user get edit after that?
(Assignee)

Comment 7

2 years ago
Edited even, typing fail

Comment 8

2 years ago
Hm I did nothing except change the owner.  I can reset the password, although so can camd.
Flags: needinfo?(mcote)
Comment hidden (offtopic)
Comment hidden (offtopic)
(Assignee)

Updated

2 years ago
Component: Treeherder → Treeherder: Infrastructure
Priority: -- → P1
Summary: sheriff actions on treeherder don't seem to be working → publish-job-action tasks failing with 'IOError: Socket closed' connecting to pulse.mozilla.org
(Assignee)

Comment 11

2 years ago
On https://pulseguardian.mozilla.org/all_pulse_users I only see the following Treeherder users:

treeherder-prod 	cdawson[at]mozilla.com
treeherder-stage 	cdawson[at]mozilla.com
treeherder-staging 	cdawson[at]mozilla.com
treeherder-test 	cdawson[at]mozilla.com

I'm guessing Cameron delete the 'treeherder' user, not realising it was still used.

Since the user was deleted rather than the password rotated, I was able to create a user with the same name under my own Pulse Guardian account, and have set the password to what was defined in the Treeherder prod PULSE_URI env variable. Restarting the rabbitmq celery workers and then retriggering a job shows it's now working:

[emorley@treeherder-rabbitmq2.private.scl3 ~]$ grep 'publish-job-action' /var/log/celery/celery_worker_hp.log  | grep 'succeeded'
[2016-07-19 01:52:31,412: INFO/MainProcess] Task publish-job-action[cddf9c39-91b4-44ef-b24c-c3d329674cca] succeeded in 0.108628049493s: None

Cam - let's definitely clean up/combine the Pulse users at some point (likely after moving to Heroku), but let's file a bug and document changes so any breakage is easier to debug :-)
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
(Assignee)

Comment 12

2 years ago
This is now the third time we've been trying to debug apparent connectivity issues, when it's really an authentication issue (see also bug 1281810 bug 1261208).

Kombu/py-aqmp should really differentiate this case. I've filed:
https://github.com/celery/kombu/issues/611

Comment 13

2 years ago
I can verify this as fixed.

Thanks Ed!
Dang, I apologize for doing that.  I couldn't see that it was in use, but clearly didn't do all the due diligence I should have.  It would be great if PulseGuardian would also show exchanges for a user.  Sigh... did I just sign up for more work on PulseGuardian?  :)
Flags: needinfo?(cdawson)

Comment 15

2 years ago
If you load queues [1]

You can see each user you have, which queues they have and what exchanges/topics it consumes.

[1] https://pulseguardian.mozilla.org/queues
(Assignee)

Comment 16

2 years ago
The problem is that the 'treeherder' user doesn't have any queues, it just publishes to an exchange. 

However Pulse guardian doesn't list exchanges. 
I've found bug 1079523 which is about adding such a feature.

Comment 17

2 years ago
Good point!
(Assignee)

Comment 18

2 years ago
(In reply to Ed Morley [:emorley] from comment #12)
> This is now the third time we've been trying to debug apparent connectivity
> issues, when it's really an authentication issue (see also bug 1281810 bug
> 1261208).
> 
> Kombu/py-aqmp should really differentiate this case. I've filed:
> https://github.com/celery/kombu/issues/611

It turns out that the AMQP spec says to close the socket immediately. However rabbitmq-server extends the AMQP protocol slightly (http://www.rabbitmq.com/auth-notification.html) - and the real issue is that py-amqp (plus librabbitmq fwiw) doesn't yet support the `authentication_failure_close` capability required to make use of it.

Follow along in:
https://github.com/celery/py-amqp/issues/106
https://github.com/celery/librabbitmq/issues/89
https://github.com/celery/kombu/issues/611
(Assignee)

Comment 19

2 years ago
(In reply to Ed Morley [:emorley] from comment #18)
> It turns out that the AMQP spec says to close the socket immediately.
> However rabbitmq-server extends the AMQP protocol slightly
> (http://www.rabbitmq.com/auth-notification.html) - and the real issue is
> that py-amqp (plus librabbitmq fwiw) doesn't yet support the
> `authentication_failure_close` capability required to make use of it.
> 
> Follow along in:
> https://github.com/celery/py-amqp/issues/106
> https://github.com/celery/librabbitmq/issues/89
> https://github.com/celery/kombu/issues/611

This is now fixed in py-amqp 2.1.1.
You need to log in before you can comment on or make changes to this bug.