Closed
Bug 1287404
Opened 8 years ago
Closed 8 years ago
publish-job-action tasks failing with 'IOError: Socket closed' connecting to pulse.mozilla.org
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jmaher, Assigned: emorley)
Details
In a daily role of a perf sheriff, it is very common to:
* backfill a job
* add new test jobs
* retrigger all talos jobs
In fact, all of these options don't seem to be working. They were working last week. One quirk is we now have the autoland branch, we should ensure this isn't causing us troubles.
Comment 1•8 years ago
|
||
I landed a fixed for a regression in bug 1286839.
I will investigate if I caused another one.
Summary: sheriff actions on treeherder don't seem to be working → pulse_actions - sheriff actions on treeherder don't seem to be working
Comment 2•8 years ago
|
||
Heroku dynos cycle every 24 hours.
Somehow, after Friday's cycling the listeners are not consuming any new messages (there's 60+ pending messages).
Jul 15 16:00:09 pulse-actions app/worker2.1: root INFO: Created Treeherder 'Sch' job.
Jul 15 16:00:09 pulse-actions app/worker2.1: root INFO: #### End of request ####.
Jul 16 09:54:37 pulse-actions heroku/worker2.1: Cycling
Jul 16 09:54:37 pulse-actions heroku/worker2.1: State changed from up to starting
Jul 16 09:54:41 pulse-actions heroku/worker2.1: Stopping all processes with SIGTERM
Jul 16 09:54:42 pulse-actions heroku/worker2.1: Starting process with command `python pulse_actions/worker.py --config-file configs/worker2.json --memory-saving`
Jul 16 09:54:42 pulse-actions heroku/worker2.1: State changed from starting to up
Jul 16 09:54:43 pulse-actions heroku/worker2.1: Process exited with status 143
Jul 16 09:54:45 pulse-actions app/worker2.1: root INFO: Console output logs INFO level messages.
Jul 16 09:54:45 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #)
Jul 16 09:54:45 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/resultset-actions, #.#)
Jul 16 09:54:45 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill)
Jul 17 10:12:43 pulse-actions heroku/worker2.1: Cycling
Jul 17 10:12:43 pulse-actions heroku/worker2.1: State changed from up to starting
Jul 17 10:12:48 pulse-actions heroku/worker2.1: Starting process with command `python pulse_actions/worker.py --config-file configs/worker2.json --memory-saving`
Jul 17 10:12:48 pulse-actions heroku/worker2.1: Stopping all processes with SIGTERM
Jul 17 10:12:49 pulse-actions heroku/worker2.1: Process exited with status 143
Jul 17 10:12:49 pulse-actions heroku/worker2.1: State changed from starting to up
Jul 17 10:12:51 pulse-actions app/worker2.1: root INFO: Console output logs INFO level messages.
Jul 17 10:12:51 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/resultset-actions, #.#)
Jul 17 10:12:51 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/resultset-runnable-job-actions, #)
Jul 17 10:12:51 pulse-actions app/worker2.1: replay.replay INFO: Listening to (exchange/treeherder/v1/job-actions, buildbot.#.backfill)
Comment 3•8 years ago
|
||
I can't receive any messages from Treeherder.
There's a large number of errors in new relic in the last 3 days.
Jul 15, '16 10:54 am Jul 18, '16 9:45 am log-parser celery.exceptions:Retry Retry in 10s: timeout('timed out',) 7,556
Jul 15, '16 10:29 pm Jul 18, '16 9:43 am publish-job-action exceptions:IOError Socket closed 3,563
Updated•8 years ago
|
Summary: pulse_actions - sheriff actions on treeherder don't seem to be working → sheriff actions on treeherder don't seem to be working
Comment 4•8 years ago
|
||
Moving to Treeherder.
Component: General → Treeherder
Product: Testing → Tree Management
Version: 49 Branch → ---
Updated•8 years ago
|
Assignee: nobody → emorley
Assignee | ||
Comment 5•8 years ago
|
||
In the publish-job-action task, pulse_publisher.py:126 calls into Kombu which uses py-amqp, which is then giving:
...
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 116, in establish_connection
conn = self.Connection(**opts)
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/connection.py", line 180, in __init__
(10, 30), # tune
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait
self.channel_id, allowed_methods, timeout)
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/connection.py", line 241, in _wait_method
channel, method_sig, args, content = read_timeout(timeout)
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/connection.py", line 330, in read_timeout
return self.method_reader.read_method()
File "/data/www/treeherder.mozilla.org/venv/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
raise m
IOError: Socket closed
These errors started around 0300 Saturday UTC+1 (1900 PT).
Nothing has changed code-wise on Treeherder prod, in the timeframe where errors have been occurring.
The connection to pulse is fine:
[emorley@treeherder-rabbitmq2.private.scl3 ~]$ nc -vz pulse.mozilla.org 5671
Connection to pulse.mozilla.org 5671 port [tcp/amqps] succeeded!
Publish to pulse is using the user `treeherder` on pulse.m.o; Cameron/Mark did the credentials for this change at all recently?
Flags: needinfo?(mcote)
Flags: needinfo?(cdawson)
Assignee | ||
Comment 6•8 years ago
|
||
The timeline is shortly after bug 1286614 comment 4. Did the 'treeherder' user get edit after that?
Assignee | ||
Comment 7•8 years ago
|
||
Edited even, typing fail
Comment 8•8 years ago
|
||
Hm I did nothing except change the owner. I can reset the password, although so can camd.
Flags: needinfo?(mcote)
Comment hidden (offtopic) |
Comment hidden (offtopic) |
Assignee | ||
Updated•8 years ago
|
Component: Treeherder → Treeherder: Infrastructure
Priority: -- → P1
Summary: sheriff actions on treeherder don't seem to be working → publish-job-action tasks failing with 'IOError: Socket closed' connecting to pulse.mozilla.org
Assignee | ||
Comment 11•8 years ago
|
||
On https://pulseguardian.mozilla.org/all_pulse_users I only see the following Treeherder users:
treeherder-prod cdawson[at]mozilla.com
treeherder-stage cdawson[at]mozilla.com
treeherder-staging cdawson[at]mozilla.com
treeherder-test cdawson[at]mozilla.com
I'm guessing Cameron delete the 'treeherder' user, not realising it was still used.
Since the user was deleted rather than the password rotated, I was able to create a user with the same name under my own Pulse Guardian account, and have set the password to what was defined in the Treeherder prod PULSE_URI env variable. Restarting the rabbitmq celery workers and then retriggering a job shows it's now working:
[emorley@treeherder-rabbitmq2.private.scl3 ~]$ grep 'publish-job-action' /var/log/celery/celery_worker_hp.log | grep 'succeeded'
[2016-07-19 01:52:31,412: INFO/MainProcess] Task publish-job-action[cddf9c39-91b4-44ef-b24c-c3d329674cca] succeeded in 0.108628049493s: None
Cam - let's definitely clean up/combine the Pulse users at some point (likely after moving to Heroku), but let's file a bug and document changes so any breakage is easier to debug :-)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 12•8 years ago
|
||
This is now the third time we've been trying to debug apparent connectivity issues, when it's really an authentication issue (see also bug 1281810 bug 1261208).
Kombu/py-aqmp should really differentiate this case. I've filed:
https://github.com/celery/kombu/issues/611
Comment 13•8 years ago
|
||
I can verify this as fixed.
Thanks Ed!
Comment 14•8 years ago
|
||
Dang, I apologize for doing that. I couldn't see that it was in use, but clearly didn't do all the due diligence I should have. It would be great if PulseGuardian would also show exchanges for a user. Sigh... did I just sign up for more work on PulseGuardian? :)
Flags: needinfo?(cdawson)
Comment 15•8 years ago
|
||
If you load queues [1]
You can see each user you have, which queues they have and what exchanges/topics it consumes.
[1] https://pulseguardian.mozilla.org/queues
Assignee | ||
Comment 16•8 years ago
|
||
The problem is that the 'treeherder' user doesn't have any queues, it just publishes to an exchange.
However Pulse guardian doesn't list exchanges.
I've found bug 1079523 which is about adding such a feature.
Comment 17•8 years ago
|
||
Good point!
Assignee | ||
Comment 18•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #12)
> This is now the third time we've been trying to debug apparent connectivity
> issues, when it's really an authentication issue (see also bug 1281810 bug
> 1261208).
>
> Kombu/py-aqmp should really differentiate this case. I've filed:
> https://github.com/celery/kombu/issues/611
It turns out that the AMQP spec says to close the socket immediately. However rabbitmq-server extends the AMQP protocol slightly (http://www.rabbitmq.com/auth-notification.html) - and the real issue is that py-amqp (plus librabbitmq fwiw) doesn't yet support the `authentication_failure_close` capability required to make use of it.
Follow along in:
https://github.com/celery/py-amqp/issues/106
https://github.com/celery/librabbitmq/issues/89
https://github.com/celery/kombu/issues/611
Assignee | ||
Comment 19•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #18)
> It turns out that the AMQP spec says to close the socket immediately.
> However rabbitmq-server extends the AMQP protocol slightly
> (http://www.rabbitmq.com/auth-notification.html) - and the real issue is
> that py-amqp (plus librabbitmq fwiw) doesn't yet support the
> `authentication_failure_close` capability required to make use of it.
>
> Follow along in:
> https://github.com/celery/py-amqp/issues/106
> https://github.com/celery/librabbitmq/issues/89
> https://github.com/celery/kombu/issues/611
This is now fixed in py-amqp 2.1.1.
You need to log in
before you can comment on or make changes to this bug.
Description
•