Timeout on AMQP for SUMO's stage servers.



Infrastructure & Operations
WebOps: Other
4 years ago
4 years ago


(Reporter: mythmon, Assigned: cyliang)



(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/15] )



4 years ago
I've gotten about 200 of these tracebacks in my email over the weekend. The first that I can find is 6:20pm PST on Friday. This only appears to be affecting SUMO's stage environment.

Traceback (most recent call last):
  File "manage.py", line 22, in <module>
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/django/core/management/__init__.py", line 392, in execute
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/django/core/management/base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/django/core/management/base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/cronjobs/management/commands/cron.py", line 64, in handle
  File "/data/support-stage/www/support.allizom.org/kitsune/kitsune/wiki/cron.py", line 38, in generate_missing_share_links
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/celery/app/task.py", line 357, in delay
    return self.apply_async(args, kwargs)
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/celery/app/task.py", line 474, in apply_async
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/celery/app/amqp.py", line 250, in publish_task
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/kombu/messaging.py", line 164, in publish
    routing_key, mandatory, immediate, exchange, declare)
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/kombu/connection.py", line 470, in _ensured
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/kombu/connection.py", line 396, in ensure_connection
    interval_start, interval_step, interval_max, callback)
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/kombu/utils/__init__.py", line 217, in retry_over_time
    return fun(*args, **kwargs)
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/kombu/connection.py", line 246, in connect
    return self.connection
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/kombu/connection.py", line 761, in connection
    self._connection = self._establish_connection()
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/kombu/connection.py", line 720, in _establish_connection
    conn = self.transport.establish_connection()
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/kombu/transport/pyamqp.py", line 115, in establish_connection
    conn = self.Connection(**opts)
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/amqp/connection.py", line 165, in __init__
    self.transport = create_transport(host, connect_timeout, ssl)
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/amqp/transport.py", line 275, in create_transport
    return TCPTransport(host, connect_timeout)
  File "/data/support-stage/src/support.allizom.org/kitsune/virtualenv/lib/python2.6/site-packages/amqp/transport.py", line 89, in __init__
    raise socket.error(last_err)
socket.error: timed out
Possibly related: bug 1097118


4 years ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/15]

Comment 2

4 years ago
mythmon: Based on the first part of the traceback, I'm assuming that these errors are the result of running some sort of task invokable from the CLI.  Would it be possible to point me at what those tasks might be so I can try to deliberately invoke these errors?

Looking through the load balancer logs, I can't find any corresponding errors with respect to trying to reach either of the new SUMO rabbit nodes.  =\
(In reply to C. Liang [:cyliang] from comment #2)

These are tasks that run on cron from the sumo admin node. From what I can tell, the webheads are able to talk to rabbit. Maybe it's just a network issue between the admin node and rabbit?


4 years ago
Depends on: 1121752

Comment 4

4 years ago
Indeed!   Bug filed for an ACL to the new rabbit cluster.


4 years ago
Assignee: server-ops-webops → cliang

Comment 5

4 years ago
Have there been any additional AMQP timeouts to the new SUMO rabbitMQ cluster in staging since the ACL went into effect?  (There should be no timeouts since this past Friday.)
(In reply to C. Liang [:cyliang] from comment #5)
> Have there been any additional AMQP timeouts to the new SUMO rabbitMQ
> cluster in staging since the ACL went into effect?  (There should be no
> timeouts since this past Friday.)

I haven't seen any in the past few days. bd

Comment 7

4 years ago
I haven't got any of these for 5 days. I think we're done here, thank you!
Last Resolved: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.