Connection reset by peer and time outs when AMO talks to RabbitMQ


Status Graveyard
Server Operations
7 years ago
3 years ago


(Reporter: kumar, Assigned: cshields)



(Whiteboard: [Add-on upload fails])

We're seeing a lot of 'Connection reset by peer' exceptions in production right now.  The traceback points to where AMO talks to RabbitMQ to queue up a celery task.  Any ideas?  The graph of celery queues does not look abnormal, however.

Users are seeing this on the addon upload page (bug 688793) but the logs say that it affects other pages that try to queue up tasks.

Traceback (most recent call last):

 File "/data/www/", line 111, in get_response
   response = callback(request, *callback_args, **callback_kwargs)

 File "/data/www/", line 102, in wrapper
   return f(*args, **kw)

 File "/data/www/", line 94, in wrapper
   return f(*args, **kw)

 File "/data/www/", line 49, in wrapper
   return f(request, *args, **kw)

 File "/data/www/", line 23, in wrapper
   return f(request, addon, *args, **kw)

 File "/data/www/", line 28, in wrapper
   return func(request, *args, **kw)

 File "/data/www/", line 32, in wrapper
   return fun()

 File "/data/www/", line 24, in <lambda>

 File "/data/www/", line 703, in upload_for_addon
   return upload(request, addon_slug=addon.slug)

 File "/data/www/", line 28, in wrapper
   return func(request, *args, **kw)

 File "/data/www/", line 49, in wrapper
   return f(request, *args, **kw)

 File "/data/www/", line 656, in upload

 File "/data/www/", line 338, in delay
   return self.apply_async(args, kwargs)

 File "/data/www/", line 22, in apply_async
   return super(Task, self).apply_async(args, kwargs, **options)

 File "/data/www/", line 448, in apply_async

 File "/data/www/", line 301, in get_publisher

 File "/data/www/", line 328, in TaskPublisher
   return TaskPublisher(*args, **, kwargs))

 File "/data/www/", line 156, in __init__
   super(TaskPublisher, self).__init__(*args, **kwargs)

 File "/data/www/", line 79, in __init__
   self.backend =

 File "/data/www/", line 124, in channel
   chan = self.transport.create_channel(self.connection)

 File "/data/www/", line 439, in connection
   self._connection = self._establish_connection()

 File "/data/www/", line 405, in _establish_connection
   conn = self.transport.establish_connection()

 File "/data/www/", line 245, in establish_connection

 File "/data/www/", line 51, in __init__
   super(Connection, self).__init__(*args, **kwargs)

 File "/data/www/", line 131, in __init__
   (10, 10), # start

 File "/data/www/", line 89, in wait
   self.channel_id, allowed_methods)

 File "/data/www/", line 198, in _wait_method

 File "/data/www/", line 215, in read_method
   raise m

error: [Errno 104] Connection reset by peer
Severity: normal → blocker


7 years ago
Assignee: server-ops → cshields


7 years ago
Depends on: 688853

Comment 1

7 years ago
(didn't realize that my previous post mid-aired and sat there)

celery1 and celery2 were both OOM, and oomkiller had done a number on the nodes.

Reset both, they are up now.  Filed 688853 to investigate why we were not notified of these issues automatically.
Last Resolved: 7 years ago
Resolution: --- → FIXED


7 years ago
Whiteboard: [Add-on upload fails]
Product: → Graveyard
You need to log in before you can comment on or make changes to this bug.