Closed Bug 963720 Opened 11 years ago Closed 11 years ago

Self-serve is down

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Unassigned)

References

Details

Attachments

(1 file)

I'm getting "The service is temporarily unavailable. Please try again later." errors when attempting to retrigger any jobs since at least 11am MVT.
from /builds/selfserve-agent/agent.log on buildbot-master61.srv.releng.use1.mozilla.com
Currently being handled on irc, hostname = releng-rabbitmq-zlb.webapp.scl3.mozilla.com From rabbitmq logs (these are harmless as it's just showing the LB testing the port); =INFO REPORT==== 24-Jan-2014::12:07:01 === accepting AMQP connection <0.2614.37> (10.22.81.209:33541 -> 10.22.81.92:5672) =WARNING REPORT==== 24-Jan-2014::12:07:03 === closing AMQP connection <0.2494.37> (10.22.81.209:65125 -> 10.22.81.92:5672): connection_closed_abruptly # before restarting rabbitmq on Status of node rabbit@rabbit1 ... [{pid,22592}, {running_applications, [{rabbitmq_management,"RabbitMQ Management Console","3.2.2"}, {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.2.2"}, {webmachine,"webmachine","1.10.3-rmq3.2.2-gite9359c7"}, {mochiweb,"MochiMedia Web Server","2.7.0-rmq3.2.2-git680dba8"}, {rabbitmq_management_agent,"RabbitMQ Management Agent","3.2.2"}, {rabbit,"RabbitMQ","3.2.2"}, {os_mon,"CPO CXC 138 46","2.2.6"}, {inets,"INETS CXC 138 49","5.6"}, {xmerl,"XML parser","1.2.9"}, {mnesia,"MNESIA CXC 138 12","4.4.19"}, {amqp_client,"RabbitMQ AMQP Client","3.2.2"}, {sasl,"SASL CXC 138 11","2.1.9.4"}, {stdlib,"ERTS CXC 138 10","1.17.4"}, {kernel,"ERTS CXC 138 10","2.14.4"}]}, {os,{unix,linux}}, {erlang_version, "Erlang R14B03 (erts-5.8.4) [source] [64-bit] [rq:1] [async-threads:30] [kernel-poll:true]\n"}, {memory, [{total,73805360}, {connection_procs,195264}, {queue_procs,126824}, {plugins,430296}, {other_proc,9389760}, {mnesia,91208}, {mgmt_db,311664}, {msg_index,11156496}, {other_ets,26216208}, {binary,4601272}, {code,17620690}, {atom,2308337}, {other_system,1357341}]}, {vm_memory_high_watermark,0.4}, {vm_memory_limit,3302108364}, {disk_free_limit,50000000}, {disk_free,34744860672}, {file_descriptors, [{total_limit,924},{total_used,11},{sockets_limit,829},{sockets_used,5}]}, {processes,[{limit,1048576},{used,240}]}, {run_queue,0}, {uptime,3708774}] ...done. Status of node rabbit@rabbit2 ... [{pid,16808}, {running_applications, [{rabbitmq_management,"RabbitMQ Management Console","3.2.2"}, {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.2.2"}, {webmachine,"webmachine","1.10.3-rmq3.2.2-gite9359c7"}, {mochiweb,"MochiMedia Web Server","2.7.0-rmq3.2.2-git680dba8"}, {rabbitmq_management_agent,"RabbitMQ Management Agent","3.2.2"}, {rabbit,"RabbitMQ","3.2.2"}, {os_mon,"CPO CXC 138 46","2.2.6"}, {inets,"INETS CXC 138 49","5.6"}, {xmerl,"XML parser","1.2.9"}, {mnesia,"MNESIA CXC 138 12","4.4.19"}, {amqp_client,"RabbitMQ AMQP Client","3.2.2"}, {sasl,"SASL CXC 138 11","2.1.9.4"}, {stdlib,"ERTS CXC 138 10","1.17.4"}, {kernel,"ERTS CXC 138 10","2.14.4"}]}, {os,{unix,linux}}, {erlang_version, "Erlang R14B03 (erts-5.8.4) [source] [64-bit] [rq:1] [async-threads:30] [kernel-poll:true]\n"}, {memory, [{total,72138496}, {connection_procs,313096}, {queue_procs,68496}, {plugins,431296}, {other_proc,9646632}, {mnesia,91272}, {mgmt_db,144160}, {msg_index,5748656}, {other_ets,26407576}, {binary,7992488}, {code,17620690}, {atom,2313993}, {other_system,1360141}]}, {vm_memory_high_watermark,0.4}, {vm_memory_limit,3302108364}, {disk_free_limit,50000000}, {disk_free,34773819392}, {file_descriptors, [{total_limit,924},{total_used,11},{sockets_limit,829},{sockets_used,7}]}, {processes,[{limit,1048576},{used,257}]}, {run_queue,0}, {uptime,3708800}] ...done. Status of node rabbit@rabbit2 ... [{pid,18194}, {running_applications, [{rabbitmq_management,"RabbitMQ Management Console","3.2.2"}, {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.2.2"}, {webmachine,"webmachine","1.10.3-rmq3.2.2-gite9359c7"}, {mochiweb,"MochiMedia Web Server","2.7.0-rmq3.2.2-git680dba8"}, {rabbitmq_management_agent,"RabbitMQ Management Agent","3.2.2"}, {rabbit,"RabbitMQ","3.2.2"}, {os_mon,"CPO CXC 138 46","2.2.6"}, {inets,"INETS CXC 138 49","5.6"}, {xmerl,"XML parser","1.2.9"}, {mnesia,"MNESIA CXC 138 12","4.4.19"}, {amqp_client,"RabbitMQ AMQP Client","3.2.2"}, {sasl,"SASL CXC 138 11","2.1.9.4"}, {stdlib,"ERTS CXC 138 10","1.17.4"}, {kernel,"ERTS CXC 138 10","2.14.4"}]}, {os,{unix,linux}}, {erlang_version, "Erlang R14B03 (erts-5.8.4) [source] [64-bit] [rq:1] [async-threads:30] [kernel-poll:true]\n"}, {memory, [{total,31385592}, {connection_procs,5296}, {queue_procs,56768}, {plugins,262608}, {other_proc,9764600}, {mnesia,91104}, {mgmt_db,17632}, {msg_index,24648}, {other_ets,1066840}, {binary,13984}, {code,17185874}, {atom,1551257}, {other_system,1344981}]}, {vm_memory_high_watermark,0.4}, {vm_memory_limit,3302108364}, {disk_free_limit,50000000}, {disk_free,34789359616}, {file_descriptors, [{total_limit,924},{total_used,4},{sockets_limit,829},{sockets_used,1}]}, {processes,[{limit,1048576},{used,196}]}, {run_queue,0}, {uptime,16}] ...done. releng-rabbitmq-zlb.webapp.scl3.mozilla.com does not appear to be the culprit. Currently investigating with jhopkins,zeller and RyanVM
I've killed all of the selfserve-agent processes and let supervisord start them back up. Let's see if that helps.
dustin restarted buildapi and the backlog of queue items got consumed right away. We were unable to tell the root cause of buildapi's problems due to bug 806777. Things are running fine now. Thanks for everyone's help on this!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reported in #it at 12:57 PT that this was happening again. No progress resolving so far.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The rabbit/selfserve-agent was a blind alley. The selfserve agents get hung up on when idle, and they log that when they reconnect. Ugly, but normal. Even with logging, I couldn't tell what was wrong with buildapi. It had >100 threads, all of which were stuck in sem_wait from acquiring a Python lock of some sort, but that could be anything really. I ran gdb against buildapi, and after detaching the poor thing died. When I started it back up, things looked OK again. Noting that submitting new jobs seemed quick, but loading pages full of info didn't, I guessed that redis might be to blame and restarted redis. Finally, on the hypothesis that the new, longer-timeout check_http invocations were causing the problem, I disabled active checks of "builddata.pub.build.mozilla.org" / "http file age - /buildjson/builds-4hr.js.gz" We haven't seen more problems in the last 45m or so, so maybe one of those things worked, or maybe we've been lucky.
In bug 863268 I just switched the buildapi HTTP interface to the releng cluster. That should help or, at the least, make this easier to debug.
Just made two requests, one timed out, the other succeeded but took quite a while: 16:05:03.585 GET https://secure.pub.build.mozilla.org/buildapi/self-serve/b2g-inbound/rev/151feb1e7b8b [HTTP/1.1 200 OK 38972ms]
I saw a couple of timeouts this morning, too.
seems this is way more worse today, did a lot of retriggers and none came through
I believe those were during the DC downtime this morning. There's not much I can do to respond to "I saw", unfortunately. If you have more data, either a more precise timestamp or, better, a time and HTTP status of some sort, that would help me investigate.
The selfserve agents were down - bug 967459. Please don't re-open this bug, as it's not at all specific. If you do see further problems with self-serve, open a new bug with specific information as to what went wrong. HTTP status codes, error messages from the browser, timestamps, and observed behavior.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: