Closed
Bug 963720
Opened 11 years ago
Closed 11 years ago
Self-serve is down
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: RyanVM, Unassigned)
References
Details
Attachments
(1 file)
1.51 KB,
text/plain
|
Details |
I'm getting "The service is temporarily unavailable. Please try again later." errors when attempting to retrigger any jobs since at least 11am MVT.
Comment 1•11 years ago
|
||
from /builds/selfserve-agent/agent.log on buildbot-master61.srv.releng.use1.mozilla.com
Comment 2•11 years ago
|
||
Currently being handled on irc, hostname = releng-rabbitmq-zlb.webapp.scl3.mozilla.com
From rabbitmq logs (these are harmless as it's just showing the LB testing the port);
=INFO REPORT==== 24-Jan-2014::12:07:01 ===
accepting AMQP connection <0.2614.37> (10.22.81.209:33541 -> 10.22.81.92:5672)
=WARNING REPORT==== 24-Jan-2014::12:07:03 ===
closing AMQP connection <0.2494.37> (10.22.81.209:65125 -> 10.22.81.92:5672):
connection_closed_abruptly
# before restarting rabbitmq on
Status of node rabbit@rabbit1 ...
[{pid,22592},
{running_applications,
[{rabbitmq_management,"RabbitMQ Management Console","3.2.2"},
{rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.2.2"},
{webmachine,"webmachine","1.10.3-rmq3.2.2-gite9359c7"},
{mochiweb,"MochiMedia Web Server","2.7.0-rmq3.2.2-git680dba8"},
{rabbitmq_management_agent,"RabbitMQ Management Agent","3.2.2"},
{rabbit,"RabbitMQ","3.2.2"},
{os_mon,"CPO CXC 138 46","2.2.6"},
{inets,"INETS CXC 138 49","5.6"},
{xmerl,"XML parser","1.2.9"},
{mnesia,"MNESIA CXC 138 12","4.4.19"},
{amqp_client,"RabbitMQ AMQP Client","3.2.2"},
{sasl,"SASL CXC 138 11","2.1.9.4"},
{stdlib,"ERTS CXC 138 10","1.17.4"},
{kernel,"ERTS CXC 138 10","2.14.4"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang R14B03 (erts-5.8.4) [source] [64-bit] [rq:1] [async-threads:30] [kernel-poll:true]\n"},
{memory,
[{total,73805360},
{connection_procs,195264},
{queue_procs,126824},
{plugins,430296},
{other_proc,9389760},
{mnesia,91208},
{mgmt_db,311664},
{msg_index,11156496},
{other_ets,26216208},
{binary,4601272},
{code,17620690},
{atom,2308337},
{other_system,1357341}]},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,3302108364},
{disk_free_limit,50000000},
{disk_free,34744860672},
{file_descriptors,
[{total_limit,924},{total_used,11},{sockets_limit,829},{sockets_used,5}]},
{processes,[{limit,1048576},{used,240}]},
{run_queue,0},
{uptime,3708774}]
...done.
Status of node rabbit@rabbit2 ...
[{pid,16808},
{running_applications,
[{rabbitmq_management,"RabbitMQ Management Console","3.2.2"},
{rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.2.2"},
{webmachine,"webmachine","1.10.3-rmq3.2.2-gite9359c7"},
{mochiweb,"MochiMedia Web Server","2.7.0-rmq3.2.2-git680dba8"},
{rabbitmq_management_agent,"RabbitMQ Management Agent","3.2.2"},
{rabbit,"RabbitMQ","3.2.2"},
{os_mon,"CPO CXC 138 46","2.2.6"},
{inets,"INETS CXC 138 49","5.6"},
{xmerl,"XML parser","1.2.9"},
{mnesia,"MNESIA CXC 138 12","4.4.19"},
{amqp_client,"RabbitMQ AMQP Client","3.2.2"},
{sasl,"SASL CXC 138 11","2.1.9.4"},
{stdlib,"ERTS CXC 138 10","1.17.4"},
{kernel,"ERTS CXC 138 10","2.14.4"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang R14B03 (erts-5.8.4) [source] [64-bit] [rq:1] [async-threads:30] [kernel-poll:true]\n"},
{memory,
[{total,72138496},
{connection_procs,313096},
{queue_procs,68496},
{plugins,431296},
{other_proc,9646632},
{mnesia,91272},
{mgmt_db,144160},
{msg_index,5748656},
{other_ets,26407576},
{binary,7992488},
{code,17620690},
{atom,2313993},
{other_system,1360141}]},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,3302108364},
{disk_free_limit,50000000},
{disk_free,34773819392},
{file_descriptors,
[{total_limit,924},{total_used,11},{sockets_limit,829},{sockets_used,7}]},
{processes,[{limit,1048576},{used,257}]},
{run_queue,0},
{uptime,3708800}]
...done.
Status of node rabbit@rabbit2 ...
[{pid,18194},
{running_applications,
[{rabbitmq_management,"RabbitMQ Management Console","3.2.2"},
{rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.2.2"},
{webmachine,"webmachine","1.10.3-rmq3.2.2-gite9359c7"},
{mochiweb,"MochiMedia Web Server","2.7.0-rmq3.2.2-git680dba8"},
{rabbitmq_management_agent,"RabbitMQ Management Agent","3.2.2"},
{rabbit,"RabbitMQ","3.2.2"},
{os_mon,"CPO CXC 138 46","2.2.6"},
{inets,"INETS CXC 138 49","5.6"},
{xmerl,"XML parser","1.2.9"},
{mnesia,"MNESIA CXC 138 12","4.4.19"},
{amqp_client,"RabbitMQ AMQP Client","3.2.2"},
{sasl,"SASL CXC 138 11","2.1.9.4"},
{stdlib,"ERTS CXC 138 10","1.17.4"},
{kernel,"ERTS CXC 138 10","2.14.4"}]},
{os,{unix,linux}},
{erlang_version,
"Erlang R14B03 (erts-5.8.4) [source] [64-bit] [rq:1] [async-threads:30] [kernel-poll:true]\n"},
{memory,
[{total,31385592},
{connection_procs,5296},
{queue_procs,56768},
{plugins,262608},
{other_proc,9764600},
{mnesia,91104},
{mgmt_db,17632},
{msg_index,24648},
{other_ets,1066840},
{binary,13984},
{code,17185874},
{atom,1551257},
{other_system,1344981}]},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,3302108364},
{disk_free_limit,50000000},
{disk_free,34789359616},
{file_descriptors,
[{total_limit,924},{total_used,4},{sockets_limit,829},{sockets_used,1}]},
{processes,[{limit,1048576},{used,196}]},
{run_queue,0},
{uptime,16}]
...done.
releng-rabbitmq-zlb.webapp.scl3.mozilla.com does not appear to be the culprit.
Currently investigating with jhopkins,zeller and RyanVM
Comment 3•11 years ago
|
||
I've killed all of the selfserve-agent processes and let supervisord start them back up. Let's see if that helps.
Comment 4•11 years ago
|
||
dustin restarted buildapi and the backlog of queue items got consumed right away. We were unable to tell the root cause of buildapi's problems due to bug 806777.
Things are running fine now. Thanks for everyone's help on this!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 5•11 years ago
|
||
Reported in #it at 12:57 PT that this was happening again. No progress resolving so far.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 6•11 years ago
|
||
The rabbit/selfserve-agent was a blind alley. The selfserve agents get hung up on when idle, and they log that when they reconnect. Ugly, but normal.
Even with logging, I couldn't tell what was wrong with buildapi. It had >100 threads, all of which were stuck in sem_wait from acquiring a Python lock of some sort, but that could be anything really.
I ran gdb against buildapi, and after detaching the poor thing died. When I started it back up, things looked OK again.
Noting that submitting new jobs seemed quick, but loading pages full of info didn't, I guessed that redis might be to blame and restarted redis.
Finally, on the hypothesis that the new, longer-timeout check_http invocations were causing the problem, I disabled active checks of "builddata.pub.build.mozilla.org" / "http file age - /buildjson/builds-4hr.js.gz"
We haven't seen more problems in the last 45m or so, so maybe one of those things worked, or maybe we've been lucky.
Comment 9•11 years ago
|
||
In bug 863268 I just switched the buildapi HTTP interface to the releng cluster. That should help or, at the least, make this easier to debug.
Comment 10•11 years ago
|
||
Just made two requests, one timed out, the other succeeded but took quite a while:
16:05:03.585 GET https://secure.pub.build.mozilla.org/buildapi/self-serve/b2g-inbound/rev/151feb1e7b8b [HTTP/1.1 200 OK 38972ms]
Comment 11•11 years ago
|
||
I saw a couple of timeouts this morning, too.
Comment 12•11 years ago
|
||
seems this is way more worse today, did a lot of retriggers and none came through
Comment 13•11 years ago
|
||
I believe those were during the DC downtime this morning.
There's not much I can do to respond to "I saw", unfortunately. If you have more data, either a more precise timestamp or, better, a time and HTTP status of some sort, that would help me investigate.
Comment 14•11 years ago
|
||
The selfserve agents were down - bug 967459.
Please don't re-open this bug, as it's not at all specific. If you do see further problems with self-serve, open a new bug with specific information as to what went wrong. HTTP status codes, error messages from the browser, timestamps, and observed behavior.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•