Closed Bug 1497117 Opened 6 years ago Closed 6 years ago

Balrog admin prod is unresponsive

Categories

(Cloud Services :: Operations: Miscellaneous, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: nthomas, Unassigned)

Details

Attachments

(1 file)

All (uncached) requests for the Balrog admin UI are getting 502 or 504 responses, eg https://aus4-admin.mozilla.org/rules?product=Firefox&channel=release This will prevent any emergency work around rules/releases, and possibly the next Nightly from being added to Balrog. In datadog I see a lot RDS read IOPS but no CPU or throughput increase. It's possbile I may have triggered this by using the 'Change logs' UI while working on bug 1495714 (filed bug 1497108 to add some limits). Both the /__lbheartbeat__ and /__heartbeat__ endpoints are also unresponsive, are those being monitored ?
The 'client' side for Firefox installs querying updates seems OK. The busiest time of the week is about now so max cpu is rising, but new instances are being spun up.
The RDS Read IOPS went back to normal values ~0810 UTC, and the admin UI is responsive again. Can we tell what the problem was ?
Severity: critical → normal
There are a ton of these in the app logs: Mon Oct 8 05:18:43 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:44 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:45 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:46 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:47 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:48 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:49 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:50 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:51 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:52 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:53 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:54 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:55 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:56 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:57 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** So it looks like something was hanging the workers. Still haven't tracked down the root cause. I do see the database was slammed with reads during this time.
These were the requests right before this all happened, not sure if the last one is related: [pid: 9|app: 0|req: 1579575/1581370] 127.0.0.1 () {44 vars in 651 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/rules => generated 3979 bytes in 55 msecs (HTTP/1.1 200) 6 headers in 244 bytes (1 switches on core 0) [pid: 9|app: 0|req: 1579576/1581371] 127.0.0.1 () {44 vars in 660 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/releases => generated 35 bytes in 8 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0) [pid: 9|app: 0|req: 1579577/1581372] 127.0.0.1 () {44 vars in 669 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/permissions => generated 35 bytes in 3 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0) [pid: 9|app: 0|req: 1579578/1581373] 127.0.0.1 () {44 vars in 687 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/emergency_shutoff => generated 35 bytes in 3 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0) [pid: 9|app: 0|req: 1579579/1581374] 127.0.0.1 () {44 vars in 711 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/required_signoffs/product => generated 35 bytes in 3 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0) [pid: 9|app: 0|req: 1579580/1581375] 127.0.0.1 () {44 vars in 723 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/required_signoffs/permissions => generated 35 bytes in 3 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0) [pid: 9|app: 0|req: 1579581/1581376] 127.0.0.1 () {50 vars in 2302 bytes} [Mon Oct 8 03:15:27 2018] GET /api/releases/history?page=2&changed_by=jcristau@mozilla.com&product=Firefox => generated 687824 bytes in 5726779 msecs (HTTP/1.1 200) 8 headers in 293 bytes (1 switches on core 0) Mon Oct 8 05:18:43 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
and then: Mon Oct 8 06:28:01 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (101/100) *** [pid: 9|app: 0|req: 1579582/1581377] 127.0.0.1 () {50 vars in 2302 bytes} [Mon Oct 8 04:50:54 2018] GET /api/releases/history?page=3&changed_by=jcristau@mozilla.com&product=Firefox => generated 136 bytes in 5827247 msecs (HTTP/1.1 200) 6 headers in 243 bytes (1 switches on core 0) so, those requests were taking about 1.6 hours to complete.
I don't see those history requests anywhere else in the logs. Does that endpoint generate expensive queries or large result sets?
Wow, very slow responses. Those history requests for jcristau were definitely me. There are supposed to be limits on how much data is selected, but I need to trace through all the functions to make sure it's working properly. Also going to look into setting a query timeout to match the 120s or so nginx is using. In parallel, could you double check the history cleanup is working as expected ? Maybe generate some data from the releases_history table which shows name, count(change_id), min(timestamp).
mysql> select count(*) from releases_history; +----------+ | count(*) | +----------+ | 819528 | +----------+ 1 row in set (0.17 sec)
That all seems sensible. eg the dated blobs like Firefox-mozilla-central-nightly-20181006220113 go back only 14 nightlies; the largest change count is 14406 for Firefox-mozilla-central-nightly-latest, but ~500 updates/nightly and 14 nightlies is 14000; ~500 changes for each new release. So either something pathological is going on with bug 1495714, or the history table is just growing as we add more releases in (and chose to not clean them up). I'm inclined towards the latter so lets close this and I'll investigate there.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #10) > So either something pathological is going on with bug 1495714, or the > history table is just growing as we add more releases in (and chose to not > clean them up). I'm inclined towards the latter so lets close this and I'll > investigate there. I meant 'former' there.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: