Closed
Bug 1497117
Opened 6 years ago
Closed 6 years ago
Balrog admin prod is unresponsive
Categories
(Cloud Services :: Operations: Miscellaneous, task)
Cloud Services
Operations: Miscellaneous
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: nthomas, Unassigned)
Details
Attachments
(1 file)
All (uncached) requests for the Balrog admin UI are getting 502 or 504 responses, eg https://aus4-admin.mozilla.org/rules?product=Firefox&channel=release
This will prevent any emergency work around rules/releases, and possibly the next Nightly from being added to Balrog.
In datadog I see a lot RDS read IOPS but no CPU or throughput increase. It's possbile I may have triggered this by using the 'Change logs' UI while working on bug 1495714 (filed bug 1497108 to add some limits).
Both the /__lbheartbeat__ and /__heartbeat__ endpoints are also unresponsive, are those being monitored ?
Reporter | ||
Comment 1•6 years ago
|
||
The 'client' side for Firefox installs querying updates seems OK. The busiest time of the week is about now so max cpu is rising, but new instances are being spun up.
Reporter | ||
Comment 2•6 years ago
|
||
The RDS Read IOPS went back to normal values ~0810 UTC, and the admin UI is responsive again. Can we tell what the problem was ?
Severity: critical → normal
Comment 3•6 years ago
|
||
There are a ton of these in the app logs:
Mon Oct 8 05:18:43 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:44 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:45 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:46 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:47 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:48 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:49 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:50 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:51 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:52 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:53 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:54 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:55 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:56 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Mon Oct 8 05:18:57 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
So it looks like something was hanging the workers. Still haven't tracked down the root cause. I do see the database was slammed with reads during this time.
Comment 4•6 years ago
|
||
These were the requests right before this all happened, not sure if the last one is related:
[pid: 9|app: 0|req: 1579575/1581370] 127.0.0.1 () {44 vars in 651 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/rules => generated 3979 bytes in 55 msecs (HTTP/1.1 200) 6 headers in 244 bytes (1 switches on core 0)
[pid: 9|app: 0|req: 1579576/1581371] 127.0.0.1 () {44 vars in 660 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/releases => generated 35 bytes in 8 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0)
[pid: 9|app: 0|req: 1579577/1581372] 127.0.0.1 () {44 vars in 669 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/permissions => generated 35 bytes in 3 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0)
[pid: 9|app: 0|req: 1579578/1581373] 127.0.0.1 () {44 vars in 687 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/emergency_shutoff => generated 35 bytes in 3 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0)
[pid: 9|app: 0|req: 1579579/1581374] 127.0.0.1 () {44 vars in 711 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/required_signoffs/product => generated 35 bytes in 3 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0)
[pid: 9|app: 0|req: 1579580/1581375] 127.0.0.1 () {44 vars in 723 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/required_signoffs/permissions => generated 35 bytes in 3 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0)
[pid: 9|app: 0|req: 1579581/1581376] 127.0.0.1 () {50 vars in 2302 bytes} [Mon Oct 8 03:15:27 2018] GET /api/releases/history?page=2&changed_by=jcristau@mozilla.com&product=Firefox => generated 687824 bytes in 5726779 msecs (HTTP/1.1 200) 8 headers in 293 bytes (1 switches on core 0)
Mon Oct 8 05:18:43 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***
Comment 5•6 years ago
|
||
and then: Mon Oct 8 06:28:01 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (101/100) ***
[pid: 9|app: 0|req: 1579582/1581377] 127.0.0.1 () {50 vars in 2302 bytes} [Mon Oct 8 04:50:54 2018] GET /api/releases/history?page=3&changed_by=jcristau@mozilla.com&product=Firefox => generated 136 bytes in 5827247 msecs (HTTP/1.1 200) 6 headers in 243 bytes (1 switches on core 0)
so, those requests were taking about 1.6 hours to complete.
Comment 6•6 years ago
|
||
I don't see those history requests anywhere else in the logs. Does that endpoint generate expensive queries or large result sets?
Reporter | ||
Comment 7•6 years ago
|
||
Wow, very slow responses. Those history requests for jcristau were definitely me.
There are supposed to be limits on how much data is selected, but I need to trace through all the functions to make sure it's working properly. Also going to look into setting a query timeout to match the 120s or so nginx is using.
In parallel, could you double check the history cleanup is working as expected ? Maybe generate some data from the releases_history table which shows name, count(change_id), min(timestamp).
Comment 8•6 years ago
|
||
mysql> select count(*) from releases_history;
+----------+
| count(*) |
+----------+
| 819528 |
+----------+
1 row in set (0.17 sec)
Comment 9•6 years ago
|
||
Reporter | ||
Comment 10•6 years ago
|
||
That all seems sensible. eg the dated blobs like Firefox-mozilla-central-nightly-20181006220113 go back only 14 nightlies; the largest change count is 14406 for Firefox-mozilla-central-nightly-latest, but ~500 updates/nightly and 14 nightlies is 14000; ~500 changes for each new release.
So either something pathological is going on with bug 1495714, or the history table is just growing as we add more releases in (and chose to not clean them up). I'm inclined towards the latter so lets close this and I'll investigate there.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Reporter | ||
Comment 11•6 years ago
|
||
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #10)
> So either something pathological is going on with bug 1495714, or the
> history table is just growing as we add more releases in (and chose to not
> clean them up). I'm inclined towards the latter so lets close this and I'll
> investigate there.
I meant 'former' there.
You need to log in
before you can comment on or make changes to this bug.
Description
•