1497117 - Balrog admin prod is unresponsive

Reporter

Description

•

6 years ago

All (uncached) requests for the Balrog admin UI are getting 502 or 504 responses, eg https://aus4-admin.mozilla.org/rules?product=Firefox&channel=release This will prevent any emergency work around rules/releases, and possibly the next Nightly from being added to Balrog. In datadog I see a lot RDS read IOPS but no CPU or throughput increase. It's possbile I may have triggered this by using the 'Change logs' UI while working on bug 1495714 (filed bug 1497108 to add some limits). Both the /__lbheartbeat__ and /__heartbeat__ endpoints are also unresponsive, are those being monitored ?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

6 years ago

The 'client' side for Firefox installs querying updates seems OK. The busiest time of the week is about now so max cpu is rising, but new instances are being spun up.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

6 years ago

The RDS Read IOPS went back to normal values ~0810 UTC, and the admin UI is responsive again. Can we tell what the problem was ?

Severity: critical → normal

Jeremy Orem [:oremj]

Comment 3

•

6 years ago

There are a ton of these in the app logs: Mon Oct 8 05:18:43 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:44 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:45 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:46 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:47 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:48 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:49 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:50 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:51 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:52 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:53 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:54 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:55 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:56 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** Mon Oct 8 05:18:57 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) *** So it looks like something was hanging the workers. Still haven't tracked down the root cause. I do see the database was slammed with reads during this time.

Jeremy Orem [:oremj]

Comment 4

•

6 years ago

These were the requests right before this all happened, not sure if the last one is related: [pid: 9|app: 0|req: 1579575/1581370] 127.0.0.1 () {44 vars in 651 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/rules => generated 3979 bytes in 55 msecs (HTTP/1.1 200) 6 headers in 244 bytes (1 switches on core 0) [pid: 9|app: 0|req: 1579576/1581371] 127.0.0.1 () {44 vars in 660 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/releases => generated 35 bytes in 8 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0) [pid: 9|app: 0|req: 1579577/1581372] 127.0.0.1 () {44 vars in 669 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/permissions => generated 35 bytes in 3 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0) [pid: 9|app: 0|req: 1579578/1581373] 127.0.0.1 () {44 vars in 687 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/emergency_shutoff => generated 35 bytes in 3 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0) [pid: 9|app: 0|req: 1579579/1581374] 127.0.0.1 () {44 vars in 711 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/required_signoffs/product => generated 35 bytes in 3 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0) [pid: 9|app: 0|req: 1579580/1581375] 127.0.0.1 () {44 vars in 723 bytes} [Mon Oct 8 03:15:12 2018] GET /api/scheduled_changes/required_signoffs/permissions => generated 35 bytes in 3 msecs (HTTP/1.1 200) 6 headers in 242 bytes (1 switches on core 0) [pid: 9|app: 0|req: 1579581/1581376] 127.0.0.1 () {50 vars in 2302 bytes} [Mon Oct 8 03:15:27 2018] GET /api/releases/history?page=2&changed_by=jcristau@mozilla.com&product=Firefox => generated 687824 bytes in 5726779 msecs (HTTP/1.1 200) 8 headers in 293 bytes (1 switches on core 0) Mon Oct 8 05:18:43 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (100/100) ***

Jeremy Orem [:oremj]

Comment 5

•

6 years ago

and then: Mon Oct 8 06:28:01 2018 - *** uWSGI listen queue of socket "127.0.0.1:40735" (fd: 3) full !!! (101/100) *** [pid: 9|app: 0|req: 1579582/1581377] 127.0.0.1 () {50 vars in 2302 bytes} [Mon Oct 8 04:50:54 2018] GET /api/releases/history?page=3&changed_by=jcristau@mozilla.com&product=Firefox => generated 136 bytes in 5827247 msecs (HTTP/1.1 200) 6 headers in 243 bytes (1 switches on core 0) so, those requests were taking about 1.6 hours to complete.

Jeremy Orem [:oremj]

Comment 6

•

6 years ago

I don't see those history requests anywhere else in the logs. Does that endpoint generate expensive queries or large result sets?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

6 years ago

Wow, very slow responses. Those history requests for jcristau were definitely me. There are supposed to be limits on how much data is selected, but I need to trace through all the functions to make sure it's working properly. Also going to look into setting a query timeout to match the 120s or so nginx is using. In parallel, could you double check the history cleanup is working as expected ? Maybe generate some data from the releases_history table which shows name, count(change_id), min(timestamp).

Jeremy Orem [:oremj]

Comment 8

•

6 years ago

mysql> select count(*) from releases_history; +----------+ | count(*) | +----------+ | 819528 | +----------+ 1 row in set (0.17 sec)

Jeremy Orem [:oremj]

Comment 9

•

6 years ago

Attached file mysql> select name, count(change_id), min(timestamp) from releases_history group by name; — Details

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 10

•

6 years ago

That all seems sensible. eg the dated blobs like Firefox-mozilla-central-nightly-20181006220113 go back only 14 nightlies; the largest change count is 14406 for Firefox-mozilla-central-nightly-latest, but ~500 updates/nightly and 14 nightlies is 14000; ~500 changes for each new release. So either something pathological is going on with bug 1495714, or the history table is just growing as we add more releases in (and chose to not clean them up). I'm inclined towards the latter so lets close this and I'll investigate there.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → WORKSFORME

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 11

•

6 years ago

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #10) > So either something pathological is going on with bug 1495714, or the > history table is just growing as we add more releases in (and chose to not > clean them up). I'm inclined towards the latter so lets close this and I'll > investigate there. I meant 'former' there.

Bugzilla

Quick Search

Balrog admin prod is unresponsive

Categories

(Cloud Services :: Operations: Miscellaneous, task)

Tracking

(Not tracked)

People

(Reporter: nthomas, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Attachment

General

Description

File Name

Content Type