1295450 - new cleanup script slows down balrog db enough to cause a snowball effect when enough writes come in

Reporter

Description

•

9 years ago

It's taken many more hours than it normally would to submit a release into Balrog. Here's an example which did succeed on the 5th attempt: https://tools.taskcluster.net/task-inspector/#LXtZfi1kQWK9N0qMdQLGdg/ The admin UI was also pretty slow for loading releases, checking permissions etc - requests which depend on DB operations. Not clear where the fault is because * we deployed today, but the list of changes was minimal (cf68278 --> 6ae366e I think) * it didn't improve when I asked mostlygeek to abort the cleanup of releases_history (~1830 Pacific). There was a drop in RDS CPU usage at 2100 Pacific, aligning with the 1500-2100 run period the script * https://github.com/mozilla/balrog/commit/18047698e064bff5af1459323fe943daf6c5175a is supposed to prevent this problem, but it does introduce some more db operations when trying to merge We're getting successes now, so the snowball effect is probably melting away. Combined load of cleanup and storm of submissions may have buried us, even with the backoff in the submission retries. We've had ~1700 changes on Firefox-49.0b4-build1 so far, vs 1367 for the previous beta, which raises the question of repeat submissions or incorrect/mishandled error codes.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

9 years ago

TBH, after reviewing the datadog graphs again it really looks like load from the cleanup script running between 3 and 9pm Pacific, and didn't actually stop when mostlygeek asked it too.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

9 years ago

Also, flow on effect on update verification jobs, which should be downstream but are not until bug 1276506.

bhearsum@mozilla.com (:bhearsum)

Comment 3

•

9 years ago

It seems pretty clear that the cleanup script caused the sluggishness. mostlygeek and I were talking the other day about perf in the scl3 db vs. RDS, and IIRC thought it was pretty clear that the scl3 db was much beefier. Now that we're running more expensive queries, that seems to be taking a toll. A couple of thoughts/ideas: * To avoid causing issues for regularly scheduled releases, we could change the schedule of the cleanup script to only happen on weekends. However, this might make things even worse if we get a chemspill release over a weekend, so I'm not sure this a good option. * With the cleanup script causing both issues with releases, and sluggishness with the UI, it seems like we're pushing up against the limits of the current db. Can we bump up the instance to something beefier? Once we catch up on the backlogged releases_history, we could consider bumping down again. Eg: if cleaning up a single day's history only takes 15min on the current db, it wouldn't have the same snowball effect. * In the medium term, not storing nightly history at all (https://bugzilla.mozilla.org/show_bug.cgi?id=1294493) might be the best route. We still need to decide if we're okay not having _any_ nightly history, though, and this is not something I'll likely have the headspace for a bit. I'm raising the importance of this bug because I think it's important to avoid impacting the next scheduled release.

Severity: normal → critical

Justin Wood (:Callek)

Comment 4

•

9 years ago

We're still failing a bunch of funsize jobs (ones that I restarted this morning) submitting to balrog. At first glance it seems ~50% of the jobs I retriggered are coming back as failed. Two Examples: https://tools.taskcluster.net/task-inspector/#29OITfr-QwCvysjA89tv5A/7 https://tools.taskcluster.net/task-inspector/#DEFI6SQJS-yp-JpXjr96Mg/7

bhearsum@mozilla.com (:bhearsum)

Comment 5

•

9 years ago

(In reply to Justin Wood (:Callek) from comment #4) > We're still failing a bunch of funsize jobs (ones that I restarted this > morning) submitting to balrog. > > At first glance it seems ~50% of the jobs I retriggered are coming back as > failed. > > Two Examples: > https://tools.taskcluster.net/task-inspector/#29OITfr-QwCvysjA89tv5A/7 > https://tools.taskcluster.net/task-inspector/#DEFI6SQJS-yp-JpXjr96Mg/7 The cleanup script shouldn't be running at this point, but nightly builds look to be in the processing of submitting. That's a factor, but it's strange if that would be causing releases issues - nightly+release submitting at the same time is not abnormal AFAIK. Write latency doesn't look much higher than usual (about 3-4ms at the moment). CPU usage on RDS is elevated though, even compared to 5, 6, and 7 days ago. RDS throughput is also up, but that's likely a snowball effect due to the retries.

Justin Wood (:Callek)

Comment 6

•

9 years ago

To touchpoint here, I had to do a few more manual retries and all is now well... A handy view of the vast scope of this is https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=ab7b68014a1e20c423aa9b50ca76fd8edccb272c&filter-resultStatus=success&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable&filter-searchStr=funsize

bhearsum@mozilla.com (:bhearsum)

Comment 7

•

9 years ago

Updating summary to reflect investigation.

Summary: A lot of retries for "Failed to update row, old_data_version doesn't match current data_version" in Firefox 49.0b4 → new cleanup script slows down balrog db enough to cause a snowball effect when enough writes come in

bhearsum@mozilla.com (:bhearsum)

Comment 8

•

9 years ago

Benson and I just chatted about this. He said that we cleared out around 420,000 rows yesterday, which looks like the vast majority of the backlogged nightly history. Based on how quickly that went, we expect the rest of the backlog to clear out in the window today. If my math is correct, we're able to clean up around 1,000 rows/minute. It looks like we generate between 3,000 and 4,000 new rows of nightly history every night (based on summing the data versions from one set of nightly builds: https://aus4-admin.mozilla.org/releases#20160813), which means we _should_ be able to clean up in less than 5 minutes once the backlog is clear. Since there's no release scheduled today I'd like to let the cleanup query run again today to clear the backlog, and then we can see how long it takes for non-backlogged history in the coming days.

Rail Aliiev [:rail]

Comment 9

•

9 years ago

actually we are waiting for a dot release GTB today...

bhearsum@mozilla.com (:bhearsum)

Comment 10

•

9 years ago

(In reply to Rail Aliiev [:rail] from comment #9) > actually we are waiting for a dot release GTB today... Per IRC, no cleanup will happen today because of this. Benson says that the cronjob is not enabled, and that he's been running the script manually. Maybe we'll try again tomorrow if there's no releases, or maybe try to finish catching up on the weekend.

bhearsum@mozilla.com (:bhearsum)

Comment 11

•

9 years ago

No issues with submitting yesterday's release to Balrog AFAICT.

Justin Wood (:Callek)

Comment 12

•

9 years ago

(In reply to Ben Hearsum (:bhearsum) from comment #11) > No issues with submitting yesterday's release to Balrog AFAICT. Correct (there was an unrelated-to-this-bug issue, but it too self corrected)

Comment hidden (Intermittent Failures Robot)

bhearsum@mozilla.com (:bhearsum)

Comment 14

•

9 years ago

This isn't a problem anymore. Most of the backlog cleanup is done, and the day-to-day cleanup only takes a few minutes.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Release Engineering Graveyard

Bugzilla

new cleanup script slows down balrog db enough to cause a snowball effect when enough writes come in

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

Tracking

(Not tracked)

People

(Reporter: nthomas, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated