Closed Bug 1295450 Opened 8 years ago Closed 8 years ago

new cleanup script slows down balrog db enough to cause a snowball effect when enough writes come in

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Unassigned)

Details

It's taken many more hours than it normally would to submit a release into Balrog. Here's an example which did succeed on the 5th attempt:
https://tools.taskcluster.net/task-inspector/#LXtZfi1kQWK9N0qMdQLGdg/

The admin UI was also pretty slow for loading releases, checking permissions etc - requests which depend on DB operations.

Not clear where the fault is because
* we deployed today, but the list of changes was minimal (cf68278 --> 6ae366e I think)
* it didn't improve when I asked mostlygeek to abort the cleanup of releases_history (~1830 Pacific). There was a drop in RDS CPU usage at 2100 Pacific, aligning with the 1500-2100 run period the script
* https://github.com/mozilla/balrog/commit/18047698e064bff5af1459323fe943daf6c5175a is supposed to prevent this problem, but it does introduce some more db operations when trying to merge

We're getting successes now, so the snowball effect is probably melting away. Combined load of cleanup and storm of submissions may have buried us, even with the backoff in the submission retries.

We've had ~1700 changes on Firefox-49.0b4-build1 so far, vs 1367 for the previous beta, which raises the question of repeat submissions or incorrect/mishandled error codes.
TBH, after reviewing the datadog graphs again it really looks like load from the cleanup script running between 3 and 9pm Pacific, and didn't actually stop when mostlygeek asked it too.
Also, flow on effect on update verification jobs, which should be downstream but are not until bug 1276506.
It seems pretty clear that the cleanup script caused the sluggishness. mostlygeek and I were talking the other day about perf in the scl3 db vs. RDS, and IIRC thought it was pretty clear that the scl3 db was much beefier. Now that we're running more expensive queries, that seems to be taking a toll.

A couple of thoughts/ideas:
* To avoid causing issues for regularly scheduled releases, we could change the schedule of the cleanup script to only happen on weekends. However, this might make things even worse if we get a chemspill release over a weekend, so I'm not sure this a good option.
* With the cleanup script causing both issues with releases, and sluggishness with the UI, it seems like we're pushing up against the limits of the current db. Can we bump up the instance to something beefier? Once we catch up on the backlogged releases_history, we could consider bumping down again. Eg: if cleaning up a single day's history only takes 15min on the current db, it wouldn't have the same snowball effect.
* In the medium term, not storing nightly history at all (https://bugzilla.mozilla.org/show_bug.cgi?id=1294493) might be the best route. We still need to decide if we're okay not having _any_ nightly history, though, and this is not something I'll likely have the headspace for a bit.

I'm raising the importance of this bug because I think it's important to avoid impacting the next scheduled release.
Severity: normal → critical
We're still failing a bunch of funsize jobs (ones that I restarted this morning) submitting to balrog.

At first glance it seems ~50% of the jobs I retriggered are coming back as failed.

Two Examples:
https://tools.taskcluster.net/task-inspector/#29OITfr-QwCvysjA89tv5A/7
https://tools.taskcluster.net/task-inspector/#DEFI6SQJS-yp-JpXjr96Mg/7
(In reply to Justin Wood (:Callek) from comment #4)
> We're still failing a bunch of funsize jobs (ones that I restarted this
> morning) submitting to balrog.
> 
> At first glance it seems ~50% of the jobs I retriggered are coming back as
> failed.
> 
> Two Examples:
> https://tools.taskcluster.net/task-inspector/#29OITfr-QwCvysjA89tv5A/7
> https://tools.taskcluster.net/task-inspector/#DEFI6SQJS-yp-JpXjr96Mg/7

The cleanup script shouldn't be running at this point, but nightly builds look to be in the processing of submitting. That's a factor, but it's strange if that would be causing releases issues - nightly+release submitting at the same time is not abnormal AFAIK.

Write latency doesn't look much higher than usual (about 3-4ms at the moment).
CPU usage on RDS is elevated though, even compared to 5, 6, and 7 days ago.
RDS throughput is also up, but that's likely a snowball effect due to the retries.
Updating summary to reflect investigation.
Summary: A lot of retries for "Failed to update row, old_data_version doesn't match current data_version" in Firefox 49.0b4 → new cleanup script slows down balrog db enough to cause a snowball effect when enough writes come in
Benson and I just chatted about this. He said that we cleared out around 420,000 rows yesterday, which looks like the vast majority of the backlogged nightly history. Based on how quickly that went, we expect the rest of the backlog to clear out in the window today. If my math is correct, we're able to clean up around 1,000 rows/minute. It looks like we generate between 3,000 and 4,000 new rows of nightly history every night (based on summing the data versions from one set of nightly builds: https://aus4-admin.mozilla.org/releases#20160813), which means we _should_ be able to clean up in less than 5 minutes once the backlog is clear.

Since there's no release scheduled today I'd like to let the cleanup query run again today to clear the backlog, and then we can see how long it takes for non-backlogged history in the coming days.
actually we are waiting for a dot release GTB today...
(In reply to Rail Aliiev [:rail] from comment #9)
> actually we are waiting for a dot release GTB today...

Per IRC, no cleanup will happen today because of this. Benson says that the cronjob is not enabled, and that he's been running the script manually. Maybe we'll try again tomorrow if there's no releases, or maybe try to finish catching up on the weekend.
No issues with submitting yesterday's release to Balrog AFAICT.
(In reply to Ben Hearsum (:bhearsum) from comment #11)
> No issues with submitting yesterday's release to Balrog AFAICT.

Correct (there was an unrelated-to-this-bug issue, but it too self corrected)
This isn't a problem anymore. Most of the backlog cleanup is done, and the day-to-day cleanup only takes a few minutes.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Release Engineering → Release Engineering Graveyard
You need to log in before you can comment on or make changes to this bug.