Closed Bug 1470337 Opened 7 years ago Closed 7 years ago

IntegrityError seen when running ./manage.py calculate_stats

Categories

(Webtools Graveyard :: Pontoon, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mathjazz, Unassigned)

Details

Attachments

(2 files)

Attached file stacktrace.txt
As part of bug 1377969 we introduced the calculate_stats management command, which re-calculates stats for all TranslatedResources and related objects. It took around one hour to finish on production server and we've seen some IntegrityErrors while it was running. See stack trace in the attachment. I also noticed "unreviewed_strings" count was off for many Teams and Projects, so I ran the following commands: > for locale in Locale.objects.all(): > locale.aggregate_stats() > > for project in Project.objects.all(): > project.aggregate_stats() > > for pl in ProjectLocale.objects.all(): > pl.aggregate_stats() Checking a few hours later, no IntegrityErrors were thrown since and the counts also look OK. What caused the IntegrityErrors? While the command was running, translations were submitted/rejected/approved, which also affects stats. That resulted in TranslateResource.calculate_stats() returning negative values for unreviewed_strings_diff and eventually raising IntegrityError in adjust_stats(). A possible fix is to de-optimize the management command or rewrite it completely.
I believe the only good and easy solution to this is to put the website under maintenance while we run the command. Do we have a maintenance mode on Pontoon?
We do: https://devcenter.heroku.com/articles/maintenance-mode It's also known as the Tetris mode. The alternative is to run the non-optimized version of the management command, which should be less frustrating for users: https://github.com/adngdb/pontoon/blob/3f8c6ce5069b5f8c6ff1a0ecebb57d9e9f1b6ec7/pontoon/base/management/commands/calculate_stats.py I can check how much slower it is on stage.
The non-optimized command was around 10-15% slower on stage. I'll open a PR shortly.
Moving back to the non-optimized version reduces the risk of creating inconsistencies in the database, but it doesn't remove it entirely. Because `calculate_stats` is not atomic, it is possible that a user will update a string right after the counts have been computed in Python and before all parts (project, locale, project_locale) have been updated. It is a race issue, making the code run faster doesn't remove the race, just makes it more in favor of our cause. The only solution is to prevent the race from starting. For that, we need to remove the possibility for changes to happen while our update is running. I can see two ways to do that, one being to make the entire process a database thing with SQL or PG/SQL, and making the operation atomic, so that changes can only happen in between operations but not during them. However with my limited knowledge of psql, I can't say for sure if it is even possible. The other solution I see (and in my opinion, the only viable one) is to disable updates during the update, using some form of maintenance mode.
That's true. However: The current (optimized) code ran for about 50 minutes and any change affecting the stats made in this time frame resulted in an IntegrityError. There were 45 of them. We might have prevented more by running the script in Comment #0, but app. 1 stats affecting change per minute is in line with the average Pontoon activity ATM. While the de-optimized code in the patch will run for about 60 minutes, the time window for raising an error will no longer be the entire execution time of the script. To trigger an IntegrityError, you need to make the change to the file while calculate_stats() is being executed on its TranslatedResource, which takes 0.035 seconds on average (60 minutes divided by 100,000+ TranslatedResources). That's extremelly hard to hit, even intentionally. Also note that we run TranslatedResource.calculate_stats() on all changed resources during sync every 20 minutes. I'd like to avoid maintenance mode if possible, and I think chances of hitting an error with the updated script are very close to 0%.
Aha, this is the point in time when I quote Murphy's law: "Anything that can go wrong will go wrong." :D I understand the trade-off, and I don't mind taking it. However, I would like that we make sure our future us, or anyone working on Pontoon when this command is next ran, don't have to do the work we did here again. So let's add clear and discoverable documentation about the potential side-effects of the command, and how to solve those problems. Mind doing that?
Sure. I'll update the help text with a quick warning and a link to this bug.
Commit pushed to master at https://github.com/mozilla/pontoon https://github.com/mozilla/pontoon/commit/0423fae69d60e0deafd1c62180baaf508344bfd9 Fix bug 1470337: De-optimize calculate_stats (#986) To prevent IntegrityError. And add a note to the help text.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: