Closed Bug 1419483 Opened 7 years ago Closed 7 years ago

Manually run cycle_data to increase free space on production

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

(Blocks 1 open bug)

Details

The cycle_data task is failing on stage/prod (bug 1346567). As a stop-gap prior to that being fixed, we can manually run cycle_data on a bigger dyno size to work around the out of memory errors/timeouts, similar to what was done 4 months ago in bug 1346567 comment 8.
I've run: `thp run:detached --size=performance-l -- ./manage.py cycle_data --debug --sleep-time 0 --chunk-size 2000` (and same for stage, except using a chunk size of 20000 to compare) Ongoing logs: https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.3132 https://papertrailapp.com/systems/treeherder-stage/events?q=program%3A%2Frun.1212
I got `django.db.utils.OperationalError: (3024, 'Query execution was interrupted, maximum statement execution time exceeded')` in `failure_lines_to_delete.delete()`, trying again with chunk size 500: https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.6570 https://papertrailapp.com/systems/treeherder-stage/events?q=program%3A%2Frun.8159
The prod cycle_data task is still running now - looks like there is a lot to delete...
Wow this is still running 16 hours later..!
The one-off dyno has been running all the time since, and has just been killed by the Heroku 1 day timeout. I'll re-run it again, so it can continue from where it left off. Nov 21 18:46:05 treeherder-prod heroku/run.6570: Starting process with command `./manage.py cycle_data --debug --sleep-time 0 --chunk-size 500` Nov 21 18:46:06 treeherder-prod heroku/run.6570: State changed from starting to up Nov 21 18:46:06 treeherder-prod app/run.6570: /tmp/memcachier-stunnel.conf Nov 21 18:46:08 treeherder-prod app/run.6570: cycle interval... 120 days, 0:00:00 Nov 21 18:46:08 treeherder-prod app/run.6570: Cycling repository: mozilla-central Nov 21 18:46:09 treeherder-prod app/run.6570: Deleted 219 jobs from mozilla-central Nov 21 18:46:09 treeherder-prod app/run.6570: Cycling repository: mozilla-inbound Nov 21 18:46:47 treeherder-prod app/run.6570: Deleted 7151 jobs from mozilla-inbound Nov 21 18:46:47 treeherder-prod app/run.6570: Cycling repository: b2g-inbound Nov 21 18:46:47 treeherder-prod app/run.6570: Deleted 0 jobs from b2g-inbound Nov 21 18:46:47 treeherder-prod app/run.6570: Cycling repository: try ... Nov 22 11:31:38 treeherder-prod app/run.6570: /app/.heroku/python/lib/python2.7/site-packages/django/db/backends/mysql/base.py:101: Warning: (3170L, u"Memory capacity of 8388608 bytes for 'range_optimizer_max_mem_size' exceeded. Range optimization was not done for this query.") Nov 22 11:31:38 treeherder-prod app/run.6570: return self.cursor.execute(query, args) ... Nov 22 12:13:52 treeherder-prod app/run.6570: Deleted 7990366 jobs from try ... Nov 22 19:08:39 treeherder-prod heroku/run.6570: Cycling Nov 22 19:08:39 treeherder-prod heroku/run.6570: State changed from up to complete Nov 22 19:08:39 treeherder-prod heroku/run.6570: Stopping all processes with SIGTERM Nov 22 19:08:39 treeherder-prod heroku/run.6570: Process exited with status 143
This time I used a performance-m (since the L might have been overkill) and increased the chunk-size slightly: `thp run:detached --size=performance-m -- ./manage.py cycle_data --debug --sleep-time 0 --chunk-size 1000` https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.7820
At the moment it's getting stuck on the performance_datum deletes (bug 1346567 comment 10), so as a temporary workaround I'm running some manual deletes on that table (sticking to try for now, since it's one of the repositories that does expire data): DELETE FROM `performance_datum` WHERE (`repository_id` = 4 AND `push_timestamp` < '2017-07-25 21:27:32.388466') LIMIT 20000
Trying on stage now that the perf data cycling disabling landed in bug 1346567. $ ths run:detached --size=performance-m -- ./manage.py cycle_data --debug --sleep-time 0 --chunk-size 1000 https://papertrailapp.com/systems/treeherder-stage/events?q=program%3A%2Frun.6558
Trying with chunk size 500 again: $ th{d,s,p} run:detached --size=performance-m -- ./manage.py cycle_data --debug --sleep-time 0 --chunk-size 500 https://papertrailapp.com/systems/treeherder-prototype/events?q=program%3A%2Frun.6393 https://papertrailapp.com/systems/treeherder-stage/events?q=program%3A%2Frun.9926 https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.2630 The prod one timed out at the: `Machine.objects.exclude(id__in=used_machine_ids).delete()` (since it must be doing an unnecessary intermeditate SELECT) ...so I've manually run the delete using: `DELETE FROM machine WHERE machine.id NOT IN (SELECT machine_id FROM job);`
After prod finished, I ran: * an `OPTIMIZE TABLE job_log` (which freed up 20GB), * then an `OPTIMIZE TABLE job` (which would have freed up 10-20GB but failed due to "Duplicate entry '153230551' for key 'PRIMARY'" - which I think is due to a race condition that can occur during ALTER TABLE when writes still continuing) * then `OPTIMIZE TABLE failure_line` (which freed up 171GB!)
(In reply to Ed Morley [:emorley] from comment #12) > After prod finished, I ran: * `OPTIMIZE TABLE job_detail` - which freed another 121 GB Free space is now at ~500GB of 1TB on prod (dev/stage still need the same treatment to get to that point)
dev/stage were sorted out too; now all three instances have at least 500GB out of 1TB free \o/
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.