1419483 - Manually run cycle_data to increase free space on production

Assignee

Description

•

7 years ago

The cycle_data task is failing on stage/prod (bug 1346567). As a stop-gap prior to that being fixed, we can manually run cycle_data on a bigger dyno size to work around the out of memory errors/timeouts, similar to what was done 4 months ago in bug 1346567 comment 8.

Ed Morley [:emorley]

Assignee

Comment 1

•

7 years ago

I've run: `thp run:detached --size=performance-l -- ./manage.py cycle_data --debug --sleep-time 0 --chunk-size 2000` (and same for stage, except using a chunk size of 20000 to compare) Ongoing logs: https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.3132 https://papertrailapp.com/systems/treeherder-stage/events?q=program%3A%2Frun.1212

Ed Morley [:emorley]

Assignee

Comment 2

•

7 years ago

I got `django.db.utils.OperationalError: (3024, 'Query execution was interrupted, maximum statement execution time exceeded')` in `failure_lines_to_delete.delete()`, trying again with chunk size 500: https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.6570 https://papertrailapp.com/systems/treeherder-stage/events?q=program%3A%2Frun.8159

Ed Morley [:emorley]

Assignee

Comment 3

•

7 years ago

The prod cycle_data task is still running now - looks like there is a lot to delete...

Ed Morley [:emorley]

Assignee

Comment 4

•

7 years ago

Wow this is still running 16 hours later..!

Ed Morley [:emorley]

Assignee

Comment 5

•

7 years ago

The one-off dyno has been running all the time since, and has just been killed by the Heroku 1 day timeout. I'll re-run it again, so it can continue from where it left off. Nov 21 18:46:05 treeherder-prod heroku/run.6570: Starting process with command `./manage.py cycle_data --debug --sleep-time 0 --chunk-size 500` Nov 21 18:46:06 treeherder-prod heroku/run.6570: State changed from starting to up Nov 21 18:46:06 treeherder-prod app/run.6570: /tmp/memcachier-stunnel.conf Nov 21 18:46:08 treeherder-prod app/run.6570: cycle interval... 120 days, 0:00:00 Nov 21 18:46:08 treeherder-prod app/run.6570: Cycling repository: mozilla-central Nov 21 18:46:09 treeherder-prod app/run.6570: Deleted 219 jobs from mozilla-central Nov 21 18:46:09 treeherder-prod app/run.6570: Cycling repository: mozilla-inbound Nov 21 18:46:47 treeherder-prod app/run.6570: Deleted 7151 jobs from mozilla-inbound Nov 21 18:46:47 treeherder-prod app/run.6570: Cycling repository: b2g-inbound Nov 21 18:46:47 treeherder-prod app/run.6570: Deleted 0 jobs from b2g-inbound Nov 21 18:46:47 treeherder-prod app/run.6570: Cycling repository: try ... Nov 22 11:31:38 treeherder-prod app/run.6570: /app/.heroku/python/lib/python2.7/site-packages/django/db/backends/mysql/base.py:101: Warning: (3170L, u"Memory capacity of 8388608 bytes for 'range_optimizer_max_mem_size' exceeded. Range optimization was not done for this query.") Nov 22 11:31:38 treeherder-prod app/run.6570: return self.cursor.execute(query, args) ... Nov 22 12:13:52 treeherder-prod app/run.6570: Deleted 7990366 jobs from try ... Nov 22 19:08:39 treeherder-prod heroku/run.6570: Cycling Nov 22 19:08:39 treeherder-prod heroku/run.6570: State changed from up to complete Nov 22 19:08:39 treeherder-prod heroku/run.6570: Stopping all processes with SIGTERM Nov 22 19:08:39 treeherder-prod heroku/run.6570: Process exited with status 143

Ed Morley [:emorley]

Assignee

Comment 6

•

7 years ago

This time I used a performance-m (since the L might have been overkill) and increased the chunk-size slightly: `thp run:detached --size=performance-m -- ./manage.py cycle_data --debug --sleep-time 0 --chunk-size 1000` https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.7820

Ed Morley [:emorley]

Assignee

Comment 7

•

7 years ago

At the moment it's getting stuck on the performance_datum deletes (bug 1346567 comment 10), so as a temporary workaround I'm running some manual deletes on that table (sticking to try for now, since it's one of the repositories that does expire data): DELETE FROM `performance_datum` WHERE (`repository_id` = 4 AND `push_timestamp` < '2017-07-25 21:27:32.388466') LIMIT 20000

Ed Morley [:emorley]

Assignee

Comment 8

•

7 years ago

Trying on stage now that the perf data cycling disabling landed in bug 1346567. $ ths run:detached --size=performance-m -- ./manage.py cycle_data --debug --sleep-time 0 --chunk-size 1000 https://papertrailapp.com/systems/treeherder-stage/events?q=program%3A%2Frun.6558

Ed Morley [:emorley]

Assignee

Comment 9

•

7 years ago

It OOMed, so running with chunk size 500 this time: https://papertrailapp.com/systems/treeherder-stage/events?q=program%3A%2Frun.9725

Ed Morley [:emorley]

Assignee

Comment 10

•

7 years ago

And same again (chunk size 500) for prototype and prod: https://papertrailapp.com/systems/treeherder-prototype/events?q=program%3A%2Frun.8008 https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.3982

Ed Morley [:emorley]

Assignee

Comment 11

•

7 years ago

Trying with chunk size 500 again: $ th{d,s,p} run:detached --size=performance-m -- ./manage.py cycle_data --debug --sleep-time 0 --chunk-size 500 https://papertrailapp.com/systems/treeherder-prototype/events?q=program%3A%2Frun.6393 https://papertrailapp.com/systems/treeherder-stage/events?q=program%3A%2Frun.9926 https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.2630 The prod one timed out at the: `Machine.objects.exclude(id__in=used_machine_ids).delete()` (since it must be doing an unnecessary intermeditate SELECT) ...so I've manually run the delete using: `DELETE FROM machine WHERE machine.id NOT IN (SELECT machine_id FROM job);`

Ed Morley [:emorley]

Assignee

Comment 12

•

7 years ago

After prod finished, I ran: * an `OPTIMIZE TABLE job_log` (which freed up 20GB), * then an `OPTIMIZE TABLE job` (which would have freed up 10-20GB but failed due to "Duplicate entry '153230551' for key 'PRIMARY'" - which I think is due to a race condition that can occur during ALTER TABLE when writes still continuing) * then `OPTIMIZE TABLE failure_line` (which freed up 171GB!)

Ed Morley [:emorley]

Assignee

Comment 13

•

7 years ago

(In reply to Ed Morley [:emorley] from comment #12) > After prod finished, I ran: * `OPTIMIZE TABLE job_detail` - which freed another 121 GB Free space is now at ~500GB of 1TB on prod (dev/stage still need the same treatment to get to that point)

Ed Morley [:emorley]

Assignee

Comment 14

•

7 years ago

dev/stage were sorted out too; now all three instances have at least 500GB out of 1TB free \o/

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Bugzilla

Manually run cycle_data to increase free space on production

Categories

(Tree Management :: Treeherder: Infrastructure, enhancement, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: emorley)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14