Closed Bug 1284432 Opened 8 years ago Closed 7 years ago

cycle-data is permanently failing on stage/prod

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: emorley, Unassigned)

References

(Blocks 2 open bugs)

Details

Attachments

(5 files)

The failure line select is from:
https://github.com/mozilla/treeherder/blob/c183a7abebb14936ba669cf2c71a1309d907f139/treeherder/model/derived/jobs.py#L635-L636

However even if this completed, the results won't be used on SCL3 stage/prod at the moment, since there is no elasticsearch instance, so the es_delete() will be a no-op.
Flags: needinfo?(james)
Attachment #8771906 - Flags: review?(emorley)
Assignee: nobody → james
Attachment #8771906 - Flags: review?(emorley) → review+
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/e54757eb13c6ff3b7eede0f406c848e9cc5bd89f
Bug 1284432 - Delete failure lines from es in chunks to prevent slow queries. (#1699)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Attachment #8773280 - Flags: review?(emorley)
Attachment #8773280 - Flags: review?(emorley) → review+
This is still happening - James could you take a look urgently? It's now been several weeks since we've expired any data.

A quick win would be making the change mentioned in the refactor PR (that is, not performing the lookups that are only used for ES deletes, if we don't actually have an ELASTICSEARCH_URL set). However we'll still need to fix this properly before turning on ES later.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Ed Morley [:emorley] from comment #5)
> This is still happening - James could you take a look urgently? It's now
> been several weeks since we've expired any data.

A month in fact. (Which means eg our SCL3 stage DB is now double the size it was previously, and prod 25% larger).
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/4d2b2c8681b44db569f52c66a08f239ba2a91e16
Bug 1284432 - Refactor the code for deleting chunks of failure lines from elasticsearch to make the intent clearer (#1715)
    cycle interval... 45 days, 0:00:00
    Cycling Database: addon-sdk
    Deleted 0 jobs from addon-sdk
    Cycling Database: alder
    Deleted 0 jobs from alder
    Cycling Database: ash
    Deleted 0 jobs from ash
    Cycling Database: autoland
    Traceback (most recent call last):
      File "./manage.py", line 11, in <module>
        execute_from_command_line(sys.argv)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line
        utility.execute()
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 346, in execute
        self.fetch_command(subcommand).run_from_argv(self.argv)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv
        self.execute(*args, **cmd_options)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/base.py", line 445, in execute
        output = self.handle(*args, **options)
      File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/management/commands/cycle_data.py", line 68, in handle
        options['sleep_time'])
      File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/jobs.py", line 645, in cycle_data
        chunk_size, sleep_time)
      File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/utils.py", line 44, in orm_delete
        id__in=delete_ids[lower_bound:lower_bound+chunk_size]).delete()
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/query.py", line 536, in delete
        collector.collect(del_query)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/deletion.py", line 229, in collect
        field.rel.on_delete(self, field, sub_objs, self.using)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/deletion.py", line 18, in CASCADE
        source_attr=field.name, nullable=field.null)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/deletion.py", line 228, in collect
        elif sub_objs:
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/query.py", line 170, in __nonzero__
        return type(self).__bool__(self)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/query.py", line 166, in __bool__
        self._fetch_all()
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/query.py", line 965, in _fetch_all
        self._result_cache = list(self.iterator())
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/query.py", line 238, in iterator
        results = compiler.execute_sql()
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
        cursor.execute(sql, params)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
        return self.cursor.execute(sql, params)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/utils.py", line 98, in __exit__
        six.reraise(dj_exc_type, dj_exc_value, traceback)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
        return self.cursor.execute(sql, params)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 124, in execute
        return self.cursor.execute(query, args)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/MySQLdb/cursors.py", line 226, in execute
        self.errorhandler(self, exc, value)
      File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
        raise errorvalue
    django.db.utils.OperationalError: (2013, 'Lost connection to MySQL server during query') 

I think this is now failing trying to delete from the new jobs table.

I have a suspicion that the problem here is that django's attempt to "emulate" ON DELETE CASCADE is slow. It *might* be possible to fix this using on_delete="DO_NOTHING" and relying on the database to cascade across relationships.
Although if we do that it seems to be necessary to manually (i.e. using raw sql) set up the ON DELETE behaviour in the database since django doesn't seem to actually support specifying this.
Attachment #8774916 - Flags: review?(emorley)
Comment on attachment 8774916 [details] [review]
[treeherder] mozilla:cycle_data_delete_cascade > mozilla:master

I've left a comment :-)
Attachment #8774916 - Flags: review?(emorley)
I'm see if I can dig into this now that we're off datasource.
Assignee: james → emorley
So it turns out cycle-data hasn't successfully run since 15th January since it exceeds the max memory usage and the dyno gets terminated :-(
(In reply to Ed Morley [:emorley] from comment #13)
> So it turns out cycle-data hasn't successfully run since 15th January since
> it exceeds the max memory usage and the dyno gets terminated :-(

I had a quick look at this and didn't see any smoking guns.

We have a lot more relations to cross over when doing the cascade behaviour than we did a few months ago (e.g. we now set the job id of a performance datum to null on delete), I wonder if that could be responsible.

Maybe try reducing the chunk size here and see if it helps?

https://github.com/mozilla/treeherder/blob/master/treeherder/model/management/commands/cycle_data.py#L36
Ah I believe cycle_data is spending ages on queries like:

SELECT `job`.`guid` FROM `job` WHERE (`job`.`submit_time` < '2016-10-19 20:09:49.630866' AND `job`.`repository_id` = 2)

Since there is no index on `submit_time` :-(

Also I noticed that we currently leave orphaned FailureLine rows behind in the case of non-clean termination (which is every time currently), since there is no foreign key on FailureLine to job, and instead of deleting the FailureLine objects first, they are deleted after the corresponding Job objects (so we'll never go back on next run to clean up).
Summary: cycle-data is permanently failing on stage/prod with "OperationalError: (2006, 'MySQL server has gone away')" → cycle-data is permanently failing on stage/prod
(In reply to Ed Morley [:emorley] from comment #15)
> Ah I believe cycle_data is spending ages on queries like:
> 
> SELECT `job`.`guid` FROM `job` WHERE (`job`.`submit_time` < '2016-10-19
> 20:09:49.630866' AND `job`.`repository_id` = 2)
> 
> Since there is no index on `submit_time` :-(

We should probably fix this, but I don't think it would account for the OOM errors.

> Also I noticed that we currently leave orphaned FailureLine rows behind in
> the case of non-clean termination (which is every time currently), since
> there is no foreign key on FailureLine to job, and instead of deleting the
> FailureLine objects first, they are deleted after the corresponding Job
> objects (so we'll never go back on next run to clean up).

Until we fix this properly by adding a foreign key relation, we could do some kind of search for orphaned failure lines at the end of cycle data (much like we do already for some of the job metadata).
Yeah I agree, it's more that there's just so much wrong/crufty with the current implementation it just needs rewriting now we're not using datasource. I imagine the OOM is coming from casting the queryset to a list (undoing the lazy loading) prior to slicing into the chunk size.

A better implementation might be to calculate a threshold datetime early, then loop through a "delete LIMIT N" until no rows deleted, which avoids the upfront loading of hundreds of thousands of jobs.
(In reply to Ed Morley [:emorley] from comment #17)
> Yeah I agree, it's more that there's just so much wrong/crufty with the
> current implementation it just needs rewriting now we're not using
> datasource. I imagine the OOM is coming from casting the queryset to a list
> (undoing the lazy loading) prior to slicing into the chunk size.
> 
> A better implementation might be to calculate a threshold datetime early,
> then loop through a "delete LIMIT N" until no rows deleted, which avoids the
> upfront loading of hundreds of thousands of jobs.

You can limit the number of records returned by adding something like `[:chunk_size]` to the query set. 

https://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets

I think by default django returns everything in a query set at the point that you actually ask for data to be returned. The only way to get around this is to use iterator() (but that limits you to handling one record at a time, so would be too slow here).
It's FailureLine's lack of foreign key and the ES data that makes everything unnecessarily complicated.

James, any chance of making FailureLine use foreign keys to Job, and for ES data to use another form of data expiry (eg its own date based deletion)?
Flags: needinfo?(james)
Totally happy for failure_line to have a foreign key to job; that didn't work when it was first implemented. Note that it's probably quite a slow change to run on prod, but maybe it would be possible to roll it up into an update which removed the classification related items once the new UI for autoclassify has landed.

I think that ES can be taught to expire its own documents using TTL, but a better approach is apparently to structure the data into date-based indices so that you can delete an entire index at a time. I haven't investigated how much work that would be for us.
Flags: needinfo?(james)
(In reply to Ed Morley [:emorley] from comment #19)
> It's FailureLine's lack of foreign key and the ES data that makes everything
> unnecessarily complicated.
> 
> James, any chance of making FailureLine use foreign keys to Job, and for ES
> data to use another form of data expiry (eg its own date based deletion)?

I agree that this would be nice for a lot of reasons (it would also be much more space efficient) but I don't know if it's necessary to fix this bug. I think in the short term we should do the following:

* Add a composite index on submit_time / repository to speed up getting the guids
* Query the guid list in chunks (instead of getting everything and then chunking it)
* Make the chunksize configurable via an environment variable, and experiment with lowering it from its current value of 5000 (as I outlined in comment 14).
We also need to rearrange the order and delete the FailureLine/ES entries prior to deleting the jobs, to ensure orphaned rows aren't left behind if the task doesn't complete successfully.

I'm not saying this bug cannot be fixed without these other changes, I'm just growing tired of adding hacks around existing problems.
And I'm also concerned about the ES expiration parts, since they are currently not running on stage/prod (there is no ES instance being used there yet) - so may add to our problems once active. 

If tackling that will require restructuring the ES data into date-based indices (comment 20), then it seems like we should check it's not worth doing before bug 1321798 lands.

I've filed bug 1340505 for ES expiration, and bug 1340503 for adding the job foreign key to failure_line.
I'm going to look into this a bit.
Assignee: emorley → wlachance
Comment on attachment 8843076 [details] [review]
[treeherder] wlach:1284432 > mozilla:master

I know this doesn't solve all the problems, but it should be ok for a stopgap.
Attachment #8843076 - Flags: review?(emorley)
Comment on attachment 8843076 [details] [review]
[treeherder] wlach:1284432 > mozilla:master

Clearing review until OOM issues resolved :-)
Attachment #8843076 - Flags: review?(emorley)
Comment on attachment 8843076 [details] [review]
[treeherder] wlach:1284432 > mozilla:master

Updated for new chunk size / sleep time intervals based on testing on stage + rebased the migrations.
Attachment #8843076 - Flags: review?(emorley)
Attachment #8843076 - Flags: review?(emorley) → review+
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/f0e483ce02ad0488010d29075e08541b7dd99c90
Bug 1284432 - Add an index to speed up finding jobs to cycle

https://github.com/mozilla/treeherder/commit/abbacd99079d35d73d9489f287284d0b4a56c4b6
Bug 1284432 - Make cycling data more efficient

Don't get all job guids per project at once, then chunk, then delete the
chunks. Get a chunk, delete that chunk, then get another, delete that,
etc. This should be more memory and time efficient.

https://github.com/mozilla/treeherder/commit/1b795317ca80a880819f0abf3da6f20d821797ed
Bug 1284432 - Cycle jobs after failure line data

For historical reasons, the failure line data is currently related to the job
via the guid (not a foreign key relation). To avoid the case where we have
orphan data where we die halfway through deleting the job data, make sure
we cycle the failure line data *before* any jobs.

https://github.com/mozilla/treeherder/commit/8dd0e3c4f26cdd620181606123943e5ab70cd598
Bug 1284432 - Make chunk size / sleep time for cycle data command configurable

https://github.com/mozilla/treeherder/commit/f84a9b8c566598a388e10900115f92e758150ad8
Bug 1284432 - New defaults for chunk size / sleep time for cycle data

Turns out that a small chunk size (100) and no sleep time works best
for our current setup.
Comment on attachment 8844516 [details] [review]
[treeherder] wlach:1284432-followup > mozilla:master

The current time out for cycle data is set to 15 minutes, which is probably too short (especially if there's a backlog of data to be cycled, as there is currently). Let's set it to a higher value, like say 3 hours.
Attachment #8844516 - Flags: review?(emorley)
Attachment #8844516 - Flags: review?(emorley) → review+
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/e50a675502f0ad66cbe2984e11f3bb87dccb3ab7
Bug 1284432 - Set timeout of cycle data command to 3 hours (#2232)

The default time limit of a celery task is 15 minutes, which is probably
too short for this one (especially if there is a backlog of data to be
cycled).
cycle-data is still timing out on prod, but that's because the "add an index" commit isn't on the production branch, so it's doing table scans for the job SELECT.
I started a one-off dyno earlier (detached otherwise it times out after 60 mins of no console activity):
`thp run:detached --size=performance-l -- ./manage.py cycle_data --debug --sleep-time 0 --chunk-size 1000` 

It's been going for about 10 hours now, but slowly getting there:

Mar 10 15:03:21 treeherder-prod app/run.4457: cycle interval... 120 days, 0:00:00 
Mar 10 15:03:21 treeherder-prod app/run.4457: Cycling repository: mozilla-central 
Mar 10 15:03:23 treeherder-prod app/run.4457: Deleted 0 jobs from mozilla-central 
Mar 10 15:03:23 treeherder-prod app/run.4457: Cycling repository: mozilla-inbound 
Mar 10 15:03:42 treeherder-prod app/run.4457: Deleted 288 jobs from mozilla-inbound 
Mar 10 15:03:42 treeherder-prod app/run.4457: Cycling repository: b2g-inbound 
Mar 10 15:03:42 treeherder-prod app/run.4457: Deleted 0 jobs from b2g-inbound 
Mar 10 15:03:42 treeherder-prod app/run.4457: Cycling repository: try 
Mar 10 21:44:46 treeherder-prod app/run.4457: Deleted 2490962 jobs from try 
Mar 10 21:44:46 treeherder-prod app/run.4457: Cycling repository: mozilla-aurora 
Mar 10 22:55:49 treeherder-prod app/run.4457: Deleted 567259 jobs from mozilla-aurora 
Mar 10 22:55:49 treeherder-prod app/run.4457: Cycling repository: mozilla-beta 
Mar 10 23:40:17 treeherder-prod app/run.4457: Deleted 308836 jobs from mozilla-beta 
Mar 10 23:40:17 treeherder-prod app/run.4457: Cycling repository: mozilla-release 
Mar 10 23:52:29 treeherder-prod app/run.4457: Deleted 84975 jobs from mozilla-release 
...
Mar 10 23:52:29 treeherder-prod app/run.4457: Cycling repository: build-system 
Mar 10 23:52:29 treeherder-prod app/run.4457: Deleted 0 jobs from build-system 
Mar 10 23:52:29 treeherder-prod app/run.4457: Cycling repository: fx-team 


Interestingly, even with a chunk size of 1000, the memory usage hasn't exceeded 200MB:
https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.4457
We seem to have finally caught up with the main deletions, however the reference data cleanup that follows them is failing:

Deleted 0 jobs from webrender 
Traceback (most recent call last): 
  File "./manage.py", line 11, in <module> 
    execute_from_command_line(sys.argv) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/__init__.py", line 367, in execute_from_command_line 
    utility.execute() 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/__init__.py", line 359, in execute 
    self.fetch_command(subcommand).run_from_argv(self.argv) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/base.py", line 294, in run_from_argv 
    self.execute(*args, **cmd_options) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/base.py", line 345, in execute 
    output = self.handle(*args, **options) 
  File "/app/treeherder/model/management/commands/cycle_data.py", line 66, in handle 
    self.cycle_non_job_data(options['chunk_size'], options['sleep_time']) 
  File "/app/treeherder/model/management/commands/cycle_data.py", line 81, in cycle_non_job_data 
    Machine.objects.exclude(id__in=used_machine_ids).delete() 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 608, in delete 
    collector.collect(del_query) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/deletion.py", line 191, in collect 
    reverse_dependency=reverse_dependency) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/deletion.py", line 89, in add 
    if not objs: 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 264, in __nonzero__ 
    return type(self).__bool__(self) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 260, in __bool__ 
    self._fetch_all() 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 1087, in _fetch_all 
    self._result_cache = list(self.iterator()) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 54, in __iter__ 
    results = compiler.execute_sql() 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 835, in execute_sql 
    cursor.execute(sql, params) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute 
    return self.cursor.execute(sql, params) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/utils.py", line 94, in __exit__ 
    six.reraise(dj_exc_type, dj_exc_value, traceback) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute 
    return self.cursor.execute(sql, params) 
  File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 110, in execute 
    return self.cursor.execute(query, args) 
  File "/app/.heroku/python/lib/python2.7/site-packages/MySQLdb/cursors.py", line 250, in execute 
    self.errorhandler(self, exc, value) 
  File "/app/.heroku/python/lib/python2.7/site-packages/MySQLdb/connections.py", line 42, in defaulterrorhandler 
    raise errorvalue 
django.db.utils.OperationalError: (2006, 'MySQL server has gone away')
Depends on: 1346567
Assignee: wlachance → nobody
Closing this in favour of bug 1346567.
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: