Closed
Bug 1284432
Opened 8 years ago
Closed 7 years ago
cycle-data is permanently failing on stage/prod
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P1)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: emorley, Unassigned)
References
(Blocks 2 open bugs)
Details
Attachments
(5 files)
47 bytes,
text/x-github-pull-request
|
emorley
:
review+
|
Details | Review |
47 bytes,
text/x-github-pull-request
|
emorley
:
review+
|
Details | Review |
47 bytes,
text/x-github-pull-request
|
Details | Review | |
47 bytes,
text/x-github-pull-request
|
emorley
:
review+
|
Details | Review |
47 bytes,
text/x-github-pull-request
|
emorley
:
review+
|
Details | Review |
Has failed every one of the last 7 days of daily runs: https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors?search%5Bname%5D=cycle-data&search%5Bsort_by%5D=count&search%5Btype%5D=all&tw%5Bend%5D=1467710357&tw%5Bstart%5D=1467105557&utf8=%E2%9C%93 (and likely failing for some time before that) Example: https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/cb37e0-d0184c14-427e-11e6-bd66-b82a72d22a14 Slow transaction trace: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?show_browser=false&tw%5Bend%5D=1467710296&tw%5Bstart%5D=1467105496&type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f6379636c652d64617461222c22225d&app_trace_id=cb37e0-c932585d-41b5-11e6-bd66-b82a72d22a14 95% of the profile is spent in 'MySQL failure_line select' -> Seems like a regression from bug 1268484 or friends.
Flags: needinfo?(james)
Reporter | ||
Comment 1•8 years ago
|
||
The failure line select is from: https://github.com/mozilla/treeherder/blob/c183a7abebb14936ba669cf2c71a1309d907f139/treeherder/model/derived/jobs.py#L635-L636 However even if this completed, the results won't be used on SCL3 stage/prod at the moment, since there is no elasticsearch instance, so the es_delete() will be a no-op.
Comment 2•8 years ago
|
||
Updated•8 years ago
|
Flags: needinfo?(james)
Attachment #8771906 -
Flags: review?(emorley)
Reporter | ||
Updated•8 years ago
|
Assignee: nobody → james
Reporter | ||
Updated•8 years ago
|
Attachment #8771906 -
Flags: review?(emorley) → review+
Comment 3•8 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/e54757eb13c6ff3b7eede0f406c848e9cc5bd89f Bug 1284432 - Delete failure lines from es in chunks to prevent slow queries. (#1699)
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment 4•8 years ago
|
||
Updated•8 years ago
|
Attachment #8773280 -
Flags: review?(emorley)
Reporter | ||
Updated•8 years ago
|
Attachment #8773280 -
Flags: review?(emorley) → review+
Reporter | ||
Comment 5•8 years ago
|
||
This is still happening - James could you take a look urgently? It's now been several weeks since we've expired any data. A quick win would be making the change mentioned in the refactor PR (that is, not performing the lookups that are only used for ES deletes, if we don't actually have an ELASTICSEARCH_URL set). However we'll still need to fix this properly before turning on ES later.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 6•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #5) > This is still happening - James could you take a look urgently? It's now > been several weeks since we've expired any data. A month in fact. (Which means eg our SCL3 stage DB is now double the size it was previously, and prod 25% larger).
Comment 7•8 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/4d2b2c8681b44db569f52c66a08f239ba2a91e16 Bug 1284432 - Refactor the code for deleting chunks of failure lines from elasticsearch to make the intent clearer (#1715)
Comment 8•8 years ago
|
||
cycle interval... 45 days, 0:00:00 Cycling Database: addon-sdk Deleted 0 jobs from addon-sdk Cycling Database: alder Deleted 0 jobs from alder Cycling Database: ash Deleted 0 jobs from ash Cycling Database: autoland Traceback (most recent call last): File "./manage.py", line 11, in <module> execute_from_command_line(sys.argv) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line utility.execute() File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 346, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv self.execute(*args, **cmd_options) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/base.py", line 445, in execute output = self.handle(*args, **options) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/management/commands/cycle_data.py", line 68, in handle options['sleep_time']) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/jobs.py", line 645, in cycle_data chunk_size, sleep_time) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/utils.py", line 44, in orm_delete id__in=delete_ids[lower_bound:lower_bound+chunk_size]).delete() File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/query.py", line 536, in delete collector.collect(del_query) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/deletion.py", line 229, in collect field.rel.on_delete(self, field, sub_objs, self.using) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/deletion.py", line 18, in CASCADE source_attr=field.name, nullable=field.null) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/deletion.py", line 228, in collect elif sub_objs: File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/query.py", line 170, in __nonzero__ return type(self).__bool__(self) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/query.py", line 166, in __bool__ self._fetch_all() File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/query.py", line 965, in _fetch_all self._result_cache = list(self.iterator()) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/query.py", line 238, in iterator results = compiler.execute_sql() File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql cursor.execute(sql, params) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute return self.cursor.execute(sql, params) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/utils.py", line 98, in __exit__ six.reraise(dj_exc_type, dj_exc_value, traceback) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute return self.cursor.execute(sql, params) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 124, in execute return self.cursor.execute(query, args) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/MySQLdb/cursors.py", line 226, in execute self.errorhandler(self, exc, value) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorvalue django.db.utils.OperationalError: (2013, 'Lost connection to MySQL server during query') I think this is now failing trying to delete from the new jobs table. I have a suspicion that the problem here is that django's attempt to "emulate" ON DELETE CASCADE is slow. It *might* be possible to fix this using on_delete="DO_NOTHING" and relying on the database to cascade across relationships.
Comment 9•8 years ago
|
||
Although if we do that it seems to be necessary to manually (i.e. using raw sql) set up the ON DELETE behaviour in the database since django doesn't seem to actually support specifying this.
Comment 10•8 years ago
|
||
Updated•8 years ago
|
Attachment #8774916 -
Flags: review?(emorley)
Reporter | ||
Comment 11•8 years ago
|
||
Comment on attachment 8774916 [details] [review] [treeherder] mozilla:cycle_data_delete_cascade > mozilla:master I've left a comment :-)
Attachment #8774916 -
Flags: review?(emorley)
Reporter | ||
Comment 12•7 years ago
|
||
I'm see if I can dig into this now that we're off datasource.
Assignee: james → emorley
Reporter | ||
Comment 13•7 years ago
|
||
So it turns out cycle-data hasn't successfully run since 15th January since it exceeds the max memory usage and the dyno gets terminated :-(
Comment 14•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #13) > So it turns out cycle-data hasn't successfully run since 15th January since > it exceeds the max memory usage and the dyno gets terminated :-( I had a quick look at this and didn't see any smoking guns. We have a lot more relations to cross over when doing the cascade behaviour than we did a few months ago (e.g. we now set the job id of a performance datum to null on delete), I wonder if that could be responsible. Maybe try reducing the chunk size here and see if it helps? https://github.com/mozilla/treeherder/blob/master/treeherder/model/management/commands/cycle_data.py#L36
Reporter | ||
Comment 15•7 years ago
|
||
Ah I believe cycle_data is spending ages on queries like: SELECT `job`.`guid` FROM `job` WHERE (`job`.`submit_time` < '2016-10-19 20:09:49.630866' AND `job`.`repository_id` = 2) Since there is no index on `submit_time` :-( Also I noticed that we currently leave orphaned FailureLine rows behind in the case of non-clean termination (which is every time currently), since there is no foreign key on FailureLine to job, and instead of deleting the FailureLine objects first, they are deleted after the corresponding Job objects (so we'll never go back on next run to clean up).
Summary: cycle-data is permanently failing on stage/prod with "OperationalError: (2006, 'MySQL server has gone away')" → cycle-data is permanently failing on stage/prod
Comment 16•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #15) > Ah I believe cycle_data is spending ages on queries like: > > SELECT `job`.`guid` FROM `job` WHERE (`job`.`submit_time` < '2016-10-19 > 20:09:49.630866' AND `job`.`repository_id` = 2) > > Since there is no index on `submit_time` :-( We should probably fix this, but I don't think it would account for the OOM errors. > Also I noticed that we currently leave orphaned FailureLine rows behind in > the case of non-clean termination (which is every time currently), since > there is no foreign key on FailureLine to job, and instead of deleting the > FailureLine objects first, they are deleted after the corresponding Job > objects (so we'll never go back on next run to clean up). Until we fix this properly by adding a foreign key relation, we could do some kind of search for orphaned failure lines at the end of cycle data (much like we do already for some of the job metadata).
Reporter | ||
Comment 17•7 years ago
|
||
Yeah I agree, it's more that there's just so much wrong/crufty with the current implementation it just needs rewriting now we're not using datasource. I imagine the OOM is coming from casting the queryset to a list (undoing the lazy loading) prior to slicing into the chunk size. A better implementation might be to calculate a threshold datetime early, then loop through a "delete LIMIT N" until no rows deleted, which avoids the upfront loading of hundreds of thousands of jobs.
Comment 18•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #17) > Yeah I agree, it's more that there's just so much wrong/crufty with the > current implementation it just needs rewriting now we're not using > datasource. I imagine the OOM is coming from casting the queryset to a list > (undoing the lazy loading) prior to slicing into the chunk size. > > A better implementation might be to calculate a threshold datetime early, > then loop through a "delete LIMIT N" until no rows deleted, which avoids the > upfront loading of hundreds of thousands of jobs. You can limit the number of records returned by adding something like `[:chunk_size]` to the query set. https://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets I think by default django returns everything in a query set at the point that you actually ask for data to be returned. The only way to get around this is to use iterator() (but that limits you to handling one record at a time, so would be too slow here).
Reporter | ||
Comment 19•7 years ago
|
||
It's FailureLine's lack of foreign key and the ES data that makes everything unnecessarily complicated. James, any chance of making FailureLine use foreign keys to Job, and for ES data to use another form of data expiry (eg its own date based deletion)?
Flags: needinfo?(james)
Comment 20•7 years ago
|
||
Totally happy for failure_line to have a foreign key to job; that didn't work when it was first implemented. Note that it's probably quite a slow change to run on prod, but maybe it would be possible to roll it up into an update which removed the classification related items once the new UI for autoclassify has landed. I think that ES can be taught to expire its own documents using TTL, but a better approach is apparently to structure the data into date-based indices so that you can delete an entire index at a time. I haven't investigated how much work that would be for us.
Flags: needinfo?(james)
Comment 21•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #19) > It's FailureLine's lack of foreign key and the ES data that makes everything > unnecessarily complicated. > > James, any chance of making FailureLine use foreign keys to Job, and for ES > data to use another form of data expiry (eg its own date based deletion)? I agree that this would be nice for a lot of reasons (it would also be much more space efficient) but I don't know if it's necessary to fix this bug. I think in the short term we should do the following: * Add a composite index on submit_time / repository to speed up getting the guids * Query the guid list in chunks (instead of getting everything and then chunking it) * Make the chunksize configurable via an environment variable, and experiment with lowering it from its current value of 5000 (as I outlined in comment 14).
Reporter | ||
Comment 22•7 years ago
|
||
We also need to rearrange the order and delete the FailureLine/ES entries prior to deleting the jobs, to ensure orphaned rows aren't left behind if the task doesn't complete successfully. I'm not saying this bug cannot be fixed without these other changes, I'm just growing tired of adding hacks around existing problems.
Reporter | ||
Comment 23•7 years ago
|
||
And I'm also concerned about the ES expiration parts, since they are currently not running on stage/prod (there is no ES instance being used there yet) - so may add to our problems once active. If tackling that will require restructuring the ES data into date-based indices (comment 20), then it seems like we should check it's not worth doing before bug 1321798 lands. I've filed bug 1340505 for ES expiration, and bug 1340503 for adding the job foreign key to failure_line.
Comment 25•7 years ago
|
||
Comment 26•7 years ago
|
||
Comment on attachment 8843076 [details] [review] [treeherder] wlach:1284432 > mozilla:master I know this doesn't solve all the problems, but it should be ok for a stopgap.
Attachment #8843076 -
Flags: review?(emorley)
Reporter | ||
Comment 27•7 years ago
|
||
Comment on attachment 8843076 [details] [review] [treeherder] wlach:1284432 > mozilla:master Clearing review until OOM issues resolved :-)
Attachment #8843076 -
Flags: review?(emorley)
Comment 28•7 years ago
|
||
Comment on attachment 8843076 [details] [review] [treeherder] wlach:1284432 > mozilla:master Updated for new chunk size / sleep time intervals based on testing on stage + rebased the migrations.
Attachment #8843076 -
Flags: review?(emorley)
Reporter | ||
Updated•7 years ago
|
Attachment #8843076 -
Flags: review?(emorley) → review+
Comment 29•7 years ago
|
||
Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/f0e483ce02ad0488010d29075e08541b7dd99c90 Bug 1284432 - Add an index to speed up finding jobs to cycle https://github.com/mozilla/treeherder/commit/abbacd99079d35d73d9489f287284d0b4a56c4b6 Bug 1284432 - Make cycling data more efficient Don't get all job guids per project at once, then chunk, then delete the chunks. Get a chunk, delete that chunk, then get another, delete that, etc. This should be more memory and time efficient. https://github.com/mozilla/treeherder/commit/1b795317ca80a880819f0abf3da6f20d821797ed Bug 1284432 - Cycle jobs after failure line data For historical reasons, the failure line data is currently related to the job via the guid (not a foreign key relation). To avoid the case where we have orphan data where we die halfway through deleting the job data, make sure we cycle the failure line data *before* any jobs. https://github.com/mozilla/treeherder/commit/8dd0e3c4f26cdd620181606123943e5ab70cd598 Bug 1284432 - Make chunk size / sleep time for cycle data command configurable https://github.com/mozilla/treeherder/commit/f84a9b8c566598a388e10900115f92e758150ad8 Bug 1284432 - New defaults for chunk size / sleep time for cycle data Turns out that a small chunk size (100) and no sleep time works best for our current setup.
Comment 30•7 years ago
|
||
Comment 31•7 years ago
|
||
Comment on attachment 8844516 [details] [review] [treeherder] wlach:1284432-followup > mozilla:master The current time out for cycle data is set to 15 minutes, which is probably too short (especially if there's a backlog of data to be cycled, as there is currently). Let's set it to a higher value, like say 3 hours.
Attachment #8844516 -
Flags: review?(emorley)
Reporter | ||
Updated•7 years ago
|
Attachment #8844516 -
Flags: review?(emorley) → review+
Comment 32•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/e50a675502f0ad66cbe2984e11f3bb87dccb3ab7 Bug 1284432 - Set timeout of cycle data command to 3 hours (#2232) The default time limit of a celery task is 15 minutes, which is probably too short for this one (especially if there is a backlog of data to be cycled).
Reporter | ||
Comment 33•7 years ago
|
||
cycle-data is still timing out on prod, but that's because the "add an index" commit isn't on the production branch, so it's doing table scans for the job SELECT.
Reporter | ||
Comment 34•7 years ago
|
||
I started a one-off dyno earlier (detached otherwise it times out after 60 mins of no console activity): `thp run:detached --size=performance-l -- ./manage.py cycle_data --debug --sleep-time 0 --chunk-size 1000` It's been going for about 10 hours now, but slowly getting there: Mar 10 15:03:21 treeherder-prod app/run.4457: cycle interval... 120 days, 0:00:00 Mar 10 15:03:21 treeherder-prod app/run.4457: Cycling repository: mozilla-central Mar 10 15:03:23 treeherder-prod app/run.4457: Deleted 0 jobs from mozilla-central Mar 10 15:03:23 treeherder-prod app/run.4457: Cycling repository: mozilla-inbound Mar 10 15:03:42 treeherder-prod app/run.4457: Deleted 288 jobs from mozilla-inbound Mar 10 15:03:42 treeherder-prod app/run.4457: Cycling repository: b2g-inbound Mar 10 15:03:42 treeherder-prod app/run.4457: Deleted 0 jobs from b2g-inbound Mar 10 15:03:42 treeherder-prod app/run.4457: Cycling repository: try Mar 10 21:44:46 treeherder-prod app/run.4457: Deleted 2490962 jobs from try Mar 10 21:44:46 treeherder-prod app/run.4457: Cycling repository: mozilla-aurora Mar 10 22:55:49 treeherder-prod app/run.4457: Deleted 567259 jobs from mozilla-aurora Mar 10 22:55:49 treeherder-prod app/run.4457: Cycling repository: mozilla-beta Mar 10 23:40:17 treeherder-prod app/run.4457: Deleted 308836 jobs from mozilla-beta Mar 10 23:40:17 treeherder-prod app/run.4457: Cycling repository: mozilla-release Mar 10 23:52:29 treeherder-prod app/run.4457: Deleted 84975 jobs from mozilla-release ... Mar 10 23:52:29 treeherder-prod app/run.4457: Cycling repository: build-system Mar 10 23:52:29 treeherder-prod app/run.4457: Deleted 0 jobs from build-system Mar 10 23:52:29 treeherder-prod app/run.4457: Cycling repository: fx-team Interestingly, even with a chunk size of 1000, the memory usage hasn't exceeded 200MB: https://papertrailapp.com/systems/treeherder-prod/events?q=program%3A%2Frun.4457
Reporter | ||
Comment 35•7 years ago
|
||
We seem to have finally caught up with the main deletions, however the reference data cleanup that follows them is failing: Deleted 0 jobs from webrender Traceback (most recent call last): File "./manage.py", line 11, in <module> execute_from_command_line(sys.argv) File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/__init__.py", line 367, in execute_from_command_line utility.execute() File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/__init__.py", line 359, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/base.py", line 294, in run_from_argv self.execute(*args, **cmd_options) File "/app/.heroku/python/lib/python2.7/site-packages/django/core/management/base.py", line 345, in execute output = self.handle(*args, **options) File "/app/treeherder/model/management/commands/cycle_data.py", line 66, in handle self.cycle_non_job_data(options['chunk_size'], options['sleep_time']) File "/app/treeherder/model/management/commands/cycle_data.py", line 81, in cycle_non_job_data Machine.objects.exclude(id__in=used_machine_ids).delete() File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 608, in delete collector.collect(del_query) File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/deletion.py", line 191, in collect reverse_dependency=reverse_dependency) File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/deletion.py", line 89, in add if not objs: File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 264, in __nonzero__ return type(self).__bool__(self) File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 260, in __bool__ self._fetch_all() File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 1087, in _fetch_all self._result_cache = list(self.iterator()) File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/query.py", line 54, in __iter__ results = compiler.execute_sql() File "/app/.heroku/python/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 835, in execute_sql cursor.execute(sql, params) File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute return self.cursor.execute(sql, params) File "/app/.heroku/python/lib/python2.7/site-packages/django/db/utils.py", line 94, in __exit__ six.reraise(dj_exc_type, dj_exc_value, traceback) File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute return self.cursor.execute(sql, params) File "/app/.heroku/python/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 110, in execute return self.cursor.execute(query, args) File "/app/.heroku/python/lib/python2.7/site-packages/MySQLdb/cursors.py", line 250, in execute self.errorhandler(self, exc, value) File "/app/.heroku/python/lib/python2.7/site-packages/MySQLdb/connections.py", line 42, in defaulterrorhandler raise errorvalue django.db.utils.OperationalError: (2006, 'MySQL server has gone away')
Updated•7 years ago
|
Assignee: wlachance → nobody
Reporter | ||
Comment 36•7 years ago
|
||
Closing this in favour of bug 1346567.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 7 years ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•