Closed Bug 1239185 Opened 9 years ago Closed 9 years ago

Stop using the reference data model for job ingestion

Categories

(Tree Management :: Treeherder, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wlach, Assigned: wlach)

References

Details

Attachments

(3 files)

The reference data model is overengineered compared to just using the Django ORM to look up (or create) the objects we need for job ingestion. I took a stab at starting to remove it, going to post my work to date. I think I'm mostly done.
Blocks: 1231361
Comment on attachment 8707255 [details] [review] [treeherder] wlach:1239185 > mozilla:master So this is mostly working, except `test_job_list_filter_fields` with a parameter of `job_group_id` (tests/webapp/api/test_jobs_api.py) doesn't work, because it looks like a stray JobGroup object from `test_ingest_job_with_updated_job_group` (tests/model/derived/test_jobs_model.py) is sticking around, which throws off the id of the ingested job group by 1. At least on my local machine (still waiting for travis to finish) Mauro, do you know what's going on? I thought we were supposed to be flushing the db between tests? I tried some techniques from here http://pytest-django.readthedocs.org/en/latest/database.html and they didn't seem to help. :( Aside from that, I think this is pretty close to being able to land. Some nice code deletion action -- after this lands, we'll be pretty close to being able to remove the reference data model altogether.
Attachment #8707255 - Flags: feedback?(mdoglio)
I confirm the reference data tables should be truncated at the end of each test with a series of truncate tables. I wasn't able to understand what the problem is in a couple of hours. I should spend more time to find it out.
Comment on attachment 8707255 [details] [review] [treeherder] wlach:1239185 > mozilla:master Ooops I forgot to clear the f? flag yesterday.
Attachment #8707255 - Flags: feedback?(mdoglio)
Comment on attachment 8707255 [details] [review] [treeherder] wlach:1239185 > mozilla:master Ok, I figured it out! Well partly. We were incorrectly creating the second job group in the unit test in question (test_ingest_job_with_updated_job_group), which somehow got persisted. So fixing that makes everything pass. I added a check to the unit test to make sure the second job group doesn't get created, so our testing is at least a little more comprehensive now. There's still the question of why things didn't get cleaned up, but that's almost certainly not a new issue. I've done some basic checking locally to make sure that ingestion still seems to work reasonably-- it does. I suspect the thing to do is experiment with landing this to stage for a couple hours tomorrow and see how things go. Does that sound ok?
Attachment #8707255 - Flags: review?(mdoglio)
This has been running fine on stage since last Thursday, there appears to be no noticeable performance differences from doing things this way.
Comment on attachment 8707255 [details] [review] [treeherder] wlach:1239185 > mozilla:master fix it, then ship it!
Attachment #8707255 - Flags: review?(mdoglio) → review+
Unfortunately there are some new errors on stage that seem related to this: https://rpm.newrelic.com/accounts/677903/applications/5585473/filterable_errors?tw%5Bend%5D=1458044844&tw%5Bstart%5D=1457958444#/heatmap?top_facet=transactionUiName&primary_facet=error.message&barchart=barchart&_k=mwmdpc eg: /treeherder.webapp.api.jobs:JobsViewSet.create django.db.utils:DataError: (1265, "Data truncated for column 'review_status' at row 1") https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/553eb4-6a433091-eaa9-11e5-a1e3-b82a72d22a14 /treeherder.webapp.api.jobs:JobsViewSet.create treeherder.model.models:MultipleObjectsReturned: get() returned more than one Machine -- it returned 2! https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/553eb4-62fd7783-eaa9-11e5-a1e3-b82a72d22a14 /treeherder.webapp.api.artifact:ArtifactViewSet.create sonschema.exceptions:ValidationError: 'name' is a required propertyFailed validating 'required' in schema['properties']['suites']['items']: ... https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/553eb4-228fd653-ea8a-11e5-a1e3-b82a72d22a14 (though this last one, whilst previously unseen, may be unrelated to this bug)
Thanks for following up on this Ed! (In reply to Ed Morley [:emorley] from comment #10) > /treeherder.webapp.api.jobs:JobsViewSet.create > django.db.utils:DataError: (1265, "Data truncated for column > 'review_status' at row 1") > https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/ > 553eb4-6a433091-eaa9-11e5-a1e3-b82a72d22a14 This looks like a problem with the database (probably caused by lax enforcement of the schema when we were using datasource). I'd be tempted to solve the problem by just removing the review_status and review_timestamp fields, since they are currently unused. Thoughts? > /treeherder.webapp.api.jobs:JobsViewSet.create > treeherder.model.models:MultipleObjectsReturned: get() returned more than > one Machine -- it returned 2! > https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/ > 553eb4-62fd7783-eaa9-11e5-a1e3-b82a72d22a14 Hmm, likewise this looks more like an existing problem that the patch showed up. Should we really have two machines with the same name in the db? I'm tempted to remove the duplicate entry and add a unique constraint to the schema. > /treeherder.webapp.api.artifact:ArtifactViewSet.create > sonschema.exceptions:ValidationError: 'name' is a required propertyFailed > validating 'required' in schema['properties']['suites']['items']: ... > https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/ > 553eb4-228fd653-ea8a-11e5-a1e3-b82a72d22a14 > (though this last one, whilst previously unseen, may be unrelated to this > bug) Yes, that's bug 1256408. :)
> This looks like a problem with the database (probably caused by lax enforcement of the schema when we were using datasource). I'd be tempted to solve the problem by just removing the review_status and review_timestamp fields, since they are currently unused. Thoughts? Sounds good to me :-)
Comment on attachment 8730877 [details] [review] [treeherder] wlach:1239185-followup > mozilla:master The machine uniqueness problem will need a manual fix on stage before this lands, which I'll take care of seperately.
Attachment #8730877 - Flags: review?(emorley)
Comment on attachment 8730877 [details] [review] [treeherder] wlach:1239185-followup > mozilla:master Looks good, thank you :-)
Attachment #8730877 - Flags: review?(emorley) → review+
I think I fixed all the issues with duplicate machines on stage (there were no dupes on prod), and the migrations above should ensure this doesn't happen again. I'm going to call this fixed - please let me know if you encounter further issues.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
I received a high error alert email from New Relic this morning (stage is at 20% exception rate). There are two new exceptions: 1) treeherder.webapp.api.jobs:JobsViewSet.create django.db.utils:OperationalError (1054, "Unknown column 'reference_data_signatures.review_timestamp' in 'field list'") https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/553eb4-e27bfe8d-eb63-11e5-a249-b82a72d22a14 2) treeherder.webapp.api.jobs:JobsViewSet.create django.db.utils:DataError (1406, "Data too long for column 'platform' at row 1") https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/553eb7-e7ecfc4c-eb62-11e5-a249-b82a72d22a14 #1 is due to the migration from the followup fix above (comment 17) having been run when Cameron tested the (rebased on master) revision hash feature branch on stage, however following that, stage was then reverted back to the 'stage' branch which is behind master, and so doesn't include the changes that need to go along with the new schema. Fixing this should be as simple as pushing current master to the stage branch, so the comment 17 commits get re-deployed (which I'll do now). In the case of rolling back changes on stage or testing feature branches, we need to keep an eye out for migrations and adjust branch points accordingly. #2 seems new and possibly related to this bug - Will, would you mind taking a look? Thank you for battling through this! :-)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Thanks to the recent improvements to the custom request attributes reported to New Relic, we can tell that the #2 case is being submitted by Taskcluster, with example payload of: {u'project': u'try', u'job': {u'build_platform': {u'platform': u'android-4-0-armv7-api15-partner1', u'os_name': u'-', u'architecture': u'-'}, u'submit_timestamp': 1458123450, u'build_system_type': u'taskcluster', u'name': u'[TC] Android armv7 API 15+ part ... (Custom attributes are truncated to 256 chars by the New Relic Python agent for perf reasons) Perhaps we were just silently truncating 'android-4-0-armv7-api15-partner1' before?
(In reply to Ed Morley [:emorley] from comment #20) > Thanks to the recent improvements to the custom request attributes reported > to New Relic, we can tell that the #2 case is being submitted by > Taskcluster, with example payload of: > > {u'project': u'try', u'job': {u'build_platform': {u'platform': > u'android-4-0-armv7-api15-partner1', u'os_name': u'-', u'architecture': > u'-'}, u'submit_timestamp': 1458123450, u'build_system_type': > u'taskcluster', u'name': u'[TC] Android armv7 API 15+ part ... > > (Custom attributes are truncated to 256 chars by the New Relic Python agent > for perf reasons) > > Perhaps we were just silently truncating 'android-4-0-armv7-api15-partner1' > before? Quite possibly... I'll submit a PR to bump the platform length to 50 (it is currently 25, and 'android-4-0-armv7-api15-partner1' is 32).
Comment on attachment 8731288 [details] [review] [treeherder] wlach:1239185-machinelength > mozilla:master I wound up setting it to 100, which sounds a bit excessive, but is the same as in the reference data signature model.
Attachment #8731288 - Flags: review?(emorley)
Comment on attachment 8731288 [details] [review] [treeherder] wlach:1239185-machinelength > mozilla:master r+ with the migration script change added :-)
Attachment #8731288 - Flags: review?(emorley) → review+
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/56621187bc2329b00d69c303945c9be85d8c71a3 Bug 1239185 - Bump maximum platform length to 100 characters in all cases It was 100 in the reference data signature, but only 25 in the build and machine platform models.
Is this new? (not looked at the code, eating dinner at the moment) https://rpm.newrelic.com/accounts/677903/applications/5585473/filterable_errors#/show/553eb4-0414c2ca-ebb4-11e5-a249-b82a72d22a14/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&_k=ecuett treeherder.webapp.api.jobs:JobsViewSet.create ... File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/webapp/api/utils.py", line 105, in use_jobs_modelFile "/data/www/treeherder.allizom.org/treeherder-service/treeherder/webapp/api/jobs.py", line 230, in createFile "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/jobs.py", line 1006, in store_job_dataFile "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/jobs.py", line 1419, in _set_data_ids exceptions:KeyError: u'3724a051484fba3b5a4bc5826a3286270b0ccb9e'
(In reply to Ed Morley [:emorley] from comment #27) > Is this new? (not looked at the code, eating dinner at the moment) > https://rpm.newrelic.com/accounts/677903/applications/5585473/ > filterable_errors#/show/553eb4-0414c2ca-ebb4-11e5-a249-b82a72d22a14/ > stack_trace?top_facet=transactionUiName&primary_facet=error. > class&barchart=barchart&_k=ecuett > > treeherder.webapp.api.jobs:JobsViewSet.create > > ... > File > "/data/www/treeherder.allizom.org/treeherder-service/treeherder/webapp/api/ > utils.py", line 105, in use_jobs_modelFile > "/data/www/treeherder.allizom.org/treeherder-service/treeherder/webapp/api/ > jobs.py", line 230, in createFile > "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/ > derived/jobs.py", line 1006, in store_job_dataFile > "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/ > derived/jobs.py", line 1419, in _set_data_ids > exceptions:KeyError: u'3724a051484fba3b5a4bc5826a3286270b0ccb9e' Hmm, the line in question is related to `revision_hash` and `result sets`: https://github.com/mozilla/treeherder/blob/326b918/treeherder/model/derived/jobs.py#L1419 I'd suspect camd's recent testing of his revision_hash work is the more likely culprit here.
I'm pretty sure this is done now, things have been running smoothly on stage for a while.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: