1215587 - Remove job_group_id from job_type table to resolve incorrect group names

Assignee

Description

•

10 years ago

This constraint means that a symbol can belong to one and only one group. But sometimes the symbols belong to different ones. This isn't a useful constraint and causes data ingestion errors because jobs get put into the wrong groups.

Cameron Dawson [:camd]

Assignee

Updated

•

10 years ago

Assignee: nobody → cdawson

Cameron Dawson [:camd]

Assignee

Comment 1

•

10 years ago

this bug also means that if a new job type is ingested once in the wrong group, it is forever ingested in that group until someone fixes the DB.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 2

•

10 years ago

Not only the symbols are affected here but also the job name. If you have to change it, following submissions will still end up in a wrong group. We had this yesterday with update results of the firefox-ui-tests repository, whereby functional (Fu) jobs ended up in update (Fu).

Mauro Doglio [:mdoglio]

Comment 4

•

10 years ago

:camd what you mentioned in comment 1 is probably bug 1193372

Mauro Doglio [:mdoglio]

Comment 5

•

10 years ago

I'm struggling to understand why we need the same job name/symbol associated to more than one job group :( I mean, other than the 'fix a typo' case which we need to fix anyway; which use cases are we trying to cover here? :whimboo in your example in comment 2 does the 'Fu' symbol mean something different based on the group?

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 6

•

10 years ago

So with our tests we run localized builds of Firefox which locale id is used as job symbol. We have different types of tests and they end up in different groups. As of now we changed the job symbols for our update tests so the following Treeherder results will not show what has been initially be seen. https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=f6ecc28fba3d&filter-job_group_symbol=Fffilter-job_group_symbol=Fu&exclusion_profile=false&fromchange=1624a69d46d7

Armen [:armenzg]

Comment 7

•

10 years ago

For instance, we run tests as with e10s enabled or not. You can see in here [1] a lot of jobs are run for both groups: I'm doing some work of running the same set of jobs that run on Buildbot. If I get the group name wrong from the beginning I will have to ask for the cached group to be fixed. I don't mind making requests to be fixed. This will happen once in a while when a new platform is brought up. [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=52bcfd8df0a1

Cameron Dawson [:camd]

Assignee

Comment 8

•

9 years ago

FWIW: If you have a unique job name (as opposed to the symbol) then it can show up in a different group. The string of the name is part of what we currently use to determine uniqueness.

William Lachance (:wlach)

Comment 12

•

9 years ago

The fix for this isn't that difficult (make the job group another job property, instead of a property of the job type), though we will need to migrate older jobs to the new format. It might be best done as part of migrating the jobs table out of the per-project db's, which I hope to do early next quarter.

Cameron Dawson [:camd]

Assignee

Updated

•

9 years ago

Status: NEW → ASSIGNED

Cameron Dawson [:camd]

Assignee

Comment 14

•

9 years ago

Hey Will-- I was going to work on this, but it definitely makes sense for this to be coordinated with moving jobs out of the per-repo databases. Is this a bug you'd want to tackle at the same time you do that work? Or would it help if I did the change prior to that? I could certainly add these fields to the existing new job model and have the data ingestion start writing to that. I could even switch it to start reading from there instead of from the old job table. I could even run queries to populate all the jobs already in the job model with the values from the per-repo databases. I just don't want to muddy the work you'll be doing. Please let me know how you'd like to proceed.

Flags: needinfo?(wlachance)

William Lachance (:wlach)

Comment 15

•

9 years ago

In practice it will probably be a month or so before I can get around to moving the jobs to the main database (it will be the last step of the datasource migration, after the artifacts have been transitioned -- we need to do things in that order because the artifacts are referenced per job). I don't think you doing this *now* will make that more difficult, on the other hand it might make more sense to just do this as part of the ORM refactoring (sort of like what we did with bug 1270629).

Flags: needinfo?(wlachance)

Ed Morley [:emorley]

Updated

•

9 years ago

Summary: remove job_group_id from job_type table → Remove job_group_id from job_type table to resolve incorrect group names

Ed Morley [:emorley]

Comment 16

•

9 years ago

This came up on IRC today.

Priority: -- → P2

Ed Morley [:emorley]

Comment 17

•

9 years ago

And today: <ttaubert> so um I don't remember, what did we do last time treeherder ignored the .groupSymbol property given by a task? was it cached somewhere? <emorley> ah so that's bug 1215587 - camd_pto_back_2_16 normally resolves those via a manual DB edit - I'm not 100% sure what has to be changed without breaking things

William Lachance (:wlach)

Comment 18

•

9 years ago

This should be (much) easier to do now that the datasource removal is complete.

Tim Taubert [:ttaubert] (inactive)

Updated

•

8 years ago

Depends on: 1354044

Cameron Dawson [:camd]

Assignee

Comment 19

•

8 years ago

I propose these steps for this: 1. Create field PR a. Create ``job_group`` field on the ``job`` table. b. Write the ``job_group`` to both ``job.job_group`` and ``job_type.job_group``. c. Leave the API reading the ``job_type`` from ``job_type.job_group`` for now d. Make management command to back-fill from ``job_type.job_group`` to ``job.job_group`` 2. Run the management command to back-fill ``job_group`` 3. Transition PR a. Remove ``job_type.job_group`` field b. Transition API to read ``job_group`` from ``job`` table

Cameron Dawson [:camd]

Assignee

Comment 20

•

8 years ago

Talked to Ed today. I did some searching online and it doesn't look like it's faster to add a foreign key field with null=True vs. null=False, so I may as well make it True with a default to 2 (which maps to not-grouped). This way I don't have to do 2 costly table-copy migrations. The default is kind of irrelevant since we will always automatically assign a value to that field during ingestion

GitHub Autolander Bot

Comment 21

•

8 years ago

Attached file [treeherder] mozilla:job-group-independent > mozilla:master — Details

Cameron Dawson [:camd]

Assignee

Comment 22

•

8 years ago

Comment on attachment 8882270 [details] [review] [treeherder] mozilla:job-group-independent > mozilla:master Hi Ed-- Here's the first of the PRs for changing adding job_group to the ``job`` table.

Attachment #8882270 - Flags: review?(emorley)

GitHub Autolander Bot

Comment 23

•

8 years ago

Attached file [treeherder] mozilla:job-group-independent2 > mozilla:job-group-independent — Details

Ed Morley [:emorley]

Comment 24

•

8 years ago

Comment on attachment 8882270 [details] [review] [treeherder] mozilla:job-group-independent > mozilla:master Thank you for doing this! I'm still feeling pretty jetlagged so may have missed something -- I'd love it if we could pay special attention to the migration for this when it lands on prototype/stage to spot for issues. Perhaps we should test it against prototype before merging to master?

Attachment #8882270 - Flags: review?(emorley) → review+

Treeherder GitHub Bugbot

Comment 25

•

8 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/e3ed84757863f1978dac7607342a9f1c7c8ba79e Bug 1215587 - Add job_group field to jobs table (#2601) This is to allow any job type to be able to belong to any job group. This will also mean that if someone accidentally picked the wrong group for a job type, we don't need to fix it in the DB for all new jobs. They can fix their task definition, and all new jobs will go to the new job group. This includes a management command to migrate the old data from job_type.job_group to the new field of job.job_group. A follow-up PR will remove the old field and set the API to read from the job.job_group field.

Cameron Dawson [:camd]

Assignee

Comment 26

•

8 years ago

When deploying this to Stage and Prototype, this caused more of an interruption than I saw in my own testing. For deploying to production, I will follow these steps: 1. Announce a Treeherder outage (which probably means a Tree closure) 2. Stop data ingestion and let the celery queues clear 3. Restart RDS to be sure all memory has been cleared (this step was required on Stage, but done as a precaution) 4. Deploy branch and allow migration to take place. There are several SQL steps to this. The first takes about 1.6 hours. The second step to add the foreign key takes about 2.2 hours. (It didn't seem to take this long when I'd tested it on my own RDS instance, but I must have lost track of time. Or it was faster with the copy of Prod. Or both.) 5. Turn ingestion back on and allow to catch up, at least somewhat. Pump the dynos to 10 or 12 for store_pulse_jobs (temporarily as needed) 6. End outage, notify Sheriffs the trees can be reopened

Ed Morley [:emorley]

Comment 27

•

8 years ago

(In reply to Cameron Dawson [:camd] from comment #26) > 3. Restart RDS to be sure all memory has been cleared (this step was > required on Stage, but done as a precaution) I would prefer we didn't do this unless necessary (ie if it gives the out of memory error right at the start of the migration). Prod is using an m4.2xlarge that has double the RAM of dev/prod's m4.xlarge, so it's much less likely we'll need to. > (It didn't seem to take this long when I'd tested it on my own RDS > instance, but I must have lost track of time. Or it was faster with the > copy of Prod. Or both.) It was likely due to the lower read/write load locally. On the plus side, prod's m4.2xlarge has both double CPU and RAM so should be quicker than stage.

Cameron Dawson [:camd]

Assignee

Comment 28

•

8 years ago

(In reply to Ed Morley [:emorley] from comment #27) > (In reply to Cameron Dawson [:camd] from comment #26) > > 3. Restart RDS to be sure all memory has been cleared (this step was > > required on Stage, but done as a precaution) > > I would prefer we didn't do this unless necessary (ie if it gives the out of > memory error right at the start of the migration). Prod is using an > m4.2xlarge that has double the RAM of dev/prod's m4.xlarge, so it's much > less likely we'll need to. > > > (It didn't seem to take this long when I'd tested it on my own RDS > > instance, but I must have lost track of time. Or it was faster with the > > copy of Prod. Or both.) > > It was likely due to the lower read/write load locally. On the plus side, > prod's m4.2xlarge has both double CPU and RAM so should be quicker than > stage. OK, sure. I can hold off restarting the DB instance unless I hit a memory problem.

Cameron Dawson [:camd]

Assignee

Comment 29

•

8 years ago

This initial migration will require a Tree closure. At this point, the closure is scheduled for this Saturday, 8/26/17 at 7:00am PT. Maintenance - DB Migration on Treeherder - Aug 26, 07:00-11:00 PDT Service and Impact: Treeherder will be in maintenance mode and unavailable for that time period. The following trees will be closed for this: mozilla-central, mozilla-inbound, autoland, mozilla-beta, mozilla-release, mozilla-esr52 Duration: 4.00 hours Bugzilla: https://bugzilla.mozilla.org/show_bug.cgi?id=1215587 If you would like to follow along with the progress, please visit #treeherder in IRC and https://mozilla-releng.net/treestatus. Thank you, Mozilla Operations Center

Salvador Espinoza [:sal]

Updated

•

8 years ago

Blocks: 1389289

Cameron Dawson [:camd]

Assignee

Comment 30

•

8 years ago

Comment on attachment 8883076 [details] [review] [treeherder] mozilla:job-group-independent2 > mozilla:job-group-independent Apologies to have this assigned to you so much later than the original. I know it requires a lot of paging back in of the info. Sorry about that.

Attachment #8883076 - Flags: review?(emorley)

Cameron Dawson [:camd]

Assignee

Comment 31

•

8 years ago

Before starting this work, check in with oncall in #moc to initiate communications on internal and public pages.

GitHub Autolander Bot

Comment 32

•

8 years ago

Attached file [treeherder] mozilla:job-group-independent3 > mozilla:job-group-independent2 — Details

Cameron Dawson [:camd]

Assignee

Updated

•

8 years ago

Attachment #8900904 - Flags: review?(emorley)

Ed Morley [:emorley]

Comment 33

•

8 years ago

My daily papertrail alert email contained 3000 entries today (10x increase on normal), most of which were high RAM usage warnings for worker_buildapi_pending dynos on prototype/stage. To debug I looked at the Heroku stage app metrics (https://dashboard.heroku.com/apps/treeherder-stage/metrics/worker_buildapi_pending?starting=72-hours-ago) and can see the memory usage spiked when ingestion was re-enabled after it was paused on stage. I believe this was due to the worker_buildapi_* dynos being stopped, rather than the worker_beat dyno. When done that way around, the celerybeat triggered tasks still build up, such that the queues now contain hundreds of fetch_buildapi_* tasks that all try to execute at once when those dynos are resumed. Generally I'd recommend stopping the celerybeat dyno instead, otherwise one has to remember to clear out the excess queued beat jobs. For the prod deploy, I'd recommend instead stopping these dynos: * worker_beat * either worker_read_pulse_jobs or worker_store_pulse_jobs (depending on whether you'd rather than queue build up on pulse.m.o or our own rabbitmq instance) * either worker_read_pulse_pushes or worker_store_pulse_resultsets (ditto)

Ed Morley [:emorley]

Updated

•

8 years ago

Attachment #8883076 - Flags: review?(emorley) → review+

Ed Morley [:emorley]

Comment 34

•

8 years ago

Comment on attachment 8900904 [details] [review] [treeherder] mozilla:job-group-independent3 > mozilla:job-group-independent2 Left a comment :-)

Attachment #8900904 - Flags: review?(emorley) → review-

Cameron Dawson [:camd]

Assignee

Comment 35

•

8 years ago

I now have the power to close and open the trees myself and kwierso showed me how to do so with TreeStatus. So I will just handle that part myself.

Cameron Dawson [:camd]

Assignee

Comment 36

•

8 years ago

Post Partum for PR1 =================== The deploy of PR 1 with the slow migration to add the job_group field to the jobs table is complete. Unfortunately, there was some data loss of about 227 jobs from 11:38-11:42am PT [1]. These jobs were being submitted via the API for autophone, hasal-dev and awfy. The API shut off during maintenance mode. However, the deploy of the prod code failed while TH was in maint mode. The pre-deploy does a curl call to get the old revision. But it couldn't because we were in maint mode. So it died in the pre-deploy before actually deploying the new code, though the migrations HAD completed. curl: (22) The requested URL returned error: 503 Service Unavailable So I turned off maintenance mode, but left the ingestion dynos disabled. Then I did another push of the production branch. This time it was successful. But in the time between ending maint mode and when the new code was pushed, the API submissions were failing against the old code with new DB. The new job_group field is required (not null) and only has a default value in the django model, not directly in the db. As a result, when the old code tried to write a record, it would fail because it wasn't setting the job_group field value. Unfortunately, turning off the web dynos before ending maint mode would have given the same error. The curl would have failed. I've opened Bug 1394128 for fixing the deploy script to work during maint mode. [1] https://rpm.newrelic.com/accounts/677903/applications/14179757/traced_errors/48fe9f2f-8a8e-11e7-b507-0242ac110012_4240_12567/similar_errors?original_error_id=48fe9f2f-8a8e-11e7-b507-0242ac110012_4240_12567

Cameron Dawson [:camd]

Assignee

Comment 37

•

8 years ago

Also: I opened Bug 1394126 for BC for the lost autophone jobs to ask if he can resubmit.

Treeherder GitHub Bugbot

Comment 38

•

8 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/5cb083ee132088ea70048efb8fa34a01e28b098c Bug 1215587 - Follow-up to transition over to new job.job_group (#2607)

Cameron Dawson [:camd]

Assignee

Comment 39

•

8 years ago

Comment on attachment 8900904 [details] [review] [treeherder] mozilla:job-group-independent3 > mozilla:job-group-independent2 I think this addresses your concerns. :)

Attachment #8900904 - Flags: review- → review?(emorley)

Ed Morley [:emorley]

Comment 40

•

8 years ago

Comment on attachment 8900904 [details] [review] [treeherder] mozilla:job-group-independent3 > mozilla:job-group-independent2 Looks great - thank you :-)

Attachment #8900904 - Flags: review?(emorley) → review+

Ed Morley [:emorley]

Updated

•

8 years ago

Depends on: 1394555

Ed Morley [:emorley]

Comment 41

•

8 years ago

(In reply to Cameron Dawson [:camd] from comment #36) > Unfortunately, there was some data loss of about 227 jobs from 11:38-11:42am > PT [1]. These jobs were being submitted via the API for autophone, > hasal-dev and awfy. The API shut off during maintenance mode. However, the > deploy of the prod code failed while TH was in maint mode. The pre-deploy > does a curl call to get the old revision. But it couldn't because we were > in maint mode. So it died in the pre-deploy before actually deploying the > new code, though the migrations HAD completed. Some thoughts on this: * People who submit to us using the REST API need to handle 5xx HTTP errors and retry (ideally with exponential backoff). If they don't, data loss is not really our fault. Though it's probably not worth them fixing, they should just switch to using Pulse, which already has more safeguards during our ingestion process (eg not acking until the job is successfully ingested). * Even had the deploy succeeded, there still would have been an amount of time between the DB changes being made and the new Python code being run in production. As such, the new code ideally needed to be forwards-backwards compatible. However this is hard with Django migrations, since Django sets defaults at the ORM layer not the DB layer, so unless we used manual RunSQL commands there's not much we can do. (Ordinarily without an ORM one would add a new column with a default set at the DB layer, so the old Python code would still work with it. And then only after in another PR would the app layer start depending on the new column etc).

Ed Morley [:emorley]

Comment 42

•

8 years ago

(In reply to Ed Morley [:emorley] from comment #41) > However this is hard with Django migrations, since Django sets > defaults at the ORM layer not the DB layer, so unless we used manual RunSQL > commands there's not much we can do. Actually I guess we could have used `null=True` to make it a nullable column (instead of using a numeric default), since I think that is set at the DB level.

Cameron Dawson [:camd]

Assignee

Comment 43

•

8 years ago

(In reply to Ed Morley [:emorley] from comment #42) > (In reply to Ed Morley [:emorley] from comment #41) > > However this is hard with Django migrations, since Django sets > > defaults at the ORM layer not the DB layer, so unless we used manual RunSQL > > commands there's not much we can do. > > Actually I guess we could have used `null=True` to make it a nullable column > (instead of using a numeric default), since I think that is set at the DB > level. Yeah, it's true. Though then we would have needed ANOTHER long migration to remove the default value. So this is probably best. And I totally agree about moving people away from the API and to pulse anyway. For Autophone: Bug 1395326.

Treeherder GitHub Bugbot

Comment 44

•

8 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/edba0fb429d0a96d2f76cad9be1c65c83dfdb4f6 Bug 1215587 - Remove job_group from job_type table (#2733)

Cameron Dawson [:camd]

Assignee

Comment 45

•

8 years ago

This has been deployed to stage and prototype and seems to be working well.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

GitHub Autolander Bot

Comment 46

•

8 years ago

Attached file [treeherder] mozilla:graphql-job-group > mozilla:master — Details

Cameron Dawson [:camd]

Assignee

Comment 47

•

8 years ago

Comment on attachment 8903254 [details] [review] [treeherder] mozilla:graphql-job-group > mozilla:master I accidentally broke the graphql queries. Here's the fix with a new test to catch anything like this in the future.

Attachment #8903254 - Flags: review?(emorley)

Ed Morley [:emorley]

Updated

•

8 years ago

Attachment #8903254 - Flags: review?(emorley) → review+

Treeherder GitHub Bugbot

Comment 48

•

8 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/2a82166526a0d8f387c05a909f85539b70215027 Bug 1215587 - Fix graphql query with job_group change (#2744)

Ed Morley [:emorley]

Comment 49

•

8 years ago

Has the SQL [1] to populate the new column now been run on all three environments? It's just I don't see any record of what was run when in this bug (we really must do this in the future), so it's hard to know whether I can safely deploy master to prod. [1] https://github.com/mozilla/treeherder/pull/2733#issuecomment-325039772

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Comment 50

•

8 years ago

I've cherrypicked bug 1330412 to the production branch since it was needed urgently, since it wasn't clear if I could deploy master. I also had to cherrypick bug 1394593 since the buildpack settings for prod have been switched back already, since I was expecting the next deploy to be from master.

Cameron Dawson [:camd]

Assignee

Comment 51

•

8 years ago

(In reply to Ed Morley [:emorley] from comment #49) > Has the SQL [1] to populate the new column now been run on all three > environments? > > It's just I don't see any record of what was run when in this bug (we really > must do this in the future), so it's hard to know whether I can safely > deploy master to prod. > > > [1] https://github.com/mozilla/treeherder/pull/2733#issuecomment-325039772 Yes, the sql to update the new column has been run on all 3 environments. I had thought I'd commented here that I'd done that, but clearly I hadn't. My apologies. I totally agree that should have been put here as soon as I'd completed it. bstack tested it on stage and the triggers are working. I was just pinging you in IRC to see if you could try a trigger, too. I don't have the scopes yet to do so, so I need to look into getting those.

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Comment 52

•

8 years ago

(In reply to Cameron Dawson [:camd] from comment #51) > Yes, the sql to update the new column has been run on all 3 environments. Great - looks like this bug needn't block deploy then :-) > bstack tested it on stage and the triggers are working. I was just pinging > you in IRC to see if you could try a trigger, too. I don't have the scopes > yet to do so, so I need to look into getting those. This is a different bug (bug 1391321) rather than this one. I'll continue the discussion there.

Treeherder GitHub Bugbot

Comment 53

•

8 years ago

Commit pushed to master at https://github.com/mozilla-rpweb/treeherder-manifest https://github.com/mozilla-rpweb/treeherder-manifest/commit/a7b23eb6e0e588a52329e73edf438dbdd08d52d3 Bug 1215587 - Fix graphql query with job_group change

[treeherder] mozilla:job-group-independent > mozilla:master 8 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	emorley : review+	Details \| Review
[treeherder] mozilla:job-group-independent2 > mozilla:job-group-independent 8 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	emorley : review+	Details \| Review
[treeherder] mozilla:job-group-independent3 > mozilla:job-group-independent2 8 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	emorley : review+	Details \| Review
[treeherder] mozilla:graphql-job-group > mozilla:master 8 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	emorley : review+	Details \| Review