Remove job_group_id from job_type table to resolve incorrect group names

RESOLVED FIXED

Status

P2
normal
RESOLVED FIXED
3 years ago
a month ago

People

(Reporter: camd, Assigned: camd)

Tracking

Details

Attachments

(4 attachments)

(Assignee)

Description

3 years ago
This constraint means that a symbol can belong to one and only one group.  But sometimes the symbols belong to different ones.  This isn't a useful constraint and causes data ingestion errors because jobs get put into the wrong groups.
(Assignee)

Updated

3 years ago
Assignee: nobody → cdawson
(Assignee)

Comment 1

3 years ago
this bug also means that if a new job type is ingested once in the wrong group, it is forever ingested in that group until someone fixes the DB.
Not only the symbols are affected here but also the job name. If you have to change it, following submissions will still end up in a wrong group. We had this yesterday with update results of the firefox-ui-tests repository, whereby functional (Fu) jobs ended up in update (Fu).

Updated

3 years ago
Duplicate of this bug: 1229394
:camd what you mentioned in comment 1 is probably bug 1193372
I'm struggling to understand why we need the same job name/symbol associated to more than one job group :(
I mean, other than the 'fix a typo' case which we need to fix anyway; which use cases are we trying to cover here?
:whimboo in your example in comment 2 does the 'Fu' symbol mean something different based on the group?
So with our tests we run localized builds of Firefox which locale id is used as job symbol. We have different types of tests and they end up in different groups. As of now we changed the job symbols for our update tests so the following Treeherder results will not show what has been initially be seen.

https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=f6ecc28fba3d&filter-job_group_symbol=Fffilter-job_group_symbol=Fu&exclusion_profile=false&fromchange=1624a69d46d7

Comment 7

3 years ago
For instance, we run tests as with e10s enabled or not.
You can see in here [1] a lot of jobs are run for both groups:

I'm doing some work of running the same set of jobs that run on Buildbot.
If I get the group name wrong from the beginning I will have to ask for the cached group to be fixed.

I don't mind making requests to be fixed. This will happen once in a while when a new platform is brought up.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=52bcfd8df0a1
(Assignee)

Comment 8

2 years ago
FWIW: If you have a unique job name (as opposed to the symbol) then it can show up in a different group.  The string of the name is part of what we currently use to determine uniqueness.
Duplicate of this bug: 1269090
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1276504

Updated

2 years ago
Duplicate of this bug: 1278598
The fix for this isn't that difficult (make the job group another job property, instead of a property of the job type), though we will need to migrate older jobs to the new format. It might be best done as part of migrating the jobs table out of the per-project db's, which I hope to do early next quarter.
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1193372
(Assignee)

Updated

2 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 14

2 years ago
Hey Will-- I was going to work on this, but it definitely makes sense for this to be coordinated with moving jobs out of the per-repo databases.  Is this a bug you'd want to tackle at the same time you do that work?  

Or would it help if I did the change prior to that?  I could certainly add these fields to the existing new job model and have the data ingestion start writing to that.  I could even switch it to start reading from there instead of from the old job table.  I could even run queries to populate all the jobs already in the job model with the values from the per-repo databases.

I just don't want to muddy the work you'll be doing.  Please let me know how you'd like to proceed.
Flags: needinfo?(wlachance)
In practice it will probably be a month or so before I can get around to moving the jobs to the main database (it will be the last step of the datasource migration, after the artifacts have been transitioned -- we need to do things in that order because the artifacts are referenced per job). 

I don't think you doing this *now* will make that more difficult, on the other hand it might make more sense to just do this as part of the ORM refactoring (sort of like what we did with bug 1270629).
Flags: needinfo?(wlachance)

Updated

2 years ago
Summary: remove job_group_id from job_type table → Remove job_group_id from job_type table to resolve incorrect group names
This came up on IRC today.
Priority: -- → P2
And today:

<ttaubert> so um I don't remember, what did we do last time treeherder ignored the .groupSymbol property given by a task? was it cached somewhere?
<emorley> ah so that's bug 1215587 - camd_pto_back_2_16 normally resolves those via a manual DB edit - I'm not 100% sure what has to be changed without breaking things
This should be (much) easier to do now that the datasource removal is complete.
(Assignee)

Comment 19

a year ago
I propose these steps for this:

1. Create field PR
  a. Create ``job_group`` field on the ``job`` table.  
  b. Write the ``job_group`` to both ``job.job_group`` and ``job_type.job_group``.  
  c. Leave the API reading the ``job_type`` from ``job_type.job_group`` for now
  d. Make management command to back-fill from ``job_type.job_group`` to ``job.job_group``

2. Run the management command to back-fill ``job_group``

3. Transition PR 
  a. Remove ``job_type.job_group`` field
  b. Transition API to read ``job_group`` from ``job`` table
(Assignee)

Comment 20

a year ago
Talked to Ed today.  I did some searching online and it doesn't look like it's faster to add a foreign key field with null=True vs. null=False, so I may as well make it True with a default to 2 (which maps to not-grouped).  This way I don't have to do 2 costly table-copy migrations.  The default is kind of irrelevant since we will always automatically assign a value to that field during ingestion
Created attachment 8882270 [details] [review]
[treeherder] mozilla:job-group-independent > mozilla:master
(Assignee)

Comment 22

a year ago
Comment on attachment 8882270 [details] [review]
[treeherder] mozilla:job-group-independent > mozilla:master

Hi Ed--  Here's the first of the PRs for changing adding job_group to the ``job`` table.
Attachment #8882270 - Flags: review?(emorley)
Created attachment 8883076 [details] [review]
[treeherder] mozilla:job-group-independent2 > mozilla:job-group-independent
Comment on attachment 8882270 [details] [review]
[treeherder] mozilla:job-group-independent > mozilla:master

Thank you for doing this!

I'm still feeling pretty jetlagged so may have missed something -- I'd love it if we could pay special attention to the migration for this when it lands on prototype/stage to spot for issues. Perhaps we should test it against prototype before merging to master?
Attachment #8882270 - Flags: review?(emorley) → review+

Comment 25

a year ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/e3ed84757863f1978dac7607342a9f1c7c8ba79e
Bug 1215587 - Add job_group field to jobs table (#2601)

This is to allow any job type to be able to belong to any
job group.  This will also mean that if someone accidentally
picked the wrong group for a job type, we don't need to
fix it in the DB for all new jobs.  They can fix their task
definition, and all new jobs will go to the new job group.

This includes a management command to migrate the old
data from job_type.job_group to the new field of job.job_group.

A follow-up PR will remove the old field and set the API to
read from the job.job_group field.
(Assignee)

Comment 26

a year ago
When deploying this to Stage and Prototype, this caused more of an interruption than I saw in my own testing.  For deploying to production, I will follow these steps:

1. Announce a Treeherder outage (which probably means a Tree closure)
2. Stop data ingestion and let the celery queues clear
3. Restart RDS to be sure all memory has been cleared (this step was required on Stage, but done as a precaution)
4. Deploy branch and allow migration to take place.  There are several SQL steps to this.  The first takes about 1.6 hours.  The second step to add the foreign key takes about 2.2 hours.  
   (It didn't seem to take this long when I'd tested it on my own RDS instance, but I must have lost track of time.  Or it was faster with the copy of Prod.  Or both.)
5. Turn ingestion back on and allow to catch up, at least somewhat.  Pump the dynos to 10 or 12 for store_pulse_jobs (temporarily as needed)
6. End outage, notify Sheriffs the trees can be reopened
(In reply to Cameron Dawson [:camd] from comment #26)
> 3. Restart RDS to be sure all memory has been cleared (this step was
> required on Stage, but done as a precaution)

I would prefer we didn't do this unless necessary (ie if it gives the out of memory error right at the start of the migration). Prod is using an m4.2xlarge that has double the RAM of dev/prod's m4.xlarge, so it's much less likely we'll need to.

>    (It didn't seem to take this long when I'd tested it on my own RDS
> instance, but I must have lost track of time.  Or it was faster with the
> copy of Prod.  Or both.)

It was likely due to the lower read/write load locally. On the plus side, prod's m4.2xlarge has both double CPU and RAM so should be quicker than stage.
(Assignee)

Comment 28

a year ago
(In reply to Ed Morley [:emorley] from comment #27)
> (In reply to Cameron Dawson [:camd] from comment #26)
> > 3. Restart RDS to be sure all memory has been cleared (this step was
> > required on Stage, but done as a precaution)
> 
> I would prefer we didn't do this unless necessary (ie if it gives the out of
> memory error right at the start of the migration). Prod is using an
> m4.2xlarge that has double the RAM of dev/prod's m4.xlarge, so it's much
> less likely we'll need to.
> 
> >    (It didn't seem to take this long when I'd tested it on my own RDS
> > instance, but I must have lost track of time.  Or it was faster with the
> > copy of Prod.  Or both.)
> 
> It was likely due to the lower read/write load locally. On the plus side,
> prod's m4.2xlarge has both double CPU and RAM so should be quicker than
> stage.

OK, sure.  I can hold off restarting the DB instance unless I hit a memory problem.
(Assignee)

Comment 29

a year ago
This initial migration will require a Tree closure.  At this point, the closure is scheduled for this Saturday, 8/26/17 at 7:00am PT.

Maintenance - DB Migration on Treeherder - Aug 26, 07:00-11:00 PDT
Service and Impact:
Treeherder will be in maintenance mode and unavailable for that time period.
The following trees will be closed for this: mozilla-central, mozilla-inbound, autoland, mozilla-beta, mozilla-release, mozilla-esr52
Duration: 4.00 hours
Bugzilla: https://bugzilla.mozilla.org/show_bug.cgi?id=1215587

If you would like to follow along with the progress, please visit #treeherder in IRC and https://mozilla-releng.net/treestatus.

Thank you,
Mozilla Operations Center
Blocks: 1389289
(Assignee)

Comment 30

a year ago
Comment on attachment 8883076 [details] [review]
[treeherder] mozilla:job-group-independent2 > mozilla:job-group-independent

Apologies to have this assigned to you so much later than the original.  I know it requires a lot of paging back in of the info.  Sorry about that.
Attachment #8883076 - Flags: review?(emorley)
(Assignee)

Comment 31

a year ago
Before starting this work, check in with oncall in #moc to initiate communications on internal and public pages.
Created attachment 8900904 [details] [review]
[treeherder] mozilla:job-group-independent3 > mozilla:job-group-independent2
(Assignee)

Updated

a year ago
Attachment #8900904 - Flags: review?(emorley)
My daily papertrail alert email contained 3000 entries today (10x increase on normal), most of which were high RAM usage warnings for worker_buildapi_pending dynos on prototype/stage.

To debug I looked at the Heroku stage app metrics (https://dashboard.heroku.com/apps/treeherder-stage/metrics/worker_buildapi_pending?starting=72-hours-ago) and can see the memory usage spiked when ingestion was re-enabled after it was paused on stage.

I believe this was due to the worker_buildapi_* dynos being stopped, rather than the worker_beat dyno. When done that way around, the celerybeat triggered tasks still build up, such that the queues now contain hundreds of fetch_buildapi_* tasks that all try to execute at once when those dynos are resumed. Generally I'd recommend stopping the celerybeat dyno instead, otherwise one has to remember to clear out the excess queued beat jobs.

For the prod deploy, I'd recommend instead stopping these dynos:
* worker_beat
* either worker_read_pulse_jobs or worker_store_pulse_jobs (depending on whether you'd rather than queue build up on pulse.m.o or our own rabbitmq instance)
* either worker_read_pulse_pushes or worker_store_pulse_resultsets (ditto)

Updated

a year ago
Attachment #8883076 - Flags: review?(emorley) → review+
Comment on attachment 8900904 [details] [review]
[treeherder] mozilla:job-group-independent3 > mozilla:job-group-independent2

Left a comment :-)
Attachment #8900904 - Flags: review?(emorley) → review-
(Assignee)

Comment 35

a year ago
I now have the power to close and open the trees myself and kwierso showed me how to do so with TreeStatus.  So I will just handle that part myself.
(Assignee)

Comment 36

a year ago
Post Partum for PR1
===================

The deploy of PR 1 with the slow migration to add the job_group field to the jobs table is complete.

Unfortunately, there was some data loss of about 227 jobs from 11:38-11:42am PT [1].  These jobs were being submitted via the API for autophone, hasal-dev and awfy.  The API shut off during maintenance mode.  However, the deploy of the prod code failed while TH was in maint mode.  The pre-deploy does a curl call to get the old revision.  But it couldn't because we were in maint mode.  So it died in the pre-deploy before actually deploying the new code, though the migrations HAD completed.

curl: (22) The requested URL returned error: 503 Service Unavailable

So I turned off maintenance mode, but left the ingestion dynos disabled.  Then I did another push of the production branch.  This time it was successful.  But in the time between ending maint mode and when the new code was pushed, the API submissions were failing against the old code with new DB.  

The new job_group field is required (not null) and only has a default value in the django model, not directly in the db.  As a result, when the old code tried to write a record, it would fail because it wasn't setting the job_group field value.

Unfortunately, turning off the web dynos before ending maint mode would have given the same error.  The curl would have failed.

I've opened Bug 1394128 for fixing the deploy script to work during maint mode.

[1] https://rpm.newrelic.com/accounts/677903/applications/14179757/traced_errors/48fe9f2f-8a8e-11e7-b507-0242ac110012_4240_12567/similar_errors?original_error_id=48fe9f2f-8a8e-11e7-b507-0242ac110012_4240_12567
(Assignee)

Comment 37

a year ago
Also: I opened Bug 1394126 for BC for the lost autophone jobs to ask if he can resubmit.

Comment 38

a year ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/5cb083ee132088ea70048efb8fa34a01e28b098c
Bug 1215587 - Follow-up to transition over to new job.job_group (#2607)
(Assignee)

Comment 39

a year ago
Comment on attachment 8900904 [details] [review]
[treeherder] mozilla:job-group-independent3 > mozilla:job-group-independent2

I think this addresses your concerns.  :)
Attachment #8900904 - Flags: review- → review?(emorley)
Comment on attachment 8900904 [details] [review]
[treeherder] mozilla:job-group-independent3 > mozilla:job-group-independent2

Looks great - thank you :-)
Attachment #8900904 - Flags: review?(emorley) → review+

Updated

a year ago
Depends on: 1394555
(In reply to Cameron Dawson [:camd] from comment #36)
> Unfortunately, there was some data loss of about 227 jobs from 11:38-11:42am
> PT [1].  These jobs were being submitted via the API for autophone,
> hasal-dev and awfy.  The API shut off during maintenance mode.  However, the
> deploy of the prod code failed while TH was in maint mode.  The pre-deploy
> does a curl call to get the old revision.  But it couldn't because we were
> in maint mode.  So it died in the pre-deploy before actually deploying the
> new code, though the migrations HAD completed.

Some thoughts on this:
* People who submit to us using the REST API need to handle 5xx HTTP errors and retry (ideally with exponential backoff). If they don't, data loss is not really our fault. Though it's probably not worth them fixing, they should just switch to using Pulse, which already has more safeguards during our ingestion process (eg not acking until the job is successfully ingested).
* Even had the deploy succeeded, there still would have been an amount of time between the DB changes being made and the new Python code being run in production. As such, the new code ideally needed to be forwards-backwards compatible. However this is hard with Django migrations, since Django sets defaults at the ORM layer not the DB layer, so unless we used manual RunSQL commands there's not much we can do. (Ordinarily without an ORM one would add a new column with a default set at the DB layer, so the old Python code would still work with it. And then only after in another PR would the app layer start depending on the new column etc).
(In reply to Ed Morley [:emorley] from comment #41)
> However this is hard with Django migrations, since Django sets
> defaults at the ORM layer not the DB layer, so unless we used manual RunSQL
> commands there's not much we can do.

Actually I guess we could have used `null=True` to make it a nullable column (instead of using a numeric default), since I think that is set at the DB level.
(Assignee)

Comment 43

a year ago
(In reply to Ed Morley [:emorley] from comment #42)
> (In reply to Ed Morley [:emorley] from comment #41)
> > However this is hard with Django migrations, since Django sets
> > defaults at the ORM layer not the DB layer, so unless we used manual RunSQL
> > commands there's not much we can do.
> 
> Actually I guess we could have used `null=True` to make it a nullable column
> (instead of using a numeric default), since I think that is set at the DB
> level.

Yeah, it's true.  Though then we would have needed ANOTHER long migration to remove the default value.  So this is probably best.  And I totally agree about moving people away from the API and to pulse anyway.  For Autophone: Bug 1395326.
(Assignee)

Comment 45

a year ago
This has been deployed to stage and prototype and seems to be working well.
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
Created attachment 8903254 [details] [review]
[treeherder] mozilla:graphql-job-group > mozilla:master
(Assignee)

Comment 47

a year ago
Comment on attachment 8903254 [details] [review]
[treeherder] mozilla:graphql-job-group > mozilla:master

I accidentally broke the graphql queries.  Here's the fix with a new test to catch anything like this in the future.
Attachment #8903254 - Flags: review?(emorley)

Updated

a year ago
Attachment #8903254 - Flags: review?(emorley) → review+
Has the SQL [1] to populate the new column now been run on all three environments?

It's just I don't see any record of what was run when in this bug (we really must do this in the future), so it's hard to know whether I can safely deploy master to prod.


[1] https://github.com/mozilla/treeherder/pull/2733#issuecomment-325039772
Flags: needinfo?(cdawson)
I've cherrypicked bug 1330412 to the production branch since it was needed urgently, since it wasn't clear if I could deploy master. I also had to cherrypick bug 1394593 since the buildpack settings for prod have been switched back already, since I was expecting the next deploy to be from master.
(Assignee)

Comment 51

a year ago
(In reply to Ed Morley [:emorley] from comment #49)
> Has the SQL [1] to populate the new column now been run on all three
> environments?
> 
> It's just I don't see any record of what was run when in this bug (we really
> must do this in the future), so it's hard to know whether I can safely
> deploy master to prod.
> 
> 
> [1] https://github.com/mozilla/treeherder/pull/2733#issuecomment-325039772

Yes, the sql to update the new column has been run on all 3 environments.  I had thought I'd commented here that I'd done that, but clearly I hadn't.  My apologies.  I totally agree that should have been put here as soon as I'd completed it.

bstack tested it on stage and the triggers are working.  I was just pinging you in IRC to see if you could try a trigger, too.  I don't have the scopes yet to do so, so I need to look into getting those.
Flags: needinfo?(cdawson)
(In reply to Cameron Dawson [:camd] from comment #51)
> Yes, the sql to update the new column has been run on all 3 environments.

Great - looks like this bug needn't block deploy then :-)

> bstack tested it on stage and the triggers are working.  I was just pinging
> you in IRC to see if you could try a trigger, too.  I don't have the scopes
> yet to do so, so I need to look into getting those.

This is a different bug (bug 1391321) rather than this one. I'll continue the discussion there.

Updated

a month ago
Duplicate of this bug: 1174973
You need to log in before you can comment on or make changes to this bug.