Add missing indexes to the jobdetail table

RESOLVED FIXED

Status

Tree Management
Treeherder: Data Ingestion
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: camd, Assigned: camd)

Tracking

Details

Attachments

(5 attachments, 4 obsolete attachments)

(Assignee)

Description

a year ago
Here's an example of the error on new relic:

https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/553a8f-83e1d9fc-2c4a-11e6-b947-b82a72d22a14

treeherder.model.models:MultipleObjectsReturned: get() returned more than one JobDetail -- it returned 2!
(Assignee)

Updated

a year ago
Assignee: nobody → cdawson

Comment 1

a year ago
Created attachment 8760982 [details] [review]
[treeherder] mozilla:jobdetail-unique-together > mozilla:master
(Assignee)

Updated

a year ago
Attachment #8760982 - Flags: review?(wlachance)
Comment on attachment 8760982 [details] [review]
[treeherder] mozilla:jobdetail-unique-together > mozilla:master

Should have added this way in the first place, sorry! Thanks for fixing it.
Attachment #8760982 - Flags: review?(wlachance) → review+

Comment 3

a year ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/4e377daf5aa8357ccee9ea6d2d841dfe41ee70e3
Bug 1278711 - Add unique together index to JobDetail model/table (#1573)

During job ingestion, we can sometimes get an error of
MultipleObjectsReturned when creating a JobDetail record.  Adding
an index for uniqueness will prevent us from getting that while
using the get_or_create model function.
(Assignee)

Updated

a year ago
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED

Comment 4

a year ago
Created attachment 8761321 [details] [review]
[treeherder] mozilla:revert-jobdetail-index > mozilla:master

Comment 5

a year ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/12567801ce236844cf267337869a6a724b66121a
Revert "Bug 1278711 - Add unique together index to JobDetail model/table (#1573)" (#1577)

This reverts commit 4e377daf5aa8357ccee9ea6d2d841dfe41ee70e3.

Comment 6

a year ago
Created attachment 8761357 [details] [review]
[treeherder] mozilla:add-jobdetail-unique-together-idx > mozilla:master
(Assignee)

Comment 7

a year ago
Comment on attachment 8761357 [details] [review]
[treeherder] mozilla:add-jobdetail-unique-together-idx > mozilla:master

Hey Will--  What do you think of this approach?  :)
Attachment #8761357 - Flags: review?(wlachance)
Comment on attachment 8761357 [details] [review]
[treeherder] mozilla:add-jobdetail-unique-together-idx > mozilla:master

As discussed on irc, I'd like to see:

1. Update the get_or_create statement to put "url" in a defaults dict (lots of examples of this elsewhere in treeherder)
2. Update the comment to make it clear that the behaviour is what we want, rather than just a workaround.
Attachment #8761357 - Flags: review?(wlachance) → review-
(Assignee)

Comment 9

a year ago
I think there is likely an additional problem to what we're seeing.  We probably already have duplicate rows in the table.  That will cause the creation of this index to fail.  Currently running a query to find duplicates on stage now, but it's taking a loooooong time. Keeps timing out... sigh...

Sooo... after googling, I think our solution MAY be to accept the migration as I've got it written.  But instead of executing it as-is, we need to run the sql by hand.  I am SURE it'll timeout anyway....

But the SQL for the migration will be:

ALTER TABLE `job_detail` MODIFY `title` varchar(100) NULL;
ALTER TABLE `job_detail` ADD CONSTRAINT `job_detail_job_id_f2203261c91c576_uniq` UNIQUE (`job_id`, `title`, `value`);

However, we can still make use of "ALTER IGNORE TABLE" which will delete duplicates for us as it creates the index.  This function is apparently available in mysql 5.6, which we're on.  It's being taken out of 5.7.4 because it causes Hepatitis.  Actually... not sure why...

Will: did you have any better ideas?
Status: RESOLVED → REOPENED
Flags: needinfo?(wlachance)
Resolution: FIXED → ---
(Assignee)

Comment 10

a year ago
My last comment may have been unclear.  We should replace this line:

ALTER TABLE `job_detail` ADD CONSTRAINT `job_detail_job_id_f2203261c91c576_uniq` UNIQUE (`job_id`, `title`, `value`);

with this:

ALTER IGNORE TABLE `job_detail` ADD CONSTRAINT `job_detail_job_id_f2203261c91c576_uniq` UNIQUE (`job_id`, `title`, `value`);
(In reply to Cameron Dawson [:camd] from comment #9)
> I think there is likely an additional problem to what we're seeing.  We
> probably already have duplicate rows in the table.  That will cause the
> creation of this index to fail.  Currently running a query to find
> duplicates on stage now, but it's taking a loooooong time. Keeps timing
> out... sigh...

It might be faster to create an index on job_id, value, title first before running a query to get the duplicates?

Are we not able to add the constraint without eliminating the duplicates first? Does mysql verify the table matches the constraint when you add it?

> Sooo... after googling, I think our solution MAY be to accept the migration
> as I've got it written.  But instead of executing it as-is, we need to run
> the sql by hand.  I am SURE it'll timeout anyway....

Yeah, definitely this is stuff that should be done outside of a deployment. :)
Flags: needinfo?(wlachance)
If making manual changes, please also update the new stage RDS instance too :-)
(I'd recommend adding SCL3 x 2 + RDS x 3 to your MySQLWorkbench configs - makes it pretty quick to connect to each to make changes :-))

Updated

a year ago
Blocks: 1266229
(Assignee)

Comment 13

a year ago
Dang, I'm a bit stuck here so far.  Need to talk with a DBA.  So the ``value`` field is a varchar(512) and I can't create an index on that field even by itself.  I still get "Specified key was too long; max key length is 767 bytes"

The table is utf8 and as I understand each char in a varchar can be 1 to 6 bytes.  The index is probably calculating size based on 6.  We'd have to truncate the field to 125 or so to make it fit in an index.  And that's just the one field. 

We CAN avoid the exception just with a ``try:`` block.  But that won't prevent us creating duplicate records.
(Assignee)

Comment 14

a year ago
After more googling, it looks like there's a setting for the innoDB engine that may help us:

http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_large_prefix
(Assignee)

Comment 15

a year ago
While this error *does* happen in connection with data ingestion with pulse jobs, it can happen with ingestion via API as well.  So it's orthogonal to the pulse ingestion work.  While important to fix, I'm removing it as a blocker from that bug.
No longer blocks: 1266229
(Assignee)

Comment 16

a year ago
Hey Sheeri--  Sorry to pull you in on this, but I'm a bit out of my league here.  The TL;DR is that we want to create a unique index on 3 columns in our DB that are varchar(100), varchar(512) and int.  However, I'm hitting the error in command 13.  I'm wondering if the use of innodb_large_prefix is viable as a solution to this?

I'm working on this db, fwiw: treeherder-stage-rw-vip.db.scl3.mozilla.com
Table: treeherder_stage.job_detail
Flags: needinfo?(scabral)
(Assignee)

Comment 17

a year ago
Sheeri: specifically, would you recommend adding these to our my.cnf?

innodb_large_prefix=on
innodb_file_format=barracuda
innodb_file_per_table=true
innodb_file_per_table is standard.

We can put those other 2 in the my.cnf. They *should* work. The first 2 can also be put in as global values while the server is running.

The file format is for new tables only. I'm not sure if an ALTER TABLE will do the right thing in that case, but that's what stage is for, right?

However, job_detail is 11G. Let me set up some tests on stage.
Flags: needinfo?(scabral)
mysql>  show variables like 'innodb_large%';
+---------------------+-------+
| Variable_name       | Value |
+---------------------+-------+
| innodb_large_prefix | OFF   |
+---------------------+-------+
1 row in set (0.00 sec)

mysql>  show variables like 'innodb_file%';
+--------------------------+----------+
| Variable_name            | Value    |
+--------------------------+----------+
| innodb_file_format       | Antelope |
| innodb_file_format_check | ON       |
| innodb_file_format_max   | Antelope |
| innodb_file_per_table    | ON       |
+--------------------------+----------+
4 rows in set (0.00 sec)

mysql> set global innodb_large_prefix=on;  
Query OK, 0 rows affected (0.00 sec)

mysql>  set global innodb_file_format=barracuda;
Query OK, 0 rows affected (0.00 sec)

mysql> select @@hostname;
+---------------------------------------+
| @@hostname                            |
+---------------------------------------+
| treeherder1.stage.db.scl3.mozilla.com |
+---------------------------------------+
1 row in set (0.00 sec)

mysql>  show variables like 'innodb_large%'; show variables like 'innodb_file%';
+---------------------+-------+
| Variable_name       | Value |
+---------------------+-------+
| innodb_large_prefix | ON    |
+---------------------+-------+
1 row in set (0.00 sec)

+--------------------------+-----------+
| Variable_name            | Value     |
+--------------------------+-----------+
| innodb_file_format       | Barracuda |
| innodb_file_format_check | ON        |
| innodb_file_format_max   | Antelope  |
| innodb_file_per_table    | ON        |
+--------------------------+-----------+
4 rows in set (0.00 sec)

mysql> use treeherder_stage;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> create table jd like job_detail; insert into jd select * from job_detail limit 100000;
Query OK, 0 rows affected (0.17 sec)

Query OK, 100000 rows affected (2.63 sec)
Records: 100000  Duplicates: 0  Warnings: 0

---- meanwhile, on the shell:
[root@treeherder1.stage.db.scl3 ~]# ls -lh /var/lib/mysql/treeherder_stage/jd.ibd
-rw-rw---- 1 mysql mysql 40M Jun 15 12:33 /var/lib/mysql/treeherder_stage/jd.ibd

---- back in the db:

mysql> ALTER IGNORE TABLE jd ADD CONSTRAINT UNIQUE (`job_id`, `title`, `value`);
ERROR 1709 (HY000): Index column size too large. The maximum column size is 767 bytes.

Hrm, not sure how to force this table to be barracuda. Will look into it.
restarted treeherder1.stage with those parameters.

mysql> create table jd  (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `title` varchar(100) COLLATE utf8_bin DEFAULT NULL,
  `value` varchar(512) COLLATE utf8_bin NOT NULL,
  `url` varchar(512) COLLATE utf8_bin DEFAULT NULL,
  `job_id` bigint(20) NOT NULL,  PRIMARY KEY (`id`),
  KEY `job_detail_d697ea38` (`job_id`),
  KEY `job_detail_title` (`title`),
  FOREIGN KEY (`job_id`) REFERENCES `job` (`id`),
 UNIQUE KEY (`job_id`, `title`, `value`)
 ) engine=innodb default charset=utf8 collate=utf8_bin row_format=Compressed; 

That worked - putting row_format=Compressed in forced Barracuda:

mysql>  show variables like 'innodb_file_format%';
+--------------------------+-----------+
| Variable_name            | Value     |
+--------------------------+-----------+
| innodb_file_format       | Barracuda |
| innodb_file_format_check | ON        |
| innodb_file_format_max   | Barracuda |
+--------------------------+-----------+
3 rows in set (0.00 sec)

innodb_file_format_max is a high water mark, so it went from Antelope from Barracuda, because the table "jd" has it now.

Let me test more...
tl;dr - turn the global params on, and both adds and alters that use row_format=compressed do the trick:

mysql> set global innodb_large_prefix=on; set global innodb_file_format=barracuda;
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

mysql> use treeherder_stage;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql>  create table jd  (  `id` bigint(20) NOT NULL AUTO_INCREMENT,  `title` varchar(100) COLLATE utf8_bin DEFAULT NULL,  `value` varchar(512) COLLATE utf8_bin NOT NULL,  `url` varchar(512) COLLATE utf8_bin DEFAULT NULL,  `job_id` bigint(20) NOT NULL,  PRIMARY KEY (`id`),  KEY `job_detail_d697ea38` (`job_id`),  KEY `job_detail_title` (`title`),  FOREIGN KEY (`job_id`) REFERENCES `job` (`id`), UNIQUE KEY (`job_id`, `title`, `value`) ) engine=innodb default charset=utf8 collate=utf8_bin row_format=Compressed; 
Query OK, 0 rows affected (0.19 sec)

let's see if ALTER TABLE works:

mysql> drop table jd; create table jd like job_detail; alter table jd add UNIQUE KEY (`job_id`, `title`, `value`), row_format=compressed;
Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.02 sec)

Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0

table jd dropped. 

The proof of concept is complete.

Comment 22

a year ago
Created attachment 8762951 [details] [review]
[treeherder] mozilla:innodb-large-prefix > mozilla:master
(Assignee)

Comment 23

a year ago
Comment on attachment 8762951 [details] [review]
[treeherder] mozilla:innodb-large-prefix > mozilla:master

Ed-- Do you know if this is the only place I need to make this change?  How do we set this on Heroku/RDS.

Looking at the RDS dashboard, perhaps I need to create a custom "DB Parameter Group"?  I fiddled with it, but I don't apparently have access to do that, atm...  :)
Attachment #8762951 - Flags: review?(emorley)
Comment on attachment 8762951 [details] [review]
[treeherder] mozilla:innodb-large-prefix > mozilla:master

The MySQL config file in the repo currently only affects the Vagrant environment. We also need to modify for:
* Travis (see .travis.yml for James' current modification)
* SCL3-{stage,prod}
* RDS (prototype,stage,prod}

For Travis/Vagrant IMO we should use a shared file (to replace the manual edits in .travis.yml) - we should create a new mostly empty file to do this (the current in-repo file is crufty and from the mysql 5.5 and us-using-myasm era).

For RDS yeah we'll need to use a parameter group, and Kendall would need to set this up for you in Terraform etc for all three RDS instances.

Before we do this - do we definitely need such a large index? Can we reduce the size of some of the fields (if needs be by truncating edge-case existing data) and/or do we need the index on the 'value' attribute?

ie: I would posit that there should only be one instance per job_id-title combination, and the value shouldn't come into it? In which case the index shouldn't need to cover value, right? (sorry kinda lost track with all the discussion so far, I may be missing something)
Attachment #8762951 - Flags: review?(emorley)
(In reply to Ed Morley [:emorley] from comment #24)
> ie: I would posit that there should only be one instance per job_id-title
> combination, and the value shouldn't come into it? In which case the index
> shouldn't need to cover value, right? (sorry kinda lost track with all the
> discussion so far, I may be missing something)

Sadly we can have multiple job details with the same title. The most common example I've seen is "artifact uploaded". 

In lieu of modifying the database, we could consider capping the length "value" at 155 characters (so title + value would only be 255 characters long). Kind of an arbitrary limitation, but probably legit.
(In reply to Cameron Dawson [:camd] from comment #15)
> While this error *does* happen in connection with data ingestion with pulse
> jobs, it can happen with ingestion via API as well.  So it's orthogonal to
> the pulse ingestion work.  While important to fix, I'm removing it as a
> blocker from that bug.

I don't see this occurring at all on prod -- are you sure this isn't only tickled when pulse data ingestion is active?

Some recent stage occurrences:
https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/553eb4-2d42fd6f-41f0-11e6-bd66-b82a72d22a14
https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/553a8f-c32d44b2-41ee-11e6-bd66-b82a72d22a14

Updated

a year ago
Flags: needinfo?(cdawson)
Instances where this occurs:
* SCL3-stage
* Heroku-stage

Instances were Pulse data ingestion is enabled:
* SCL3-stage
* Heroku-stage

So I think for now we need to treat this as blocking bug 1266229.
Blocks: 1266229
(Assignee)

Comment 28

a year ago
I had thought that perhaps we could just make a unique index on job_id, title, value.  And, indeed, after trimming the size of ``value`` to 125, I was able to create an index with all 3.  And having that index allowed me to query for duplicates with those 3 fields.

My hope was that any records that were dups in those 3 fields were just dups.  But they're not.  Well, these might be anomalies:

https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=411f7ae1704d12648fab77210488692d23f42b74&filter-searchStr=e42fd356fb0d7e405385e12f4c4ca5c528ad67de&filter-tier=1&filter-tier=2&filter-tier=3

If you look in the Job Details tab, there are lots of dupes.  I'm still hunting for another example that doesn't look like it was a fluke.  :)
Flags: needinfo?(cdawson)
That seems like a bug. That job is from autophone.

Bob, are the duplicate entries on the job details tab here (eg "artifact uploaded", "Config: s1s2-twitter-remote.ini") expected?
https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=411f7ae1704d12648fab77210488692d23f42b74&filter-searchStr=e42fd356fb0d7e405385e12f4c4ca5c528ad67de&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=23564721
Flags: needinfo?(bob)
Cameron, I'm seeing these exceptions on SCL3 stage:

 django.db.utils:DataError: (1406, "Data too long for column 'value' at row 1") 

in the log-parser task, during artifacts.py store_job_details():
https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/556195-b6d98887-4361-11e6-bd66-b82a72d22a14

Is this due to the change just made?

Comment 31

a year ago
Ed, no those duplicates aren't expected. I looked at several other jobs during that time frame and several of the cancelled jobs had duplicate entries like the example you gave. I didn't see any examples with jobs which were not cancelled. My first impression would be there is something wrong with the way that I cancel jobs that results in multiple job details being submitted.
Flags: needinfo?(bob)
(Assignee)

Comment 32

a year ago
Talked to Ed and Will in our TH meeting and decided this doesn't block transition to Pulse ingestion.

These errors happen when we get the same job from BOTH pulse and api.  The API submission on stage was accidentally switched on, which caused these errors.  

Nevertheless, we need to fix this issue.  So the plan is to :

1. Make some fields shorter so they can fit into a unique index (title, value, job_id)
2. delete job_detail entries that are in violation of the uniqueness in item 1
3. create that unique index from item 1
4. If we get a dup with a different url, we will just update the url value for the title/value/job_id entry
5. If users want multiple urls, they need to ensure to submit them with a unique title/value combination

To do this, I'll need to
(Assignee)

Updated

a year ago
No longer blocks: 1266229

Updated

a year ago
Attachment #8761357 - Attachment is obsolete: true

Updated

a year ago
Attachment #8761321 - Attachment is obsolete: true

Updated

a year ago
Attachment #8760982 - Attachment is obsolete: true

Updated

a year ago
Attachment #8761357 - Attachment is obsolete: false

Updated

a year ago
Attachment #8762951 - Attachment is obsolete: true
(Assignee)

Comment 33

a year ago
Oops, I never finished my thought from comment 32...  :)

My plan is to have a few PRs in stages.

PR 1
====

1. Change field lengths so they can fit into the desired index.
2. Change the get_or_create to an update_or_create

SQL
===
delete duplicate records
create a unique index in title, value, job_id (coming in PR 2)
fake out the migration

PR 2
====

1. Create the migration for Unique index on title, value and job_id

Comment 34

a year ago
Created attachment 8769313 [details] [review]
[treeherder] mozilla:job-info-fields-for-pulse > mozilla:master
(Assignee)

Updated

a year ago
Attachment #8769313 - Flags: review?(garndt)

Updated

a year ago
Attachment #8769313 - Flags: review?(garndt) → review+

Comment 35

a year ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/4cc00cf784b7ff4deb8cf37acf544f605cebb81e
Bug 1278711 - Set pulse schema jobdetails field lengths higher (#1665)

This sets the field lengths to what they will be in a later PR for
the job_details model.  But these are still within the constraints
of the current field lengths for that table.

Comment 36

a year ago
Created attachment 8769377 [details] [review]
[treeherder] mozilla:jobdetail-uniqueness-fix > mozilla:master
(Assignee)

Updated

a year ago
Attachment #8769377 - Flags: review?(wlachance)
(Assignee)

Comment 37

a year ago
I will likely have to shuttle this migration through prod (maybe stage, too) by hand, since the job_detail table is quite large there.
(Assignee)

Comment 38

a year ago
I will remove duplicates on the job_detail table with this sql query I'll run by hand prior to creating the unique index:

delete
from job_detail using job_detail,
    job_detail jt1
where job_detail.id < jt1.id
    and job_detail.title = jt1.title
    and job_detail.value = jt1.value
    and job_detail.job_id = jt1.job_id

Comment 39

a year ago
Created attachment 8769393 [details] [review]
[treeherder] mozilla:job_detail_unique_together_idx > mozilla:master
(Assignee)

Comment 40

a year ago
Comment on attachment 8769393 [details] [review]
[treeherder] mozilla:job_detail_unique_together_idx > mozilla:master

This is a follow-on to the other PR in this bug.  So please review that one first.  This one branches off that one, so please only compare the last commit on this PR.
Attachment #8769393 - Flags: review?(wlachance)
Comment on attachment 8769377 [details] [review]
[treeherder] mozilla:jobdetail-uniqueness-fix > mozilla:master

I'd like us to consider using Django's _meta API here. Other than that this looks good.
Attachment #8769377 - Flags: review?(wlachance) → review-
Comment on attachment 8769393 [details] [review]
[treeherder] mozilla:job_detail_unique_together_idx > mozilla:master

One nit, but this basically looks fine.
Attachment #8769393 - Flags: review?(wlachance) → review+
(Assignee)

Comment 43

a year ago
Comment on attachment 8769377 [details] [review]
[treeherder] mozilla:jobdetail-uniqueness-fix > mozilla:master

Yep, that's a great improvement.  Done.  I'll have to keep ``_meta`` in mind for the future.
Attachment #8769377 - Flags: review- → review?(wlachance)
(In reply to Cameron Dawson [:camd] from comment #38)
> I will remove duplicates on the job_detail table with this sql query I'll
> run by hand prior to creating the unique index:
> 
> delete
> from job_detail using job_detail,
>     job_detail jt1
> where job_detail.id < jt1.id
>     and job_detail.title = jt1.title
>     and job_detail.value = jt1.value
>     and job_detail.job_id = jt1.job_id

I'm worried that this is going to be super slow unless you have an index (since my guess is that this is a O(n^2) operation). You may want to create a temporary non-unique index on job id, title and value before you do this.
Attachment #8769377 - Flags: review?(wlachance) → review+
Pushes like

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4244fed90395&selectedJob=24994251

are exhibiting this behaviour.  The Task Group shows good results:

https://tools.taskcluster.net/task-group-inspector/#PjaFt0T3TwS2ERtQoYzkPQ//
(Assignee)

Comment 46

a year ago
Nick, ends up this is a similar, but not-quite-the-same issue as the one I believe you were seeing.  I created bug 1291882 for that one.  I have fixed the DB and I think it's resolved now.  Please let me know if you see it again.

Comment 47

a year ago
Created attachment 8778474 [details] [review]
[treeherder] mozilla:tmp-prod-jobdetail-migration > mozilla:master
(Assignee)

Updated

a year ago
Attachment #8778474 - Flags: review?(emorley)

Updated

a year ago
Attachment #8778474 - Flags: review?(emorley) → review+

Comment 48

a year ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/0f2fd75019b1cdfacf40a27be9d2db5e7c9a8826
Bug 1278711 - Lower field lengths in job_detail table (#1666)

This is required in order to create a unique index on title,
value and job_id to prevent duplicates.  The index will be
created in a later PR.

This also uses update_or_create instead of get_or_create as
this will be the mechanism going forward to prevent duplicates.

Comment 49

a year ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/13943f6e2113924c83aa1944c7c6fd336fc177b9
Bug 1278711 - Add unique_together index for job_detail (#1667)
(Assignee)

Updated

a year ago
Depends on: 1293500
(Assignee)

Comment 50

a year ago
The field lengths have been changed and the commit for that (0f2fd75019b1cdfacf40a27be9d2db5e7c9a8826) has been pushed to production now.  The migration for adding the unique index has already been faked in the migrations table.  Pythian is going to begin the command to create that index now.

I will confirm it's been created tomorrow, then we can close this bug.
Clarifying summary / adding an explanation, since the tree closure message is pointing at this bug and there seems to be some confusion from people as to what to expect.

In order to fix occasional race conditions during data ingestion, an index had to be added to one of the larger database tables used by Treeherder. This was tested on stage, but look a while to complete, so for production Pythian suggested using their schema modification tools that allows adding an index without locking the table.

However this index addition (which is occurring in the dep bug 1293500) appears to have still caused a perf impact (judging from #treeherder IRC scrollback, there aren't logs for #data and I can't see any bug comments), meaning job ingestion started backing up at 01:30 UTC today.

As such, it Cameron pushed a branch that temporarily disabled Taskcluster job ingestion at, and all log parsing - so that the DB index creation could complete faster.

Looking at the CPU graphs I don't think the index creation has finished yet:
https://rpm.newrelic.com/accounts/677903/servers/6106888

Once it has, ingestion can be re-enabled, albeit there is a 64,000 taskcluster job ingestion backlog, and a 20,000 buildbot log parsing backlog.
Summary: Getting JobDetail errors during job ingestion MultipleObjectsReturned → Add missing indexes to the jobdetail table
(In reply to Ed Morley [:emorley] from comment #51)
> pushed a branch that temporarily disabled Taskcluster job ingestion at,

At 04:00 UTC
The index completion has now completed; job ingestion re-enabled at 09:45 UTC.

In addition, I've temporarily blocked the ActiveData-ETL user agent from making requests to the API, since it was responsible for 75% of total web requests, and included ones made to the table being altered (which will now be under heavy use now that data ingestion has resumed).

The DB CPU usage is maxed out at 100%, but is otherwise handling the backlog fine.

In the 40 mins since data ingestion resumed, the queues have fallen as follows:
* Taskcluster job ingestion queue: 64,000 -> 0
* Log parsing queues: 22,000 -> 33,000

Note that whilst the log parsing queue has risen, that's because the 64K Taskcluster jobs would have each put at least one log parsing task in the other queue. 

The trees can now be reopened, since the log parsing backlog is for successful jobs only (the parsing of logs for failed jobs takes priority), so doesn't affect sheriffing (and will catch up shortly anyway).

Updated

a year ago
Attachment #8761357 - Attachment is obsolete: true
Log parser queue for successful jobs' logs is now also down to zero.
(In reply to Ed Morley [:emorley] from comment #53)
> In addition, I've temporarily blocked the ActiveData-ETL user agent from
> making requests to the API, since it was responsible for 75% of total web
> requests, and included ones made to the table being altered (which will now
> be under heavy use now that data ingestion has resumed).

Not the DB load has stabilised, I've pushed to prod again reversing the block.
s/Not/Now/
Created attachment 8779745 [details]
newrelic-transactions-graph.jpg
(Assignee)

Comment 58

a year ago
We can finally call this one fixed now.  Thanks Ed, Will and everyone for your help and efforts on this issue.
Status: REOPENED → RESOLVED
Last Resolved: a year agoa year ago
Resolution: --- → FIXED
Blocks: 1299954

Updated

a year ago
Depends on: 1304339
You need to log in before you can comment on or make changes to this bug.