Closed Bug 1278711 Opened 9 years ago Closed 9 years ago

Add missing indexes to the jobdetail table

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: camd, Assigned: camd)

References

Details

Attachments

(5 files, 4 obsolete files)

[treeherder] mozilla:jobdetail-unique-together > mozilla:master 9 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	wlach : review+	Details \| Review
[treeherder] mozilla:revert-jobdetail-index > mozilla:master 9 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request		Details \| Review
[treeherder] mozilla:add-jobdetail-unique-together-idx > mozilla:master 9 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	wlach : review-	Details \| Review
[treeherder] mozilla:innodb-large-prefix > mozilla:master 9 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request		Details \| Review
[treeherder] mozilla:job-info-fields-for-pulse > mozilla:master 9 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	garndt : review+	Details \| Review
[treeherder] mozilla:jobdetail-uniqueness-fix > mozilla:master 9 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	wlach : review+	Details \| Review
[treeherder] mozilla:job_detail_unique_together_idx > mozilla:master 9 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	wlach : review+	Details \| Review
[treeherder] mozilla:tmp-prod-jobdetail-migration > mozilla:master 9 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	emorley : review+	Details \| Review
newrelic-transactions-graph.jpg 9 years ago Ed Morley [:emorley] 203.51 KB, image/jpeg		Details

Cameron Dawson [:camd]

Assignee

Description

•

9 years ago

Here's an example of the error on new relic: https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/553a8f-83e1d9fc-2c4a-11e6-b947-b82a72d22a14 treeherder.model.models:MultipleObjectsReturned: get() returned more than one JobDetail -- it returned 2!

Cameron Dawson [:camd]

Assignee

Updated

•

9 years ago

Assignee: nobody → cdawson

GitHub Autolander Bot

Comment 1

•

9 years ago

Attached file [treeherder] mozilla:jobdetail-unique-together > mozilla:master (obsolete) — Details

Cameron Dawson [:camd]

Assignee

Updated

•

9 years ago

Attachment #8760982 - Flags: review?(wlachance)

William Lachance (:wlach)

Comment 2

•

9 years ago

Comment on attachment 8760982 [details] [review] [treeherder] mozilla:jobdetail-unique-together > mozilla:master Should have added this way in the first place, sorry! Thanks for fixing it.

Attachment #8760982 - Flags: review?(wlachance) → review+

Treeherder GitHub Bugbot

Comment 3

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/4e377daf5aa8357ccee9ea6d2d841dfe41ee70e3 Bug 1278711 - Add unique together index to JobDetail model/table (#1573) During job ingestion, we can sometimes get an error of MultipleObjectsReturned when creating a JobDetail record. Adding an index for uniqueness will prevent us from getting that while using the get_or_create model function.

Cameron Dawson [:camd]

Assignee

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

GitHub Autolander Bot

Comment 4

•

9 years ago

Attached file [treeherder] mozilla:revert-jobdetail-index > mozilla:master (obsolete) — Details

Treeherder GitHub Bugbot

Comment 5

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/12567801ce236844cf267337869a6a724b66121a Revert "Bug 1278711 - Add unique together index to JobDetail model/table (#1573)" (#1577) This reverts commit 4e377daf5aa8357ccee9ea6d2d841dfe41ee70e3.

GitHub Autolander Bot

Comment 6

•

9 years ago

Attached file [treeherder] mozilla:add-jobdetail-unique-together-idx > mozilla:master (obsolete) — Details

Cameron Dawson [:camd]

Assignee

Comment 7

•

9 years ago

Comment on attachment 8761357 [details] [review] [treeherder] mozilla:add-jobdetail-unique-together-idx > mozilla:master Hey Will-- What do you think of this approach? :)

Attachment #8761357 - Flags: review?(wlachance)

William Lachance (:wlach)

Comment 8

•

9 years ago

Comment on attachment 8761357 [details] [review] [treeherder] mozilla:add-jobdetail-unique-together-idx > mozilla:master As discussed on irc, I'd like to see: 1. Update the get_or_create statement to put "url" in a defaults dict (lots of examples of this elsewhere in treeherder) 2. Update the comment to make it clear that the behaviour is what we want, rather than just a workaround.

Attachment #8761357 - Flags: review?(wlachance) → review-

Cameron Dawson [:camd]

Assignee

Comment 9

•

9 years ago

I think there is likely an additional problem to what we're seeing. We probably already have duplicate rows in the table. That will cause the creation of this index to fail. Currently running a query to find duplicates on stage now, but it's taking a loooooong time. Keeps timing out... sigh... Sooo... after googling, I think our solution MAY be to accept the migration as I've got it written. But instead of executing it as-is, we need to run the sql by hand. I am SURE it'll timeout anyway.... But the SQL for the migration will be: ALTER TABLE `job_detail` MODIFY `title` varchar(100) NULL; ALTER TABLE `job_detail` ADD CONSTRAINT `job_detail_job_id_f2203261c91c576_uniq` UNIQUE (`job_id`, `title`, `value`); However, we can still make use of "ALTER IGNORE TABLE" which will delete duplicates for us as it creates the index. This function is apparently available in mysql 5.6, which we're on. It's being taken out of 5.7.4 because it causes Hepatitis. Actually... not sure why... Will: did you have any better ideas?

Status: RESOLVED → REOPENED

Flags: needinfo?(wlachance)

Resolution: FIXED → ---

Cameron Dawson [:camd]

Assignee

Comment 10

•

9 years ago

My last comment may have been unclear. We should replace this line: ALTER TABLE `job_detail` ADD CONSTRAINT `job_detail_job_id_f2203261c91c576_uniq` UNIQUE (`job_id`, `title`, `value`); with this: ALTER IGNORE TABLE `job_detail` ADD CONSTRAINT `job_detail_job_id_f2203261c91c576_uniq` UNIQUE (`job_id`, `title`, `value`);

William Lachance (:wlach)

Comment 11

•

9 years ago

(In reply to Cameron Dawson [:camd] from comment #9) > I think there is likely an additional problem to what we're seeing. We > probably already have duplicate rows in the table. That will cause the > creation of this index to fail. Currently running a query to find > duplicates on stage now, but it's taking a loooooong time. Keeps timing > out... sigh... It might be faster to create an index on job_id, value, title first before running a query to get the duplicates? Are we not able to add the constraint without eliminating the duplicates first? Does mysql verify the table matches the constraint when you add it? > Sooo... after googling, I think our solution MAY be to accept the migration > as I've got it written. But instead of executing it as-is, we need to run > the sql by hand. I am SURE it'll timeout anyway.... Yeah, definitely this is stuff that should be done outside of a deployment. :)

Flags: needinfo?(wlachance)

Ed Morley [:emorley]

Comment 12

•

9 years ago

If making manual changes, please also update the new stage RDS instance too :-) (I'd recommend adding SCL3 x 2 + RDS x 3 to your MySQLWorkbench configs - makes it pretty quick to connect to each to make changes :-))

Ed Morley [:emorley]

Updated

•

9 years ago

Blocks: 1266229

Cameron Dawson [:camd]

Assignee

Comment 13

•

9 years ago

Dang, I'm a bit stuck here so far. Need to talk with a DBA. So the ``value`` field is a varchar(512) and I can't create an index on that field even by itself. I still get "Specified key was too long; max key length is 767 bytes" The table is utf8 and as I understand each char in a varchar can be 1 to 6 bytes. The index is probably calculating size based on 6. We'd have to truncate the field to 125 or so to make it fit in an index. And that's just the one field. We CAN avoid the exception just with a ``try:`` block. But that won't prevent us creating duplicate records.

Cameron Dawson [:camd]

Assignee

Comment 14

•

9 years ago

After more googling, it looks like there's a setting for the innoDB engine that may help us: http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_large_prefix

Cameron Dawson [:camd]

Assignee

Comment 15

•

9 years ago

While this error *does* happen in connection with data ingestion with pulse jobs, it can happen with ingestion via API as well. So it's orthogonal to the pulse ingestion work. While important to fix, I'm removing it as a blocker from that bug.

No longer blocks: 1266229

Cameron Dawson [:camd]

Assignee

Comment 16

•

9 years ago

Hey Sheeri-- Sorry to pull you in on this, but I'm a bit out of my league here. The TL;DR is that we want to create a unique index on 3 columns in our DB that are varchar(100), varchar(512) and int. However, I'm hitting the error in command 13. I'm wondering if the use of innodb_large_prefix is viable as a solution to this? I'm working on this db, fwiw: treeherder-stage-rw-vip.db.scl3.mozilla.com Table: treeherder_stage.job_detail

Flags: needinfo?(scabral)

Cameron Dawson [:camd]

Assignee

Comment 17

•

9 years ago

Sheeri: specifically, would you recommend adding these to our my.cnf? innodb_large_prefix=on innodb_file_format=barracuda innodb_file_per_table=true

Sheeri Cabral [:sheeri]

Comment 18

•

9 years ago

innodb_file_per_table is standard. We can put those other 2 in the my.cnf. They *should* work. The first 2 can also be put in as global values while the server is running. The file format is for new tables only. I'm not sure if an ALTER TABLE will do the right thing in that case, but that's what stage is for, right? However, job_detail is 11G. Let me set up some tests on stage.

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Comment 19

•

9 years ago

mysql> show variables like 'innodb_large%'; +---------------------+-------+ | Variable_name | Value | +---------------------+-------+ | innodb_large_prefix | OFF | +---------------------+-------+ 1 row in set (0.00 sec) mysql> show variables like 'innodb_file%'; +--------------------------+----------+ | Variable_name | Value | +--------------------------+----------+ | innodb_file_format | Antelope | | innodb_file_format_check | ON | | innodb_file_format_max | Antelope | | innodb_file_per_table | ON | +--------------------------+----------+ 4 rows in set (0.00 sec) mysql> set global innodb_large_prefix=on; Query OK, 0 rows affected (0.00 sec) mysql> set global innodb_file_format=barracuda; Query OK, 0 rows affected (0.00 sec) mysql> select @@hostname; +---------------------------------------+ | @@hostname | +---------------------------------------+ | treeherder1.stage.db.scl3.mozilla.com | +---------------------------------------+ 1 row in set (0.00 sec) mysql> show variables like 'innodb_large%'; show variables like 'innodb_file%'; +---------------------+-------+ | Variable_name | Value | +---------------------+-------+ | innodb_large_prefix | ON | +---------------------+-------+ 1 row in set (0.00 sec) +--------------------------+-----------+ | Variable_name | Value | +--------------------------+-----------+ | innodb_file_format | Barracuda | | innodb_file_format_check | ON | | innodb_file_format_max | Antelope | | innodb_file_per_table | ON | +--------------------------+-----------+ 4 rows in set (0.00 sec) mysql> use treeherder_stage; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> create table jd like job_detail; insert into jd select * from job_detail limit 100000; Query OK, 0 rows affected (0.17 sec) Query OK, 100000 rows affected (2.63 sec) Records: 100000 Duplicates: 0 Warnings: 0 ---- meanwhile, on the shell: [root@treeherder1.stage.db.scl3 ~]# ls -lh /var/lib/mysql/treeherder_stage/jd.ibd -rw-rw---- 1 mysql mysql 40M Jun 15 12:33 /var/lib/mysql/treeherder_stage/jd.ibd ---- back in the db: mysql> ALTER IGNORE TABLE jd ADD CONSTRAINT UNIQUE (`job_id`, `title`, `value`); ERROR 1709 (HY000): Index column size too large. The maximum column size is 767 bytes. Hrm, not sure how to force this table to be barracuda. Will look into it.

Sheeri Cabral [:sheeri]

Comment 20

•

9 years ago

restarted treeherder1.stage with those parameters. mysql> create table jd ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `title` varchar(100) COLLATE utf8_bin DEFAULT NULL, `value` varchar(512) COLLATE utf8_bin NOT NULL, `url` varchar(512) COLLATE utf8_bin DEFAULT NULL, `job_id` bigint(20) NOT NULL, PRIMARY KEY (`id`), KEY `job_detail_d697ea38` (`job_id`), KEY `job_detail_title` (`title`), FOREIGN KEY (`job_id`) REFERENCES `job` (`id`), UNIQUE KEY (`job_id`, `title`, `value`) ) engine=innodb default charset=utf8 collate=utf8_bin row_format=Compressed; That worked - putting row_format=Compressed in forced Barracuda: mysql> show variables like 'innodb_file_format%'; +--------------------------+-----------+ | Variable_name | Value | +--------------------------+-----------+ | innodb_file_format | Barracuda | | innodb_file_format_check | ON | | innodb_file_format_max | Barracuda | +--------------------------+-----------+ 3 rows in set (0.00 sec) innodb_file_format_max is a high water mark, so it went from Antelope from Barracuda, because the table "jd" has it now. Let me test more...

Sheeri Cabral [:sheeri]

Comment 21

•

9 years ago

tl;dr - turn the global params on, and both adds and alters that use row_format=compressed do the trick: mysql> set global innodb_large_prefix=on; set global innodb_file_format=barracuda; Query OK, 0 rows affected (0.00 sec) Query OK, 0 rows affected (0.00 sec) mysql> use treeherder_stage; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> create table jd ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `title` varchar(100) COLLATE utf8_bin DEFAULT NULL, `value` varchar(512) COLLATE utf8_bin NOT NULL, `url` varchar(512) COLLATE utf8_bin DEFAULT NULL, `job_id` bigint(20) NOT NULL, PRIMARY KEY (`id`), KEY `job_detail_d697ea38` (`job_id`), KEY `job_detail_title` (`title`), FOREIGN KEY (`job_id`) REFERENCES `job` (`id`), UNIQUE KEY (`job_id`, `title`, `value`) ) engine=innodb default charset=utf8 collate=utf8_bin row_format=Compressed; Query OK, 0 rows affected (0.19 sec) let's see if ALTER TABLE works: mysql> drop table jd; create table jd like job_detail; alter table jd add UNIQUE KEY (`job_id`, `title`, `value`), row_format=compressed; Query OK, 0 rows affected (0.01 sec) Query OK, 0 rows affected (0.02 sec) Query OK, 0 rows affected (0.05 sec) Records: 0 Duplicates: 0 Warnings: 0 table jd dropped. The proof of concept is complete.

GitHub Autolander Bot

Comment 22

•

9 years ago

Attached file [treeherder] mozilla:innodb-large-prefix > mozilla:master (obsolete) — Details

Cameron Dawson [:camd]

Assignee

Comment 23

•

9 years ago

Comment on attachment 8762951 [details] [review] [treeherder] mozilla:innodb-large-prefix > mozilla:master Ed-- Do you know if this is the only place I need to make this change? How do we set this on Heroku/RDS. Looking at the RDS dashboard, perhaps I need to create a custom "DB Parameter Group"? I fiddled with it, but I don't apparently have access to do that, atm... :)

Attachment #8762951 - Flags: review?(emorley)

Ed Morley [:emorley]

Comment 24

•

9 years ago

Comment on attachment 8762951 [details] [review] [treeherder] mozilla:innodb-large-prefix > mozilla:master The MySQL config file in the repo currently only affects the Vagrant environment. We also need to modify for: * Travis (see .travis.yml for James' current modification) * SCL3-{stage,prod} * RDS (prototype,stage,prod} For Travis/Vagrant IMO we should use a shared file (to replace the manual edits in .travis.yml) - we should create a new mostly empty file to do this (the current in-repo file is crufty and from the mysql 5.5 and us-using-myasm era). For RDS yeah we'll need to use a parameter group, and Kendall would need to set this up for you in Terraform etc for all three RDS instances. Before we do this - do we definitely need such a large index? Can we reduce the size of some of the fields (if needs be by truncating edge-case existing data) and/or do we need the index on the 'value' attribute? ie: I would posit that there should only be one instance per job_id-title combination, and the value shouldn't come into it? In which case the index shouldn't need to cover value, right? (sorry kinda lost track with all the discussion so far, I may be missing something)

Attachment #8762951 - Flags: review?(emorley)

William Lachance (:wlach)

Comment 25

•

9 years ago

(In reply to Ed Morley [:emorley] from comment #24) > ie: I would posit that there should only be one instance per job_id-title > combination, and the value shouldn't come into it? In which case the index > shouldn't need to cover value, right? (sorry kinda lost track with all the > discussion so far, I may be missing something) Sadly we can have multiple job details with the same title. The most common example I've seen is "artifact uploaded". In lieu of modifying the database, we could consider capping the length "value" at 155 characters (so title + value would only be 255 characters long). Kind of an arbitrary limitation, but probably legit.

Ed Morley [:emorley]

Comment 26

•

9 years ago

(In reply to Cameron Dawson [:camd] from comment #15) > While this error *does* happen in connection with data ingestion with pulse > jobs, it can happen with ingestion via API as well. So it's orthogonal to > the pulse ingestion work. While important to fix, I'm removing it as a > blocker from that bug. I don't see this occurring at all on prod -- are you sure this isn't only tickled when pulse data ingestion is active? Some recent stage occurrences: https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/553eb4-2d42fd6f-41f0-11e6-bd66-b82a72d22a14 https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/553a8f-c32d44b2-41ee-11e6-bd66-b82a72d22a14

Ed Morley [:emorley]

Updated

•

9 years ago

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Comment 27

•

9 years ago

Instances where this occurs: * SCL3-stage * Heroku-stage Instances were Pulse data ingestion is enabled: * SCL3-stage * Heroku-stage So I think for now we need to treat this as blocking bug 1266229.

Blocks: 1266229

Cameron Dawson [:camd]

Assignee

Comment 28

•

9 years ago

I had thought that perhaps we could just make a unique index on job_id, title, value. And, indeed, after trimming the size of ``value`` to 125, I was able to create an index with all 3. And having that index allowed me to query for duplicates with those 3 fields. My hope was that any records that were dups in those 3 fields were just dups. But they're not. Well, these might be anomalies: https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=411f7ae1704d12648fab77210488692d23f42b74&filter-searchStr=e42fd356fb0d7e405385e12f4c4ca5c528ad67de&filter-tier=1&filter-tier=2&filter-tier=3 If you look in the Job Details tab, there are lots of dupes. I'm still hunting for another example that doesn't look like it was a fluke. :)

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Comment 29

•

9 years ago

That seems like a bug. That job is from autophone. Bob, are the duplicate entries on the job details tab here (eg "artifact uploaded", "Config: s1s2-twitter-remote.ini") expected? https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=411f7ae1704d12648fab77210488692d23f42b74&filter-searchStr=e42fd356fb0d7e405385e12f4c4ca5c528ad67de&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=23564721

Flags: needinfo?(bob)

Ed Morley [:emorley]

Comment 30

•

9 years ago

Cameron, I'm seeing these exceptions on SCL3 stage: django.db.utils:DataError: (1406, "Data too long for column 'value' at row 1") in the log-parser task, during artifacts.py store_job_details(): https://rpm.newrelic.com/accounts/677903/applications/5585473/traced_errors/556195-b6d98887-4361-11e6-bd66-b82a72d22a14 Is this due to the change just made?

Bob Clary [:bc] (inactive)

Comment 31

•

9 years ago

Ed, no those duplicates aren't expected. I looked at several other jobs during that time frame and several of the cancelled jobs had duplicate entries like the example you gave. I didn't see any examples with jobs which were not cancelled. My first impression would be there is something wrong with the way that I cancel jobs that results in multiple job details being submitted.

Flags: needinfo?(bob)

Cameron Dawson [:camd]

Assignee

Comment 32

•

9 years ago

Talked to Ed and Will in our TH meeting and decided this doesn't block transition to Pulse ingestion. These errors happen when we get the same job from BOTH pulse and api. The API submission on stage was accidentally switched on, which caused these errors. Nevertheless, we need to fix this issue. So the plan is to : 1. Make some fields shorter so they can fit into a unique index (title, value, job_id) 2. delete job_detail entries that are in violation of the uniqueness in item 1 3. create that unique index from item 1 4. If we get a dup with a different url, we will just update the url value for the title/value/job_id entry 5. If users want multiple urls, they need to ensure to submit them with a unique title/value combination To do this, I'll need to

Cameron Dawson [:camd]

Assignee

Updated

•

9 years ago

No longer blocks: 1266229

Ed Morley [:emorley]

Updated

•

9 years ago

Attachment #8761357 - Attachment is obsolete: true

Ed Morley [:emorley]

Updated

•

9 years ago

Attachment #8761321 - Attachment is obsolete: true

Ed Morley [:emorley]

Updated

•

9 years ago

Attachment #8760982 - Attachment is obsolete: true

Ed Morley [:emorley]

Updated

•

9 years ago

Attachment #8761357 - Attachment is obsolete: false

Ed Morley [:emorley]

Updated

•

9 years ago

Attachment #8762951 - Attachment is obsolete: true

Cameron Dawson [:camd]

Assignee

Comment 33

•

9 years ago

Oops, I never finished my thought from comment 32... :) My plan is to have a few PRs in stages. PR 1 ==== 1. Change field lengths so they can fit into the desired index. 2. Change the get_or_create to an update_or_create SQL === delete duplicate records create a unique index in title, value, job_id (coming in PR 2) fake out the migration PR 2 ==== 1. Create the migration for Unique index on title, value and job_id

GitHub Autolander Bot

Comment 34

•

9 years ago

Attached file [treeherder] mozilla:job-info-fields-for-pulse > mozilla:master — Details

Cameron Dawson [:camd]

Assignee

Updated

•

9 years ago

Attachment #8769313 - Flags: review?(garndt)

Greg Arndt [:garndt]

Updated

•

9 years ago

Attachment #8769313 - Flags: review?(garndt) → review+

Treeherder GitHub Bugbot

Comment 35

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/4cc00cf784b7ff4deb8cf37acf544f605cebb81e Bug 1278711 - Set pulse schema jobdetails field lengths higher (#1665) This sets the field lengths to what they will be in a later PR for the job_details model. But these are still within the constraints of the current field lengths for that table.

GitHub Autolander Bot

Comment 36

•

9 years ago

Attached file [treeherder] mozilla:jobdetail-uniqueness-fix > mozilla:master — Details

Cameron Dawson [:camd]

Assignee

Updated

•

9 years ago

Attachment #8769377 - Flags: review?(wlachance)

Cameron Dawson [:camd]

Assignee

Comment 37

•

9 years ago

I will likely have to shuttle this migration through prod (maybe stage, too) by hand, since the job_detail table is quite large there.

Cameron Dawson [:camd]

Assignee

Comment 38

•

9 years ago

I will remove duplicates on the job_detail table with this sql query I'll run by hand prior to creating the unique index: delete from job_detail using job_detail, job_detail jt1 where job_detail.id < jt1.id and job_detail.title = jt1.title and job_detail.value = jt1.value and job_detail.job_id = jt1.job_id

GitHub Autolander Bot

Comment 39

•

9 years ago

Attached file [treeherder] mozilla:job_detail_unique_together_idx > mozilla:master — Details

Cameron Dawson [:camd]

Assignee

Comment 40

•

9 years ago

Comment on attachment 8769393 [details] [review] [treeherder] mozilla:job_detail_unique_together_idx > mozilla:master This is a follow-on to the other PR in this bug. So please review that one first. This one branches off that one, so please only compare the last commit on this PR.

Attachment #8769393 - Flags: review?(wlachance)

William Lachance (:wlach)

Comment 41

•

9 years ago

Comment on attachment 8769377 [details] [review] [treeherder] mozilla:jobdetail-uniqueness-fix > mozilla:master I'd like us to consider using Django's _meta API here. Other than that this looks good.

Attachment #8769377 - Flags: review?(wlachance) → review-

William Lachance (:wlach)

Comment 42

•

9 years ago

Comment on attachment 8769393 [details] [review] [treeherder] mozilla:job_detail_unique_together_idx > mozilla:master One nit, but this basically looks fine.

Attachment #8769393 - Flags: review?(wlachance) → review+

Cameron Dawson [:camd]

Assignee

Comment 43

•

9 years ago

Comment on attachment 8769377 [details] [review] [treeherder] mozilla:jobdetail-uniqueness-fix > mozilla:master Yep, that's a great improvement. Done. I'll have to keep ``_meta`` in mind for the future.

Attachment #8769377 - Flags: review- → review?(wlachance)

William Lachance (:wlach)

Comment 44

•

9 years ago

(In reply to Cameron Dawson [:camd] from comment #38) > I will remove duplicates on the job_detail table with this sql query I'll > run by hand prior to creating the unique index: > > delete > from job_detail using job_detail, > job_detail jt1 > where job_detail.id < jt1.id > and job_detail.title = jt1.title > and job_detail.value = jt1.value > and job_detail.job_id = jt1.job_id I'm worried that this is going to be super slow unless you have an index (since my guess is that this is a O(n^2) operation). You may want to create a temporary non-unique index on job id, title and value before you do this.

William Lachance (:wlach)

Updated

•

9 years ago

Attachment #8769377 - Flags: review?(wlachance) → review+

Nick Alexander :nalexander [he/him] (PTO Feb 24-March 3)

Comment 45

•

9 years ago

Pushes like https://treeherder.mozilla.org/#/jobs?repo=try&revision=4244fed90395&selectedJob=24994251 are exhibiting this behaviour. The Task Group shows good results: https://tools.taskcluster.net/task-group-inspector/#PjaFt0T3TwS2ERtQoYzkPQ//

Cameron Dawson [:camd]

Assignee

Comment 46

•

9 years ago

Nick, ends up this is a similar, but not-quite-the-same issue as the one I believe you were seeing. I created bug 1291882 for that one. I have fixed the DB and I think it's resolved now. Please let me know if you see it again.

GitHub Autolander Bot

Comment 47

•

9 years ago

Attached file [treeherder] mozilla:tmp-prod-jobdetail-migration > mozilla:master — Details

Cameron Dawson [:camd]

Assignee

Updated

•

9 years ago

Attachment #8778474 - Flags: review?(emorley)

Ed Morley [:emorley]

Updated

•

9 years ago

Attachment #8778474 - Flags: review?(emorley) → review+

Treeherder GitHub Bugbot

Comment 48

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/0f2fd75019b1cdfacf40a27be9d2db5e7c9a8826 Bug 1278711 - Lower field lengths in job_detail table (#1666) This is required in order to create a unique index on title, value and job_id to prevent duplicates. The index will be created in a later PR. This also uses update_or_create instead of get_or_create as this will be the mechanism going forward to prevent duplicates.

Treeherder GitHub Bugbot

Comment 49

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/13943f6e2113924c83aa1944c7c6fd336fc177b9 Bug 1278711 - Add unique_together index for job_detail (#1667)

Cameron Dawson [:camd]

Assignee

Updated

•

9 years ago

Depends on: 1293500

Cameron Dawson [:camd]

Assignee

Comment 50

•

9 years ago

The field lengths have been changed and the commit for that (0f2fd75019b1cdfacf40a27be9d2db5e7c9a8826) has been pushed to production now. The migration for adding the unique index has already been faked in the migrations table. Pythian is going to begin the command to create that index now. I will confirm it's been created tomorrow, then we can close this bug.

Ed Morley [:emorley]

Comment 51

•

9 years ago

Clarifying summary / adding an explanation, since the tree closure message is pointing at this bug and there seems to be some confusion from people as to what to expect. In order to fix occasional race conditions during data ingestion, an index had to be added to one of the larger database tables used by Treeherder. This was tested on stage, but look a while to complete, so for production Pythian suggested using their schema modification tools that allows adding an index without locking the table. However this index addition (which is occurring in the dep bug 1293500) appears to have still caused a perf impact (judging from #treeherder IRC scrollback, there aren't logs for #data and I can't see any bug comments), meaning job ingestion started backing up at 01:30 UTC today. As such, it Cameron pushed a branch that temporarily disabled Taskcluster job ingestion at, and all log parsing - so that the DB index creation could complete faster. Looking at the CPU graphs I don't think the index creation has finished yet: https://rpm.newrelic.com/accounts/677903/servers/6106888 Once it has, ingestion can be re-enabled, albeit there is a 64,000 taskcluster job ingestion backlog, and a 20,000 buildbot log parsing backlog.

Summary: Getting JobDetail errors during job ingestion MultipleObjectsReturned → Add missing indexes to the jobdetail table

Ed Morley [:emorley]

Comment 52

•

9 years ago

(In reply to Ed Morley [:emorley] from comment #51) > pushed a branch that temporarily disabled Taskcluster job ingestion at, At 04:00 UTC

Ed Morley [:emorley]

Comment 53

•

9 years ago

The index completion has now completed; job ingestion re-enabled at 09:45 UTC. In addition, I've temporarily blocked the ActiveData-ETL user agent from making requests to the API, since it was responsible for 75% of total web requests, and included ones made to the table being altered (which will now be under heavy use now that data ingestion has resumed). The DB CPU usage is maxed out at 100%, but is otherwise handling the backlog fine. In the 40 mins since data ingestion resumed, the queues have fallen as follows: * Taskcluster job ingestion queue: 64,000 -> 0 * Log parsing queues: 22,000 -> 33,000 Note that whilst the log parsing queue has risen, that's because the 64K Taskcluster jobs would have each put at least one log parsing task in the other queue. The trees can now be reopened, since the log parsing backlog is for successful jobs only (the parsing of logs for failed jobs takes priority), so doesn't affect sheriffing (and will catch up shortly anyway).

Ed Morley [:emorley]

Updated

•

9 years ago

Attachment #8761357 - Attachment is obsolete: true

Ed Morley [:emorley]

Comment 54

•

9 years ago

Log parser queue for successful jobs' logs is now also down to zero.

Ed Morley [:emorley]

Comment 55

•

9 years ago

(In reply to Ed Morley [:emorley] from comment #53) > In addition, I've temporarily blocked the ActiveData-ETL user agent from > making requests to the API, since it was responsible for 75% of total web > requests, and included ones made to the table being altered (which will now > be under heavy use now that data ingestion has resumed). Not the DB load has stabilised, I've pushed to prod again reversing the block.

Ed Morley [:emorley]

Comment 56

•

9 years ago

s/Not/Now/

Ed Morley [:emorley]

Comment 57

•

9 years ago

Attached image newrelic-transactions-graph.jpg — Details

Cameron Dawson [:camd]

Assignee

Comment 58

•

9 years ago

We can finally call this one fixed now. Thanks Ed, Will and everyone for your help and efforts on this issue.

Status: REOPENED → RESOLVED

Closed: 9 years ago → 9 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Updated

•

8 years ago

Depends on: 1304339

You need to log in before you can comment on or make changes to this bug.