Closed
Bug 1308773
Opened 9 years ago
Closed 9 years ago
Taskcluster Pulse job ingestion backlogs due to slow exclusion_profile SELECT
Categories
(Tree Management :: Treeherder: Data Ingestion, defect, P1)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: aryx, Unassigned)
Details
Attachments
(1 file)
Take a look at https://treeherder.mozilla.org/#/jobs?repo=try&tochange=cfd8b0cc4887608b55fc7a958509ed1b98e7afe1
gecko-decision-op, eslint and Windows 2012 still have running jobs after 3 hours, the first few are usually done after a few minutes. There are more still running tc jobs on the previous pushes, the later ones have none at all (because gecko-decision doesn't run).
This might be a data ingestion issue with treeherder because the already running jobs don't finish according to TH.
Comment 1•9 years ago
|
||
I believe this might be an issue with ingesting jobs on TH production.
staging: https://treeherder.allizom.org/#/jobs?repo=try&tochange=cfd8b0cc4887608b55fc7a958509ed1b98e7afe1
Here is the decision task that ran: https://tools.taskcluster.net/task-inspector/#ahfakzHcQgai431EGnS2Og/
compared to prod, which does not show any TC tasks: https://treeherder.mozilla.org/#/jobs?repo=try&tochange=cfd8b0cc4887608b55fc7a958509ed1b98e7afe1
Looking at our taskcluster-treeherder service that publishes messages for TH production, it published 3 messages for the decision task.
Times are in CT
Oct 09 06:24:51 taskcluster-treeherder app/handlers.1: Sun, 09 Oct 2016 11:24:51 GMT taskcluster-treeherder:handler message received for task ahfakzHcQgai431EGnS2Og with route index.gecko.v2.try.latest.firefox.decision,tc-treeherder.v2.try.cfd8b0cc4887608b55fc7a958509ed1b98e7afe1.148803,tc-treeherder-stage.v2.try.cfd8b0cc4887608b55fc7a958509ed1b98e7afe1.148803
Oct 09 06:24:51 taskcluster-treeherder app/handlers.1: Publishing message for try with task ID ahfakzHcQgai431EGnS2Og
Oct 09 06:24:51 taskcluster-treeherder app/handlers.1: Published message for try with task ID ahfakzHcQgai431EGnS2Og
Oct 09 06:24:51 taskcluster-treeherder app/handlers.1: Sun, 09 Oct 2016 11:24:51 GMT taskcluster-treeherder:handler message received for task ahfakzHcQgai431EGnS2Og with route index.gecko.v2.try.latest.firefox.decision,tc-treeherder.v2.try.cfd8b0cc4887608b55fc7a958509ed1b98e7afe1.148803,tc-treeherder-stage.v2.try.cfd8b0cc4887608b55fc7a958509ed1b98e7afe1.148803
Oct 09 06:24:52 taskcluster-treeherder app/handlers.1: Publishing message for try with task ID ahfakzHcQgai431EGnS2Og
Oct 09 06:24:52 taskcluster-treeherder app/handlers.1: Published message for try with task ID ahfakzHcQgai431EGnS2Og
Oct 09 06:25:47 taskcluster-treeherder app/handlers.1: Sun, 09 Oct 2016 11:25:47 GMT taskcluster-treeherder:handler message received for task ahfakzHcQgai431EGnS2Og with route index.gecko.v2.try.latest.firefox.decision,tc-treeherder.v2.try.cfd8b0cc4887608b55fc7a958509ed1b98e7afe1.148803,tc-treeherder-stage.v2.try.cfd8b0cc4887608b55fc7a958509ed1b98e7afe1.148803
Oct 09 06:25:47 taskcluster-treeherder app/handlers.1: Publishing message for try with task ID ahfakzHcQgai431EGnS2Og
Oct 09 06:25:47 taskcluster-treeherder app/handlers.1: Published message for try with task ID ahfakzHcQgai431EGnS2Og
Comment 2•9 years ago
|
||
Late Friday UK time we were seeing slight backlogs ingesting Taskcluster jobs over Pulse, due to a slow SELECT on Treeherder's exclusion_profile table. This appears to be a continuation of that.
I've bumped the `worker_store_pulse_jobs` dyno count up to 10 now, to reduce the backlog (currently at 1700 jobs and falling).
Previous context (from email to treeherder-internal@):
"""
On both prod and stage, pulse job ingestion slowed slightly, not 100% sure why. There were some very occasional super slow DB queries (like 50s SELECT on exclusion_profile, even though it only has 4 rows, albeit 200KB blob in the flat_exclusion field of the random row I looked at?!):
https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?show_browser=false&tw%5Bend%5D=1475854176&tw%5Bstart%5D=1475843376&type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f73746f72652d70756c73652d6a6f6273222c22225d
...but not really enough to explain.
It may just be a higher volume of Pulse jobs being ingested.
For now, I've raised the number of workers (previously 4) on stage + prod:
$ ths scale worker_store_pulse_jobs=5
Scaling dynos... done, now running worker_store_pulse_jobs at 5:Standard-2X
$ thp scale worker_store_pulse_jobs=5
Scaling dynos... done, now running worker_store_pulse_jobs at 5:Standard-2X
"""
Will's reply:
"""
I believe this is caused by repeated calls to get_lower_tier_signatures in the jobs model. I have a patch cooking to memoize the calls to this method, which could speed things up a bit (or even a lot). I think our problem is that we're basically calling this operation every time we instantiate the jobs model -- which is probably much more frequent than before since we're ingesting one-job-at-a-time via pulse.
I actually want to serialize the job ingestion for the buildbot case as well (Bug 1308528), so it will become even more important that this is performant. Anyway, will test on stage on Monday and let y'all know how it goes.
""
Component: General → Treeherder: Data Ingestion
Priority: -- → P1
Product: Taskcluster → Tree Management
Summary: Taskcluster stopped scheduling jobs and running jobs frozen or TC data ingestion in treeherder broke → Taskcluster Pulse job ingestion backlogs due to slow exclusion_profile SELECT
Version: unspecified → ---
Comment 3•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #2)
> I've bumped the `worker_store_pulse_jobs` dyno count up to 10 now, to reduce
> the backlog (currently at 1700 jobs and falling).
Backlog at zero:
https://rpm.newrelic.com/accounts/677903/dashboard/16676634/page/2?tw%5Bend%5D=1476020251&tw%5Bstart%5D=1475998651
Comment 4•9 years ago
|
||
Comment 5•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #2)
> ...
>
> Will's reply:
> """
> I believe this is caused by repeated calls to get_lower_tier_signatures in
> the jobs model. I have a patch cooking to memoize the calls to this method,
> which could speed things up a bit (or even a lot). I think our problem is
> that we're basically calling this operation every time we instantiate the
> jobs model -- which is probably much more frequent than before since we're
> ingesting one-job-at-a-time via pulse.
>
> I actually want to serialize the job ingestion for the buildbot case as well
> (Bug 1308528), so it will become even more important that this is
> performant. Anyway, will test on stage on Monday and let y'all know how it
> goes.
> ""
FYI, attached a pull request implementing the functionality I described. Currently untested in a real environment, and depends on (unlanded/unreviewed) changes in bug 1308528.
| Reporter | ||
Comment 6•9 years ago
|
||
Here we go again, trees closed, according to nthomas the gecko decision task stopped showing up on Try ~2 hours ago.
Flags: needinfo?(emorley)
Comment 7•9 years ago
|
||
Bumped to 15 dynos, queues now at zero, trees reopened.
Flags: needinfo?(emorley)
Comment 8•9 years ago
|
||
Cameron/Will, I really think we need to stop setting tier as part of ingestion on the jobs table. Every single job ingested has to query the 200kb json blob in the exclusion_profile table, rather than just once per UI load etc.
I think we must have hit some tipping point for number of job type signatures that's now making the reads to this table expensive. I'm puzzled why innodb's cache isn't handling all of them though, there are only 4 rows in that table.
Pulse ingestion on heroku stage was happily only needing 4 dynos for the last few months, and now we're up to 15 on prod!
Comment 9•9 years ago
|
||
3 rows, since the only thing in the "test" profile was an exclusion which doesn't match any builds that still exist, an all-too-common feature in our exclusions.
Beyond removing unused exclusions, and ones that should have been removed two releases ago, and ones which only exclude jobs that don't exist, we seem to have exclusions which are setting tier-2 or tier-3 on Taskcluster jobs, which is just crazy since doing it in-tree is simple and more sane for when a tier change is going to ride the trains.
Comment 10•9 years ago
|
||
> SELECT * FROM mysql.slow_log ORDER BY query_time DESC LIMIT 10
******************** 1. row *********************
start_time: 2016-10-09 18:35:30
user_host: th_admin[th_admin] @ SNIP.compute-1.amazonaws.com [SNIP]
query_time: 00:01:59
lock_time: 00:00:00
rows_sent: 1
rows_examined: 1
db: treeherder
last_insert_id: 0
insert_id: 0
server_id: SNIP
sql_text: SELECT `exclusion_profile`.`id`, `exclusion_profile`.`name`, `exclusion_profile`.`is_default`, `exclusion_profile`.`flat_exclusion`, `exclusion_profile`.`author_id`, `exclusion_profile`.`modified` FROM `exclusion_profile` WHERE `exclusion_profile`.`name` = 'Tier-2'
thread_id: 17622233
******************** 2. row *********************
start_time: 2016-10-10 05:33:44
user_host: th_admin[th_admin] @ SNIP.compute-1.amazonaws.com [SNIP]
query_time: 00:01:56
lock_time: 00:00:00
rows_sent: 1
rows_examined: 1
db: treeherder
last_insert_id: 0
insert_id: 0
server_id: SNIP
sql_text: SELECT `exclusion_profile`.`id`, `exclusion_profile`.`name`, `exclusion_profile`.`is_default`, `exclusion_profile`.`flat_exclusion`, `exclusion_profile`.`author_id`, `exclusion_profile`.`modified` FROM `exclusion_profile` WHERE `exclusion_profile`.`name` = 'Tier-3'
thread_id: 18263848
Two minutes for a query on a 4 row table?!
Comment 11•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #8)
> Cameron/Will, I really think we need to stop setting tier as part of
> ingestion on the jobs table. Every single job ingested has to query the
> 200kb json blob in the exclusion_profile table, rather than just once per UI
> load etc.
"Setting" is not exactly technically accurate -- we're really just getting the information that's stored in the database, which we need to do to assign the right tier to the job. My PR changes the confusing nomenclature.
I think we should start caching this information in memcache (maybe even removing the flat_exclusions column entirely). That said, it really doesn't make sense why the query is (or at least can be) so slow. It should be super fast.
Comment 12•9 years ago
|
||
That's not what I mean.
We currently have a rather confusing mixture of tier handling:
* we accept tier as part of the submitted job information
* we also derive tier from the profiles the sheriffs have derived
* we then store this representation in a nested JSON blob in this 4-row table (?!)
* plus we then store the final calculated tier per job, as a field on the jobs table called `tier`, rather than having global state per-repo
The first two bullets should be an either/or, not both. (I know some of this was unavoidable due to buildbot vs taskcluster, but we should try and clean it up now/in the future)
And for the third/fourth, I don't think we should do it like this.
However I agree that both as an interim stop-gap, and in any future solution, we should consider using memcached to speed up retrieving what it going to be fairly hot data.
Comment 13•9 years ago
|
||
s/sheriffs have derived/sheriffs have created/
Comment 14•9 years ago
|
||
Some other things I've spotted:
* Looking at cycle data, I think we're not expiring old signatures from some tables
* The largest row in the exclusion_profile table, the "default" row, is redundant IMO, since anything that isn't tier-2/tier-3 is tier-1. We could presumably skip writing/reading it.
* We have the user_exclusion_profile table which isn't used for anything at the moment
For reference, the flat_exclusion field contains:
* "default": 6000 job signatures
* "Tier-2": 1100 job signatures
* "Tier-3": 3400 job signatures
Comment 15•9 years ago
|
||
The oddly named "default" row is the "hidden" row, not the tier-1 row.
Comment 16•9 years ago
|
||
Oh indeed :-)
Comment 17•9 years ago
|
||
I tried a few things today to see if they would help:
* OPTIMIZE TABLE treeherder.exclusion_profile; (though I'd also done this the other day too, to no apparent effect)
* restarting the RDS instance (with fail-over)
* Rotating the slow query logs (AWS uses TABLE mode by default, which apparently can cause issue if the size grows considerably - I'm going to also file a bug to switch to FILE mode)
* restarting all dynos
The slow queries plus the strange `OperationalError (2026, 'SSL connection error: protocol version mismatch')` exceptions have seemed to stop for now. Not sure if that's coincidence, or one of the above, or due to philor pruning the list of job exclusions.
¯\_(ツ)_/¯
Comment 18•9 years ago
|
||
Comment on attachment 8799248 [details] [review]
[treeherder] wlach:1308773 > mozilla:master
I have been testing this on stage a bit. So far it hasn't enormously faster than calling mysql (maybe 10 ms) with the current load, but it should have much better worse-case performance.
You can see it's working by looking at the fall in calls to exclusion_profile select amongst the various transactions.
Attachment #8799248 -
Flags: review?(emorley)
Comment 19•9 years ago
|
||
Comment on attachment 8799248 [details] [review]
[treeherder] wlach:1308773 > mozilla:master
Deferring to Cameron since he's more familiar with the signature hashes/tier logic, and I'm out today :-)
Attachment #8799248 -
Flags: review?(emorley) → review?(cdawson)
Updated•9 years ago
|
Attachment #8799248 -
Flags: review?(cdawson) → review+
Comment 20•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/f9178c24c053d342ff53f372361bf13fa956fdf3
Bug 1308773 - Memoize calls to get per-project exclusion profiles (#1913)
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•