loading 'Similar jobs' is slow/can take very long

RESOLVED FIXED

Status

P2
normal
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: aryx, Assigned: camd)

Tracking

Details

Attachments

(2 attachments)

After 'Similar jobs' has been clicked, it takes in general very long until the job list has been retrieved, e.g. tried it for a job: 1.89 KB got sent after 14 seconds.

Is this an indexing issue and can this be fixed?
Same platform
Same options
Same machine
allow 7 different non-empty combinations.
(Assignee)

Updated

3 years ago
Priority: -- → P2
(Assignee)

Updated

3 years ago
Assignee: nobody → cdawson

Comment 1

3 years ago
Any news on this?

I also find it very frustrating when looking at oranges and, at least in theory, you'd have thought that returning this information ought to be pretty fast.

Thanks.
Flags: needinfo?(cdawson)
(Assignee)

Comment 2

3 years ago
I have not been able to address this yet.  Other work has crept up higher in the priority list.  But I will take a look next week and see what I can do.
Flags: needinfo?(cdawson)
(In reply to Cameron Dawson [:camd] from comment #2)
> I have not been able to address this yet.  Other work has crept up higher in
> the priority list.  But I will take a look next week and see what I can do.

I had a quick look at this. I tried running this query manually:

SELECT j.id, j.`job_guid`, j.`signature`, j.`job_coalesced_to_guid`, j.`build_platform_id`, j.`option_collection_hash`, j.failure_classification_id, m.`name` AS machine_name, mp.`platform` AS platform, mp.`os_name` AS machine_platform_os, mp.`architecture` AS machine_platform_architecture, bp.`platform` AS build_platform, bp.`os_name` AS build_os, bp.`architecture` AS build_architecture, j.`job_type_id` AS job_type_id, jt.`name` AS job_type_name, jt.`symbol` AS job_type_symbol, jt.`description` AS job_type_description, jg.`name` AS job_group_name, jt.`job_group_id` AS job_group_id, jg.`symbol` AS job_group_symbol, jg.`description` AS job_group_description, j.`who`, j.`result_set_id`, j.`result`, j.`state`, j.`reason`, j.`start_timestamp`, j.`end_timestamp`, j.`submit_timestamp`, j.`running_eta`, j.`last_modified`, j.`tier`, rds.`name` AS ref_data_name, rds.`build_system_type` AS build_system_type FROM `job` AS j LEFT JOIN `treeherder`.`machine` AS m ON j.`machine_id` = m.id LEFT JOIN `treeherder`.`machine_platform` AS mp ON j.`machine_platform_id` = mp.id LEFT JOIN `treeherder`.`build_platform` AS bp ON j.`build_platform_id` = bp.id LEFT JOIN `treeherder`.`job_type` AS jt ON j.`job_type_id` = jt.id LEFT JOIN `treeherder`.`job_group` AS jg ON jt.`job_group_id` = jg.id LEFT JOIN `treeherder`.reference_data_signatures rds ON j.signature = rds.signature LEFT JOIN result_set rs ON rs.id = j.result_set_id WHERE 1 AND j.option_collection_hash = '32faaecac742100f7753f0c1d0aa0add01b4046b' AND j.build_platform_id = 106 AND j.id <> 28033815 AND j.job_type_id = 4542  GROUP BY j.id ORDER BY rs.push_timestamp DESC  LIMIT 21 OFFSET 0

It's pretty slow (5-10 secs) to run. However, if we change `ORDER BY rs.push_timestamp` to `ORDER BY j.result_set_id` it gets a lot faster. Do we really need to be that precise about which result sets we're getting jobs from? It stands to reason to me that result_set_id is normally incrementing, so should be good enough for a view like this.

Comment 4

3 years ago
(In reply to William Lachance (:wlach) from comment #3)
> (In reply to Cameron Dawson [:camd] from comment #2)
> > I have not been able to address this yet.  Other work has crept up higher in
> > the priority list.  But I will take a look next week and see what I can do.
> 
> I had a quick look at this. I tried running this query manually:

I used to do a lot of database development in a previous life and I'm intrigued by the huge query.
I assume we're actually using all those files, there are a lot more fields than I see in the Similar Jobs UI.

Do we need that GROUP BY j.id?

Also, out of interest, what database is this?
Flags: needinfo?(wlachance)
(In reply to Bob Owen (:bobowen) from comment #4)
> (In reply to William Lachance (:wlach) from comment #3)
> > (In reply to Cameron Dawson [:camd] from comment #2)
> > > I have not been able to address this yet.  Other work has crept up higher in
> > > the priority list.  But I will take a look next week and see what I can do.
> > 
> > I had a quick look at this. I tried running this query manually:
> 
> I used to do a lot of database development in a previous life and I'm
> intrigued by the huge query.
> I assume we're actually using all those files, there are a lot more fields
> than I see in the Similar Jobs UI.

I don't think we really do, but I don't think that part is really worth optimizing at this point (it's not the bottleneck, and we're going to be completely rewriting this code shortly)

> Do we need that GROUP BY j.id?

Almost certainly not. :)

> Also, out of interest, what database is this?

It's the treeherder database.
Flags: needinfo?(wlachance)
(In reply to William Lachance (:wlach) from comment #3)
> (In reply to Cameron Dawson [:camd] from comment #2)
> > I have not been able to address this yet.  Other work has crept up higher in
> > the priority list.  But I will take a look next week and see what I can do.
> 
> I had a quick look at this. I tried running this query manually:
> 
> ...
> 
> It's pretty slow (5-10 secs) to run. However, if we change `ORDER BY
> rs.push_timestamp` to `ORDER BY j.result_set_id` it gets a lot faster. Do we
> really need to be that precise about which result sets we're getting jobs
> from? It stands to reason to me that result_set_id is normally incrementing,
> so should be good enough for a view like this.

So after some testing on stage, I realized that things were still rather slow. It turns out that what we really need is a composite index on all the columns we're indexing on.

The combination of (1) the suggested change to use result_set_id and (2) bobowen's suggestion of not bothering with group by and (3) adding this index to the mozilla inbound jobs table on stage: `create index wlach_test_idx_jobs_all_the_things on job (result_set_id, option_collection_hash, build_platform_id, job_type_id)` seems to make things perform quite nicely. I'll post a patch!
Created attachment 8753496 [details] [review]
[treeherder] wlach:1246196 > mozilla:master
Comment on attachment 8753496 [details] [review]
[treeherder] wlach:1246196 > mozilla:master

I *think* this should be enough. We'll have to manually create the indexes ourselves, of course.
Attachment #8753496 - Flags: review?(cdawson)
(Assignee)

Comment 9

3 years ago
Comment on attachment 8753496 [details] [review]
[treeherder] wlach:1246196 > mozilla:master

lgtm.  :)
Attachment #8753496 - Flags: review?(cdawson) → review+
So I forgot to mention this, but in further testing I wasn't 100% convinced I had actually fixed the problem.

I am testing another approach made possible by some data model changes we've made, hopefully this will finally fix the problem.
Created attachment 8816538 [details] [review]
[treeherder] wlach:similar_jobs_speedup > mozilla:master
Attachment #8816538 - Flags: review?(cdawson)
(Assignee)

Comment 12

2 years ago
Comment on attachment 8816538 [details] [review]
[treeherder] wlach:similar_jobs_speedup > mozilla:master

I haven't tested this myself, but the reasoning seems sound.  :)
Attachment #8816538 - Flags: review?(cdawson) → review+

Comment 13

2 years ago
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/d3827f2e8c31e440c83c09a97b34994a51b4eb25
Bug 1246196 - Speed up loading similar jobs (#2015)

* Instead of ordering by push time (which is a foreign key), order by job
start time.
* Add composite indexes to speed up the query
Deployed change to stage and loading similar jobs is now pretty snappy there. Production should see this on the next deploy (I'll probably announce/blog when that happens)
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.