Closed
Bug 1196764
Opened 10 years ago
Closed 9 years ago
Rewrite handling of job_eta since it's responsible for 80% of the job creation time
Categories
(Tree Management :: Treeherder: Data Ingestion, defect, P2)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
eg:
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e6a6f62733a4a6f6273566965775365742e637265617465222c22225d
MySQL job_eta select is 80% of treeherder.webapp.api.jobs:JobsViewSet.create
This is because:
* we add a new row to the job_eta table every time the calculate eta task runs (rather than updating existing rows)
* we don't expire data from the job_eta tables - eg mozilla-inbound's job_eta has 2.8 million rows (bug 1126886)
* there is no index on the signature column, and we use that as part of our SELECT in get_last_eta_by_signatures (bug 1178038)
The current job_eta implementation:
* was designed to allow for tracking of etas over time (hence having multiple entries per signature), which we don't really need.
* also calculates pending eta, but this is not accurate (being removed by bug 1126886).
* also calculates median/min/max eta for each signature, which we currently don't use.
* copies a snapshot of the average eta into the jobs table, for each job. I believe this is an unnecessary over-complication (once a job has finished we don't really care if the eta for jobs this week was more or less than last week).
I propose for the new design:
* we create a new table, so we don't have to try and sort out the old ones (deletes from tables with 2.8 million rows will be slow)
* this new table should be in the main treeherder DB (ie a small part towards the single-database work, bug 1178641) rather than per project
* we forget about the unused min/max/median for now
* we only create one row per signature combination, and update that
* so that we can expire data, we either store a last_modified for each row, or rely on the signature not being in the reference_data_signatures table (we don't expire from that table either, but once support is added there, we could just cascade delete to the new job_eta table too)
We can then land this in stages:
1) create the new table and start the calculate_eta task populating it
2) switch jobs ingestion over from querying the old job_eta table to querying the new one, when populating the running_eta field on the jobs table
3) drop the old tables
In the future (not part of this bug), I'd like to stop storing running_eta on the jobs table entirely - and instead either:
(a) join the jobs table with job_eta when generating the response for the /jobs endpoint, or
(b) have the UI fetch job eta reference data only once - and combine it client side with the jobs being displayed.
Assignee | ||
Updated•10 years ago
|
Assignee | ||
Comment 3•10 years ago
|
||
Current schema (after bug 1181572):
CREATE TABLE `job_eta` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`signature` varchar(50) DEFAULT NULL,
`state` varchar(25) COLLATE utf8_bin NOT NULL,
`avg_sec` int(10) unsigned NOT NULL,
`median_sec` int(10) unsigned NOT NULL,
`min_sec` int(10) unsigned NOT NULL,
`max_sec` int(10) unsigned NOT NULL,
`std` int(10) unsigned NOT NULL,
`sample_count` int(10) unsigned NOT NULL,
`submit_timestamp` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_state` (`state`),
KEY `idx_avg_sec` (`avg_sec`),
KEY `idx_median_sec` (`median_sec`),
KEY `idx_sample_count` (`sample_count`),
KEY `idx_submit_timestamp` (`submit_timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Proposed schema:
CREATE TABLE `job_eta` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`repository` varchar(50) NOT NULL,
`signature` varchar(50) NOT NULL,
`avg_sec` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uni_repository_signature` (`signature`, `repository`)
) ...;
At which point, I'm wondering if we should just add a `average_runtime` field to the existing `reference_data_signatures` table, and have the calculate eta task populate that + jobs ingestion query from there, instead?
Thoughts?
Flags: needinfo?(mdoglio)
Comment 4•10 years ago
|
||
That sounds good to me, iirc we are not using the pending eta anymore in the ui so there's practically no loss of information.
Flags: needinfo?(mdoglio)
Assignee | ||
Comment 6•10 years ago
|
||
I haven't worked on it since comment 3, since been focusing on my deliverable.
Would be good to get this fixed soon though.
Flags: needinfo?(emorley)
Assignee | ||
Comment 7•9 years ago
|
||
WIP at:
https://github.com/mozilla/treeherder/compare/refactor-job-eta
(not all changes pushed yet & needs commit message tidy up etc before before it's worth opening as a PR)
Assignee | ||
Comment 8•9 years ago
|
||
Attachment #8692941 -
Flags: review?(mdoglio)
Updated•9 years ago
|
Attachment #8692941 -
Flags: review?(mdoglio) → review+
Comment 9•9 years ago
|
||
Commits pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/f5093377882bdae86a7e4940c7578faa482c36ce
Bug 1196764 - Stop calculating/storing unused job duration statistics
The calculate_eta task calculates average/median/min/max & standard
deviation of recent job durations, when only the average is used.
Whilst the job_eta table will be replaced entirely in later commits, the
query that fetches the stats to populate it will still be used, so needs
cleaning up regardless.
https://github.com/mozilla/treeherder/commit/006caebe3609836acaf4f8edfad1a5b6e158c051
Bug 1196764 - Use start not submit timestamp to retrieve job durations
Previously the time a job spent in the pending state *and* also the time
spent in the running state were calculated and stored. However, now that
only the latter is required, the time window that the query searches
over needs to only cover the start_timestamp onwards, not the earlier
submit_timestamp.
https://github.com/mozilla/treeherder/commit/585ddc5189c79fdef8a0cb767d786d294d9a234d
Bug 1196764 - Clean up the fetching of recent job average durations
Switches the datasource query to return the raw tuple from
cursor.fetchall() rather the dict return_type which repacks everything
unnecessarily.
Also renames the query and variables since what's being calculated is no
longer a group (it's just the average duration and not pending time too)
and 'duration' is more accurate than 'eta' (the latter is a time not an
int).
https://github.com/mozilla/treeherder/commit/fb7f83154e24928bb66bad833e4a827f55777fd8
Bug 1196764 - Return early if max_start_timestamp was not found
Which can occur when no jobs exist for that repository.
https://github.com/mozilla/treeherder/commit/d075e41705930647a2e61a6b7403da9cb0302c09
Bug 1196764 - Flatten the return value of get_job_eta_times()
We only ever use the 'running' subkey for each signature (since
bug 1181572), so there's no point in including both 'running' and
'pending' subkeys in the dict returned from get_job_eta_times().
https://github.com/mozilla/treeherder/commit/ab20e09a2cb68c6dde52b76fe110d010653ed0d2
Bug 1196764 - Rename job_eta_times to average_job_durations
Since ETA means a time, but we're actually dealing with durations.
https://github.com/mozilla/treeherder/commit/620228cbf3b55384bb51d44bfba233f88f1363ae
Bug 1196764 - Rename calculate_eta to calculate_durations
Since we're not calculating ETAs (the UI does that once it knows the
start time and expected duraction), we're calculating recent average
durations instead.
https://github.com/mozilla/treeherder/commit/c89fc00502277ffdc2c778126da9695d1c6a2737
Bug 1196764 - Add the test_repository fixture to tests that store jobs
At the moment tests that store jobs can get away with not having the
test project listed in the repository table. However with later commits,
the new job_duration table will have a foreign key on the repository
table, which will break those tests unless they use the test_repository
fixture, which creates the repository record for them.
https://github.com/mozilla/treeherder/commit/6e4b6f2dea7be1b959e3c56f08517f71975833a2
Bug 1196764 - Add a JobDuration model
Which will replace the per-project job_eta tables.
I've not added a `modified` field, since I think it's best we wait until
we actually need it. One use-case would be for data expiration, however
I think that's best done by cross-referencing signatures on the
reference_data_signatures table, however (a) that table doesn't yet have
data expiration either, and (b) it's possible once that table is
refactored we may even want to combine it with the job_durations table
too.
https://github.com/mozilla/treeherder/commit/cd64c64a659f830e66ba919b19a00f055865700b
Bug 1196764 - Store job durations in the new unified table using the ORM
Instead of storing the calculated average job durations in the
per-project job_eta tables, they are now stored in the new job_duration
table in the main treeherder database. This means we can use Django's
ORM instead of datasource.
In addition, if a job signature already exists in the table, we update
it rather than creating duplicate rows - and most importantly unlike the
previous table, the correct columns are indexed to avoid performance
problems.
This commit just changes where we store new durations, a later commit
will switch the retrieval of them from the old table to the new.
https://github.com/mozilla/treeherder/commit/c8dba75a2971497296734f07a2a2f41df1735fd6
Bug 1196764 - Fetch job durations from the new unified table
The old per-project job_eta table is no longer being updated. This makes
job ingestion retrieve expected durations from the new table instead.
https://github.com/mozilla/treeherder/commit/978a3c207ff9ad441774667c2d822400d9c3d607
Bug 1196764 - Remove the unused per-project job_eta table
Once we're happy we don't need to rollback, it can be dropped from
stage/prod/heroku.
https://github.com/mozilla/treeherder/commit/8384eefc60c59fe5a30a33acb17d36f27a13a571
Bug 1196764 - Add a test for calculating and using average job durations
This test checks that the `calculate_duration` command correctly
populates the `job_duration` table both on an initial run and when new
jobs cause an existing average duration to change.
It also checks that once the task has been run, any subsequent ingestion
of jobs with matching signatures cause those jobs to have the correct
`running_eta` value when inserted into the `job` table.
Assignee | ||
Comment 10•9 years ago
|
||
I deployed this to stage and then forced a manual calculate durations using:
../venv/bin/python ./manage.py shell
from treeherder.model.tasks import calculate_durations
calculate_durations.apply_async(routing_key="calculate_durations")
(Done like that vs using the command, so that it ran using the newrelic wrapper)
...and it seems to be working well.
I'll follow up with more stats once this is on prod.
Assignee | ||
Updated•9 years ago
|
Assignee | ||
Comment 11•9 years ago
|
||
Graph showing before/after.
Submitting jobs via the jobs endpoint is now 5 times faster (1500ms -> 300ms per transaction).
You need to log in
before you can comment on or make changes to this bug.
Description
•