Closed Bug 1196764 Opened 9 years ago Closed 9 years ago

Rewrite handling of job_eta since it's responsible for 80% of the job creation time

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

eg: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?type=app#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e6a6f62733a4a6f6273566965775365742e637265617465222c22225d MySQL job_eta select is 80% of treeherder.webapp.api.jobs:JobsViewSet.create This is because: * we add a new row to the job_eta table every time the calculate eta task runs (rather than updating existing rows) * we don't expire data from the job_eta tables - eg mozilla-inbound's job_eta has 2.8 million rows (bug 1126886) * there is no index on the signature column, and we use that as part of our SELECT in get_last_eta_by_signatures (bug 1178038) The current job_eta implementation: * was designed to allow for tracking of etas over time (hence having multiple entries per signature), which we don't really need. * also calculates pending eta, but this is not accurate (being removed by bug 1126886). * also calculates median/min/max eta for each signature, which we currently don't use. * copies a snapshot of the average eta into the jobs table, for each job. I believe this is an unnecessary over-complication (once a job has finished we don't really care if the eta for jobs this week was more or less than last week). I propose for the new design: * we create a new table, so we don't have to try and sort out the old ones (deletes from tables with 2.8 million rows will be slow) * this new table should be in the main treeherder DB (ie a small part towards the single-database work, bug 1178641) rather than per project * we forget about the unused min/max/median for now * we only create one row per signature combination, and update that * so that we can expire data, we either store a last_modified for each row, or rely on the signature not being in the reference_data_signatures table (we don't expire from that table either, but once support is added there, we could just cascade delete to the new job_eta table too) We can then land this in stages: 1) create the new table and start the calculate_eta task populating it 2) switch jobs ingestion over from querying the old job_eta table to querying the new one, when populating the running_eta field on the jobs table 3) drop the old tables In the future (not part of this bug), I'd like to stop storing running_eta on the jobs table entirely - and instead either: (a) join the jobs table with job_eta when generating the response for the /jobs endpoint, or (b) have the UI fetch job eta reference data only once - and combine it client side with the jobs being displayed.
Depends on: 1181572
No longer depends on: 1126886
Current schema (after bug 1181572): CREATE TABLE `job_eta` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `signature` varchar(50) DEFAULT NULL, `state` varchar(25) COLLATE utf8_bin NOT NULL, `avg_sec` int(10) unsigned NOT NULL, `median_sec` int(10) unsigned NOT NULL, `min_sec` int(10) unsigned NOT NULL, `max_sec` int(10) unsigned NOT NULL, `std` int(10) unsigned NOT NULL, `sample_count` int(10) unsigned NOT NULL, `submit_timestamp` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), KEY `idx_state` (`state`), KEY `idx_avg_sec` (`avg_sec`), KEY `idx_median_sec` (`median_sec`), KEY `idx_sample_count` (`sample_count`), KEY `idx_submit_timestamp` (`submit_timestamp`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin; Proposed schema: CREATE TABLE `job_eta` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `repository` varchar(50) NOT NULL, `signature` varchar(50) NOT NULL, `avg_sec` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `uni_repository_signature` (`signature`, `repository`) ) ...; At which point, I'm wondering if we should just add a `average_runtime` field to the existing `reference_data_signatures` table, and have the calculate eta task populate that + jobs ingestion query from there, instead? Thoughts?
Flags: needinfo?(mdoglio)
That sounds good to me, iirc we are not using the pending eta anymore in the ui so there's practically no loss of information.
Flags: needinfo?(mdoglio)
:emorley what's the status of this bug?
Flags: needinfo?(emorley)
I haven't worked on it since comment 3, since been focusing on my deliverable. Would be good to get this fixed soon though.
Flags: needinfo?(emorley)
WIP at: https://github.com/mozilla/treeherder/compare/refactor-job-eta (not all changes pushed yet & needs commit message tidy up etc before before it's worth opening as a PR)
Attachment #8692941 - Flags: review?(mdoglio)
Attachment #8692941 - Flags: review?(mdoglio) → review+
Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/f5093377882bdae86a7e4940c7578faa482c36ce Bug 1196764 - Stop calculating/storing unused job duration statistics The calculate_eta task calculates average/median/min/max & standard deviation of recent job durations, when only the average is used. Whilst the job_eta table will be replaced entirely in later commits, the query that fetches the stats to populate it will still be used, so needs cleaning up regardless. https://github.com/mozilla/treeherder/commit/006caebe3609836acaf4f8edfad1a5b6e158c051 Bug 1196764 - Use start not submit timestamp to retrieve job durations Previously the time a job spent in the pending state *and* also the time spent in the running state were calculated and stored. However, now that only the latter is required, the time window that the query searches over needs to only cover the start_timestamp onwards, not the earlier submit_timestamp. https://github.com/mozilla/treeherder/commit/585ddc5189c79fdef8a0cb767d786d294d9a234d Bug 1196764 - Clean up the fetching of recent job average durations Switches the datasource query to return the raw tuple from cursor.fetchall() rather the dict return_type which repacks everything unnecessarily. Also renames the query and variables since what's being calculated is no longer a group (it's just the average duration and not pending time too) and 'duration' is more accurate than 'eta' (the latter is a time not an int). https://github.com/mozilla/treeherder/commit/fb7f83154e24928bb66bad833e4a827f55777fd8 Bug 1196764 - Return early if max_start_timestamp was not found Which can occur when no jobs exist for that repository. https://github.com/mozilla/treeherder/commit/d075e41705930647a2e61a6b7403da9cb0302c09 Bug 1196764 - Flatten the return value of get_job_eta_times() We only ever use the 'running' subkey for each signature (since bug 1181572), so there's no point in including both 'running' and 'pending' subkeys in the dict returned from get_job_eta_times(). https://github.com/mozilla/treeherder/commit/ab20e09a2cb68c6dde52b76fe110d010653ed0d2 Bug 1196764 - Rename job_eta_times to average_job_durations Since ETA means a time, but we're actually dealing with durations. https://github.com/mozilla/treeherder/commit/620228cbf3b55384bb51d44bfba233f88f1363ae Bug 1196764 - Rename calculate_eta to calculate_durations Since we're not calculating ETAs (the UI does that once it knows the start time and expected duraction), we're calculating recent average durations instead. https://github.com/mozilla/treeherder/commit/c89fc00502277ffdc2c778126da9695d1c6a2737 Bug 1196764 - Add the test_repository fixture to tests that store jobs At the moment tests that store jobs can get away with not having the test project listed in the repository table. However with later commits, the new job_duration table will have a foreign key on the repository table, which will break those tests unless they use the test_repository fixture, which creates the repository record for them. https://github.com/mozilla/treeherder/commit/6e4b6f2dea7be1b959e3c56f08517f71975833a2 Bug 1196764 - Add a JobDuration model Which will replace the per-project job_eta tables. I've not added a `modified` field, since I think it's best we wait until we actually need it. One use-case would be for data expiration, however I think that's best done by cross-referencing signatures on the reference_data_signatures table, however (a) that table doesn't yet have data expiration either, and (b) it's possible once that table is refactored we may even want to combine it with the job_durations table too. https://github.com/mozilla/treeherder/commit/cd64c64a659f830e66ba919b19a00f055865700b Bug 1196764 - Store job durations in the new unified table using the ORM Instead of storing the calculated average job durations in the per-project job_eta tables, they are now stored in the new job_duration table in the main treeherder database. This means we can use Django's ORM instead of datasource. In addition, if a job signature already exists in the table, we update it rather than creating duplicate rows - and most importantly unlike the previous table, the correct columns are indexed to avoid performance problems. This commit just changes where we store new durations, a later commit will switch the retrieval of them from the old table to the new. https://github.com/mozilla/treeherder/commit/c8dba75a2971497296734f07a2a2f41df1735fd6 Bug 1196764 - Fetch job durations from the new unified table The old per-project job_eta table is no longer being updated. This makes job ingestion retrieve expected durations from the new table instead. https://github.com/mozilla/treeherder/commit/978a3c207ff9ad441774667c2d822400d9c3d607 Bug 1196764 - Remove the unused per-project job_eta table Once we're happy we don't need to rollback, it can be dropped from stage/prod/heroku. https://github.com/mozilla/treeherder/commit/8384eefc60c59fe5a30a33acb17d36f27a13a571 Bug 1196764 - Add a test for calculating and using average job durations This test checks that the `calculate_duration` command correctly populates the `job_duration` table both on an initial run and when new jobs cause an existing average duration to change. It also checks that once the task has been run, any subsequent ingestion of jobs with matching signatures cause those jobs to have the correct `running_eta` value when inserted into the `job` table.
I deployed this to stage and then forced a manual calculate durations using: ../venv/bin/python ./manage.py shell from treeherder.model.tasks import calculate_durations calculate_durations.apply_async(routing_key="calculate_durations") (Done like that vs using the command, so that it ran using the newrelic wrapper) ...and it seems to be working well. I'll follow up with more stats once this is on prod.
Blocks: 1229066
Blocks: 1178227
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Graph showing before/after. Submitting jobs via the jobs endpoint is now 5 times faster (1500ms -> 300ms per transaction).
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: