[meta] Cleanup or eliminate job_detail table
Categories
(Tree Management :: Treeherder, defect, P1)
Tracking
(Not tracked)
People
(Reporter: emorley, Assigned: sclements)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
Attachments
(4 files, 1 obsolete file)
As part of thinking about ingesting manifests for the test-centric UI, some issues with the JobDetails table (which is where these manifests currently end up) have become apparent: a) The job details table is one of the largest tables in the database (140GB, 550 million rows). b) Anyone can add new types of data to job_details (using TinderboxPrint lines) without any naming coordination or consideration as to whether it's useful to store. c) There are many legacy TinderboxPrint lines whose usefulness is questionable (and in many cases still contain HTML markup!). d) The current schema is suboptimal both in terms of size on disk, and also in that it doesn't encourage consistency in naming, making consuming it hard (eg all uploads called "artifact uploaded" regardless of what they are). e) There's an unclear overlap between the job_details and job_log table (eg some upload URLs are duplicated across both, others are just in job_details when really they should be in job_log). f) A row in the job_details table could have been ingested via any one of 6 codepaths, which seems a bit excessive and is asking for inconsistency (by code inspection I've seen one bug already) and makes debugging/refactoring harder. My thoughts: * Whilst free-form metadata ingestion (from TinderboxPrints) reduces the barrier to entry, IMO it encourages the addition of cruft that doesn't add value for a significant proportion of users, so a whitelist or more structured set of preset types should be used instead. This could be combined with switching to a more appropriate notation than "TinderboxPrint". * Stats based metadata (eg "CPU idle: 33.2 (3.5%)") should be submitted to Perfherder or not at all. If it doesn't need to be graphed, the people who care about it should just find it in the log themselves. * URLs such as links to test suite documentation for interpreting failures should just be read from the log and doesn't warrant being a JobDetail. * We might want to differentiate between metadata that should appear in the UI vs that which is used by tooling (eg the buildbot request_id is a JobDetail and so appears in the UI but is just used for requests to buildapi). * HTML markup should either be stripped out or outright rejected, and ideally the harness adjusted to stop generating it. * Upload URLs should go in a separate table. That table would be quite similar to the job_log table (aside from the parsing status field), so one alternative would be to combine them. * The JobDetails model (or a new one split out from it) should only have two fields `name` and `value`, and neither should be allowed to be NULL. * I'm not sure what we should do with URLs that are non-upload links (eg to other dashboards). Links to things like task inspector should presumably be replaced by special taskcluster handling in the UI (using the newly stored task_id), but there are other links at present (eg for hasal-dashboard). -- Additional info... JobDetails model definition: id = models.BigAutoField(primary_key=True) job = models.ForeignKey(Job) title = models.CharField(max_length=70, null=True) value = models.CharField(max_length=125) url = models.URLField(null=True, max_length=512) unique_together = ("title", "value", "job") JobLog model definition: job = models.ForeignKey(Job) name = models.CharField(max_length=50) url = models.URLField(max_length=255) status = models.IntegerField(choices=STATUSES, default=PENDING) unique_together = ('job', 'name', 'url') The job_details table can be populated via any of: 1) The log parser (from TinderboxPrint lines) [1] 2) builds-{pending,running,4hr} ingestion which adds a `buildapi` artifact containing the buildername and request_id [2a][2b] 3) The `jobInfo` property in Taskcluster Pulse job submissions [3a][3b] 4) The `extra.artifacts` property in Taskcluster Pulse job submissions [4a][4b] 5) REST API via the /jobs/ endpoint (jobs can include a nested `artifacts` blob) [5] 6) REST API via the /artifacts/ endpoint [6] A small selection of the questionable job_detail records from production: title: Summary value: NULL url: NULL title: NULL value: <a href=https://hg.mozilla.org/mozilla-central/rev/052d4d77cbca title='Built from mozilla-central revision 052d4d77cbca'>moz: url: NULL title: Result Summary value: <br/>========================================<br/>Total 27 tests; 25 succeeded, 2 failed.<br/><br/>Failure summary:<br/> - ht url: NULL title: Source code value: GitHub url: https://www.github.com/servo/servo title: File bug value: bug template url: https://bugzilla.mozilla.org/enter_bug.cgi?assigned_to=nobody%40mozilla.org&cc=armenzg%40mozilla.com&comment=Provide%20link.&component=General&form_name=enter_bug&product=Testing&short_desc=pulse_actions%20-%20Brief%20description%20of%20failure (for a longer list see later attachments) [1] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/log_parser/parsers.py#L253-L345 [2a] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/etl/buildapi.py#L225-L235 [2b] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/etl/buildapi.py#L360-L370 [3a] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/schemas/pulse-job.yml#L276-L314 [3b] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/etl/job_loader.py#L189-L210 [4a] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/schemas/pulse-job.yml#L393-L397 [4b] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/etl/job_loader.py#L249-L257 [5] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/webapp/api/jobs.py#L610-L622 [6] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/webapp/api/artifact.py#L24-L29
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Comment 1•7 years ago
|
||
Reporter | ||
Comment 2•7 years ago
|
||
Comment 3•7 years ago
|
||
I agree with most of this analysis. I'm not exactly sure what the full solution looks like, though I think a separate jobs upload table would be very beneficial for a lot of reasons. It probably doesn't hurt for us to file bugs / provide patches for cases where people are obviously submitting redundant/silly data. I just filed one PR for servo-perf to remove the link to the servo repository (which is already implicit in the revision data): https://github.com/shinglyu/servo-perf/pull/27
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Comment 4•7 years ago
|
||
I believe the size of this table is a contributing factor towards bug 1384485.
Comment 5•7 years ago
|
||
Now that most (all?) builds have been migrated to taskcluster, it might be worth rethinking the translation code that converts all uploads for tc jobs into job details: https://github.com/taskcluster/taskcluster-treeherder/blob/700a76715aefd1b581f35d149d96977db590fc57/src/transform/artifact_links.js#L30 If we really need/want to display this information in treeherder, we could probably fetch it easily enough with taskcluster api's. I'm not sure what % of the job details are taskcluster uploads but I'd wager it's a large number.
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Comment 6•7 years ago
|
||
(In reply to William Lachance (:wlach) (use needinfo!) from comment #5) > If we really need/want to display this information in treeherder, we could > probably fetch it easily enough with taskcluster api's. Good idea :-)
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Comment 7•7 years ago
|
||
Updated list. I've had to exclude some titles that include too much variety, such as: 'Built from autoland revision 002f87c4339207515da014fe01440da99d8f63cb' 'Size of firefox-45.9.1.en-US.linux-i686.tar.bz2' 'Verification of .../1395926-vertical-upright-gpos-1.html'
Reporter | ||
Comment 8•7 years ago
|
||
Reporter | ||
Comment 9•7 years ago
|
||
Thoughts (particularly given the recent attachments): * the links to uploaded files are a significant proportion of all job details, so it confirms that they are likely worth of being handled separately to the rest. - However before we do this, we need to decide on whether we just point at the taskcluster API instead * CPU/resource stats are the next biggest contender (particularly once the variations are summed), for which I've filed bug 1423573 * the indexes for this table are as large as the data itself, which is due to trying to fit different data types into the {title, value, url} schema, plus the need for indexes in all columns due to the way that Django's update_or_create works (we could improve this by insisting picking a different set of properties that must be unique)
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Updated•5 years ago
|
Comment 10•4 years ago
|
||
This is too old. We have newer bugs covering database-too-big issues
Assignee | ||
Comment 11•4 years ago
|
||
Much of the analysis and recommendations are still relevant. Reopening as a meta.
An update for how the table is being populated:
- The log parser (from TinderboxPrint lines)
- The
jobInfo
property in Taskcluster Pulse job submissions - The
extra.artifacts
property in Taskcluster Pulse job submissions
I need to look into 2 and 3 a bit more to find out what type of data is being processed and what it looks like. But looking in the table, the first 10,000 rows contains a combination of uploaded artifacts - to deprecate in bug 1603249 - and parsed text from TinderboxPrint log lines (see bug 1625033).
Not applicable anymore:
4) builds-{pending,running,4hr} ingestion which adds a buildapi
artifact containing the buildername and request_id <-- looks to be deprecated
5) REST API via the /jobs/ endpoint (jobs can include a nested artifacts
blob) <-- deprecated in bug 1349182
6) REST API via the /artifacts/ endpoint <-- deprecated in bug 1357970
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Comment 12•4 years ago
|
||
Assignee | ||
Comment 13•4 years ago
|
||
I removed the table and API since they're no longer needed due to changes made in bug 1603249 (per comment 11, we're not actually ingesting any artifacts with pulse jobs): https://github.com/mozilla/treeherder/commit/8f0c3a771f0b538bc5d2e2ef7c0451d867583e09
Comment 14•4 years ago
|
||
Victory! Wohoo! \o/
Description
•