The job_detail table on prod is 180GB / 610 million rows, and much of the content is of questionable value

NEW
Assigned to

Status

Tree Management
Treeherder
P2
normal
11 months ago
2 months ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

(Depends on: 3 bugs, Blocks: 1 bug)

Details

Attachments

(3 attachments, 1 obsolete attachment)

(Assignee)

Description

11 months ago
As part of thinking about ingesting manifests for the test-centric UI, some issues with the JobDetails table (which is where these manifests currently end up) have become apparent:

a) The job details table is one of the largest tables in the database (140GB, 550 million rows).
b) Anyone can add new types of data to job_details (using TinderboxPrint lines) without any naming coordination or consideration as to whether it's useful to store.
c) There are many legacy TinderboxPrint lines whose usefulness is questionable (and in many cases still contain HTML markup!).
d) The current schema is suboptimal both in terms of size on disk, and also in that it doesn't encourage consistency in naming, making consuming it hard (eg all uploads called "artifact uploaded" regardless of what they are).
e) There's an unclear overlap between the job_details and job_log table (eg some upload URLs are duplicated across both, others are just in job_details when really they should be in job_log).
f) A row in the job_details table could have been ingested via any one of 6 codepaths, which seems a bit excessive and is asking for inconsistency (by code inspection I've seen one bug already) and makes debugging/refactoring harder.

My thoughts:
* Whilst free-form metadata ingestion (from TinderboxPrints) reduces the barrier to entry, IMO it encourages the addition of cruft that doesn't add value for a significant proportion of users, so a whitelist or more structured set of preset types should be used instead. This could be combined with switching to a more appropriate notation than "TinderboxPrint".
* Stats based metadata (eg "CPU idle: 33.2 (3.5%)") should be submitted to Perfherder or not at all. If it doesn't need to be graphed, the people who care about it should just find it in the log themselves.
* URLs such as links to test suite documentation for interpreting failures should just be read from the log and doesn't warrant being a JobDetail.
* We might want to differentiate between metadata that should appear in the UI vs that which is used by tooling (eg the buildbot request_id is a JobDetail and so appears in the UI but is just used for requests to buildapi).
* HTML markup should either be stripped out or outright rejected, and ideally the harness adjusted to stop generating it.
* Upload URLs should go in a separate table. That table would be quite similar to the job_log table (aside from the parsing status field), so one alternative would be to combine them.
* The JobDetails model (or a new one split out from it) should only have two fields `name` and `value`, and neither should be allowed to be NULL.
* I'm not sure what we should do with URLs that are non-upload links (eg to other dashboards). Links to things like task inspector should presumably be replaced by special taskcluster handling in the UI (using the newly stored task_id), but there are other links at present (eg for hasal-dashboard).

--

Additional info...

JobDetails model definition:

    id = models.BigAutoField(primary_key=True)
    job = models.ForeignKey(Job)
    title = models.CharField(max_length=70, null=True)
    value = models.CharField(max_length=125)
    url = models.URLField(null=True, max_length=512)

        unique_together = ("title", "value", "job")

JobLog model definition:

    job = models.ForeignKey(Job)
    name = models.CharField(max_length=50)
    url = models.URLField(max_length=255)
    status = models.IntegerField(choices=STATUSES, default=PENDING)

        unique_together = ('job', 'name', 'url')

The job_details table can be populated via any of:

1) The log parser (from TinderboxPrint lines) [1]
2) builds-{pending,running,4hr} ingestion which adds a `buildapi` artifact containing the buildername and request_id [2a][2b]
3) The `jobInfo` property in Taskcluster Pulse job submissions [3a][3b]
4) The `extra.artifacts` property in Taskcluster Pulse job submissions [4a][4b]
5) REST API via the /jobs/ endpoint (jobs can include a nested `artifacts` blob) [5]
6) REST API via the /artifacts/ endpoint [6]

A small selection of the questionable job_detail records from production:

    title: Summary
    value: NULL
      url: NULL

    title: NULL
    value: <a href=https://hg.mozilla.org/mozilla-central/rev/052d4d77cbca title='Built from mozilla-central revision 052d4d77cbca'>moz:
      url: NULL

    title: Result Summary
    value: <br/>========================================<br/>Total 27 tests; 25 succeeded, 2 failed.<br/><br/>Failure summary:<br/> - ht
      url: NULL

    title: Source code
    value: GitHub
      url: https://www.github.com/servo/servo

    title: File bug
    value: bug template
      url: https://bugzilla.mozilla.org/enter_bug.cgi?assigned_to=nobody%40mozilla.org&cc=armenzg%40mozilla.com&comment=Provide%20link.&component=General&form_name=enter_bug&product=Testing&short_desc=pulse_actions%20-%20Brief%20description%20of%20failure

(for a longer list see later attachments)


[1] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/log_parser/parsers.py#L253-L345
[2a] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/etl/buildapi.py#L225-L235
[2b] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/etl/buildapi.py#L360-L370
[3a] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/schemas/pulse-job.yml#L276-L314
[3b] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/etl/job_loader.py#L189-L210
[4a] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/schemas/pulse-job.yml#L393-L397
[4b] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/etl/job_loader.py#L249-L257
[5] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/webapp/api/jobs.py#L610-L622
[6] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/webapp/api/artifact.py#L24-L29
(Assignee)

Updated

11 months ago
Summary: The job_details table on prod is 140GB / 550 millions rows, and much of the content is of questionable value → The job_detail table on prod is 140GB / 535 million rows, and much of the content is of questionable value
(Assignee)

Comment 1

11 months ago
Created attachment 8840702 [details]
Examples of job_detail records
(Assignee)

Comment 2

11 months ago
Created attachment 8840703 [details]
List of unique job_detail.title
I agree with most of this analysis. I'm not exactly sure what the full solution looks like, though I think a separate jobs upload table would be very beneficial for a lot of reasons.

It probably doesn't hurt for us to file bugs / provide patches for cases where people are obviously submitting redundant/silly data. I just filed one PR for servo-perf to remove the link to the servo repository (which is already implicit in the revision data):

https://github.com/shinglyu/servo-perf/pull/27
(Assignee)

Updated

11 months ago
Assignee: nobody → emorley
(Assignee)

Comment 4

6 months ago
I believe the size of this table is a contributing factor towards bug 1384485.
Blocks: 1384485
Now that most (all?) builds have been migrated to taskcluster, it might be worth rethinking the translation code that converts all uploads for tc jobs into job details:

https://github.com/taskcluster/taskcluster-treeherder/blob/700a76715aefd1b581f35d149d96977db590fc57/src/transform/artifact_links.js#L30

If we really need/want to display this information in treeherder, we could probably fetch it easily enough with taskcluster api's.

I'm not sure what % of the job details are taskcluster uploads but I'd wager it's a large number.
(Assignee)

Updated

6 months ago
Blocks: 1386331
(Assignee)

Updated

6 months ago
Priority: P3 → P2
(Assignee)

Updated

6 months ago
Summary: The job_detail table on prod is 140GB / 535 million rows, and much of the content is of questionable value → The job_detail table on prod is 170GB / 560 million rows, and much of the content is of questionable value
(Assignee)

Comment 6

6 months ago
(In reply to William Lachance (:wlach) (use needinfo!) from comment #5)
> If we really need/want to display this information in treeherder, we could
> probably fetch it easily enough with taskcluster api's.

Good idea :-)
(Assignee)

Updated

4 months ago
Summary: The job_detail table on prod is 170GB / 560 million rows, and much of the content is of questionable value → The job_detail table on prod is 180GB / 610 million rows, and much of the content is of questionable value

Updated

3 months ago
See Also: → bug 1410737
(Assignee)

Updated

2 months ago
Depends on: 1420983
(Assignee)

Updated

2 months ago
Blocks: 1078392
(Assignee)

Comment 7

2 months ago
Created attachment 8934959 [details]
List of unique job_detail.title (updated)

Updated list. I've had to exclude some titles that include too much variety, such as:

'Built from autoland revision 002f87c4339207515da014fe01440da99d8f63cb'
'Size of firefox-45.9.1.en-US.linux-i686.tar.bz2'
'Verification of .../1395926-vertical-upright-gpos-1.html'
Attachment #8840703 - Attachment is obsolete: true
(Assignee)

Updated

2 months ago
Depends on: 1423573
(Assignee)

Updated

2 months ago
Depends on: 1423576
(Assignee)

Comment 8

2 months ago
Created attachment 8934972 [details]
Counts of titles seen in latest 500,000 JobDetails
(Assignee)

Comment 9

2 months ago
Thoughts (particularly given the recent attachments):
* the links to uploaded files are a significant proportion of all job details, so it confirms that they are likely worth of being handled separately to the rest.
  - However before we do this, we need to decide on whether we just point at the taskcluster API instead
* CPU/resource stats are the next biggest contender (particularly once the variations are summed), for which I've filed bug 1423573
* the indexes for this table are as large as the data itself, which is due to trying to fit different data types into the {title, value, url} schema, plus the need for indexes in all columns due to the way that Django's update_or_create works (we could improve this by insisting picking a different set of properties that must be unique)
You need to log in before you can comment on or make changes to this bug.