Closed Bug 1342296 Opened 7 years ago Closed 4 years ago

[meta] Cleanup or eliminate job_detail table

Categories

(Tree Management :: Treeherder, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: sclements)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(4 files, 1 obsolete file)

As part of thinking about ingesting manifests for the test-centric UI, some issues with the JobDetails table (which is where these manifests currently end up) have become apparent:

a) The job details table is one of the largest tables in the database (140GB, 550 million rows).
b) Anyone can add new types of data to job_details (using TinderboxPrint lines) without any naming coordination or consideration as to whether it's useful to store.
c) There are many legacy TinderboxPrint lines whose usefulness is questionable (and in many cases still contain HTML markup!).
d) The current schema is suboptimal both in terms of size on disk, and also in that it doesn't encourage consistency in naming, making consuming it hard (eg all uploads called "artifact uploaded" regardless of what they are).
e) There's an unclear overlap between the job_details and job_log table (eg some upload URLs are duplicated across both, others are just in job_details when really they should be in job_log).
f) A row in the job_details table could have been ingested via any one of 6 codepaths, which seems a bit excessive and is asking for inconsistency (by code inspection I've seen one bug already) and makes debugging/refactoring harder.

My thoughts:
* Whilst free-form metadata ingestion (from TinderboxPrints) reduces the barrier to entry, IMO it encourages the addition of cruft that doesn't add value for a significant proportion of users, so a whitelist or more structured set of preset types should be used instead. This could be combined with switching to a more appropriate notation than "TinderboxPrint".
* Stats based metadata (eg "CPU idle: 33.2 (3.5%)") should be submitted to Perfherder or not at all. If it doesn't need to be graphed, the people who care about it should just find it in the log themselves.
* URLs such as links to test suite documentation for interpreting failures should just be read from the log and doesn't warrant being a JobDetail.
* We might want to differentiate between metadata that should appear in the UI vs that which is used by tooling (eg the buildbot request_id is a JobDetail and so appears in the UI but is just used for requests to buildapi).
* HTML markup should either be stripped out or outright rejected, and ideally the harness adjusted to stop generating it.
* Upload URLs should go in a separate table. That table would be quite similar to the job_log table (aside from the parsing status field), so one alternative would be to combine them.
* The JobDetails model (or a new one split out from it) should only have two fields `name` and `value`, and neither should be allowed to be NULL.
* I'm not sure what we should do with URLs that are non-upload links (eg to other dashboards). Links to things like task inspector should presumably be replaced by special taskcluster handling in the UI (using the newly stored task_id), but there are other links at present (eg for hasal-dashboard).

--

Additional info...

JobDetails model definition:

    id = models.BigAutoField(primary_key=True)
    job = models.ForeignKey(Job)
    title = models.CharField(max_length=70, null=True)
    value = models.CharField(max_length=125)
    url = models.URLField(null=True, max_length=512)

        unique_together = ("title", "value", "job")

JobLog model definition:

    job = models.ForeignKey(Job)
    name = models.CharField(max_length=50)
    url = models.URLField(max_length=255)
    status = models.IntegerField(choices=STATUSES, default=PENDING)

        unique_together = ('job', 'name', 'url')

The job_details table can be populated via any of:

1) The log parser (from TinderboxPrint lines) [1]
2) builds-{pending,running,4hr} ingestion which adds a `buildapi` artifact containing the buildername and request_id [2a][2b]
3) The `jobInfo` property in Taskcluster Pulse job submissions [3a][3b]
4) The `extra.artifacts` property in Taskcluster Pulse job submissions [4a][4b]
5) REST API via the /jobs/ endpoint (jobs can include a nested `artifacts` blob) [5]
6) REST API via the /artifacts/ endpoint [6]

A small selection of the questionable job_detail records from production:

    title: Summary
    value: NULL
      url: NULL

    title: NULL
    value: <a href=https://hg.mozilla.org/mozilla-central/rev/052d4d77cbca title='Built from mozilla-central revision 052d4d77cbca'>moz:
      url: NULL

    title: Result Summary
    value: <br/>========================================<br/>Total 27 tests; 25 succeeded, 2 failed.<br/><br/>Failure summary:<br/> - ht
      url: NULL

    title: Source code
    value: GitHub
      url: https://www.github.com/servo/servo

    title: File bug
    value: bug template
      url: https://bugzilla.mozilla.org/enter_bug.cgi?assigned_to=nobody%40mozilla.org&cc=armenzg%40mozilla.com&comment=Provide%20link.&component=General&form_name=enter_bug&product=Testing&short_desc=pulse_actions%20-%20Brief%20description%20of%20failure

(for a longer list see later attachments)


[1] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/log_parser/parsers.py#L253-L345
[2a] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/etl/buildapi.py#L225-L235
[2b] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/etl/buildapi.py#L360-L370
[3a] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/schemas/pulse-job.yml#L276-L314
[3b] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/etl/job_loader.py#L189-L210
[4a] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/schemas/pulse-job.yml#L393-L397
[4b] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/etl/job_loader.py#L249-L257
[5] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/webapp/api/jobs.py#L610-L622
[6] https://github.com/mozilla/treeherder/blob/495fc898fd29c7fe760aef93fe9151735433e70d/treeherder/webapp/api/artifact.py#L24-L29
Summary: The job_details table on prod is 140GB / 550 millions rows, and much of the content is of questionable value → The job_detail table on prod is 140GB / 535 million rows, and much of the content is of questionable value
Attached file List of unique job_detail.title (obsolete) —
I agree with most of this analysis. I'm not exactly sure what the full solution looks like, though I think a separate jobs upload table would be very beneficial for a lot of reasons.

It probably doesn't hurt for us to file bugs / provide patches for cases where people are obviously submitting redundant/silly data. I just filed one PR for servo-perf to remove the link to the servo repository (which is already implicit in the revision data):

https://github.com/shinglyu/servo-perf/pull/27
Assignee: nobody → emorley
I believe the size of this table is a contributing factor towards bug 1384485.
Blocks: 1384485
Now that most (all?) builds have been migrated to taskcluster, it might be worth rethinking the translation code that converts all uploads for tc jobs into job details:

https://github.com/taskcluster/taskcluster-treeherder/blob/700a76715aefd1b581f35d149d96977db590fc57/src/transform/artifact_links.js#L30

If we really need/want to display this information in treeherder, we could probably fetch it easily enough with taskcluster api's.

I'm not sure what % of the job details are taskcluster uploads but I'd wager it's a large number.
Depends on: 1385868
Blocks: 1386331
Priority: P3 → P2
Summary: The job_detail table on prod is 140GB / 535 million rows, and much of the content is of questionable value → The job_detail table on prod is 170GB / 560 million rows, and much of the content is of questionable value
(In reply to William Lachance (:wlach) (use needinfo!) from comment #5)
> If we really need/want to display this information in treeherder, we could
> probably fetch it easily enough with taskcluster api's.

Good idea :-)
Summary: The job_detail table on prod is 170GB / 560 million rows, and much of the content is of questionable value → The job_detail table on prod is 180GB / 610 million rows, and much of the content is of questionable value
See Also: → 1410737
Depends on: 1420983
Blocks: 1078392
Updated list. I've had to exclude some titles that include too much variety, such as:

'Built from autoland revision 002f87c4339207515da014fe01440da99d8f63cb'
'Size of firefox-45.9.1.en-US.linux-i686.tar.bz2'
'Verification of .../1395926-vertical-upright-gpos-1.html'
Attachment #8840703 - Attachment is obsolete: true
Depends on: 1423573
Depends on: 1423576
Thoughts (particularly given the recent attachments):
* the links to uploaded files are a significant proportion of all job details, so it confirms that they are likely worth of being handled separately to the rest.
  - However before we do this, we need to decide on whether we just point at the taskcluster API instead
* CPU/resource stats are the next biggest contender (particularly once the variations are summed), for which I've filed bug 1423573
* the indexes for this table are as large as the data itself, which is due to trying to fit different data types into the {title, value, url} schema, plus the need for indexes in all columns due to the way that Django's update_or_create works (we could improve this by insisting picking a different set of properties that must be unique)
Assignee: emorley → nobody
Priority: P2 → P3

This is too old. We have newer bugs covering database-too-big issues

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INVALID
See Also: → 1603249

Much of the analysis and recommendations are still relevant. Reopening as a meta.

An update for how the table is being populated:

  1. The log parser (from TinderboxPrint lines)
  2. The jobInfo property in Taskcluster Pulse job submissions
  3. The extra.artifacts property in Taskcluster Pulse job submissions

I need to look into 2 and 3 a bit more to find out what type of data is being processed and what it looks like. But looking in the table, the first 10,000 rows contains a combination of uploaded artifacts - to deprecate in bug 1603249 - and parsed text from TinderboxPrint log lines (see bug 1625033).

Not applicable anymore:
4) builds-{pending,running,4hr} ingestion which adds a buildapi artifact containing the buildername and request_id <-- looks to be deprecated
5) REST API via the /jobs/ endpoint (jobs can include a nested artifacts blob) <-- deprecated in bug 1349182
6) REST API via the /artifacts/ endpoint <-- deprecated in bug 1357970

Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Summary: The job_detail table on prod is 180GB / 610 million rows, and much of the content is of questionable value → [meta] Cleanup or eliminate job_detail table
Depends on: 1603249, 1625033
See Also: 1603249
Assignee: nobody → sclements
Priority: P3 → P1
No longer depends on: 1625033

I removed the table and API since they're no longer needed due to changes made in bug 1603249 (per comment 11, we're not actually ingesting any artifacts with pulse jobs): https://github.com/mozilla/treeherder/commit/8f0c3a771f0b538bc5d2e2ef7c0451d867583e09

Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED

Victory! Wohoo! \o/

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: