Closed Bug 1603249 Opened 5 years ago Closed 5 years ago

Do not store "artifact uploaded" into JobDetail table

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: sclements)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

This is one of the pieces of data ingestion that makes the Treeherder database the most write intensive. Every single task that we ingest has multiple artifacts uploaded and this adds up quite quickly.

This is where the code is first added into the pipeline:
https://github.com/mozilla/treeherder/blob/610fc5082615cebb9c15d19c838560b77cff732d/treeherder/etl/taskcluster_pulse/handler.py#L346

We can change the API to fetch the artifacts for the task from Taskcluster and return the artifacts as part of the API. In a sense, we want to make the change to look like a no-op for consumers of the API. I'm aware that mozscreenshots queries this API.

This is a sample TC API:
https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/aNQOqM0HQwC9eK612X7nQQ/runs/0/artifacts

Sample entry:

      {
            "job_id": 280744315,
            "job_guid": "14be8c02-4387-402f-a59c-ae17f3d4d1ee/0",
            "title": "artifact uploaded",
            "value": "live_backing.log",
            "url": "https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/FL6MAkOHQC-lnK4X89TR7g/runs/0/artifacts/public/logs/live_backing.log"
        },
Depends on: 1603261

~65% of the last 10M rows is TC artifacts: https://sql.telemetry.mozilla.org/queries/67114#169999

The writes can be decreased with a transaction in the log parser. Armen, may you point me to code? I will make a bug

Flags: needinfo?(armenzg)

Hi Kyle, I'm not entirely sure what you mean about the log parser.

I mentioned this code in the first comment. Is that what you're looking for?

Flags: needinfo?(armenzg)
Priority: -- → P3
Assignee: nobody → sclements
Status: NEW → ASSIGNED
Priority: P3 → P1

Per Armen's initial comment, mozscreenshots does indeed seem to be the primary external consumer of the JobDetail endpoint: https://papertrailapp.com/systems/treeherder-prod/events?q=%22Get%20%2Fapi%2Fjobdetail%22%20-%22https%3A%2F%2Ftreeherder.mozilla.org%2F%22

I did some more research into this and we have everything we need in the UI to fetch the artifacts at the point they're needed directly rather than call JobDetails -> taskclusterAPI -> return results. I spoke with Matt Noorenberghe and he's willing to refactor mozscreenshots to query taskcluster directly rather than JobDetails. The Jobs API returns the retry_id(runId), task_id and we will have access to the root_url via repository.

So my thinking is to do this in stages. I can switch the UI to retrieve artifacts directly. Prep the backend code for when we stop ingesting artifacts and remove the JobDetails endoint (and meanwhile notify other users of its deprecation in x weeks). Then merge those backend changes once Matt has made changes in early-mid April. Does anyone have an objection to this idea?

This sounds great! :) Though we still want to process the logs for our own uses. So some of the "artifact uploaded" are still needed. Or at least we need to parse them one way or another.

Specific to Push Health is we need the "*_errorsummary.log" because it's parsed to create FailureLine objects, which Push Health relies upon.

But I like Sarah's plan. Modify UI first. Then wean us off ingesting the links for the artifacts. We might even consider switching some of them to be saved to a different or a few different tables, as the case may be. But I'll leave the minutiae up to you. :)

Thanks for reaching out to Matt!

This is a fine plan.

Depends on: 1625033
See Also: → 1342296

After chatting with Tom.Prince the other day, I've come into new information about what we store in the JobDetail table. In a nutshell, in addition to storing uploaded artifacts in the table we're also parsing log lines with the name "TinderboxPrint" and storing them in the table. Much of that content seems to be of questionable value and could be found by someone looking at the logs. I also found an old bug filed by Ed with a useful, if somewhat outdated, analysis.

So before I can proceed with the removal of the jobdetails endpoint completely we need to figure out if we should keep supporting that log parsing/storing of data. I'll reopen bug 1342296 as a meta (it was closed as invalid) and add an update since some of the information is no longer applicable.

I think the approach to take is to continue with the idea of deprecating the use of /jobdetail/ endpoint for uploaded artifact retrieval both in our UI and for mozscreenshots.

For the job details panel in TH, we'll need to still retrieve the job details that are not uploaded artifacts (from TinderboxPrint and anything else) until we decide what, if anything, we should store in JobDetail. But this will at least cut down on a large chunk of the writes to the table in the meantime.

Blocks: 1342296
See Also: 1342296

So before I can proceed with the removal of the jobdetails endpoint completely

I was not asking to remove the jobdetails endpoint or to stop storing anything related to TinderboxPrints.
My original request was not to store artifacts in the DB since it has a high write impact and we can instead list in the UI the artifacts from TC APIs or put a link to the TC UI that shows artifacts for a task.

I think your approach in the last two paragraphs makes sense.

Thanks Sarah!

(In reply to Armen [:armenzg] from comment #10)

I was not asking to remove the jobdetails endpoint or to stop storing anything related to TinderboxPrints.
My original request was not to store artifacts in the DB since it has a high write impact and we can instead list in the UI the artifacts from TC APIs or put a link to the TC UI that shows artifacts for a task.

Yes, I know. But instead of only focusing on one aspect - the storage of uploaded artifacts - I think it's worth evaluating whether or not we should be storing the TinderboxPrint data in the table and whether we need to have the /jobdetails/ endpoint at all (I was a little premature in thinking we could do this right away, but now that I've done more research I still think it's worth considering). But like I said, that'll be a next step.

Depends on: 1605426
No longer depends on: 1625033

(In reply to Sarah Clements [:sclements] from comment #11)

(In reply to Armen [:armenzg] from comment #10)

I was not asking to remove the jobdetails endpoint or to stop storing anything related to TinderboxPrints.
My original request was not to store artifacts in the DB since it has a high write impact and we can instead list in the UI the artifacts from TC APIs or put a link to the TC UI that shows artifacts for a task.

Yes, I know. But instead of only focusing on one aspect - the storage of uploaded artifacts - I think it's worth evaluating whether or not we should be storing the TinderboxPrint data in the table and whether we need to have the /jobdetails/ endpoint at all (I was a little premature in thinking we could do this right away, but now that I've done more research I still think it's worth considering). But like I said, that'll be a next step.

Yeah, I agree with you here, Sarah. wrt those TinderboxPrintlines, I wonder how we can determine if they give any value to folks. It's possible nobody cares about them. I remember some conversations we had with Ed way back when (as you mentioned). We could add a little badge in the ui next to the printlines of deprecated or something and see if someone files a bug asking to keep them. Or remove them and see if anybody screams. :D.

can you hide them behind a + button- keep metrics whenever the + button is expanded to show the data? I would say 45 days of collecting data would be enough data to make an informed decision.

(In reply to Cameron Dawson [:camd] from comment #13)

Yeah, I agree with you here, Sarah. wrt those TinderboxPrintlines, I wonder how we can determine if they give any value to folks. It's possible nobody cares about them. I remember some conversations we had with Ed way back when (as you mentioned). We could add a little badge in the ui next to the printlines of deprecated or something and see if someone files a bug asking to keep them. Or remove them and see if anybody screams. :D.

I was planning to send an email out on Monday to dev-platform mailing list about the plan to stop ingesting uploaded artifacts at (roughly) end of month. So, I could also mention that we are thinking of not processing those TinderboxPrint lines at a later date and solicit feedback on it. They'd still be in the logs for people to look at, but we wouldn't be parsing it to add to the JobDetail table. Maybe we'd get some feedback that way.

I know that Tom.Prince finds value in the "Built by... " urls parsed from TinderboxPrint and he suggested he could create a structured artifact - see bug 1625033. (I'm actually realizing I'm not clear on what that means, but could it then be retrieved from the taskcluster public artifacts API instead?)

can you hide them behind a + button- keep metrics whenever the + button is expanded to show the data? I would say 45 days of collecting data would be enough data to make an informed decision.

I don't think how they're displayed is the issue, just a question of if we're ingesting them and no one really looks at that data in the job details or log viewer panels.

(In reply to Sarah Clements [:sclements] from comment #15)

can you hide them behind a + button- keep metrics whenever the + button is expanded to show the data? I would say 45 days of collecting data would be enough data to make an informed decision.

I don't think how they're displayed is the issue, just a question of if we're ingesting them and no one really looks at that data in the job details or log viewer panels.

My point wasn't to change our display, my point was to track when people intentionally view the data. If the data is displayed by default we have no way to track the usage. If the data was behind an API that users had to click to access, then we could easily track usage of the API.

Ah, I see. That sounds like a good idea and would be a trivial change to make to the UI. But I think we only have logs going back 3 days in Papertrail (anything older is archived) so I'd probably have to create a script to process the logs if we wanted to track usage over several weeks. And then determine a threshold for how many queries per day determines whether it's useful enough to keep around.

that starts to get complicated. Another approach is to determine what jobs have "tinderbox print" statements and ask the owners- depending on what has the info and what is in the info it could be a small set of people to confirm with.

On second thought, New Relic might be able to provide some insight since it tracks throughput/requests per minute for a specific API over a period of time. Right now though, mozscreenshots is the primary external consumer so once it stops using that endpoint, it'll be easier to measure.

that starts to get complicated. Another approach is to determine what jobs have "tinderbox print" statements and ask the owners- depending on what has the info and what is in the info it could be a small set of people to confirm with.

Yes, and I have yet to look into that. I can however see on bug 1342296 that Ed had filed two bugs a while back about the TinderboxPrint lines. So perhaps its time to follow up with them if they are still using it.

See Also: → 1533002
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: