Closed Bug 1602893 Opened 4 years ago Closed 3 years ago

Retriggered browsertime vismet tasks only report results from the first task

Categories

(Testing :: Raptor, defect, P2)

Version 3
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sparky, Unassigned)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [perf:workflow])

We currently have visual metrics running on browsertime but when we retrigger those browsertime tests, there are no new vismet tasks created for it, see here for a sample "add-new" task which retriggers a browsertime test, but doesn't add a new vismet task for it: https://firefox-ci-tc.services.mozilla.com/tasks/P6v8YHTOTzO6do85ER3RXg/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FP6v8YHTOTzO6do85ER3RXg%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log

It came from this push: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&tier=1%2C2%2C3&revision=5ba6228ac0ee383b783734c386631856a339f0f2&searchStr=add-new&selectedJob=280540364

:ahal, I've added you as a CC to this bug in case you have any thoughts/ideas about this issue. For context, the vismet tasks are currently created dynamically with this transform: https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/transforms/visual_metrics_dep.py#21

A run-visual-metrics attribute dictates which tasks should have one created for them: https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/test/raptor.yml#1023

This problem is worse than I originally thought. I can schedule multiple vismet tasks, but each of them only processed the results from the first btime task that was created for them. (We can't retrigger, and we can't schedule multiple runs). So for me to be able to analyze the vismet data, I would have to make one push per trial which is a bit much. Here's a task where I tried it with 50 retriggers: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=20528cc4f915039c970db9adacbb784491e2c01a

We have a partial solution here (thanks to :aki for the help):

  1. Select a browsertime test task (not a vismet task) that you want to retrigger.
  2. In the pop-up menu, select the ... and click on Custom Action.
  3. Pick the retrigger action if it wasn't already selected.
  4. Set downstream to true.
  5. Enter the number of times it should be retriggered in times.
  6. Retrigger!

This is the only method we have to retrigger these tasks. The --rebuild doesn't work but it would be cool if we could eventually get that to work with these tasks. Also note that the drop-down arrow on the push that provides the Custom Push Action option doesn't work for this either.

I've added that information to the wiki as well: https://wiki.mozilla.org/TestEngineering/Performance/Raptor/Browsertime

I don't know if this workflow works anymore because you can't do retriggers on jobs that succeed.

Last I checked (last week), it still works for the browsertime tests. Are you referring to raptor-browsertime or mozperftest?

Using ./mach try fuzzy --retry N and getting stale data is a big footgun still.

Whiteboard: [perf:workflow]
Depends on: 1677559

Can we retitle this bug so it shows up in searches? Maybe "retriggered browsertime tasks only report results from the first task" or "retriggered browsertime tasks have invalid data in perfherder" or something? (I didn't know the significance of "visual metrics", so did not realize this bug was describing my problem.)

I forget about this, and run into it all the time. And I'm still not sure what scenarios run into this problem, so I'm never sure whether to trust eg regression bugs when they're complaining about massive numbers of browsertime changes; perherder will report the number of tasks run so everything appears to be fine.

Blocks: mach-busted
Flags: needinfo?(gmierz2)

Sure thing, I've updated the title.

The only time you would hit this issue is when you retrigger the *-vismet tasks in Treeherder - the correct way of doing this is described here: https://wiki.mozilla.org/TestEngineering/Performance/Raptor/Browsertime/VisualMetrics

The perf sherriffs should know how to retrigger the vismet tasks correctly so you shouldn't worry about this happening in regression/improvement bugs. When you look at the Perfherder Compare view, if you see a metric with multiple runs but they only report a single value, then you know that it wasn't retriggered correctly.

We're looking to getting visual-metrics running in the test tasks themselves to get around this issue but we need to get FFMPEG and ImageMagick installed on the machines first.

Flags: needinfo?(gmierz2)
Summary: Can't run visual metrics on retriggered browsertime tasks → Retriggered browsertime vismet tasks only report results from the first task

(In reply to Greg Mierzwinski [:sparky] from comment #9)

The only time you would hit this issue is when you retrigger the *-vismet tasks in Treeherder - the correct way of doing this is described here: https://wiki.mozilla.org/TestEngineering/Performance/Raptor/Browsertime/VisualMetrics

Right, but that's the common case if you're using mach try fuzzy --rebuild N.

The perf sherriffs should know how to retrigger the vismet tasks correctly so you shouldn't worry about this happening in regression/improvement bugs. When you look at the Perfherder Compare view, if you see a metric with multiple runs but they only report a single value, then you know that it wasn't retriggered correctly.

Right, so in my example I have to know what either " ± 0" means, or seeing confidence either blank (?) or "Infinity", despite "Total Runs" being > 1.

And as a result of looking at this, I'm noticing that my --rebuild 7 pushes are reporting some useful data -- specifically, the non-vismet metrics. Which apparently I can pattern-match because the vismet have spelled out names like FirstVisualChange and the non-vismet ones don't. Or, if looking at the results from individual tasks, the non-vismet ones have abbreviated names like "fcp". Ah, I see now that those show up in the comparison view as "subtests". It would be nice if the vismet ones were grouped the same way, or if neither was grouped. (Or at least, with my limited understanding I think it would be nice. I'm not really understanding the overall picture, so I could be way off base.)

We're looking to getting visual-metrics running in the test tasks themselves to get around this issue but we need to get FFMPEG and ImageMagick installed on the machines first.

Right, I saw that in the dependent bug, though it seems a little unfortunate to need that. It seems like the taskgraph should have the necessary smarts added for this. Still, whatever works. Expediency is good.

(In reply to Steve Fink [:sfink] [:s:] from comment #10)

(In reply to Greg Mierzwinski [:sparky] from comment #9)

The only time you would hit this issue is when you retrigger the *-vismet tasks in Treeherder - the correct way of doing this is described here: https://wiki.mozilla.org/TestEngineering/Performance/Raptor/Browsertime/VisualMetrics

Right, but that's the common case if you're using mach try fuzzy --rebuild N.

The perf sherriffs should know how to retrigger the vismet tasks correctly so you shouldn't worry about this happening in regression/improvement bugs. When you look at the Perfherder Compare view, if you see a metric with multiple runs but they only report a single value, then you know that it wasn't retriggered correctly.

Right, so in my example I have to know what either " ± 0" means, or seeing confidence either blank (?) or "Infinity", despite "Total Runs" being > 1.

And as a result of looking at this, I'm noticing that my --rebuild 7 pushes are reporting some useful data -- specifically, the non-vismet metrics. Which apparently I can pattern-match because the vismet have spelled out names like FirstVisualChange and the non-vismet ones don't. Or, if looking at the results from individual tasks, the non-vismet ones have abbreviated names like "fcp". Ah, I see now that those show up in the comparison view as "subtests". It would be nice if the vismet ones were grouped the same way, or if neither was grouped. (Or at least, with my limited understanding I think it would be nice. I'm not really understanding the overall picture, so I could be way off base.)

Oh right sorry, the --rebuild doesn't work either. Essentially, every standard method except the one mentioned in the wiki will not work. I think what you're describing regarding the subtests vs not-subtests is something we were looking into, :davehunt can you provide more information?

Right, I saw that in the dependent bug, though it seems a little unfortunate to need that. It seems like the taskgraph should have the necessary smarts added for this. Still, whatever works. Expediency is good.

I fully agree. I'm getting quite annoyed by this taskcluster limitation so I'm going to try to fix this through another way - posting patch soon.

Flags: needinfo?(dave.hunt)
Depends on: 1686118

I'll post the patch in bug 1686118 since it solves the majority of the issue but it won't fix the issues with --rebuild since that would involve modifying taskcluster code and I'm not sure what may be the best way to change this: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/create.py#89-92

(In reply to Greg Mierzwinski [:sparky] from comment #11)

I think what you're describing regarding the subtests vs not-subtests is something we were looking into, :davehunt can you provide more information?

Indeed. In bug 1672794 we will stop summarising results as a geometric mean, which means they will report directly in the compare view and not under the "subtests" view.

Flags: needinfo?(dave.hunt)

I think this bug, as titled, "Retriggered browsertime vismet tasks only report results from the first task
", can now be closed. (?)

Well, if the retriggering is done via Taskcluster UI, then this can be closed because I was doing that today and it worked fine.

But it would be really handy to have the --rebuild option, because it can be extremely time consuming to add the jobs if you want to test, for instance, 10 sites on 3 platforms.

And maybe Bug 1684946 can be updated to explicitly be about the --rebuild case?

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #15)

I think this bug, as titled, "Retriggered browsertime vismet tasks only report results from the first task
", can now be closed. (?)

Well, if the retriggering is done via Taskcluster UI, then this can be closed because I was doing that today and it worked fine.

True. I guess I have a bad habit of using "retrigger" to mean passing the --rebuild flag.

But it would be really handy to have the --rebuild option, because it can be extremely time consuming to add the jobs if you want to test, for instance, 10 sites on 3 platforms.

And maybe Bug 1684946 can be updated to explicitly be about the --rebuild case?

Yeah, I don't know what the best way to arrange the bugs. Retriggering was fixed in bug 1686118. We could either use this bug or bug 1684946 for --rebuild. This bug has the best description of what happens when things go wrong.

I guess I can just add the additional info to bug 1684946 and use it.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.