Improve user experience of triggering performance tests via try perf selector
Categories
(Testing :: Performance, task, P2)
Tracking
(firefox108 fixed)
Tracking | Status | |
---|---|---|
firefox108 | --- | fixed |
People
(Reporter: sparky, Assigned: sparky)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [fxp])
Attachments
(9 files, 5 obsolete files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
Identifying suitable performance tests and running them in our CI is currently confusing, meaning that this is under utilised. We see frustration when engineers are attempting to run tests in CI, and there are likely other engineers who give up early or have been put off from past experience. This is harmful to promoting a culture of performance at Mozilla.
One suggestion is that instead of listing all tasks in try chooser, we could have have logical collections (page load, benchmark, essential, etc.)
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
I recently used try to get page load perf results for essentially the first time – since my perspective is fairly fresh, feel free to ask me for feedback on the design. From my history as a product developer, I think the experience should be simple enough that I come with a question like, "How did my change impact cold start up on fenix?" and I'll be able to get the results without asking for help.
At a high level, there are two major roadblocks I hit:
- figuring out which tests to select is hard. "logical collections (page load, benchmark, essential, etc.)" seems like it'd get most of the way there, at least for page load
- I don't know how to interpret the results, e.g. what's the difference between
SpeedIndex
,ContentfulSpeedIndex
, andPerceptualSpeedIndex
? Do I need to retrigger to get confident results? etc. I provided this feedback to the perfherder compare google survey so I won't go into details here.
Assignee | ||
Comment 2•3 years ago
|
||
(In reply to Michael Comella (:mcomella) [needinfo or I won't see it] from comment #1)
- I don't know how to interpret the results, e.g. what's the difference between
SpeedIndex
,ContentfulSpeedIndex
, andPerceptualSpeedIndex
? Do I need to retrigger to get confident results? etc. I provided this feedback to the perfherder compare google survey so I won't go into details here.
Thanks for your input :mcomella!
Regarding your second point, I think we can target a good portion of that with some new perftest guidelines we are thinking of making. There's no bugs or documentation available about this yet.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 3•2 years ago
|
||
This patch adds a new try selector specifically for performance tests. We make use of the compare selector as well as the fuzzy, and again selectors (used within the compare one) to do this.
The code starts off by building the variant combinations (e.g. fission+live-sites+profiling), then uses these to build up the rest of the categories. The variants are hidden by default and only available with something like --variants profiling
. The same is true for apps, where chrome tests are hidden by default, and for platforms where android is hidden by default. There are flags developers can use to enable these if needed. All of this is done to prevent excessive try runs as these categories make it easier to do so.
The fuzzy selector was also modified to handle displaying the categories, and then parsing those categories for the tasks requested.
Assignee | ||
Comment 4•2 years ago
|
||
This patch adds a capture_log
flag that will allow a user to capture the logs produced when running the push_to_try
methods. Currently, we use subprocess.check_call
which causes hg to hang when we attempt to redirect and gather stdout. Using subprocess.run
is better as it lets us capture the logs, but the log output is very slow and can easily lead people to believe that hg is hanging when it's not. This results in corrupted repos. Using Popen, the logs are output slower than check_call
, but faster than run
so you know something is happening. The speed at which these logs are printed is also why I have this log capturing behind a flag.
This functionality will be used in the ./mach try perf
selector to capture the child revisions produced for the try task config file changes and allow us to produce a PerfCompare link to provide the user.
Assignee | ||
Comment 5•2 years ago
|
||
Depends on D155980
Assignee | ||
Comment 6•2 years ago
|
||
This patch modifies the Compare selector to gather the logs when perf_categories
are supplied. This argument will hold the categories to display in the Fuzzy selector and is currently the only thing that dictates if logs should be collected. Also, this simplifies the process of using --show-all
in the perf selector since we still want to capture logs but we don't want to display any categories. A log processor was added to capture the logs and parse the revisions. Furthermore, the run method was moved into the CompareParser to make it's usage clearer in the perf selector.
Depends on D155981
Assignee | ||
Comment 7•2 years ago
|
||
This patch changes the fuzzy selector to show perf-specific categories when they are available (provided from the perf selector). When these are provided, only they will be shown, and the actual tasks will be hidden. After the user selects some categories, we run a set of queries specified by that category to select the tasks to run.
Depends on D154515
Assignee | ||
Comment 8•2 years ago
|
||
:ahal, regarding this patch (https://bugzilla.mozilla.org/attachment.cgi?id=9292284), the suggestion of using iter
to get the logs didn't work either. I've been trying a few other things and combinations but I can't get it to work any better/faster than what I already have here.
I spent some time looking at the mercurial source code to see what is going on there when the ui.*
methods are called for logging and I noticed that the logs become "protected" and I think this is why I'm having a hard time capturing them: https://www.mercurial-scm.org/repo/hg-stable/file/tip/mercurial/utils/procutil.py#l426
The code suggests that I can get this output from stderr, but there's nothing being output there (or nothing that I can catch).
What do you think about this?
Comment 9•2 years ago
|
||
Sorry, I have to admit.. this is going a bit over my head. I guess your original patch is good enough then :)
Assignee | ||
Comment 10•2 years ago
|
||
No problem, thanks! I'll reduce the duplication as you requested.
Assignee | ||
Comment 11•2 years ago
|
||
This patch moves the fzf utility methods to a separate file so that they can be used by other try choosers such as the perf chooser. At the same time, helper methods for two aspects of the mach try fuzzy
run method (setting up the tasks, and building the base command) are added but the code in the run method is left untouched.
Updated•2 years ago
|
Assignee | ||
Comment 12•2 years ago
|
||
This patch moves the fzf utility methods to a separate file so that they can be used by other try choosers such as the perf chooser. At the same time, helper methods for two aspects of the mach try fuzzy
run method (setting up the tasks, and building the base command) are added but the code in the run method is left untouched.
Updated•2 years ago
|
Comment 13•2 years ago
|
||
Assignee | ||
Updated•2 years ago
|
Comment 14•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 15•2 years ago
|
||
This patch adds the basics for the perf selector: file, categorizations, selector configuration, and CLI arguments.
Depends on D155980
Assignee | ||
Comment 16•2 years ago
|
||
This patch adds a method for expanding the categories. This method takes all the configuration details from the PerfParser (variants, apps, platforms, etc.) and combines them to create the categories that people will be seeing and selecting from. It creates the queries, and checks for any restrictions among them to provide selections that work. See the method for more details on all of this.
Depends on D160413
Assignee | ||
Comment 17•2 years ago
|
||
This patch adds two methods for getting/selecting tasks. The first one (get_tasks) is used to either query fzf directly, or to use interactively for user selections. The second (get_perf_tasks) makes use of the first to query for categories (through the user) and then query for the selected tasks. It also performs all the necesary intersections/unions for the queries.
These two are separated because we need a simple way to allow the user to look and select the raw tasks instead of categories if needed. This way the user can still get a comparison link at the end instead of going to mcah try fuzzy
. See the next patches.
Depends on D160414
Assignee | ||
Comment 18•2 years ago
|
||
Depends on D160415
Assignee | ||
Comment 19•2 years ago
|
||
This patch adds a method that allows us to push to try. It performs two pushes similar to the compare selector, except in this case we're also capturing the logs in all cases so that we can provide a Perfherder URL afterwards. At the same time, this patch adds a LogProcessor that can capture these logs and parse them for the revisions.
Depends on D160416
Assignee | ||
Comment 20•2 years ago
|
||
This patch adds the run methods for the perf selector as well as the entry point for the mach command. It also produces the Perfherder URL at the end.
Depends on D160417
Assignee | ||
Comment 21•2 years ago
|
||
This patch adds two parametrized tests (totalling 18 tests) for the perf selector. The first set of 15 tests are for testing the category expansions under different configurations. The last 3 tests are for testing the run methods in the three configurations that we offer: standard, dry runs, and uncategorized test selection (or --show-all).
Depends on D160418
Assignee | ||
Updated•2 years ago
|
Comment 22•2 years ago
|
||
Comment 23•2 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/df816f6bad3f
https://hg.mozilla.org/mozilla-central/rev/ff5e60523368
https://hg.mozilla.org/mozilla-central/rev/6e94007a5fe8
https://hg.mozilla.org/mozilla-central/rev/3c18a3f66f89
https://hg.mozilla.org/mozilla-central/rev/d57d61c31656
https://hg.mozilla.org/mozilla-central/rev/cac9a015e794
https://hg.mozilla.org/mozilla-central/rev/461ee4c95c53
https://hg.mozilla.org/mozilla-central/rev/0a6ce361ed41
Assignee | ||
Updated•2 years ago
|
Updated•1 year ago
|
Description
•