Closed Bug 1389395 Opened 7 years ago Closed 7 years ago

Request: Regression alert for Hasal data

Categories

(Tree Management :: Perfherder, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: bobby.chien+bugzilla, Unassigned)

Details

Attachments

(1 file)

The ideas of setup regression as following conditions:
- The Hasal test result has 6 data points in each test per day 
- The algorithm (based on Joel's suggestion): Student t-test which takes a minimum of 12 historical and 12 future data points to generate an alert. 

Target:
- Hasal posted data based on test mozilla-central code.
When alerts are generated, will you be looking at them yourself, or will you expect the performance Sheriff to look at them?

If a perf sheriff looks at the alerts, then I want to make sure we have a bug template on file and a wiki page documenting the tests and owners.  We will also need a way to retrigger the tests and push to try- likewise have the ability to backfill on autoland and try- my guess is that Hasal is not designed to run like this, which leads me to believe this will be self managed alerts.

Given our existing algorithm, I assume we could have results out for you in 2 days time, I am not sure how simple it is to modify the alert algorithm for a single framework/test- but we could look into that.  I did look at some of the data and it is very noisy, so I suspect alerts will not show up often.


Please do confirm.
Yes, Hasal team will be monitoring the numbers before the end of November (Firefox 57 to Release). The team will learn the experience first, and potentially transfer to performance sheriff in the future. But we don't have the plan yet. Please set the alerts to hasal-dev@mozilla.com. Thanks.

The team will take responsibility to provide info into template and wiki. Please stay tuned.
Currently alerts are posted to the alert dashboard:
https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=9&page=1

we do not have the ability to email alerts.  If your team is going to look at the alerts, then not much is needed.  Are you only posting data to mozilla-central?  Right now we generate alerts only for mozilla-inbound, autoland, and mozilla-beta- we understand that this will need to change to support hasal, but I want to make sure you think about what happens after September 20th when Firefox 57 is on mozilla-beta.  Will you have data there?  Will you continue to have data on mozilla-central?  We don't run any perf tests on mozilla-release, or esr- only on branches that have development.
Whiteboard: [PI:August]
Joel, sorry for the confusion regarding Firefox 57. Hasal only performs the tests on mozilla-central. This why we post numbers to mozilla-central only, it's daily testing. Why I mentioned Firefox 57, I talk about the timeframe. Hasal team will monitor the alerts until the end of November. If the Hasal testing needs to continue after November, we will transfer the knowledge (as the plan in my mind) to performance sheriff.
yes, but 57 will be on beta September 20th, this means that Hasal data on mozilla-central will not be very useful in 5 weeks.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #5)
> yes, but 57 will be on beta September 20th, this means that Hasal data on
> mozilla-central will not be very useful in 5 weeks.

I see your point. I will work with my team to think about this approach. However, We will continue test on mozilla-central as regression perspective.
Hi Bobby,

Joel and I had a look at some of the hasal data, it is very noisy. For example, on this comparison [1], for the tests run on the base build alone, the results range from 221 to 712. The standard deviations are very high, i.e. around 30%. With data this noisy we can't have dependable performance alerts. Also in the same comparison, if you look at the third item on the list, (youtube_all_select..) the base and new data doesn't look valid, i.e. most values are 33.33 or 44.44.

Can you point us to any data in perfherder that demonstrates otherwise i.e. less-noisy data, and an example of non-noisy data that demonstrates a 100% valid perf regression?

Doing the work to enable hasal perf alerting on mozilla central won't be valuable if the data is too noisy. We don't want to risk generating a bunch of false perf alert bugs that would take sheriffs and ultimately developers time, possibly looking into items that aren't valid.

I would suggest that we close this bug out, and the first step be an attempt to make the test results themselves more consistent, if possible. If that is a success and the results are less noisy, then at that point please send a request to pi-request to enable alerting for hasal on central and we could move forward at that time. Thanks!

[1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&originalRevision=a6a1f5c1d971dbee67ba6eec7ead7902351ddca2&newProject=mozilla-central&newRevision=4f4487cc2d30d988742109868dcf21c4113f12f5&framework=9&showOnlyImportant=0&showOnlyConfident=1
Flags: needinfo?(bchien)
(In reply to Robert Wood [:rwood] from comment #7)
> Hi Bobby,
> 
> Joel and I had a look at some of the hasal data, it is very noisy. For
> example, on this comparison [1], for the tests run on the base build alone,
> the results range from 221 to 712. The standard deviations are very high,
> i.e. around 30%. With data this noisy we can't have dependable performance
> alerts. Also in the same comparison, if you look at the third item on the
> list, (youtube_all_select..) the base and new data doesn't look valid, i.e.
> most values are 33.33 or 44.44.

The data is noisy but I think the p25 and p75 values are quite stable. Can we have higher threshold for the alert for the time being until we can reduce the standard deviations?

> Can you point us to any data in perfherder that demonstrates otherwise i.e.
> less-noisy data, and an example of non-noisy data that demonstrates a 100%
> valid perf regression?

You can see the case facebook_ail_scroll_home_1_txt, it's clear that around Jul 29 the case regressed.

https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1511022,1,9&selected=%5Bmozilla-central,39af63dbc39efdd7b2eb9a0ec623626139f8b993,232826,303559433,9%5D

I haven't find which bug caused the jump but it looks valid.

> Doing the work to enable hasal perf alerting on mozilla central won't be
> valuable if the data is too noisy. We don't want to risk generating a bunch
> of false perf alert bugs that would take sheriffs and ultimately developers
> time, possibly looking into items that aren't valid.
> 
> I would suggest that we close this bug out, and the first step be an attempt
> to make the test results themselves more consistent, if possible. If that is
> a success and the results are less noisy, then at that point please send a
> request to pi-request to enable alerting for hasal on central and we could
> move forward at that time. Thanks!
> 
> [1]
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-
> central&originalRevision=a6a1f5c1d971dbee67ba6eec7ead7902351ddca2&newProject=
> mozilla-
> central&newRevision=4f4487cc2d30d988742109868dcf21c4113f12f5&framework=9&show
> OnlyImportant=0&showOnlyConfident=1
I am not clear what p25 and p75 is?  Can you link to them in the graph server?

It is good to see there is a visible regression- this goes to the other point- what are we going to do with the data and who is going to look at the regression?  Keep in mind that we would get little to no alerts with a higher threshold- I would rather wait until you can get more consistent data.

We are not eager to add alerts which are not providing value for Mozilla (translating to bugs and fixes)- possibly it could be clearer who is going to look at the alerts and document how the data is run (for example, why are there some runs with 20 iterations and others with 40+?)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #9)
> I am not clear what p25 and p75 is?  Can you link to them in the graph
> server?

I mean if you only look at the data between first quartile and third quartile, the maximum and minimum are stable. In case of regression, the entire data moves higher.

> It is good to see there is a visible regression- this goes to the other
> point- what are we going to do with the data and who is going to look at the
> regression?  Keep in mind that we would get little to no alerts with a
> higher threshold- I would rather wait until you can get more consistent data.

For now we will look at the regression, file bugs and find owner, etc. Higher threshold might get us fewer alerts, but if we do get alerts we know it's real.
 
> We are not eager to add alerts which are not providing value for Mozilla
> (translating to bugs and fixes)- possibly it could be clearer who is going
> to look at the alerts and document how the data is run (for example, why are
> there some runs with 20 iterations and others with 40+?)

It's extremely valuable to catch regressions because input latency are release criteria. Without alerts we'd have to check it manually. As said in comment 2 we are still in a early stage of documenting the process so all of this is a learning process.

We can experiment with the alert algorithm locally but I think it will be great to have all the data in one place once we have verified it :)
Flags: needinfo?(bchien)
Attached file herder.py
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #10)
...
> We can experiment with the alert algorithm locally but I think it will be
> great to have all the data in one place once we have verified it :)

Attaching a script that :jmaher made based on our current alerting algorithm, that you can use to further analyze hasal data. It does seem to pick up some legitimate alerts, although several also seem to be invalid because of noise/outliers.

If you like try out the script locally ('python herder.py') and have a look at the alerts listed. Also you can tweak the 'back_window, fore_window, and threshold' parameters in the 'detectChanges' function and see if you can find better values that work better with the the hasal data. If so, we could possibly use those values as custom alerting parameters for hasal.
Whiteboard: [PI:August] → [PI:September]
Whiteboard: [PI:September]
Closing as incomplete for now, since I can't see any work for us here at the moment. I'd recommend opening a bug in a Hasal component to track making the results more stable, after which we can reopen this one. (Or if you'd prefer to move this bug there and reopen instead, that's fine too :-))
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: