[Tracking] Create Datazilla Alerting Mechanism for performance regression and data ingestion disruption

RESOLVED FIXED

Status

Testing
Talos
P1
normal
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: cmtalbert, Assigned: jmaher)

Tracking

({perf})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [c=automation p= s=2014.05.09.t u=])

Attachments

(1 attachment)

(Reporter)

Description

5 years ago
Created attachment 810786 [details]
Summary data structures

This might be a dupe but I didn't see it anywhere.  

We need a tracking bug for the datazilla alerting work.

Here is the current plan of record:
= Sept 12 =
* short term (to be delivered in within september)
** [jeads] get emails going out
** [kyle] object count table
** [jeads] test_run_id table
** [jmaher] - look at existing gmo alerting system and see if it could use datazilla data
*** http://hg.mozilla.org/graphs/file/tip/server/analysis/analyze_talos.py
*** run on graphs.m.o, queries the database directly, not a blocker, not a good sign
*** I will experiment with writing a datazilla extractor and reusing most code
** summary page skeleton [stretch]
* [jmaher] - 1 solution is to hack the regression alert system to pull data from datazilla instead of graphs.m.o; not ideal at all, but it could move us closer to datazilla. (how much work to have it markup tdad (test_data_all_dimensions))
** we wouldn't have all dimensions- very limited but we could hit parity with what we have now, again not ideal
* alerting system - architecture - current system does 40day blocks 
** we get just under 1 object/second
** do we search for alerts on ingestion, or out of band
** table/queue of test_ids to process
* revision-level alerts - delayed so it has more info
** initial alert if we find a regression
** wait X minutes/hours until we get most of the data and can send a *final* summary
* summary interface (linked to from the emails)
** one graph that shows the push chain
** top level view of tests and notifications (jolly green giant type of grid)
** view previous alerts to determine if this is a new alert or possibly existing
** how could we notify new data in the UI
** how to determine the number of tests we run, actually objects
*** median # of objects/push/branch
*** timerange could be 7 days, make this configurable
** how to mark user interface for known bad bugs

Further information can be found on the Signal From Noise etherpad: https://etherpad.mozilla.org/SignalFromNoise

And a trial idea of the summary data for the alarm notification is attached (also copied from the etherpad)
(Assignee)

Updated

5 years ago
Component: General → Talos

Updated

5 years ago
Depends on: 949190

Updated

5 years ago
Keywords: perf
Whiteboard: [c=automation p= s= u=]
Priority: -- → P2

Comment 1

4 years ago
Hi Joel,

Assigning this to you since you're managing this effort. The fxos-perf-alerts@mozilla.com mailing list has been created so you can update this tool's configuration to make use of it. Ben Kelly has said that most of what this bug describes has been implemented, if that's true please resolve this.

Thanks,
Mike
Flags: needinfo?(jmaher)
Assigning to :jmaher so we can track this in our sprint. Joel if you think somebody is more able to be the assignee, feel free.
Assignee: nobody → jmaher
(Assignee)

Comment 3

4 years ago
we have this working for ingestion alerts and for regressions, so the general flow is good.

I assume we can mark this as done?
Flags: needinfo?(jmaher)

Comment 4

4 years ago
Joel,

Has everything been satisfied for this bug's dependent bug 949190? Looks like that's still out for sec-review.

Thanks,
Mike
Status: NEW → ASSIGNED
Flags: needinfo?(jmaher)
(Assignee)

Comment 5

4 years ago
it depends how far we want to take this.  Right now we have alerts generated via automation for all the tests going to datazilla.  I agree this should be on a more static server, but I would like to close this bug when we get it deployed there.  Tweaks to the detection algorithm, adjusting tests or alert text don't fall under the scope of creating these.
Flags: needinfo?(jmaher)

Updated

4 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Whiteboard: [c=automation p= s= u=] → [c=automation p= s=2014.05.09.t u=]
You need to log in before you can comment on or make changes to this bug.