Closed Bug 1503345 Opened 6 years ago Closed 5 years ago

Create monitors for incoming data pings for Snippets

Categories

(Firefox :: Messaging System, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
Iteration:
65.3 - Nov 30

People

(Reporter: giorgos, Assigned: nanj)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Hi AS Team,

When we switch to ASR the onwership of the data pipeline will be transferred from the MEAO team to you.

The MEAO team has two monitors that check that we receive metrics from snippets (e.g. impressions and clicks).

 - Monitor one is a anomaly detection monitor which will go off if the trend of the number of received metrics changes rapidly. This monitor notifies as if something in the data collection or snippet delivery goes wrong, while avoiding false alarms on weekends and during holidays that Fx usage decreases.

 - Monitor two is a simple threshold monitor which trigger when the number of received metrics drops bellow a low threshold. This monitor is actually WIP right now [0] as the recent AS bug 1503047 that prevented snippet display [1] in combination with the gradual roll-out of Fx resulted in the anomaly detection monitor to adapt to the dropping metrics numbers without going off.

Both monitors are implemented in Datadog. Happy to provide more technical info upon request. Thanks!

[0] https://github.com/mozmeao/snippets-service/issues/816
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1503047
Severity: normal → enhancement
Iteration: --- → 65.2 (Nov 16)
Priority: -- → P2
Nan,

There was some discussion on how can we monitor snippets on the product side to make sure nothing is out of order and Tim mentioned something about a need to log a bug for jbuck and the opps team to implement this.  

Anyways I was told to Needinfo you. I will include Tim so he can provide more context.
Flags: needinfo?(najiang)
Flags: needinfo?(tspurway)
Yep, we can definitely set up something for the Snippets health monitoring on Datadog.

No need to file another bug, let's just use this one to track the work.
Flags: needinfo?(tspurway)
Flags: needinfo?(najiang)
Assignee: nobody → najiang
Depends on: 1502971
Iteration: 65.2 (Nov 16) → 65.3 (Nov 30)
Hi :giorgos,

I am currently collecting the requirements for those monitors, and would like to understand more details about them. Could you take a look at following questions?

* What are the required metrics? Assuming that we will need the total number of telemetry records generated from Snippets. Do we also need the counters for individual events, such as IMPRESSIONS and CLICKS. If necessary, we can also expose total number of unique users for each individual metric mentioned above. Let me know if I missed anything here.  

* How often should we send those metrics to Datadog? For your reference, most of Activity Stream monitors are usually reported on a daily basis. Although we can customize this frequency based on the requirements of Snippets.

* Do we want to set up monitors for different release channels separately, i.e. release, beta, and nightly? Or do we just want to treat them as a whole?
Flags: needinfo?(giorgos)
Hi Nan,

The idea behind those monitors is to catch failures that are related to a. snippets delivery b. snippets display or c. data collection. Metrics complete the snippets lifecycle, so monitoring collection ensures that the system works.

To your questions:

 1. In our setup we triggered a graphite/statsd ping when we received a ping from a browser (any ping, impression, click, etc). I think that's enough for a monitor, to ensure that the system as a whole works. 

 2. I'd say given the importance of the snippets service I'd continue doing what we do today, i.e. send them instantly. We need to identify failures in (a), (b) or (c) as soon as possible.

 3. That's a good idea. Since AS is being actively developed and changes there can break snippets, I suggest that we monitor each channel separately. This way we will identify problems similar to the one we had recently, before they hit release.

Can myself and my team get access to those monitors once created?
Flags: needinfo?(giorgos)
(In reply to Giorgos Logiotatidis [:giorgos] from comment #4)
> 
> Can myself and my team get access to those monitors once created?

Absolutely. In fact, we will be continuously generating those statsd metrics, and make them available to Datadog. You can build your own monitors with those metrics.
(In reply to Nan Jiang [:nanj] from comment #5)
> (In reply to Giorgos Logiotatidis [:giorgos] from comment #4)
> > 
> > Can myself and my team get access to those monitors once created?
> 
> Absolutely. In fact, we will be continuously generating those statsd
> metrics, and make them available to Datadog. You can build your own monitors
> with those metrics.

But you'll still have your own monitors, right?
Flags: needinfo?(najiang)
(In reply to Giorgos Logiotatidis [:giorgos] from comment #6)
> But you'll still have your own monitors, right?

Yes, we also have various monitors to track all the mission-critical metrics of AS.

See https://app.datadoghq.com/dash/47775/tiles
Flags: needinfo?(najiang)
Hey Giorgos, just a quick update that we've made following metrics available on Datadog:

"tiles.redshift.activity_stream.snippets.nightly.impression.total": total number of impressions observed as of now
"tiles.redshift.activity_stream.snippets.nightly.click_button.total": total number of clicks observed as of now
"tiles.redshift.activity_stream.snippets.nightly.messages.total": total number of unique messages observed as of now

You can also get those numbers for "beta" and "release" by replacing "nightly" in the metric string. You can build your own monitors on Datadog based on those metrics, feel free to let me know if there is anything missing here.

Note that all those counters are currently scheduled to be reported once per hour, let me now if this doesn't meet your need.
Blocks: 1513279
I'm getting readings on beta using `tiles.redshift.activity_stream.snippets.beta.impression.total`,  we've made it! 

Nice work :ninaj!
\o/

Closing this now. Let's file new bugs for other monitors if needed.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Attached image image.png
Hey Nan,

The metrics from Beta look like they get reset every day (attached screenshot). Take you please take a look?
Flags: needinfo?(najiang)
(In reply to Giorgos Logiotatidis [:giorgos] from comment #11)

> The metrics from Beta look like they get reset every day (attached
> screenshot). Take you please take a look?

Yep, all those metrics are monitoring the "current" state of Snippets. Such as "how many impressions/clicks have been observed by far today?", "how many different Snippets messages have been served by far today?".

Are you looking for some other monitors with different time windows?
Flags: needinfo?(najiang) → needinfo?(giorgos)
Thanks for clarifying. 

What's graphed there makes sense but I wonder how monitors will work. Does this mean that if we receive no metrics we need to wait at most 24 hours for the day to change and then the monitor will check if previous day has enough reported metrics?
Flags: needinfo?(giorgos) → needinfo?(najiang)
We don't need to wait for 24 hours, all the metrics could be monitored in the real time.  

Of the top of my head, I believe we can set up two types (in Datadog's terminology) of alerts for Snippets.

* "Threshold alert" on total number of messages (i.e. "tiles.redshift.activity_stream.snippets.{channel}.messages.total"), which triggers a warning if this metric drops below a predefined threshold.

* "Change alert" on total number of impressions / click_buttons. Since all those monitors get updated on a hourly basis, there should be a significant delta between every two observations if the system is behaving properly.

It'll be a bit tricky to set those alerts during the UTC midnight, i.e. when those metrics get "reset". Let's see if there is any way we can overcome that on Datadog.
Flags: needinfo?(najiang)
Component: Activity Streams: Newtab → Activity Streams: Application Servers
Hi :nanj and happy new year!

Did you have the chance to take a look at the monitors you suggest? Those make sense to me. 

Also can you please list what kind of alerts you already have setup? Do you have those documented somewhere?

Thanks!
Flags: needinfo?(najiang)
(In reply to Giorgos Logiotatidis [:giorgos] from comment #15)
> Hi :nanj and happy new year!

Happy new year to you, too!

> Did you have the chance to take a look at the monitors you suggest? Those
> make sense to me.

Not yet, I haven't set up any alerts for Snippets. I'd like to add some once we kick off the Snippets rollout in release this week, so stay tuned.
Flags: needinfo?(najiang)
Component: Activity Streams: Application Servers → Messaging System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: