4.39 KB, text/x-lua
430 bytes, text/plain
Create Lua filters that will watch the statsd data as it flows through, and generate notifiers when the triggers happen.
What data do we want to filter/notify on?
Andy: Have specs for the first few stats you'd like to notify on?
I like the standard deviation script that you showed in the secret channel for these values: stats.addons.response.200 stats.addons.response.301 stats.addons.response.400 stats.addons.response.401 stats.addons.response.403 stats.addons.response.500
Great, I'll start with these. Will try to have something to experiment with by the end of this week.
Aaaaannnd I'm not going to have anything ready by the end of this week, I've been busy w/ other things. I've finally been able to start on it today, though, and will definitely have some stuff to try out next week.
I've made a ton of progress on this and am to a point where I'm ready to start testing out my code on the running dev/stage Heka instance. To simplify the testing and iteration process, I've had firstname.lastname@example.org update the Heka config to allow me to dynamically add and remove sandbox filters w/o a restart or any further config changes. My first attempt at uploading my script failed, though, b/c I was depending on recent changes to the Lua environment, so Heka itself needs to be updated first. I've built an RPM from a recent Heka nightly and have given it to Jason for deployment, attached here for posterity.
Okay, well, the RPM was slightly too large to upload, so here's a link: https://people.mozilla.org/~rmiller/heka/heka-0_5_0-20140205-linux-amd64.rpm
We have lift-off: * http://logstash1.addons.phx1.mozilla.com:4352/#plugins/filters/dev_sbxmgr-http_monitor Apologies for the still buggy dashboard UI, but this page might need a refresh before you see everything. Ultimately, however, there should be a list of output links on the right side of the page. Among these should be 4 of type "CBUF", which will take you to graphs for the response codes for each listed host. For now there are no notifiers going out. Instead, when an anomaly is detected, an annotation is placed on the graph. This is to avoid a flood of false positives, since we have a lot of work to get the anomaly detection dialed in to not fire except when we really want it to. This is especially true when we're on dev and stage, since the traffic is low and intermittent. I need to switch gears for a bit to handle some other Heka stuff, but I'll come back to this in a week or so to help push this along further. The code that's generating those graphs was dynamically loaded into Heka, so for now I'm going to attach that code and the related config to this bug so we have it for posterity.
Component: Server Operations: AMO Operations → Operations: Marketplace
Product: mozilla.org → Mozilla Services
Version: other → unspecified
Rob, is https://github.com/mozilla-services/heka/blob/dev/sandbox/lua/filters/http_status.lua equivalent?
It's close. The filter you linked to set up to process the output of Heka's http server log (nginx, apache, haproxy) parsing instead of the same info coming in via statsd. I actually did deploy a filter to graph and analyse the statsd data. The anomaly detection stuff is only doing graph annotations, not actually sending notifiers, b/c at the time our anomaly detection code was a bit more rough and had too many false positives when the data was sparse. We've since added some algorithms that will handle this better. Also, I've recently added a filter that can handle statsd data in the general case: https://github.com/mozilla-services/heka/blob/dev/sandbox/lua/filters/stat_graph.lua which is a better choice than the one-off that I originally wrote for marketplace. Just realized I need to add support for anomaly detection there, though, I'll work on that today or tomorrow and will update this ticket when it's ready.
Okay, as of https://github.com/mozilla-services/heka/pull/980 the stat_graph filter (for general purpose graphing and monitoring of specific statsd data points) supports anomaly detection and has been merged to dev. So the custom filter code that I deployed can be replaced, either by http_status.lua (if you want to go the 'parsing the log files' route) or by stat_graph.lua (if you want to continue to use statsd as the source of the monitored data).
Added dev/stage HTTPStatus report here: http://kibana1.stage.addons.phx1.mozilla.com:4352/#sandboxes/HTTPStatus/outputs/HTTPStatus.HTTPStatus.cbuf
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.