Closed Bug 1146512 Opened 5 years ago Closed 4 years ago
Telemetry data missing on telemetry
.mozilla .org after 2015-03-20
The Telemetry aggregation code appears to be "stuck", and the dashboards are not showing data after 2015-03-20. Previously when this has happened, we have been able to detect it by checking the timestamp of the following file in S3: https://s3-us-west-2.amazonaws.com/telemetry-dashboard/v7/versions.json This time, that file is being updated, we're just not seeing new data in the dashboards. I did a bit of initial poking around on the ec2 instance that coordinates the aggregator, and it appears to be stuck at ### Updating: beta/31 - downloaded 20150323050149-31-beta - merged rows 1285 Checking Beta 31 on telemetry.m.o shows a build date from 1911, which may be the culprit: http://mzl.la/1EKa2Jo I'm not sure how to proceed, hoping :jonasfj can shed some light.
Never mind the part about Beta 31 - the aggregator got past that part after a while.
This could be normal delay. We have data from the 21st: http://mzl.la/1N7Jy6r And yesterday is scheduled today 13.31 UTC; meaning yesterday (the 22'nd) will be done sometime today, depending on processing time maybe later. And yesterdays submissions probably contains data from sessions that ran on the 21st. Granted I might be wrong there. Anyways, my point is we do have a lot of delay - let's wait until we're certain something is wrong :) @mreid, maybe taking a look at the SQS queues tomorrow morning very early. Before 13.00 UTC. They should be empty there, if not then that is a potential issue. And maybe we'll need to move from a micro node for merging data. Also next step is debugging this further is looking at FILES_PROCESSED, just to see what submission dates we have handled. This should be encoded in file names. Also compare to list of files that has failed, and maybe even list of files that exists. If this doesn't help, we have to look at what happens on the actual workers. -- If you want to keep using this aggregation code in your next incarnation of telemetry, we should run it under docker on taskcluster, I can help with that. But I suspect your next incarnation will feature it's own powerful analysis framework.
Looking at it today... I see data from the 22nd and a bit of data from the 23rd, so it's updating :) @mreid, If the SQS queues don't grow empty we need to move to a large node. If it doesn't go empty we're most likely loosing data, which is bad...
Is there anything in the logs that might give us a clue if the queue is getting emptied out?
@mreid, nope... We have to look at SQS and cloudwatch...
well, technically you can scroll up the screen session and see if there is any "got 0 messages" :)
Scrollback doesn't go back very far, all but one say "Fetched 30 messages", and the other one says "Fetched 21 messages". If it fetched less than 30, does that mean it emptied out?
Sorry, SQS provides no such promise... But it's a good sign...
Maybe we should migrate to a larger instance anyways? We could wrap it up in a CloudFormation config too, so it stops being a unique snowflake.
Jonas, if do you mind taking this bug and making a CloudFormation for the aggregator-coordinator node? Then we could all sleep a little better :)
It's close to trivial to add the server to the cloudformation config: https://github.com/mozilla/telemetry-server/blob/master/analysis/analysis-worker-stack.yaml The hard part is building the AMI, which is a bit more than non-trivial I don't think I'll get around to it anytime soon. I'm not even sure how we rebuild the current analysis-worker AMI: https://github.com/mozilla/telemetry-server/blob/master/analysis/analysis-worker-stack.yaml#L49-L50 @mreid, do you know how to rebuild that AMI? If so we can the other AMI to the same place/hook. The readme and scripts on the coordinator pretty much reveals what needs to be added to to AMI. And I'll happily answer questions... --- But if we want to set all of this up right, we probably better off using docker than custom AMIs. Possibly with tutum.co for deployment... Test cycles are much faster with docker, than AMIs. But it also depends on how much we're willing to invest in this. I don't want to make any significant investment here if we're going to replace the analysis framework sometime soon. @mreid, vladan: what does the life-cycle for this aggregation pipeline look like?
I don't know offhand where/how that AMI was built. For the purposes of getting this running with minimum effort, we could just add a UserData script to configure the coordinator node. I'll leave it to Vladan to address the lifecycle question.
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #11) > @mreid, vladan: what does the life-cycle for this aggregation pipeline look > like? We'll probably need something new for aggregation in the new pipeline. I expect the old Telemetry pipeline will be around for another quarter, but I wouldn't want to make big investments in it if they could not be easily ported to the new pipeline
The V2 aggregation pipeline was retired in bug 1179751.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.