Closed Bug 1146512 Opened 5 years ago Closed 4 years ago

Telemetry data missing on telemetry.mozilla.org after 2015-03-20

Categories

(Data Platform and Tools :: Telemetry Dashboards (TMO), defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: mreid, Unassigned)

Details

The Telemetry aggregation code appears to be "stuck", and the dashboards are not showing data after 2015-03-20.

Previously when this has happened, we have been able to detect it by checking the timestamp of the following file in S3:
https://s3-us-west-2.amazonaws.com/telemetry-dashboard/v7/versions.json

This time, that file is being updated, we're just not seeing new data in the dashboards.

I did a bit of initial poking around on the ec2 instance that coordinates the aggregator, and it appears to be stuck at
### Updating: beta/31
 - downloaded 20150323050149-31-beta
 - merged rows 1285

Checking Beta 31 on telemetry.m.o shows a build date from 1911, which may be the culprit: http://mzl.la/1EKa2Jo

I'm not sure how to proceed, hoping :jonasfj can shed some light.
Flags: needinfo?(jopsen)
Never mind the part about Beta 31 - the aggregator got past that part after a while.
This could be normal delay.
We have data from the 21st: http://mzl.la/1N7Jy6r

And yesterday is scheduled today 13.31 UTC; meaning yesterday (the 22'nd) will be done sometime today,
depending on processing time maybe later. And yesterdays submissions probably contains data from
sessions that ran on the 21st. Granted I might be wrong there. Anyways, my point is we do have a lot
of delay - let's wait until we're certain something is wrong :)

@mreid, maybe taking a look at the SQS queues tomorrow morning very early. Before 13.00 UTC.
They should be empty there, if not then that is a potential issue. And maybe we'll need to
move from a micro node for merging data.

Also next step is debugging this further is looking at FILES_PROCESSED, just to see what submission
dates we have handled. This should be encoded in file names. Also compare to list of files that
has failed, and maybe even list of files that exists.

If this doesn't help, we have to look at what happens on the actual workers.

--
If you want to keep using this aggregation code in your next incarnation of telemetry, we should
run it under docker on taskcluster, I can help with that. But I suspect your next incarnation will
feature it's own powerful analysis framework.
Looking at it today... I see data from the 22nd and a bit of data from the 23rd, so it's updating :)

@mreid,
If the SQS queues don't grow empty we need to move to a large node. If it doesn't go empty we're most
likely loosing data, which is bad...
Flags: needinfo?(jopsen)
Is there anything in the logs that might give us a clue if the queue is getting emptied out?
@mreid, nope... We have to look at SQS and cloudwatch...
well, technically you can scroll up the screen session and see if there is any "got 0 messages" :)
Scrollback doesn't go back very far, all but one say "Fetched 30 messages", and the other one says "Fetched 21 messages".  If it fetched less than 30, does that mean it emptied out?
Sorry, SQS provides no such promise...

But it's a good sign...
Maybe we should migrate to a larger instance anyways? We could wrap it up in a CloudFormation config too, so it stops being a unique snowflake.
Jonas, if do you mind taking this bug and making a CloudFormation for the aggregator-coordinator node? Then we could all sleep a little better :)
Flags: needinfo?(jopsen)
It's close to trivial to add the server to the cloudformation config:
https://github.com/mozilla/telemetry-server/blob/master/analysis/analysis-worker-stack.yaml

The hard part is building the AMI, which is a bit more than non-trivial I don't think I'll get around
to it anytime soon. I'm not even sure how we rebuild the current analysis-worker AMI:
https://github.com/mozilla/telemetry-server/blob/master/analysis/analysis-worker-stack.yaml#L49-L50
@mreid, do you know how to rebuild that AMI? If so we can the other AMI to the same place/hook.

The readme and scripts on the coordinator pretty much reveals what needs to be added to to AMI.
And I'll happily answer questions...

---
But if we want to set all of this up right, we probably better off using docker than custom AMIs.
Possibly with tutum.co for deployment... Test cycles are much faster with docker, than AMIs.

But it also depends on how much we're willing to invest in this. I don't want to make any significant
investment here if we're going to replace the analysis framework sometime soon.

@mreid, vladan: what does the life-cycle for this aggregation pipeline look like?
Flags: needinfo?(jopsen)
I don't know offhand where/how that AMI was built. For the purposes of getting this running with minimum effort, we could just add a UserData script to configure the coordinator node.

I'll leave it to Vladan to address the lifecycle question.
Flags: needinfo?(vdjeric)
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #11)
> @mreid, vladan: what does the life-cycle for this aggregation pipeline look
> like?

We'll probably need something new for aggregation in the new pipeline. I expect the old Telemetry pipeline will be around for another quarter, but I wouldn't want to make big investments in it if they could not be easily ported to the new pipeline
Flags: needinfo?(vdjeric)
The V2 aggregation pipeline was retired in bug 1179751.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INVALID
Product: Webtools → Data Platform and Tools
You need to log in before you can comment on or make changes to this bug.