1146512 - Telemetry data missing on telemetry.mozilla.org after 2015-03-20

Reporter

Description

•

11 years ago

The Telemetry aggregation code appears to be "stuck", and the dashboards are not showing data after 2015-03-20. Previously when this has happened, we have been able to detect it by checking the timestamp of the following file in S3: https://s3-us-west-2.amazonaws.com/telemetry-dashboard/v7/versions.json This time, that file is being updated, we're just not seeing new data in the dashboards. I did a bit of initial poking around on the ec2 instance that coordinates the aggregator, and it appears to be stuck at ### Updating: beta/31 - downloaded 20150323050149-31-beta - merged rows 1285 Checking Beta 31 on telemetry.m.o shows a build date from 1911, which may be the culprit: http://mzl.la/1EKa2Jo I'm not sure how to proceed, hoping :jonasfj can shed some light.

Mark Reid [:mreid]

Reporter

Updated

•

11 years ago

Flags: needinfo?(jopsen)

Mark Reid [:mreid]

Reporter

Comment 1

•

11 years ago

Never mind the part about Beta 31 - the aggregator got past that part after a while.

Jonas Finnemann Jensen (:jonasfj)

Comment 2

•

11 years ago

This could be normal delay. We have data from the 21st: http://mzl.la/1N7Jy6r And yesterday is scheduled today 13.31 UTC; meaning yesterday (the 22'nd) will be done sometime today, depending on processing time maybe later. And yesterdays submissions probably contains data from sessions that ran on the 21st. Granted I might be wrong there. Anyways, my point is we do have a lot of delay - let's wait until we're certain something is wrong :) @mreid, maybe taking a look at the SQS queues tomorrow morning very early. Before 13.00 UTC. They should be empty there, if not then that is a potential issue. And maybe we'll need to move from a micro node for merging data. Also next step is debugging this further is looking at FILES_PROCESSED, just to see what submission dates we have handled. This should be encoded in file names. Also compare to list of files that has failed, and maybe even list of files that exists. If this doesn't help, we have to look at what happens on the actual workers. -- If you want to keep using this aggregation code in your next incarnation of telemetry, we should run it under docker on taskcluster, I can help with that. But I suspect your next incarnation will feature it's own powerful analysis framework.

Jonas Finnemann Jensen (:jonasfj)

Comment 3

•

11 years ago

Looking at it today... I see data from the 22nd and a bit of data from the 23rd, so it's updating :) @mreid, If the SQS queues don't grow empty we need to move to a large node. If it doesn't go empty we're most likely loosing data, which is bad...

Flags: needinfo?(jopsen)

Mark Reid [:mreid]

Reporter

Comment 4

•

11 years ago

Is there anything in the logs that might give us a clue if the queue is getting emptied out?

Jonas Finnemann Jensen (:jonasfj)

Comment 5

•

11 years ago

@mreid, nope... We have to look at SQS and cloudwatch...

Jonas Finnemann Jensen (:jonasfj)

Comment 6

•

11 years ago

well, technically you can scroll up the screen session and see if there is any "got 0 messages" :)

Mark Reid [:mreid]

Reporter

Comment 7

•

11 years ago

Scrollback doesn't go back very far, all but one say "Fetched 30 messages", and the other one says "Fetched 21 messages". If it fetched less than 30, does that mean it emptied out?

Jonas Finnemann Jensen (:jonasfj)

Comment 8

•

11 years ago

Sorry, SQS provides no such promise... But it's a good sign...

Mark Reid [:mreid]

Reporter

Comment 9

•

11 years ago

Maybe we should migrate to a larger instance anyways? We could wrap it up in a CloudFormation config too, so it stops being a unique snowflake.

Mark Reid [:mreid]

Reporter

Comment 10

•

11 years ago

Jonas, if do you mind taking this bug and making a CloudFormation for the aggregator-coordinator node? Then we could all sleep a little better :)

Flags: needinfo?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Comment 11

•

11 years ago

It's close to trivial to add the server to the cloudformation config: https://github.com/mozilla/telemetry-server/blob/master/analysis/analysis-worker-stack.yaml The hard part is building the AMI, which is a bit more than non-trivial I don't think I'll get around to it anytime soon. I'm not even sure how we rebuild the current analysis-worker AMI: https://github.com/mozilla/telemetry-server/blob/master/analysis/analysis-worker-stack.yaml#L49-L50 @mreid, do you know how to rebuild that AMI? If so we can the other AMI to the same place/hook. The readme and scripts on the coordinator pretty much reveals what needs to be added to to AMI. And I'll happily answer questions... --- But if we want to set all of this up right, we probably better off using docker than custom AMIs. Possibly with tutum.co for deployment... Test cycles are much faster with docker, than AMIs. But it also depends on how much we're willing to invest in this. I don't want to make any significant investment here if we're going to replace the analysis framework sometime soon. @mreid, vladan: what does the life-cycle for this aggregation pipeline look like?

Flags: needinfo?(jopsen)

Mark Reid [:mreid]

Reporter

Comment 12

•

11 years ago

I don't know offhand where/how that AMI was built. For the purposes of getting this running with minimum effort, we could just add a UserData script to configure the coordinator node. I'll leave it to Vladan to address the lifecycle question.

Flags: needinfo?(vdjeric)

Vladan Djeric (:vladan)

Comment 13

•

11 years ago

(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #11) > @mreid, vladan: what does the life-cycle for this aggregation pipeline look > like? We'll probably need something new for aggregation in the new pipeline. I expect the old Telemetry pipeline will be around for another quarter, but I wouldn't want to make big investments in it if they could not be easily ported to the new pipeline

Flags: needinfo?(vdjeric)

Mark Reid [:mreid]

Reporter

Comment 14

•

10 years ago

The V2 aggregation pipeline was retired in bug 1179751.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → INVALID

BMO Automation

Updated

•

7 years ago

Product: Webtools → Data Platform and Tools

Nobody; OK to take it and work on it

Assignee

Updated

•

3 years ago

Component: Telemetry Dashboards (TMO) → General

Bugzilla

Telemetry data missing on telemetry.mozilla.org after 2015-03-20

Categories

(Data Platform and Tools :: General, defect)

Tracking

(Not tracked)

People

(Reporter: mreid, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Updated