Closed Bug 1018603 Opened 10 years ago Closed 10 years ago

telemetry hasn't received submissions since 5/26

Categories

(Toolkit :: Telemetry, defect)

29 Branch
x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: mmc, Unassigned)

Details

I see no submissions for anything since 5/26, by build date or by submission date.
cc'ing rvitillo since he fixed bug 962153.
Submissions are still coming in OK. The telemetry.m.o dashboard analysis code is stuck.

We have monitoring in place, but due to a strange quirk of the server + email-sending API, the monitoring failed.

For future reference, the server that runs the monitor had some clock skew, and if the timestamps are more than 5 minutes out of line with Amazon's servers, it will refuse to send email with the following error:
<ErrorResponse xmlns="http://ses.amazonaws.com/doc/2010-12-01/">
  <Error>
    <Type>Sender</Type>
    <Code>RequestExpired</Code>
    <Message>Request timestamp: Mon, 02 Jun 2014 12:41:00 GMT expired.  It must be within 300 secs/ of server time.</Message>
  </Error>
  <RequestId>6e17ad3a-...</RequestId>
</ErrorResponse>

Now I'll look at what's actually wrong with the dashboard analysis code.
It should be updating again now - new data should appear once all the pending data has been processed.
mreid and I traced it to bucket listing, where we didn't have any retry logic. I've added retry logic in:
https://github.com/mozilla/telemetry-aggregator/pull/1

But testing is needed... The easy way to do this is just to deploy it and see what happens :)
The process is checkpointed, so with reasonably luck this shouldn't cause any harm.
The aggregator code has caught up and the data is now appearing on the dashboard as expected.

I filed bug 1019597 to follow up with the bucket-listing problem.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Thanks, Mark! I notice that submissions by calendar date (not build date) are still 2 days behind. Is this expected?
@mmc,
> I notice that submissions by calendar date (not build date) are still 2 days behind. Is this expected?
We recently delayed our analysis some more to reduce errors (that caused missing data).
We suspect it's okay. And will only really be worried if the 2 days lag becomes 3 and so forth.
Or if there is a hole in the data.

Processing time, makes it hard to say exactly when yesterdays data (UTC) is processed, but it's probably not done until sometime relatively late tomorrow (UTC).
So that's sounds like ~2 days delay, and if we a few days in the queue it make be a little slow when we restart the deadlocked process.

Let's see in a day or two, if it's still ~2 days behind, then that's probably fine.
You need to log in before you can comment on or make changes to this bug.