Mysterious dip in Universal Search data

RESOLVED INCOMPLETE

Status

Cloud Services
Metrics: Pipeline
P2
normal
RESOLVED INCOMPLETE
a year ago
6 months ago

People

(Reporter: RaFromBRC, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

a year ago
We've seen an unexplained dip in the Universal Search telemetry data on the dashboard that is being built out. The data volume has recovered, but this isn't the first such dip we've seen, and we're not sure why they're happening. Ilana and Rob Rayborn have been looking into it, but could use support from the pipeline team, since we don't know where in the chain the issue is originating.

The live dashboard is at https://sql.telemetry.mozilla.org/dashboard/-in-progress-universal-search-executive-summary

There has been an email thread about the issue, I'll paste the contents of the thread here. The thread forked, so the chronology is a bit wonky, but all the info is in here, anyway.

----------------------------------
From: Ilana Segall <isegall@mozilla.com>
Subject: Re: Data problems with Universal Search
To: Wil Clouser <wclouser@mozilla.com>
Cc: John Gruen <jgruen@mozilla.com>, Robert Miller <rmiller@mozilla.com>, 
Nick Chapman <nchapman@mozilla.com>, Javaun Moradi <jmoradi@mozilla.com>, 
Chuck Harmston <chuck@mozilla.com>, Robert Rayborn <rrayborn@mozilla.com>, 
Rebecca Weiss <rweiss@mozilla.com>

We didn't see comparable drops in the other tests, which was odd.

On Aug 17, 2016, at 6:39 PM, Wil Clouser <wclouser@mozilla.com> wrote:

Thanks for investigating.  Our scheduled pushes of Test Pilot were Jul 5,
Jul 19, Aug 2.  It's hard to tell if any of those line up with the dates on
the graph, but it's an idea.  Although I'd expect the ping drop to be
reflected in Tab Center also if we messed something up in the Test Pilot
add-on.

Wil

On Wed, Aug 17, 2016 at 5:58 PM, Ilana Segall <isegall@mozilla.com> wrote:

> Today, Rob and I investigated an odd data dip that we observed in
> Universal Search for most of July:
>
> <Screen Shot 2016-08-17 at 4.16.17 PM.png>
>
> We did a few quality checks to make sure this loss wasn't happening during
> the spark phase:
>
> - There is no other experiment named universal-search (universal-search1,
> for instance) that pings might have leaked to
> - Examining both the submission date and the creation date show the same
> level of dropoff, so it's not a time issue
> - There is no change in the ping format that would have caused the ping
> not to be processed correctly
>
> The data is coming back up, so it appears that the problem may have
> resolved itself, but we're still very concerned that these pings never made
> it to aws and can't be recovered. Is there a way to examine log files and
> see if anything happened pre-aws to cause the loss?
>
>
> Ilana
>

---------------------------

From: Robert Rayborn <rrayborn@mozilla.com>
Date: Wed, 17 Aug 2016 23:10:47 -0700
Subject: Re: Data problems with Universal Search
To: Chuck Harmston <chuck@mozilla.com>
Cc: Ilana Segall <isegall@mozilla.com>, Robert Miller <rmiller@mozilla.com>, 
Nick Chapman <nchapman@mozilla.com>, Wil Clouser <wclouser@mozilla.com>, 
John Gruen <jgruen@mozilla.com>, Javaun Moradi <jmoradi@mozilla.com>, 
Rebecca Weiss <rweiss@mozilla.com>

Here <https://sql.telemetry.mozilla.org/queries/973/source#1756> is a link
to the actual query (for more detailed hovering etc).  It looks to be July
7th to August 2nd, July 5th still looks fine (perhaps that's deployment
delays).

I don't see any unique attributes in the limited subset of data that I
graph in Presto, but Ilana would be able to pull more fields directly from
Telemetry.

Thanks

On Wed, Aug 17, 2016 at 6:47 PM, Chuck Harmston <chuck@mozilla.com> wrote:

> Thanks for looking into this, Ilana and Tob!
>
> On the code side, there are two places where this could have been a
> problem: in Universal Search or in Test Pilot.
>
> Universal Search unfortunately had very little change over that time. We
> made documentation-only changes on May 25th, then the next commit was one
> small change
> <https://github.com/mozilla/universal-search/commit/cf224a7708d373c346f98=
e143dd96b1188fc848b> on
> July 15th to fix an issue with the resultType in our ping. Unfortunately,
> the issue seems to have started in between those two things.
>
> Test Pilot did have three deployments over that time: July 5th, July 19th=
,
> and August 2nd. It=E2=80=99s hard to tell if any of those line up with th=
e drops or
> increases in the chart, but cecause of the nature of how all pings pass
> through, it seems likely that other experiments (notably Tab Center) woul=
d
> see similar drops in the same time periods. Is that something observable?
>
> One moonshot question: it looks like a very small number of pings made it
> through between July 10th and 24th. Do those pings have any
> unique/identifying characteristics?
>
>
> On August 17, 2016 at 6:58:29 PM, Ilana Segall (isegall@mozilla.com)
> wrote:
>
> Today, Rob and I investigated an odd data dip that we observed in
> Universal Search for most of July:
>
> [image: Inline image 1]
>
> We did a few quality checks to make sure this loss wasn't happening durin=
g
> the spark phase:
>
> - There is no other experiment named universal-search (universal-search1,
> for instance) that pings might have leaked to
> - Examining both the submission date and the creation date show the same
> level of dropoff, so it's not a time issue
> - There is no change in the ping format that would have caused the ping
> not to be processed correctly
>
> The data is coming back up, so it appears that the problem may have
> resolved itself, but we're still very concerned that these pings never ma=
de
> it to aws and can't be recovered. Is there a way to examine log files and
> see if anything happened pre-aws to cause the loss?
>
>
> Ilana
Assignee: gfritzsche → nobody
Are you looking for input from me?
rrayborn tells us (via email) that there is a similar dip in pings from Tab Center during the same period.

Since the period roughly aligns with our pushes, I think an investigation into the add-on itself is warranted.
Chuck is investigating the add-on

Comment 4

a year ago
Additional info: The drop in pings aligns with universal search began, but whereas universal search data appears to have resolved itself, we still don't see a recovery from vtabs.

https://sql.telemetry.mozilla.org/queries/744#1250
https://sql.telemetry.mozilla.org/queries/1043#1816
Created attachment 8783574 [details]
Screen Shot 2016-08-22 at 8.59.11 AM.png

I've checked a variety of different configurations of old Test Pilot and extension versions, and wasn't once able to produce a configuration where experiment telemetry pings weren't at least making it as far as about:telemetry.

I also made an attempt to align Test Pilot release dates with the drop in data, and wasn't able to successfully do so:

- We made the June 6 release as scheduled.
- We made an unscheduled release on June 9 to patch a critical bug.
- We skipped the scheduled June 20 release due to the London workweek. This seems like it would have been the most likely to affect experiment pings when aligning to the graph.
- We made the July 5 release as scheduled. At this point, the drop in pings was already well underway.
- A release was made on July 15th, in the middle of the problem.
- The next release was not made until July 29, at which point the data seems to have recovered.
- Additional releases were made on August 11 and August 12.

In this attachment, I've annotated a Universal Search ping chart with these dates. It seems unlikely that Test Pilot releases are the issue here.
Was the test-pilot data affected by the partial outage from bug 1286220 & bug 1285621?
This started to hit on July 4 and could at least overlap with issues here.
I don't see a bug for test-pilot getting backfilled as part of bug 1286226.

Updated

a year ago
Priority: -- → P2

Comment 7

6 months ago
Water under the bridge at this point, reopen if we do anything to pursue this.
Status: NEW → RESOLVED
Last Resolved: 6 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.