Closed Bug 1182031 Opened 9 years ago Closed 9 years ago

browser_mozLoop_telemetry.js is going to permafail when the Gecko version number is bumped to 43

Categories

(Hello (Loop) :: Client, defect, P2)

defect
Points:
1

Tracking

(firefox40 unaffected, firefox41 unaffected, firefox42 fixed)

RESOLVED FIXED
mozilla42
Iteration:
42.3 - Aug 10
Tracking Status
firefox40 --- unaffected
firefox41 --- unaffected
firefox42 --- fixed

People

(Reporter: RyanVM, Assigned: standard8)

References

Details

Attachments

(1 file)

https://treeherder.mozilla.org/logviewer.html#?job_id=9206626&repo=try

51 INFO TEST-UNEXPECTED-FAIL | browser/components/loop/test/mochitest/browser_mozLoop_telemetry.js | TWO_WAY_MEDIA_CONN_LENGTH.BETWEEN_30S_AND_5M - Got 7, expected 3
52 INFO TEST-UNEXPECTED-FAIL | browser/components/loop/test/mochitest/browser_mozLoop_telemetry.js | TWO_WAY_MEDIA_CONN_LENGTH.MORE_THAN_5M - Got undefined, expected 4
57 INFO TEST-UNEXPECTED-FAIL | browser/components/loop/test/mochitest/browser_mozLoop_telemetry.js | SHARING_STATE_CHANGE.BROWSER_ENABLED - 7 === 3 - JS frame :: chrome://mochitests/content/browser/browser/components/loop/test/mochitest/browser_mozLoop_telemetry.js :: test_mozLoop_telemetryAdd_sharing_buckets :: line 76
58 INFO TEST-UNEXPECTED-FAIL | browser/components/loop/test/mochitest/browser_mozLoop_telemetry.js | SHARING_STATE_CHANGE.BROWSER_DISABLED - "undefined" === 4 - JS frame :: chrome://mochitests/content/browser/browser/components/loop/test/mochitest/browser_mozLoop_telemetry.js :: test_mozLoop_telemetryAdd_sharing_buckets :: line 77
Flags: needinfo?(paolo.mozmail)
Yup, a lot of these Loop Telemetry histograms are expiring in 43. The solution is to remove these tests + the histograms and the accumulation code before Firefox 43.
Alternately, we can discuss whether it's ok to postpone the expiry date of these histograms.
Maybe Mike knows whether we still need these histograms, or can relay that to the Loop team?
Flags: needinfo?(paolo.mozmail) → needinfo?(mdeboer)
Well, all true, but it's up to our PM, Romain, to decide whether he wants the Histograms extended to later Fx versions.
Flags: needinfo?(mdeboer) → needinfo?(rtestard)
Iteration: --- → 42.3 - Aug 10
Rank: 21
Flags: qe-verify-
Flags: firefox-backlog+
Priority: -- → P2
Whiteboard: [needed prior to 2015-08-10 for gecko bump]
(In reply to Mike de Boer [:mikedeboer] from comment #3)
> Well, all true, but it's up to our PM, Romain, to decide whether he wants
> the Histograms extended to later Fx versions.

The analysts have not had a chance to set-up the environment for data analysis yet, this is being actively worked on - bug 1177137
These were released with Firefox 39 and we only have 3 weeks worth of data. Could we postpone the expiry dates to give us more time to collect and analyze the data?
Flags: needinfo?(rtestard)
(In reply to Romain Testard [:RT] from comment #4)
> Could we postpone the expiry dates to give us more time to collect and
> analyze the data?

Vladan, I'm forwarding this question to you.
Flags: needinfo?(vdjeric)
I don't have a problem with bumping the expiry date of this crop of histograms, but it seems like there are a few different issues here:

1. The collected data can already be spot-checked via the dashboard (despite issues described in https://bugzilla.mozilla.org/show_bug.cgi?id=1177137#c5) before doing a Spark analysis

2. These histograms have barely collected any samples during 39/40/41/42 nightly/aurora/beta/release cycles, e.g. 90 pings during two weeks on Beta 40 http://bit.ly/1MkGrfB 
Are these histograms actually able to get you the data you want? 
Is there a bug in the histogram recording code? Or do you think the counts reported in the dashes are incorrect?

3. What's the purpose of keeping these histograms around longer? Are you planning to use the histograms to evaluate fixes for issues found from analyzing the data?
 
http://mzl.la/1MkG8kJ
http://bit.ly/1MkGkk1
Flags: needinfo?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #6)
> 2. These histograms have barely collected any samples during 39/40/41/42
> nightly/aurora/beta/release cycles, e.g. 90 pings during two weeks on Beta
> 40 http://bit.ly/1MkGrfB 
> Are these histograms actually able to get you the data you want? 
> Is there a bug in the histogram recording code? Or do you think the counts
> reported in the dashes are incorrect?

Interesting! I'd like Dan & Adam to take a look at this and possibly file a bug to fix this, if necessary!
Flags: needinfo?(dmose)
Flags: needinfo?(adam)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #6)
> 3. What's the purpose of keeping these histograms around longer? Are you
> planning to use the histograms to evaluate fixes for issues found from
> analyzing the data?

Heh, another question for Le Product Manager!
Flags: needinfo?(rtestard)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #6)
 > 2. These histograms have barely collected any samples during 39/40/41/42
 > nightly/aurora/beta/release cycles, e.g. 90 pings during two weeks on Beta
 > 40 http://bit.ly/1MkGrfB 

So this bitly URL doesn't work at all for me.  Clicking it in Firefox beta 40 and Chrome each yields different errors in the JS console, making it hard to investigate.  Am I missing something?  (needinfo :vladan)

> > Are these histograms actually able to get you the data you want? 
> > Is there a bug in the histogram recording code? Or do you think the counts
> > reported in the dashes are incorrect?

If we really are seeing only 90 pings for 'LOOP_TWO_WAY_MEDIA_CONN_LENGTH' in beta, that sounds wrong and very much worthy of investigation.  Once I can get functional access to that data, I'll try and debug further.

Just so that we can contextualize that, it'd be helpful to know what the "believed correct" numbers for two-way media connection length are on the Firefox 40 beta channel.  Erik, is that number easy to get? (needinfo :Erik) 

(In reply to Mike de Boer [:mikedeboer] from comment #8)
> (In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #6)
> > 3. What's the purpose of keeping these histograms around longer? Are you
> > planning to use the histograms to evaluate fixes for issues found from
> > analyzing the data?
> 
> Heh, another question for Le Product Manager!

The reasoning for the length of the data collection is almost certainly documented in the bug where they landed, RT probably has that.  However, what is equally is important, if I'm understanding things correctly:

As far as I can see, every one of the expiring histograms is opt-out.  The expiration dates were all chosen assuming that opt-out data collection would actually be turned on and working in either 39 or 40.  However, Thomas Huelbert sent out a mail on Friday saying:

"As a result we are not turning on the opt out telemetry probes for the release population in 40. If you have an opt out telemetry probe that you were expecting for the release population in 40, please be aware it won't be available until 41."

Meaning that we almost certainly want to push all of these numbers out anyway.
Flags: needinfo?(vdjeric)
Flags: needinfo?(erik)
Flags: needinfo?(dmose)
> Just so that we can contextualize that, it'd be helpful to know what the
> "believed correct" numbers for two-way media connection length are on the
> Firefox 40 beta channel.  Erik, is that number easy to get? (needinfo :Erik) 

I'm not yet pulling the length telemetry--in fact, I wasn't aware of it until now--but I can give you total entries into the bidirectional state, which should be a reasonable surrogate for your number-of-pings metric. I see 119 sendrecv states for FF40 in the past 24 hours but 1496 for FF39. (You can see the same by pasting "path:rooms AND state:sendrecv AND user_agent_browser.raw:Firefox AND user_agent_version:40" into https://kibana.shared.us-west-2.prod.mozaws.net/index.html#/dashboard/elasticsearch/Loop%20Room%20Creation%20Timeline and adjusting the timespan.)

Those numbers (differing by a factor of 12.5) are about half what I'd expect, since overall hits from FF40 and FF39 differ by a factor of somewhere between 20.1 (/rooms) and 23.8 (/rooms/*).
Flags: needinfo?(erik)
(In reply to Dan Mosedale (:dmose) - needinfo? me for response from comment #9)
> So this bitly URL doesn't work at all for me.  Clicking it in Firefox beta
> 40 and Chrome each yields different errors in the JS console, making it hard
> to investigate.  Am I missing something?  (needinfo :vladan)

Nuts. Our intern is actively working on these dashes and he landed a regression yesterday. We'll get the dash fixed tomorrow and the bit.ly URLs will work again. The mzl.la link works: http://mzl.la/1MkG8kJ

> As far as I can see, every one of the expiring histograms is opt-out.  The
> expiration dates were all chosen assuming that opt-out data collection would
> actually be turned on and working in either 39 or 40.

These histograms expire in 43, but I guess you need opt-out data from more than 2 version of Release (41 and 42)? That's fine, as I said, there's no problem with bumping the expiry dates if it makes sense to do so.
Flags: needinfo?(vdjeric)
(In reply to Erik Rose [:erik][:erikrose] from comment #10)
> Those numbers (differing by a factor of 12.5) are about half what I'd
> expect, since overall hits from FF40 and FF39 differ by a factor of
> somewhere between 20.1 (/rooms) and 23.8 (/rooms/*).

Is this data from source Loop servers? Or Telemetry?
Flags: needinfo?(erik)
> Is this data from source Loop servers? Or Telemetry?

Loop servers
Flags: needinfo?(erik)
The bitly URLs are working again, but you need to scroll down to Advanced Settings and select "Don't Sanitize". The dash thinks the data is unreliable because there are too few submissions heh

e.g. 27 samples in LOOP_TWO_WAY_MEDIA_CONN_LENGTH reported from Beta 40 on Monday July 20th http://bit.ly/1CVzEpy

How does that # of conversations compare to the server-side numbers for the same day for Beta 40?
(In reply to Mike de Boer [:mikedeboer] from comment #8)
> (In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #6)
> > 3. What's the purpose of keeping these histograms around longer? Are you
> > planning to use the histograms to evaluate fixes for issues found from
> > analyzing the data?
> 
> Heh, another question for Le Product Manager!

We want to understand how users use our product and the impact that new features have on, for instance call duration. So this is not about identification of issues but more about understanding how users use our product and the impact that our changes have on that.

For that reason looking at Beta users is not good for us (low usage but also non typical GA users with different usage patterns). My understanding is that Telemetry is opt-in for GA in 39 and 40 so it means we'll only start seeing useful input from Telemetry with 41?
Flags: needinfo?(rtestard)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #14)
> The bitly URLs are working again, but you need to scroll down to Advanced
> Settings and select "Don't Sanitize". The dash thinks the data is unreliable
> because there are too few submissions heh
> 
> e.g. 27 samples in LOOP_TWO_WAY_MEDIA_CONN_LENGTH reported from Beta 40 on
> Monday July 20th http://bit.ly/1CVzEpy
> 
> How does that # of conversations compare to the server-side numbers for the
> same day for Beta 40?

The server-side numbers show 90 for that day.  Erik and I looked more at other loop telemetry probes and and found telemetry numbers on beta, release, and nightly opt-out are 1/3 to 2/3 of similar server-side stats.  Our hypothesis is that this can be largely explained some combination of the tiny sample size being more vulnerable to edge-case skew as well as by non-release cohorts of users being more aware of and interested in opting-out of data collection.  I'd be curious to hear your thoughts on that.

(In reply to Mike de Boer [:mikedeboer] from comment #7)
> (In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #6)
> > 2. These histograms have barely collected any samples during 39/40/41/42
> > nightly/aurora/beta/release cycles, e.g. 90 pings during two weeks on Beta
> > 40 http://bit.ly/1MkGrfB 
> > Are these histograms actually able to get you the data you want? 
> > Is there a bug in the histogram recording code? Or do you think the counts
> > reported in the dashes are incorrect?
> 
> Interesting! I'd like Dan & Adam to take a look at this and possibly file a
> bug to fix this, if necessary!

Good idea.  I'm going to spin that out to a separate bug.
Bug 1187561 filed on further validating telemetry probe data.
Depends on: 1141852
The uplift is Monday and nothing has landed here. Should I be concerned? Especially given the lack of guarantees of anything that lands on inbound over the weekend making it to m-c in time for the uplift.
Flags: needinfo?(vdjeric)
This increases the expiry to FF 45, which will stop the tests failing. As discussed, we don't yet have decent information for these, nor do we have confidence in the numbers.

Bug 1187561 is going to attempt to figure out the discrepency in the numbers before we expire again. Getting some more numbers over the next couple of cycles may also help.

See comment 15 for the usefulness of these numbers for us.
Attachment #8645657 - Flags: review?(vdjeric)
At this point, the fix here is going to have to land directly on mozilla-central and is currently blocking the uplifts from happening.
Attachment #8645657 - Flags: review?(vdjeric) → review+
Assignee: nobody → standard8
Points: --- → 1
Flags: needinfo?(vdjeric)
Flags: needinfo?(adam)
Thanks, Mark!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [needed prior to 2015-08-10 for gecko bump]
Target Milestone: --- → mozilla42
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: