786788 - Distribution Prediction, Build Migration and Early/New Adopters

Reporter

Description

•

13 years ago

A bug (see ) requested that (1) we be able to 'predict the distribution' of measurements based on a certain number of days of data. (see, related to (1) of https://metrics.mozilla.com/projects/browse/METRICS-995, though to be honest i dont understand this comment -taras can chime in here) Also,(2) a comment hypothesizes that early adopters submit different measurements compared to later adopters. (see (2) of https://metrics.mozilla.com/projects/browse/METRICS-995) (3) Dynamics of migration from buildid to buildid (see https://bugzilla.mozilla.org/show_bug.cgi?id=765010) To answer , we did some rough analysis. Data Collected: - all Nightly builds between 20120702 and 20120801 - for a given buildid, 99% of the submissions come within 21 days of the release of a new buildid. Within the first 14 days this is about 96% We then collected 15 days of data for 91 buildids i.e. for a buildid, collect 15 days of data starting from the day of the buildid. Thus every buildid has a 15 days of data. - the percent picked up depends on when the build went out: if the build is on Saturday, more % is picked up within 3 days as opposed to a build being released on Friday. This is obvious since Friday builds hit the weekend. As a result we should collect 7 days - depending on day of release, 7 days picks up about 90% of the data. * How does a Measurement change based on # of days of data collected: We looked at EVENTLOOP_UI_LAG_EXP_MS. If the distribution for this based on first 3 days, next 4-7 days, 8-11 days . For this measurement there is *no difference* between the first 3 days, next 4-7 days etc. Keep in mind, data returned in the first 3 days could possibly be from 'early adopters', by the 4th day they might have moved onto new builds, hence the data in 4-7 days is from slower adopters. However this need not be the case. since the distribution for EVENTLOOP_UI_LAG_EXP_MS does not change, the distribution based on 7 days is the predicted distribution! However there is no guarantee that this be the case for other measures (especially ones that depend on the dynamics of the internet) For example, DISK_CACHE_CORRUPT_DETAILS, a categorical variable, the distribution for 4-7 days *is* different from the distribution for <=3 days. A priori, it is difficult to say if a variable will change distribution with days. ** Summary for (1) So to answer (1) 'predict the distribution', it is not always possible and ideally, we should wait for 7 days of data(for a given buildid) since that is about 90% of any data submitted and waiting another 14 days for the remaining 10% is not probably worthwhile if we assume that the new 10% will not drastically change the distribution. ** Summary for (2) and (3) In some sense looking at data partitioned by first 3 days, 4-7 days, 7-11 days delineates the early adopters (the first 3 days), from later adopters (4-7 days). However, people can submit data in all 3 periods or just first one, or even the second one (they could have moved from an old build in which case they are late adopters). The best inference we can make is if the submission date is far removed from the buildid, then the packet is from a late adopter. However this doesn't provide any help towards Migration - how many users moved from one build to another in how many days. For this I recommend: add "LastBuildID" field to the packet . This field is populated if the last buildid is different from the json$info$appBuildID, otherwise it is null. We then have an idea of how many packets have migrated from one buildid to another. This still doesn't tell us anything about unique users. So we can add one more field "LastSubmissionDate" or "DaysSinceLastSubmission". This tells us how many unique users on a given build. Using these two we can provide migration curves and early/new adopter analysis.

"Saptarshi Guha[:joy]"

Reporter

Comment 1

•

13 years ago

> another in how many days. For this I recommend: add "LastBuildID" field to > the packet . This field > > This still doesn't tell us anything about unique users. So we can add one > more field > "LastSubmissionDate" or "DaysSinceLastSubmission". This tells us how many > unique users on a given > build. Both of these are privacy preserving. Adding them cannot infringe privacy and can only benefit analyses. Cheers

Nathan Froyd [:froydnj]

Assignee

Updated

•

13 years ago

Assignee: nobody → nfroyd

OS: Mac OS X → All

Hardware: x86 → All

Nathan Froyd [:froydnj]

Assignee

Comment 2

•

13 years ago

Just to clarify, you want LastBuildID and LastSubmissionDate only in the first ping from a given session, correct? So if we land this feature, the first several pings from an updated client would look like: session ID X, ping 1: no LastBuildID, no LastSubmissionDate session ID X, ping 2: no LastBuildID, no LastSubmissionDate ... session ID Y, ping 1: LastBuildID present, LastSubmissionDate present session ID Y, ping 2: no LastBuildID, no LastSubmissionDate ... session ID Z, ping 1: LastBuildID present, LastSubmissionDate present ...

"Saptarshi Guha[:joy]"

Reporter

Comment 3

•

13 years ago

Thanks for spending time on this. My comments: > session ID X, ping 1: no LastBuildID, no LastSubmissionDate Correct. So the feature has landed and this the first ping subsequent to that. We have no idea of LastSubmissionDate so this is empty. Similarly, we have no idea of LastBuildID, so this too is empty. > session ID X, ping 2: no LastBuildID, no LastSubmissionDate Now, ping 2 can retrieve information for LastSubmissionDate (i.e. the date of submission (YYYYMMDD) of ping 1) and so LastSubmissionDate := date of submission of Ping 1 LastBuildID is only filled if appBuildID != buildID of previous ping, so since this is the same session X, I am assuming appBuildID of ping 2 == appBuildID of ping 1 ... > session ID Y, ping 1: LastBuildID present, LastSubmissionDate present Yes > session ID Y, ping 2: no LastBuildID, no LastSubmissionDate Same logic as (session ID X, ping 2): i guess appBuildID cannot change for the same session, but LastSubmissionDate := date of submission of Session ID Y, ping 1 ... > session ID Z, ping 1: LastBuildID present, LastSubmissionDate present yes. Sounds right?

"Saptarshi Guha[:joy]"

Reporter

Comment 4

•

13 years ago

Hello, May i know the status on this?

Nathan Froyd [:froydnj]

Assignee

Comment 5

•

13 years ago

I haven't touched this bug due to constraints elsewhere. I don't think it's difficult, I've just had other priorities the last two weeks.

Nathan Froyd [:froydnj]

Assignee

Comment 6

•

13 years ago

I am going to want to privacy folks to look at this, though. Sharing data across sessions as described in comment 3 is sufficient for me to want an expert to look at this. What is this giving us that Firefox Health Report does not?

Keywords: privacy-review-needed

"Saptarshi Guha[:joy]"

Reporter

Comment 7

•

13 years ago

1. FHR does not carry any telemetry data. 2. The only thing shared is last buildid and last submission date - this does not necessarily link packet A to packet B unless there is exactly one packet that has the indicated last buildid and last submission date. Even then , one gets the last two packets and still one has telemetry data.

Curtis Koenig [:curtisk-use curtis.koenig+bzATgmail.com]]

Comment 8

•

13 years ago

:geekboy, this looks fine to me, but maybe I'm missing something. Could you weigh in please?

Flags: needinfo?(sstamm)

Sid Stamm [:geekboy or :sstamm]

Comment 9

•

12 years ago

Looks fine to me. Double-checking with Tom.

Flags: needinfo?(sstamm) → needinfo?(tom)

Tom Lowenthal

Comment 10

•

12 years ago

I think that this change is consistent with the privacy statements that we make regarding Telemetry. I do not think that this behavior would violate the privacy expectations of a user who has Telemetry turned on. However, I'd like to triple check with... just kidding: I think we're good here.

Flags: needinfo?(tom)

Nathan Froyd [:froydnj]

Assignee

Comment 11

•

12 years ago

Sid and Tom have both signed off, removing p-r-n.

Keywords: privacy-review-needed

Nathan Froyd [:froydnj]

Assignee

Comment 12

•

12 years ago

One question about sending LastSubmittedDate timestamps: we're going to run into a situation like: session X, ping N: sent at time T <session X shuts down, saves ping N+1 with LastSubmittedDate of T> <saves LastSubmittedDate of T somewhere else as well> session Y, ping 1: sent with LastSubmittedDate of T session Y, sending saved pings: sends session X, ping N+1 with LastSubmittedDate of T so we're winding up with two pings that both have the same LastSubmittedDate. Is that going to cause problems for whatever analyses are being run? The easiest way out of this is to send the first ping from session Y with whatever time session X's N+1 ping was saved at, but that's not quite right for the purposes of analysis either.

Flags: needinfo?(sguha)

Nathan Froyd [:froydnj]

Assignee

Comment 13

•

12 years ago

Attached patch part 1 - let the uuid always be the slug and store the reason separately — Details — Splinter Review

This is a cleanup so that it's easier to tell when we're sending current session pings. And it's just a nice cleanup in general.

Attachment #690500 - Flags: review?(vdjeric)

Nathan Froyd [:froydnj]

Assignee

Comment 14

•

12 years ago

Attached patch part 2 - send lastSubmittedDate and lastAppBuildID in telemetry pings (obsolete) — Details — Splinter Review

Just what it says on the tin.

Attachment #690501 - Flags: review?(vdjeric)

Nathan Froyd [:froydnj]

Assignee

Comment 15

•

12 years ago

Attached patch part 2.5 - add tests for lastSubmittedDate and lastSubmittedBuildID (obsolete) — Details — Splinter Review

Attachment #690502 - Flags: review?(vdjeric)

Nathan Froyd [:froydnj]

Assignee

Comment 16

•

12 years ago

Attached patch part 3 - separate out a generic write-object-as-JSON function (obsolete) — Details — Splinter Review

We're going to write out the values for lastSubmittedDate and lastSubmittedBuildID as a JSON object; we might as well have a generic function that takes care of all the grotty details.

Attachment #690503 - Flags: review?(vdjeric)

Nathan Froyd [:froydnj]

Assignee

Comment 17

•

12 years ago

Attached patch part 4 - persist lastSubmittedDate and lastSubmittedBuildID to files (obsolete) — Details — Splinter Review

...and finally, what we've all been waiting for.

Attachment #690504 - Flags: review?(vdjeric)

Nathan Froyd [:froydnj]

Assignee

Comment 18

•

12 years ago

Attached patch part 3 - separate out a generic write-object-as-JSON function — Details — Splinter Review

Of course the object destructuring syntax doesn't work when assigning to this.FOO.

Attachment #690503 - Attachment is obsolete: true

Attachment #690503 - Flags: review?(vdjeric)

Attachment #690554 - Flags: review?(vdjeric)

Nathan Froyd [:froydnj]

Assignee

Comment 19

•

12 years ago

Attached patch part 4 - persist lastSubmittedDate and lastSubmittedBuildID to files (obsolete) — Details — Splinter Review

Attachment #690504 - Attachment is obsolete: true

Attachment #690504 - Flags: review?(vdjeric)

Attachment #690555 - Flags: review?(vdjeric)

"Saptarshi Guha[:joy]"

Reporter

Comment 20

•

12 years ago

(In reply to Nathan Froyd (:froydnj) from comment #12) > One question about sending LastSubmittedDate timestamps: we're going to run > into a situation like: > > session X, ping N: sent at time T > <session X shuts down, saves ping N+1 with LastSubmittedDate of T> > <saves LastSubmittedDate of T somewhere else as well> > session Y, ping 1: sent with LastSubmittedDate of T > session Y, sending saved pings: sends session X, ping N+1 with > LastSubmittedDate of T > > so we're winding up with two pings that both have the same > LastSubmittedDate. Is that going to cause problems for whatever analyses > are being run? > > The easiest way out of this is to send the first ping from session Y with > whatever time session X's N+1 ping was saved at, but that's not quite right > for the purposes of analysis either. Yes I had realized this earlier but hadn't gotten around to commenting on it. The uses of this data is to 1) Typical inter-arrival time of the session pings: how often is the browser being used 2) Typical time since last build version - dynamics of shift from build to build (there have been concerns about 'slow adopters') So, LastSubmissionDate := date of the last idle-daily ping submission LastBuild := The last buildid (present if not equal to current build id) So for Caveats: I should point out two oversights in comment [1] 1. LastSubmissionDate uses the same theory as Days Since last Ping (see [1] and [2]) however, though the theory was good, there have been issues counting "unique number of users", see [3], so getting the correct unique pings might or might not happen. 2. LastBuild can be used to tag sessions as coming from fast migrators or not. One last request, do you think it's possible to include in every ping TotalNumberOfSubmittedSessionsOnThisBuild this count is inclusive of the current ping. Why: if last build is old, we might think this session coming from an installation with infrequent use. The TotalNumberOfSubmittedSessionsOnThisBuild indicates activity on this build. This corresponds totalPingCount of [4] Use: a) well only to segment/profile based on histograms/info vars sessions according to sessions from actively used installations or not. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=616835 [2] https://blog.mozilla.org/metrics/2011/04/13/using-the-new-days-last-ping-metric-to-look-at-firefox-4-downloads/ [3] https://bugzilla.mozilla.org/show_bug.cgi?id=677617 [4] https://bugzilla.mozilla.org/show_bug.cgi?id=620837

Flags: needinfo?(sguha)

Nathan Froyd [:froydnj]

Assignee

Comment 21

•

12 years ago

(In reply to Saptarshi Guha from comment #20) > (In reply to Nathan Froyd (:froydnj) from comment #12) > > so we're winding up with two pings that both have the same > > LastSubmittedDate. Is that going to cause problems for whatever analyses > > are being run? > > Yes I had realized this earlier but hadn't gotten around to commenting on it. OK, I think your comment suggests it's OK to have two pings with the same LastSubmissionDate. Though I'm not sure about: > The uses of this data is to > 1) Typical inter-arrival time of the session pings: how often is the browser > being used > 2) Typical time since last build version - dynamics of shift from build to > build (there have been concerns about 'slow adopters') > > So, > > LastSubmissionDate := date of the last idle-daily ping submission > LastBuild := The last buildid (present if not equal to current build id) > > So for > Caveats: It looks like this bit got cut off, so I'm not sure... > One last request, do you think it's possible to include in every ping > > TotalNumberOfSubmittedSessionsOnThisBuild > > this count is inclusive of the current ping. That's pretty easy to add. I'll add that as another patch.

"Saptarshi Guha[:joy]"

Reporter

Comment 22

•

12 years ago

> OK, I think your comment suggests it's OK to have two pings with the same > LastSubmissionDate. > Yes, basically a day has at most one idle-daily and it is tagged with characteristics of the last idle-daily. > Though I'm not sure about: > > > The uses of this data is to > > 1) Typical inter-arrival time of the session pings: how often is the browser > > being used > > 2) Typical time since last build version - dynamics of shift from build to > > build (there have been concerns about 'slow adopters') > > > > So, I just wanted to clarify what we can and cannot do with these two metrics. lastBuild ---------- if we do a date subtraction i.e. PingSubmissionDate - lastBuildConvertedtoDate then the histograms can be profiled by some rough indicator of rough fast installation moves from build to build. If this differences is large, then the submission could have come from a slow adopter . Now this is not entirely true, because this large difference might be an outlier in the "slow adopter"'s history... (FHR contains this history) lastSubmissionDate ------------------ can be used to give a snapshot the activity of the installation (that sent this session ping). If last submission date was a day ago, the installation was used yesterday .... Theoretically it can be used to count unique # of session pings but my references above discuss some unexplainable glitches in the engineering Hope that helps and thanks again for your time on this and the TotalNumberOfSubmittedSessionsOnThisBuild.

"Saptarshi Guha[:joy]"

Reporter

Comment 23

•

12 years ago

btw, just wanted to clarify TotalNumberOfSubmittedSessionsOnThisBuild is the total number sessions submitted (:= saved_sessions + idle_daily) for that build

part 1 - let the uuid always be the slug and store the reason separately 12 years ago Nathan Froyd [:froydnj] 2.25 KB, patch	vladan : review+	Details \| Diff \| Splinter Review
part 2 - send lastSubmittedDate and lastAppBuildID in telemetry pings 12 years ago Nathan Froyd [:froydnj] 2.65 KB, patch		Details \| Diff \| Splinter Review
part 2.5 - add tests for lastSubmittedDate and lastSubmittedBuildID 12 years ago Nathan Froyd [:froydnj] 3.08 KB, patch		Details \| Diff \| Splinter Review
part 3 - separate out a generic write-object-as-JSON function 12 years ago Nathan Froyd [:froydnj] 2.69 KB, patch		Details \| Diff \| Splinter Review
part 4 - persist lastSubmittedDate and lastSubmittedBuildID to files 12 years ago Nathan Froyd [:froydnj] 3.50 KB, patch		Details \| Diff \| Splinter Review
part 3 - separate out a generic write-object-as-JSON function 12 years ago Nathan Froyd [:froydnj] 2.97 KB, patch	vladan : review+	Details \| Diff \| Splinter Review
part 4 - persist lastSubmittedDate and lastSubmittedBuildID to files 12 years ago Nathan Froyd [:froydnj] 3.49 KB, patch		Details \| Diff \| Splinter Review
part 2 - send lastSubmittedDate and lastSubmittedBuildID in telemetry pings 12 years ago Nathan Froyd [:froydnj] 2.72 KB, patch	vladan : review-	Details \| Diff \| Splinter Review
part 2.5 - add tests for lastSubmittedDate and lastSubmittedBuildID 12 years ago Nathan Froyd [:froydnj] 3.00 KB, patch		Details \| Diff \| Splinter Review
part 4 - persist lastSubmittedDate and lastSubmittedBuildID to files 12 years ago Nathan Froyd [:froydnj] 3.52 KB, patch	vladan : review+	Details \| Diff \| Splinter Review
part 2 - send lastActive{SessionDate,BuildID} in telemetry pings 12 years ago Nathan Froyd [:froydnj] 3.96 KB, patch		Details \| Diff \| Splinter Review
part 4 - persist lastActive{SessionDate,BuildID} to files 12 years ago Nathan Froyd [:froydnj] 4.87 KB, patch		Details \| Diff \| Splinter Review
part 5 - add tests for lastActive{SessionDate,BuildID} 12 years ago Nathan Froyd [:froydnj] 8.49 KB, patch		Details \| Diff \| Splinter Review