849947 - FHR submission counts vs Blocklist ping

Reporter

Description

•

11 years ago

Hello,

i've been comparing the FHR activity numbers from Aurora,21 to
blocklist pings.

1) To arrive at an FHR derived activity number i use the following logic

- subset all documents that pass this:
    grepl('^(21)',p$data$last$org.mozilla.appInfo.appinfo[['version']]) && "aurora"==p$data$last$org.mozilla.appInfo.appinfo[['updateChannel']]

- then for this subset, compute the total number of documents that
  were active on day 'd', where d is in the set {"2013-03-03","2013-03-04",...,"2013-03-10"}
- the result is the following table

        date  count
1 2013-03-03 155238
2 2013-03-04 175737
3 2013-03-05 174974
4 2013-03-06 170992
5 2013-03-07 165304
6 2013-03-08 155300
7 2013-03-09 137933
8 2013-03-10 125158

(if i redo this count tomorrow or sometime in the future, the FHR numbers will increase to an asymptote)

2) To compare, i get the blocklist counts for Aurora 21 with build ids on or after 2013-03-03, this is the table
        date  count
1 2013-03-03  36026
2 2013-03-04  66144
3 2013-03-05  87137
4 2013-03-06  98032
5 2013-03-07 104084
6 2013-03-08 104930
7 2013-03-09  98662
8 2013-03-10 100825


3) As you can see the ratio is greater than 1

        date    fhr    adi    ratio
1 2013-03-03 155238  36026 4.309055
2 2013-03-04 175737  66144 2.656885
3 2013-03-05 174974  87137 2.008033
4 2013-03-06 170992  98032 1.744247
5 2013-03-07 165304 104084 1.588179
6 2013-03-08 155300 104930 1.480034
7 2013-03-09 137933  98662 1.398036
8 2013-03-10 125158 100825 1.241339


4) There is an issue with comparing these numbers: not everybody using
   Aurora 21 have moved to a build >= 2013-03-03,  the BL number
   for 2013-03-05 are BL submissions for installations on a build >
   2013-03-05. The FHR activity for 2013-03-05 are all documents that
   were active on 2013-03-05 though they need not have been on a build
   >= 2013-03-05 on the day 2013-03-05. 
   Hence the two numbers are not quite comparable.
   
it is entirely possible that BL under reports. But by how much? Should
we accept these numbers without further checks?

  As a temporary check, the suggestion is to include :

- was the blocklist ping successfully sent?
- build id of the successful blocklist ping stored in the days field
  (i.e. per active day what buildid was the installation)


cheers
Saptarshi

Gregory Szorc [:gps]

Assignee

Comment 1

•

11 years ago

How exactly would we measure whether the blocklist was sent? I'm not sure if we're looking at modifying blocklist code or piggybacking off existing preferences:

extensions.blocklist.pingCountTotal
extensions.blocklist.pingCountVersion
extensions.blocklist.enabled

Presumably pingCountTotal is a counter that increments with each successful ping? If we put that in FHR's per-day payload, that would tell us if the blocklist was sent, right?

Gregory Szorc [:gps]

Assignee

Comment 2

•

11 years ago

As a cheap method to investigate this discrepancy, I propose we force a preferences flush to disk after FHR submits its payload. Because this will incur blocking I/O on the main thread, I propose we only perform this action on Nightly and Aurora channels and that the experiment only run for as short as possible to yield results (maybe a week or two). If we find that preference writes are getting lost, we should put bug 846133 on the fast track and uplift it as needed.

Mike Connor [:mconnor]

Comment 3

•

11 years ago

I'm in favour of this experiment, especially if we call the flush on an idle timer to minimize user impact.  As the pref service only flushes to disk if there are uncommitted changes (savePrefFile is an effective no-op if there are no changes), it would be good to record how many times we still flush at shutdown after an FHR-driven flush.  Can we add that as a Telemetry probe?

All that said, I think we need to stop using prefs for timers in general, if this is a big problem.  Is there a bug on building a shared service to manage long-running timers across sessions?

Gregory Szorc [:gps]

Assignee

Comment 4

•

11 years ago

(In reply to Mike Connor [:mconnor] from comment #3)
> I'm in favour of this experiment, especially if we call the flush on an idle
> timer to minimize user impact.

I say we just do it without the idle timer. Otherwise, we won't know the full impact of our experiment: we'll still question whether the idle timer fires and whether all pref writes are being persisted. Yes, we may introduce jank. But it will occur at most once per day and only on the Nightly and Aurora channels and only for a limited time. These kinds of experiments are allowed to live on those channels, IMO.

> As the pref service only flushes to disk if
> there are uncommitted changes (savePrefFile is an effective no-op if there
> are no changes), it would be good to record how many times we still flush at
> shutdown after an FHR-driven flush.  Can we add that as a Telemetry probe?

Bug 848461. Even longer term, if prefs turn out to be not reliable, we should investigate making them more so. Many features rely on prefs and having writes lost for a non-trivial amount of users would likely result in many sub-par product experiences.

> All that said, I think we need to stop using prefs for timers in general, if
> this is a big problem.  Is there a bug on building a shared service to
> manage long-running timers across sessions?

The main problem is not the timer: it's the last document ID. See the initial comment in bug 846133 for an explanation of the theorized problem.

Gregory Szorc [:gps]

Assignee

Updated

•

11 years ago

status-firefox20: --- → unaffected

status-firefox21: --- → affected

status-firefox22: --- → affected

tracking-firefox21: --- → ?

Gregory Szorc [:gps]

Assignee

Comment 5

•

11 years ago

Attached patch Part 1: Add MOZ_PRERELEASE_CHANNEL variable to build system, v1 (obsolete) — Details — Splinter Review

We now have at least two features doing the "if nightly or aurora then" dance. So, let's create a unified variable for it. It's easier to grep for, etc.

IMO this beats our current solution of "revert changes to prefs once a tree is uplifted." But that is a bigger problem and we shouldn't try tackling it in this bug.

Assignee: nobody → gps

Status: NEW → ASSIGNED

Attachment #723635 - Flags: review?(mh+mozilla)

Gregory Szorc [:gps]

Assignee

Comment 6

•

11 years ago

Attached patch Part 2: Flush prefs after receiving server response, v1 (obsolete) — Details — Splinter Review

rnewman for code. vladan for perf sign-off since this may cause jank.

Attachment #723636 - Flags: review?(rnewman)

Attachment #723636 - Flags: feedback?(vdjeric)

Gregory Szorc [:gps]

Assignee

Comment 7

•

11 years ago

Gavin pointed me to RELEASE_BUILD, which is defined at https://mxr.mozilla.org/mozilla-central/source/config/rules.mk#1280. He was trying to convince me on IRC that "'a' in $GRE_MILESTONE" might make more sense. I'm not totally familiar with the most appropriate method here...

Richard Newman [:rnewman]

Comment 8

•

11 years ago

Comment on attachment 723636 [details] [diff] [review]
Part 2: Flush prefs after receiving server response, v1

Review of attachment 723636 [details] [diff] [review]:
-----------------------------------------------------------------

I think this does what you intend it to.

::: services/healthreport/healthreporter.jsm
@@ +1156,5 @@
>  
> +#ifdef MOZ_PRERELEASE_CHANNEL
> +    // Intended to be temporary until we a) assess the impact b) bug 846133
> +    // deploys more robust storage for state.
> +    Services.prefs.savePrefFile(null);

I suggest you wrap this in a try..catch. That method call can throw in all manner of ways.

There's nothing we can do about it, but note that we're assuming that the prefs service has its prefs file -- if it doesn't have one for some reason, this quietly returns OK.

Attachment #723636 - Flags: review?(rnewman) → review+

Alex Keybl [:akeybl]

Updated

•

11 years ago

tracking-firefox21: ? → +

tracking-firefox22: --- → +

Whiteboard: [leave-open]

Gregory Szorc [:gps]

Assignee

Comment 9

•

11 years ago

Privacy feedback wanted:

Can we add "did blocklist ping occur" to FHR payload?

The preferred design is to record the blocklist ping's counter value for each day we submit an FHR payload. If the blocklist ping is enabled and working successfully, the counter increments by 1 every day Firefox is used. Metrics will see this from FHR payloads and will know if the FHR's over-reporting is due to blocklist pings being disabled or not working. We can derive additional user benefit from this data. For example, about:healthreport could recommend the user re-enable a disabled blocklist ping to get protection from bad add-ons. It can also help us better assess the impact of the blocklist ping.

Here is what the data would look like:

{
  "2013-03-10": {
    "blocklist": {count: 314},
  },
  "2013-03-11": {
    "blocklist": {count: 315},
  },
}

This would tell us 315 blocklist pings have been sent and that the blocklist ping was enabled for at least 2013-03-11.

Flags: needinfo?(afowler)

Mike Hommey [:glandium]

Comment 10

•

11 years ago

Comment on attachment 723635 [details] [diff] [review]
Part 1: Add MOZ_PRERELEASE_CHANNEL variable to build system, v1

Review of attachment 723635 [details] [diff] [review]:
-----------------------------------------------------------------

::: configure.in
@@ +4362,5 @@
> +if test "$MOZ_UPDATE_CHANNEL" = "nightly" -o "$MOZ_UPDATE_CHANNEL" = "aurora"; then
> +    MOZ_PRERELEASE_CHANNEL=1
> +    AC_DEFINE(MOZ_PRERELEASE_CHANNEL)
> +    AC_SUBST(MOZ_PRERELEASE_CHANNEL)
> +fi

So, things are a bit confusing, and if you want to add a variable, you might as well make things *very* clear: What does it apply to? here, nightly+aurora, but what if you want it enabled on project branches? what about local builds?

In the case of telemetry/FHR, we may well just want them on nightly+aurora and *not* on local builds nor project branches. But what about other use cases for this variable? Are there actually more use cases for enabling something on nightly+aurora but *not* local builds or project branches?

Attachment #723635 - Flags: review?(mh+mozilla)

"Saptarshi Guha[:joy]"

Reporter

Comment 11

•

11 years ago

(In reply to Gregory Szorc [:gps] from comment #9)
> Privacy feedback wanted:
> 
> Can we add "did blocklist ping occur" to FHR payload?
> 
> The preferred design is to record the blocklist ping's counter value for
> each day we submit an FHR payload. If the blocklist ping is enabled and
> working successfully, the counter increments by 1 every day Firefox is used.
> Metrics will see this from FHR payloads and will know if the FHR's
> over-reporting is due to blocklist pings being disabled or not working. We
> can derive additional user benefit from this data. For example,
> about:healthreport could recommend the user re-enable a disabled blocklist
> ping to get protection from bad add-ons. It can also help us better assess
> the impact of the blocklist ping.
> 
> Here is what the data would look like:
> 
> {
>   "2013-03-10": {
>     "blocklist": {count: 314},
>   },
>   "2013-03-11": {
>     "blocklist": {count: 315},
>   },
> }
> 
> This would tell us 315 blocklist pings have been sent and that the blocklist
> ping was enabled for at least 2013-03-11.

Do you think it might be better to see set
"blocklist": {sent: 1}
or
"blocklist": {sent: 0}

Then inspection of this format tells us if blocklist was sent. With the format described in comment(9), we'd have to see if the count for day N is greater than count for day (N-1) to infer if the blocklist ping was sent.

Also another thing: we dont know what the build id the installation had on the day the blocklist ping was sent. In the above example, we would have no idea of the build id on 03/10 and 03/11. FHR only records the latest build id (which makes sense for release). The blocklist query returns a count for number of pings on e.g. Aurora and builds for *that day*. So possibly the buildid (first 8 digits, YYYYMMDD) included on the day level? This recommendation is temporary and only as a sanity check.

Gregory Szorc [:gps]

Assignee

Comment 12

•

11 years ago

The main division is between release-like and pre-release-like builds. This is roughly differentiating between {Release, Beta} and {Aurora, Nightly}, respectively.

Making things more complicated are local and project branch builds, which may not use --enable-update-channel.

There are some features (like Telemetry) that we purposefully do not want enabled on local builds (well, at least we don't want the data submission part enabled - but Telemetry doesn't have a separate flag/pref for that AFAICT). There are other features we want enabled on all pre-release-like builds. These include feature incubating in Aurora and Nightly for an extended period of time.

But why draw the line there? Surely there are features we want enabled on all builds up to Beta!

While I think we need to solve this problem generically, there is urgency in this bug. So, I propose a quick one-off for FHR only. We can establish a unified build system convention in another bug.

Gregory Szorc [:gps]

Assignee

Comment 13

•

11 years ago

(In reply to Saptarshi Guha from comment #11)
> Do you think it might be better to see set
> "blocklist": {sent: 1}
> or
> "blocklist": {sent: 0}
> 
> Then inspection of this format tells us if blocklist was sent. With the
> format described in comment(9), we'd have to see if the count for day N is
> greater than count for day (N-1) to infer if the blocklist ping was sent.

I'm fine with either. A boolean flag requires us to maintain last-known blocklist state in FHR, which is slightly more complicated, but manageable.

Having the raw counts also lets us know if blocklist is doing something weird, such as firing multiple times in a day or not firing at all.

> Also another thing: we dont know what the build id the installation had on
> the day the blocklist ping was sent. In the above example, we would have no
> idea of the build id on 03/10 and 03/11. FHR only records the latest build
> id (which makes sense for release). The blocklist query returns a count for
> number of pings on e.g. Aurora and builds for *that day*. So possibly the
> buildid (first 8 digits, YYYYMMDD) included on the day level? This
> recommendation is temporary and only as a sanity check.

So what you are saying is you want a per-day build ID? Since we already have this data in the payload as a "last" value and we already track when app version upgrades occur in daily history, we should just be able to add this.

Mike Hommey [:glandium]

Comment 14

•

11 years ago

(In reply to Gregory Szorc [:gps] from comment #12)
> While I think we need to solve this problem generically, there is urgency in
> this bug. So, I propose a quick one-off for FHR only. We can establish a
> unified build system convention in another bug.

You don't need to add an AC_SUBST for a one-off.

(not currently active) Ted Mielczarek

Comment 15

•

11 years ago

We use MOZILLA_OFFICIAL to differentiate between "builds done on tinderbox" and "builds done by local developers", FYI. This is what we use to enable crash reporting at runtime.

Gregory Szorc [:gps]

Assignee

Comment 16

•

11 years ago

Attached patch Part 2: Flush prefs after receiving server response, v2 — Details — Splinter Review

Implementing a one-off in the preprocessor.

Attachment #723635 - Attachment is obsolete: true

Attachment #723636 - Attachment is obsolete: true

Attachment #723636 - Flags: feedback?(vdjeric)

Attachment #724032 - Flags: review?(rnewman)

Attachment #724032 - Flags: review?(mh+mozilla)

Attachment #724032 - Flags: feedback?(vdjeric)

Mike Hommey [:glandium]

Comment 17

•

11 years ago

Comment on attachment 724032 [details] [diff] [review]
Part 2: Flush prefs after receiving server response, v2

Review of attachment 724032 [details] [diff] [review]:
-----------------------------------------------------------------

::: services/healthreport/Makefile.in
@@ +43,5 @@
>  PP_TARGETS += MAIN_JS_MODULE
>  
>  MODULES := $(modules)
>  MODULES_PATH = $(FINAL_TARGET)/modules/services/healthreport
> +MODULES_FLAGS := $(extra_pp_flags)

This is okay in itself, but ... we are effectively installing all modules twice here, aren't we? (once all merged in HealthReport.jsm, and once individually in modules/services/healthreport)

Attachment #724032 - Flags: review?(mh+mozilla) → review+

Gregory Szorc [:gps]

Assignee

Comment 18

•

11 years ago

(In reply to Mike Hommey [:glandium] from comment #17)
> This is okay in itself, but ... we are effectively installing all modules
> twice here, aren't we? (once all merged in HealthReport.jsm, and once
> individually in modules/services/healthreport)

Yes... for hysterical raisins. One will hopefully go away as part of bug 836285.

Richard Newman [:rnewman]

Updated

•

11 years ago

Attachment #724032 - Flags: review?(rnewman) → review+

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 19

•

11 years ago

I don't believe that just reporting the blocklist counter value is good enough here.

One of the discoveries we made when analyzing the counter data is what we called the "Sunday effect".  On Mondays, we would see more pings with a "days since last ping" value of 1 meaning they had last pinged on Sunday that we saw total pings for Sunday.

The hypothesis we came up with is that the blocklist code runs, it makes the request, but the request was never properly received and logged on our side.  We don't have any good data on why this might happen, but one could assume one possible reason is lack of an internet connection on an idle computer (such as one left on in an office where internet access is controlled / limited via a proxy or scheduled firewall).

If we want this measure to be more reliable, I think we'd need to evaluate putting some minor code into blocklist itself to provide more accurate reporting (such as counting the number of success responses to the GET request).

"Saptarshi Guha[:joy]"

Reporter

Comment 20

•

11 years ago

> I'm fine with either. A boolean flag requires us to maintain last-known
> blocklist state in FHR, which is slightly more complicated, but manageable.
Oh yes, your approach works nicely (covers a case i hadn't thought of), i.e the one outlined in [9]. Let's stick with what you describe. also see daniel's comments


> 
> Having the raw counts also lets us know if blocklist is doing something
> weird, such as firing multiple times in a day or not firing at all.
> 
> 
> So what you are saying is you want a per-day build ID? Since we already have
> this data in the payload as a "last" value and we already track when app
> version upgrades occur in daily history, we should just be able to add this.

yes.

"Saptarshi Guha[:joy]"

Reporter

Comment 21

•

11 years ago

> > So what you are saying is you want a per-day build ID? Since we already have
> > this data in the payload as a "last" value and we already track when app
> > version upgrades occur in daily history, we should just be able to add this.
> 
> yes.

One more thing, can we store the build id that was sent in blocklist ping? i.e. the FHR might store the build id after an upgrade and the blocklist ping sent a build id before the upgrade.
And this would be a correct comparison. Assuming as per Daniel's comment [19], that the blocklist was indeed sent.

Gregory Szorc [:gps]

Assignee

Updated

•

11 years ago

Depends on: 850450

Gregory Szorc [:gps]

Assignee

Comment 22

•

11 years ago

I poked around the blocklist code today. I discovered that the blocklist ping is using nsIUpdateTimerManager (https://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/nsUpdateTimerManager.js) to schedule blocklist pings. That generic timer manager (which should probably be decoupled from the update service because it's /really/ generic - but I digress) is storing last fired time in prefs - just like FHR. I could find no reference in the timer manager or blocklist code that forces a prefs save.

What this means is that FHR and Blocklist should be impacted by prefs loss in roughly the same way in terms of scheduling. Now, scheduling is not necessarily FHR payload counts. Prefs loss is bad for FHR because it could cause a document on the server to be orphaned because the client lost its ID and doesn't know it needs to delete it. If we want a fair comparison before the patch in this bug lands, we could compare raw HTTP request counts between FHR and blocklist ping. However, obtaining said data on a per-channel basis might be difficult or impossible. We should still proceed with this patch. If this patch results in a noticeable change, then we have a real problem of pref loss occurring in the wild.

Gregory Szorc [:gps]

Assignee

Updated

•

11 years ago

Depends on: 850483

Vladan Djeric (:vladan)

Comment 23

•

11 years ago

(In reply to Saptarshi Guha from comment #0)
>    The FHR activity for 2013-03-05 are all documents that
>    were active on 2013-03-05 though they need not have been on a build
>    >= 2013-03-05 on the day 2013-03-05. 
>    Hence the two numbers are not quite comparable.

Why can't we apply the same criteria when generating FHR & Blacklist numbers?

1) Restrict both FHR & Blocklist counts to builds after March 3rd
2) Report FHR activity for 2013-03-05 as all documents that were active on 2013-03-05 and build >= 2013-03-05

Flags: needinfo?(sguha)

"Saptarshi Guha[:joy]"

Reporter

Comment 24

•

11 years ago

But the build id of FHR documents are the latest build id and not the build id on 2013-03-05. Block list ping captures the build id on that day.

The point here is do documents active on day D(as per FHR) send a blocklist ping? If this can be confirmed and FHR is indeed more that implies blocklist ping underestimates and we know by how much.

Flags: needinfo?(sguha)

Mike Connor [:mconnor]

Updated

•

11 years ago

Priority: -- → P1

Vladan Djeric (:vladan)

Comment 25

•

11 years ago

Comment on attachment 724032 [details] [diff] [review]
Part 2: Flush prefs after receiving server response, v2

I don't want to hold up the investigation, so i'm ok with giving feedback+ for a short-running experiment on nightly/aurora that does main-thread prefs flush once per session. However, I need to talk to Joy to understand exactly what the document count numbers represent. I'm unclear on a few points, e.g. why is the "discrepancy ratio" trending downward?

Attachment #724032 - Flags: feedback?(vdjeric)

"Saptarshi Guha[:joy]"

Reporter

Comment 26

•

11 years ago

Vlandan,

No mention of ratio's trending downwards. What we have right now is blocklist count is much lower than FHR count for a given day. 

We need to know why: is it because the blocklist under reports (and we can enumerate cases why it would)

We need to know if indeed FHR documents being submitted are sending blocklist pings i.e. is FHR submission a superset of Blocklist?

Also, the following pseudo-SQL query against blocklist

select sum(pings),buildid from blocklist where version = 22 and channel==aurora and date=="2013-03-03" and product == "firefox"

then the count is number of BL pings sent by active installations on "2013-03-03" grouped by buildIDs active on that date. This result stays the same if run the above query on 2013-03-04 or 2013-03-05 or any day >=2013-03-04

Doing the same query for FHR changes for any day >=2013-03-03 since FHR only reports the build id that the installation is currently on and not the build id that the installation was on on 2013-03-03

So ultimately, the inferences we make from this is:
a) for every installation using FHR, is Firefox attempting to send the blocklist ping? ([1*] currently, we cannot confirm if the ping was indeed received by our servers)
b) compute the count of FHR active installations on a given build for a given day and compare that number with the blocklist ping. 

Cautioned by [1*], we will then gain some insight into why the FHR actives for 2013-03-03 (or any date) is much higher than the blocklist pings for that date.

Gregory Szorc [:gps]

Assignee

Comment 27

•

11 years ago

https://hg.mozilla.org/services/services-central/rev/7eeb6eb9b802

Gregory Szorc [:gps]

Assignee

Comment 28

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/7eeb6eb9b802

Target Milestone: --- → mozilla22

Gregory Szorc [:gps]

Assignee

Comment 29

•

11 years ago

Comment on attachment 724032 [details] [diff] [review]
Part 2: Flush prefs after receiving server response, v2

[Approval Request Comment]
Bug caused by (feature/regressing bug #): FHR and Prefs durability interaction.
User impact if declined: Clients may lose FHR state, causing excessive uploading and causing Mozilla to not have an accurate picture from which to derive useful recommendations.
Testing completed (on m-c, etc.): It's been on m-c for a few days. Patch is trivial and does nothing radical.
Risk to taking this patch (and alternatives if risky): Might cause jank.
String or UUID changes made by this patch: None

This patch only affects the aurora and nightly channels due to preprocessor magic. The intent is for this patch to be a temporary mechanism to help identify discrepancies between FHR and blocklist ping counts. This discrepancy is probably the single biggest question mark hovering over FHR and answering it will make a lot of people very happy.

This bug is marked as a P1 which means it's an FHR launch blocker. We need this patch to land on Aurora to assess the impact that non-durable prefs storage is having on FHR.

Attachment #724032 - Flags: approval-mozilla-aurora?

bhavana bajaj [:bajaj]

Comment 30

•

11 years ago

Comment on attachment 724032 [details] [diff] [review]
Part 2: Flush prefs after receiving server response, v2

Trivial patch affects aurora & nightly channels only.

Please keep an eye on any performance impacts this may have due to the outlined risk here

Attachment #724032 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Gregory Szorc [:gps]

Assignee

Comment 31

•

11 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/f5d00a1b6731

status-firefox21: affected → fixed

status-firefox22: affected → fixed

Gregory Szorc [:gps]

Assignee

Comment 32

•

11 years ago

Alex's Privacy review is on other bugs.

Flags: needinfo?(afowler)

Gregory Szorc [:gps]

Assignee

Comment 33

•

11 years ago

I was wrong. We're keeping this bug open to add some form of "did blocklist occur" to FHR. Privacy review is requested in comment #9.

Flags: needinfo?(afowler)

Keywords: privacy-review-needed

Gregory Szorc [:gps]

Assignee

Comment 34

•

11 years ago

Observation: At least 0.195% of FHR documents reference dates in the future. Due to bug 853198, blocklist ping may not fire if it has recorded a time in the future.

Given the low percentage of FHR payloads with future dates, I do not think bug 853198 is significantly contributing to the discrepancy outlined in this bug. That being said, 0.195% is the minimal percentage of FHR documents referencing future dates. I believe the actual percentage is a little higher. However, I don't believe it is high enough to explain the double digit discrepancy we're currently seeing.

Gregory Szorc [:gps]

Assignee

Updated

•

11 years ago

Depends on: 854018

Gregory Szorc [:gps]

Assignee

Comment 35

•

11 years ago

Here is some interesting data:

Build ID        Total   NoLast  % NoLast
20130302042014  11223   6307    56.20
20130303042011  11691   3682    31.49
20130304042010  12008   2666    22.20
20130305042015  11295   2033    18.00
20130306042010  13085   1951    14.91
20130307042016  11131   1440    12.94
20130308042013  11890   1675    14.09
20130309042012  11069   1358    12.27
20130310042012  12759   1502    11.77
20130311042011  12816   1486    11.59
20130312042013  13515   1396    10.33
20130313042013  12214   1218    9.97
20130314042011  54642   4248    7.77
20130315042012  8456    725     8.57
20130316042013  6957    659     9.47
20130317042012  7691    601     7.81
20130318042013  9696    747     7.70
20130319042012  8041    504     6.27
20130320042012  10311   736     7.14
20130321042014  18510   1036    5.60
20130322042013  31154   1247    4.00
20130323042013  20262   677     3.34
20130324042013  40038   836     2.09
20130325042013  4660    148     3.18

This is a breakdown of the *current* Aurora payloads on the server with build ID's having more than 500 current payloads. The 2nd column is total number of payloads with 8 or more days of data in the payload. The 3rd column is the number of those payloads that *don't* have a last ping date.

Essentially what I'm trying to do is isolate clients that have been running FHR for a few days but are saying they've never submitted a payload. This should be uncommon because FHR should attempt an upload every day. If we've collected multiple days of data, at least one of those collections should have triggered an upload. For a client to not declare a payload for over a week would mean one of the following:

a) The privacy policy has not been accepted
b) There was an error uploading the payload (firewall, server outage, bad connectivity, etc)
c) The client thinks it never uploaded before (e.g. it lost the pref write)

So, I could look at this graph and see that the percentage is going down. That seems good. However, what I'd like to see is a nice dip for the 2013-03-21 Aurora - the one where we started flushing preferences - to conclude that flushing prefs had an impact (and that pref writes are being lost by a significant percentage of clients). If you plot the curve of the percentages, there does seem to be a dip. However, it's not pronounced enough to be conclusive. I think more data is needed. I'd love to look at these numbers again at the end of the week.

Jishnu Menon :jishnu

Comment 36

•

11 years ago

(In reply to Gregory Szorc [:gps] from comment #9)

Hi Gregory -

Based on this bug https://bugzilla.mozilla.org/show_bug.cgi?id=855710 this sounds ok with me. We should deprecate once its not needed.

Thanks
Jishnu

> Privacy feedback wanted:
> 
> Can we add "did blocklist ping occur" to FHR payload?
> 
> The preferred design is to record the blocklist ping's counter value for
> each day we submit an FHR payload. If the blocklist ping is enabled and
> working successfully, the counter increments by 1 every day Firefox is used.
> Metrics will see this from FHR payloads and will know if the FHR's
> over-reporting is due to blocklist pings being disabled or not working. We
> can derive additional user benefit from this data. For example,
> about:healthreport could recommend the user re-enable a disabled blocklist
> ping to get protection from bad add-ons. It can also help us better assess
> the impact of the blocklist ping.
> 
> Here is what the data would look like:
> 
> {
>   "2013-03-10": {
>     "blocklist": {count: 314},
>   },
>   "2013-03-11": {
>     "blocklist": {count: 315},
>   },
> }
> 
> This would tell us 315 blocklist pings have been sent and that the blocklist
> ping was enabled for at least 2013-03-11.

Gregory Szorc [:gps]

Assignee

Updated

•

11 years ago

Component: Metrics and Firefox Health Report → Client: Desktop

Flags: approval-mozilla-aurora+

Product: Mozilla Services → Firefox Health Report

Target Milestone: mozilla22 → ---

Richard Newman [:rnewman]

Updated

•

11 years ago

Component: Client: Desktop → Data Collection

OS: Linux → All

Hardware: x86_64 → All

Gregory Szorc [:gps]

Assignee

Comment 37

•

11 years ago

Attached patch Always flush prefs after server response, v1 — Details — Splinter Review

[Approval Request Comment]
Bug caused by (feature/regressing bug #): FHR
User impact if declined: See comment below.
Testing completed (on m-c, etc.): The equivalent has been on Aurora and Nightly for a while.
Risk to taking this patch (and alternatives if risky): Should be very low risk.
String or IDL/UUID changes made by this patch: None

I was talking with Brendan and Saptarshi of Metrics. We are still seeing an abundance of what I'm calling "orphaned" FHR records. These are records that are uploaded and don't get updated for a long period of time, if ever. As described in earlier comments in this bug, a possible cause of orphaned records is that pref writes are being lost.

We would like perform an A/B comparison of sorts on Beta to isolate the impact of lost pref writes. (We don't believe the comparison on Aurora and Nightly branches is as effective because of the churn on those channels.)

What I'm proposing is that we enable pref flushing after upload on the Beta channel for a few weeks. Then, before beta goes to release, we back out the change. Or we can always change this patch such that it's conditional on the beta or below branch and no backout is needed. Whatever is preferred.

I believe Metrics would like this experiment conducted ASAP. One reason is it will likely take a few weeks to have sufficient data to make confident conclusions. However, if we're not comfortable doing it in 21, we can probably wait until Beta 22. I will defer to Metrics to champion 21 if the approval isn't a rubber stamp.

On a technical side, this should be very low risk. Taking this patch will force a fsync() that should eventually happen anyway. Backing out this patch will restore current behavior, which is known to not be problematic aside from the possible orphaned records issue.

If the A/B comparison reveals that pref writes are being lost, we will likely request the forced prefs flushing be permanent until we move FHR state to a file instead. This would hopefully occur in time for consideration of uplift to Beta 22. If there is a point release of 21 release, we may consider requesting pref flushes go into that. But let's not get too ahead of ourselves. We need data first. And this patch will start to collect it for us.

Attachment #739708 - Flags: approval-mozilla-beta?

Gregory Szorc [:gps]

Assignee

Comment 38

•

11 years ago

I would appreciate sign-off from Performance on conducting this experiment. Taras?

Keep in mind the prefs fsync (assuming there is one) after FHR server response likely comes a few hundred milliseconds after the SQLite WAL fsync. Given the expected close proximity, the cost of the prefs fsync should be marginal.

Flags: needinfo?(taras.mozilla)

Vladan Djeric (:vladan)

Comment 39

•

11 years ago

(In reply to Gregory Szorc [:gps] from comment #37)
> We would like perform an A/B comparison of sorts on Beta to isolate the
> impact of lost pref writes. (We don't believe the comparison on Aurora and
> Nightly branches is as effective because of the churn on those channels.)

Why does churn invalidate the data gathered with Nightly/Aurora?

(dormant account)

Comment 40

•

11 years ago

(In reply to Vladan Djeric (:vladan) from comment #39)
> (In reply to Gregory Szorc [:gps] from comment #37)
> > We would like perform an A/B comparison of sorts on Beta to isolate the
> > impact of lost pref writes. (We don't believe the comparison on Aurora and
> > Nightly branches is as effective because of the churn on those channels.)
> 
> Why does churn invalidate the data gathered with Nightly/Aurora?

I second this question. I'm also not sure of the usefulness of this experiment if you don't do a proper comparison(eg have some profiles flush and some dont+indicate that in fhr payload)

Also, can you quantify 'abundance'? 

This is an additional fsync from what would not happen anyway. I think the later fsyncs happen if something else wants to write to prefs...which would still need an additional fsync.

Flags: needinfo?(taras.mozilla)

"Saptarshi Guha[:joy]"

Reporter

Comment 41

•

11 years ago

How would one determine if the patch has decreased orphaned records?
And what is the exact definition of orphaned?a record that isn't updated?
That could also be because the user has stopped using off anymore. Do we have an idea if how many are orphaned?

"Saptarshi Guha[:joy]"

Reporter

Comment 42

•

11 years ago

How would one determine if the patch has decreased orphaned records?
And what is the exact definition of orphaned?a record that isn't updated?
That could also be because the user has stopped using off anymore. Do we have an idea if how many are orphaned?

Gregory Szorc [:gps]

Assignee

Comment 43

•

11 years ago

We would compare the orphan rate of builds with this patch against the orphan rate of builds without this patch. We know this by the build ID in FHR's existing payload: no new data would need to be introduced. We simply know that builds between some data range have 1 behavior and others another. Hopefully the adoption and usage patterns of Firefox is similar across those builds or the experiment is invalid. A long-running experiment helps ensure patterns are similar.

An orphaned record is semantically equivalent to a stale/old record: it is a record on the server that is N or more days old. If every client used Firefox every day, we should have 0 orphaned records.

If you graphed the age in days of all records on the server, you expect to see a reverse exponential curve that plateaus somewhere. If we are actively losing track of uploaded document IDs because of pref non-persistence, this patch would result in a curve that plateaus at a lower number of records per day.

"Saptarshi Guha[:joy]"

Reporter

Comment 44

•

11 years ago

I'm on PTO today. Ask someone in metrics for the ecdf

X:= (today - thispingdate) in days
That is the current view of orphans.

What happens when users mover from the patched
Build to the subsequent build. Will the subsequent build still have the patch?


So if we put this in build b we can look at the days between activity before the user moves to build b and compare after. However this might be useless if this patch affects lastpingdate and thispingdate

bhavana bajaj [:bajaj]

Comment 45

•

11 years ago

(In reply to Gregory Szorc [:gps] from comment #37)
> Created attachment 739708 [details] [diff] [review]
> Always flush prefs after server response, v1
> 
> [Approval Request Comment]
> Bug caused by (feature/regressing bug #): FHR
> User impact if declined: See comment below.
> Testing completed (on m-c, etc.): The equivalent has been on Aurora and
> Nightly for a while.
> Risk to taking this patch (and alternatives if risky): Should be very low
> risk.
> String or IDL/UUID changes made by this patch: None
> 
> I was talking with Brendan and Saptarshi of Metrics. We are still seeing an
> abundance of what I'm calling "orphaned" FHR records. These are records that
> are uploaded and don't get updated for a long period of time, if ever. As
> described in earlier comments in this bug, a possible cause of orphaned
> records is that pref writes are being lost.
> 
> We would like perform an A/B comparison of sorts on Beta to isolate the
> impact of lost pref writes. (We don't believe the comparison on Aurora and
> Nightly branches is as effective because of the churn on those channels.)
> 
> What I'm proposing is that we enable pref flushing after upload on the Beta
> channel for a few weeks. Then, before beta goes to release, we back out the
> change. Or we can always change this patch such that it's conditional on the
> beta or below branch and no backout is needed. Whatever is preferred.
> 
> I believe Metrics would like this experiment conducted ASAP. One reason is
> it will likely take a few weeks to have sufficient data to make confident
> conclusions. However, if we're not comfortable doing it in 21, we can
> probably wait until Beta 22. I will defer to Metrics to champion 21 if the
> approval isn't a rubber stamp.

Looks like we still need Alex's review and metrics input on this bug here.We will be going to build for our Beta 4 tomorrow noon PT & this may not get into Fx21 if we do not the needed responses by then .
> 
> On a technical side, this should be very low risk. Taking this patch will
> force a fsync() that should eventually happen anyway. Backing out this patch
> will restore current behavior, which is known to not be problematic aside
> from the possible orphaned records issue.
> 
> If the A/B comparison reveals that pref writes are being lost, we will
> likely request the forced prefs flushing be permanent until we move FHR
> state to a file instead. This would hopefully occur in time for
> consideration of uplift to Beta 22. If there is a point release of 21
> release, we may consider requesting pref flushes go into that. But let's not
> get too ahead of ourselves. We need data first. And this patch will start to
> collect it for us.

Flags: needinfo?(sguha)

bhavana bajaj [:bajaj]

Comment 46

•

11 years ago

gps, can you please help address comment# 40 , still trying to understand if we need this on beta , given that ?

Also from the metrics side, joy will have the needed data by tomorrow morning if that helps.

"Saptarshi Guha[:joy]"

Reporter

Comment 47

•

11 years ago

Attached file Distribution of days since thisPingDate — Details

I hope this sheds some light needed for answering [43]

Gps: your hypothesis is that, if the patch is applied to nightly and works, then for example, the 90% (currently 25 days) will decrease?
Assuming everything else stays constant.

Flags: needinfo?(sguha)

Gregory Szorc [:gps]

Assignee

Comment 48

•

11 years ago

(In reply to Saptarshi Guha from comment #47)
> Gps: your hypothesis is that, if the patch is applied to nightly and works,
> then for example, the 90% (currently 25 days) will decrease?
> Assuming everything else stays constant.

That is roughly my hypothesis, yes. However, we'd likely have to compare historical percentages of specific build IDs in specific channels (mostly Beta).

bhavana bajaj [:bajaj]

Comment 49

•

11 years ago

I spoke offline with :gps and he is considering to conduct this experiment on Fx22 Beta'a so a- here and and wontfix for Fx21 here.

bhavana bajaj [:bajaj]

Updated

•

11 years ago

Attachment #739708 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

bhavana bajaj [:bajaj]

Updated

•

11 years ago

status-firefox21: fixed → wontfix

Gregory Szorc [:gps]

Assignee

Updated

•

11 years ago

Flags: needinfo?(afowler)

Benjamin Smedberg

Comment 50

•

10 years ago

I'm going to call this fixed and we'll follow up on other data issues in other bugs.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Firefox Health Report → Firefox Health Report Graveyard

Part 1: Add MOZ_PRERELEASE_CHANNEL variable to build system, v1 11 years ago Gregory Szorc [:gps] 1.77 KB, patch		Details \| Diff \| Splinter Review
Part 2: Flush prefs after receiving server response, v1 11 years ago Gregory Szorc [:gps] 977 bytes, patch	rnewman : review+	Details \| Diff \| Splinter Review
Part 2: Flush prefs after receiving server response, v2 11 years ago Gregory Szorc [:gps] 1.98 KB, patch	rnewman : review+ glandium : review+	Details \| Diff \| Splinter Review
Always flush prefs after server response, v1 11 years ago Gregory Szorc [:gps] 1.06 KB, patch	bajaj : approval-mozilla-beta-	Details \| Diff \| Splinter Review
Distribution of days since thisPingDate 11 years ago "Saptarshi Guha[:joy]" 50.42 KB, application/force-download		Details