Closed Bug 849947 Opened 7 years ago Closed 6 years ago

FHR submission counts vs Blocklist ping

Categories

(Firefox Health Report Graveyard :: Data Collection, defect, P1)

defect

Tracking

(firefox20 unaffected, firefox21+ wontfix, firefox22+ fixed)

RESOLVED FIXED
Tracking Status
firefox20 --- unaffected
firefox21 + wontfix
firefox22 + fixed

People

(Reporter: joy, Assigned: gps)

References

Details

(Keywords: privacy-review-needed, Whiteboard: [leave-open])

Attachments

(3 files, 2 obsolete files)

Hello,

i've been comparing the FHR activity numbers from Aurora,21 to
blocklist pings.

1) To arrive at an FHR derived activity number i use the following logic

- subset all documents that pass this:
    grepl('^(21)',p$data$last$org.mozilla.appInfo.appinfo[['version']]) && "aurora"==p$data$last$org.mozilla.appInfo.appinfo[['updateChannel']]

- then for this subset, compute the total number of documents that
  were active on day 'd', where d is in the set {"2013-03-03","2013-03-04",...,"2013-03-10"}
- the result is the following table

        date  count
1 2013-03-03 155238
2 2013-03-04 175737
3 2013-03-05 174974
4 2013-03-06 170992
5 2013-03-07 165304
6 2013-03-08 155300
7 2013-03-09 137933
8 2013-03-10 125158

(if i redo this count tomorrow or sometime in the future, the FHR numbers will increase to an asymptote)

2) To compare, i get the blocklist counts for Aurora 21 with build ids on or after 2013-03-03, this is the table
        date  count
1 2013-03-03  36026
2 2013-03-04  66144
3 2013-03-05  87137
4 2013-03-06  98032
5 2013-03-07 104084
6 2013-03-08 104930
7 2013-03-09  98662
8 2013-03-10 100825


3) As you can see the ratio is greater than 1

        date    fhr    adi    ratio
1 2013-03-03 155238  36026 4.309055
2 2013-03-04 175737  66144 2.656885
3 2013-03-05 174974  87137 2.008033
4 2013-03-06 170992  98032 1.744247
5 2013-03-07 165304 104084 1.588179
6 2013-03-08 155300 104930 1.480034
7 2013-03-09 137933  98662 1.398036
8 2013-03-10 125158 100825 1.241339


4) There is an issue with comparing these numbers: not everybody using
   Aurora 21 have moved to a build >= 2013-03-03,  the BL number
   for 2013-03-05 are BL submissions for installations on a build >
   2013-03-05. The FHR activity for 2013-03-05 are all documents that
   were active on 2013-03-05 though they need not have been on a build
   >= 2013-03-05 on the day 2013-03-05. 
   Hence the two numbers are not quite comparable.
   
it is entirely possible that BL under reports. But by how much? Should
we accept these numbers without further checks?

  As a temporary check, the suggestion is to include :

- was the blocklist ping successfully sent?
- build id of the successful blocklist ping stored in the days field
  (i.e. per active day what buildid was the installation)


cheers
Saptarshi
How exactly would we measure whether the blocklist was sent? I'm not sure if we're looking at modifying blocklist code or piggybacking off existing preferences:

extensions.blocklist.pingCountTotal
extensions.blocklist.pingCountVersion
extensions.blocklist.enabled

Presumably pingCountTotal is a counter that increments with each successful ping? If we put that in FHR's per-day payload, that would tell us if the blocklist was sent, right?
As a cheap method to investigate this discrepancy, I propose we force a preferences flush to disk after FHR submits its payload. Because this will incur blocking I/O on the main thread, I propose we only perform this action on Nightly and Aurora channels and that the experiment only run for as short as possible to yield results (maybe a week or two). If we find that preference writes are getting lost, we should put bug 846133 on the fast track and uplift it as needed.
I'm in favour of this experiment, especially if we call the flush on an idle timer to minimize user impact.  As the pref service only flushes to disk if there are uncommitted changes (savePrefFile is an effective no-op if there are no changes), it would be good to record how many times we still flush at shutdown after an FHR-driven flush.  Can we add that as a Telemetry probe?

All that said, I think we need to stop using prefs for timers in general, if this is a big problem.  Is there a bug on building a shared service to manage long-running timers across sessions?
(In reply to Mike Connor [:mconnor] from comment #3)
> I'm in favour of this experiment, especially if we call the flush on an idle
> timer to minimize user impact.

I say we just do it without the idle timer. Otherwise, we won't know the full impact of our experiment: we'll still question whether the idle timer fires and whether all pref writes are being persisted. Yes, we may introduce jank. But it will occur at most once per day and only on the Nightly and Aurora channels and only for a limited time. These kinds of experiments are allowed to live on those channels, IMO.

> As the pref service only flushes to disk if
> there are uncommitted changes (savePrefFile is an effective no-op if there
> are no changes), it would be good to record how many times we still flush at
> shutdown after an FHR-driven flush.  Can we add that as a Telemetry probe?

Bug 848461. Even longer term, if prefs turn out to be not reliable, we should investigate making them more so. Many features rely on prefs and having writes lost for a non-trivial amount of users would likely result in many sub-par product experiences.

> All that said, I think we need to stop using prefs for timers in general, if
> this is a big problem.  Is there a bug on building a shared service to
> manage long-running timers across sessions?

The main problem is not the timer: it's the last document ID. See the initial comment in bug 846133 for an explanation of the theorized problem.
We now have at least two features doing the "if nightly or aurora then" dance. So, let's create a unified variable for it. It's easier to grep for, etc.

IMO this beats our current solution of "revert changes to prefs once a tree is uplifted." But that is a bigger problem and we shouldn't try tackling it in this bug.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Attachment #723635 - Flags: review?(mh+mozilla)
rnewman for code. vladan for perf sign-off since this may cause jank.
Attachment #723636 - Flags: review?(rnewman)
Attachment #723636 - Flags: feedback?(vdjeric)
Gavin pointed me to RELEASE_BUILD, which is defined at https://mxr.mozilla.org/mozilla-central/source/config/rules.mk#1280. He was trying to convince me on IRC that "'a' in $GRE_MILESTONE" might make more sense. I'm not totally familiar with the most appropriate method here...
Comment on attachment 723636 [details] [diff] [review]
Part 2: Flush prefs after receiving server response, v1

Review of attachment 723636 [details] [diff] [review]:
-----------------------------------------------------------------

I think this does what you intend it to.

::: services/healthreport/healthreporter.jsm
@@ +1156,5 @@
>  
> +#ifdef MOZ_PRERELEASE_CHANNEL
> +    // Intended to be temporary until we a) assess the impact b) bug 846133
> +    // deploys more robust storage for state.
> +    Services.prefs.savePrefFile(null);

I suggest you wrap this in a try..catch. That method call can throw in all manner of ways.

There's nothing we can do about it, but note that we're assuming that the prefs service has its prefs file -- if it doesn't have one for some reason, this quietly returns OK.
Attachment #723636 - Flags: review?(rnewman) → review+
Whiteboard: [leave-open]
Privacy feedback wanted:

Can we add "did blocklist ping occur" to FHR payload?

The preferred design is to record the blocklist ping's counter value for each day we submit an FHR payload. If the blocklist ping is enabled and working successfully, the counter increments by 1 every day Firefox is used. Metrics will see this from FHR payloads and will know if the FHR's over-reporting is due to blocklist pings being disabled or not working. We can derive additional user benefit from this data. For example, about:healthreport could recommend the user re-enable a disabled blocklist ping to get protection from bad add-ons. It can also help us better assess the impact of the blocklist ping.

Here is what the data would look like:

{
  "2013-03-10": {
    "blocklist": {count: 314},
  },
  "2013-03-11": {
    "blocklist": {count: 315},
  },
}

This would tell us 315 blocklist pings have been sent and that the blocklist ping was enabled for at least 2013-03-11.
Flags: needinfo?(afowler)
Comment on attachment 723635 [details] [diff] [review]
Part 1: Add MOZ_PRERELEASE_CHANNEL variable to build system, v1

Review of attachment 723635 [details] [diff] [review]:
-----------------------------------------------------------------

::: configure.in
@@ +4362,5 @@
> +if test "$MOZ_UPDATE_CHANNEL" = "nightly" -o "$MOZ_UPDATE_CHANNEL" = "aurora"; then
> +    MOZ_PRERELEASE_CHANNEL=1
> +    AC_DEFINE(MOZ_PRERELEASE_CHANNEL)
> +    AC_SUBST(MOZ_PRERELEASE_CHANNEL)
> +fi

So, things are a bit confusing, and if you want to add a variable, you might as well make things *very* clear: What does it apply to? here, nightly+aurora, but what if you want it enabled on project branches? what about local builds?

In the case of telemetry/FHR, we may well just want them on nightly+aurora and *not* on local builds nor project branches. But what about other use cases for this variable? Are there actually more use cases for enabling something on nightly+aurora but *not* local builds or project branches?
Attachment #723635 - Flags: review?(mh+mozilla)
(In reply to Gregory Szorc [:gps] from comment #9)
> Privacy feedback wanted:
> 
> Can we add "did blocklist ping occur" to FHR payload?
> 
> The preferred design is to record the blocklist ping's counter value for
> each day we submit an FHR payload. If the blocklist ping is enabled and
> working successfully, the counter increments by 1 every day Firefox is used.
> Metrics will see this from FHR payloads and will know if the FHR's
> over-reporting is due to blocklist pings being disabled or not working. We
> can derive additional user benefit from this data. For example,
> about:healthreport could recommend the user re-enable a disabled blocklist
> ping to get protection from bad add-ons. It can also help us better assess
> the impact of the blocklist ping.
> 
> Here is what the data would look like:
> 
> {
>   "2013-03-10": {
>     "blocklist": {count: 314},
>   },
>   "2013-03-11": {
>     "blocklist": {count: 315},
>   },
> }
> 
> This would tell us 315 blocklist pings have been sent and that the blocklist
> ping was enabled for at least 2013-03-11.

Do you think it might be better to see set
"blocklist": {sent: 1}
or
"blocklist": {sent: 0}

Then inspection of this format tells us if blocklist was sent. With the format described in comment(9), we'd have to see if the count for day N is greater than count for day (N-1) to infer if the blocklist ping was sent.

Also another thing: we dont know what the build id the installation had on the day the blocklist ping was sent. In the above example, we would have no idea of the build id on 03/10 and 03/11. FHR only records the latest build id (which makes sense for release). The blocklist query returns a count for number of pings on e.g. Aurora and builds for *that day*. So possibly the buildid (first 8 digits, YYYYMMDD) included on the day level? This recommendation is temporary and only as a sanity check.
The main division is between release-like and pre-release-like builds. This is roughly differentiating between {Release, Beta} and {Aurora, Nightly}, respectively.

Making things more complicated are local and project branch builds, which may not use --enable-update-channel.

There are some features (like Telemetry) that we purposefully do not want enabled on local builds (well, at least we don't want the data submission part enabled - but Telemetry doesn't have a separate flag/pref for that AFAICT). There are other features we want enabled on all pre-release-like builds. These include feature incubating in Aurora and Nightly for an extended period of time.

But why draw the line there? Surely there are features we want enabled on all builds up to Beta!

While I think we need to solve this problem generically, there is urgency in this bug. So, I propose a quick one-off for FHR only. We can establish a unified build system convention in another bug.
(In reply to Saptarshi Guha from comment #11)
> Do you think it might be better to see set
> "blocklist": {sent: 1}
> or
> "blocklist": {sent: 0}
> 
> Then inspection of this format tells us if blocklist was sent. With the
> format described in comment(9), we'd have to see if the count for day N is
> greater than count for day (N-1) to infer if the blocklist ping was sent.

I'm fine with either. A boolean flag requires us to maintain last-known blocklist state in FHR, which is slightly more complicated, but manageable.

Having the raw counts also lets us know if blocklist is doing something weird, such as firing multiple times in a day or not firing at all.

> Also another thing: we dont know what the build id the installation had on
> the day the blocklist ping was sent. In the above example, we would have no
> idea of the build id on 03/10 and 03/11. FHR only records the latest build
> id (which makes sense for release). The blocklist query returns a count for
> number of pings on e.g. Aurora and builds for *that day*. So possibly the
> buildid (first 8 digits, YYYYMMDD) included on the day level? This
> recommendation is temporary and only as a sanity check.

So what you are saying is you want a per-day build ID? Since we already have this data in the payload as a "last" value and we already track when app version upgrades occur in daily history, we should just be able to add this.
(In reply to Gregory Szorc [:gps] from comment #12)
> While I think we need to solve this problem generically, there is urgency in
> this bug. So, I propose a quick one-off for FHR only. We can establish a
> unified build system convention in another bug.

You don't need to add an AC_SUBST for a one-off.
We use MOZILLA_OFFICIAL to differentiate between "builds done on tinderbox" and "builds done by local developers", FYI. This is what we use to enable crash reporting at runtime.
Implementing a one-off in the preprocessor.
Attachment #723635 - Attachment is obsolete: true
Attachment #723636 - Attachment is obsolete: true
Attachment #723636 - Flags: feedback?(vdjeric)
Attachment #724032 - Flags: review?(rnewman)
Attachment #724032 - Flags: review?(mh+mozilla)
Attachment #724032 - Flags: feedback?(vdjeric)
Comment on attachment 724032 [details] [diff] [review]
Part 2: Flush prefs after receiving server response, v2

Review of attachment 724032 [details] [diff] [review]:
-----------------------------------------------------------------

::: services/healthreport/Makefile.in
@@ +43,5 @@
>  PP_TARGETS += MAIN_JS_MODULE
>  
>  MODULES := $(modules)
>  MODULES_PATH = $(FINAL_TARGET)/modules/services/healthreport
> +MODULES_FLAGS := $(extra_pp_flags)

This is okay in itself, but ... we are effectively installing all modules twice here, aren't we? (once all merged in HealthReport.jsm, and once individually in modules/services/healthreport)
Attachment #724032 - Flags: review?(mh+mozilla) → review+
(In reply to Mike Hommey [:glandium] from comment #17)
> This is okay in itself, but ... we are effectively installing all modules
> twice here, aren't we? (once all merged in HealthReport.jsm, and once
> individually in modules/services/healthreport)

Yes... for hysterical raisins. One will hopefully go away as part of bug 836285.
Attachment #724032 - Flags: review?(rnewman) → review+
I don't believe that just reporting the blocklist counter value is good enough here.

One of the discoveries we made when analyzing the counter data is what we called the "Sunday effect".  On Mondays, we would see more pings with a "days since last ping" value of 1 meaning they had last pinged on Sunday that we saw total pings for Sunday.

The hypothesis we came up with is that the blocklist code runs, it makes the request, but the request was never properly received and logged on our side.  We don't have any good data on why this might happen, but one could assume one possible reason is lack of an internet connection on an idle computer (such as one left on in an office where internet access is controlled / limited via a proxy or scheduled firewall).

If we want this measure to be more reliable, I think we'd need to evaluate putting some minor code into blocklist itself to provide more accurate reporting (such as counting the number of success responses to the GET request).
> I'm fine with either. A boolean flag requires us to maintain last-known
> blocklist state in FHR, which is slightly more complicated, but manageable.
Oh yes, your approach works nicely (covers a case i hadn't thought of), i.e the one outlined in [9]. Let's stick with what you describe. also see daniel's comments


> 
> Having the raw counts also lets us know if blocklist is doing something
> weird, such as firing multiple times in a day or not firing at all.
> 
> 
> So what you are saying is you want a per-day build ID? Since we already have
> this data in the payload as a "last" value and we already track when app
> version upgrades occur in daily history, we should just be able to add this.

yes.
> > So what you are saying is you want a per-day build ID? Since we already have
> > this data in the payload as a "last" value and we already track when app
> > version upgrades occur in daily history, we should just be able to add this.
> 
> yes.

One more thing, can we store the build id that was sent in blocklist ping? i.e. the FHR might store the build id after an upgrade and the blocklist ping sent a build id before the upgrade.
And this would be a correct comparison. Assuming as per Daniel's comment [19], that the blocklist was indeed sent.
Depends on: 850450
I poked around the blocklist code today. I discovered that the blocklist ping is using nsIUpdateTimerManager (https://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/update/nsUpdateTimerManager.js) to schedule blocklist pings. That generic timer manager (which should probably be decoupled from the update service because it's /really/ generic - but I digress) is storing last fired time in prefs - just like FHR. I could find no reference in the timer manager or blocklist code that forces a prefs save.

What this means is that FHR and Blocklist should be impacted by prefs loss in roughly the same way in terms of scheduling. Now, scheduling is not necessarily FHR payload counts. Prefs loss is bad for FHR because it could cause a document on the server to be orphaned because the client lost its ID and doesn't know it needs to delete it. If we want a fair comparison before the patch in this bug lands, we could compare raw HTTP request counts between FHR and blocklist ping. However, obtaining said data on a per-channel basis might be difficult or impossible. We should still proceed with this patch. If this patch results in a noticeable change, then we have a real problem of pref loss occurring in the wild.
Depends on: 850483
(In reply to Saptarshi Guha from comment #0)
>    The FHR activity for 2013-03-05 are all documents that
>    were active on 2013-03-05 though they need not have been on a build
>    >= 2013-03-05 on the day 2013-03-05. 
>    Hence the two numbers are not quite comparable.

Why can't we apply the same criteria when generating FHR & Blacklist numbers?

1) Restrict both FHR & Blocklist counts to builds after March 3rd
2) Report FHR activity for 2013-03-05 as all documents that were active on 2013-03-05 and build >= 2013-03-05
Flags: needinfo?(sguha)
But the build id of FHR documents are the latest build id and not the build id on 2013-03-05. Block list ping captures the build id on that day.

The point here is do documents active on day D(as per FHR) send a blocklist ping? If this can be confirmed and FHR is indeed more that implies blocklist ping underestimates and we know by how much.
Flags: needinfo?(sguha)
Priority: -- → P1
Comment on attachment 724032 [details] [diff] [review]
Part 2: Flush prefs after receiving server response, v2

I don't want to hold up the investigation, so i'm ok with giving feedback+ for a short-running experiment on nightly/aurora that does main-thread prefs flush once per session. However, I need to talk to Joy to understand exactly what the document count numbers represent. I'm unclear on a few points, e.g. why is the "discrepancy ratio" trending downward?
Attachment #724032 - Flags: feedback?(vdjeric)
Vlandan,

No mention of ratio's trending downwards. What we have right now is blocklist count is much lower than FHR count for a given day. 

We need to know why: is it because the blocklist under reports (and we can enumerate cases why it would)

We need to know if indeed FHR documents being submitted are sending blocklist pings i.e. is FHR submission a superset of Blocklist?

Also, the following pseudo-SQL query against blocklist

select sum(pings),buildid from blocklist where version = 22 and channel==aurora and date=="2013-03-03" and product == "firefox"

then the count is number of BL pings sent by active installations on "2013-03-03" grouped by buildIDs active on that date. This result stays the same if run the above query on 2013-03-04 or 2013-03-05 or any day >=2013-03-04

Doing the same query for FHR changes for any day >=2013-03-03 since FHR only reports the build id that the installation is currently on and not the build id that the installation was on on 2013-03-03

So ultimately, the inferences we make from this is:
a) for every installation using FHR, is Firefox attempting to send the blocklist ping? ([1*] currently, we cannot confirm if the ping was indeed received by our servers)
b) compute the count of FHR active installations on a given build for a given day and compare that number with the blocklist ping. 

Cautioned by [1*], we will then gain some insight into why the FHR actives for 2013-03-03 (or any date) is much higher than the blocklist pings for that date.
https://hg.mozilla.org/mozilla-central/rev/7eeb6eb9b802
Target Milestone: --- → mozilla22
Comment on attachment 724032 [details] [diff] [review]
Part 2: Flush prefs after receiving server response, v2

[Approval Request Comment]
Bug caused by (feature/regressing bug #): FHR and Prefs durability interaction.
User impact if declined: Clients may lose FHR state, causing excessive uploading and causing Mozilla to not have an accurate picture from which to derive useful recommendations.
Testing completed (on m-c, etc.): It's been on m-c for a few days. Patch is trivial and does nothing radical.
Risk to taking this patch (and alternatives if risky): Might cause jank.
String or UUID changes made by this patch: None

This patch only affects the aurora and nightly channels due to preprocessor magic. The intent is for this patch to be a temporary mechanism to help identify discrepancies between FHR and blocklist ping counts. This discrepancy is probably the single biggest question mark hovering over FHR and answering it will make a lot of people very happy.

This bug is marked as a P1 which means it's an FHR launch blocker. We need this patch to land on Aurora to assess the impact that non-durable prefs storage is having on FHR.
Attachment #724032 - Flags: approval-mozilla-aurora?
Comment on attachment 724032 [details] [diff] [review]
Part 2: Flush prefs after receiving server response, v2

Trivial patch affects aurora & nightly channels only.

Please keep an eye on any performance impacts this may have due to the outlined risk here
Attachment #724032 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Alex's Privacy review is on other bugs.
Flags: needinfo?(afowler)
I was wrong. We're keeping this bug open to add some form of "did blocklist occur" to FHR. Privacy review is requested in comment #9.
Flags: needinfo?(afowler)
Observation: At least 0.195% of FHR documents reference dates in the future. Due to bug 853198, blocklist ping may not fire if it has recorded a time in the future.

Given the low percentage of FHR payloads with future dates, I do not think bug 853198 is significantly contributing to the discrepancy outlined in this bug. That being said, 0.195% is the minimal percentage of FHR documents referencing future dates. I believe the actual percentage is a little higher. However, I don't believe it is high enough to explain the double digit discrepancy we're currently seeing.
Depends on: 854018
Here is some interesting data:

Build ID        Total   NoLast  % NoLast
20130302042014  11223   6307    56.20
20130303042011  11691   3682    31.49
20130304042010  12008   2666    22.20
20130305042015  11295   2033    18.00
20130306042010  13085   1951    14.91
20130307042016  11131   1440    12.94
20130308042013  11890   1675    14.09
20130309042012  11069   1358    12.27
20130310042012  12759   1502    11.77
20130311042011  12816   1486    11.59
20130312042013  13515   1396    10.33
20130313042013  12214   1218    9.97
20130314042011  54642   4248    7.77
20130315042012  8456    725     8.57
20130316042013  6957    659     9.47
20130317042012  7691    601     7.81
20130318042013  9696    747     7.70
20130319042012  8041    504     6.27
20130320042012  10311   736     7.14
20130321042014  18510   1036    5.60
20130322042013  31154   1247    4.00
20130323042013  20262   677     3.34
20130324042013  40038   836     2.09
20130325042013  4660    148     3.18

This is a breakdown of the *current* Aurora payloads on the server with build ID's having more than 500 current payloads. The 2nd column is total number of payloads with 8 or more days of data in the payload. The 3rd column is the number of those payloads that *don't* have a last ping date.

Essentially what I'm trying to do is isolate clients that have been running FHR for a few days but are saying they've never submitted a payload. This should be uncommon because FHR should attempt an upload every day. If we've collected multiple days of data, at least one of those collections should have triggered an upload. For a client to not declare a payload for over a week would mean one of the following:

a) The privacy policy has not been accepted
b) There was an error uploading the payload (firewall, server outage, bad connectivity, etc)
c) The client thinks it never uploaded before (e.g. it lost the pref write)

So, I could look at this graph and see that the percentage is going down. That seems good. However, what I'd like to see is a nice dip for the 2013-03-21 Aurora - the one where we started flushing preferences - to conclude that flushing prefs had an impact (and that pref writes are being lost by a significant percentage of clients). If you plot the curve of the percentages, there does seem to be a dip. However, it's not pronounced enough to be conclusive. I think more data is needed. I'd love to look at these numbers again at the end of the week.
(In reply to Gregory Szorc [:gps] from comment #9)

Hi Gregory -

Based on this bug https://bugzilla.mozilla.org/show_bug.cgi?id=855710 this sounds ok with me. We should deprecate once its not needed.

Thanks
Jishnu

> Privacy feedback wanted:
> 
> Can we add "did blocklist ping occur" to FHR payload?
> 
> The preferred design is to record the blocklist ping's counter value for
> each day we submit an FHR payload. If the blocklist ping is enabled and
> working successfully, the counter increments by 1 every day Firefox is used.
> Metrics will see this from FHR payloads and will know if the FHR's
> over-reporting is due to blocklist pings being disabled or not working. We
> can derive additional user benefit from this data. For example,
> about:healthreport could recommend the user re-enable a disabled blocklist
> ping to get protection from bad add-ons. It can also help us better assess
> the impact of the blocklist ping.
> 
> Here is what the data would look like:
> 
> {
>   "2013-03-10": {
>     "blocklist": {count: 314},
>   },
>   "2013-03-11": {
>     "blocklist": {count: 315},
>   },
> }
> 
> This would tell us 315 blocklist pings have been sent and that the blocklist
> ping was enabled for at least 2013-03-11.
Component: Metrics and Firefox Health Report → Client: Desktop
Flags: approval-mozilla-aurora+
Product: Mozilla Services → Firefox Health Report
Target Milestone: mozilla22 → ---
Component: Client: Desktop → Data Collection
OS: Linux → All
Hardware: x86_64 → All
[Approval Request Comment]
Bug caused by (feature/regressing bug #): FHR
User impact if declined: See comment below.
Testing completed (on m-c, etc.): The equivalent has been on Aurora and Nightly for a while.
Risk to taking this patch (and alternatives if risky): Should be very low risk.
String or IDL/UUID changes made by this patch: None

I was talking with Brendan and Saptarshi of Metrics. We are still seeing an abundance of what I'm calling "orphaned" FHR records. These are records that are uploaded and don't get updated for a long period of time, if ever. As described in earlier comments in this bug, a possible cause of orphaned records is that pref writes are being lost.

We would like perform an A/B comparison of sorts on Beta to isolate the impact of lost pref writes. (We don't believe the comparison on Aurora and Nightly branches is as effective because of the churn on those channels.)

What I'm proposing is that we enable pref flushing after upload on the Beta channel for a few weeks. Then, before beta goes to release, we back out the change. Or we can always change this patch such that it's conditional on the beta or below branch and no backout is needed. Whatever is preferred.

I believe Metrics would like this experiment conducted ASAP. One reason is it will likely take a few weeks to have sufficient data to make confident conclusions. However, if we're not comfortable doing it in 21, we can probably wait until Beta 22. I will defer to Metrics to champion 21 if the approval isn't a rubber stamp.

On a technical side, this should be very low risk. Taking this patch will force a fsync() that should eventually happen anyway. Backing out this patch will restore current behavior, which is known to not be problematic aside from the possible orphaned records issue.

If the A/B comparison reveals that pref writes are being lost, we will likely request the forced prefs flushing be permanent until we move FHR state to a file instead. This would hopefully occur in time for consideration of uplift to Beta 22. If there is a point release of 21 release, we may consider requesting pref flushes go into that. But let's not get too ahead of ourselves. We need data first. And this patch will start to collect it for us.
Attachment #739708 - Flags: approval-mozilla-beta?
I would appreciate sign-off from Performance on conducting this experiment. Taras?

Keep in mind the prefs fsync (assuming there is one) after FHR server response likely comes a few hundred milliseconds after the SQLite WAL fsync. Given the expected close proximity, the cost of the prefs fsync should be marginal.
Flags: needinfo?(taras.mozilla)
(In reply to Gregory Szorc [:gps] from comment #37)
> We would like perform an A/B comparison of sorts on Beta to isolate the
> impact of lost pref writes. (We don't believe the comparison on Aurora and
> Nightly branches is as effective because of the churn on those channels.)

Why does churn invalidate the data gathered with Nightly/Aurora?
(In reply to Vladan Djeric (:vladan) from comment #39)
> (In reply to Gregory Szorc [:gps] from comment #37)
> > We would like perform an A/B comparison of sorts on Beta to isolate the
> > impact of lost pref writes. (We don't believe the comparison on Aurora and
> > Nightly branches is as effective because of the churn on those channels.)
> 
> Why does churn invalidate the data gathered with Nightly/Aurora?

I second this question. I'm also not sure of the usefulness of this experiment if you don't do a proper comparison(eg have some profiles flush and some dont+indicate that in fhr payload)

Also, can you quantify 'abundance'? 

This is an additional fsync from what would not happen anyway. I think the later fsyncs happen if something else wants to write to prefs...which would still need an additional fsync.
Flags: needinfo?(taras.mozilla)
How would one determine if the patch has decreased orphaned records?
And what is the exact definition of orphaned?a record that isn't updated?
That could also be because the user has stopped using off anymore. Do we have an idea if how many are orphaned?
How would one determine if the patch has decreased orphaned records?
And what is the exact definition of orphaned?a record that isn't updated?
That could also be because the user has stopped using off anymore. Do we have an idea if how many are orphaned?
We would compare the orphan rate of builds with this patch against the orphan rate of builds without this patch. We know this by the build ID in FHR's existing payload: no new data would need to be introduced. We simply know that builds between some data range have 1 behavior and others another. Hopefully the adoption and usage patterns of Firefox is similar across those builds or the experiment is invalid. A long-running experiment helps ensure patterns are similar.

An orphaned record is semantically equivalent to a stale/old record: it is a record on the server that is N or more days old. If every client used Firefox every day, we should have 0 orphaned records.

If you graphed the age in days of all records on the server, you expect to see a reverse exponential curve that plateaus somewhere. If we are actively losing track of uploaded document IDs because of pref non-persistence, this patch would result in a curve that plateaus at a lower number of records per day.
I'm on PTO today. Ask someone in metrics for the ecdf

X:= (today - thispingdate) in days
That is the current view of orphans.

What happens when users mover from the patched
Build to the subsequent build. Will the subsequent build still have the patch?


So if we put this in build b we can look at the days between activity before the user moves to build b and compare after. However this might be useless if this patch affects lastpingdate and thispingdate
(In reply to Gregory Szorc [:gps] from comment #37)
> Created attachment 739708 [details] [diff] [review]
> Always flush prefs after server response, v1
> 
> [Approval Request Comment]
> Bug caused by (feature/regressing bug #): FHR
> User impact if declined: See comment below.
> Testing completed (on m-c, etc.): The equivalent has been on Aurora and
> Nightly for a while.
> Risk to taking this patch (and alternatives if risky): Should be very low
> risk.
> String or IDL/UUID changes made by this patch: None
> 
> I was talking with Brendan and Saptarshi of Metrics. We are still seeing an
> abundance of what I'm calling "orphaned" FHR records. These are records that
> are uploaded and don't get updated for a long period of time, if ever. As
> described in earlier comments in this bug, a possible cause of orphaned
> records is that pref writes are being lost.
> 
> We would like perform an A/B comparison of sorts on Beta to isolate the
> impact of lost pref writes. (We don't believe the comparison on Aurora and
> Nightly branches is as effective because of the churn on those channels.)
> 
> What I'm proposing is that we enable pref flushing after upload on the Beta
> channel for a few weeks. Then, before beta goes to release, we back out the
> change. Or we can always change this patch such that it's conditional on the
> beta or below branch and no backout is needed. Whatever is preferred.
> 
> I believe Metrics would like this experiment conducted ASAP. One reason is
> it will likely take a few weeks to have sufficient data to make confident
> conclusions. However, if we're not comfortable doing it in 21, we can
> probably wait until Beta 22. I will defer to Metrics to champion 21 if the
> approval isn't a rubber stamp.

Looks like we still need Alex's review and metrics input on this bug here.We will be going to build for our Beta 4 tomorrow noon PT & this may not get into Fx21 if we do not the needed responses by then .
> 
> On a technical side, this should be very low risk. Taking this patch will
> force a fsync() that should eventually happen anyway. Backing out this patch
> will restore current behavior, which is known to not be problematic aside
> from the possible orphaned records issue.
> 
> If the A/B comparison reveals that pref writes are being lost, we will
> likely request the forced prefs flushing be permanent until we move FHR
> state to a file instead. This would hopefully occur in time for
> consideration of uplift to Beta 22. If there is a point release of 21
> release, we may consider requesting pref flushes go into that. But let's not
> get too ahead of ourselves. We need data first. And this patch will start to
> collect it for us.
Flags: needinfo?(sguha)
gps, can you please help address comment# 40 , still trying to understand if we need this on beta , given that ?

Also from the metrics side, joy will have the needed data by tomorrow morning if that helps.
I hope this sheds some light needed for answering [43]

Gps: your hypothesis is that, if the patch is applied to nightly and works, then for example, the 90% (currently 25 days) will decrease?
Assuming everything else stays constant.
Flags: needinfo?(sguha)
(In reply to Saptarshi Guha from comment #47)
> Gps: your hypothesis is that, if the patch is applied to nightly and works,
> then for example, the 90% (currently 25 days) will decrease?
> Assuming everything else stays constant.

That is roughly my hypothesis, yes. However, we'd likely have to compare historical percentages of specific build IDs in specific channels (mostly Beta).
I spoke offline with :gps and he is considering to conduct this experiment on Fx22 Beta'a so a- here and and wontfix for Fx21 here.
Attachment #739708 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Flags: needinfo?(afowler)
I'm going to call this fixed and we'll follow up on other data issues in other bugs.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.