Closed Bug 855024 Opened 8 years ago Closed 8 years ago

High-level collection APIs aren't safe when run in parallel

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

defect
Not set
normal

Tracking

(firefox21 fixed, firefox22 fixed, firefox23 fixed)

RESOLVED FIXED
Firefox 23
Tracking Status
firefox21 --- fixed
firefox22 --- fixed
firefox23 --- fixed

People

(Reporter: gps, Assigned: gps)

References

Details

(Whiteboard: [qa-])

Attachments

(1 file)

As mconnor discovered when testing about:healthreport, some of the high-level APIs on the health reporter don't react well when called in parallel. e.g. multiple outstanding collectMeasurements() will result in badness.

The underlying problem is likely pull-only providers. These functions currently ensure they are registered, do work, then unregister those providers. If the first call unregisters pull-only providers while a 2nd call is ongoing, there is badness.

I see two options:

1) Change pull-only providers registration to record an active request count. Every time an operation begins that requires pull-only providers we increment the counter. If we go from 0 to 1 we initialize pull-only providers. If we go from 1 to 0, we de-initialize them.

2) Coalesce promises. If a call to collectMeasurements() etc is performed when another is in flight, we return the promise created during the first call.

I'm leaning towards #2. But, we should do #1 anyway because it is more robust (it was not done originally because of time pressure).
We may need to bump this up to a P1. I haven't verified this, but if someone refreshes about:healthreport rapidly, this would likely put the core service into a bad state. It would recover on next app restart, sure. But, it's a really annoying bug.
Component: Metrics and Firefox Health Report → Client: Desktop
Product: Mozilla Services → Firefox Health Report
This should address the immediate concerns over race conditions in pull-only provider registration. With this patch, the provider manager counts how many consumers of pull-only providers there are. If unregistration is requested while another entity still has a "reference" to a pull-only provider "session" then the request will be ignored. It's kinda like garbage collection.

The underlying API around pull-only providers is still rather crap. I would like for healthreporter to be agnostic about pull-only providers and for pull-only providers to be an internal implementation detail inside the provider manager. However, this requires exposing more collection APIs on provider manager. I can do that as a part 2 or as a follow-up bug. Reviewer gets to choose.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Attachment #737697 - Flags: review?(rnewman)
Comment on attachment 737697 [details] [diff] [review]
Better pull-only provider handling, v1

Review of attachment 737697 [details] [diff] [review]:
-----------------------------------------------------------------

Follow-up bug, in answer to your inquiry.
Attachment #737697 - Flags: review?(rnewman) → review+
https://hg.mozilla.org/mozilla-central/rev/866528335dfc
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment on attachment 737697 [details] [diff] [review]
Better pull-only provider handling, v1

[Approval Request Comment]
Bug caused by (feature/regressing bug #): FHR
User impact if declined: Rapidly reloading about:healthreport might result in the FHR core service getting in a weird state and displaying incomplete information.
Testing completed (on m-c, etc.): It just landed yesterday. It needs a few more days to see if any fallout is identified.
Risk to taking this patch (and alternatives if risky): The risk is we broke pull-only providers in FHR. This would mean not all data would be collected. A few days of data analysis on Nightly should reveal whether we regressed anything.
String or IDL/UUID changes made by this patch: None
Attachment #737697 - Flags: approval-mozilla-beta?
Attachment #737697 - Flags: approval-mozilla-aurora?
(In reply to Gregory Szorc [:gps] from comment #6)
> Comment on attachment 737697 [details] [diff] [review]
> Better pull-only provider handling, v1
> 
> [Approval Request Comment]
> Bug caused by (feature/regressing bug #): FHR
> User impact if declined: Rapidly reloading about:healthreport might result
> in the FHR core service getting in a weird state and displaying incomplete
> information.
> Testing completed (on m-c, etc.): It just landed yesterday. It needs a few
> more days to see if any fallout is identified.
> Risk to taking this patch (and alternatives if risky): The risk is we broke
> pull-only providers in FHR. This would mean not all data would be collected.
> A few days of data analysis on Nightly should reveal whether we regressed
> anything.

I am approving this on aurora as this is self-contained to fhr and will have a checkpoint to verify with :gps by Tuesday to see how the data on aurora/nightly looks before we land this on beta.

:gps, if you have any information on data looking as expected by Tuesday, please comment in the bug.

> String or IDL/UUID changes made by this patch: None
Attachment #737697 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Blocks: 861455
needsinfo on gps, so that we have the answer to comment #7 by tonight/tomorrow before we approve/deny beta uplift here.
Flags: needinfo?(gps)
I checked offline with :gps and he has confirmed no regressions were found based on his extensive testing conducted to test this patch.

I am going to approve this on beta given that. Please make sure to watch the data closely once Fx21 beta 4 is out end of week would be a good time to make sure there were no unexpected fallouts on beta.
Flags: needinfo?(gps)
Comment on attachment 737697 [details] [diff] [review]
Better pull-only provider handling, v1

Please land asap to get this into Fx21 beta 4.
Attachment #737697 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I tried to reproduce this issue on Firefox 21 beta 3: at the first load, about:healthreport shows Vital Stats, This Month, Add-ons, Plugins, and Startup Time by Day. After several reloads, Startup Time by Day disappears.

Is this the incomplete information you were referring to in comment 6?



When trying to verify this fix on Fx21 beta 4 with the same scenario, I got the following results:
Win 7, Ubuntu 12.10 - All the data is displayed even after performing multiple reloads.

Mac OSX 10.8.2 - The Startup Time by Day section is missing from the first load of the page.
Flags: needinfo?(gps)
If you look at the raw data, the following items may disappear after rapid reloading:

* data['last']['org.mozilla.appInfo.appinfo']
* data['last']['org.mozilla.appInfo.versions']
* data['last']['org.mozilla.sysinfo.sysinfo']
* data['days'][X]['org.mozilla.appSessions.previous']
* data['days'][X]['org.mozilla.crashes.crashes']

org.mozilla.appSessions.previous is the "Startup Time by Day" data.

It's interesting that the "Startup Time by Day" section was missing on first load on OS X. Not something I've ever seen. Perhaps it was a first profile or the iframe is not displaying data if the number of data points is small? I recommend looking at the raw JSON: that way we don't have anything potentially filtering data and causing false positives.
Flags: needinfo?(gps)
Thanks Gregory.

Ioana, can you retest this given the information in comment 14?
Keywords: verifyme
(In reply to Gregory Szorc [:gps] from comment #14)
> If you look at the raw data, the following items may disappear after rapid
> reloading:
> 
> * data['last']['org.mozilla.appInfo.appinfo']
> * data['last']['org.mozilla.appInfo.versions']
> * data['last']['org.mozilla.sysinfo.sysinfo']
> * data['days'][X]['org.mozilla.appSessions.previous']
> * data['days'][X]['org.mozilla.crashes.crashes']
I tried to reproduce this on Firefox 21 beta 1 (20130401192816), but it didn't happen no mater how many rapid reloads I did. I'm guessing this issue is only visible intermittently, in which case it can't be verified reliably.

I also tried to reproduce it on Firefox 21 beta 4, just in case, but it worked well there too.

> org.mozilla.appSessions.previous is the "Startup Time by Day" data.
> 
> It's interesting that the "Startup Time by Day" section was missing on first
> load on OS X. Not something I've ever seen. Perhaps it was a first profile
> or the iframe is not displaying data if the number of data points is small?
> I recommend looking at the raw JSON: that way we don't have anything
> potentially filtering data and causing false positives.

I think you were right with the small data points part. After using the same profile a little longer and adding more data, the "Start Time by Day" section got displayed.
Gregory, do you have any theories on the failure to reproduce this in comment 16?
Flags: needinfo?(gps)
It would be an intermittent failure and it may be much harder to reproduce on fast hardware. I was able to reproduce this locally just once. It took a number of tries before I confirmed it.

I'm pretty confident the patch didn't regress anything. Perhaps shrugging and marking qa- wouldn't be the worst outcome here.
Flags: needinfo?(gps)
Thanks Gregory. Given that I'm marking this [qa-].
Keywords: verifyme
Whiteboard: [qa-]
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.