Race condition setting experiment branch while a write is pending

RESOLVED FIXED in Firefox 33

Status

Firefox Health Report
Client: Desktop
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: Benjamin Smedberg, Assigned: smacleod)

Tracking

31 Branch
Firefox 34
x86_64
Linux
Points:
5
Dependency tree / graph
Bug Flags:
firefox-backlog +
qe-verify -

Firefox Tracking Flags

(firefox33 fixed, firefox34 fixed)

Details

Attachments

(1 attachment)

(Reporter)

Description

3 years ago
We're seeing people change experiment branches in bug 1037595. Right now we're not sure what's happening, but my current theory is that there's a race condition as described in that bug:

I can imagine this happening in this scenario: we are starting the experiment at about the same time we're saving the experiment cache to disk here: http://hg.mozilla.org/mozilla-central/annotate/75f66f8cb99f/browser/experiments/Experiments.jsm#l954

We may end up interleaving the call which sets the dirty flag in the middle of that: http://hg.mozilla.org/mozilla-central/annotate/75f66f8cb99f/browser/experiments/Experiments.jsm#l645 which would leave us not knowing that the cache was dirty, and never re-saving it. That race is actually fairly likely, given that we will have just activated the experiment.

It may be sufficient to just move the "this._dirty" set from line 955 to 953, but in that case we probably want to catch any exceptions writing the file and reset the state to dirty.
Flags: firefox-backlog+
(Reporter)

Updated

3 years ago
Blocks: 1037595
Are there any issues with waiting on the main task (which triggers the save to cache), i.e. "yield this._mainTask" from setExperimentBranch()?
I'd rather not have different functions on each others implementation details if possible.
(Reporter)

Comment 2

3 years ago
That introduces potentially significant delay in a function that is only bound by disk load time. It would block on any HTTP reload of the manifest.

I don't think the dirty flag is an implementation detail: it's a marker that we need to save the cache to disk, and in this case we're racing on it.
(Assignee)

Updated

3 years ago
Assignee: nobody → smacleod
Status: NEW → ASSIGNED

Updated

3 years ago
Iteration: --- → 34.1
QA Whiteboard: [qa?]
Hi Steven, can you mark this bug as [qa+] or [qa-] for verification.
Flags: needinfo?(smacleod)
(Assignee)

Updated

3 years ago
QA Whiteboard: [qa?] → [qa+]
Flags: needinfo?(smacleod)
(Assignee)

Comment 4

3 years ago
Created attachment 8461931 [details] [diff] [review]
Patch - Prevent a race condition when writing the experiments cache

With this patch we now clear _dirty before writing, and reset it if we catch any errors.
Attachment #8461931 - Flags: review?(georg.fritzsche)
Attachment #8461931 - Flags: review?(georg.fritzsche) → review+
(Assignee)

Comment 5

3 years ago
https://hg.mozilla.org/integration/fx-team/rev/6b4ab5ef92fa
Target Milestone: --- → Firefox 34
https://hg.mozilla.org/mozilla-central/rev/6b4ab5ef92fa
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
QA Contact: kamiljoz
Hi Liz, I believe Kamil is on PTO.  Can a new QA contact be assigned to verify this bug.
Flags: needinfo?(lhenry)
QA Contact: kamiljoz
Hi Marco,

Most of QA will be in the MTV office from 07/28/2014 - 08/01/2014. I'll try finishing this up over there whenever we have some "working time" available unless Liz knows someone from SV that can handle this one.
Thannks Kamil.  I've added you back as the QA contact.

(In reply to Kamil Jozwiak [:kjozwiak] from comment #8)
> Hi Marco,
> 
> Most of QA will be in the MTV office from 07/28/2014 - 08/01/2014. I'll try
> finishing this up over there whenever we have some "working time" available
> unless Liz knows someone from SV that can handle this one.
Flags: needinfo?(lhenry)
QA Contact: kamiljoz
Hi Steven, Florin has asked for details on how to verify this bug.  Can you provide him with that.
Flags: needinfo?(smacleod)
(Assignee)

Comment 11

3 years ago
Hmm, thinking more about this it's probably going to be very difficult to verify. It would require being able to reproduce the race condition, causing the branch switches, with a high frequency. Then using a fixed build you'd attempt reproduction and see no more switches.

This involves some very specific interleaving of execution, so it's probably going to be hard to reproduce well, and in the end we'd only be able to say "It looks like it's fixed, but we're not 100% sure". Based on the new code I'm confident the race shouldn't be possible, so we can probably skip verification.

I must have made a snap decision when marking [qa+] and didn't consider it enough, sorry!
QA Whiteboard: [qa+] → [qa-]
Flags: needinfo?(smacleod)
Thanks for taking a look Steven!
(Reporter)

Comment 13

3 years ago
I expect the best way to verify this would be to run a dummy experiment on the nightly channel, with the branch-setting code that we were using for the search experiment. Then we can use FHR to examine whether people are still switching experiment branches unexpected. Should I file a separate bug for this kind of verification?
Flags: needinfo?(smacleod)
(Assignee)

Comment 14

3 years ago
Yeah lets take care of that in a new bug.
Flags: needinfo?(smacleod)
(Reporter)

Updated

3 years ago
Depends on: 1052714
saveToCache might be closed while in the yield-in-try, forcing a return. Better to assert dirty on entry and clear dirty after the yield (still in the try).

/be
(Reporter)

Updated

3 years ago
Depends on: 1052545
Comment on attachment 8461931 [details] [diff] [review]
Patch - Prevent a race condition when writing the experiments cache

Approval Request Comment
[Feature/regressing bug #]: pre-existing bug in telemetry experiments
[User impact if declined]: Slow shutdowns for some set of users. This blocks bug 1052545!
[Describe test coverage new/current, TBPL]: Via bug 1052545: landed, manual verification done. Full verification via an experiment will take a few weeks to complete (bug 1052714).
[Risks and why]: this is not risk-free, but all the risk is localized to the experiment system; it cannot affect release users and is unlikely to affect users who aren't in an experiment
[String/UUID change made/needed]: none
Attachment #8461931 - Flags: approval-mozilla-aurora?
status-firefox33: --- → affected
status-firefox34: --- → fixed
Attachment #8461931 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/16c940d83269
status-firefox33: affected → fixed
Testing for the race condition issues in telemetry experiments can be found in Bug # 1052545
QA Whiteboard: [qa-]
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.