We're seeing people change experiment branches in bug 1037595. Right now we're not sure what's happening, but my current theory is that there's a race condition as described in that bug: I can imagine this happening in this scenario: we are starting the experiment at about the same time we're saving the experiment cache to disk here: http://hg.mozilla.org/mozilla-central/annotate/75f66f8cb99f/browser/experiments/Experiments.jsm#l954 We may end up interleaving the call which sets the dirty flag in the middle of that: http://hg.mozilla.org/mozilla-central/annotate/75f66f8cb99f/browser/experiments/Experiments.jsm#l645 which would leave us not knowing that the cache was dirty, and never re-saving it. That race is actually fairly likely, given that we will have just activated the experiment. It may be sufficient to just move the "this._dirty" set from line 955 to 953, but in that case we probably want to catch any exceptions writing the file and reset the state to dirty.
Are there any issues with waiting on the main task (which triggers the save to cache), i.e. "yield this._mainTask" from setExperimentBranch()? I'd rather not have different functions on each others implementation details if possible.
That introduces potentially significant delay in a function that is only bound by disk load time. It would block on any HTTP reload of the manifest. I don't think the dirty flag is an implementation detail: it's a marker that we need to save the cache to disk, and in this case we're racing on it.
Hi Steven, can you mark this bug as [qa+] or [qa-] for verification.
Created attachment 8461931 [details] [diff] [review] Patch - Prevent a race condition when writing the experiments cache With this patch we now clear _dirty before writing, and reset it if we catch any errors.
Hi Liz, I believe Kamil is on PTO. Can a new QA contact be assigned to verify this bug.
Hi Marco, Most of QA will be in the MTV office from 07/28/2014 - 08/01/2014. I'll try finishing this up over there whenever we have some "working time" available unless Liz knows someone from SV that can handle this one.
Thannks Kamil. I've added you back as the QA contact. (In reply to Kamil Jozwiak [:kjozwiak] from comment #8) > Hi Marco, > > Most of QA will be in the MTV office from 07/28/2014 - 08/01/2014. I'll try > finishing this up over there whenever we have some "working time" available > unless Liz knows someone from SV that can handle this one.
Hi Steven, Florin has asked for details on how to verify this bug. Can you provide him with that.
Hmm, thinking more about this it's probably going to be very difficult to verify. It would require being able to reproduce the race condition, causing the branch switches, with a high frequency. Then using a fixed build you'd attempt reproduction and see no more switches. This involves some very specific interleaving of execution, so it's probably going to be hard to reproduce well, and in the end we'd only be able to say "It looks like it's fixed, but we're not 100% sure". Based on the new code I'm confident the race shouldn't be possible, so we can probably skip verification. I must have made a snap decision when marking [qa+] and didn't consider it enough, sorry!
Thanks for taking a look Steven!
I expect the best way to verify this would be to run a dummy experiment on the nightly channel, with the branch-setting code that we were using for the search experiment. Then we can use FHR to examine whether people are still switching experiment branches unexpected. Should I file a separate bug for this kind of verification?
Yeah lets take care of that in a new bug.
saveToCache might be closed while in the yield-in-try, forcing a return. Better to assert dirty on entry and clear dirty after the yield (still in the try). /be
Comment on attachment 8461931 [details] [diff] [review] Patch - Prevent a race condition when writing the experiments cache Approval Request Comment [Feature/regressing bug #]: pre-existing bug in telemetry experiments [User impact if declined]: Slow shutdowns for some set of users. This blocks bug 1052545! [Describe test coverage new/current, TBPL]: Via bug 1052545: landed, manual verification done. Full verification via an experiment will take a few weeks to complete (bug 1052714). [Risks and why]: this is not risk-free, but all the risk is localized to the experiment system; it cannot affect release users and is unlikely to affect users who aren't in an experiment [String/UUID change made/needed]: none
Testing for the race condition issues in telemetry experiments can be found in Bug # 1052545