Closed Bug 1038174 Opened 11 years ago Closed 11 years ago

Race condition setting experiment branch while a write is pending

Tracking

(firefox33 fixed, firefox34 fixed)

Status:

RESOLVED FIXED

Milestone:

Firefox 34

Tracking Flags:

Tracking

Status

firefox33

---

fixed

firefox34

---

fixed

People

(Reporter: benjamin, Assigned: smacleod)

References

Details

Attachments

(1 file)

Patch - Prevent a race condition when writing the experiments cache 11 years ago Steven MacLeod [:smacleod] 2.10 KB, patch	gfritzsche : review+ Sylvestre : approval-mozilla-aurora+	Details \| Diff \| Splinter Review

Benjamin Smedberg

Reporter

Description

•

11 years ago

We're seeing people change experiment branches in bug 1037595. Right now we're not sure what's happening, but my current theory is that there's a race condition as described in that bug: I can imagine this happening in this scenario: we are starting the experiment at about the same time we're saving the experiment cache to disk here: http://hg.mozilla.org/mozilla-central/annotate/75f66f8cb99f/browser/experiments/Experiments.jsm#l954 We may end up interleaving the call which sets the dirty flag in the middle of that: http://hg.mozilla.org/mozilla-central/annotate/75f66f8cb99f/browser/experiments/Experiments.jsm#l645 which would leave us not knowing that the cache was dirty, and never re-saving it. That race is actually fairly likely, given that we will have just activated the experiment. It may be sufficient to just move the "this._dirty" set from line 955 to 953, but in that case we probably want to catch any exceptions writing the file and reset the state to dirty.

Flags: firefox-backlog+

Benjamin Smedberg

Reporter

Updated

•

11 years ago

Blocks: 1037595

Georg Fritzsche [:gfritzsche]

Comment 1

•

11 years ago

Are there any issues with waiting on the main task (which triggers the save to cache), i.e. "yield this._mainTask" from setExperimentBranch()? I'd rather not have different functions on each others implementation details if possible.

Benjamin Smedberg

Reporter

Comment 2

•

11 years ago

That introduces potentially significant delay in a function that is only bound by disk load time. It would block on any HTTP reload of the manifest. I don't think the dirty flag is an implementation detail: it's a marker that we need to save the cache to disk, and in this case we're racing on it.

Steven MacLeod [:smacleod]

Assignee

Updated

•

11 years ago

Assignee: nobody → smacleod

Status: NEW → ASSIGNED

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Iteration: --- → 34.1

QA Whiteboard: [qa?]

Marco Mucci [:MarcoM]

Comment 3

•

11 years ago

Hi Steven, can you mark this bug as [qa+] or [qa-] for verification.

Flags: needinfo?(smacleod)

Steven MacLeod [:smacleod]

Assignee

Updated

•

11 years ago

QA Whiteboard: [qa?] → [qa+]

Flags: needinfo?(smacleod)

Steven MacLeod [:smacleod]

Assignee

Comment 4

•

11 years ago

Attached patch Patch - Prevent a race condition when writing the experiments cache — Details — Splinter Review

With this patch we now clear _dirty before writing, and reset it if we catch any errors.

Attachment #8461931 - Flags: review?(georg.fritzsche)

Georg Fritzsche [:gfritzsche]

Updated

•

11 years ago

Attachment #8461931 - Flags: review?(georg.fritzsche) → review+

Steven MacLeod [:smacleod]

Assignee

Comment 5

•

11 years ago

https://hg.mozilla.org/integration/fx-team/rev/6b4ab5ef92fa

Target Milestone: --- → Firefox 34

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 6

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/6b4ab5ef92fa

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Liz Henry (:lizzard) (relman/hg->git project)

Updated

•

11 years ago

QA Contact: kamiljoz

Marco Mucci [:MarcoM]

Comment 7

•

11 years ago

Hi Liz, I believe Kamil is on PTO. Can a new QA contact be assigned to verify this bug.

Flags: needinfo?(lhenry)

QA Contact: kamiljoz

Kamil Jozwiak [:kjozwiak]

Comment 8

•

11 years ago

Hi Marco, Most of QA will be in the MTV office from 07/28/2014 - 08/01/2014. I'll try finishing this up over there whenever we have some "working time" available unless Liz knows someone from SV that can handle this one.

Marco Mucci [:MarcoM]

Comment 9

•

11 years ago

Thannks Kamil. I've added you back as the QA contact. (In reply to Kamil Jozwiak [:kjozwiak] from comment #8) > Hi Marco, > > Most of QA will be in the MTV office from 07/28/2014 - 08/01/2014. I'll try > finishing this up over there whenever we have some "working time" available > unless Liz knows someone from SV that can handle this one.

Flags: needinfo?(lhenry)

QA Contact: kamiljoz

Marco Mucci [:MarcoM]

Comment 10

•

11 years ago

Hi Steven, Florin has asked for details on how to verify this bug. Can you provide him with that.

Flags: needinfo?(smacleod)

Steven MacLeod [:smacleod]

Assignee

Comment 11

•

11 years ago

Hmm, thinking more about this it's probably going to be very difficult to verify. It would require being able to reproduce the race condition, causing the branch switches, with a high frequency. Then using a fixed build you'd attempt reproduction and see no more switches. This involves some very specific interleaving of execution, so it's probably going to be hard to reproduce well, and in the end we'd only be able to say "It looks like it's fixed, but we're not 100% sure". Based on the new code I'm confident the race shouldn't be possible, so we can probably skip verification. I must have made a snap decision when marking [qa+] and didn't consider it enough, sorry!

QA Whiteboard: [qa+] → [qa-]

Flags: needinfo?(smacleod)

Kamil Jozwiak [:kjozwiak]

Comment 12

•

11 years ago

Thanks for taking a look Steven!

Benjamin Smedberg

Reporter

Comment 13

•

11 years ago

I expect the best way to verify this would be to run a dummy experiment on the nightly channel, with the branch-setting code that we were using for the search experiment. Then we can use FHR to examine whether people are still switching experiment branches unexpected. Should I file a separate bug for this kind of verification?

Flags: needinfo?(smacleod)

Steven MacLeod [:smacleod]

Assignee

Comment 14

•

11 years ago

Yeah lets take care of that in a new bug.

Flags: needinfo?(smacleod)

Benjamin Smedberg

Reporter

Updated

•

11 years ago

Depends on: 1052714

Brendan Eich [:brendan]

Comment 15

•

11 years ago

saveToCache might be closed while in the yield-in-try, forcing a return. Better to assert dirty on entry and clear dirty after the yield (still in the try). /be

Benjamin Smedberg

Reporter

Updated

•

11 years ago

Depends on: 1052545

Georg Fritzsche [:gfritzsche]

Comment 16

•

11 years ago

Comment on attachment 8461931 [details] [diff] [review] Patch - Prevent a race condition when writing the experiments cache Approval Request Comment [Feature/regressing bug #]: pre-existing bug in telemetry experiments [User impact if declined]: Slow shutdowns for some set of users. This blocks bug 1052545! [Describe test coverage new/current, TBPL]: Via bug 1052545: landed, manual verification done. Full verification via an experiment will take a few weeks to complete (bug 1052714). [Risks and why]: this is not risk-free, but all the risk is localized to the experiment system; it cannot affect release users and is unlikely to affect users who aren't in an experiment [String/UUID change made/needed]: none

Attachment #8461931 - Flags: approval-mozilla-aurora?

Sylvestre Ledru [:Sylvestre]

Updated

•

11 years ago

status-firefox33: --- → affected

status-firefox34: --- → fixed

Sylvestre Ledru [:Sylvestre]

Updated

•

11 years ago

Attachment #8461931 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Georg Fritzsche [:gfritzsche]

Comment 17

•

11 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/16c940d83269

status-firefox33: affected → fixed

Kamil Jozwiak [:kjozwiak]

Comment 18

•

11 years ago

Testing for the race condition issues in telemetry experiments can be found in Bug # 1052545

QA Whiteboard: [qa-]

Flags: qe-verify-

BMO Automation

Updated

•

7 years ago

Product: Firefox Health Report → Firefox Health Report Graveyard

You need to log in before you can comment on or make changes to this bug.