1052545 - Telemetry experiments: experiment branches still changing after bug 1038174 (search provider not being reverted back to original in search experiment)

Reporter

Description

•

11 years ago

Attached image Error in Browser Console — Details

I'm not exactly sure what happened but it appears that the default search provider is not being reverted once the error that I attached occurs in the browser console. STR (reproduced this many times): I reproduced the problem using the following two beta builds: - http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/31.0b6/win32/en-US/ - http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/32.0b5/win32/en-US/ 1) Install one of the beta builds listed above, add/change the following preferences: - devtools.chrome.enabled;true - experiments.force-sample-value;.5 - experiments.logging.level;0 - experiments.manifest.cert.checkAttributes;false 2) Once the above preferences have been added/changed, disable/enable "experiments.enabled" to initiate the installation of the search experiment. You should see the default search provider change to either Yahoo/Bing (double check what branch you're in using about:healthreport -> Raw Data) At this point, if you move the machines date ahead a few days and receive the error I attached, the search provider will not revert back to the original search provider, examples: Example #1: -> move the machines date ahead by 1 week, re-launch fx and check the browser console (you'll receive the same error message) -> disable the search experiment and you'll notice that the default search provider will not revert back to Google Example #2: -> move the machines date ahead so there's 4 days remaining in the experiment and re-launch fx -> check the browser cosnole and you should see the error -> close fx and move the machines date ahead by 1 day so there's 3 days remaining -> relaunch fx and you'll notice there's three days remaining but the search provider hasn't been reverted

Benjamin Smedberg

Assignee

Comment 1

•

11 years ago

Kamil, does the experiment branch stay constant throughout this process?

Flags: needinfo?(kamiljoz)

Kamil Jozwiak [:kjozwiak]

Reporter

Comment 2

•

11 years ago

Benjamin, once I received the error message in the browser console the branch usually changed to "0". There was one instance (case #3) where the branch changed several times while I moved the dates closer to the experiment expatriation. Case #1: - "lastActiveBranch": "5" (checked after the experiment was installed) - "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error) - "lastActiveBranch": "0" (checked after disabling/removing the experiment through about:addons) Case #2: - "lastActiveBranch": "4" (checked after the experiment was installed) - "lastActiveBranch": "0" (checked after moving the date forwards so there's 3 days remaining in the experiment) Case #3: (despite the branches changing every time I moved the date forward, "Google" remained as the default provider) - "lastActiveBranch": "1" (checked after the experiment was installed) - "lastActiveBranch": "3" (checked after moving the date forwards and receiving the error) - "lastActiveBranch": "1" (checked after moving the date forwards and receiving the error) - "lastActiveBranch": "5" (checked after moving the date forwards and receiving the error) Case #4: - "lastActiveBranch": "5" (checked after the experiment was installed) - "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error) - "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error) - "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error)

Flags: needinfo?(kamiljoz)

Benjamin Smedberg

Assignee

Comment 3

•

11 years ago

Cool, this might be a good testcase for bug 1038174, then, or a symptom of another bug. Could you retry this using a nightly build after bug 1038174 landed to see whether that fixes this or at least helps?

Flags: needinfo?(kamiljoz)

Kamil Jozwiak [:kjozwiak]

Reporter

Comment 4

•

11 years ago

Will do! Once bug #1038174 lands into m-c I'll go through this again and see if the patch helps or even fixes this issue. I ran into this while going through bug #1032970 on m-c and figured it might have been an issue due to changing the manifest on my personal staging server but ran into the same thing using the official staging server with the latest beta.

Flags: needinfo?(kamiljoz)

Kamil Jozwiak [:kjozwiak]

Reporter

Comment 5

•

11 years ago

Benjamin, I went through the same thing in comment #0 using the latest m-c and ran into the same thing (as you mentioned, bug #1038174 has already landed) - Build used: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-08-12-03-02-02-mozilla-central/ I edited the manifest on my personal staging server to get it working with m-c, changed/added the following values: - "endTime": 1412035200 - "channel": ["nightly"] - "maxVersion": "34.*" Once the staging server was live, I basically ran through the same steps as comment #0 and received the same results mentioned in comment #2. Here's the manifest that was used: {"experiments": [{"maxStartTime": 1408043234, "xpiHash": "sha256:967b0870d0c08ad74d39d12cca34cf40cab7d344aa93180495366a2e4105b9fd", "appName": ["Firefox"], "locale": ["en-US"], "maxActiveSeconds": 2419200, "maxVersion": "34.*", "xpiURL": "http://www.kamiljozwiak.com/fx-searchtest-en-beta31%40mozilla.org/experiment.xpi", "sample": 0.5, "minBuildID": "20140630185627", "startTime": 1404216000, "minVersion": "31.0a", "endTime": 1412035200, "id": "fx-searchtest-en-beta31@mozilla.org", "channel": ["nightly"]}

Benjamin Smedberg

Assignee

Comment 6

•

11 years ago

Well that sucks!

Points: --- → 8

Flags: firefox-backlog+

Kamil Jozwiak [:kjozwiak]

Reporter

Comment 7

•

11 years ago

Benjamin, looks like you were correct about the VM's being slower during disk write. I went through the same test cases from comment #0 on my OSX 10.9.4 x64 and Windows 8.1 x64 and only seen the issue once on the Win 8.1 machine using the latest m-c. When I changed the date right after fx was closed, the entire machine basically slowed down to a crawl and disk usage shot up to 100%. When the machine finally became responsive after about 2 minutes, the experiment expired but the search provider wasn't reverted back to Google. (this happened after I changed the date around 15 different times which started to slow down Windows) Some results: Used the following build: - http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/32.0b6/ "lastActiveBranch": "5" [Passed on both m-c & beta using OSX 10.9.4] <-- this one failed on Win 8.1 "lastActiveBranch": "1" [Passed on both m-c & beta using OSX 10.9.4/Win 8.1] "lastActiveBranch": "2" [Passed on both m-c & beta using OSX 10.9.4/Win 8.1] "lastActiveBranch": "4" [Passed on beta using OSX 10.9.4] "lastActiveBranch": "3" [Passed in m-c on both OSX 10.94/Win 8.1] Strange thing is that when I initially tested this on the same VM's, this wasn't happening.. I would've seen it as I always had the browser console opened and made sure the branches always matched via about:healthreport Should we close this ticket and monitor the results we get from bug #1052714?

Flags: needinfo?(benjamin)

Benjamin Smedberg

Assignee

Comment 8

•

11 years ago

If we have a testcase and we know it's still broken, then bug 1052714 isn't going to help. Looking for ways now to reliably simulate slow disk so I can debug locally.

Flags: needinfo?(benjamin)

Summary: Telemetry experiments: search provider not being reverted back to original → Telemetry experiments: experiment branches still changing after bug 1038174 (search provider not being reverted back to original in search experiment)

Benjamin Smedberg

Assignee

Updated

•

11 years ago

Assignee: nobody → benjamin

Marco Mucci [:MarcoM]

Updated

•

11 years ago

QA Whiteboard: [qa?]

Benjamin Smedberg

Assignee

Comment 9

•

11 years ago

1408035053127 Browser.Experiments.Experiments DEBUG Experiments #0::evaluateExperiments() - activating experiment experiment-branch-test-nightly@experiments.mozilla.org 1408035053144 Browser.Experiments.Experiments TRACE ExperimentEntry #4::start() for experiment-branch-test-nightly@experiments.mozilla.org 1408035053160 Browser.Experiments.Experiments TRACE ExperimentEntry #4::reconcileAddonState() 1408035053174 Browser.Experiments.Experiments INFO ExperimentEntry #4::reconcileAddonState() - Installing add-on. 1408035058642 Browser.Experiments.Experiments TRACE ExperimentEntry #4::_installAddon() - onDownloadEnded for experiment-branch-test-nightly@experiments.mozilla.org 1408035058660 Browser.Experiments.Experiments TRACE ExperimentEntry #4::_installAddon() - onInstallStarted for experiment-branch-test-nightly@experiments.mozilla.org 1408035058674 Browser.Experiments.Experiments TRACE Experiments #0::onInstallStarted() - experiment-branchtest-nightly@experiments.mozilla.org 1408035058690 Browser.Experiments.Experiments INFO Experiments #0::onInstallStarted allowing install because install tracked by us. 1408035059739 addons.xpi WARN Add-on experiment-branchtest-nightly@experiments.mozilla.org is missing bootstrap method install 1408035059785 Browser.Experiments.Experiments TRACE ExperimentEntry #4::_installAddon() - install ended for experiment-branch-test-nightly@experiments.mozilla.org 1408035059812 Browser.Experiments.Experiments TRACE ExperimentEntry #4::Add-on is disabled. Enabling. Branch experiment startup bootstrap.js:17 1408035059890 Browser.Experiments.Experiments INFO ExperimentEntry #4::onEnabled() for experiment-branchtest-nightly@experiments.mozilla.org 1408035060064 Services.Metrics.Provider.org.mozilla.addons WARN Add-on type without field: experiment 1408035060219 Browser.Experiments.Experiments TRACE Experiments #0::_saveToCache 1408035060262 Browser.Experiments.Experiments TRACE Experiments #0::writeAtomic starting. _dirty: false Branch experiment real startup bootstrap.js:22 No pre-existing branch, setting to: 3 bootstrap.js:32 1408035060471 Browser.Experiments.Experiments TRACE Experiments #0::setExperimentBranch before. _dirty: false 1408035060489 Browser.Experiments.Experiments TRACE Experiments #0::_run Successfully set branch bootstrap.js:34 1408035061268 Services.Metrics.Provider.org.mozilla.addons WARN Add-on type without field: experiment 1408035065467 Browser.Experiments.Experiments TRACE Experiments #0::writeAtomic finished. _dirty: true 1408035065475 Browser.Experiments.Experiments DEBUG Experiments #0::_saveToCache saved to c:\Users\bsmedberg\test-profile\experiments.json 1408035065489 Browser.Experiments.Experiments TRACE Experiments #0::_main finished, scheduling next run 1408035065500 Browser.Experiments.Experiments TRACE Experiments #0::scheduleExperimentEvaluation() - scheduling for 1408639859801, now: 1408035065497 1408035065518 Browser.Experiments.Experiments TRACE Experiments #0::setExperimentBranch after. _dirty: true So the code from bug 1038174 improved things in this race condition, but isn't sufficient. In particular, the _dirty flag is now correct, but we don't re-trigger writing the cache. I thought we rechecked _dirty in the main loop, but I was wrong: http://hg.mozilla.org/mozilla-central/annotate/5299864050ee/browser/experiments/Experiments.jsm#l804 shows that we only reloop on _refresh (updateManifest) or _terminateReason (disableExperiment). I believe the reason for this is that we don't want to inifite-loop in the case where writing the cache fails (the catch block at http://hg.mozilla.org/mozilla-central/annotate/5299864050ee/browser/experiments/Experiments.jsm#l1269) I propose to fix this by re-looping while (this._dirty) but making failure to write the cache rethrow its exception to exit out of the loop.

Benjamin Smedberg

Assignee

Updated

•

11 years ago

QA Whiteboard: [qa?] → [qa+]

Benjamin Smedberg

Assignee

Comment 10

•

11 years ago

Attached patch experiment-branch-norace — Details — Splinter Review

Unfortunately, adding an automatic test for this is still very difficult, though not impossible: it involves taking a patch to OS.File that allow you to delay the callback of the writeAtomic I/O operation, and then using that parameter (but only while testing) from the experiments manager. Georg I'd like to know whether you think that adding all that test-only plumbing is worth it for an automated test here.

Attachment #8473232 - Flags: review?(georg.fritzsche)

Benjamin Smedberg

Assignee

Comment 11

•

11 years ago

Attached patch writeAtomic-delayparam — Details — Splinter Review

If we wanted it, this is the patch which adds the delay to OS.File.

Georg Fritzsche [:gfritzsche]

Comment 12

•

11 years ago

Comment on attachment 8473232 [details] [diff] [review] experiment-branch-norace Review of attachment 8473232 [details] [diff] [review]: ----------------------------------------------------------------- Looking good to me pending below comments. ::: browser/experiments/Experiments.jsm @@ +63,5 @@ > > const URI_EXTENSION_STRINGS = "chrome://mozapps/locale/extensions/extensions.properties"; > const STRING_TYPE_NAME = "type.%ID%.name"; > > +const FAILED_CACHE_WRITE_RETRY = 60 * 3; This should be more descriptive, e.g. CACHE_WRITE_RETRY_DELAY_SEC? @@ -330,5 @@ > this.name = "AlreadyShutdownError"; > this.message = message; > this.stack = error.stack; > } > - Why remove this unrelated line break? Do we really need to cluster this together?

Attachment #8473232 - Flags: review?(georg.fritzsche) → review+

Georg Fritzsche [:gfritzsche]

Comment 13

•

11 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #10) > Unfortunately, adding an automatic test for this is still very difficult, > though not impossible: it involves taking a patch to OS.File that allow you > to delay the callback of the writeAtomic I/O operation, and then using that > parameter (but only while testing) from the experiments manager. > > Georg I'd like to know whether you think that adding all that test-only > plumbing is worth it for an automated test here. I'd like to see this - we should be able to avoid patching Experiments.jsm up by either: * directly mocking OS.File.writeAtomic() or * direct OS.File usage through Experiments.Policy and mock that

Benjamin Smedberg

Assignee

Comment 14

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/f9e0efd92b8d

Target Milestone: --- → Firefox 34

Benjamin Smedberg

Assignee

Updated

•

11 years ago

Blocks: 1054397

Ryan VanderMeulen [:RyanVM]

Comment 15

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/f9e0efd92b8d

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Iteration: --- → 34.2

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Updated

•

11 years ago

QA Contact: kamiljoz

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Iteration: 34.2 → 34.3

QA Whiteboard: [qa+]

Flags: qe-verify+

Kamil Jozwiak [:kjozwiak]

Reporter

Comment 16

•

11 years ago

Went through verification using the following build: - http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-08-19-03-02-02-mozilla-central/ - Experiment used: fx-searchtest-en-beta31@mozilla.org - Changed on local staging: "maxVersion": "34.*" - Changed on local staging: "channel": ["nightly"] Test Cases: - "activeExperimentBranch": "2" - PASSED - "activeExperimentBranch": "1" - PASSED - "activeExperimentBranch": "4" - PASSED - "activeExperimentBranch": "5" - PASSED - "activeExperimentBranch": "3" - PASSED - moved the computer date ahead by 2-3 day increments and ensured that the experiment was still using the same original branch - ensured that the search provider is restored to Google once there's 3 days remaining - ensured that the branch information remains in the telemetry payload until the experiment has been removed/expired - ensured that the search provider is restored to Google once the experiment has expired/removed - ensured that once the experiment is expired/removed, it appears as disabled under about:config, about:support & about:addons - ensured that the search provider isn't reverted over to Google once three days is reached and the user selected their own search provider - ensured that the search provider isn't reverted back to Google once the experiment has been removed/expired and the user selected their own search provider Still seeing the following errors appear in the browser console: DEPRECATION WARNING: Search service falling back to synchronous initialization. This is generally the consequence of an add-on using a deprecated search service API. You may find more details about this deprecation at: https://developer.mozilla.org/en-US/docs/XPCOM_Interface_Reference/nsIBrowserSearchService#async_warning resource://gre/components/nsSearchService.js 2863 SRCH_SVC__ensureInitialized resource://gre/components/nsSearchService.js 4125 SearchService.prototype.currentEngine resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://fx-searchtest-at-mozilla-dot-org/fx-searchtest/lib/main.js 163 finishExperiment resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://fx-searchtest-at-mozilla-dot-org/fx-searchtest/lib/main.js 126 exports.main resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://gre/modules/commonjs/sdk/addon/runner.js 145 run resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://gre/modules/commonjs/sdk/addon/runner.js 81 startup/</< resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js 866 Handler.prototype.process resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js 745 this.PromiseWalker.walkerLoop null 0 null Benjamin, I've been going through this for a few hours in the latest m-c and couldn't reproduce the issue. However there's a few things I wanted to double check with you and mention: - I'm still getting the above error message in the browser console and it usually appears consistently once the experiment has three days remaining and is reverted back to Google. The error also appears here and there while the experiment is still using the selected branch but not as often. - I can't reproduce the original issue as consistently as before using the same VM's. I'm not sure if it's because I updated them and perhaps changed something in the virtual HD that makes this harder to reproduce than before, I'm not really sure. - Is there a way that you reproduced the issue consistently while you were debugging the issue?

Flags: needinfo?(benjamin)

Benjamin Smedberg

Assignee

Comment 17

•

11 years ago

The deprecation warning isn't a big deal. The only way I could reproduce this consistently was to land a code change which introduced an artificial delay in the result of the experiments cache write. Bug 1054397 covers trying to make an automated test for this, and bug 1052714 covers trying to verify this by deploying a nightly experiment: so if your initial tests look good, I think we should call this as manually-verified as possible and rely on those two bugs to track the remaining work.

Flags: needinfo?(benjamin)

Kamil Jozwiak [:kjozwiak]

Reporter

Comment 18

•

11 years ago

Thanks Benjamin! I went through it a few more times on OSX 10.9.4 and couldn't reproduce the original issues. I followed the same test cases outlined in comment #16. As per comment # 17, I'm marking this as verified and will keep an eye on both bug # 1054397 and bug # 1052714.

Status: RESOLVED → VERIFIED

Benjamin Smedberg

Assignee

Updated

•

11 years ago

Blocks: 1038174

Benjamin Smedberg

Assignee

Comment 19

•

11 years ago

Comment on attachment 8473232 [details] [diff] [review] experiment-branch-norace Approval Request Comment [Feature/regressing bug #]: pre-existing bug in telemetry experiments [User impact if declined]: slow shutdowns for some set of users [Describe test coverage new/current, TBPL]: landed, manual verification done. Full verification via an experiment will take a few weeks to complete (bug 1052714). [Risks and why]: this is not risk-free, but all the risk is localized to the experiment system; it cannot affect release users and is unlikely to affect users who aren't in an experiment [String/UUID change made/needed]: none

Attachment #8473232 - Flags: approval-mozilla-aurora?

Sylvestre Ledru [:Sylvestre]

Updated

•

11 years ago

status-firefox33: --- → affected

status-firefox34: --- → verified

Sylvestre Ledru [:Sylvestre]

Updated

•

11 years ago

Attachment #8473232 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Ryan VanderMeulen [:RyanVM]

Comment 20

•

11 years ago

Needs rebasing for Aurora uplift.

Flags: needinfo?(benjamin)

Keywords: branch-patch-needed

Georg Fritzsche [:gfritzsche]

Comment 21

•

11 years ago

This is missing bug 1038174 on Aurora, we never uplifted that one. Approval requested there now.

Flags: needinfo?(benjamin)

Georg Fritzsche [:gfritzsche]

Comment 22

•

11 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/36adbb8b0b83

status-firefox33: affected → fixed

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

Keywords: branch-patch-needed

Kamil Jozwiak [:kjozwiak]

Reporter

Comment 23

•

11 years ago

Went through verification using the following build: - http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-08-29-00-40-03-mozilla-aurora/ - Experiment used: fx-searchtest-en-beta31 - Changed on local staging: "channel": ["aurora"] - Changed on local staging: "minVersion": "33.0a" - "activeExperimentBranch": "2" - PASSED - "activeExperimentBranch": "4" - PASSED - "activeExperimentBranch": "5" - PASSED - "activeExperimentBranch": "1" - PASSED - "activeExperimentBranch": "3" - PASSED I went through basically the same test cases from comment #16 and made sure that I didn't run into any race conditions were the branch would change once FX was re-launched.

status-firefox33: fixed → verified

Benjamin Smedberg

Assignee

Comment 24

•

11 years ago

Verified also via test experiment that this problem is essentially fixed: <branches>...,<count> 0,,6 2,,11 4,,3 ,199 0,11584 490089 1,11370 2,1,1 4,,4,1 ,0,8 0,2,1 1,,9 2,11442 3,,6 ,1,7 2,3,2 3,11447 ,2,6 4,11497 ,3,10 ,4,1 So there were four users in the experiment who switched branches, and that may be expected levels of I/O failure or QA testing. It's a little strange that there are occasionally empty branches, but it's also a small enough problem that I'm not going to followup on it now.

BMO Automation

Updated

•

7 years ago

Product: Firefox Health Report → Firefox Health Report Graveyard

Error in Browser Console 11 years ago Kamil Jozwiak [:kjozwiak] 149.93 KB, image/png		Details
experiment-branch-norace 11 years ago Benjamin Smedberg 6.93 KB, patch	gfritzsche : review+ Sylvestre : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
writeAtomic-delayparam 11 years ago Benjamin Smedberg 3.34 KB, patch		Details \| Diff \| Splinter Review