Closed
Bug 1052545
Opened 10 years ago
Closed 10 years ago
Telemetry experiments: experiment branches still changing after bug 1038174 (search provider not being reverted back to original in search experiment)
Categories
(Firefox Health Report Graveyard :: Client: Desktop, defect)
Tracking
(firefox33 verified, firefox34 verified)
VERIFIED
FIXED
Firefox 34
People
(Reporter: kjozwiak, Assigned: benjamin)
References
Details
Attachments
(3 files)
149.93 KB,
image/png
|
Details | |
6.93 KB,
patch
|
gfritzsche
:
review+
Sylvestre
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
3.34 KB,
patch
|
Details | Diff | Splinter Review |
I'm not exactly sure what happened but it appears that the default search provider is not being reverted once the error that I attached occurs in the browser console. STR (reproduced this many times): I reproduced the problem using the following two beta builds: - http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/31.0b6/win32/en-US/ - http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/32.0b5/win32/en-US/ 1) Install one of the beta builds listed above, add/change the following preferences: - devtools.chrome.enabled;true - experiments.force-sample-value;.5 - experiments.logging.level;0 - experiments.manifest.cert.checkAttributes;false 2) Once the above preferences have been added/changed, disable/enable "experiments.enabled" to initiate the installation of the search experiment. You should see the default search provider change to either Yahoo/Bing (double check what branch you're in using about:healthreport -> Raw Data) At this point, if you move the machines date ahead a few days and receive the error I attached, the search provider will not revert back to the original search provider, examples: Example #1: -> move the machines date ahead by 1 week, re-launch fx and check the browser console (you'll receive the same error message) -> disable the search experiment and you'll notice that the default search provider will not revert back to Google Example #2: -> move the machines date ahead so there's 4 days remaining in the experiment and re-launch fx -> check the browser cosnole and you should see the error -> close fx and move the machines date ahead by 1 day so there's 3 days remaining -> relaunch fx and you'll notice there's three days remaining but the search provider hasn't been reverted
Assignee | ||
Comment 1•10 years ago
|
||
Kamil, does the experiment branch stay constant throughout this process?
Flags: needinfo?(kamiljoz)
Reporter | ||
Comment 2•10 years ago
|
||
Benjamin, once I received the error message in the browser console the branch usually changed to "0". There was one instance (case #3) where the branch changed several times while I moved the dates closer to the experiment expatriation. Case #1: - "lastActiveBranch": "5" (checked after the experiment was installed) - "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error) - "lastActiveBranch": "0" (checked after disabling/removing the experiment through about:addons) Case #2: - "lastActiveBranch": "4" (checked after the experiment was installed) - "lastActiveBranch": "0" (checked after moving the date forwards so there's 3 days remaining in the experiment) Case #3: (despite the branches changing every time I moved the date forward, "Google" remained as the default provider) - "lastActiveBranch": "1" (checked after the experiment was installed) - "lastActiveBranch": "3" (checked after moving the date forwards and receiving the error) - "lastActiveBranch": "1" (checked after moving the date forwards and receiving the error) - "lastActiveBranch": "5" (checked after moving the date forwards and receiving the error) Case #4: - "lastActiveBranch": "5" (checked after the experiment was installed) - "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error) - "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error) - "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error)
Flags: needinfo?(kamiljoz)
Assignee | ||
Comment 3•10 years ago
|
||
Cool, this might be a good testcase for bug 1038174, then, or a symptom of another bug. Could you retry this using a nightly build after bug 1038174 landed to see whether that fixes this or at least helps?
Flags: needinfo?(kamiljoz)
Reporter | ||
Comment 4•10 years ago
|
||
Will do! Once bug #1038174 lands into m-c I'll go through this again and see if the patch helps or even fixes this issue. I ran into this while going through bug #1032970 on m-c and figured it might have been an issue due to changing the manifest on my personal staging server but ran into the same thing using the official staging server with the latest beta.
Flags: needinfo?(kamiljoz)
Reporter | ||
Comment 5•10 years ago
|
||
Benjamin, I went through the same thing in comment #0 using the latest m-c and ran into the same thing (as you mentioned, bug #1038174 has already landed) - Build used: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-08-12-03-02-02-mozilla-central/ I edited the manifest on my personal staging server to get it working with m-c, changed/added the following values: - "endTime": 1412035200 - "channel": ["nightly"] - "maxVersion": "34.*" Once the staging server was live, I basically ran through the same steps as comment #0 and received the same results mentioned in comment #2. Here's the manifest that was used: {"experiments": [{"maxStartTime": 1408043234, "xpiHash": "sha256:967b0870d0c08ad74d39d12cca34cf40cab7d344aa93180495366a2e4105b9fd", "appName": ["Firefox"], "locale": ["en-US"], "maxActiveSeconds": 2419200, "maxVersion": "34.*", "xpiURL": "http://www.kamiljozwiak.com/fx-searchtest-en-beta31%40mozilla.org/experiment.xpi", "sample": 0.5, "minBuildID": "20140630185627", "startTime": 1404216000, "minVersion": "31.0a", "endTime": 1412035200, "id": "fx-searchtest-en-beta31@mozilla.org", "channel": ["nightly"]}
Reporter | ||
Comment 7•10 years ago
|
||
Benjamin, looks like you were correct about the VM's being slower during disk write. I went through the same test cases from comment #0 on my OSX 10.9.4 x64 and Windows 8.1 x64 and only seen the issue once on the Win 8.1 machine using the latest m-c. When I changed the date right after fx was closed, the entire machine basically slowed down to a crawl and disk usage shot up to 100%. When the machine finally became responsive after about 2 minutes, the experiment expired but the search provider wasn't reverted back to Google. (this happened after I changed the date around 15 different times which started to slow down Windows) Some results: Used the following build: - http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/32.0b6/ "lastActiveBranch": "5" [Passed on both m-c & beta using OSX 10.9.4] <-- this one failed on Win 8.1 "lastActiveBranch": "1" [Passed on both m-c & beta using OSX 10.9.4/Win 8.1] "lastActiveBranch": "2" [Passed on both m-c & beta using OSX 10.9.4/Win 8.1] "lastActiveBranch": "4" [Passed on beta using OSX 10.9.4] "lastActiveBranch": "3" [Passed in m-c on both OSX 10.94/Win 8.1] Strange thing is that when I initially tested this on the same VM's, this wasn't happening.. I would've seen it as I always had the browser console opened and made sure the branches always matched via about:healthreport Should we close this ticket and monitor the results we get from bug #1052714?
Flags: needinfo?(benjamin)
Assignee | ||
Comment 8•10 years ago
|
||
If we have a testcase and we know it's still broken, then bug 1052714 isn't going to help. Looking for ways now to reliably simulate slow disk so I can debug locally.
Flags: needinfo?(benjamin)
Summary: Telemetry experiments: search provider not being reverted back to original → Telemetry experiments: experiment branches still changing after bug 1038174 (search provider not being reverted back to original in search experiment)
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → benjamin
Updated•10 years ago
|
QA Whiteboard: [qa?]
Assignee | ||
Comment 9•10 years ago
|
||
1408035053127 Browser.Experiments.Experiments DEBUG Experiments #0::evaluateExperiments() - activating experiment experiment-branch-test-nightly@experiments.mozilla.org 1408035053144 Browser.Experiments.Experiments TRACE ExperimentEntry #4::start() for experiment-branch-test-nightly@experiments.mozilla.org 1408035053160 Browser.Experiments.Experiments TRACE ExperimentEntry #4::reconcileAddonState() 1408035053174 Browser.Experiments.Experiments INFO ExperimentEntry #4::reconcileAddonState() - Installing add-on. 1408035058642 Browser.Experiments.Experiments TRACE ExperimentEntry #4::_installAddon() - onDownloadEnded for experiment-branch-test-nightly@experiments.mozilla.org 1408035058660 Browser.Experiments.Experiments TRACE ExperimentEntry #4::_installAddon() - onInstallStarted for experiment-branch-test-nightly@experiments.mozilla.org 1408035058674 Browser.Experiments.Experiments TRACE Experiments #0::onInstallStarted() - experiment-branchtest-nightly@experiments.mozilla.org 1408035058690 Browser.Experiments.Experiments INFO Experiments #0::onInstallStarted allowing install because install tracked by us. 1408035059739 addons.xpi WARN Add-on experiment-branchtest-nightly@experiments.mozilla.org is missing bootstrap method install 1408035059785 Browser.Experiments.Experiments TRACE ExperimentEntry #4::_installAddon() - install ended for experiment-branch-test-nightly@experiments.mozilla.org 1408035059812 Browser.Experiments.Experiments TRACE ExperimentEntry #4::Add-on is disabled. Enabling. Branch experiment startup bootstrap.js:17 1408035059890 Browser.Experiments.Experiments INFO ExperimentEntry #4::onEnabled() for experiment-branchtest-nightly@experiments.mozilla.org 1408035060064 Services.Metrics.Provider.org.mozilla.addons WARN Add-on type without field: experiment 1408035060219 Browser.Experiments.Experiments TRACE Experiments #0::_saveToCache 1408035060262 Browser.Experiments.Experiments TRACE Experiments #0::writeAtomic starting. _dirty: false Branch experiment real startup bootstrap.js:22 No pre-existing branch, setting to: 3 bootstrap.js:32 1408035060471 Browser.Experiments.Experiments TRACE Experiments #0::setExperimentBranch before. _dirty: false 1408035060489 Browser.Experiments.Experiments TRACE Experiments #0::_run Successfully set branch bootstrap.js:34 1408035061268 Services.Metrics.Provider.org.mozilla.addons WARN Add-on type without field: experiment 1408035065467 Browser.Experiments.Experiments TRACE Experiments #0::writeAtomic finished. _dirty: true 1408035065475 Browser.Experiments.Experiments DEBUG Experiments #0::_saveToCache saved to c:\Users\bsmedberg\test-profile\experiments.json 1408035065489 Browser.Experiments.Experiments TRACE Experiments #0::_main finished, scheduling next run 1408035065500 Browser.Experiments.Experiments TRACE Experiments #0::scheduleExperimentEvaluation() - scheduling for 1408639859801, now: 1408035065497 1408035065518 Browser.Experiments.Experiments TRACE Experiments #0::setExperimentBranch after. _dirty: true So the code from bug 1038174 improved things in this race condition, but isn't sufficient. In particular, the _dirty flag is now correct, but we don't re-trigger writing the cache. I thought we rechecked _dirty in the main loop, but I was wrong: http://hg.mozilla.org/mozilla-central/annotate/5299864050ee/browser/experiments/Experiments.jsm#l804 shows that we only reloop on _refresh (updateManifest) or _terminateReason (disableExperiment). I believe the reason for this is that we don't want to inifite-loop in the case where writing the cache fails (the catch block at http://hg.mozilla.org/mozilla-central/annotate/5299864050ee/browser/experiments/Experiments.jsm#l1269) I propose to fix this by re-looping while (this._dirty) but making failure to write the cache rethrow its exception to exit out of the loop.
Assignee | ||
Updated•10 years ago
|
QA Whiteboard: [qa?] → [qa+]
Assignee | ||
Comment 10•10 years ago
|
||
Unfortunately, adding an automatic test for this is still very difficult, though not impossible: it involves taking a patch to OS.File that allow you to delay the callback of the writeAtomic I/O operation, and then using that parameter (but only while testing) from the experiments manager. Georg I'd like to know whether you think that adding all that test-only plumbing is worth it for an automated test here.
Attachment #8473232 -
Flags: review?(georg.fritzsche)
Assignee | ||
Comment 11•10 years ago
|
||
If we wanted it, this is the patch which adds the delay to OS.File.
Comment 12•10 years ago
|
||
Comment on attachment 8473232 [details] [diff] [review] experiment-branch-norace Review of attachment 8473232 [details] [diff] [review]: ----------------------------------------------------------------- Looking good to me pending below comments. ::: browser/experiments/Experiments.jsm @@ +63,5 @@ > > const URI_EXTENSION_STRINGS = "chrome://mozapps/locale/extensions/extensions.properties"; > const STRING_TYPE_NAME = "type.%ID%.name"; > > +const FAILED_CACHE_WRITE_RETRY = 60 * 3; This should be more descriptive, e.g. CACHE_WRITE_RETRY_DELAY_SEC? @@ -330,5 @@ > this.name = "AlreadyShutdownError"; > this.message = message; > this.stack = error.stack; > } > - Why remove this unrelated line break? Do we really need to cluster this together?
Attachment #8473232 -
Flags: review?(georg.fritzsche) → review+
Comment 13•10 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #10) > Unfortunately, adding an automatic test for this is still very difficult, > though not impossible: it involves taking a patch to OS.File that allow you > to delay the callback of the writeAtomic I/O operation, and then using that > parameter (but only while testing) from the experiments manager. > > Georg I'd like to know whether you think that adding all that test-only > plumbing is worth it for an automated test here. I'd like to see this - we should be able to avoid patching Experiments.jsm up by either: * directly mocking OS.File.writeAtomic() or * direct OS.File usage through Experiments.Policy and mock that
Assignee | ||
Comment 14•10 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/f9e0efd92b8d
Target Milestone: --- → Firefox 34
Comment 15•10 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/f9e0efd92b8d
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Iteration: --- → 34.2
Updated•10 years ago
|
QA Contact: kamiljoz
Updated•10 years ago
|
Iteration: 34.2 → 34.3
QA Whiteboard: [qa+]
Flags: qe-verify+
Reporter | ||
Comment 16•10 years ago
|
||
Went through verification using the following build: - http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-08-19-03-02-02-mozilla-central/ - Experiment used: fx-searchtest-en-beta31@mozilla.org - Changed on local staging: "maxVersion": "34.*" - Changed on local staging: "channel": ["nightly"] Test Cases: - "activeExperimentBranch": "2" - PASSED - "activeExperimentBranch": "1" - PASSED - "activeExperimentBranch": "4" - PASSED - "activeExperimentBranch": "5" - PASSED - "activeExperimentBranch": "3" - PASSED - moved the computer date ahead by 2-3 day increments and ensured that the experiment was still using the same original branch - ensured that the search provider is restored to Google once there's 3 days remaining - ensured that the branch information remains in the telemetry payload until the experiment has been removed/expired - ensured that the search provider is restored to Google once the experiment has expired/removed - ensured that once the experiment is expired/removed, it appears as disabled under about:config, about:support & about:addons - ensured that the search provider isn't reverted over to Google once three days is reached and the user selected their own search provider - ensured that the search provider isn't reverted back to Google once the experiment has been removed/expired and the user selected their own search provider Still seeing the following errors appear in the browser console: DEPRECATION WARNING: Search service falling back to synchronous initialization. This is generally the consequence of an add-on using a deprecated search service API. You may find more details about this deprecation at: https://developer.mozilla.org/en-US/docs/XPCOM_Interface_Reference/nsIBrowserSearchService#async_warning resource://gre/components/nsSearchService.js 2863 SRCH_SVC__ensureInitialized resource://gre/components/nsSearchService.js 4125 SearchService.prototype.currentEngine resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://fx-searchtest-at-mozilla-dot-org/fx-searchtest/lib/main.js 163 finishExperiment resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://fx-searchtest-at-mozilla-dot-org/fx-searchtest/lib/main.js 126 exports.main resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://gre/modules/commonjs/sdk/addon/runner.js 145 run resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://gre/modules/commonjs/sdk/addon/runner.js 81 startup/</< resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js 866 Handler.prototype.process resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js 745 this.PromiseWalker.walkerLoop null 0 null Benjamin, I've been going through this for a few hours in the latest m-c and couldn't reproduce the issue. However there's a few things I wanted to double check with you and mention: - I'm still getting the above error message in the browser console and it usually appears consistently once the experiment has three days remaining and is reverted back to Google. The error also appears here and there while the experiment is still using the selected branch but not as often. - I can't reproduce the original issue as consistently as before using the same VM's. I'm not sure if it's because I updated them and perhaps changed something in the virtual HD that makes this harder to reproduce than before, I'm not really sure. - Is there a way that you reproduced the issue consistently while you were debugging the issue?
Flags: needinfo?(benjamin)
Assignee | ||
Comment 17•10 years ago
|
||
The deprecation warning isn't a big deal. The only way I could reproduce this consistently was to land a code change which introduced an artificial delay in the result of the experiments cache write. Bug 1054397 covers trying to make an automated test for this, and bug 1052714 covers trying to verify this by deploying a nightly experiment: so if your initial tests look good, I think we should call this as manually-verified as possible and rely on those two bugs to track the remaining work.
Flags: needinfo?(benjamin)
Reporter | ||
Comment 18•10 years ago
|
||
Thanks Benjamin! I went through it a few more times on OSX 10.9.4 and couldn't reproduce the original issues. I followed the same test cases outlined in comment #16. As per comment # 17, I'm marking this as verified and will keep an eye on both bug # 1054397 and bug # 1052714.
Status: RESOLVED → VERIFIED
Assignee | ||
Comment 19•10 years ago
|
||
Comment on attachment 8473232 [details] [diff] [review] experiment-branch-norace Approval Request Comment [Feature/regressing bug #]: pre-existing bug in telemetry experiments [User impact if declined]: slow shutdowns for some set of users [Describe test coverage new/current, TBPL]: landed, manual verification done. Full verification via an experiment will take a few weeks to complete (bug 1052714). [Risks and why]: this is not risk-free, but all the risk is localized to the experiment system; it cannot affect release users and is unlikely to affect users who aren't in an experiment [String/UUID change made/needed]: none
Attachment #8473232 -
Flags: approval-mozilla-aurora?
Updated•10 years ago
|
status-firefox33:
--- → affected
status-firefox34:
--- → verified
Updated•10 years ago
|
Attachment #8473232 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 20•10 years ago
|
||
Needs rebasing for Aurora uplift.
Flags: needinfo?(benjamin)
Keywords: branch-patch-needed
Comment 21•10 years ago
|
||
This is missing bug 1038174 on Aurora, we never uplifted that one. Approval requested there now.
Flags: needinfo?(benjamin)
Updated•10 years ago
|
Keywords: branch-patch-needed
Reporter | ||
Comment 23•10 years ago
|
||
Went through verification using the following build: - http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-08-29-00-40-03-mozilla-aurora/ - Experiment used: fx-searchtest-en-beta31 - Changed on local staging: "channel": ["aurora"] - Changed on local staging: "minVersion": "33.0a" - "activeExperimentBranch": "2" - PASSED - "activeExperimentBranch": "4" - PASSED - "activeExperimentBranch": "5" - PASSED - "activeExperimentBranch": "1" - PASSED - "activeExperimentBranch": "3" - PASSED I went through basically the same test cases from comment #16 and made sure that I didn't run into any race conditions were the branch would change once FX was re-launched.
Assignee | ||
Comment 24•10 years ago
|
||
Verified also via test experiment that this problem is essentially fixed: <branches>...,<count> 0,,6 2,,11 4,,3 ,199 0,11584 490089 1,11370 2,1,1 4,,4,1 ,0,8 0,2,1 1,,9 2,11442 3,,6 ,1,7 2,3,2 3,11447 ,2,6 4,11497 ,3,10 ,4,1 So there were four users in the experiment who switched branches, and that may be expected levels of I/O failure or QA testing. It's a little strange that there are occasionally empty branches, but it's also a small enough problem that I'm not going to followup on it now.
Updated•6 years ago
|
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•