Closed Bug 1052545 Opened 10 years ago Closed 10 years ago

Telemetry experiments: experiment branches still changing after bug 1038174 (search provider not being reverted back to original in search experiment)

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

x86
macOS
defect
Not set
normal

Tracking

(firefox33 verified, firefox34 verified)

VERIFIED FIXED
Firefox 34
Tracking Status
firefox33 --- verified
firefox34 --- verified

People

(Reporter: kjozwiak, Assigned: benjamin)

References

Details

Attachments

(3 files)

I'm not exactly sure what happened but it appears that the default search provider is not being reverted once the error that I attached occurs in the browser console.

STR (reproduced this many times):

I reproduced the problem using the following two beta builds:
- http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/31.0b6/win32/en-US/
- http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/32.0b5/win32/en-US/

1) Install one of the beta builds listed above, add/change the following preferences:

- devtools.chrome.enabled;true
- experiments.force-sample-value;.5
- experiments.logging.level;0
- experiments.manifest.cert.checkAttributes;false

2) Once the above preferences have been added/changed, disable/enable "experiments.enabled" to initiate the installation of the search experiment. You should see the default search provider change to either Yahoo/Bing (double check what branch you're in using about:healthreport -> Raw Data)

At this point, if you move the machines date ahead a few days and receive the error I attached, the search provider will not revert back to the original search provider, examples:

Example #1:
-> move the machines date ahead by 1 week, re-launch fx and check the browser console (you'll receive the same error message)
-> disable the search experiment and you'll notice that the default search provider will not revert back to Google

Example #2:
-> move the machines date ahead so there's 4 days remaining in the experiment and re-launch fx
-> check the browser cosnole and you should see the error
-> close fx and move the machines date ahead by 1 day so there's 3 days remaining
-> relaunch fx and you'll notice there's three days remaining but the search provider hasn't been reverted
Kamil, does the experiment branch stay constant throughout this process?
Flags: needinfo?(kamiljoz)
Benjamin, once I received the error message in the browser console the branch usually changed to "0". There was one instance (case #3) where the branch changed several times while I moved the dates closer to the experiment expatriation.

Case #1:
- "lastActiveBranch": "5" (checked after the experiment was installed)
- "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error)
- "lastActiveBranch": "0" (checked after disabling/removing the experiment through about:addons)

Case #2:
- "lastActiveBranch": "4" (checked after the experiment was installed)
- "lastActiveBranch": "0" (checked after moving the date forwards so there's 3 days remaining in the experiment)

Case #3: (despite the branches changing every time I moved the date forward, "Google" remained as the default provider)
- "lastActiveBranch": "1" (checked after the experiment was installed)
- "lastActiveBranch": "3" (checked after moving the date forwards and receiving the error)
- "lastActiveBranch": "1" (checked after moving the date forwards and receiving the error)
- "lastActiveBranch": "5" (checked after moving the date forwards and receiving the error)

Case #4:

- "lastActiveBranch": "5" (checked after the experiment was installed)
- "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error)
- "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error)
- "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error)
Flags: needinfo?(kamiljoz)
Cool, this might be a good testcase for bug 1038174, then, or a symptom of another bug.

Could you retry this using a nightly build after bug 1038174 landed to see whether that fixes this or at least helps?
Flags: needinfo?(kamiljoz)
Will do! Once bug #1038174 lands into m-c I'll go through this again and see if the patch helps or even fixes this issue.

I ran into this while going through bug #1032970 on m-c and figured it might have been an issue due to changing the manifest on my personal staging server but ran into the same thing using the official staging server with the latest beta.
Flags: needinfo?(kamiljoz)
Benjamin, I went through the same thing in comment #0 using the latest m-c and ran into the same thing (as you mentioned, bug #1038174 has already landed)
- Build used: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-08-12-03-02-02-mozilla-central/

I edited the manifest on my personal staging server to get it working with m-c, changed/added the following values:

- "endTime": 1412035200
- "channel": ["nightly"]
- "maxVersion": "34.*"

Once the staging server was live, I basically ran through the same steps as comment #0 and received the same results mentioned in comment #2. Here's the manifest that was used:

{"experiments": [{"maxStartTime": 1408043234, "xpiHash": "sha256:967b0870d0c08ad74d39d12cca34cf40cab7d344aa93180495366a2e4105b9fd", "appName": ["Firefox"], "locale": ["en-US"], "maxActiveSeconds": 2419200, "maxVersion": "34.*", "xpiURL": "http://www.kamiljozwiak.com/fx-searchtest-en-beta31%40mozilla.org/experiment.xpi", "sample": 0.5, "minBuildID": "20140630185627", "startTime": 1404216000, "minVersion": "31.0a", "endTime": 1412035200, "id": "fx-searchtest-en-beta31@mozilla.org", "channel": ["nightly"]}
Well that sucks!
Points: --- → 8
Flags: firefox-backlog+
Benjamin, looks like you were correct about the VM's being slower during disk write. I went through the same test cases from comment #0 on my OSX 10.9.4 x64 and Windows 8.1 x64 and only seen the issue once on the Win 8.1 machine using the latest m-c.

When I changed the date right after fx was closed, the entire machine basically slowed down to a crawl and disk usage shot up to 100%. When the machine finally became responsive after about 2 minutes, the experiment expired but the search provider wasn't reverted back to Google. (this happened after I changed the date around 15 different times which started to slow down Windows)

Some results:

Used the following build:
- http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/32.0b6/

"lastActiveBranch": "5" [Passed on both m-c & beta using OSX 10.9.4] <-- this one failed on Win 8.1
"lastActiveBranch": "1" [Passed on both m-c & beta using OSX 10.9.4/Win 8.1]
"lastActiveBranch": "2" [Passed on both m-c & beta using OSX 10.9.4/Win 8.1]
"lastActiveBranch": "4" [Passed on beta using OSX 10.9.4]
"lastActiveBranch": "3" [Passed in m-c on both OSX 10.94/Win 8.1]

Strange thing is that when I initially tested this on the same VM's, this wasn't happening.. I would've seen it as I always had the browser console opened and made sure the branches always matched via about:healthreport

Should we close this ticket and monitor the results we get from bug #1052714?
Flags: needinfo?(benjamin)
If we have a testcase and we know it's still broken, then bug 1052714 isn't going to help. Looking for ways now to reliably simulate slow disk so I can debug locally.
Flags: needinfo?(benjamin)
Summary: Telemetry experiments: search provider not being reverted back to original → Telemetry experiments: experiment branches still changing after bug 1038174 (search provider not being reverted back to original in search experiment)
Assignee: nobody → benjamin
QA Whiteboard: [qa?]
1408035053127	Browser.Experiments.Experiments	DEBUG	Experiments #0::evaluateExperiments() - activating experiment experiment-branch-test-nightly@experiments.mozilla.org
1408035053144	Browser.Experiments.Experiments	TRACE	ExperimentEntry #4::start() for experiment-branch-test-nightly@experiments.mozilla.org
1408035053160	Browser.Experiments.Experiments	TRACE	ExperimentEntry #4::reconcileAddonState()
1408035053174	Browser.Experiments.Experiments	INFO	ExperimentEntry #4::reconcileAddonState() - Installing add-on.
1408035058642	Browser.Experiments.Experiments	TRACE	ExperimentEntry #4::_installAddon() - onDownloadEnded for experiment-branch-test-nightly@experiments.mozilla.org
1408035058660	Browser.Experiments.Experiments	TRACE	ExperimentEntry #4::_installAddon() - onInstallStarted for experiment-branch-test-nightly@experiments.mozilla.org
1408035058674	Browser.Experiments.Experiments	TRACE	Experiments #0::onInstallStarted() - experiment-branchtest-nightly@experiments.mozilla.org
1408035058690	Browser.Experiments.Experiments	INFO	Experiments #0::onInstallStarted allowing install because install tracked by us.
1408035059739	addons.xpi	WARN	Add-on experiment-branchtest-nightly@experiments.mozilla.org is missing bootstrap method install
1408035059785	Browser.Experiments.Experiments	TRACE	ExperimentEntry #4::_installAddon() - install ended for experiment-branch-test-nightly@experiments.mozilla.org
1408035059812	Browser.Experiments.Experiments	TRACE	ExperimentEntry #4::Add-on is disabled. Enabling.
Branch experiment startup bootstrap.js:17
1408035059890	Browser.Experiments.Experiments	INFO	ExperimentEntry #4::onEnabled() for experiment-branchtest-nightly@experiments.mozilla.org
1408035060064	Services.Metrics.Provider.org.mozilla.addons	WARN	Add-on type without field: experiment
1408035060219	Browser.Experiments.Experiments	TRACE	Experiments #0::_saveToCache
1408035060262	Browser.Experiments.Experiments	TRACE	Experiments #0::writeAtomic starting. _dirty: false
Branch experiment real startup bootstrap.js:22
No pre-existing branch, setting to: 3 bootstrap.js:32
1408035060471	Browser.Experiments.Experiments	TRACE	Experiments #0::setExperimentBranch before. _dirty: false
1408035060489	Browser.Experiments.Experiments	TRACE	Experiments #0::_run
Successfully set branch bootstrap.js:34
1408035061268	Services.Metrics.Provider.org.mozilla.addons	WARN	Add-on type without field: experiment
1408035065467	Browser.Experiments.Experiments	TRACE	Experiments #0::writeAtomic finished. _dirty: true
1408035065475	Browser.Experiments.Experiments	DEBUG	Experiments #0::_saveToCache saved to c:\Users\bsmedberg\test-profile\experiments.json
1408035065489	Browser.Experiments.Experiments	TRACE	Experiments #0::_main finished, scheduling next run
1408035065500	Browser.Experiments.Experiments	TRACE	Experiments #0::scheduleExperimentEvaluation() - scheduling for 1408639859801, now: 1408035065497
1408035065518	Browser.Experiments.Experiments	TRACE	Experiments #0::setExperimentBranch after. _dirty: true

So the code from bug 1038174 improved things in this race condition, but isn't sufficient. In particular, the _dirty flag is now correct, but we don't re-trigger writing the cache. I thought we rechecked _dirty in the main loop, but I was wrong: http://hg.mozilla.org/mozilla-central/annotate/5299864050ee/browser/experiments/Experiments.jsm#l804 shows that we only reloop on _refresh (updateManifest) or _terminateReason (disableExperiment).

I believe the reason for this is that we don't want to inifite-loop in the case where writing the cache fails (the catch block at http://hg.mozilla.org/mozilla-central/annotate/5299864050ee/browser/experiments/Experiments.jsm#l1269)

I propose to fix this by re-looping while (this._dirty) but making failure to write the cache rethrow its exception to exit out of the loop.
QA Whiteboard: [qa?] → [qa+]
Unfortunately, adding an automatic test for this is still very difficult, though not impossible: it involves taking a patch to OS.File that allow you to delay the callback of the writeAtomic I/O operation, and then using that parameter (but only while testing) from the experiments manager.

Georg I'd like to know whether you think that adding all that test-only plumbing is worth it for an automated test here.
Attachment #8473232 - Flags: review?(georg.fritzsche)
If we wanted it, this is the patch which adds the delay to OS.File.
Comment on attachment 8473232 [details] [diff] [review]
experiment-branch-norace

Review of attachment 8473232 [details] [diff] [review]:
-----------------------------------------------------------------

Looking good to me pending below comments.

::: browser/experiments/Experiments.jsm
@@ +63,5 @@
>  
>  const URI_EXTENSION_STRINGS     = "chrome://mozapps/locale/extensions/extensions.properties";
>  const STRING_TYPE_NAME          = "type.%ID%.name";
>  
> +const FAILED_CACHE_WRITE_RETRY = 60 * 3;

This should be more descriptive, e.g. CACHE_WRITE_RETRY_DELAY_SEC?

@@ -330,5 @@
>    this.name = "AlreadyShutdownError";
>    this.message = message;
>    this.stack = error.stack;
>  }
> -

Why remove this unrelated line break? Do we really need to cluster this together?
Attachment #8473232 - Flags: review?(georg.fritzsche) → review+
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #10)
> Unfortunately, adding an automatic test for this is still very difficult,
> though not impossible: it involves taking a patch to OS.File that allow you
> to delay the callback of the writeAtomic I/O operation, and then using that
> parameter (but only while testing) from the experiments manager.
> 
> Georg I'd like to know whether you think that adding all that test-only
> plumbing is worth it for an automated test here.

I'd like to see this - we should be able to avoid patching Experiments.jsm up by either:
* directly mocking OS.File.writeAtomic() or
* direct OS.File usage through Experiments.Policy and mock that
https://hg.mozilla.org/integration/mozilla-inbound/rev/f9e0efd92b8d
Target Milestone: --- → Firefox 34
Blocks: 1054397
https://hg.mozilla.org/mozilla-central/rev/f9e0efd92b8d
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Iteration: --- → 34.2
QA Contact: kamiljoz
Iteration: 34.2 → 34.3
QA Whiteboard: [qa+]
Flags: qe-verify+
Went through verification using the following build:
- http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-08-19-03-02-02-mozilla-central/
- Experiment used: fx-searchtest-en-beta31@mozilla.org
- Changed on local staging: "maxVersion": "34.*"
- Changed on local staging: "channel": ["nightly"]

Test Cases:

- "activeExperimentBranch": "2" - PASSED
- "activeExperimentBranch": "1" - PASSED
- "activeExperimentBranch": "4" - PASSED
- "activeExperimentBranch": "5" - PASSED
- "activeExperimentBranch": "3" - PASSED

- moved the computer date ahead by 2-3 day increments and ensured that the experiment was still using the same original branch
- ensured that the search provider is restored to Google once there's 3 days remaining
- ensured that the branch information remains in the telemetry payload until the experiment has been removed/expired
- ensured that the search provider is restored to Google once the experiment has expired/removed
- ensured that once the experiment is expired/removed, it appears as disabled under about:config, about:support & about:addons
- ensured that the search provider isn't reverted over to Google once three days is reached and the user selected their own search provider
- ensured that the search provider isn't reverted back to Google once the experiment has been removed/expired and the user selected their own search provider

Still seeing the following errors appear in the browser console:

DEPRECATION WARNING: Search service falling back to synchronous initialization. This is generally the consequence of an add-on using a deprecated search service API.
You may find more details about this deprecation at: https://developer.mozilla.org/en-US/docs/XPCOM_Interface_Reference/nsIBrowserSearchService#async_warning
resource://gre/components/nsSearchService.js 2863 SRCH_SVC__ensureInitialized
resource://gre/components/nsSearchService.js 4125 SearchService.prototype.currentEngine
resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://fx-searchtest-at-mozilla-dot-org/fx-searchtest/lib/main.js 163 finishExperiment
resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://fx-searchtest-at-mozilla-dot-org/fx-searchtest/lib/main.js 126 exports.main
resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://gre/modules/commonjs/sdk/addon/runner.js 145 run
resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://gre/modules/commonjs/sdk/addon/runner.js 81 startup/</<
resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js 866 Handler.prototype.process
resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js 745 this.PromiseWalker.walkerLoop
null 0 null

Benjamin, I've been going through this for a few hours in the latest m-c and couldn't reproduce the issue. However there's a few things I wanted to double check with you and mention:

- I'm still getting the above error message in the browser console and it usually appears consistently once the experiment has three days remaining and is reverted back to Google. The error also appears here and there while the experiment is still using the selected branch but not as often.

- I can't reproduce the original issue as consistently as before using the same VM's. I'm not sure if it's because I updated them and perhaps changed something in the virtual HD that makes this harder to reproduce than before, I'm not really sure.

- Is there a way that you reproduced the issue consistently while you were debugging the issue?
Flags: needinfo?(benjamin)
The deprecation warning isn't a big deal.

The only way I could reproduce this consistently was to land a code change which introduced an artificial delay in the result of the experiments cache write. Bug 1054397 covers trying to make an automated test for this, and bug 1052714 covers trying to verify this by deploying a nightly experiment: so if your initial tests look good, I think we should call this as manually-verified as possible and rely on those two bugs to track the remaining work.
Flags: needinfo?(benjamin)
Thanks Benjamin! I went through it a few more times on OSX 10.9.4 and couldn't reproduce the original issues. I followed the same test cases outlined in comment #16.

As per comment # 17, I'm marking this as verified and will keep an eye on both bug # 1054397 and bug # 1052714.
Status: RESOLVED → VERIFIED
Blocks: 1038174
Comment on attachment 8473232 [details] [diff] [review]
experiment-branch-norace

Approval Request Comment
[Feature/regressing bug #]: pre-existing bug in telemetry experiments
[User impact if declined]: slow shutdowns for some set of users
[Describe test coverage new/current, TBPL]: landed, manual verification done. Full verification via an experiment will take a few weeks to complete (bug 1052714).
[Risks and why]: this is not risk-free, but all the risk is localized to the experiment system; it cannot affect release users and is unlikely to affect users who aren't in an experiment
[String/UUID change made/needed]: none
Attachment #8473232 - Flags: approval-mozilla-aurora?
Attachment #8473232 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Needs rebasing for Aurora uplift.
Flags: needinfo?(benjamin)
This is missing bug 1038174 on Aurora, we never uplifted that one. Approval requested there now.
Flags: needinfo?(benjamin)
Went through verification using the following build:
- http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-08-29-00-40-03-mozilla-aurora/

- Experiment used: fx-searchtest-en-beta31
- Changed on local staging: "channel": ["aurora"]
- Changed on local staging: "minVersion": "33.0a"

- "activeExperimentBranch": "2" - PASSED
- "activeExperimentBranch": "4" - PASSED
- "activeExperimentBranch": "5" - PASSED
- "activeExperimentBranch": "1" - PASSED
- "activeExperimentBranch": "3" - PASSED

I went through basically the same test cases from comment #16 and made sure that I didn't run into any race conditions were the branch would change once FX was re-launched.
Verified also via test experiment that this problem is essentially fixed:

<branches>...,<count>
0,,6
2,,11
4,,3
,199
0,11584
490089
1,11370
2,1,1
4,,4,1
,0,8
0,2,1
1,,9
2,11442
3,,6
,1,7
2,3,2
3,11447
,2,6
4,11497
,3,10
,4,1

So there were four users in the experiment who switched branches, and that may be expected levels of I/O failure or QA testing. It's a little strange that there are occasionally empty branches, but it's also a small enough problem that I'm not going to followup on it now.
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.