Telemetry experiments: experiment branches still changing after bug 1038174 (search provider not being reverted back to original in search experiment)

VERIFIED FIXED in Firefox 33

Status

Firefox Health Report
Client: Desktop
VERIFIED FIXED
3 years ago
3 years ago

People

(Reporter: kjozwiak, Assigned: bsmedberg)

Tracking

unspecified
Firefox 34
x86
Mac OS X
Points:
8
Dependency tree / graph
Bug Flags:
firefox-backlog +
qe-verify +

Firefox Tracking Flags

(firefox33 verified, firefox34 verified)

Details

Attachments

(3 attachments)

(Reporter)

Description

3 years ago
Created attachment 8471678 [details]
Error in Browser Console

I'm not exactly sure what happened but it appears that the default search provider is not being reverted once the error that I attached occurs in the browser console.

STR (reproduced this many times):

I reproduced the problem using the following two beta builds:
- http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/31.0b6/win32/en-US/
- http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/32.0b5/win32/en-US/

1) Install one of the beta builds listed above, add/change the following preferences:

- devtools.chrome.enabled;true
- experiments.force-sample-value;.5
- experiments.logging.level;0
- experiments.manifest.cert.checkAttributes;false

2) Once the above preferences have been added/changed, disable/enable "experiments.enabled" to initiate the installation of the search experiment. You should see the default search provider change to either Yahoo/Bing (double check what branch you're in using about:healthreport -> Raw Data)

At this point, if you move the machines date ahead a few days and receive the error I attached, the search provider will not revert back to the original search provider, examples:

Example #1:
-> move the machines date ahead by 1 week, re-launch fx and check the browser console (you'll receive the same error message)
-> disable the search experiment and you'll notice that the default search provider will not revert back to Google

Example #2:
-> move the machines date ahead so there's 4 days remaining in the experiment and re-launch fx
-> check the browser cosnole and you should see the error
-> close fx and move the machines date ahead by 1 day so there's 3 days remaining
-> relaunch fx and you'll notice there's three days remaining but the search provider hasn't been reverted
(Assignee)

Comment 1

3 years ago
Kamil, does the experiment branch stay constant throughout this process?
Flags: needinfo?(kamiljoz)
(Reporter)

Comment 2

3 years ago
Benjamin, once I received the error message in the browser console the branch usually changed to "0". There was one instance (case #3) where the branch changed several times while I moved the dates closer to the experiment expatriation.

Case #1:
- "lastActiveBranch": "5" (checked after the experiment was installed)
- "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error)
- "lastActiveBranch": "0" (checked after disabling/removing the experiment through about:addons)

Case #2:
- "lastActiveBranch": "4" (checked after the experiment was installed)
- "lastActiveBranch": "0" (checked after moving the date forwards so there's 3 days remaining in the experiment)

Case #3: (despite the branches changing every time I moved the date forward, "Google" remained as the default provider)
- "lastActiveBranch": "1" (checked after the experiment was installed)
- "lastActiveBranch": "3" (checked after moving the date forwards and receiving the error)
- "lastActiveBranch": "1" (checked after moving the date forwards and receiving the error)
- "lastActiveBranch": "5" (checked after moving the date forwards and receiving the error)

Case #4:

- "lastActiveBranch": "5" (checked after the experiment was installed)
- "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error)
- "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error)
- "lastActiveBranch": "0" (checked after moving the date forwards and receiving the error)
Flags: needinfo?(kamiljoz)
(Assignee)

Comment 3

3 years ago
Cool, this might be a good testcase for bug 1038174, then, or a symptom of another bug.

Could you retry this using a nightly build after bug 1038174 landed to see whether that fixes this or at least helps?
Flags: needinfo?(kamiljoz)
(Reporter)

Comment 4

3 years ago
Will do! Once bug #1038174 lands into m-c I'll go through this again and see if the patch helps or even fixes this issue.

I ran into this while going through bug #1032970 on m-c and figured it might have been an issue due to changing the manifest on my personal staging server but ran into the same thing using the official staging server with the latest beta.
Flags: needinfo?(kamiljoz)
(Reporter)

Comment 5

3 years ago
Benjamin, I went through the same thing in comment #0 using the latest m-c and ran into the same thing (as you mentioned, bug #1038174 has already landed)
- Build used: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-08-12-03-02-02-mozilla-central/

I edited the manifest on my personal staging server to get it working with m-c, changed/added the following values:

- "endTime": 1412035200
- "channel": ["nightly"]
- "maxVersion": "34.*"

Once the staging server was live, I basically ran through the same steps as comment #0 and received the same results mentioned in comment #2. Here's the manifest that was used:

{"experiments": [{"maxStartTime": 1408043234, "xpiHash": "sha256:967b0870d0c08ad74d39d12cca34cf40cab7d344aa93180495366a2e4105b9fd", "appName": ["Firefox"], "locale": ["en-US"], "maxActiveSeconds": 2419200, "maxVersion": "34.*", "xpiURL": "http://www.kamiljozwiak.com/fx-searchtest-en-beta31%40mozilla.org/experiment.xpi", "sample": 0.5, "minBuildID": "20140630185627", "startTime": 1404216000, "minVersion": "31.0a", "endTime": 1412035200, "id": "fx-searchtest-en-beta31@mozilla.org", "channel": ["nightly"]}
(Assignee)

Comment 6

3 years ago
Well that sucks!
Points: --- → 8
Flags: firefox-backlog+
(Reporter)

Comment 7

3 years ago
Benjamin, looks like you were correct about the VM's being slower during disk write. I went through the same test cases from comment #0 on my OSX 10.9.4 x64 and Windows 8.1 x64 and only seen the issue once on the Win 8.1 machine using the latest m-c.

When I changed the date right after fx was closed, the entire machine basically slowed down to a crawl and disk usage shot up to 100%. When the machine finally became responsive after about 2 minutes, the experiment expired but the search provider wasn't reverted back to Google. (this happened after I changed the date around 15 different times which started to slow down Windows)

Some results:

Used the following build:
- http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/32.0b6/

"lastActiveBranch": "5" [Passed on both m-c & beta using OSX 10.9.4] <-- this one failed on Win 8.1
"lastActiveBranch": "1" [Passed on both m-c & beta using OSX 10.9.4/Win 8.1]
"lastActiveBranch": "2" [Passed on both m-c & beta using OSX 10.9.4/Win 8.1]
"lastActiveBranch": "4" [Passed on beta using OSX 10.9.4]
"lastActiveBranch": "3" [Passed in m-c on both OSX 10.94/Win 8.1]

Strange thing is that when I initially tested this on the same VM's, this wasn't happening.. I would've seen it as I always had the browser console opened and made sure the branches always matched via about:healthreport

Should we close this ticket and monitor the results we get from bug #1052714?
Flags: needinfo?(benjamin)
(Assignee)

Comment 8

3 years ago
If we have a testcase and we know it's still broken, then bug 1052714 isn't going to help. Looking for ways now to reliably simulate slow disk so I can debug locally.
Flags: needinfo?(benjamin)
Summary: Telemetry experiments: search provider not being reverted back to original → Telemetry experiments: experiment branches still changing after bug 1038174 (search provider not being reverted back to original in search experiment)
(Assignee)

Updated

3 years ago
Assignee: nobody → benjamin

Updated

3 years ago
QA Whiteboard: [qa?]
(Assignee)

Comment 9

3 years ago
1408035053127	Browser.Experiments.Experiments	DEBUG	Experiments #0::evaluateExperiments() - activating experiment experiment-branch-test-nightly@experiments.mozilla.org
1408035053144	Browser.Experiments.Experiments	TRACE	ExperimentEntry #4::start() for experiment-branch-test-nightly@experiments.mozilla.org
1408035053160	Browser.Experiments.Experiments	TRACE	ExperimentEntry #4::reconcileAddonState()
1408035053174	Browser.Experiments.Experiments	INFO	ExperimentEntry #4::reconcileAddonState() - Installing add-on.
1408035058642	Browser.Experiments.Experiments	TRACE	ExperimentEntry #4::_installAddon() - onDownloadEnded for experiment-branch-test-nightly@experiments.mozilla.org
1408035058660	Browser.Experiments.Experiments	TRACE	ExperimentEntry #4::_installAddon() - onInstallStarted for experiment-branch-test-nightly@experiments.mozilla.org
1408035058674	Browser.Experiments.Experiments	TRACE	Experiments #0::onInstallStarted() - experiment-branchtest-nightly@experiments.mozilla.org
1408035058690	Browser.Experiments.Experiments	INFO	Experiments #0::onInstallStarted allowing install because install tracked by us.
1408035059739	addons.xpi	WARN	Add-on experiment-branchtest-nightly@experiments.mozilla.org is missing bootstrap method install
1408035059785	Browser.Experiments.Experiments	TRACE	ExperimentEntry #4::_installAddon() - install ended for experiment-branch-test-nightly@experiments.mozilla.org
1408035059812	Browser.Experiments.Experiments	TRACE	ExperimentEntry #4::Add-on is disabled. Enabling.
Branch experiment startup bootstrap.js:17
1408035059890	Browser.Experiments.Experiments	INFO	ExperimentEntry #4::onEnabled() for experiment-branchtest-nightly@experiments.mozilla.org
1408035060064	Services.Metrics.Provider.org.mozilla.addons	WARN	Add-on type without field: experiment
1408035060219	Browser.Experiments.Experiments	TRACE	Experiments #0::_saveToCache
1408035060262	Browser.Experiments.Experiments	TRACE	Experiments #0::writeAtomic starting. _dirty: false
Branch experiment real startup bootstrap.js:22
No pre-existing branch, setting to: 3 bootstrap.js:32
1408035060471	Browser.Experiments.Experiments	TRACE	Experiments #0::setExperimentBranch before. _dirty: false
1408035060489	Browser.Experiments.Experiments	TRACE	Experiments #0::_run
Successfully set branch bootstrap.js:34
1408035061268	Services.Metrics.Provider.org.mozilla.addons	WARN	Add-on type without field: experiment
1408035065467	Browser.Experiments.Experiments	TRACE	Experiments #0::writeAtomic finished. _dirty: true
1408035065475	Browser.Experiments.Experiments	DEBUG	Experiments #0::_saveToCache saved to c:\Users\bsmedberg\test-profile\experiments.json
1408035065489	Browser.Experiments.Experiments	TRACE	Experiments #0::_main finished, scheduling next run
1408035065500	Browser.Experiments.Experiments	TRACE	Experiments #0::scheduleExperimentEvaluation() - scheduling for 1408639859801, now: 1408035065497
1408035065518	Browser.Experiments.Experiments	TRACE	Experiments #0::setExperimentBranch after. _dirty: true

So the code from bug 1038174 improved things in this race condition, but isn't sufficient. In particular, the _dirty flag is now correct, but we don't re-trigger writing the cache. I thought we rechecked _dirty in the main loop, but I was wrong: http://hg.mozilla.org/mozilla-central/annotate/5299864050ee/browser/experiments/Experiments.jsm#l804 shows that we only reloop on _refresh (updateManifest) or _terminateReason (disableExperiment).

I believe the reason for this is that we don't want to inifite-loop in the case where writing the cache fails (the catch block at http://hg.mozilla.org/mozilla-central/annotate/5299864050ee/browser/experiments/Experiments.jsm#l1269)

I propose to fix this by re-looping while (this._dirty) but making failure to write the cache rethrow its exception to exit out of the loop.
(Assignee)

Updated

3 years ago
QA Whiteboard: [qa?] → [qa+]
(Assignee)

Comment 10

3 years ago
Created attachment 8473232 [details] [diff] [review]
experiment-branch-norace

Unfortunately, adding an automatic test for this is still very difficult, though not impossible: it involves taking a patch to OS.File that allow you to delay the callback of the writeAtomic I/O operation, and then using that parameter (but only while testing) from the experiments manager.

Georg I'd like to know whether you think that adding all that test-only plumbing is worth it for an automated test here.
Attachment #8473232 - Flags: review?(georg.fritzsche)
(Assignee)

Comment 11

3 years ago
Created attachment 8473234 [details] [diff] [review]
writeAtomic-delayparam

If we wanted it, this is the patch which adds the delay to OS.File.
Comment on attachment 8473232 [details] [diff] [review]
experiment-branch-norace

Review of attachment 8473232 [details] [diff] [review]:
-----------------------------------------------------------------

Looking good to me pending below comments.

::: browser/experiments/Experiments.jsm
@@ +63,5 @@
>  
>  const URI_EXTENSION_STRINGS     = "chrome://mozapps/locale/extensions/extensions.properties";
>  const STRING_TYPE_NAME          = "type.%ID%.name";
>  
> +const FAILED_CACHE_WRITE_RETRY = 60 * 3;

This should be more descriptive, e.g. CACHE_WRITE_RETRY_DELAY_SEC?

@@ -330,5 @@
>    this.name = "AlreadyShutdownError";
>    this.message = message;
>    this.stack = error.stack;
>  }
> -

Why remove this unrelated line break? Do we really need to cluster this together?
Attachment #8473232 - Flags: review?(georg.fritzsche) → review+
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #10)
> Unfortunately, adding an automatic test for this is still very difficult,
> though not impossible: it involves taking a patch to OS.File that allow you
> to delay the callback of the writeAtomic I/O operation, and then using that
> parameter (but only while testing) from the experiments manager.
> 
> Georg I'd like to know whether you think that adding all that test-only
> plumbing is worth it for an automated test here.

I'd like to see this - we should be able to avoid patching Experiments.jsm up by either:
* directly mocking OS.File.writeAtomic() or
* direct OS.File usage through Experiments.Policy and mock that
(Assignee)

Comment 14

3 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/f9e0efd92b8d
Target Milestone: --- → Firefox 34
(Assignee)

Updated

3 years ago
Blocks: 1054397
https://hg.mozilla.org/mozilla-central/rev/f9e0efd92b8d
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED

Updated

3 years ago
Iteration: --- → 34.2
QA Contact: kamiljoz

Updated

3 years ago
Iteration: 34.2 → 34.3
QA Whiteboard: [qa+]
Flags: qe-verify+
(Reporter)

Comment 16

3 years ago
Went through verification using the following build:
- http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-08-19-03-02-02-mozilla-central/
- Experiment used: fx-searchtest-en-beta31@mozilla.org
- Changed on local staging: "maxVersion": "34.*"
- Changed on local staging: "channel": ["nightly"]

Test Cases:

- "activeExperimentBranch": "2" - PASSED
- "activeExperimentBranch": "1" - PASSED
- "activeExperimentBranch": "4" - PASSED
- "activeExperimentBranch": "5" - PASSED
- "activeExperimentBranch": "3" - PASSED

- moved the computer date ahead by 2-3 day increments and ensured that the experiment was still using the same original branch
- ensured that the search provider is restored to Google once there's 3 days remaining
- ensured that the branch information remains in the telemetry payload until the experiment has been removed/expired
- ensured that the search provider is restored to Google once the experiment has expired/removed
- ensured that once the experiment is expired/removed, it appears as disabled under about:config, about:support & about:addons
- ensured that the search provider isn't reverted over to Google once three days is reached and the user selected their own search provider
- ensured that the search provider isn't reverted back to Google once the experiment has been removed/expired and the user selected their own search provider

Still seeing the following errors appear in the browser console:

DEPRECATION WARNING: Search service falling back to synchronous initialization. This is generally the consequence of an add-on using a deprecated search service API.
You may find more details about this deprecation at: https://developer.mozilla.org/en-US/docs/XPCOM_Interface_Reference/nsIBrowserSearchService#async_warning
resource://gre/components/nsSearchService.js 2863 SRCH_SVC__ensureInitialized
resource://gre/components/nsSearchService.js 4125 SearchService.prototype.currentEngine
resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://fx-searchtest-at-mozilla-dot-org/fx-searchtest/lib/main.js 163 finishExperiment
resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://fx-searchtest-at-mozilla-dot-org/fx-searchtest/lib/main.js 126 exports.main
resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://gre/modules/commonjs/sdk/addon/runner.js 145 run
resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///C:/Users/kjozQA/AppData/Roaming/Mozilla/Firefox/Profiles/uxh9a34j.default/extensions/fx-searchtest@mozilla.org.xpi!/bootstrap.js -> resource://gre/modules/commonjs/toolkit/loader.js -> resource://gre/modules/commonjs/sdk/addon/runner.js 81 startup/</<
resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js 866 Handler.prototype.process
resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js 745 this.PromiseWalker.walkerLoop
null 0 null

Benjamin, I've been going through this for a few hours in the latest m-c and couldn't reproduce the issue. However there's a few things I wanted to double check with you and mention:

- I'm still getting the above error message in the browser console and it usually appears consistently once the experiment has three days remaining and is reverted back to Google. The error also appears here and there while the experiment is still using the selected branch but not as often.

- I can't reproduce the original issue as consistently as before using the same VM's. I'm not sure if it's because I updated them and perhaps changed something in the virtual HD that makes this harder to reproduce than before, I'm not really sure.

- Is there a way that you reproduced the issue consistently while you were debugging the issue?
Flags: needinfo?(benjamin)
(Assignee)

Comment 17

3 years ago
The deprecation warning isn't a big deal.

The only way I could reproduce this consistently was to land a code change which introduced an artificial delay in the result of the experiments cache write. Bug 1054397 covers trying to make an automated test for this, and bug 1052714 covers trying to verify this by deploying a nightly experiment: so if your initial tests look good, I think we should call this as manually-verified as possible and rely on those two bugs to track the remaining work.
Flags: needinfo?(benjamin)
(Reporter)

Comment 18

3 years ago
Thanks Benjamin! I went through it a few more times on OSX 10.9.4 and couldn't reproduce the original issues. I followed the same test cases outlined in comment #16.

As per comment # 17, I'm marking this as verified and will keep an eye on both bug # 1054397 and bug # 1052714.
Status: RESOLVED → VERIFIED
(Assignee)

Updated

3 years ago
Blocks: 1038174
(Assignee)

Comment 19

3 years ago
Comment on attachment 8473232 [details] [diff] [review]
experiment-branch-norace

Approval Request Comment
[Feature/regressing bug #]: pre-existing bug in telemetry experiments
[User impact if declined]: slow shutdowns for some set of users
[Describe test coverage new/current, TBPL]: landed, manual verification done. Full verification via an experiment will take a few weeks to complete (bug 1052714).
[Risks and why]: this is not risk-free, but all the risk is localized to the experiment system; it cannot affect release users and is unlikely to affect users who aren't in an experiment
[String/UUID change made/needed]: none
Attachment #8473232 - Flags: approval-mozilla-aurora?
status-firefox33: --- → affected
status-firefox34: --- → verified
Attachment #8473232 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Needs rebasing for Aurora uplift.
Flags: needinfo?(benjamin)
Keywords: branch-patch-needed
This is missing bug 1038174 on Aurora, we never uplifted that one. Approval requested there now.
Flags: needinfo?(benjamin)
https://hg.mozilla.org/releases/mozilla-aurora/rev/36adbb8b0b83
status-firefox33: affected → fixed
Keywords: branch-patch-needed
(Reporter)

Comment 23

3 years ago
Went through verification using the following build:
- http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-08-29-00-40-03-mozilla-aurora/

- Experiment used: fx-searchtest-en-beta31
- Changed on local staging: "channel": ["aurora"]
- Changed on local staging: "minVersion": "33.0a"

- "activeExperimentBranch": "2" - PASSED
- "activeExperimentBranch": "4" - PASSED
- "activeExperimentBranch": "5" - PASSED
- "activeExperimentBranch": "1" - PASSED
- "activeExperimentBranch": "3" - PASSED

I went through basically the same test cases from comment #16 and made sure that I didn't run into any race conditions were the branch would change once FX was re-launched.
status-firefox33: fixed → verified
(Assignee)

Comment 24

3 years ago
Verified also via test experiment that this problem is essentially fixed:

<branches>...,<count>
0,,6
2,,11
4,,3
,199
0,11584
490089
1,11370
2,1,1
4,,4,1
,0,8
0,2,1
1,,9
2,11442
3,,6
,1,7
2,3,2
3,11447
,2,6
4,11497
,3,10
,4,1

So there were four users in the experiment who switched branches, and that may be expected levels of I/O failure or QA testing. It's a little strange that there are occasionally empty branches, but it's also a small enough problem that I'm not going to followup on it now.
You need to log in before you can comment on or make changes to this bug.