Closed Bug 1145188 Opened 5 years ago Closed 4 years ago

Telemetry redesign: TelemetrySession init should be triggered by TelemetryController

Categories

(Toolkit :: Telemetry, defect, P3)

defect
Points:
2

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: gfritzsche, Assigned: yarik.sheptykin, Mentored)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [measurement:client] [lang=js])

Attachments

(2 files, 37 obsolete files)

92.49 KB, patch
Dexter
: review+
Details | Diff | Splinter Review
23.20 KB, patch
yarik.sheptykin
: review+
gfritzsche
: feedback+
Details | Diff | Splinter Review
Currently both TelemetryPing and TelemetrySession are initialized delayed independently (1min after startup).
We should change that and probably just trigger TelemetrySession init (non-blockingly) after the TelemetryPing init.
Blocks: 1201022
No longer blocks: 1122482
Summary: TelemetrySession init should be triggered by TelemetryPing → TelemetrySession init should be triggered by TelemetryController
Blocks: 1210415
Blocks: 1156353
No longer blocks: 1201022
Points: --- → 2
Priority: -- → P2
Whiteboard: [unifiedTelemetry] [measurement:client]
Summary: TelemetrySession init should be triggered by TelemetryController → Telemetry redesign: TelemetrySession init should be triggered by TelemetryController
Blocks: 1217452
Bug 1145188 - Telemetry redesign: TelemetrySession init should be triggered by TelemetryController. r?gfritzsche
Attachment #8682526 - Flags: review?(gfritzsche)
Assignee: nobody → alessio.placitelli
Status: NEW → ASSIGNED
Priority: P2 → P1
Comment on attachment 8682526 [details]
MozReview Request: Bug 1145188 - Telemetry redesign: TelemetrySession init should be triggered by TelemetryController. r?gfritzsche

https://reviewboard.mozilla.org/r/24081/#review21533

I think we need to approach this differently, sorry for the detour.

We should have:
* TelemetryController call say TelemetrySession.earlyInit() on startup
* TelemetryController call say TelemetrySession.delayedInit() after the delay (and after its own delayed init)
* TelemetryController call TelemetrySession.shutdown() from its own shutdown routines (and remove TelemetrySessions blocker)
* content process init should not be really affected here

This lets us get rid of the two separate (and potentially racy) delays and gets us defined init and easier-to-argue-about init and shutdown routines.
Attachment #8682526 - Flags: review?(gfritzsche)
Unassigning and moving back to P2, as I'm working on other more important bugs.
Assignee: alessio.placitelli → nobody
Status: ASSIGNED → NEW
Priority: P1 → P2
Assignee: nobody → yarik.sheptykin
Mentor: alessio.placitelli, gfritzsche
Attached patch part 2 - Fix the tests (obsolete) — Splinter Review
The patches were bit-rotten, I've refreshed the test patch a bit.

Please ping me if you need any pointers on this.
Flags: needinfo?(yarik.sheptykin)
Thanks for sharing your work Alessio! I will try to preserve you work a much as possible and build upon it.
Flags: needinfo?(yarik.sheptykin)
A quick status update:

* Patches apply good.
* Unit tests fail at one place with:
ERROR toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | Process still running after test!

I started addressing comment 2.
Hey Alessio!

Thanks for waiting patiently for a patch. I would like to share a work in progress with you. This is build upon your patches. Basically it fixes failing tests in test_TelemetrySession.js. Please, take a look if I my fixes make sense and let me know if I am moving in the right direction.
Attachment #8697990 - Flags: review?(alessio.placitelli)
Comment on attachment 8697990 [details] [diff] [review]
Moving initialization triggers for TelemetrySession into TelemetryContoller

Review of attachment 8697990 [details] [diff] [review]:
-----------------------------------------------------------------

Good job Iaroslav!

This looks much better now, thanks for looking into it. I see that you addressed most of the test issues, we should be almost there!

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ -432,5 @@
>      this._lastTickTime = now;
>  
>      // Check if the daily ping is due.
>      const shouldSendDaily = this._isDailyPingDue(nowDate);
> -

Any reason why this blank line was removed?

@@ -581,5 @@
>      Impl._subsessionStartActiveTicks = 0;
>      Impl._subsessionStartTimeMonotonic = 0;
>      this.uninstall();
> -    //return this.setup();
> -    dump("\n**** DEBUG - CHECK THIS CALL! IT WAS CALLING .setup()!\n");

Did you make sure you were not getting this log line in the logs?

::: toolkit/components/telemetry/tests/unit/test_TelemetrySession.js
@@ +1345,5 @@
>  
>    // Write the ping to the aborted-session file. TelemetrySession will add it to the
>    // saved pings directory when it starts.
>    yield TelemetryStorage.savePingToFile(abortedSessionPing, ABORTED_FILE, false);
> +  Assert.ok(!!(yield OS.File.exists(ABORTED_FILE)),

Good idea to add this here, there's no need for the "!!" though: as long as yield OS.File... is wrapped within the parentheses, you will get the same result!

@@ +1346,5 @@
>    // Write the ping to the aborted-session file. TelemetrySession will add it to the
>    // saved pings directory when it starts.
>    yield TelemetryStorage.savePingToFile(abortedSessionPing, ABORTED_FILE, false);
> +  Assert.ok(!!(yield OS.File.exists(ABORTED_FILE)),
> +            "The aborted session ping must be exist now from the aborted session ping directory.");

Nit: "The aborted session ping must exist in the aborted session ping directory."

@@ +1352,5 @@
>    yield clearPendingPings();
>    PingServer.clearRequests();
>    yield TelemetryController.reset();
> +  // Forcing Telemetry to pick up pending pings.
> +  yield TelemetrySend.setup(true);

Why is this needed? We should be able to trigger this behaviour with |TelemetryController.reset|.

TelemetryController.reset tries to mimick a full Firefox shutdown + restart: the idea here is that we shouldn't need to call other setup functions.

@@ +1377,5 @@
>    let schedulerTickCallback = null;
>    let now = fakeNow(2040, 1, 1, 0, 0, 0);
>    // Fake scheduler functions to control aborted-session flow in tests.
>    fakeSchedulerTimer(callback => schedulerTickCallback = callback, () => {});
> +

Nit: empty blank line

@@ +1430,4 @@
>    // Trigger both the daily ping and the saved-session.
>    Assert.ok(!!schedulerTickCallback);
>    // Execute one scheduler tick.
> +  fakeNow(dailyDueDate);

Why did you move fakeNow?

@@ +1430,5 @@
>    // Trigger both the daily ping and the saved-session.
>    Assert.ok(!!schedulerTickCallback);
>    // Execute one scheduler tick.
> +  fakeNow(dailyDueDate);
> +  // TelemetryController.reset has triggered some pings. Clearing the queue from

Instead of doing this here, you could do something like

  yield clearPendingPings();
  PingServer.clearRequests();
  yield TelemetryController.reset();

where you previously called TelemetryController.reset() (line 1422).

The first line should get rid of the persisted pending pings.
The second line should clear the request queue on the mock webserver.
The third line.. well, you know about that one :)

@@ +1461,5 @@
>    }
>  
>    const ABORTED_FILE = OS.Path.join(DATAREPORTING_PATH, ABORTED_PING_FILE_NAME);
>  
>    clearPendingPings();

Whoops, that's missing a |yield|. Can you add it since we're here?

@@ +1549,5 @@
>  
>    // Remove any aborted-session ping from the previous tests.
>    yield OS.File.removeDir(DATAREPORTING_PATH, { ignoreAbsent: true });
>    yield clearPendingPings();
> +

Nit: empty blank line

@@ +1591,4 @@
>    Telemetry.canRecordExtended = false;
>  
>    yield clearPendingPings();
>    PingServer.clearRequests();

Maybe those two lines could be moved before yield TelemetryController.reset()?

@@ +1591,5 @@
>    Telemetry.canRecordExtended = false;
>  
>    yield clearPendingPings();
>    PingServer.clearRequests();
> +

Nit: empty blank

@@ +1596,4 @@
>    yield sendPing();
>  
>    let ping = yield PingServer.promiseNextPing();
> +

Nit: empty blank

@@ +1618,5 @@
>    // Restore the preference.
>    Telemetry.canRecordExtended = true;
>  
>    // Send a new ping that should contain the extended data.
> +  yield TelemetryController.reset();

Mh, I'm not sure that's needed here. Maybe we could get rid of this line.

@@ +1686,5 @@
>      // We don't have the aborted session or the daily ping here.
>      return;
>    }
>  
> +  let schedulerTickCallback = null;

Any particular reason why this block was moved?
Attachment #8697990 - Flags: review?(alessio.placitelli) → feedback+
Status: NEW → ASSIGNED
(In reply to Alessio Placitelli [:Dexter] from comment #9)

Thanks for the really good review, Alessio! That helped me a lot in improving the patch. Here is the new version with your feedback addressed.

> @@ -581,5 @@
> >      Impl._subsessionStartActiveTicks = 0;
> >      Impl._subsessionStartTimeMonotonic = 0;
> >      this.uninstall();
> > -    //return this.setup();
> > -    dump("\n**** DEBUG - CHECK THIS CALL! IT WAS CALLING .setup()!\n");
>
> Did you make sure you were not getting this log line in the logs?

I got this message in the logs but I believe it is ok. TelemetryController.reset() is calling this TelemetrySession.reset(). If I read this code correctly, earlier TelemetrySession would reset its values, and re-install itself. Since we handled control over TelemetrySession.[early|delayed]Init to TelemetryController, and it TelemetryController.reset takes care of this. I think we can simply drop this.setup() here.

Am I missing something?
Attachment #8697990 - Attachment is obsolete: true
Attachment #8700167 - Flags: review?(alessio.placitelli)
Comment on attachment 8700167 [details] [diff] [review]
part 3 - Finalizing work on tests

Review of attachment 8700167 [details] [diff] [review]:
-----------------------------------------------------------------

On a first look, this looks good. Before going for the final review, I have a little request: would you mind submitting one patch containing all the changes to Telemetry core files and one patch containing all the changes to the test?

So, instead of 3 patch files (2 of them being WIP work), we just have 2 final patches ready for review. These would include all the code from part 1 and part 2, that would now be obsolete.
Attachment #8700167 - Flags: review?(alessio.placitelli)
Hi Alessio! Here is the patch for the core.
Attachment #8693625 - Attachment is obsolete: true
Attachment #8693629 - Attachment is obsolete: true
Attachment #8700167 - Attachment is obsolete: true
Attachment #8700316 - Flags: review?(alessio.placitelli)
Here is the patch for tests.
Attachment #8700317 - Flags: review?(alessio.placitelli)
Comment on attachment 8700316 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (core)

Review of attachment 8700316 [details] [diff] [review]:
-----------------------------------------------------------------

Iaroslav, thank you for your patch!

It looks good, there's just a bunch if small things to remove (I think I added them in my WIP patch, so not your fault).

I just want to take one final look after you make those changes.

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +187,5 @@
>    /**
>     * Used only for testing purposes.
>     */
> +  shutdownChrome: function() {
> +    return Impl.shutdown(true);

nit: you can drop the parameter here (see the comments below)

@@ +748,5 @@
>      this._attachObservers();
>  
> +    // Perform a lightweight, early initialization for the component, just registering
> +    // a few observers and initializing the session.
> +    TelemetrySession.earlyInit(testing);

nit: since we're using |this._testMode| in other places within this function, let's just use it in place of |testing| here as well.

@@ +770,5 @@
>          this._clientID = yield ClientID.getClientID();
>  
>          // Purge the pings archive by removing outdated pings. We don't wait for this
>          // task to complete, but TelemetryStorage blocks on it during shutdown.
>          TelemetryStorage.runCleanPingArchiveTask();

Could you move |TelemetryStorage.runCleanPingArchiveTask();| and the related comment after TelemetrySend.setup()?

@@ +773,5 @@
>          // task to complete, but TelemetryStorage blocks on it during shutdown.
>          TelemetryStorage.runCleanPingArchiveTask();
>  
> +        // Perform TelemetrySession delayed init.
> +        let sessionInitPromise = TelemetrySession.delayedInit(testing);

nit: as above.

@@ +779,5 @@
> +          // If we're testing, wait for telemetrySession to init.
> +          yield sessionInitPromise;
> +        }
> +
> +        // NOTE: TelemetrySend is set up after TelemetrySession, otherwise

nit: drop "NOTE" and mention that it would not be sent "in this session"

@@ +815,5 @@
>      }
>    },
>  
>    // Do proper shutdown waiting and cleanup.
> +  _cleanupOnShutdown: Task.async(function*(aTesting) {

nit: we can use |this._testMode| and drop |aTesting| here (and at the call site)

@@ +831,5 @@
>  
>        // Stop any ping sending.
>        yield TelemetrySend.shutdown();
>  
> +      yield TelemetrySession.shutdown(aTesting);

nit: as above

@@ +836,4 @@
>  
> +      // AsyncShutdown is not really test-friendly. We won't be able to reuse
> +      // the barriers/reset them after the wait, in tests.
> +      if (!aTesting) {

nit: as above

@@ +852,5 @@
>        this._initStarted = false;
>      }
>    }),
>  
> +  shutdown: function(aTesting = false) {

nit: if we use |this._testMode| in _cleanupOnShutdown we don't need this argument anymore.

@@ +870,5 @@
>  
>      // This handles 4).
>      if (!this._delayedInitTask) {
>        // We already ran the delayed initialization.
> +      return this._cleanupOnShutdown(aTesting);

nit: let's drop aTesting.

@@ +875,4 @@
>      }
>  
>      // This handles 2) and 3).
> +    return this._delayedInitTask.finalize().then(() => this._cleanupOnShutdown(aTesting));

nit: same here

@@ +1000,5 @@
>    reset: Task.async(function*() {
>      this._clientID = null;
>      this._detachObservers();
>  
> +    // TODO: ideally, we should call shutdown first.

nit: you can drop this comment here
Attachment #8700316 - Flags: review?(alessio.placitelli) → feedback+
Comment on attachment 8700317 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (test)

Review of attachment 8700317 [details] [diff] [review]:
-----------------------------------------------------------------

It's r+ with the following changes:

::: toolkit/components/telemetry/tests/unit/test_TelemetrySession.js
@@ +494,5 @@
>  });
>  
>  // Checks that a sent ping is correctly received by a dummy http server.
>  add_task(function* test_simplePing() {
> +  yield TelemetrySend.reset();

I think we no longer need this line, as that's called by TelemetryController.reset().

@@ +1025,5 @@
>    let now = new Date(2040, 1, 1, 12, 0, 0);
>    let timerCallback = null;
>    let timerDelay = null;
>  
> +  yield TelemetrySend.reset();

Let's drop this and change below to use |reset()| instead of |setup()|

@@ +1039,5 @@
>      [PREF_TEST, {what: TelemetryEnvironment.RECORD_PREF_VALUE}],
>    ]);
>  
>    // Setup.
> +  yield TelemetryController.setup();

Let's use reset() here

@@ +1689,5 @@
>      Assert.ok(!receivedPingRequest, "Telemetry must only send one daily ping.");
>      receivedPingRequest = req;
>    });
>  
> +  // NOTE: faking scheduler timer hast to happen before resetting

nit: drop "NOTE:"
Attachment #8700317 - Flags: review?(alessio.placitelli) → review+
Hey Iaroslav, could you also check tests outside toolkit/components/telemetry using TelemetryController.setup or TelemetrySession.setup?

I just stumbled upon this: https://dxr.mozilla.org/mozilla-central/rev/388bdc46ba51ee31da8b8abe977e0ca38d117434/browser/experiments/test/xpcshell/test_conditions.js#61
(In reply to Alessio Placitelli [:Dexter] from comment #16)

Hey Alessio!

> Hey Iaroslav, could you also check tests outside
> toolkit/components/telemetry using TelemetryController.setup or
> TelemetrySession.setup?

That's a very good though! I searched dxr for TelemetrySession (https://dxr.mozilla.org/mozilla-central/search?q=TelemetrySession.&redirect=false&case=false) and didn't see anything suspicious besides the tests in experiments. I will fix the experiments test in patch for tests.
(In reply to Iaroslav Sheptykin from comment #17)
> That's a very good though! I searched dxr for TelemetrySession
> (https://dxr.mozilla.org/mozilla-central/search?q=TelemetrySession.
> &redirect=false&case=false) and didn't see anything suspicious besides the
> tests in experiments. I will fix the experiments test in patch for tests.

Awesome! Thank you :)
Since I touched the code in experiments, clearing r+ and requesting a new review.
Attachment #8700317 - Attachment is obsolete: true
Attachment #8701049 - Flags: review?(alessio.placitelli)
Attached patch bug1145188-core.patch (obsolete) — Splinter Review
I played with the browser after applying this patch to see if anything pops up during runtime. Telemetry seems to operate normally. Console and terminal don't show any weird messages. Telemetry creates folders and pings as expected. I lack however a solid understanding of the whole process so you might want to doublecheck here.
Attachment #8700316 - Attachment is obsolete: true
Attachment #8701052 - Flags: review?(alessio.placitelli)
Comment on attachment 8701052 [details] [diff] [review]
bug1145188-core.patch

Review of attachment 8701052 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good, thank you for your persistence!
Attachment #8701052 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8701049 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (test)

Review of attachment 8701049 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good, assuming it passes all the xpcshell tests locally.

There's only one last thing: could you change the commit message for this test patch to something like "Bug 1145188: Fix the tests to use the new Telemetry initialization model, r?Dexter"
Attachment #8701049 - Flags: review?(alessio.placitelli) → review+
(In reply to Iaroslav Sheptykin from comment #20)
> I played with the browser after applying this patch to see if anything pops
> up during runtime. Telemetry seems to operate normally. Console and terminal
> don't show any weird messages. Telemetry creates folders and pings as
> expected. I lack however a solid understanding of the whole process so you
> might want to doublecheck here.

Good, thank you for testing that. I'll double check. Usually, we test for:

- Shutdown/saved-session pings being there for short sessions.
- Crash pings being sent
- Aborted-session pings being generated and sent
- Environment-changes pings being correctly generated and sent

We have a stub local web server to help us with the tests (https://github.com/georgf/gzipServer/)
Could you push the patches to try?
Attachment #8701052 - Attachment is obsolete: true
Attachment #8701201 - Flags: review+
Attachment #8701049 - Attachment is obsolete: true
Attachment #8701202 - Flags: review+
Hi Alessio!

I checked the tests and they seem o.k. to me. Some failed but those look unrelated. I re-triggered one job that has the most failures. How do the tests look to you?

One thing I found interesting in the logs:
(see https://treeherder.mozilla.org/#/jobs?repo=try&revision=ae0f02454236&selectedJob=14827395)

```
15:58:42     INFO -  *************************
15:58:42     INFO -  A coding exception was thrown and uncaught in a Task.
15:58:42     INFO -  Full message: TypeError: aString.substring is not a function
15:58:42     INFO -  Full stack: limitStringToLength@resource://gre/modules/TelemetryEnvironment.jsm:267:10
15:58:42     INFO -  EnvironmentAddonBuilder.prototype._getActiveAddons<@resource://gre/modules/TelemetryEnvironment.jsm:525:18
15:58:42     INFO -  EnvironmentAddonBuilder.prototype._updateAddons<@resource://gre/modules/TelemetryEnvironment.jsm:479:27
15:58:42     INFO -  TaskImpl_run@resource://gre/modules/Task.jsm:315:40
15:58:42     INFO -  TaskImpl@resource://gre/modules/Task.jsm:276:3
15:58:42     INFO -  createAsyncFunction/asyncFunction@resource://gre/modules/Task.jsm:250:14
15:58:42     INFO -  EnvironmentAddonBuilder.prototype._checkForChanges@resource://gre/modules/TelemetryEnvironment.jsm:437:25
15:58:42     INFO -  EnvironmentAddonBuilder.prototype._onAddonChange@resource://gre/modules/TelemetryEnvironment.jsm:422:5
15:58:42     INFO -  EnvironmentAddonBuilder.prototype.onInstalled@resource://gre/modules/TelemetryEnvironment.jsm:414:5
15:58:42     INFO -  AddonManagerInternal.callAddonListeners@resource://gre/modules/AddonManager.jsm:1801:11
15:58:42     INFO -  this.AddonManagerPrivate.callAddonListeners@resource://gre/modules/AddonManager.jsm:2842:5
15:58:42     INFO -  AddonInstaller/this.install@resource://gre/modules/SocialService.jsm:989:7
15:58:42     INFO -  this.SocialService._showInstallNotification/action.callback@resource://gre/modules/SocialService.jsm:572:9
15:58:42     INFO -  _onButtonEvent@resource://gre/modules/PopupNotifications.jsm:1159:7
15:58:42     INFO -  oncommand@chrome://browser/content/browser.xul:1:1
15:58:42     INFO -  tests.testDirectoryInstall/<@chrome://mochitests/content/browser/browser/base/content/test/social/browser_addons.js:196:7
15:58:42     INFO -  Handler.prototype.process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:933:23
15:58:42     INFO -  this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:812:7
15:58:42     INFO -  Promise*this.PromiseWalker.scheduleWalkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:743:11
15:58:42     INFO -  this.PromiseWalker.schedulePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:775:7
15:58:42     INFO -  this.PromiseWalker.completePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:710:7
15:58:42     INFO -  handler@chrome://mochitests/content/browser/browser/base/content/test/social/head.js:404:5
15:58:42     INFO -  EventListener.handleEvent*EventTargetInterposition.methods.addEventListener@resource://gre/modules/RemoteAddonsParent.jsm:624:5
15:58:42     INFO -  AddonInterpositionService.prototype.interposeProperty/desc.value@resource://gre/components/multiprocessShims.js:160:52
15:58:42     INFO -  ensureEventFired@chrome://mochitests/content/browser/browser/base/content/test/social/head.js:402:1
15:58:42     INFO -  tests.testDirectoryInstall@chrome://mochitests/content/browser/browser/base/content/test/social/browser_addons.js:193:5
15:58:42     INFO -  runSocialTests/runNextTest/</<@chrome://mochitests/content/browser/browser/base/content/test/social/head.js:208:11
15:58:42     INFO -  runSocialTests/cbPreTest@chrome://mochitests/content/browser/browser/base/content/test/social/head.js:181:31
15:58:42     INFO -  runSocialTests/runNextTest/<@chrome://mochitests/content/browser/browser/base/content/test/social/head.js:204:7
15:58:42     INFO -  testScope/test_executeSoon/<.run@chrome://mochikit/content/browser-test.js:968:9
15:58:42     INFO -  *************************

```

There seems to be a problem in the code collecting information about addons. Seems like we need to harden the typecheck for aString here: 
https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryEnvironment.jsm#264-267

I might be mistaken though. I didn't inspect this error in depth. I could file a bug on this and take a closer look. Makes sense?
Flags: needinfo?(alessio.placitelli)
(In reply to Iaroslav Sheptykin from comment #28)
> Hi Alessio!

Hey!

> There seems to be a problem in the code collecting information about addons.
> Seems like we need to harden the typecheck for aString here: 
> https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/
> TelemetryEnvironment.jsm#264-267
> 
> I might be mistaken though. I didn't inspect this error in depth. I could
> file a bug on this and take a closer look. Makes sense?

Bug 1233438 landed recently, which hardens that check a bit more. It has been on m-c for a while, even though DXR is not showing it (it seems to be broken right now).

Still, this is a different case, as we're not getting a null or undefined string. Would you mind, as you suggested, file a bug on this and take a closer look? At least to rule out if that's a test-only issue.

Those tests look ok, would you mind trying for Android and B2G as well (we're not active on those platforms, but we've been seeing weirdness lately)? 

"try: -b do -p linux,linux64-asan,linux64-valgrind,macosx64,win32,android-api-9,android-api-11,android-x86,emulator,emulator-jb,linux32_gecko -u xpcshell,mochitests -t none"
Flags: needinfo?(alessio.placitelli)
Rebased onto the latest m-c
Attachment #8701202 - Attachment is obsolete: true
Attachment #8702061 - Flags: review+
Hey, Alessio! Our patch got out of synch with the current m-c. I merged the diffs and the tests seem to run fine. Could take a short look if things still look o.k.?
Attachment #8701201 - Attachment is obsolete: true
Attachment #8702063 - Flags: review?(alessio.placitelli)
Comment on attachment 8702063 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController

Review of attachment 8702063 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good!
Attachment #8702063 - Flags: review?(alessio.placitelli) → review+
Tests failed on linux opt. that's weird. Investigating
(In reply to Iaroslav Sheptykin from comment #35)
> Tests failed on linux opt. that's weird. Investigating

Interesting, the Android X3 tests are failing as well. I saw that very same failure in the tests for bug 1179209, but I have to dig deeper.
Attachment #8702063 - Flags: feedback?(gfritzsche)
Comment on attachment 8702063 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController

Review of attachment 8702063 [details] [diff] [review]:
-----------------------------------------------------------------

This looks pretty good, but i think we shouldn't land this in its current form.
I left some comments and questions below.

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +772,5 @@
> +        // Perform TelemetrySession delayed init.
> +        let sessionInitPromise = TelemetrySession.delayedInit(this._testMode);
> +        if (this._testMode) {
> +          // If we're testing, wait for telemetrySession to init.
> +          yield sessionInitPromise;

This seems odd and can give us different ordering in tests than in normal sessions.
What is the problem that this is solving here?
Can't we just always wait on TelemetrySession.delayedInit()?

@@ +776,5 @@
> +          yield sessionInitPromise;
> +        }
> +
> +        // TelemetrySend is set up after TelemetrySession, otherwise
> +        // the aborted pings would not be sent in this session.

Why would they not get sent otherwise?
It seems like doing the TelemetrySend setup first should work fine, i'm not sure why we need to move it after TelemetrySession setup.

@@ +827,5 @@
>  
>      // Now do an orderly shutdown.
>      try {
>        // Stop the datachoices infobar display.
> +      TelemetryReportingPolicy.shutdown(this._testMode);

TelemetryReportingPolicy.shutdown() doesn't take any arguments.

@@ +832,5 @@
>  
>        // Stop any ping sending.
>        yield TelemetrySend.shutdown();
>  
> +      yield TelemetrySession.shutdown(this._testMode);

TelemetrySession.shutdown() doesn't actually use the testing argument, lets drop it here and in TelemetrySession.jsm

@@ +837,4 @@
>  
> +      // AsyncShutdown is not really test-friendly. We won't be able to reuse
> +      // the barriers/reset them after the wait, in tests.
> +      if (!this._testMode) {

I know that it was not working properly in tests before, but was it actually broken?
I think waiting on the barriers even with test mode on was fine, can we leave it that way?

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1372,5 @@
>  
>      ppml.addMessageListener(MESSAGE_TELEMETRY_PAYLOAD, this);
>      ppml.addMessageListener(MESSAGE_TELEMETRY_USS, this);
> +  },
> +  

Nit: Trailing whitespace.

@@ +1385,2 @@
>      this._delayedInitTaskDeferred = Promise.defer();
> +    this._delayedInitTask = Task.spawn(function* () {

Now that we are using a standard task here, we can greatly simply this:
Task.spawn() returns a promise, so we can just remove this._delayedInitTaskDeferred and its usage.

@@ +1429,1 @@
>      return this._delayedInitTaskDeferred.promise;

Just return this._delayedInitTask

@@ +1796,2 @@
>      // or it was called and
>      //   2) _delayedInitTask was scheduled, but didn't run yet.

With the code changes here, this can't happen anymore.
Please remove it from the list and update the comments below accordingly.
Attachment #8702063 - Flags: feedback?(gfritzsche)
Attachment #8702061 - Attachment is obsolete: true
Attachment #8706131 - Flags: review?(alessio.placitelli)
Attachment #8702063 - Attachment is obsolete: true
Attachment #8706132 - Flags: review?(alessio.placitelli)
Comment on attachment 8706132 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController

Review of attachment 8706132 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +770,5 @@
>          this._clientID = yield ClientID.getClientID();
>  
> +        yield TelemetrySend.setup(this._testMode);
> +        // Perform TelemetrySession delayed init.
> +        yield TelemetrySession.delayedInit(this._testMode);

nit: could you add a newline after this?

@@ +829,2 @@
>  
> +      // AsyncShutdown is not really test-friendly. We won't be able to reuse

As Georg suggested in comment 37, we should wait on barriers even in test mode, unless there's a good reason not too.
Attachment #8706132 - Flags: review?(alessio.placitelli) → feedback+
(In reply to Alessio Placitelli [:Dexter] from comment #40)

Thanks for the quick review alessio!

> As Georg suggested in comment 37, we should wait on barriers even in test
> mode, unless there's a good reason not too.


Your are right. I forgot this part!
Comment on attachment 8706131 [details] [diff] [review]
Fix the tests to use the new Telemetry initialization model

I think it's better for :gfritzsche to take on reviewing those two patches.
Attachment #8706131 - Flags: review?(alessio.placitelli) → review?(gfritzsche)
Comment on attachment 8706131 [details] [diff] [review]
Fix the tests to use the new Telemetry initialization model

Lets wait for the other patch update first and whether this affects the patch here.
Please reflag later as needed :)
Attachment #8706131 - Flags: review?(gfritzsche)
Hi guys! Waiting on barriers causes crashes. We fail to register to events which are already over during shutdown. It seems we need to somehow "reset" barriers during TelemetryController reset. I am looking into how this could be done.
Had to update the test patch because test_TelemetryLog.js was failing.
Attachment #8706131 - Attachment is obsolete: true
Attachment #8708783 - Flags: review?(gfritzsche)
Hey Georg! Please take a look on how I reset the Barrier. Is that ok? Should we maybe move the [initialization](https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryController.jsm#383,385) of barriers into telemetry setup completely? Seems a good idea to me.
Attachment #8706132 - Attachment is obsolete: true
Attachment #8708784 - Flags: review?(gfritzsche)
Priority: P2 → P3
Whiteboard: [unifiedTelemetry] [measurement:client] → [measurement:client]
Comment on attachment 8708784 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (test)

Review of attachment 8708784 [details] [diff] [review]:
-----------------------------------------------------------------

We also need to update TelemetryStartup.js to only notify TelemetryController.

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +185,5 @@
>  
>    /**
>     * Used only for testing purposes.
>     */
> +  shutdownChrome: function() {

Lets call this public function just "shutdown".

@@ +712,5 @@
>  
> +    if (this._connectionsBarrier.state === "Complete") {
> +      this._connectionsBarrier =
> +        new AsyncShutdown.Barrier("TelemetryController: Waiting for pending ping activity");
> +    }

This seems hacky. If we really need to reset the barriers, lets do this in reset().
I don't understand why this is needed for the _connectionsBarrier, but not for the _shutdownBarrier.

This sounds we might have to investigate this a bit to make sure that we don't have any misbehavior now.

What do i need to reproduce the issues you were seeing?
Take out the lines here and then run a specific test?
Attachment #8708784 - Flags: review?(gfritzsche)
Comment on attachment 8708783 [details] [diff] [review]
Fix the tests to use the new Telemetry initialization model

Review of attachment 8708783 [details] [diff] [review]:
-----------------------------------------------------------------

These look fine from a quick look, deferring detailed review until the other patch is finished.
Attachment #8708783 - Flags: review?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #47)

Hi Georg!

> What do i need to reproduce the issues you were seeing?
> Take out the lines here and then run a specific test?

Yes, just remove the reset of this barrier. For me test_TelemetrySession.js causes exceptions then, but you might also check test_TelemetryController.js.
Duplicate of this bug: 1210415
Hi Guys! A short update on my progress on this bug:
I have addressed most of the feedback from :gfritzsche but got stuck in test_TelemetrySession.js, which hangs for an unknown to me reason. I've been trying to get it pass for a couple of weeks now. I was mostly slowed down by problems in building firefox on my new laptop. If this isn't urgent I would try for one more week before giving up here.
That sounds good - if you are stuck at a specific point, we'd be happy to help out.
You could e.g. upload a WIP patch and point us to where it gets stuck.
Attachment #8708783 - Attachment is obsolete: true
Attachment #8708784 - Attachment is obsolete: true
Attachment #8735135 - Flags: review?(gfritzsche)
Attached patch bug1145188_test.patch (obsolete) — Splinter Review
Attachment #8735136 - Flags: review?(gfritzsche)
Attachment #8735136 - Attachment is obsolete: true
Attachment #8735136 - Flags: review?(gfritzsche)
Attachment #8735137 - Flags: review?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #47)
> > +  shutdownChrome: function() {
> 
> Lets call this public function just "shutdown".

shutdown is already in use by a getter:
https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryController.jsm#304-306
Do we want to change that?

> This seems hacky. If we really need to reset the barriers, lets do this in
> reset().
> I don't understand why this is needed for the _connectionsBarrier, but not
> for the _shutdownBarrier.
> 
> This sounds we might have to investigate this a bit to make sure that we
> don't have any misbehavior now.
> 
> What do i need to reproduce the issues you were seeing?
> Take out the lines here and then run a specific test?

Without this I get the following exception in test_TelemetrySession.js:

 0:08.78 LOG: Thread-1 INFO "CONSOLE_MESSAGE: (error) [JavaScript Error: "1458930290315	Toolkit.Telemetry	ERROR	TelemetrySend::_doPing - error making request to http://localhost:35516/submit/telemetry/f01bf1b4-16a6-422d-9362-93a17d3c4ddc/main/XPCShell/1/default/2007010101?v=4: abort" {file: "resource://gre/modules/Log.jsm" line: 751}]
App_append@resource://gre/modules/Log.jsm:751:9
Logger.prototype.log@resource://gre/modules/Log.jsm:388:7
LoggerRepository.prototype.getLoggerWithMessagePrefix/proxy.log@resource://gre/modules/Log.jsm:503:44
Logger.prototype.error@resource://gre/modules/Log.jsm:396:5
TelemetrySendImpl._doPing/errorhandler@resource://gre/modules/TelemetrySend.jsm:939:7
TelemetrySendImpl._cancelOutgoingRequests@resource://gre/modules/TelemetrySend.jsm:759:9
TelemetrySendImpl.shutdown<@resource://gre/modules/TelemetrySend.jsm:644:11
TaskImpl_run@resource://gre/modules/Task.jsm:319:40
TaskImpl@resource://gre/modules/Task.jsm:280:3
createAsyncFunction/asyncFunction@resource://gre/modules/Task.jsm:254:14
this.TelemetrySend.shutdown@resource://gre/modules/TelemetrySend.jsm:193:12
Impl._cleanupOnShutdown<@resource://gre/modules/TelemetryController.jsm:750:13
TaskImpl_run@resource://gre/modules/Task.jsm:319:40
TaskImpl@resource://gre/modules/Task.jsm:280:3
createAsyncFunction/asyncFunction@resource://gre/modules/Task.jsm:254:14
Impl.shutdown@resource://gre/modules/TelemetryController.jsm:787:14
this.TelemetryController<.shutdownChrome@resource://gre/modules/TelemetryController.jsm:161:12
test_savedSessionData@/home/yarik/Projects/mozilla-central/obj-x86_64-unknown-linux-gnu/_tests/xpcshell/toolkit/components/telemetry/tests/unit/test_TelemetrySession.js:1174:9
TaskImpl_run@resource://gre/modules/Task.jsm:319:40
Handler.prototype.process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:937:23
this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:816:7
Promise*this.PromiseWalker.scheduleWalkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:747:11
this.PromiseWalker.schedulePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:779:7
this.PromiseWalker.completePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:714:7
onError@resource://gre/modules/osfile/osfile_native.jsm:66:7
_do_main@/home/yarik/Projects/mozilla-central/testing/xpcshell/head.js:209:5
_execute_test@/home/yarik/Projects/mozilla-central/testing/xpcshell/head.js:533:5
@-e:1:1
"
(In reply to Iaroslav Sheptykin from comment #56)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #47)
> > > +  shutdownChrome: function() {
> > 
> > Lets call this public function just "shutdown".
> 
> shutdown is already in use by a getter:
> https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/
> TelemetryController.jsm#304-306
> Do we want to change that?

There is actually only one user of that, TelemetrySession:
https://dxr.mozilla.org/mozilla-central/search?q=TelemetryController.shutdown&redirect=false&case=true
Given that we are changing the logic here for the shutdown and make TelemetryController explicitly shut down TelemetrySession... we can remove that public getter.
I get conflicts in TelemetrySession.jsm when trying to apply this patch on the current mozilla-central, could you update them? :)
Whiteboard: [measurement:client] → [measurement:client] [lang=js]
Attachment #8735135 - Flags: review?(gfritzsche) → feedback?(alessio.placitelli)
Attachment #8735137 - Flags: review?(gfritzsche) → feedback?(alessio.placitelli)
Comment on attachment 8735135 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (core)

Clearing the feedback: waiting for the rebased patches!
Attachment #8735135 - Flags: feedback?(alessio.placitelli)
Attachment #8735137 - Flags: feedback?(alessio.placitelli)
Attachment #8735137 - Attachment is obsolete: true
Attachment #8736023 - Flags: review?(alessio.placitelli)
Attachment #8735135 - Attachment is obsolete: true
Attachment #8736024 - Flags: review?(alessio.placitelli)
Rebased, having however to resolve some conflicts.
- The real error, easily reproducible with test_SubsessionChaining.js (much shorter than test_TelemetrySession.js!):

 0:03.30 LOG: Thread-3 ERROR Unexpected exception Error: Phase "TelemetryController: Waiting for pending ping activity" is finished, it is too late to register completion condition "Waiting for ping task" at resource://gre/modules/AsyncShutdown.jsm:693
Barrier/this.client.addBlocker@resource://gre/modules/AsyncShutdown.jsm:693:15
Impl._trackPendingPingTask@resource://gre/modules/TelemetryController.jsm:447:5
send@resource://gre/modules/TelemetryController.jsm:489:5
this.TelemetryController<.submitExternalPing@resource://gre/modules/TelemetryController.jsm:203:12
Impl.saveShutdownPings@resource://gre/modules/TelemetrySession.jsm:1665:14
Impl.shutdownChromeProcess/cleanup/<@resource://gre/modules/TelemetrySession.jsm:1902:15
Impl.shutdownChromeProcess/cleanup@resource://gre/modules/TelemetrySession.jsm:1901:14
Impl.shutdownChromeProcess@resource://gre/modules/TelemetrySession.jsm:1926:14
this.TelemetrySession<.shutdown@resource://gre/modules/TelemetrySession.jsm:623:12
Impl._cleanupOnShutdown<@resource://gre/modules/TelemetryController.jsm:753:13
Impl.shutdown/<@resource://gre/modules/TelemetryController.jsm:792:56
Handler.prototype.process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:937:23
this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:816:7
Promise*this.PromiseWalker.scheduleWalkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:747:11
this.PromiseWalker.schedulePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:779:7
this.PromiseWalker.completePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:714:7
onError@resource://gre/modules/osfile/osfile_native.jsm:66:7
_do_main@/home/dexter/mozilla-central/testing/xpcshell/head.js:209:5
_execute_test@/home/dexter/mozilla-central/testing/xpcshell/head.js:533:5
@-e:1:1
_run_next_test@/home/dexter/mozilla-central/testing/xpcshell/head.js:1540:9
do_execute_soon/<.run@/home/dexter/mozilla-central/testing/xpcshell/head.js:692:9
_do_main@/home/dexter/mozilla-central/testing/xpcshell/head.js:209:5
_execute_test@/home/dexter/mozilla-central/testing/xpcshell/head.js:533:5
@-e:1:1

- Why it doesn't happen with _shutdownBarrier?

Without these patches, we trigger |TelemetrySession.shutdownChromeProcess| when the |TelemetryController._shutdownBarrier| is
hit, making the call flow hard to follow. With these patches, we explicitly call |TelemetrySession.shutdownChromeProcess| when
shutting down, without relying on the |_shutdownBarrier|.

Moreover, |_shutdownBarrier| only gets used when waiting on pending pings removal from |_onUploadPrefChange|, which never happens
after shutting down (even in tests).

- Why it doesn't happen without this patch (core)?

We implicitly trigger shutdown from [1], by blocking |AsyncShutdown.sendTelemetry|. This only gets called once per xpcshell execution.
With this patch we manually call |TelemetryController.shutdownChrome()/shutdown()| every time we want to simulate a shutdown, thus
invalidating the barriers (see [2]).

- Solutions

Iaroslav's takes on this issue seem reasonable. To recap, he suggested to either:

1) Don't wait on the barries when calling shutdown (that's in one of the previous patches, see comment 37)
2) Instantiate a new barrier every time setup() is called (that's in the current patch)

Waiting on barriers in tests is probably a good thing. One way to retain the current behaviour and fix the test issues
would be to only skip waiting on barriers when shutdown/_cleanupOnShutdown is manually called (in test mode), while
keep on waiting when the shutdown comes from |AsyncShutdown.sendTelemetry|.

This adds some complexity and I'm not sure how much buys us in terms of coverage.

@Georg, do you have any opinion about the suggested solutions?

[1] - https://dxr.mozilla.org/mozilla-central/rev/d5d53a3b4e50b94cdf85d20690526e5a00d5b63e/toolkit/components/telemetry/TelemetryController.jsm#696
[2] - https://dxr.mozilla.org/mozilla-central/rev/d5d53a3b4e50b94cdf85d20690526e5a00d5b63e/toolkit/components/asyncshutdown/AsyncShutdown.jsm#827
Flags: needinfo?(gfritzsche)
(In reply to Alessio Placitelli [:Dexter] from comment #63)
> 1) Don't wait on the barries when calling shutdown (that's in one of the
> previous patches, see comment 37)
> 2) Instantiate a new barrier every time setup() is called (that's in the
> current patch)

Thanks for the investigation and summary.
We should go with 2), this gets us closer to "real" browser session behavior.

Let's avoid explicit state checks like this though: |this._connectionsBarrier.state === "Complete"|
When resetting the barrier, the module should already be shut down, so there should be no need to test for that.

While we are doing these changes here, can we also mark exposed "test only" functions to be clearly different, e.g. by prefixing them with "_" or similar?
Lets stay consistent with what we do elsewhere already.
Flags: needinfo?(gfritzsche)
Comment on attachment 8736024 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (core)

Review of attachment 8736024 [details] [diff] [review]:
-----------------------------------------------------------------

This looks pretty good! I think we're basically there, as everything should be working correctly with the changes below (from comment 64).

Also, as Georg mentioned, please also remove the barrier's public getter in TelemetryController.jsm (|get shutdown() {...}|).

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +156,5 @@
>  
>    /**
>     * Used only for testing purposes.
>     */
> +  shutdownChrome: function() {

As suggested in comment 64, would you kindly rename this to |testShutdownChrome|?

Let's add a "test" prefix to all the functions, exposed in TelemetryController, TelemetrySession and TelemetryEnvironment, which are only used in the tests.

@@ +167,1 @@
>    setupContent: function() {

Same here.

@@ +689,5 @@
>          // Load the ClientID.
>          this._clientID = yield ClientID.getClientID();
>  
> +        // Perform TelemetrySession delayed init.
> +        yield TelemetrySession.delayedInit(this._testMode);

We can drop the |this._testMode| argument if we use |this._testing| in TelemetrySession.

@@ +912,5 @@
>      this._detachObservers();
>  
> +    yield TelemetrySession.reset();
> +
> +    if (this._connectionsBarrier.state === "Complete") {

Let's remove this check so we are always resetting the barrier, as per comment 64.

@@ +917,5 @@
> +      this._connectionsBarrier =
> +        new AsyncShutdown.Barrier("TelemetryController: Waiting for pending ping activity");
> +    }
> +
> +    if (this._connectionsBarrier.state === "Complete") {

Same here.

::: toolkit/components/telemetry/TelemetryEnvironment.jsm
@@ +90,5 @@
>      return getGlobal()._watchPreferences(prefMap);
>    },
> +
> +  /**
> +   * NOTE: Intended for use in tests only.

nit: drop the "NOTE:"

@@ +97,5 @@
> +   * with TelemetryEnvironment. This is problematic due to the fact that
> +   * TelemetryEnvironment is a singleton. We, therefore, need this helper
> +   * method to be able to re-set TelemetryEnvironment.
> +   */
> +  reset: function() {

Can you rename this to |testReset|?

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +618,5 @@
>     * Used only for testing purposes.
>     * @param {Boolean} [aForceSavePending=true] If true, always saves the ping whether Telemetry
>     *        can send pings or not, which is used for testing.
>     */
> +  shutdown: function() {

Can you rename this to testShutdown()?

@@ +648,5 @@
> +   * Does the "heavy" Telemetry initialization later on, so we
> +   * don't impact startup performance.
> +   * @return {Promise} Resolved when the initialization completes.
> +   */
> +  delayedInit: function(aTesting = false) {

We should drop |aTesting| and use |this._testing| in |delayedInit|.

@@ -716,5 @@
>    // A Set of outstanding USS report ids
>    _childrenToHearFrom: null,
>    // monotonically-increasing id for USS reports
>    _nextTotalMemoryId: 1,
> -  _testing: false,

I don't think we should be removing this. This is being used by |getHistograms| and |getKeyedHistograms|.

We should probably just set this in |earlyInit| and use |this._testing| from |delayedInit|, dropping the parameter from the latter.

@@ -1366,3 @@
>      this._initStarted = true;
> -    this._log.trace("setupChromeProcess");
> -    this._testing = testing;

Let's keep the this._testing = ...

@@ +1422,5 @@
> +  * Does the "heavy" Telemetry initialization later on, so we
> +  * don't impact startup performance.
> +  * @return {Promise} Resolved when the initialization completes.
> +  */
> +  delayedInit:function(testing) {

Drop the |testing| parameter here, please, and use |this._testing| in its place. Also make sure to remove the argument at the call site.

@@ -1477,5 @@
>     * Initializes telemetry for a content process.
>     */
>    setupContentProcess: function setupContentProcess(testing) {
>      this._log.trace("setupContentProcess");
> -    this._testing = testing;

We keep that, as we also get histograms from content processes.

@@ +1883,5 @@
>     * This tells TelemetrySession to uninitialize and save any pending pings.
>     * @param testing Optional. If true, always saves the ping whether Telemetry
>     *                can send pings or not, which is used for testing.
>     */
>    shutdownChromeProcess: function(testing = false) {

I think we can drop |testing| as it doesn't seem to be used.

@@ -1891,5 @@
>     * @param testing Optional. If true, always saves the ping whether Telemetry
>     *                can send pings or not, which is used for testing.
>     */
>    shutdownChromeProcess: function(testing = false) {
> -    this._log.trace("shutdownChromeProcess - testing: " + testing);

Please keep this line, just drop the "- testing ..." part.
Attachment #8736024 - Flags: review?(alessio.placitelli) → feedback+
Thanks for the feedback, guys! I am planning to address this tonight.
Comment on attachment 8736023 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (test)

Review of attachment 8736023 [details] [diff] [review]:
-----------------------------------------------------------------

There are some XPCSHELL tests using |TelemetrySession| that don't live in toolkit/components/telemetry/tests/, see [1]. Would you please make sure they are correctly working and, if they are not, fix them?

Other than that and the issue about test_TelemetryLog.js, this looks good, thank you for your efforts!

[1] - https://dxr.mozilla.org/mozilla-central/search?q=%22TelemetrySession.s%22+-path%3Aobj+-path%3Atoolkit%2Fcomponents%2Ftelemetry+ext%3Ajs&redirect=false&case=true

::: toolkit/components/telemetry/tests/unit/test_PingAPI.js
@@ -476,5 @@
>    }
>  });
>  
>  add_task(function* test_shutdown() {
> -  yield TelemetrySend.shutdown();

nit: Since we're removing direct calls to TelemetrySend and TelemetrySession, would you mind removing the related includes from the beginning of the file?

::: toolkit/components/telemetry/tests/unit/test_TelemetryLog.js
@@ +27,5 @@
>  add_task(function* ()
>  {
>    do_get_profile();
> +  yield TelemetryController.setup();
> +  Telemetry.canRecordExtended = true;

Why do we need |Telemetry.canRecordExtended = true|? Where does it break if don't manually set it?

::: toolkit/components/telemetry/tests/unit/test_TelemetrySession.js
@@ +1677,5 @@
>      Assert.ok(!receivedPingRequest, "Telemetry must only send one daily ping.");
>      receivedPingRequest = req;
>    });
>  
> +  // Faking scheduler timer hast to happen before resetting TelemetryController

nit: "hast to" -> "has to"
Attachment #8736023 - Flags: review?(alessio.placitelli) → feedback+
Attachment #8736024 - Attachment is obsolete: true
Attachment #8736915 - Flags: review?(alessio.placitelli)
Hei Alessio!

(In reply to Alessio Placitelli [:Dexter] from comment #67)
> Comment on attachment 8736023 [details] [diff] [review]
> ::: toolkit/components/telemetry/tests/unit/test_TelemetryLog.js
> @@ +27,5 @@
> >  add_task(function* ()
> >  {
> >    do_get_profile();
> > +  yield TelemetryController.setup();
> > +  Telemetry.canRecordExtended = true;
> 
> Why do we need |Telemetry.canRecordExtended = true|? Where does it break if
> don't manually set it?

Without this the test breaks with this exception:

0:01.14 LOG: Thread-1 ERROR Unexpected exception TypeError: TelemetrySession.getPayload(...).log is undefined at /home/yarik/Projects/mozilla-central/obj-x86_64-unknown-linux-gnu/_tests/xpcshell/toolkit/components/telemetry/tests/unit/test_TelemetryLog.js:37
@/home/yarik/Projects/mozilla-central/obj-x86_64-unknown-linux-gnu/_tests/xpcshell/toolkit/components/telemetry/tests/unit/test_TelemetryLog.js:37:13
_run_next_test@/home/yarik/Projects/mozilla-central/testing/xpcshell/head.js:1540:9
do_execute_soon/<.run@/home/yarik/Projects/mozilla-central/testing/xpcshell/head.js:692:9
_do_main@/home/yarik/Projects/mozilla-central/testing/xpcshell/head.js:209:5
_execute_test@/home/yarik/Projects/mozilla-central/testing/xpcshell/head.js:533:5
@-e:1:1

I have to take a closer look on why this happens.
Attachment #8736023 - Attachment is obsolete: true
Attachment #8736917 - Flags: review?(alessio.placitelli)
Comment on attachment 8736915 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (core)

Review of attachment 8736915 [details] [diff] [review]:
-----------------------------------------------------------------

Thank you, this looks good with the following addressed.

It's only a handful of changes, but I'd love to give the patches a fresh look after those (that's the only reason why it's f+ and not r+).

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +745,5 @@
>  
>        // Stop any ping sending.
>        yield TelemetrySend.shutdown();
>  
> +      yield TelemetrySession.testShutdown();

We're using this in a non test code path. Can you rename |testShutdown| to |shutdown|? My bad for suggesting this change in the first place.

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +616,5 @@
>    },
>    /**
>     * Used only for testing purposes.
>     */
> +  testShutdown: function() {

Oh, my mistake. This is being used to trigger the real shutdown by TelemetryController (we're no longer blocking a barrier with shutdownChromeProcess), so it should be renamed. Would you kindly undo this change and name it back to "shutdown"?

The comment needs to be changed though.

@@ +1881,5 @@
>    },
>  
>    /**
>     * This tells TelemetrySession to uninitialize and save any pending pings.
>     * @param testing Optional. If true, always saves the ping whether Telemetry

nit: the "@param testing..." can be dropped now since we're no longer using the parameter.
Attachment #8736915 - Flags: review?(alessio.placitelli) → feedback+
Comment on attachment 8736917 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (test)

Review of attachment 8736917 [details] [diff] [review]:
-----------------------------------------------------------------

Great job with the tests, and thank you for the patience of going through the renaming.

Only a few changes left.

::: toolkit/components/telemetry/tests/unit/test_TelemetryLog.js
@@ +27,5 @@
>  add_task(function* ()
>  {
>    do_get_profile();
> +  yield TelemetryController.testSetup();
> +  Telemetry.canRecordExtended = true;

Instead of directly changing |canRecordExtended|, let's make sure the correct pref is on by calling |Services.prefs.setBoolPref(PREF_TELEMETRY_ENABLED, true);| before |testSetup|.

Please note that this is needed due to bug 1254550. Let's add a comment about this as well.

::: toolkit/components/telemetry/tests/unit/test_TelemetryTimestamps.js
@@ +31,5 @@
>    return TelemetrySession.getPayload().simpleMeasurements;
>  }
>  
>  function initialiseTelemetry() {
> +  return TelemetryController.testSetup().then(TelemetrySession.setup);

I think we can drop the |.then(TelemetrySession.setup)| part now, as we're triggering TelemetrySession's setup directly from TelemetryController.

@@ +37,5 @@
>  
>  function run_test() {
>    // Telemetry needs the AddonManager.
>    loadAddonManager();
> +  // Make profile available for |TelemetrySession.testShutdown()|.

If we change line 88, let's update this doc as well.

@@ +84,5 @@
>    do_check_true(simpleMeasurements.foo > 1); // foo was included
>    do_check_true(simpleMeasurements.bar > 1); // bar was included
>    do_check_eq(undefined, simpleMeasurements.baz); // baz wasn't included since it wasn't added
>  
> +  yield TelemetrySession.testShutdown(false);

We can change this as well to |yield TelemetryController.testShutdown();|
Attachment #8736917 - Flags: review?(alessio.placitelli) → feedback+
Comment on attachment 8736915 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (core)

Review of attachment 8736915 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1807,3 @@
>      case "app-startup":
>        // app-startup is only registered for content processes.
>        return this.setupContentProcess();

Can we call TelemetrySession.setupContentProcess() from TelemetryController.setupContentProcess(), basically doing the same thing we do for the chrome process?

This would allow us to remove this "app-startup" case and the TelemetrySession dependencies from https://dxr.mozilla.org/mozilla-central/rev/bccb11375f2af838cda714d42fd8cef78f5c7bf1/toolkit/components/telemetry/TelemetryStartup.js#14,31
Depends on: 1261406
Blocks: 1261406
No longer depends on: 1261406
Attachment #8736915 - Attachment is obsolete: true
Attachment #8738186 - Flags: review?(alessio.placitelli)
Attachment #8736917 - Attachment is obsolete: true
Attachment #8738187 - Flags: review?(alessio.placitelli)
Comment on attachment 8738186 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (core)

Review of attachment 8738186 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good to me, with the change below addressed! Good job.

::: toolkit/components/telemetry/TelemetryStartup.js
@@ -27,5 @@
>  TelemetryStartup.prototype.QueryInterface = XPCOMUtils.generateQI([Components.interfaces.nsIObserver]);
>  TelemetryStartup.prototype.observe = function(aSubject, aTopic, aData) {
>    if (aTopic == "profile-after-change" || aTopic == "app-startup") {
>      TelemetryController.observe(null, aTopic, null);
> -    TelemetrySession.observe(null, aTopic, null);

TelemetrySession was only used here, let's remove |XPCOMUtils.defineLazyModuleGetter(this, "TelemetrySession",...| at the top of this file as well.
Attachment #8738186 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8738187 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (test)

Review of attachment 8738187 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good now, thank you for your efforts and patience!
Attachment #8738187 - Flags: review?(alessio.placitelli) → review+
In addition to pushing these patches to Try, we should take them for a manual test-drive. In order to do that, we need to point the "toolkit.telemetry.server" pref to the address of a local instance of the gzip server [2], in a non-debug build.

As reported by the documentation [3], your .mozconfig file must contain the following lines:

# Telemetry Enabled Mozconfig
export MOZILLA_OFFICIAL=1
export MOZ_TELEMETRY_REPORTING=1

Otherwise no ping would get sent. Let's also enable verbose telemetry logging by setting these prefs:

toolkit.telemetry.log.level -> "Trace" (String)
toolkit.telemetry.log.dump -> true (Bool)

The tests:

- Check that Telemetry inits correctly - 

Start Firefox, wait 60 seconds, check that TelemetrySession/TelemetryController get correctly initialized.

- Trigger an environment change - 

Wait a bit after Firefox is started (like 5 minutes so we don't get caught in environment throttling, if something else happened). Install/Enable/Disable a restartless extension and check that a "main" ping with an "environment-changed" reason gets sent to the local server.

- Check that a shutdown/saved-session ping gets saved -

After Telemetry is initialized, shut down Firefox. Start Firefox again, wait for Telemetry to be initialized. A shutdown and a saved-session ping should be sent to the local server for the previous session.

- Check that the aborted-session ping gets sent -

After Telemetry is initialized, kill the Firefox process. Start Firefox again, the logs should tell you that an aborted session ping was found. The ping should be sent to the local server.

- Check that the deletion ping gets sent -

After Telemetry is initialized, disable FHR in the Data Choices options (Preferences panel). A deletion ping should be sent to the local server.

This should cover pretty much everything, except for the daily ping, which is a bit more involving (requires changing the local clock).

[1] - https://gecko.readthedocs.org/en/latest/toolkit/components/telemetry/telemetry/pings.html#ping-types
[2] - https://github.com/georgf/gzipServer/
[3] - https://gecko.readthedocs.org/en/latest/toolkit/components/telemetry/telemetry/preferences.html
In addition to the tests in the previous comment, we should also make sure that:

- the search count is available and sane -

Start FF, wait for Telemetry to init (that's only a simplification for the test), perform a search somewhere, trigger an environment-changed ping as described in the previous comment and verify that the SEARCH_COUNTS keyed histogram contains coherent data.

- subsession chaining works - 

Within a single session (from FF startup to its shutdown), subsessions should be correctly linked.
Start Firefox, wait 60 seconds, trigger an environment-changed ping. Wait 6-7 minutes and trigger another one. Then shutdown Firefox. Start it again and wait for the shutdown/saved-session pings to be received by your local server.

You should verify that the received pings have the same sessionId (if they belong to the same session) and form a correct subsession chain (prev_ping[subsessionId] = curr_ping[previousSubsessionId]). You can find additional information about session chaining and the related fields here: https://gecko.readthedocs.org/en/latest/toolkit/components/telemetry/telemetry/main-ping.html
Updated patch comment
Attachment #8738186 - Attachment is obsolete: true
Attachment #8738634 - Flags: review+
Addressed comment 75
Attachment #8738187 - Attachment is obsolete: true
Attachment #8738635 - Flags: review+
Well, some tests failed on tryserver. Investigating why...
> TEST-UNEXPECTED-FAIL | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 514] "bf17c33f-5662-47b0-816e-2cabfdbe478f" == "bd314d15-95bf-4356-b682-b6c4a8942202"
seems new.

The rest looks unrelated.
Attached file test_simplePing failure log (obsolete) —
Hey Alession!

This is the excerpt from the test-log related to test_SimplePing. Something looks suspicious around here:

12:19:18     INFO -  "CONSOLE_MESSAGE: (error) [JavaScript Error: "1459970357081	Toolkit.Telemetry	ERROR	TelemetrySend::_doPing - error making request to http://doesnotexist/submit/telemetry/6eb28f58-34f2-4bf6-be39-cbfa672d82aa/main/XPCShell/1/default/20160315?v=4: error" {file: "resource://gre/modules/Log.jsm" line: 751}]
12:19:18     INFO -  App_append@resource://gre/modules/Log.jsm:751:9
12:19:18     INFO -  Logger.prototype.log@resource://gre/modules/Log.jsm:388:7
12:19:18     INFO -  LoggerRepository.prototype.getLoggerWithMessagePrefix/proxy.log@resource://gre/modules/Log.jsm:503:44
12:19:18     INFO -  Logger.prototype.error@resource://gre/modules/Log.jsm:396:5
12:19:18     INFO -  TelemetrySendImpl._doPing/errorhandler@resource://gre/modules/TelemetrySend.jsm:939:7
12:19:18     INFO -  _do_main@/builds/slave/test/build/tests/xpcshell/head.js:209:5
12:19:18     INFO -  _execute_test@/builds/slave/test/build/tests/xpcshell/head.js:533:5
12:19:18     INFO -  @-e:1:1
12:19:18     INFO -  "

Do you think this could explain the failure in test_SimplePing? Should I take a closer look on that?
Attachment #8739002 - Flags: feedback?(alessio.placitelli)
> Hey Alession!

I mean Alessio of coarse!
(In reply to Iaroslav Sheptykin from comment #84)
> Do you think this could explain the failure in test_SimplePing? Should I
> take a closer look on that?

Hang on, this is a known intermittent that has busted Linux opt tests once. The real (different) problem is highlighted by Windows 7 tests. See: https://treeherder.mozilla.org/logviewer.html#?job_id=19097749&repo=try

This is a little fishy:

06:28:58     INFO -  "CONSOLE_MESSAGE: (info) 1460035735861	Toolkit.Telemetry	TRACE	TelemetryStorage::_purgeOldPings"
06:28:58     INFO -  Unexpected exception Error:  at (unknown module)
06:28:58     INFO -  this.BasePromiseWorker.prototype.post@resource://gre/modules/PromiseWorker.jsm:263:12
06:28:58     INFO -  post/<@resource://gre/modules/osfile/osfile_async_front.jsm:429:25
06:28:58     INFO -  Handler.prototype.process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:937:23
06:28:58     INFO -  this.PromiseWalker.walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:816:7
06:28:58     INFO -  Promise*this.PromiseWalker.scheduleWalkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:747:11
06:28:58     INFO -  this.PromiseWalker.schedulePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:779:7
06:28:58     INFO -  this.PromiseWalker.completePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:714:7
06:28:58     INFO -  this.BasePromiseWorker.prototype._worker/worker.onmessage@resource://gre/modules/PromiseWorker.jsm:231:9
06:28:58     INFO -  _do_main@C:\slave\test\build\tests\xpcshell\head.js:209:5
06:28:58     INFO -  _execute_test@C:\slave\test\build\tests\xpcshell\head.js:533:5
06:28:58     INFO -  @-e:1:1
06:28:58     INFO -  _run_next_test@C:\slave\test\build\tests\xpcshell\head.js:1540:9
06:28:58     INFO -  do_execute_soon/<.run@C:\slave\test\build\tests\xpcshell\head.js:692:9
06:28:58     INFO -  _do_main@C:\slave\test\build\tests\xpcshell\head.js:209:5
06:28:58     INFO -  _execute_test@C:\slave\test\build\tests\xpcshell\head.js:533:5
06:28:58     INFO -  @-e:1:1
06:28:58     INFO -  exiting test

The problem seems to be easily reproducible on Windows, in |test_abortedDailyCoalescing|.

(In reply to Iaroslav Sheptykin from comment #85)
> > Hey Alession!
> 
> I mean Alessio of coarse!

Don't worry, Alession doesn't sound that bad ;-)
Attachment #8739002 - Flags: feedback?(alessio.placitelli)
Thanks(In reply to Alessio Placitelli [:Dexter] from comment #86)

> Hang on, this is a known intermittent that has busted Linux opt tests once.
> The real (different) problem is highlighted by Windows 7 tests. See:
> https://treeherder.mozilla.org/logviewer.html#?job_id=19097749&repo=try

Thanks, Alessio! I am gonna take a closer look on that.
(In reply to Iaroslav Sheptykin from comment #87)
> Thanks(In reply to Alessio Placitelli [:Dexter] from comment #86)
> 
> > Hang on, this is a known intermittent that has busted Linux opt tests once.
> > The real (different) problem is highlighted by Windows 7 tests. See:
> > https://treeherder.mozilla.org/logviewer.html#?job_id=19097749&repo=try
> 
> Thanks, Alessio! I am gonna take a closer look on that.

Don't mention it. I had another look at the test code, and I noticed that the shutdown behaviour of some tests changed a bit, as some calls to |yield TelemetrySession.shutdown();| were changed to |yield TelemetryController.testReset();|, instead of |yield TelemetryController.testShutdown();|.

The problem there is that the former doesn't perform a shutdown, so the test is not blocking on pending ping cleanup and other shutdown operations performed by TelemetryStorage. This explains the unexpected exceptions when trying to do I/O.
(In reply to Alessio Placitelli [:Dexter] from comment #88)

> Don't mention it. I had another look at the test code, and I noticed that
> the shutdown behaviour of some tests changed a bit, as some calls to |yield
> TelemetrySession.shutdown();| were changed to |yield
> TelemetryController.testReset();|, instead of |yield
> TelemetryController.testShutdown();|.
> 
> The problem there is that the former doesn't perform a shutdown, so the test
> is not blocking on pending ping cleanup and other shutdown operations
> performed by TelemetryStorage. This explains the unexpected exceptions when
> trying to do I/O.

Yeah, that makes sense. So we should fix those place to use |TelemetryController.testShutdown()|, right?
Comment on attachment 8738635 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (test)

Review of attachment 8738635 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for not catching these before.

::: toolkit/components/telemetry/tests/unit/test_TelemetrySession.js
@@ +895,5 @@
>    Assert.equal(ping.payload.keyedHistograms[KEYED_ID]["a"].sum, 1);
>    Assert.equal(ping.payload.keyedHistograms[KEYED_ID]["b"].sum, 1);
>  
>    // Shutdown to cleanup the aborted-session if it gets created.
> +  yield TelemetryController.testReset();

This should be |yield TelemetryController.testShutdown();|.

@@ +949,5 @@
>    yield schedulerTickCallback();
>  
>    // Shutdown to cleanup the aborted-session if it gets created.
>    PingServer.resetPingHandler();
> +  yield TelemetryController.testReset();

This should be |yield TelemetryController.testShutdown();|.

@@ +1000,5 @@
>    Assert.equal(ping.type, PING_TYPE_MAIN);
>    Assert.equal(ping.payload.info.reason, REASON_DAILY);
>  
>    // Shutdown to cleanup the aborted-session if it gets created.
> +  yield TelemetryController.testReset();

This should be |yield TelemetryController.testShutdown();|.

@@ +1351,5 @@
>    const receivedPing = yield PingServer.promiseNextPing();
>    Assert.equal(receivedPing.type, PING_TYPE_MAIN, "Should have the correct type");
>    Assert.equal(receivedPing.payload.info.reason, REASON_ABORTED_SESSION, "Ping should have the correct reason");
>  
> +  yield TelemetryController.testReset();

This should be |yield TelemetryController.testShutdown();|.

@@ +1384,5 @@
>    // Remove the aborted session file and then shut down to make sure exceptions (e.g file
>    // not found) do not compromise the shutdown.
>    yield OS.File.remove(ABORTED_FILE);
>  
> +  yield TelemetryController.testReset();

This should be |yield TelemetryController.testShutdown();|.

@@ +1437,5 @@
>    let abortedSessionPing = JSON.parse(pingContent);
>    Assert.equal(abortedSessionPing.payload.info.sessionId, dailyPing.payload.info.sessionId);
>    Assert.equal(abortedSessionPing.payload.info.subsessionId, dailyPing.payload.info.subsessionId);
>  
> +  yield TelemetryController.testReset();

This should be |yield TelemetryController.testShutdown();|.

@@ +1475,5 @@
>  
>    Assert.ok((yield OS.File.exists(ABORTED_FILE)),
>              "There must be an aborted session ping.");
>  
> +  yield TelemetryController.testReset();

This should be |yield TelemetryController.testShutdown();|.

@@ +1520,5 @@
>    });
>  
>    // Execute one scheduler tick. It should not trigger a daily ping.
>    Assert.ok(!!schedulerTickCallback);
>    yield schedulerTickCallback();

This is missing the call to |yield TelemetryController.testShutdown();|.

@@ +1559,5 @@
>  
>    // Check that no aborted session ping was written to disk.
>    Assert.ok(!(yield OS.File.exists(ABORTED_FILE)));
>  
> +  yield TelemetryController.testReset();

This should be |yield TelemetryController.testShutdown();|.

@@ +1658,5 @@
>    fakeNow(now);
>    fakeIdleNotification("idle");
>    Assert.equal(schedulerTimeout, 10 * 60 * 1000);
>  
> +  yield TelemetryController.testReset();

This should be |yield TelemetryController.testShutdown();|.

@@ +1709,5 @@
>    const receivedPing = decodeRequestPayload(receivedPingRequest);
>    checkPingFormat(receivedPing, PING_TYPE_MAIN, true, true);
>    Assert.equal(receivedPing.payload.info.reason, REASON_DAILY);
>  
> +  yield TelemetryController.testReset();

This should be |yield TelemetryController.testShutdown();|.

@@ +1757,5 @@
>    const receivedPing = decodeRequestPayload(receivedPingRequest);
>    checkPingFormat(receivedPing, PING_TYPE_MAIN, true, true);
>    Assert.equal(receivedPing.payload.info.reason, REASON_DAILY);
>  
> +  yield TelemetryController.testReset();

This should be |yield TelemetryController.testShutdown();|.
Attachment #8738635 - Flags: review+
rebased the patch onto current central
Attachment #8738634 - Attachment is obsolete: true
Attachment #8740822 - Flags: review+
Reverted |TelemetryController.testReset| to |TelemetryController.testShutdown|.
Attachment #8738635 - Attachment is obsolete: true
Attachment #8740823 - Flags: review?(alessio.placitelli)
Comment on attachment 8740822 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (core)

Review of attachment 8740822 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +1885,2 @@
>     */
> +   shutdownChromeProcess: function() {

nit: there's an extra whitespace here, before the function name.
Comment on attachment 8740823 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (test)

Review of attachment 8740823 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good. Please make sure this runs fine on try (try: -b do -p all -u all -t none) and that the results are consistent with the manual testing.
Attachment #8740823 - Flags: review?(alessio.placitelli) → review+
Removed space.
Attachment #8739002 - Attachment is obsolete: true
Attachment #8740822 - Attachment is obsolete: true
Attachment #8740929 - Flags: review?(alessio.placitelli)
Attachment #8740929 - Flags: review?(alessio.placitelli) → review+
Hey Alessio! Tryserver finally completed testing jobs. Some tests failed as usual. Among those, which failed I see this under Android 4.3:

TEST-UNEXPECTED-FAIL | toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown.js | xpcshell return code: 139
PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown.js | FATAL ERROR: AsyncShutdown timeout in profile-before-change2 Conditions: [{"name":"TelemetryController: shutting down","state":{"initialized":true,"initStarted":true,"haveDelayedInitTask":false,"shutdownBarrier":"Not started","connectionsBarrier":"Not started","sendModule":{"sendingEnabled":false,"pendingPingRequestCount":0,"pendingPingActivityCount":0,"unpersistedPingCount":0,"persistedP
PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown.js | \x07[14503] ###!!! ABORT: file resource://gre/modules/TelemetryController.jsm, line 708
PROCESS-CRASH | toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown.js | application crashed [@ mozalloc_abort]

Does this sound familiar? Or is this something new?
Flags: needinfo?(alessio.placitelli)
(In reply to Iaroslav Sheptykin from comment #97)
> Hey Alessio! Tryserver finally completed testing jobs. Some tests failed as
> usual. Among those, which failed I see this under Android 4.3:
> 
> TEST-UNEXPECTED-FAIL |
> toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown.js
> | xpcshell return code: 139
> PROCESS |
> toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown.js
> | FATAL ERROR: AsyncShutdown timeout in profile-before-change2 Conditions:
> [{"name":"TelemetryController: shutting
> down","state":{"initialized":true,"initStarted":true,"haveDelayedInitTask":
> false,"shutdownBarrier":"Not started","connectionsBarrier":"Not
> started","sendModule":{"sendingEnabled":false,"pendingPingRequestCount":0,
> "pendingPingActivityCount":0,"unpersistedPingCount":0,"persistedP
> PROCESS |
> toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown.js
> | \x07[14503] ###!!! ABORT: file
> resource://gre/modules/TelemetryController.jsm, line 708
> PROCESS-CRASH |
> toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown.js
> | application crashed [@ mozalloc_abort]
> 
> Does this sound familiar? Or is this something new?

I'm afraid this is something new, would you mind taking a closer look?
Flags: needinfo?(alessio.placitelli)
The test crashes because we reach blocking timeout in AsyncShutdown [1]. Normally the timeout is set to 1 minute [2, 3]. In tests though we reduce it to a 100 ms [4]. Apparently this is not enough for android to shut the telemetry down properly. We might consider increasing the timeout value we use in |test_TelemetryControllerShutdown| or see how we can make the shutdown a little more snappy. What do you think, Alessio?

[1]: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/asyncshutdown/AsyncShutdown.jsm#956
[2]: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/asyncshutdown/AsyncShutdown.jsm#81-87
[3] https://dxr.mozilla.org/mozilla-central/source/modules/libpref/init/all.js#857
[4]: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown.js#41
Flags: needinfo?(alessio.placitelli)
Removed an unnecessary empty line.
Attachment #8740929 - Attachment is obsolete: true
Attachment #8742137 - Flags: review+
Hey Alessio! I updated the crash delay and re-sent the patches to the tryserver. Lets see how android behaves.
Attachment #8740823 - Attachment is obsolete: true
Flags: needinfo?(alessio.placitelli)
Attachment #8742138 - Flags: review?(alessio.placitelli)
Alright, android tests look greener now. The rest looks o.k. too. We have some random failures in test_TelemetrySession.jsm but they turn green on re-run (if I read the report right.) After this patch lands I could take a closer look at that.
Flags: needinfo?(alessio.placitelli)
Mh, but what are we doing differently than before? How did our shutdown behaviour change (with regards to this test) - test_TelemetryControllerShutdown.js?
Flags: needinfo?(alessio.placitelli) → needinfo?(yarik.sheptykin)
I believe that we made this test less stable by shutting down TelemetrySession from  TelemetryContoller [1]. TelemetrySession does quite a few things during its shutdown which delays |TelemetryController.shutdown| a bit. This delay is longer than 1 sec on Android, which triggers the emergency crash in AsyncShutdown. Before our patch TelemetryController setup and shutdown routines ran independently from those in TelemetrySession. The earlier test therefore tested solely the run-time of the TelemetryController. It set TelemetryController up [2,3] and shut it down [4,5,6,7]. With our patch it tests both TelemetryController and TelemetrySession. This change i believe makes the difference on Android. 

Do you think this explanation could be reasonable?

1: https://bugzilla.mozilla.org/attachment.cgi?id=8742137&action=diff#a/toolkit/components/telemetry/TelemetryController.jsm_sec8
2: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown.js#51
3: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryController.jsm#625-701
4: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown.js#57
5: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryController.jsm#696
6: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryController.jsm#769,773
7: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryController.jsm#719
Flags: needinfo?(yarik.sheptykin) → needinfo?(alessio.placitelli)
Thank you Iaroslav for looking into this. Since this is only affecting Android tests (and we're even running ping archiving on Android tests, which is not enabled on Fennec), I think increasing the timeout in tests is a fair change. We can always follow up with improving the shutdown.

@Georg, do you have an opinion on this?
Flags: needinfo?(alessio.placitelli) → needinfo?(gfritzsche)
Lets see if increasing the timeout reliably fixes the situation - i've retriggered the Android 4.3 xpcshell tests a few times.
If that's all green i guess we'll have to live with the situation for now, although the timeout seems rather high. Let's try 5 seconds for now.
Flags: needinfo?(gfritzsche)
tests look good.
(In reply to Alessio Placitelli [:Dexter] from comment #109)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=cbaf5572e790

This push was testing the 5s timeout. Even if all the Android X3 look orange due to |TEST-UNEXPECTED-FAIL | security/manager/ssl/tests/unit/test_signed_dir.js | xpcshell return code: 0|, which is unrelated, all the Telemetry tests ran successfully:

|INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown|

Let's lower the timeout delay to 5s and land this.
Comment on attachment 8742138 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (test)

Review of attachment 8742138 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/tests/unit/test_TelemetryControllerShutdown.js
@@ +42,5 @@
> + * phases. Assumes that Telemetry shutdown routines do not take longer than
> + * 10 seconds to complete.
> + */
> +add_task(function* test_sendTelemetryShutsDownWithinReasonableTimeout() {
> +  const TEN_SECONDS = 10 * 1000;

Let's lower this to 5s.
Attachment #8742138 - Flags: review?(alessio.placitelli) → review+
Cool! Thanks for testing that, Alessio. I can update the patch tonight.
Limited the timeout to 5s.
Attachment #8742138 - Attachment is obsolete: true
Attachment #8745338 - Flags: review+
Blocks: 1144778
(In reply to Alessio Placitelli [:Dexter] from comment #77)

Hey Alessio! I ran the manual tests and everything seems fine. I went through your guide step by step and documented some relevant results below. Overall I successfully received 18 pings and did not notice any suspicious output in the console.

> - Check that Telemetry inits correctly - 

Telemetry inits correctly:
1461783541506	addons.xpi	WARN	Failed to call uninstall for jid0-dBgF7UkIiOsWqvBng4hYu@jetpack: Error: Unknown add-on ID jid0-dBgF7UkIiOsWqvBng4hYu@jetpack (resource://gre/modules/addons/XPIProvider.jsm:7725:11) JS Stack trace: DirectoryInstallLocation.prototype.getLocationForID@XPIProvider.jsm:7725:11 < this.XPIProvider.processPendingFileChanges@XPIProvider.jsm:3284:31 < this.XPIProvider.checkForChanges@XPIProvider.jsm:3613:19 < this.XPIProvider.startup@XPIProvider.jsm:2685:25 < callProvider@AddonManager.jsm:227:12 < _startProvider@AddonManager.jsm:778:5 < AddonManagerInternal.startup@AddonManager.jsm:962:9 < this.AddonManagerPrivate.startup@AddonManager.jsm:2922:5 < amManager.prototype.observe@addonManager.js:68:9
1461783541507	addons.xpi	WARN	Attempted to remove jid0-dBgF7UkIiOsWqvBng4hYu@jetpack from app-profile but it was already gone
1461783541508	addons.xpi	WARN	Failed to call uninstall for about-addons-memory@sturdiguns.org: Error: Unknown add-on ID about-addons-memory@sturdiguns.org (resource://gre/modules/addons/XPIProvider.jsm:7725:11) JS Stack trace: DirectoryInstallLocation.prototype.getLocationForID@XPIProvider.jsm:7725:11 < this.XPIProvider.processPendingFileChanges@XPIProvider.jsm:3284:31 < this.XPIProvider.checkForChanges@XPIProvider.jsm:3613:19 < this.XPIProvider.startup@XPIProvider.jsm:2685:25 < callProvider@AddonManager.jsm:227:12 < _startProvider@AddonManager.jsm:778:5 < AddonManagerInternal.startup@AddonManager.jsm:962:9 < this.AddonManagerPrivate.startup@AddonManager.jsm:2922:5 < amManager.prototype.observe@addonManager.js:68:9
1461783541508	addons.xpi	WARN	Attempted to remove about-addons-memory@sturdiguns.org from app-profile but it was already gone
1461783541623	Toolkit.Telemetry	TRACE	TelemetryController::observe - profile-after-change notified.
1461783541623	Toolkit.Telemetry	TRACE	TelemetryController::setupTelemetry
1461783541624	Toolkit.Telemetry	TRACE	TelemetryReportingPolicy::setup
1461783541625	Toolkit.Telemetry	CONFIG	TelemetryController::enableTelemetryRecording - canRecordBase:true, canRecordExtended: true
1461783541627	Toolkit.Telemetry	DEBUG	SessionRecorder::Recording last sessions as #7.
1461783541627	Toolkit.Telemetry	DEBUG	SessionRecorder::Resetting prefs from last session.
1461783541628	Toolkit.Telemetry	DEBUG	SessionRecorder::Missing startup field: firstPaint
1461783541628	Toolkit.Telemetry	DEBUG	SessionRecorder::Missing startup field: sessionRestored
1461783541629	Toolkit.Telemetry	TRACE	TelemetrySession::earlyInit
1461783541637	Toolkit.Telemetry	TRACE	TelemetryEnvironment::constructor
1461783541639	Toolkit.Telemetry	TRACE	TelemetryEnvironment::_getGFXData - Only one display adapter detected.
1461783541651	Toolkit.Telemetry	TRACE	TelemetryEnvironment::_updateSearchEngine - isInitialized: false
1461783541652	Toolkit.Telemetry	TRACE	TelemetryEnvironment::_updateSearchEngine - isInitialized: false
1461783541652	Toolkit.Telemetry	TRACE	TelemetryEnvironment::_updateAddons
1461783541655	Toolkit.Telemetry	TRACE	TelemetryEnvironment::registerChangeListener for CrashAnnotator
1461783541842	Toolkit.Telemetry	TRACE	TelemetryEnvironment::observe - aTopic: distribution-customization-complete, aData: null
1461783542026	Toolkit.Telemetry	TRACE	TelemetryEnvironment::_updateAddons: addons differ
1461783542034	Toolkit.Telemetry	TRACE	TelemetryEnvironment::_startWatchingPrefs - [object Map]
1461783542050	Toolkit.Telemetry	TRACE	TelemetryEnvironment::observe - Topic experiments-changed
1461783542050	Toolkit.Telemetry	TRACE	TelemetryEnvironment::_updateAddons
1461783542064	Toolkit.Telemetry	TRACE	TelemetryEnvironment::_updateAddons: addons differ
1461783542064	Toolkit.Telemetry	TRACE	TelemetryEnvironment::_onEnvironmentChange for experiment-changed
1461783542065	Toolkit.Telemetry	DEBUG	TelemetryEnvironment::_onEnvironmentChange - calling CrashAnnotator
1461783542377	Toolkit.Telemetry	TRACE	TelemetryEnvironment::observe - aTopic: browser-search-service, aData: find-jar-engines
1461783542389	Toolkit.Telemetry	TRACE	TelemetryEnvironment::observe - aTopic: browser-search-service, aData: init-complete
1461783542389	Toolkit.Telemetry	TRACE	TelemetryEnvironment::_updateSearchEngine - isInitialized: true
1461783542507	Toolkit.Telemetry	TRACE	TelemetrySession::observe - xul-window-visible notified.
1461783542513	Toolkit.Telemetry	TRACE	TelemetryEnvironment::observe - aTopic: compositor:created, aData: null
1461783542649	Toolkit.Telemetry	TRACE	TelemetrySession::observe - sessionstore-windows-restored notified.
1461783542649	Toolkit.Telemetry	TRACE	TelemetrySession::gatherStartup
1461783542649	Toolkit.Telemetry	TRACE	TelemetrySession::gatherStartupHistograms

(process:29594): GLib-CRITICAL **: g_path_get_basename: assertion 'file_name != NULL' failed
1461783542856	Toolkit.Telemetry	TRACE	TelemetryController::observe - app-startup notified.
1461783542856	Toolkit.Telemetry	CONFIG	TelemetryController::enableTelemetryRecording - canRecordBase:true, canRecordExtended: true
1461783542864	Toolkit.Telemetry	TRACE	TelemetrySession#content::setupContentProcess
1461783542864	Toolkit.Telemetry	TRACE	TelemetrySession#content::gatherStartupHistograms


> - Trigger an environment change - 

Environment changes are sent:

cat report4.json | grep reason
"reason": "environment-change",

> - Check that a shutdown/saved-session ping gets saved -
> - Check that the aborted-session ping gets sent -

aborted and saved sessions are stored and sent to the server:

cat report5.json | grep reason
"reason": "shutdown",

cat report6.json | grep reason
"reason": "saved-session",

cat report7.json | grep reason
"reason": "aborted-session",
"XPIDB_startup_load_reasons": [

cat report8.json | grep reason
"reason": "aborted-session",


> - Check that the deletion ping gets sent -

It does

cat report9.json
{
  "clientId": "85b4f9ab-28f5-4e47-97cc-c9de807a756b",
  "payload": {},
  "application": {
    "vendor": "Mozilla",
    "name": "Firefox",
    "buildId": "20160427192837",
    "platformVersion": "49.0a1",
    "version": "49.0a1",
    "architecture": "x86-64",
    "displayVersion": "49.0a1",
    "xpcomAbi": "x86_64-gcc3",
    "channel": "default"
  },
  "version": 4,
  "creationDate": "2016-04-27T18:43:57.888Z",
  "type": "deletion",
  "id": "472e0306-d420-42e9-99e2-3bf3bcc9465d"
}

(In reply to Alessio Placitelli [:Dexter] from comment #78)
> In addition to the tests in the previous comment, we should also make sure
> that:

> - the search count is available and sane -

Look realistic

      "SEARCH_COUNTS": {
        "google.searchbar": {
          "range": [
            1,
            2
          ],
          "histogram_type": 4,
          "values": {
            "1": 0,
            "0": 1
          },
          "bucket_count": 3,
          "sum": 1
        },
        "google.urlbar": {
          "range": [
            1,
            2
          ],
          "histogram_type": 4,
          "values": {
            "1": 0,
            "0": 1
          },
          "bucket_count": 3,
          "sum": 1
        }
      },

> - subsession chaining works - 

Subsequent "environment_change" pings refer to the same sessionId and and previousSessionId but different subsessionIds. Chaining of subsessions works fine:

ping 15 (first environment change):
      "reason": "environment-change",      
      "subsessionId": "6337a030-f90b-4fdc-a7d5-7d0d8435b13c",
      "previousSubsessionId": "8f68e35c-c39b-4911-906e-bb19cf286566",
      "sessionId": "9b8797f4-2e9c-4a23-8f82-b6d565f59edd",
      "previousSessionId": "adff5b21-754d-40ca-9a26-4fafcf950733",
      "subsessionStartDate": "2016-04-27T00:00:00.0+02:00",

ping 16 (second environment change after ~7 min of inactivity):
      "reason": "environment-change",
      "sessionId": "9b8797f4-2e9c-4a23-8f82-b6d565f59edd",
      "previousSessionId": "adff5b21-754d-40ca-9a26-4fafcf950733",
      "subsessionId": "afe4296f-a45f-4cb1-af43-3cddd5017889",
      "previousSubsessionId": "6337a030-f90b-4fdc-a7d5-7d0d8435b13c",
      "subsessionStartDate": "2016-04-27T00:00:00.0+02:00",

ping 17 (after shutdown): 
      "reason": "shutdown",      
      "sessionId": "9b8797f4-2e9c-4a23-8f82-b6d565f59edd",
      "previousSessionId": "adff5b21-754d-40ca-9a26-4fafcf950733",
      "subsessionId": "11b33d51-d452-41a4-a63f-2cdd34f8fc22",
      "previousSubsessionId": "afe4296f-a45f-4cb1-af43-3cddd5017889",
      "subsessionStartDate": "2016-04-27T00:00:00.0+02:00",

ping 18:
      "reason": "saved-session",
      "sessionId": "9b8797f4-2e9c-4a23-8f82-b6d565f59edd",
      "previousSessionId": "adff5b21-754d-40ca-9a26-4fafcf950733",
      "subsessionId": "11b33d51-d452-41a4-a63f-2cdd34f8fc22",
      "previousSubsessionId": "afe4296f-a45f-4cb1-af43-3cddd5017889",
      "subsessionStartDate": "2016-04-27T00:00:00.0+02:00",

These are the relevant excerpts from the actual pings.

Note that |subsessoinStartDate| does not record time correctly. Is this intended?
Flags: needinfo?(alessio.placitelli)
Excellent job in testing this manually, I'm glad everything works fine. I'm landing this today.

> Note that |subsessoinStartDate| does not record time correctly. Is this
> intended?

Yes, we're intentionally truncating to days (privacy) here: https://dxr.mozilla.org/mozilla-central/rev/fc15477ce628599519cb0055f52cc195d640dc94/toolkit/components/telemetry/TelemetrySession.jsm#1011
Flags: needinfo?(alessio.placitelli)
https://hg.mozilla.org/integration/fx-team/rev/9e90db2262ac9fda4c5aab02c5ed74c163217ba2
Bug 1145188: Shifting TelemetrySession init control to TelemetryController (core). r=Dexter

https://hg.mozilla.org/integration/fx-team/rev/e7570b6d28aaf015f5f69cf3a95f6396bf87e451
Bug 1145188: Shifting TelemetrySession init control to TelemetryController (test). r=Dexter
https://hg.mozilla.org/mozilla-central/rev/9e90db2262ac
https://hg.mozilla.org/mozilla-central/rev/e7570b6d28aa
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Backed this out for frequent Android 4.2 x86 opt Sets(S4) failure in test_TelemetrySession.js

Backouts:
https://hg.mozilla.org/integration/fx-team/rev/0438ffc1f9bb
https://hg.mozilla.org/integration/fx-team/rev/f60f1ab969da

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=9079780&repo=fx-team
12:07:34  WARNING -  TEST-UNEXPECTED-FAIL | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 512] "cc6ff1e4-2bb7-4316-afeb-9ac6f7d36c9b" == "bd314d15-95bf-4356-b682-b6c4a8942202"
12:07:34     INFO -      test_TelemetrySession.js:test_simplePing:512
12:07:34     INFO -      _run_next_test@/mnt/sdcard/tests/xpc/head.js:1540:9
12:07:34     INFO -      do_execute_soon/<.run@/mnt/sdcard/tests/xpc/head.js:692:9
12:07:34     INFO -      _do_main@/mnt/sdcard/tests/xpc/head.js:209:5
12:07:34     INFO -      _execute_test@/mnt/sdcard/tests/xpc/head.js:533:5
12:07:34     INFO -      @-e:1:1
Flags: needinfo?(alessio.placitelli)
Thanks Sebastian! Did the backout resolve the failure?
Status: RESOLVED → REOPENED
Flags: needinfo?(aryx.bugmail)
Resolution: FIXED → ---
Hey Alessio! It seems from the logs that the pending ping from |test_noServerPing| is not deleted properly on Android. So when we enter |test_SimplePing| we first send the pending ping and then the ping with faked session id. So we basically inspect the ping from the previous test not the one we generated in |test_simplePing|. It seems like |yield clearPendingPings();| does not work as expected on Android. Do you have any ideas why that could be?
(In reply to Iaroslav Sheptykin from comment #121)
> Hey Alessio! It seems from the logs that the pending ping from
> |test_noServerPing| is not deleted properly on Android. So when we enter
> |test_SimplePing| we first send the pending ping and then the ping with
> faked session id. So we basically inspect the ping from the previous test
> not the one we generated in |test_simplePing|. It seems like |yield
> clearPendingPings();| does not work as expected on Android. Do you have any
> ideas why that could be?

Hey Iaroslav, this seems to happen also on Linux opt, so it's not restricted to Android (even though it happens more frequently there). It would be interesting to see what ping is being received, so we can trace when it gets generated and pinpoint the problem. I spawned a few try pushes for that. I'll keep you posted here.
Flags: needinfo?(alessio.placitelli)
Looks like this debug 

https://hg.mozilla.org/try/rev/76e7be1dc90f7651f94a7cc8b384194298141588#l15.72

causes a crash:

14:29:38     INFO -  PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | **** DEBUG - simplePing Ping id from test_noServerPing: 342c9c7e-887d-44df-8ef6-d27023c0b335
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 169] typemust be available. - true == true
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 169] idmust be available. - true == true
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 169] creationDatemust be available. - true == true
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 169] versionmust be available. - true == true
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 169] applicationmust be available. - true == true
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 169] payloadmust be available. - true == true
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 172] The ping must have the correct type. - "main" == "main"
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 173] The ping must have the correct version. - 4 == 4
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 177] buildId must have the correct value. - "20160315" == "20160315"
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 177] name must have the correct value. - "XPCShell" == "XPCShell"
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 177] version must have the correct value. - "1" == "1"
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 177] vendor must have the correct value. - "Mozilla" == "Mozilla"
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 177] platformVersion must have the correct value. - "1.9.2" == "1.9.2"
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 177] xpcomAbi must have the correct value. - "noarch-spidermonkey" == "noarch-spidermonkey"
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 183] The application section must have an architecture field. - true == true
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 185] The application section must have a channel field. - true == true
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 189] true == true
14:29:38     INFO -  TEST-PASS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 190] true == true
14:29:38     INFO -  PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | JavaScript strict warning: test_TelemetrySession.js, line 512: ReferenceError: reference to undefined property ping.info
14:29:38     INFO -  PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | *************************
14:29:38     INFO -  PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | A coding exception was thrown and uncaught in a Task.
14:29:38     INFO -  PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | Full message: TypeError: ping.info is undefined
14:29:38     INFO -  PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | Full stack: test_simplePing@test_TelemetrySession.js:512:3
14:29:38     INFO -  PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | _run_next_test@/storage/sdcard/tests/xpc/head.js:1540:9
14:29:38     INFO -  PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | do_execute_soon/<.run@/storage/sdcard/tests/xpc/head.js:692:9
14:29:38     INFO -  PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | _do_main@/storage/sdcard/tests/xpc/head.js:209:5
14:29:38     INFO -  PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | _execute_test@/storage/sdcard/tests/xpc/head.js:533:5
14:29:38     INFO -  PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | @-e:1:1
14:29:38     INFO -  PROCESS | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | *************************
14:29:38     INFO -  Unexpected exception TypeError: ping.info is undefined at test_TelemetrySession.js:512
copied too much. here is the relevant bit:

14:29:38     INFO -  Unexpected exception TypeError: ping.info is undefined at test_TelemetrySession.js:512
However the other debugs show that we receive different pings:

> **** DEBUG - Ping id from test_noServerPing: e95d5b5c-bc86-4c7c-882b-a4c850c6b6b7
> **** DEBUG - 2nd Ping id from test_noServerPing: 505f94bf-3bb5-4f35-a1da-4b08d3f3c45f
> **** DEBUG - simplePing Ping id from test_noServerPing: 342c9c7e-887d-44df-8ef6-d27023c0b335

The interesting part is the simplePing DEBUG. The log around it looks like this:

> TRACE	TelemetryController::submitExternalPing - ping assembled, id: 6056c648-bb20-4b64-b4bc-3eeb1de4ee3f"
> TRACE	TelemetrySend::submitPing - ping id: 6056c648-bb20-4b64-b4bc-3eeb1de4ee3f"
> INFO	TelemetryReportingPolicy::get dataSubmissionPolicyNotifiedDate - No date stored yet."
> TRACE	TelemetrySend::submitPing - can send pings, trying to send now"
> TRACE	TelemetrySend::Scheduler::triggerSendingPings - active send task: false, immediately: true"
> 
> TRACE	TelemetrySend::Scheduler::_doSendTask iteration"
> TRACE	TelemetrySend::Scheduler::_doSendTask - pending: 0, current: 1"
> TRACE	TelemetrySend::Scheduler::_doSendTask - can send - pending: 0, current: 1"
> TRACE	TelemetrySend::Scheduler::_doSendTask - triggering sending of 0 pings now, 0 pings waiting"
>
>
> TRACE	TelemetrySend::_doPing - server: http://localhost:40104, persisted: false, id: 6056c648-bb20-4b64-b4bc-3eeb1de4ee3f"
> TRACE	TelemetryStorage::savePingToFile - path: /mnt/sdcard/tests/xpc/p/datareporting/archived/2020-02/1580559300000.6056c648-bb20-4b64-b4bc-3eeb1de4ee3f.main.jsonlz4"
> 
> **** DEBUG - simplePing Ping id from test_noServerPing: 6056c648-bb20-4b64-b4bc-3eeb1de4ee3f

Whereas the log for the failing tests look like:

> TRACE	TelemetryController::submitExternalPing - ping assembled, id: 62508346-81e6-49ac-a58f-7915fb0539bb"
> TRACE	TelemetrySend::submitPing - ping id: 62508346-81e6-49ac-a58f-7915fb0539bb"
> INFO	TelemetryReportingPolicy::get dataSubmissionPolicyNotifiedDate - No date stored yet."
> TRACE	TelemetrySend::submitPing - can send pings, trying to send now"
> TRACE	TelemetrySend::Scheduler::triggerSendingPings - active send task: true, immediately: true"
> 
> TRACE	TelemetryStorage::_purgeOldPings"
> 
> TRACE	TelemetryStorage::savePingToFile - path: /mnt/sdcard/tests/xpc/p/datareporting/archived/2020-02/1580559300000.62508346-81e6-49ac-a58f-7915fb0539bb.main.jsonlz4"
> INFO	TelemetrySend::_doPing - successfully loaded, status: 200"
> TRACE	TelemetrySend::_onPingRequestFinished - success: true, persisted: true"
> 
> TRACE	TelemetryStorage::isDeletionPing - id: 0f59a680-6deb-46ac-a85f-d6028f6eb49e"
> TRACE	TelemetryStorage::removePendingPing - deleting ping with id: 0f59a680-6deb-46ac-a85f-d6028f6eb49e, path: /mnt/sdcard/tests/xpc/p/saved-telemetry-pings/0f59a680-6deb-46ac-a85f-d6028f6eb49e"
> *** CRASH ***

The first 4 messages are basically the same. Only ping ids differ. Here it becomes interesting:

> TRACE	TelemetrySend::Scheduler::triggerSendingPings - active send task: false, immediately: true" (debug)

vs.

> TRACE	TelemetrySend::Scheduler::triggerSendingPings - active send task: true, immediately: true" (failed-log)

Looks like when the tests fail we try to send a ping when another ping is being sent. In "*** DEBUG" log the ping is sent:

> TRACE	TelemetrySend::_doPing - server: http://localhost:40104, persisted: false, id: 6056c648-bb20-4b64-b4bc-3eeb1de4ee3f"

In failed log the ping is first persisted:

> TRACE	TelemetryStorage::savePingToFile - path: /mnt/sdcard/tests/xpc/p/datareporting/archived/2020-02/1580559300000.62508346-81e6-49ac-a58f-7915fb0539bb.main.jsonlz4"

and then we receive for the blocking send operation:

> INFO	TelemetrySend::_doPing - successfully loaded, status: 200"

To me it looks like |yield sendPing| races with another running send. I haven't yet figured out which one and why.
In the case of the crash the concurrently running ping seems to be coming from |test_noServerPing|

> INFO -  toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | Starting test_noServerPin
> ...
> TRACE	TelemetryController::submitExternalPing - ping assembled, id: 0f59a680-6deb-46ac-a85f-d6028f6eb49e"
> TRACE	TelemetrySend::submitPing - ping id: 0f59a680-6deb-46ac-a85f-d6028f6eb49e"
> ...
> TRACE	TelemetryStorage::savePingToFile - path: /mnt/sdcard/tests/xpc/p/datareporting/archived/2016-04/1461956709945.0f59a680-6deb-46ac-a85f-d6028f6eb49e.main.jsonlz4"
> ...
> INFO -  (xpcshell/head.js) | test test_noServerPing finished (3)
> INFO -  toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | Starting test_simplePing
> ...
> TRACE	TelemetrySend::_doPing - server: http://doesnotexist, persisted: false, id: 0f59a680-6deb-46ac-a85f-d6028f6eb49e"
> ...
> Toolkit.Telemetry	ERROR	TelemetrySend::_doPing - error making request to http://doesnotexist/submit/telemetry/0f59a680-6deb-46ac-a85f-d6028f6eb49e/main/XPCShell/1/default/20160315?v=4: error" {file: "resource://gre/modules/Log.jsm" line: 753}]
> ...
> INFO	TelemetrySend::sendPings - ping 0f59a680-6deb-46ac-a85f-d6028f6eb49e not sent, saving to disk: {"isTrusted":true}"
> TRACE	TelemetryStorage::savePingToFile - path: /mnt/sdcard/tests/xpc/p/saved-telemetry-pings/0f59a680-6deb-46ac-a85f-d6028f6eb49e"
> ...
> TelemetryStorage::savePendingPing - saved ping with id 0f59a680-6deb-46ac-a85f-d6028f6eb49e"
> ...
> TelemetryStorage::loadPendingPing - id: 0f59a680-6deb-46ac-a85f-d6028f6eb49e"
> ...
> TelemetrySend::_doPing - server: http://localhost:54681, persisted: true, id: 0f59a680-6deb-46ac-a85f-d6028f6eb49e
> ...
> INFO	TelemetrySend::_doPing - successfully loaded, status: 200"

from: http://archive.mozilla.org/pub/mobile/tinderbox-builds/fx-team-android-x86/1461829789/fx-team_ubuntu64_hw_test-androidx86-set-4-bm104-tests1-linux-build1088.txt.gz
(In reply to Iaroslav Sheptykin from comment #128)
> However the other debugs show that we receive different pings:
> 
> > **** DEBUG - Ping id from test_noServerPing: e95d5b5c-bc86-4c7c-882b-a4c850c6b6b7
> > **** DEBUG - 2nd Ping id from test_noServerPing: 505f94bf-3bb5-4f35-a1da-4b08d3f3c45f
> > **** DEBUG - simplePing Ping id from test_noServerPing: 342c9c7e-887d-44df-8ef6-d27023c0b335
> 

So, I added a few more debug messages here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=12f387b8cb47&selectedJob=20365984
The interesting failure log is: http://archive.mozilla.org/pub/mobile/try-builds/alessio.placitelli@gmail.com-12f387b8cb47d56231bdd8d6ec627d811ea2b318/try-android-x86/try_ubuntu64_hw_test-androidx86-set-4-bm104-tests1-linux-build1368.txt.gz

The relevant bits:

1. **** DEBUG - Ping id from test_noServerPing: e5f1cae0-e7d9-4c9b-bdd5-752962f38202
2. **** DEBUG - 2nd Ping id from test_noServerPing: ae78248b-de83-44c5-8570-2dcd423a848c
3. **** DEBUG - simplePing Ping id from test_noServerPing: 18ad35a4-f9bc-4be6-bd7e-3b0f869d0fb0
4. **** DEBUG - Recvd ping id: ae78248b-de83-44c5-8570-2dcd423a848c
5. **** DEBUG - Recvd ping reason test_noServerPing: test-ping

And then:

13:35:05  WARNING -  TEST-UNEXPECTED-FAIL | toolkit/components/telemetry/tests/unit/test_TelemetrySession.js | test_simplePing - [test_simplePing : 517] "26abe548-0791-4889-8091-6b1c49583af5" == "bd314d15-95bf-4356-b682-b6c4a8942202"

As you imagined, we're sometimes receiving at least one ping from the previous |test_noServerPing| (see 2 and 4). |yield sendPing| waits until the ping is generated and eventually handed off to the send scheduler [1] or persisted to disk if we cannot send pings right now. With |test_noServerPing| we test are in the first case, as we can send pings but the server is not there.

Since we failed sending the pings, they get persisted to disk (pending pings). The second ping |test_noServerPing| happens to be persisted almost every time before we restart TelemetryController in |test_simplePing| (and thus load the list of pending pings!), but after we clear the pending pings.

> The first 4 messages are basically the same. Only ping ids differ. Here it
> becomes interesting:
> 
> > TRACE	TelemetrySend::Scheduler::triggerSendingPings - active send task: false, immediately: true" (debug)
> 
> vs.
> 
> > TRACE	TelemetrySend::Scheduler::triggerSendingPings - active send task: true, immediately: true" (failed-log)
> 
> Looks like when the tests fail we try to send a ping when another ping is
> being sent. In "*** DEBUG" log the ping is sent:

The thing that the active send task is still kicking there is part of the issue. We should probably wait for the ping activity to settle/shutdown in |test_noServerPing| before moving on to the next test.

[1] - https://dxr.mozilla.org/mozilla-central/rev/369a5ee3a2880a4a98df3a00bf3db8d8f36b181b/toolkit/components/telemetry/TelemetrySend.jsm#710
Attached patch WIP: fixing cleanPendingPings (obsolete) — Splinter Review
Hi Alessio!

I would like to share my WIP on cleanPendingPings with you. I haven't built and tested this patch yet but it shows (hopefully) the direction I am taking. Could you give me a quick feedback if this looks reasonable and if I don't miss anything?
Attachment #8751944 - Flags: feedback?(alessio.placitelli)
Comment on attachment 8751944 [details] [diff] [review]
WIP: fixing cleanPendingPings

Review of attachment 8751944 [details] [diff] [review]:
-----------------------------------------------------------------

Mh, maybe we can just make |testClearPendingPings| call |TelemetryStorage.runRemovePendingPingTask|? Sorry, I just noticed that.
As far as I can tell this doesn't check for shutdown, so we should be good there.
Attachment #8751944 - Flags: feedback?(alessio.placitelli)
(In reply to Alessio Placitelli [:Dexter] from comment #133)
> runRemovePendingPingTask:
> https://dxr.mozilla.org/mozilla-central/rev/
> c3f5e6079284a7b7053c41f05d0fe06ff031db03/toolkit/components/telemetry/
> TelemetryStorage.jsm#1394

Yes this is exactly what we need. Thanks for pointing me to this.
Attachment #8745338 - Attachment is obsolete: true
Attachment #8751944 - Attachment is obsolete: true
Attachment #8752259 - Flags: review?(alessio.placitelli)
Attachment #8742137 - Attachment is obsolete: true
Attachment #8752260 - Flags: review?(alessio.placitelli)
(In reply to Iaroslav Sheptykin from comment #135)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=17677bbc30e7

The changeset here is basically the same, just split over 3 patches.
Comment on attachment 8752260 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (core)

Review of attachment 8752260 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good with the nit addressed.

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ +1768,5 @@
>      for (let f of FILES_TO_REMOVE) {
>        yield OS.File.remove(f, {ignoreAbsent: true})
>                     .catch(e => this._log.error("removeFHRDatabase - failed to remove " + f, e));
>      }
> +  })

nit: any particular reason this line was changed?
Attachment #8752260 - Flags: review?(alessio.placitelli) → review+
Attachment #8752259 - Flags: review?(alessio.placitelli) → review+
Added the comma back.
Attachment #8752260 - Attachment is obsolete: true
Attachment #8752833 - Flags: review+
Hey Alessio!

I addressed comment 139, updated the patch and carried r+. I checked the tests and I don't see any signs of failed telemetry tests. I think it makes sense to double check if we got rid of failing test_SimplePing on linux and android. I pushed the changeset to the tryserver as you did in comment 124:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0e84a5f5d512

If this looks green we could try to land the patches again.
Tests seem good. Should we try to land this again?
Flags: needinfo?(alessio.placitelli)
Comment on attachment 8752833 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (core)

Georg, what do you think about the proposed solution? The only change to this patch, compared to the previous version, is the addition of |testClearPendingPings| to TelemetryStorage, which is now being used in the tests instead of |clearPendingPings|.

The tests look hulk-green.
Flags: needinfo?(alessio.placitelli)
Attachment #8752833 - Flags: feedback?(gfritzsche)
(In reply to Alessio Placitelli [:Dexter] from comment #144)

> Georg, what do you think about the proposed solution? The only change to
> this patch, compared to the previous version, is the addition of
> |testClearPendingPings| to TelemetryStorage, which is now being used in the
> tests instead of |clearPendingPings|.

Here is the diff:
https://hg.mozilla.org/try/rev/29f801ee188f00d16f06afc77463b470ddc48b7d
Comment on attachment 8752833 [details] [diff] [review]
Shifting TelemetrySession init control to TelemetryController (core)

Review of attachment 8752833 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good to me, cheers!
Attachment #8752833 - Flags: feedback?(gfritzsche) → feedback+
https://hg.mozilla.org/integration/fx-team/rev/45cfc5cfe6314077e969bbe3e3b8ab84d32d41c1
Bug 1145188: Shifting TelemetrySession init control to TelemetryController (core). r=Dexter

https://hg.mozilla.org/integration/fx-team/rev/036f62007472d4e2ada1564718b617f60cd00d43
Bug 1145188: Shifting TelemetrySession init control to TelemetryController (test). r=Dexter
Duplicate of this bug: 1160816
You need to log in before you can comment on or make changes to this bug.