Closed Bug 1261494 Opened 8 years ago Closed 8 years ago

Intermittent testUnifiedTelemetryClientId | Client id file does not exist yet - or Retrieving clientId from JS has timed out -

Categories

(Firefox for Android Graveyard :: Metrics, defect)

defect
Not set
normal

Tracking

(firefox48 wontfix, firefox49 fixed, firefox50 fixed)

RESOLVED FIXED
Firefox 50
Tracking Status
firefox48 --- wontfix
firefox49 --- fixed
firefox50 --- fixed

People

(Reporter: KWierso, Assigned: mcomella)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 2 obsolete files)

https://treeherder.mozilla.org/logviewer.html#?job_id=8345475&repo=fx-team
Summary: Intermittent testUnifiedTelemetryClientId | Client id file does not exist yet - → Intermittent testUnifiedTelemetryClientId | Client id file does not exist yet - or Retrieving clientId from JS has timed out -
Shucks.

I had difficulty with "Retrieving clientId" when I created the test (I increased the timeout from 2000ms to 20000ms) – maybe I should increase it more? Emulators are rough! :(

Without looking at the code, I don't know what could be causing Client id file does not exist yet because, afaik, we remove the client ID file before each test, but it's possible this file is just left over.

Then again, maybe this file is now being created by the startup code in the application (e.g. bug 1255657 moved this earlier in the startup process), before the test can run.
Depends on: 1249491
See Also: → 1261596
Assignee: nobody → michael.l.comella
Like intermittent bug 1268998 comment 6, I wonder if this could also be related to running the uploader in automation – soft dep.
Depends on: 1270191
re comment 13, backed up by the logs:

16:54:34     INFO -  05-25 16:51:17.342 D/GeckoTelemetryUploadSer( 2860): Service started
16:54:34     INFO -  05-25 16:51:18.220 I/Robocop ( 2860): {"message":"","time":1464220278202,"source":"robocop","status":"PASS","test":"testUnifiedTelemetryClientId","thread":null,"subtest":"The toolbar is not in the editing state","action":"test_status","pid":null}
...
16:54:34     INFO -  05-25 16:51:24.731 D/GeckoTelemetryUploadSer( 2860): Telemetry upload success!
16:54:34     INFO -  05-25 16:51:24.771 D/GeckoTelemetryUploadSer( 2860): Service finished: upload and prune attempts completed
16:54:34     INFO -  05-25 16:51:25.701 I/Robocop ( 2860): {"message":"","time":1464220285681,"source":"robocop","status":"PASS","test":"testUnifiedTelemetryClientId","thread":null,"subtest":"Waiting for UrlEditText to be input method target.","action":"test_status","pid":null}
I expect this to go away with the landing of bug 1270191 – please reopen if not so.

The failure rate is low enough that I don't think we need to uplift to aurora.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
The logs indicate the telemetry upload service is now disabled in automation, however, I realized that I don't know the Gecko side of things.

Georg, do you know if Gecko telemetry could be creating the client ID file during tests in automation? If you don't know, is there any logging output associated with it I could search the logs (e.g. [1]) for?

[1]: https://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/738bd6b78f096f1d78ff07cd482fc21d586d13a7870696eba59eef240ff3f89e8644ec461c7835a7d2896d9082b2a0aecbea64703d36ca25ebe1701b3075505f
Flags: needinfo?(gfritzsche)
(In reply to Michael Comella (:mcomella) from comment #18)
> The logs indicate the telemetry upload service is now disabled in
> automation, however, I realized that I don't know the Gecko side of things.
> 
> Georg, do you know if Gecko telemetry could be creating the client ID file
> during tests in automation?

Yes, if it doesn't exist yet we would create the client id on the first request:
https://dxr.mozilla.org/mozilla-central/rev/3e8ee3599a67edd971770af4982ad4b0fe77f073/toolkit/modules/ClientID.jsm#128
Flags: needinfo?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #24)
> > Georg, do you know if Gecko telemetry could be creating the client ID file
> > during tests in automation?
> 
> Yes, if it doesn't exist yet we would create the client id on the first
> request:
> https://dxr.mozilla.org/mozilla-central/rev/
> 3e8ee3599a67edd971770af4982ad4b0fe77f073/toolkit/modules/ClientID.jsm#128

Since we're testing this functionality directly, is it possible to disable the other background processes during the test that might attempt to create a client ID?

Looking at my code, since I don't know the calling code, my (likely poor) solution would be to:
 1) Disable ClientID.getClientID
 2) Add a second method with a boolean override (e.g. ClientID.getClientID(underTesting)) which we can call from testUnifiedTelemetryClientId [1]

[1]: https://dxr.mozilla.org/mozilla-central/rev/3e8ee3599a67edd971770af4982ad4b0fe77f073/mobile/android/tests/browser/robocop/testUnifiedTelemetryClientId.js#18
Flags: needinfo?(gfritzsche)
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.

https://reviewboard.mozilla.org/r/57700/#review54796
Attachment #8759864 - Flags: review?(gkruglov) → review+
Comment on attachment 8759865 [details]
Bug 1261494 - Add assertions & logging to resetTest for sanity checking.

https://reviewboard.mozilla.org/r/57702/#review54862

::: mobile/android/tests/browser/robocop/src/org/mozilla/gecko/tests/testUnifiedTelemetryClientId.java:57
(Diff revision 1)
> +
>          if (resetJSCache) {
>              resetJSCache();
>          }
> -        getClientIdFile().delete();
> -        getFHRClientIdFile().delete();
> +        for (final File file : filesToDeleteOnReset) {
> +            file.delete();

Might also be useful to check return value of the `delete` call, but perhaps it doesn't matter since you're checking `exists` value right after.
Attachment #8759865 - Flags: review?(gkruglov) → review+
Comment on attachment 8759866 [details]
Bug 1261494 - Add logging before each test to help debug intermittent.

https://reviewboard.mozilla.org/r/57704/#review54864
Attachment #8759866 - Flags: review?(gkruglov) → review+
Georg suggested trying to prevent Gecko from creating new client ID files through method ordering: Gecko caches the client ID so if we preload the client ID into the cache, it won't try to create a new file. As such, we can keep the cache full until we decide it's time for Gecko to read/create a new client ID file.

There is still an opportunity for Gecko to attempt to access the client ID file before we can call this method, however, it greatly reduces the it.
Flags: needinfo?(gfritzsche)
I'm going to land the patches I wrote for this bug (so far) because they add helpful logging but I may end up undoing some of them as I implement comment 29.
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/57700/diff/1-2/
Attachment #8759864 - Attachment description: Bug 1261494 - Remove outdated comment regarding disabling upload service. → Bug 1261494 - Block Gecko from screwing with the client ID file by resetting the cache at the last minute.
Attachment #8759864 - Flags: review+ → review?(gkruglov)
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.

https://reviewboard.mozilla.org/r/57700/#review55802

That resetJSCache method is interesting... LGTM, hacky stuff though.
Attachment #8759864 - Flags: review?(gkruglov) → review+
https://hg.mozilla.org/integration/fx-team/rev/370eaabb204751e618b049d7cbc0cc8298d7b2c9
Bug 1261494 - Block Gecko from screwing with the client ID file by resetting the cache at the last minute. r=grisha
:mcomella -- This is still a very frequent failure. Can it be skipped?
Flags: needinfo?(michael.l.comella)
(In reply to Geoff Brown [:gbrown] from comment #45)
> :mcomella -- This is still a very frequent failure. Can it be skipped?

No – we use it to ensure we can correctly retrieve the client ID, Java <-> JS, which we can't do with a unit test. I'll prioritize fixing this.

Also:
(In reply to Georg Fritzsche [:gfritzsche] from bug 1261596 comment #9)
> Michael, can you tell whats going on here?
> Could this be fallout from bug 1233986?
Flags: needinfo?(michael.l.comella)
It looks like the increase in failures started around 6/26 & 6/29 (comment 39 & comment 40).

The patch in this bug landed on 6/12 (comment 37) so it's unlikely to have negatively affected the failure rate [1].

The patch in bug 1233986 landed several times but was backed out, finally landing on 6/28 (bug 1233986 comment 54), which is the middle of the range I saw above for the increase in failures. As the most likely cause, I'll take a look at the code and see what could have affected this test.

[1]: Tangentially, the patch in this bug appears to have reduced the failure rate (prior to the 6/26-6/29 increase) but not have eliminated it entirely, like I expected it to.
Looking at the errors, it looks like we timeout retrieving the client ID the first time [1].

My biggest lead is:
 1) Gecko retrieves the client ID and registers its `.then(...)` callbacks
 2) We attempt to retrieve the client ID by registering our `.then(...)` callbacks
 3) The client ID is retrieved & the promises resolve. Since promises resolve in the order they were registered, Gecko attempts upload
 4) Gecko's upload timeout (90s [2]) is longer than our "retrieve client ID from JS" timeout (20s) so we fail.

bug 1233986 moved Gecko telemetry upload to occur earlier during startup so its more likely to run into this 

Georg, do we attempt to upload during automation? Specifically the Android rocobop tests.

I'll drop a try run with an increased Java timeout (120s?) to see if this appears to fix the issue.

[1]: https://dxr.mozilla.org/mozilla-central/rev/88bebcaca249aeaca9197382e89d35b02be8292e/mobile/android/tests/browser/robocop/src/org/mozilla/gecko/tests/testUnifiedTelemetryClientId.java#77
[2]: https://dxr.mozilla.org/mozilla-central/rev/88bebcaca249aeaca9197382e89d35b02be8292e/toolkit/components/telemetry/TelemetrySend.jsm#68
Flags: needinfo?(gfritzsche)
Georg, also would the JS _log.trace statements appear in the test output? I can't seem to find any – I'm specifically looking for those TelemetrySend calls.

If not, is there a way to enable them or add my own log statements?
(In reply to Michael Comella (:mcomella) from comment #59)
> Georg, also would the JS _log.trace statements appear in the test output? I
> can't seem to find any – I'm specifically looking for those TelemetrySend
> calls.
> 
> If not, is there a way to enable them or add my own log statements?

I don't know where the testing prefs are set for those Fennec tests, but you enable this by setting e.g. "toolkit.telemetry.log.level" to "Trace".
Flags: needinfo?(gfritzsche)
(In reply to Michael Comella (:mcomella) from comment #57)
> Looking at the errors, it looks like we timeout retrieving the client ID the
> first time [1].
> 
> My biggest lead is:
>  1) Gecko retrieves the client ID and registers its `.then(...)` callbacks
>  2) We attempt to retrieve the client ID by registering our `.then(...)`
> callbacks
>  3) The client ID is retrieved & the promises resolve. Since promises
> resolve in the order they were registered, Gecko attempts upload
>  4) Gecko's upload timeout (90s [2]) is longer than our "retrieve client ID
> from JS" timeout (20s) so we fail.
> 
> bug 1233986 moved Gecko telemetry upload to occur earlier during startup so
> its more likely to run into this 

It should just have changed to submitting "too early" pings with the client-id instead of without, although it could have changed the timing a little bit.

> Georg, do we attempt to upload during automation? Specifically the Android
> rocobop tests.

It is possible, although we shouldn't.
We tried to set the "toolkit.telemetry.server" pref to a dummy address for the tests we found out about:
https://dxr.mozilla.org/mozilla-central/search?q=toolkit.telemetry.server&redirect=false
(In reply to Georg Fritzsche [:gfritzsche] from comment #62)
> (In reply to Michael Comella (:mcomella) from comment #59)
> > Georg, also would the JS _log.trace statements appear in the test output? I
> > can't seem to find any – I'm specifically looking for those TelemetrySend
> > calls.
> > 
> > If not, is there a way to enable them or add my own log statements?
> 
> I don't know where the testing prefs are set for those Fennec tests, but you
> enable this by setting e.g. "toolkit.telemetry.log.level" to "Trace".

Geoff, do you know how I might enable this about:config setting during a try run?
Flags: needinfo?(gbrown)
(In reply to Michael Comella (:mcomella) from comment #61)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=f3fc4a3de8be

fyi, this push was invalidated because of bug 1286942.

Waiting on bug 1286942 to be fixed in order to push to try and fix this.
Depends on: 1286942
My push from comment 69 made no functional changes and failed this test 7/7 times (+1 infra issue, +1 still running).

My push from comment 58 added a timeout longer than the telemetry upload time and passed this test 13/13 times (+1 infra issue).

I have a high suspicion this is related to attempting to upload during the test. I will try to add some logging (via comment 66) to confirm this suspicion.
(In reply to Michael Comella (:mcomella) from comment #70)
> My push from comment 58 added a timeout longer than the telemetry upload
> time and passed this test 13/13 times (+1 infra issue).

In the worst case (or if someone wants this fixed right now), we can increase the timeout so this no longer fails in automation and test fixes via try.

I'll have more time to dig in tomorrow.
Looking at the logcat from the first failure from the push from comment 71 [1], submitPing does not run until after the test has already timed-out and failed.

However, I notice that Telemetry's deferred init runs in between when we initially request the clientID from JS and when we timeout. The getClientID method seems to return correctly here because the log statements indicate the methods before & after it run. Perhaps the telemetry init is taking too long and, given JS' single-threaded nature, is blocking us from getting the client ID.

My push in comment 75 increases the delayed init timeout so hopefully Java gets control of the JS thread sooner.

[1]: https://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/870f1bd220acef368e6be70fb47a75b83beb9cbfb60399e33d601e5536ca1fa0f5a6b5a880b2cd10bcca49a5b95b92dbdee371952844ca7efa05cf4c2aa05854
My last push (comment 81) added log statements [1] to say how long it takes to retrieve the client ID from JS. In both runs, it takes 30 seconds to retrieve it initially and between 0 & 1 seconds to retrieve it each subsequent time.

Georg, do you know of any timeouts that could cause the client ID to take 30 seconds to retrieve? e.g. upload timeout, Gecko load timeout, etc.

If not, I'm just going to increase the timeout to 35 seconds because context-switching between my context graph work and fennec work is really expensive.

[1]: Look for the logtag "lol"
Flags: needinfo?(gfritzsche)
This sounds like we actually hit the Telemetry initialization delay in the test.
We delay it to 60 seconds after "profile-after-change", maybe the other 30sec already passed waiting on other operations before you timed it?

There are two ways around this:
* directly initializing Telemetry to trigger "testing" behavior: 
  https://dxr.mozilla.org/mozilla-central/rev/462dc6b44adb4573e8ce6b0dd688c206ebd516f7/toolkit/components/telemetry/tests/unit/test_TelemetryController.js#109
* setting the pref "toolkit.telemetry.initDelay" to a reasonable value:
  http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/preferences.html#testing
Flags: needinfo?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #84)
> This sounds like we actually hit the Telemetry initialization delay in the
> test.

This appears to be the issue – my latest push reduced the telemetry initialization delay to one second and it took 0 seconds to retrieve the client ID the first time (as opposed to the duration comment 82). I wonder if that's why this test has always had intermittent issues.

To second guess our assumptions, why does retrieving a client ID, which I'd expect to have no (or few) dependencies, require a general "telemetry init" to occur?

Ignoring that for now, the quick solutions would be:

(In reply to Georg Fritzsche [:gfritzsche] from comment #84)
> * setting the pref "toolkit.telemetry.initDelay" to a reasonable value:

1) I would prefer to do this – telemetry will automatically be ready whenever anyone wants to test it & it's simpler than sending a message to JS. However, could this affect startup time for autophone tests? Will it noticeably affect the runtime of our robocop test suites?

And using:

> * directly initializing Telemetry to trigger "testing" behavior:

2) We could do this for each test that needs to test telemetry but it seems unreasonable to force people to remember to initialize telemetry for each test.

3) We can make this call during BaseTest or UITest initialization, which would prevent it from affecting autophone & friends (from #1), but could still affect the overall robocop runtime duration.

Geoff, do you have an opinion?
Flags: needinfo?(gbrown)
(In reply to Michael Comella (:mcomella) from comment #86)
> To second guess our assumptions, why does retrieving a client ID, which I'd
> expect to have no (or few) dependencies, require a general "telemetry init"
> to occur?

Then again, while I feel this could be worthwhile to do, we'd also run into this issue again if anyone else tries to write a integration test for telemetry so we might as well fix the problem generically for all telemetry code.
(In reply to Michael Comella (:mcomella) from comment #86)
> Geoff, do you have an opinion?

I trust your judgement. 

Personally, I would go with option #1, setting the pref for all robocop tests, as you did in your try push. There is very little use of robocop in autophone, and currently there are no autophone performance tests that use robocop. I'd check on the robocop test suite run time, but unless the difference is significant, I think it will be fine.
Flags: needinfo?(gbrown)
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/57700/diff/2-3/
Attachment #8759864 - Attachment description: Bug 1261494 - Block Gecko from screwing with the client ID file by resetting the cache at the last minute. → Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.
Attachment #8759864 - Flags: review+ → review?(gbrown)
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.

https://reviewboard.mozilla.org/r/57700/#review64670
Attachment #8759864 - Flags: review?(gbrown) → review+
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.

Approval Request Comment
[Feature/regressing bug #]: initially implemented w/ small amount of intermittents, other bugs made it worse
[User impact if declined]: Users are unaffected but we will see a lot of intermittents
[Describe test coverage new/current, TreeHerder]: Tested locally.
[Risks and why]: Extremely low. We change an about:config preference before running robocop tests. In the most likely worst case, the additional processing time could slow down the running of our robocop test suites. However, in my testing, I did not see any noticeable impact.

[String/UUID change made/needed]: None
Attachment #8759864 - Flags: approval-mozilla-aurora?
Correction:

(In reply to Michael Comella (:mcomella) from comment #93)
> [Describe test coverage new/current, TreeHerder]: Tested locally.

And on try (comment 88).
Failures on beta & aurora seem uncommon so we don't need to land this change there, but it could be nice.

Since merge is soon, I did not flag for uplift to 48 (even though we don't change any shipping code).
(In reply to Michael Comella (:mcomella) from comment #86)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #84)
> > This sounds like we actually hit the Telemetry initialization delay in the
> > test.
> 
> This appears to be the issue – my latest push reduced the telemetry
> initialization delay to one second and it took 0 seconds to retrieve the
> client ID the first time (as opposed to the duration comment 82). I wonder
> if that's why this test has always had intermittent issues.
> 
> To second guess our assumptions, why does retrieving a client ID, which I'd
> expect to have no (or few) dependencies, require a general "telemetry init"
> to occur?

We just trigger the client id as part of the general telemetry init.
We don't want to hit the disk early to not impact startup performance.

> Ignoring that for now, the quick solutions would be:
> 
> (In reply to Georg Fritzsche [:gfritzsche] from comment #84)
> > * setting the pref "toolkit.telemetry.initDelay" to a reasonable value:
> 
> 1) I would prefer to do this – telemetry will automatically be ready
> whenever anyone wants to test it & it's simpler than sending a message to
> JS. However, could this affect startup time for autophone tests? Will it
> noticeably affect the runtime of our robocop test suites?

If this startup performance hit is fine, that sounds like the easiest option.
If startup perf is a concern here, maybe you can set prefs per-test instead before starting Gecko?
https://hg.mozilla.org/mozilla-central/rev/7c1ddc822381
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 50
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.

Fix an intermittent, taking it.
Attachment #8759864 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(In reply to Georg Fritzsche [:gfritzsche] from comment #97)
> > To second guess our assumptions, why does retrieving a client ID, which I'd
> > expect to have no (or few) dependencies, require a general "telemetry init"
> > to occur?
> 
> We just trigger the client id as part of the general telemetry init.
> We don't want to hit the disk early to not impact startup performance.

To be explicit, the issue I'm having is that when I try to access the client ID without changing the telemetry init delay, the ID takes 30 seconds the first time to access [1]. When I change the telemetry init delay to 1 second, it takes 1 second to access the client ID the first time. The issue presumably occurs because telemetry is not initialized yet.

To me, this is unexpected because retrieving the client ID doesn't appear to have any dependencies on waiting for telemetry to be initialized first.

If I had to guess, I'd think this is an issue in the way we spawn and handle the async tasks or promises to init telemetry & access the client ID. For example, an init telemetry task initially tries to access the client ID, e.g.:

ClientID.getClientID().then(function (clientID) {
  // Do some stuff
});

Since we also supply a method to `Promise.then` and these functions run sequentially, `// Do some stuff` has to finish running before my code, which accesses the client ID in the Java integration test, is allowed to run. While I saw no issues with the Promise code specifically that would cause this to be an issue, I don't know how the async task code, yields, etc. would affect the Promise resolution.

For example, does the async task code register Promise handlers when the async tasks are run, or when they're read by the interpreter? If it's the latter, the handler added by telemetry init could be registered prior to the Java test handler, blocking it from running until the telemetry init handler is run, which is not until telemetry init occurs.

In another possible case, how does promise resolution work with yield (e.g. how we frequently access the client ID in the telemetry code [2])? Does the code after the `yield` statement all have to run before the next promise handler is called? If so, in my provided sample, we'd have to wait for the upload attempt to finish before we can return (and thus run the Java test handler). 

[1]: Subsequent accesses take about 1 second.
[2]: https://dxr.mozilla.org/mozilla-central/rev/ddeb0295df692695b36295177d6790e5393e1f9a/toolkit/components/telemetry/TelemetryController.jsm#463
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: