Closed
Bug 1261494
Opened 9 years ago
Closed 8 years ago
Intermittent testUnifiedTelemetryClientId | Client id file does not exist yet - or Retrieving clientId from JS has timed out -
Categories
(Firefox for Android Graveyard :: Metrics, defect)
Firefox for Android Graveyard
Metrics
Tracking
(firefox48 wontfix, firefox49 fixed, firefox50 fixed)
RESOLVED
FIXED
Firefox 50
People
(Reporter: KWierso, Assigned: mcomella)
References
Details
(Keywords: intermittent-failure)
Attachments
(1 file, 2 obsolete files)
58 bytes,
text/x-review-board-request
|
gbrown
:
review+
Sylvestre
:
approval-mozilla-aurora+
|
Details |
Reporter | ||
Comment 1•9 years ago
|
||
Summary: Intermittent testUnifiedTelemetryClientId | Client id file does not exist yet - → Intermittent testUnifiedTelemetryClientId | Client id file does not exist yet - or Retrieving clientId from JS has timed out -
Assignee | ||
Comment 2•9 years ago
|
||
Shucks.
I had difficulty with "Retrieving clientId" when I created the test (I increased the timeout from 2000ms to 20000ms) – maybe I should increase it more? Emulators are rough! :(
Without looking at the code, I don't know what could be causing Client id file does not exist yet because, afaik, we remove the client ID file before each test, but it's possible this file is just left over.
Then again, maybe this file is now being created by the startup code in the application (e.g. bug 1255657 moved this earlier in the startup process), before the test can run.
Depends on: 1249491
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → michael.l.comella
Assignee | ||
Comment 13•9 years ago
|
||
Like intermittent bug 1268998 comment 6, I wonder if this could also be related to running the uploader in automation – soft dep.
Depends on: 1270191
Assignee | ||
Comment 14•9 years ago
|
||
re comment 13, backed up by the logs:
16:54:34 INFO - 05-25 16:51:17.342 D/GeckoTelemetryUploadSer( 2860): Service started
16:54:34 INFO - 05-25 16:51:18.220 I/Robocop ( 2860): {"message":"","time":1464220278202,"source":"robocop","status":"PASS","test":"testUnifiedTelemetryClientId","thread":null,"subtest":"The toolbar is not in the editing state","action":"test_status","pid":null}
...
16:54:34 INFO - 05-25 16:51:24.731 D/GeckoTelemetryUploadSer( 2860): Telemetry upload success!
16:54:34 INFO - 05-25 16:51:24.771 D/GeckoTelemetryUploadSer( 2860): Service finished: upload and prune attempts completed
16:54:34 INFO - 05-25 16:51:25.701 I/Robocop ( 2860): {"message":"","time":1464220285681,"source":"robocop","status":"PASS","test":"testUnifiedTelemetryClientId","thread":null,"subtest":"Waiting for UrlEditText to be input method target.","action":"test_status","pid":null}
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 16•8 years ago
|
||
I expect this to go away with the landing of bug 1270191 – please reopen if not so.
The failure rate is low enough that I don't think we need to uplift to aurora.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
Comment 17•8 years ago
|
||
Still occurring: https://treeherder.mozilla.org/logviewer.html#?job_id=29344044&repo=mozilla-inbound
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Assignee | ||
Comment 18•8 years ago
|
||
The logs indicate the telemetry upload service is now disabled in automation, however, I realized that I don't know the Gecko side of things.
Georg, do you know if Gecko telemetry could be creating the client ID file during tests in automation? If you don't know, is there any logging output associated with it I could search the logs (e.g. [1]) for?
[1]: https://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/738bd6b78f096f1d78ff07cd482fc21d586d13a7870696eba59eef240ff3f89e8644ec461c7835a7d2896d9082b2a0aecbea64703d36ca25ebe1701b3075505f
Flags: needinfo?(gfritzsche)
Assignee | ||
Comment 19•8 years ago
|
||
Assignee | ||
Comment 20•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/57700/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/57700/
Attachment #8759864 -
Flags: review?(gkruglov)
Attachment #8759865 -
Flags: review?(gkruglov)
Attachment #8759866 -
Flags: review?(gkruglov)
Assignee | ||
Comment 21•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/57702/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/57702/
Assignee | ||
Comment 22•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/57704/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/57704/
Assignee | ||
Updated•8 years ago
|
Keywords: leave-open
Comment hidden (Intermittent Failures Robot) |
Comment 24•8 years ago
|
||
(In reply to Michael Comella (:mcomella) from comment #18)
> The logs indicate the telemetry upload service is now disabled in
> automation, however, I realized that I don't know the Gecko side of things.
>
> Georg, do you know if Gecko telemetry could be creating the client ID file
> during tests in automation?
Yes, if it doesn't exist yet we would create the client id on the first request:
https://dxr.mozilla.org/mozilla-central/rev/3e8ee3599a67edd971770af4982ad4b0fe77f073/toolkit/modules/ClientID.jsm#128
Flags: needinfo?(gfritzsche)
Assignee | ||
Comment 25•8 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #24)
> > Georg, do you know if Gecko telemetry could be creating the client ID file
> > during tests in automation?
>
> Yes, if it doesn't exist yet we would create the client id on the first
> request:
> https://dxr.mozilla.org/mozilla-central/rev/
> 3e8ee3599a67edd971770af4982ad4b0fe77f073/toolkit/modules/ClientID.jsm#128
Since we're testing this functionality directly, is it possible to disable the other background processes during the test that might attempt to create a client ID?
Looking at my code, since I don't know the calling code, my (likely poor) solution would be to:
1) Disable ClientID.getClientID
2) Add a second method with a boolean override (e.g. ClientID.getClientID(underTesting)) which we can call from testUnifiedTelemetryClientId [1]
[1]: https://dxr.mozilla.org/mozilla-central/rev/3e8ee3599a67edd971770af4982ad4b0fe77f073/mobile/android/tests/browser/robocop/testUnifiedTelemetryClientId.js#18
Flags: needinfo?(gfritzsche)
Comment 26•8 years ago
|
||
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.
https://reviewboard.mozilla.org/r/57700/#review54796
Attachment #8759864 -
Flags: review?(gkruglov) → review+
Comment 27•8 years ago
|
||
Comment on attachment 8759865 [details]
Bug 1261494 - Add assertions & logging to resetTest for sanity checking.
https://reviewboard.mozilla.org/r/57702/#review54862
::: mobile/android/tests/browser/robocop/src/org/mozilla/gecko/tests/testUnifiedTelemetryClientId.java:57
(Diff revision 1)
> +
> if (resetJSCache) {
> resetJSCache();
> }
> - getClientIdFile().delete();
> - getFHRClientIdFile().delete();
> + for (final File file : filesToDeleteOnReset) {
> + file.delete();
Might also be useful to check return value of the `delete` call, but perhaps it doesn't matter since you're checking `exists` value right after.
Attachment #8759865 -
Flags: review?(gkruglov) → review+
Comment 28•8 years ago
|
||
Comment on attachment 8759866 [details]
Bug 1261494 - Add logging before each test to help debug intermittent.
https://reviewboard.mozilla.org/r/57704/#review54864
Attachment #8759866 -
Flags: review?(gkruglov) → review+
Assignee | ||
Comment 29•8 years ago
|
||
Georg suggested trying to prevent Gecko from creating new client ID files through method ordering: Gecko caches the client ID so if we preload the client ID into the cache, it won't try to create a new file. As such, we can keep the cache full until we decide it's time for Gecko to read/create a new client ID file.
There is still an opportunity for Gecko to attempt to access the client ID file before we can call this method, however, it greatly reduces the it.
Flags: needinfo?(gfritzsche)
Assignee | ||
Comment 30•8 years ago
|
||
I'm going to land the patches I wrote for this bug (so far) because they add helpful logging but I may end up undoing some of them as I implement comment 29.
Assignee | ||
Comment 31•8 years ago
|
||
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/57700/diff/1-2/
Attachment #8759864 -
Attachment description: Bug 1261494 - Remove outdated comment regarding disabling upload service. → Bug 1261494 - Block Gecko from screwing with the client ID file by resetting the cache at the last minute.
Assignee | ||
Updated•8 years ago
|
Attachment #8759865 -
Attachment is obsolete: true
Assignee | ||
Updated•8 years ago
|
Attachment #8759866 -
Attachment is obsolete: true
Assignee | ||
Comment 32•8 years ago
|
||
https://hg.mozilla.org/integration/fx-team/rev/576a3ad4b084e3772951cab80695519fbd92f8c5
Bug 1261494 - Remove outdated comment regarding disabling upload service. r=grisha
https://hg.mozilla.org/integration/fx-team/rev/67b5ee17fc49e542c717c6b0c7710bd7307a7829
Bug 1261494 - Add assertions & logging to resetTest for sanity checking. r=grisha
https://hg.mozilla.org/integration/fx-team/rev/389014745e195369fb0930d0edc542a137649e24
Bug 1261494 - Add logging before each test to help debug intermittent. r=grisha
Assignee | ||
Comment 33•8 years ago
|
||
Assignee | ||
Updated•8 years ago
|
Attachment #8759864 -
Flags: review+ → review?(gkruglov)
Comment 34•8 years ago
|
||
bugherder |
Comment 35•8 years ago
|
||
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.
https://reviewboard.mozilla.org/r/57700/#review55802
That resetJSCache method is interesting... LGTM, hacky stuff though.
Attachment #8759864 -
Flags: review?(gkruglov) → review+
Assignee | ||
Comment 36•8 years ago
|
||
https://hg.mozilla.org/integration/fx-team/rev/370eaabb204751e618b049d7cbc0cc8298d7b2c9
Bug 1261494 - Block Gecko from screwing with the client ID file by resetting the cache at the last minute. r=grisha
Comment 37•8 years ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 45•8 years ago
|
||
:mcomella -- This is still a very frequent failure. Can it be skipped?
Flags: needinfo?(michael.l.comella)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 48•8 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #45)
> :mcomella -- This is still a very frequent failure. Can it be skipped?
No – we use it to ensure we can correctly retrieve the client ID, Java <-> JS, which we can't do with a unit test. I'll prioritize fixing this.
Also:
(In reply to Georg Fritzsche [:gfritzsche] from bug 1261596 comment #9)
> Michael, can you tell whats going on here?
> Could this be fallout from bug 1233986?
Flags: needinfo?(michael.l.comella)
Assignee | ||
Comment 49•8 years ago
|
||
It looks like the increase in failures started around 6/26 & 6/29 (comment 39 & comment 40).
The patch in this bug landed on 6/12 (comment 37) so it's unlikely to have negatively affected the failure rate [1].
The patch in bug 1233986 landed several times but was backed out, finally landing on 6/28 (bug 1233986 comment 54), which is the middle of the range I saw above for the increase in failures. As the most likely cause, I'll take a look at the code and see what could have affected this test.
[1]: Tangentially, the patch in this bug appears to have reduced the failure rate (prior to the 6/26-6/29 increase) but not have eliminated it entirely, like I expected it to.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 57•8 years ago
|
||
Looking at the errors, it looks like we timeout retrieving the client ID the first time [1].
My biggest lead is:
1) Gecko retrieves the client ID and registers its `.then(...)` callbacks
2) We attempt to retrieve the client ID by registering our `.then(...)` callbacks
3) The client ID is retrieved & the promises resolve. Since promises resolve in the order they were registered, Gecko attempts upload
4) Gecko's upload timeout (90s [2]) is longer than our "retrieve client ID from JS" timeout (20s) so we fail.
bug 1233986 moved Gecko telemetry upload to occur earlier during startup so its more likely to run into this
Georg, do we attempt to upload during automation? Specifically the Android rocobop tests.
I'll drop a try run with an increased Java timeout (120s?) to see if this appears to fix the issue.
[1]: https://dxr.mozilla.org/mozilla-central/rev/88bebcaca249aeaca9197382e89d35b02be8292e/mobile/android/tests/browser/robocop/src/org/mozilla/gecko/tests/testUnifiedTelemetryClientId.java#77
[2]: https://dxr.mozilla.org/mozilla-central/rev/88bebcaca249aeaca9197382e89d35b02be8292e/toolkit/components/telemetry/TelemetrySend.jsm#68
Flags: needinfo?(gfritzsche)
Assignee | ||
Comment 58•8 years ago
|
||
Assignee | ||
Comment 59•8 years ago
|
||
Georg, also would the JS _log.trace statements appear in the test output? I can't seem to find any – I'm specifically looking for those TelemetrySend calls.
If not, is there a way to enable them or add my own log statements?
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 61•8 years ago
|
||
Comment 62•8 years ago
|
||
(In reply to Michael Comella (:mcomella) from comment #59)
> Georg, also would the JS _log.trace statements appear in the test output? I
> can't seem to find any – I'm specifically looking for those TelemetrySend
> calls.
>
> If not, is there a way to enable them or add my own log statements?
I don't know where the testing prefs are set for those Fennec tests, but you enable this by setting e.g. "toolkit.telemetry.log.level" to "Trace".
Flags: needinfo?(gfritzsche)
Comment 63•8 years ago
|
||
(In reply to Michael Comella (:mcomella) from comment #57)
> Looking at the errors, it looks like we timeout retrieving the client ID the
> first time [1].
>
> My biggest lead is:
> 1) Gecko retrieves the client ID and registers its `.then(...)` callbacks
> 2) We attempt to retrieve the client ID by registering our `.then(...)`
> callbacks
> 3) The client ID is retrieved & the promises resolve. Since promises
> resolve in the order they were registered, Gecko attempts upload
> 4) Gecko's upload timeout (90s [2]) is longer than our "retrieve client ID
> from JS" timeout (20s) so we fail.
>
> bug 1233986 moved Gecko telemetry upload to occur earlier during startup so
> its more likely to run into this
It should just have changed to submitting "too early" pings with the client-id instead of without, although it could have changed the timing a little bit.
> Georg, do we attempt to upload during automation? Specifically the Android
> rocobop tests.
It is possible, although we shouldn't.
We tried to set the "toolkit.telemetry.server" pref to a dummy address for the tests we found out about:
https://dxr.mozilla.org/mozilla-central/search?q=toolkit.telemetry.server&redirect=false
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 65•8 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #62)
> (In reply to Michael Comella (:mcomella) from comment #59)
> > Georg, also would the JS _log.trace statements appear in the test output? I
> > can't seem to find any – I'm specifically looking for those TelemetrySend
> > calls.
> >
> > If not, is there a way to enable them or add my own log statements?
>
> I don't know where the testing prefs are set for those Fennec tests, but you
> enable this by setting e.g. "toolkit.telemetry.log.level" to "Trace".
Geoff, do you know how I might enable this about:config setting during a try run?
Flags: needinfo?(gbrown)
Comment 66•8 years ago
|
||
Try adding something here:
http://hg.mozilla.org/mozilla-central/annotate/tip/testing/mochitest/runrobocop.py#l228
Flags: needinfo?(gbrown)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 68•8 years ago
|
||
(In reply to Michael Comella (:mcomella) from comment #61)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=f3fc4a3de8be
fyi, this push was invalidated because of bug 1286942.
Waiting on bug 1286942 to be fixed in order to push to try and fix this.
Depends on: 1286942
Assignee | ||
Comment 69•8 years ago
|
||
Assignee | ||
Comment 70•8 years ago
|
||
My push from comment 69 made no functional changes and failed this test 7/7 times (+1 infra issue, +1 still running).
My push from comment 58 added a timeout longer than the telemetry upload time and passed this test 13/13 times (+1 infra issue).
I have a high suspicion this is related to attempting to upload during the test. I will try to add some logging (via comment 66) to confirm this suspicion.
Assignee | ||
Comment 71•8 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 73•8 years ago
|
||
(In reply to Michael Comella (:mcomella) from comment #70)
> My push from comment 58 added a timeout longer than the telemetry upload
> time and passed this test 13/13 times (+1 infra issue).
In the worst case (or if someone wants this fixed right now), we can increase the timeout so this no longer fails in automation and test fixes via try.
I'll have more time to dig in tomorrow.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 75•8 years ago
|
||
Assignee | ||
Comment 76•8 years ago
|
||
Looking at the logcat from the first failure from the push from comment 71 [1], submitPing does not run until after the test has already timed-out and failed.
However, I notice that Telemetry's deferred init runs in between when we initially request the clientID from JS and when we timeout. The getClientID method seems to return correctly here because the log statements indicate the methods before & after it run. Perhaps the telemetry init is taking too long and, given JS' single-threaded nature, is blocking us from getting the client ID.
My push in comment 75 increases the delayed init timeout so hopefully Java gets control of the JS thread sooner.
[1]: https://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/870f1bd220acef368e6be70fb47a75b83beb9cbfb60399e33d601e5536ca1fa0f5a6b5a880b2cd10bcca49a5b95b92dbdee371952844ca7efa05cf4c2aa05854
Assignee | ||
Comment 77•8 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 81•8 years ago
|
||
Assignee | ||
Comment 82•8 years ago
|
||
My last push (comment 81) added log statements [1] to say how long it takes to retrieve the client ID from JS. In both runs, it takes 30 seconds to retrieve it initially and between 0 & 1 seconds to retrieve it each subsequent time.
Georg, do you know of any timeouts that could cause the client ID to take 30 seconds to retrieve? e.g. upload timeout, Gecko load timeout, etc.
If not, I'm just going to increase the timeout to 35 seconds because context-switching between my context graph work and fennec work is really expensive.
[1]: Look for the logtag "lol"
Flags: needinfo?(gfritzsche)
Comment hidden (Intermittent Failures Robot) |
Comment 84•8 years ago
|
||
This sounds like we actually hit the Telemetry initialization delay in the test.
We delay it to 60 seconds after "profile-after-change", maybe the other 30sec already passed waiting on other operations before you timed it?
There are two ways around this:
* directly initializing Telemetry to trigger "testing" behavior:
https://dxr.mozilla.org/mozilla-central/rev/462dc6b44adb4573e8ce6b0dd688c206ebd516f7/toolkit/components/telemetry/tests/unit/test_TelemetryController.js#109
* setting the pref "toolkit.telemetry.initDelay" to a reasonable value:
http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/preferences.html#testing
Flags: needinfo?(gfritzsche)
Assignee | ||
Comment 85•8 years ago
|
||
Assignee | ||
Comment 86•8 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #84)
> This sounds like we actually hit the Telemetry initialization delay in the
> test.
This appears to be the issue – my latest push reduced the telemetry initialization delay to one second and it took 0 seconds to retrieve the client ID the first time (as opposed to the duration comment 82). I wonder if that's why this test has always had intermittent issues.
To second guess our assumptions, why does retrieving a client ID, which I'd expect to have no (or few) dependencies, require a general "telemetry init" to occur?
Ignoring that for now, the quick solutions would be:
(In reply to Georg Fritzsche [:gfritzsche] from comment #84)
> * setting the pref "toolkit.telemetry.initDelay" to a reasonable value:
1) I would prefer to do this – telemetry will automatically be ready whenever anyone wants to test it & it's simpler than sending a message to JS. However, could this affect startup time for autophone tests? Will it noticeably affect the runtime of our robocop test suites?
And using:
> * directly initializing Telemetry to trigger "testing" behavior:
2) We could do this for each test that needs to test telemetry but it seems unreasonable to force people to remember to initialize telemetry for each test.
3) We can make this call during BaseTest or UITest initialization, which would prevent it from affecting autophone & friends (from #1), but could still affect the overall robocop runtime duration.
Geoff, do you have an opinion?
Flags: needinfo?(gbrown)
Assignee | ||
Comment 87•8 years ago
|
||
(In reply to Michael Comella (:mcomella) from comment #86)
> To second guess our assumptions, why does retrieving a client ID, which I'd
> expect to have no (or few) dependencies, require a general "telemetry init"
> to occur?
Then again, while I feel this could be worthwhile to do, we'd also run into this issue again if anyone else tries to write a integration test for telemetry so we might as well fix the problem generically for all telemetry code.
Assignee | ||
Comment 88•8 years ago
|
||
Comment 89•8 years ago
|
||
(In reply to Michael Comella (:mcomella) from comment #86)
> Geoff, do you have an opinion?
I trust your judgement.
Personally, I would go with option #1, setting the pref for all robocop tests, as you did in your try push. There is very little use of robocop in autophone, and currently there are no autophone performance tests that use robocop. I'd check on the robocop test suite run time, but unless the difference is significant, I think it will be fine.
Flags: needinfo?(gbrown)
Assignee | ||
Comment 90•8 years ago
|
||
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/57700/diff/2-3/
Attachment #8759864 -
Attachment description: Bug 1261494 - Block Gecko from screwing with the client ID file by resetting the cache at the last minute. → Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.
Attachment #8759864 -
Flags: review+ → review?(gbrown)
Comment 91•8 years ago
|
||
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.
https://reviewboard.mozilla.org/r/57700/#review64670
Attachment #8759864 -
Flags: review?(gbrown) → review+
Assignee | ||
Comment 92•8 years ago
|
||
https://hg.mozilla.org/integration/fx-team/rev/7c1ddc8223812aed002c47b4a169d7550a7c9c99
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing. r=gbrown
Assignee | ||
Comment 93•8 years ago
|
||
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.
Approval Request Comment
[Feature/regressing bug #]: initially implemented w/ small amount of intermittents, other bugs made it worse
[User impact if declined]: Users are unaffected but we will see a lot of intermittents
[Describe test coverage new/current, TreeHerder]: Tested locally.
[Risks and why]: Extremely low. We change an about:config preference before running robocop tests. In the most likely worst case, the additional processing time could slow down the running of our robocop test suites. However, in my testing, I did not see any noticeable impact.
[String/UUID change made/needed]: None
Attachment #8759864 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 94•8 years ago
|
||
Correction:
(In reply to Michael Comella (:mcomella) from comment #93)
> [Describe test coverage new/current, TreeHerder]: Tested locally.
And on try (comment 88).
Assignee | ||
Comment 95•8 years ago
|
||
Failures on beta & aurora seem uncommon so we don't need to land this change there, but it could be nice.
Since merge is soon, I did not flag for uplift to 48 (even though we don't change any shipping code).
Comment hidden (Intermittent Failures Robot) |
Comment 97•8 years ago
|
||
(In reply to Michael Comella (:mcomella) from comment #86)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #84)
> > This sounds like we actually hit the Telemetry initialization delay in the
> > test.
>
> This appears to be the issue – my latest push reduced the telemetry
> initialization delay to one second and it took 0 seconds to retrieve the
> client ID the first time (as opposed to the duration comment 82). I wonder
> if that's why this test has always had intermittent issues.
>
> To second guess our assumptions, why does retrieving a client ID, which I'd
> expect to have no (or few) dependencies, require a general "telemetry init"
> to occur?
We just trigger the client id as part of the general telemetry init.
We don't want to hit the disk early to not impact startup performance.
> Ignoring that for now, the quick solutions would be:
>
> (In reply to Georg Fritzsche [:gfritzsche] from comment #84)
> > * setting the pref "toolkit.telemetry.initDelay" to a reasonable value:
>
> 1) I would prefer to do this – telemetry will automatically be ready
> whenever anyone wants to test it & it's simpler than sending a message to
> JS. However, could this affect startup time for autophone tests? Will it
> noticeably affect the runtime of our robocop test suites?
If this startup performance hit is fine, that sounds like the easiest option.
If startup perf is a concern here, maybe you can set prefs per-test instead before starting Gecko?
Assignee | ||
Updated•8 years ago
|
Keywords: leave-open
Comment hidden (Intermittent Failures Robot) |
Comment 99•8 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
status-firefox50:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → Firefox 50
Updated•8 years ago
|
status-firefox49:
--- → affected
Comment 100•8 years ago
|
||
Comment on attachment 8759864 [details]
Bug 1261494 - Reduce telemetry init delay to 1 second for integration testing.
Fix an intermittent, taking it.
Attachment #8759864 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Assignee | ||
Comment 101•8 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #97)
> > To second guess our assumptions, why does retrieving a client ID, which I'd
> > expect to have no (or few) dependencies, require a general "telemetry init"
> > to occur?
>
> We just trigger the client id as part of the general telemetry init.
> We don't want to hit the disk early to not impact startup performance.
To be explicit, the issue I'm having is that when I try to access the client ID without changing the telemetry init delay, the ID takes 30 seconds the first time to access [1]. When I change the telemetry init delay to 1 second, it takes 1 second to access the client ID the first time. The issue presumably occurs because telemetry is not initialized yet.
To me, this is unexpected because retrieving the client ID doesn't appear to have any dependencies on waiting for telemetry to be initialized first.
If I had to guess, I'd think this is an issue in the way we spawn and handle the async tasks or promises to init telemetry & access the client ID. For example, an init telemetry task initially tries to access the client ID, e.g.:
ClientID.getClientID().then(function (clientID) {
// Do some stuff
});
Since we also supply a method to `Promise.then` and these functions run sequentially, `// Do some stuff` has to finish running before my code, which accesses the client ID in the Java integration test, is allowed to run. While I saw no issues with the Promise code specifically that would cause this to be an issue, I don't know how the async task code, yields, etc. would affect the Promise resolution.
For example, does the async task code register Promise handlers when the async tasks are run, or when they're read by the interpreter? If it's the latter, the handler added by telemetry init could be registered prior to the Java test handler, blocking it from running until the telemetry init handler is run, which is not until telemetry init occurs.
In another possible case, how does promise resolution work with yield (e.g. how we frequently access the client ID in the telemetry code [2])? Does the code after the `yield` statement all have to run before the next promise handler is called? If so, in my provided sample, we'd have to wait for the upload attempt to finish before we can return (and thus run the Java test handler).
[1]: Subsequent accesses take about 1 second.
[2]: https://dxr.mozilla.org/mozilla-central/rev/ddeb0295df692695b36295177d6790e5393e1f9a/toolkit/components/telemetry/TelemetryController.jsm#463
Comment 102•8 years ago
|
||
bugherder uplift |
Comment hidden (Intermittent Failures Robot) |
Comment 104•8 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•