1294643 - Intermittent testUnifiedTelemetryClientId | Client ID from Java equals ID from JS - got d5c85dcf-c301-453d-8da7-c4bebc1c3a81, expected 3e5d83da-a79d-4f4e-bb5c-824b54182904

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

8 years ago

+++ This bug was initially created as a clone of Bug #1268998 +++ https://treeherder.mozilla.org/logviewer.html#?job_id=33768726&repo=mozilla-inbound

Comment hidden (Intermittent Failures Robot)

Comment hidden (typo)

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Comment 9

•

8 years ago

(Corrections to the comment above) I should look into this but, fwiw, it could be slow going because I don't actively work on fennec anymore – feel free to take it away from me if we need a faster fix. The fix in bug 1270191 (don't run uploader in automation) landed on 6/1; this fix did not work. The next fix landed in bug 1261494 (reduce gecko-side telemetry init to 1s), which landed on 7/29. bug 1261494 comment 105 (from 7/31) included the total from the past week – 162 intermittents – which dropped the following week to ~50 a week (between bug 1261494 and bug 1268998). This bug seems to contain all the intermittents from 8/16 forward, which is ~80 a week.

Comment hidden (Intermittent Failures Robot)

BMO Automation

Comment 11

•

8 years ago

Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.

Priority: -- → P3

Comment hidden (Intermittent Failures Robot)

Geoff Brown [:gbrown]

Assignee

Comment 15

•

8 years ago

Attached patch relax test to allow for a change in client ID (obsolete) — Details — Splinter Review

I looked into this a bit and realized that when this failure occurs, the client id from js is the same as the client id that would have come from java subsequently. Viewed from one perspective, this is a problem in timing: The id can change after the test has retrieved it from java. I found that adding a sleep before retrieving either id was fairly effective, but the sleep had to be quite long, about 8 seconds. I don't know why. I don't know what conditions are causing the change. But more broadly, the change is happening in gecko, and a robocop test has little hope of synchronizing with gecko to ensure that the id doesn't change. My approach in this patch is to accept that it might change: Get from Java, get from JS, then get from Java again and check that the id from JS matches either the one from Java before, or the one from Java after. This change effectively eliminates this type of failure on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=273aa7212a2b1cb3fff14c2ff0731e823970424f What do you think?

Attachment #8788274 - Flags: review?(michael.l.comella)

Geoff Brown [:gbrown]

Assignee

Comment 16

•

8 years ago

Attached patch relax test to allow for a change in client ID — Details — Splinter Review

Sorry - attached the wrong patch.

Attachment #8788274 - Attachment is obsolete: true

Attachment #8788274 - Flags: review?(michael.l.comella)

Attachment #8788275 - Flags: review?(michael.l.comella)

Comment hidden (Intermittent Failures Robot)

Phil Ringnalda (:philor)

Comment 22

•

8 years ago

Just push it without review: this is the most frequent single-test failure we have, so next time I get annoyed enough by the way the fingers on my right hand are literally numb from hitting a trackpad from starring, I'll disable it otherwise.

Flags: needinfo?(gbrown)

Comment hidden (Intermittent Failures Robot)

Pulsebot

Comment 24

•

8 years ago

Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1642b6250199 Allow for more conditions in robocop testUnifiedTelemetryClientId; r=me

Geoff Brown [:gbrown]

Assignee

Comment 25

•

8 years ago

For philor :) Leaving open for a proper review when mcomella is available - we can follow-up or back out as needed.

Flags: needinfo?(gbrown)

Keywords: leave-open

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 26

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1642b6250199

Comment hidden (Intermittent Failures Robot)

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Comment 28

•

8 years ago

Comment on attachment 8788275 [details] [diff] [review] relax test to allow for a change in client ID Review of attachment 8788275 [details] [diff] [review]: ----------------------------------------------------------------- Great observations, Geoff! Thanks for looking into this! Unfortunately, I think this change defeats the purpose of the test: we're testing to see that Java can create a client ID that JS can read. If we ignore the first value that Java writes, we're letting JS ignore what Java has done. The production code could change such that JS ignores what Java has written and this test, which is meant to catch that case, will pass. If we were to make a comprise change, I think: * If we hit the case where JS doesn't read the Java value, we could delete the client ID files, have Java retry creation, and access from JS again. This isn't perfect – it'd miss the case where JS only sometimes ignores Java in prod. code – but it'd catch the case where the JS code always ignores Java, which is more likely to happen when the code is changed. Given the amount of time I've spent on this, I'd call this good enough. * We could also take Geoff's wait ~8 seconds approach That being said, maybe the reason we're having this intermittent failure is because this is actually an intermittent bug in prod. code! Thoughts: * I did an analysis on the test code below and I don't understand where this could be coming from in the test code so it's possible it's in the prod. code * We don't act exactly like this in prod. code – we create the client ID in Java before Gecko ever starts so it's possible this bug just wouldn't manifest itself in prod. code. Unfortunately, afaik, it's non-trivial to test the exact conditions: repeatedly starting up the application * Assuming both Java and JS can read from existing client ID files, this would only affect first run when we have to create the client ID file – it'd be fixed on subsequent runs. - Here is my understanding of the world. The code is currently: deleteClientIDFiles // eliminate corruption from previous test runs primeJSClientIdCache // HACK: retrieves client ID, storing the new id in cache. Creates files so: deleteClientIDFiles // delete the files we just created (inside testJavaCreatesClientId) getClientIdFromJava // saves to a fresh client ID file resetJSCache // clear the client ID from cache, allowing us to get the value from the new file getClientIdFromJS // should retrieve value from new file // assertion that JavaID == JSID intermittently fails here - The getClientID code seems to work like this: ClientID.getClientID -> call-through ClientIDImpl.getClientID -> if (cachedID) { return cachedID; } else { call-through } Impl._loadClientID -> if (task) { return task; } else { call-through, store output (task), and return it } Impl._doLoadClientID -> returns task that loads ID from disk or, if DNE, generates, stores, and returns it To translate: * We cache the returned ID so we don't need to access disk * We return with promises so that we can load the Client ID async. We cache the promise/task so that if multiple callers call, we can resolve the promise/task once for all of them and they can run their resolution code in sequence. However, this is all done with generators, tasks, yields, and promises which can be quite complex: I could be misunderstanding something and it's always possible there are bugs. - From Geoff's analysis (comment 15): > when this failure occurs, the client id from js is the same as the > client id that would have come from java subsequently. > Viewed from one perspective, this is a problem in timing: The id can > change after the test has retrieved it from java. From my understanding of the code, this should not happen. * When we prime the JS cache, we've blocked until the client ID read/write tasks have run so they can't be run again. This means no matter what gecko does (unless it calls the reset method reserved for tests), it cannot write new client ID data while the cache is full. * Java writes the client ID file and we reset the JS cache. Since the client ID file exists, any future invocations of Gecko client ID code should read from the file (unrelated gecko background work included) * JS is expected to read from the client ID file and return the value Java wrote. As Geoff mentions, that's not what happens. :( - > I found that adding a sleep before retrieving either id was fairly > effective, but the sleep had to be quite long, about 8 seconds. This is interesting: this implies to me that Gecko is doing something in the background that affects the client ID files (e.g. maybe it's creating/destroying the clientID file hierarchies as part of test preparations). It's also possible this destruction/creation of the client ID files happens in prod. code – see thoughts above.

Attachment #8788275 - Flags: review?(michael.l.comella) → feedback-

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Comment 29

•

8 years ago

(In reply to Michael Comella (:mcomella) [not actively working on fennec: expect slow responses] from comment #28) > If we were to make a comprise change, I think: > * If we hit the case where JS doesn't read the Java value, we could delete > the client ID files, have Java retry creation, and access from JS again. > This isn't perfect – it'd miss the case where JS only sometimes ignores Java > in prod. code – but it'd catch the case where the JS code always ignores > Java, which is more likely to happen when the code is changed. Given the > amount of time I've spent on this, I'd call this good enough. To be clear, given the amount of time we've invested in this, I'd like to try this one out.

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Comment 30

•

8 years ago

Also, I didn't do this, but there could be something to be learned from: * Why does this bug affect testJavaCreatesClientId but not the three other test methods? * The other intermittent issues with this test [1]. In particular, "Retrieving client ID from JS has timed out" (bug 1295188) which still appears to be active [1]: https://bugzilla.mozilla.org/buglist.cgi?quicksearch=intermittent%20testunifiedtelemetryclientid&list_id=13227162

Comment hidden (mozreview-request)

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Comment 32

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a90c9688c1c2

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Comment 33

•

8 years ago

My retry solution (comment 28, try: comment 32) didn't appear to work. I retried for ten seconds (given Geoff's previous mention of 8 seconds seeming to iron out the issues) – one thought is to try for longer. That being said, it's concerning that in: getClientIDFromJava resetJSCache getClientIDFromJS the client ID from Java is not equal to the one from JS, even on multiple retries. One of my suspicions is that a request for the client ID is in-flight by Gecko [1] and so when we force the reset and/or access the client ID, it completes the request, causing the client ID file to be overwritten. It's unclear why this would happen after even after resets. [1]: I don't think this is possible with the JS cache primed. Also, it's crazy that this could happen even over the course of multiple retries. - I just noticed that my push from comment 32 does NOT fail in taskcluster but is intermittent on buildbot. Interesting! Geoff, do you know what's different between the configurations?

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Assignee

Comment 34

•

8 years ago

> Geoff, do you know what's different between the configurations? Yes - speed! Robocop tests run on a faster (more memory, more cores) aws instance type when run from taskcluster. (That is not the case for most test types migrated from buildbot to taskcluster, but I found that robocop was struggling on taskcluster until I bumped up the aws capabilities.) So, that might be a clue: Maybe the test fails when some change/response/etc is delayed?

Flags: needinfo?(gbrown)

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Comment 35

•

8 years ago

When the emulator is slow, bugs typically exposes themselves in testing but not on device – I wonder if we should just move the TC runs to be the official test results, to avoid issues like this in the future. Any thoughts, Geoff? That being said, I'm going to be heads down in NMX work so I won't be able to investigate this for at least a few months. I'm concerned that the current code may muffle true failures (see comment 28 for more) but it's better than it was before: failing all the time (thanks Geoff!). CC Sebastian and Georg for awareness that this test is not fully covering what it should be.

Assignee: michael.l.comella → nobody

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Assignee

Comment 36

•

8 years ago

Running tests only on TC is the goal, but I don't think there is a reliable time-line for that. I have a feeling that we are at least a few months away from retiring the buildbot tests. (Bug 1280570 is one work in progress that needs to be resolved before we can advance TC Android goals.)

Flags: needinfo?(gbrown)

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Comment 37

•

8 years ago

Is there a meta bug for running tests primarily on TC? I'd like to add this (or a new bug) as depending on that so, once we do that, we can revert the change here and see how if the code still passes.

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Assignee

Comment 38

•

8 years ago

Bug 1270544 might be best (see also bug 1141248 and bug 1118394).

Flags: needinfo?(gbrown)

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

Blocks: 1295188

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Updated

•

8 years ago

Blocks: 1308336

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Comment 39

•

8 years ago

(In reply to Michael Comella (:mcomella) [not actively working on fennec: expect slow responses] from comment #37) > Is there a meta bug for running tests primarily on TC? I'd like to add this > (or a new bug) as depending on that so, once we do that, we can revert the > change here and see how if the code still passes. Filed bug 1308336 blocking bug 1270544. This bug does not appear to be getting any more failures after comment 24's patch so I'm going to close it. I added the related intermittent, bug 1295188, to the "See also" of the new bug so we can more easily keep track of it.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Updated

•

8 years ago

Assignee: nobody → gbrown

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Updated

•

8 years ago

Keywords: leave-open

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Updated

•

8 years ago

Attachment #8793584 - Attachment is obsolete: true

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Comment 40

•

8 years ago

Comment on attachment 8788275 [details] [diff] [review] relax test to allow for a change in client ID Despite negative feedback, this is the patch that landed. Please see comment 28+ for details.

Attachment #8788275 - Flags: feedback- → review+

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

status-firefox50: --- → unaffected

status-firefox51: affected → fixed

Target Milestone: --- → Firefox 51

BMO Automation

Updated

•

4 years ago

Product: Firefox for Android → Firefox for Android Graveyard

relax test to allow for a change in client ID 8 years ago Geoff Brown [:gbrown] 2.07 KB, patch		Details \| Diff \| Splinter Review
relax test to allow for a change in client ID 8 years ago Geoff Brown [:gbrown] 1.96 KB, patch	mcomella : review+	Details \| Diff \| Splinter Review
Bug 1294643 - Retry getClientID in Java until it equals JS. 8 years ago Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla] 58 bytes, text/x-review-board-request		Details