Closed Bug 1294643 Opened 8 years ago Closed 8 years ago

Intermittent testUnifiedTelemetryClientId | Client ID from Java equals ID from JS - got d5c85dcf-c301-453d-8da7-c4bebc1c3a81, expected 3e5d83da-a79d-4f4e-bb5c-824b54182904

Categories

(Firefox for Android Graveyard :: Metrics, defect, P3)

defect

Tracking

(firefox50 unaffected, firefox51 fixed)

RESOLVED FIXED
Firefox 51
Tracking Status
firefox50 --- unaffected
firefox51 --- fixed

People

(Reporter: aryx, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 2 obsolete files)

(Corrections to the comment above) I should look into this but, fwiw, it could be slow going because I don't actively work on fennec anymore – feel free to take it away from me if we need a faster fix. The fix in bug 1270191 (don't run uploader in automation) landed on 6/1; this fix did not work. The next fix landed in bug 1261494 (reduce gecko-side telemetry init to 1s), which landed on 7/29. bug 1261494 comment 105 (from 7/31) included the total from the past week – 162 intermittents – which dropped the following week to ~50 a week (between bug 1261494 and bug 1268998). This bug seems to contain all the intermittents from 8/16 forward, which is ~80 a week.
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
I looked into this a bit and realized that when this failure occurs, the client id from js is the same as the client id that would have come from java subsequently. Viewed from one perspective, this is a problem in timing: The id can change after the test has retrieved it from java. I found that adding a sleep before retrieving either id was fairly effective, but the sleep had to be quite long, about 8 seconds. I don't know why. I don't know what conditions are causing the change. But more broadly, the change is happening in gecko, and a robocop test has little hope of synchronizing with gecko to ensure that the id doesn't change. My approach in this patch is to accept that it might change: Get from Java, get from JS, then get from Java again and check that the id from JS matches either the one from Java before, or the one from Java after. This change effectively eliminates this type of failure on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=273aa7212a2b1cb3fff14c2ff0731e823970424f What do you think?
Attachment #8788274 - Flags: review?(michael.l.comella)
Sorry - attached the wrong patch.
Attachment #8788274 - Attachment is obsolete: true
Attachment #8788274 - Flags: review?(michael.l.comella)
Attachment #8788275 - Flags: review?(michael.l.comella)
Just push it without review: this is the most frequent single-test failure we have, so next time I get annoyed enough by the way the fingers on my right hand are literally numb from hitting a trackpad from starring, I'll disable it otherwise.
Flags: needinfo?(gbrown)
Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1642b6250199 Allow for more conditions in robocop testUnifiedTelemetryClientId; r=me
For philor :) Leaving open for a proper review when mcomella is available - we can follow-up or back out as needed.
Flags: needinfo?(gbrown)
Keywords: leave-open
Comment on attachment 8788275 [details] [diff] [review] relax test to allow for a change in client ID Review of attachment 8788275 [details] [diff] [review]: ----------------------------------------------------------------- Great observations, Geoff! Thanks for looking into this! Unfortunately, I think this change defeats the purpose of the test: we're testing to see that Java can create a client ID that JS can read. If we ignore the first value that Java writes, we're letting JS ignore what Java has done. The production code could change such that JS ignores what Java has written and this test, which is meant to catch that case, will pass. If we were to make a comprise change, I think: * If we hit the case where JS doesn't read the Java value, we could delete the client ID files, have Java retry creation, and access from JS again. This isn't perfect – it'd miss the case where JS only sometimes ignores Java in prod. code – but it'd catch the case where the JS code always ignores Java, which is more likely to happen when the code is changed. Given the amount of time I've spent on this, I'd call this good enough. * We could also take Geoff's wait ~8 seconds approach That being said, maybe the reason we're having this intermittent failure is because this is actually an intermittent bug in prod. code! Thoughts: * I did an analysis on the test code below and I don't understand where this could be coming from in the test code so it's possible it's in the prod. code * We don't act exactly like this in prod. code – we create the client ID in Java before Gecko ever starts so it's possible this bug just wouldn't manifest itself in prod. code. Unfortunately, afaik, it's non-trivial to test the exact conditions: repeatedly starting up the application * Assuming both Java and JS can read from existing client ID files, this would only affect first run when we have to create the client ID file – it'd be fixed on subsequent runs. - Here is my understanding of the world. The code is currently: deleteClientIDFiles // eliminate corruption from previous test runs primeJSClientIdCache // HACK: retrieves client ID, storing the new id in cache. Creates files so: deleteClientIDFiles // delete the files we just created (inside testJavaCreatesClientId) getClientIdFromJava // saves to a fresh client ID file resetJSCache // clear the client ID from cache, allowing us to get the value from the new file getClientIdFromJS // should retrieve value from new file // assertion that JavaID == JSID intermittently fails here - The getClientID code seems to work like this: ClientID.getClientID -> call-through ClientIDImpl.getClientID -> if (cachedID) { return cachedID; } else { call-through } Impl._loadClientID -> if (task) { return task; } else { call-through, store output (task), and return it } Impl._doLoadClientID -> returns task that loads ID from disk or, if DNE, generates, stores, and returns it To translate: * We cache the returned ID so we don't need to access disk * We return with promises so that we can load the Client ID async. We cache the promise/task so that if multiple callers call, we can resolve the promise/task once for all of them and they can run their resolution code in sequence. However, this is all done with generators, tasks, yields, and promises which can be quite complex: I could be misunderstanding something and it's always possible there are bugs. - From Geoff's analysis (comment 15): > when this failure occurs, the client id from js is the same as the > client id that would have come from java subsequently. > Viewed from one perspective, this is a problem in timing: The id can > change after the test has retrieved it from java. From my understanding of the code, this should not happen. * When we prime the JS cache, we've blocked until the client ID read/write tasks have run so they can't be run again. This means no matter what gecko does (unless it calls the reset method reserved for tests), it cannot write new client ID data while the cache is full. * Java writes the client ID file and we reset the JS cache. Since the client ID file exists, any future invocations of Gecko client ID code should read from the file (unrelated gecko background work included) * JS is expected to read from the client ID file and return the value Java wrote. As Geoff mentions, that's not what happens. :( - > I found that adding a sleep before retrieving either id was fairly > effective, but the sleep had to be quite long, about 8 seconds. This is interesting: this implies to me that Gecko is doing something in the background that affects the client ID files (e.g. maybe it's creating/destroying the clientID file hierarchies as part of test preparations). It's also possible this destruction/creation of the client ID files happens in prod. code – see thoughts above.
Attachment #8788275 - Flags: review?(michael.l.comella) → feedback-
(In reply to Michael Comella (:mcomella) [not actively working on fennec: expect slow responses] from comment #28) > If we were to make a comprise change, I think: > * If we hit the case where JS doesn't read the Java value, we could delete > the client ID files, have Java retry creation, and access from JS again. > This isn't perfect – it'd miss the case where JS only sometimes ignores Java > in prod. code – but it'd catch the case where the JS code always ignores > Java, which is more likely to happen when the code is changed. Given the > amount of time I've spent on this, I'd call this good enough. To be clear, given the amount of time we've invested in this, I'd like to try this one out.
Also, I didn't do this, but there could be something to be learned from: * Why does this bug affect testJavaCreatesClientId but not the three other test methods? * The other intermittent issues with this test [1]. In particular, "Retrieving client ID from JS has timed out" (bug 1295188) which still appears to be active [1]: https://bugzilla.mozilla.org/buglist.cgi?quicksearch=intermittent%20testunifiedtelemetryclientid&list_id=13227162
My retry solution (comment 28, try: comment 32) didn't appear to work. I retried for ten seconds (given Geoff's previous mention of 8 seconds seeming to iron out the issues) – one thought is to try for longer. That being said, it's concerning that in: getClientIDFromJava resetJSCache getClientIDFromJS the client ID from Java is not equal to the one from JS, even on multiple retries. One of my suspicions is that a request for the client ID is in-flight by Gecko [1] and so when we force the reset and/or access the client ID, it completes the request, causing the client ID file to be overwritten. It's unclear why this would happen after even after resets. [1]: I don't think this is possible with the JS cache primed. Also, it's crazy that this could happen even over the course of multiple retries. - I just noticed that my push from comment 32 does NOT fail in taskcluster but is intermittent on buildbot. Interesting! Geoff, do you know what's different between the configurations?
Flags: needinfo?(gbrown)
> Geoff, do you know what's different between the configurations? Yes - speed! Robocop tests run on a faster (more memory, more cores) aws instance type when run from taskcluster. (That is not the case for most test types migrated from buildbot to taskcluster, but I found that robocop was struggling on taskcluster until I bumped up the aws capabilities.) So, that might be a clue: Maybe the test fails when some change/response/etc is delayed?
Flags: needinfo?(gbrown)
When the emulator is slow, bugs typically exposes themselves in testing but not on device – I wonder if we should just move the TC runs to be the official test results, to avoid issues like this in the future. Any thoughts, Geoff? That being said, I'm going to be heads down in NMX work so I won't be able to investigate this for at least a few months. I'm concerned that the current code may muffle true failures (see comment 28 for more) but it's better than it was before: failing all the time (thanks Geoff!). CC Sebastian and Georg for awareness that this test is not fully covering what it should be.
Assignee: michael.l.comella → nobody
Flags: needinfo?(gbrown)
Running tests only on TC is the goal, but I don't think there is a reliable time-line for that. I have a feeling that we are at least a few months away from retiring the buildbot tests. (Bug 1280570 is one work in progress that needs to be resolved before we can advance TC Android goals.)
Flags: needinfo?(gbrown)
Is there a meta bug for running tests primarily on TC? I'd like to add this (or a new bug) as depending on that so, once we do that, we can revert the change here and see how if the code still passes.
Flags: needinfo?(gbrown)
Bug 1270544 might be best (see also bug 1141248 and bug 1118394).
Flags: needinfo?(gbrown)
(In reply to Michael Comella (:mcomella) [not actively working on fennec: expect slow responses] from comment #37) > Is there a meta bug for running tests primarily on TC? I'd like to add this > (or a new bug) as depending on that so, once we do that, we can revert the > change here and see how if the code still passes. Filed bug 1308336 blocking bug 1270544. This bug does not appear to be getting any more failures after comment 24's patch so I'm going to close it. I added the related intermittent, bug 1295188, to the "See also" of the new bug so we can more easily keep track of it.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment on attachment 8788275 [details] [diff] [review] relax test to allow for a change in client ID Despite negative feedback, this is the patch that landed. Please see comment 28+ for details.
Attachment #8788275 - Flags: feedback- → review+
Target Milestone: --- → Firefox 51
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: