Closed Bug 1632184 Opened 5 years ago Closed 5 years ago

Investigate integration tests not running in Fenix

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: Dexter)

References

Details

Attachments

(1 file)

While investigating writing additional tests, Chris Hartjes found that the baseline ping test is currently broken in Fenix.

The fix is trivial, requires changing the linked line to:

assertEquals("background", baselinePing.getJSONObject("ping_info")["reason"])

However this leads us to a question: are tests even running in Fenix?

Assignee: nobody → alessio.placitelli
Priority: -- → P1

Richard, in this PR we added a test to Fenix. The test is broken, and should have been failing and getting people angry at us :-) However, the test is still broken and nobody complained. I suspect this is not running on CI. Do you know anything about that?

Flags: needinfo?(rpappalardo)
QA Contact: chartjes

(In reply to Alessio Placitelli [:Dexter] from comment #1)

Richard, in this PR we added a test to Fenix. The test is broken, and should have been failing and getting people angry at us :-) However, the test is still broken and nobody complained. I suspect this is not running on CI. Do you know anything about that?

Forwarding to Isabel... can you take a look?

Flags: needinfo?(rpappalardo) → needinfo?(irios.mozilla)

(In reply to Alessio Placitelli [:Dexter] from comment #1)

Richard, in this PR we added a test to Fenix. The test is broken, and should have been failing and getting people angry at us :-) However, the test is still broken and nobody complained. I suspect this is not running on CI. Do you know anything about that?

Hey Alessio, I think I know whats happening...the UI tests that run on CI are the ones under ui package: https://github.com/mozilla-mobile/fenix/blob/master/automation/taskcluster/androidTest/flank-x86.yml#L41. So, I'm afraid that this test has not run on CI ever :( sorry about that...that's why no one has noticed about it failing now.

There are at least two options to add it to CI:
-Add it to UI tests
-Create a new task to run it, which may be the best option if you expect to have more tests for glean...

Hope that helps :) and please ping me if you want to talk more about the options/your preferences to have it running on CI.

Flags: needinfo?(irios.mozilla)

(In reply to Isabel Rios[:isabel_rios] from comment #3)

Hey Alessio, I think I know whats happening...the UI tests that run on CI are the ones under ui package: https://github.com/mozilla-mobile/fenix/blob/master/automation/taskcluster/androidTest/flank-x86.yml#L41. So, I'm afraid that this test has not run on CI ever :( sorry about that...that's why no one has noticed about it failing now.

Ouch, that's bad, but it explains what we're seeing :)

There are at least two options to add it to CI:
-Add it to UI tests
-Create a new task to run it, which may be the best option if you expect to have more tests for glean...

Hope that helps :) and please ping me if you want to talk more about the options/your preferences to have it running on CI.

What does it take to create a new task?
Would it be possible to "simply" add a new entry, i.e. - package org.mozilla.fenix.glean here, below the - package org.mozilla.fenix.ui, under test-targets?

Flags: needinfo?(irios.mozilla)

What does it take to create a new task?
Would it be possible to "simply" add a new entry, i.e. - package org.mozilla.fenix.glean here, below the - package org.mozilla.fenix.ui, under test-targets?

I think so, that's the option I would like to try first. Let me check and I will come back to you :)

Flags: needinfo?(irios.mozilla)

Hey Alessio,

Looks like that solution works. I created this PR: https://github.com/mozilla-mobile/fenix/pull/10146 and the test runs there.

Since test failures block PRs merge on Fenix, we would need to wait until the test is fixed to land this. Does that sound good?

FYi, This is the tests result report: https://console.firebase.google.com/project/moz-fenix/testlab/histories/bh.66b7091e15d53d45/matrices/8548420857237243210

Flags: needinfo?(alessio.placitelli)

(In reply to Isabel Rios[:isabel_rios] from comment #6)

Hey Alessio,

Looks like that solution works. I created this PR: https://github.com/mozilla-mobile/fenix/pull/10146 and the test runs there.

Since test failures block PRs merge on Fenix, we would need to wait until the test is fixed to land this. Does that sound good?

FYi, This is the tests result report: https://console.firebase.google.com/project/moz-fenix/testlab/histories/bh.66b7091e15d53d45/matrices/8548420857237243210

Great, thanks! Let me quickly create a PR for fixing the test. You will be able to rebase once that's merged. Thank you so much for looking into this, much appreciated.

I will ping you again once that's merged.

Flags: needinfo?(alessio.placitelli)

(In reply to Isabel Rios[:isabel_rios] from comment #6)

Hey Alessio,

Looks like that solution works. I created this PR: https://github.com/mozilla-mobile/fenix/pull/10146 and the test runs there.

Since test failures block PRs merge on Fenix, we would need to wait until the test is fixed to land this. Does that sound good?

FYi, This is the tests result report: https://console.firebase.google.com/project/moz-fenix/testlab/histories/bh.66b7091e15d53d45/matrices/8548420857237243210

Thanks Isabel! I think I fixed it in https://github.com/mozilla-mobile/fenix/pull/10152 (to be reviewed and merged). Any chance you could kindly check if that fixes the problem on CI?

Flags: needinfo?(irios.mozilla)

The test works for me locally but not on CI ...
These are the logs.. https://console.firebase.google.com/project/moz-fenix/testlab/histories/bh.66b7091e15d53d45/matrices/4775591608566573438/executions/bs.2b918418d317918/testcases/1/errors

Do you have access? If not I will paste them here...

Flags: needinfo?(irios.mozilla) → needinfo?(alessio.placitelli)

(In reply to Isabel Rios[:isabel_rios] from comment #9)

The test works for me locally but not on CI ...
These are the logs.. https://console.firebase.google.com/project/moz-fenix/testlab/histories/bh.66b7091e15d53d45/matrices/4775591608566573438/executions/bs.2b918418d317918/testcases/1/errors

Do you have access? If not I will paste them here...

Thanks Isabel, that was really helpful (yes, I do have access).

By looking at the logs, it seems that, on CI, the recently introduced startup-timeline ping is correctly getting triggered before the baseline ping in this test.

...
04-23 15:52:43.087: I/libglean_ffi(7789): glean_core::ping: Collecting startup-timeline
04-23 15:52:43.130: D/libglean_ffi(7789): glean_core::ping: Storing ping '6fcd14c0-e157-4fb3-b833-aace130b2129' at '/data/user/0/org.mozilla.fenix.debug/glean_data/pending_pings/6fcd14c0-e157-4fb3-b833-aace130b2129'
04-23 15:52:43.130: I/libglean_ffi(7789): glean_core: The ping 'startup-timeline' was submitted and will be sent as soon as possible
...
04-23 15:52:43.545: I/libglean_ffi(7789): glean_core::ping: Collecting baseline
04-23 15:52:43.547: D/libglean_ffi(7789): glean_core::ping: Storing ping 'a6dce3b4-f7fa-43f2-be58-fd847314b0c8' at '/data/user/0/org.mozilla.fenix.debug/glean_data/pending_pings/a6dce3b4-f7fa-43f2-be58-fd847314b0c8'
04-23 15:52:43.547: I/libglean_ffi(7789): glean_core: The ping 'baseline' was submitted and will be sent as soon as possible
...
04-23 15:52:43.605: I/MockWebServer(7789): MockWebServer[59188] received request: POST /submit/org-mozilla-fenix-debug/startup-timeline/1/6fcd14c0-e157-4fb3-b833-aace130b2129 HTTP/1.1 and responded: HTTP/1.1 200 OK
04-23 15:52:43.702: D/glean/ConceptFetchHttpUploader(7789): Ping successfully sent (200)
04-23 15:52:43.705: D/glean/PingUploadWorker(7789): 6fcd14c0-e157-4fb3-b833-aace130b2129 was deleted: true
04-23 15:52:43.716: I/WM-WorkerWrapper(7789): Worker result SUCCESS for Work [ id=4a385ef8-ea41-47b2-917b-f85b4351bc2c, tags={ mozilla.telemetry.glean.scheduler.PingUploadWorker, mozac_service_glean_ping_upload_worker } ]

After that, the MockWebServer times out waiting for the baseline ping which, for some reason, never arrives. I'm investigating why.

Flags: needinfo?(alessio.placitelli)

Isabel, sorry to bother you again. I updated my PR with something that should fix the failure on CI as well.

I'm not able to consistently reproduce the problem locally, but after this fix I was never able to reproduce it again.

Any chance you could test this again on CI, maybe letting it run a few times?

Flags: needinfo?(irios.mozilla)

Sure! let me try that and let you know the results

Flags: needinfo?(irios.mozilla)

Unfortunately, the test is still failing on CI.. see logs:
https://console.firebase.google.com/project/moz-fenix/testlab/histories/bh.66b7091e15d53d45/matrices/9222329471283168538/executions/bs.8a6fc5353b0cefea/testcases/1/test-cases

You can also try to add the line in my PR (https://github.com/mozilla-mobile/fenix/pull/10146/files#diff-2505cc9c0e1fbaa0bc3575cd3373d042R42) to your PR so you can see the results when you add new changes to your tests and no need to wait for me testing (it can take sometime depending on the day/time :S)

Flags: needinfo?(alessio.placitelli)

Ok, here's some notes about what I was able to see. The notes in comment 10 are correct: the test is failing because the startup-timeline ping is generated and submitted before the other pings. Here's what I think is happening:

  1. Glean is initialized by the test here. Using the testrule makes sure that Glean is in test mode.
  2. The test moves the app to background after 1s.
  3. When going to background, both glean and fenix generate some pings. Glean is supposed to send a baseline with reason background, while Fenix generates a startup-timeline.
  4. The order with which the pings are generated is non deterministic and depends by the OS.
  5. Both pings end up calling submitPingByName down the line, which is async in general, but sync when in test mode.
  6. Both pings are successfully collected (see the logs). However, the first one dispatches the worker for upload, that scans the upload directory. The second one attempts to dispatch the worker, but since we use a KEEP policy it basically does nothing. If the job dispatched from the first ping finishes iterating the directory while the second ping is still being written to disk, then we don't upload it. No data is lost: next time some upload is triggered we will upload that one too.

Jan-Erik, I suspect (6) will be dealt with using the new upload mechanism. Thoughts?

Flags: needinfo?(alessio.placitelli) → needinfo?(jrediger)

Yup, the new mechanism should deal with it, but we should probably add a test for this exact case just to be sure. I'll file a bug.

Flags: needinfo?(jrediger)
Attached file GitHub Pull Request
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: