Closed Bug 1374863 Opened 2 years ago Closed 2 years ago

'browser.engagement.tab_open_event_count' fails to properly count in automated tests.

Categories

(Toolkit :: Telemetry, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox56 --- fixed

People

(Reporter: Silne30, Assigned: Silne30)

Details

Attachments

(4 files)

When running the attached automated test, an initial tab is open when the browser opens. Two more tabs open totaling three tabs. When the ping is sent, 'browser.engagement.tab_open_event_count' == 1 rather than 2. This has recently surfaced as this test did not fail last week. When running the test manually, it counts properly. I have added conditions to wait for the tab to open but it still fails in automation.
I wonder if it's something related to what :Dexter said in bug 1271304 comment 1 about how open events during session restore are ignored. Or maybe a window's load event fires later than normal these days?

Do you have a regression window?
Flags: needinfo?(jdorlus)
Since the tests were not up and running yet, I don't have an exact regression window. I am just going to go back to when I remember it working and try to work from there.
Flags: needinfo?(jdorlus)
It was definitely working when b446a0c0642f8b8f705d9ea7808d09d857e8dab3 was made and I created the test. This was early april. I am running a git bisect to find out when the bug was introduced.
I don't think it's related to 1271304 since I am not restoring an old session in the automated test. It's just a regular session and the tab scalar seems to increment incorrectly. Only happens when I run the test in an automated way, not if I do it manually.
I was told that there was a mochitest covering most of these scenarios. Did it not catch anything like this?
Flags: needinfo?(chutten)
I ran a git bisect and have a regression range somewhere between 
[ab0396e461ae03210b1ec8110ba6dc66dc61ca96] Bug 1297072 - added 2d matrix utility functions; r=gregtatum   and 
[c33780a246c738ae1077ef9d77ebecf412452a4f] Bug 1358921: Lazily load SelfSupportBackend.jsm when first needed. r=florian

This is an approximation since it's my first time using the bisect tool. I would run the bisect, rerun the test , etc. I hope this helps.
No idea, I'm afraid. Alessio?
Flags: needinfo?(chutten) → needinfo?(alessio.placitelli)
(In reply to Chris H-C :chutten from comment #8)
> No idea, I'm afraid. Alessio?

John, could you provide some more context on the test? Is it running on Windows and, if so, which Windows version?
Does it happen on all platforms?

It would also be very useful to have a Telemetry log trace about the failure. Let me know if you have problems getting it!
Flags: needinfo?(alessio.placitelli) → needinfo?(jdorlus)
Context: on Windows 10 we open a new, additional tab by default on new profiles.
Context: I haven't run this test on windows. This fails on Linux only. And we are short a tab opening not over by 1. In the regression window, the test works on one end of the window and fails on the other. It could have to do with the speed of tabs opening. But I am not sure. But it doesn't fail when I try to reproduce manually. 

Env:

Linux 64 Bit
Test Run Method: Automated
Flags: needinfo?(jdorlus) → needinfo?(alessio.placitelli)
I will  get you some gecko logs.
Flags: needinfo?(alessio.placitelli)
Attached file geckologs.txt
I got you some gecko logs for the test that runs.
Attachment #8882456 - Flags: feedback?(alessio.placitelli)
(In reply to John Dorlus [:Silne30] from comment #13)
> Created attachment 8882456 [details]
> geckologs.txt
> 
> I got you some gecko logs for the test that runs.

Thanks! I'm seeing this error in the log file:

> TEST-UNEXPECTED-ERROR | test_debug_tabs_scalars.py TestMainTabScalars.test_main_tab_scalars | KeyError: 'browser.engagement.tab_open_event_count'

It looks like the ping you're examining doesn't contain that scalar. On a closer look, it seems that your ping data doesn't contain any scalar data. Are you sure you're looking at the correct ping and not at the 'saved-session' ping? The latter does not contain any scalar data.
Attachment #8882456 - Flags: feedback?(alessio.placitelli)
Is this failure actually from a landed test?
If not, then you should investigate this as part of the bug you are working on.
Flags: needinfo?(jdorlus)
This was caught by a landed test.
Flags: needinfo?(jdorlus) → needinfo?(gfritzsche)
Yes. One of the landed tests caught this. 1352269
Clearing needinfo as i don't see any question.
Note that there are still open questions for you here from 6 days ago.
Flags: needinfo?(gfritzsche)
Ah. I missed that. I am rerunning the test to make sure. Thanks.
Attached file mainpinglog.txt
Attachment #8884108 - Flags: review?(alessio.placitelli)
Comment on attachment 8884108 [details]
mainpinglog.txt

Hey John, I'm not sure what I should check/review in this log. Is anything from comment 14 unclear?
Flags: needinfo?(jdorlus)
Attachment #8884108 - Flags: review?(alessio.placitelli)
No. I addressed comment 14. You were right. That was a saved-session ping. But this time, I have uploaded the logs from the mainping and the count is still wrong.
Flags: needinfo?(jdorlus) → needinfo?(alessio.placitelli)
I'm not familiar with the test, could you briefly describe what do you expect? I see that the dumped ping does not contain the 'tab_open_event_count' scalar at all. Moreover, the 'subsessionCounter' is at 2, that means that a subsession split was performed before this ping and that scalars were cleared.

You're looking at a 'main-ping' with reason 'shutdown'. Is this what you're looking for? You're probably more interested in the ping from the other subsession split. But, again, as I'm not familiar with the test I don't know :)
Flags: needinfo?(jdorlus)
Flags: needinfo?(alessio.placitelli)
Hey Alessio,

Yes. The received ping is correct. The reason the subsession counter is at 2 is because in the beginning of the tests, when we change the telemetry prefs, marionette restarts the browser. So it rightfully is on the second subsession when the test starts. The problem is also not the ping because all of the other scalars that are also part of that subsession are there and accounted for. There is something up with the tab open event counter specifically. If I comment out the assertion for that scalar, the test passes 50/50 times. If I don't comment out that scalar, the test fails 50/50 times. What's interesting is that sometimes when the test fails, the scalar is not in the ping at all. Sometimes it's an assertion error. This is how the test runs.

Open a browser.
Make changes to telemetry prefs.
Restart Browser (Starting second subsession)
Open a second tab 
Wait for second tab to open
Open Third tab
Wait for third tab to open
Close third tab
Wait for third tab to close
Switch to second tab
Close second tab
Wait for second tab to close
Switch to first tab
Restart Browser
Wait for ping

The test asserts that max_concurrent_tab_count == 3 which passes
It asserts that max_concurrent_window_count == 1 which passes
But asserting tab_open_event_count == 2 fails. (1 is recorded in the ping or it's not there at all.)

This is really strange behavior but it seems to be something that only an automated test brings out. May be the speed of the tabs opening causes a lack of the count incrementing. Not sure. But this is not a mistake. Something is legitimately failing.
Flags: needinfo?(jdorlus) → needinfo?(alessio.placitelli)
I ran the test 100 times and it passed 8 times and failed 92. Perhaps there is some kind of race condition happening.
(In reply to John Dorlus [:Silne30] from comment #24)
> Hey Alessio,
> 
> Yes. The received ping is correct. The reason the subsession counter is at 2
> is because in the beginning of the tests, when we change the telemetry
> prefs, marionette restarts the browser. So it rightfully is on the second
> subsession when the test starts. The problem is also not the ping because
> all of the other scalars that are also part of that subsession are there and
> accounted for.

The |subsessionCounter| is the number of subsessions since the browser started, not since the profile was created [1].
Restarting the browser does not increase the |subsessionCounter|. This value is reset at every restart.

Are you absolutely sure there's no other ping coming in? I would suggest you dump *all* the pings you receive since the beginning of the test.

> There is something up with the tab open event counter
> specifically. If I comment out the assertion for that scalar, the test
> passes 50/50 times. If I don't comment out that scalar, the test fails 50/50
> times. What's interesting is that sometimes when the test fails, the scalar
> is not in the ping at all. Sometimes it's an assertion error. This is how
> the test runs.

At each session split, where the subsessionCounter is increased, all the scalars are reset. Some of them are set again (e.g. maximum number of tabs and windows), and there's why you see them in all the main pings despite of the subsession split. The particular one that's causing a failure here only gets set when a new tab is created.

Let's look at the test flow below.
 
> Open a browser.
> Make changes to telemetry prefs.
> Restart Browser (Starting second subsession)

Nope, you end the previous session here. Then start a new session when Firefox restarts.
So, when running again, you are at subsessionCounter = 1 unless there was a subsession split.

Moreover, when restarting the browser here, you would will have at least two pings on disk: saved-session and 'main-ping' with reason 'shutdown'. These are related to the first browsing session in the above 3 lines, and you are not really interested in them.

> Open a second tab 
> Wait for second tab to open
> Open Third tab
> Wait for third tab to open
> Close third tab
> Wait for third tab to close
> Switch to second tab
> Close second tab
> Wait for second tab to close
> Switch to first tab
> Restart Browser
> Wait for ping
> The test asserts that max_concurrent_tab_count == 3 which passes
> It asserts that max_concurrent_window_count == 1 which passes
> But asserting tab_open_event_count == 2 fails. (1 is recorded in the ping or
> it's not there at all.)

Which makes me thing, again, that something is triggering a subsession split in between your test or that you're opening the tabs before the session-restore message is received, meaning that the open-tab events are ignored (the max count isn't because it's simply counting independently).

> This is really strange behavior but it seems to be something that only an
> automated test brings out. May be the speed of the tabs opening causes a
> lack of the count incrementing. Not sure. But this is not a mistake.
> Something is legitimately failing.

I think we're still unable to show this is a legitimate issue with the code: we do have mochitest covering this [2] which are not failing and there are a few grey areas in the analysis that need to be cleared up (are we receiving multiple pings? are we waiting for the session restore?).

[1] - http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html
[2] - http://searchfox.org/mozilla-central/search?q=&case=false&regexp=false&path=browser_UsageTelemetry
Flags: needinfo?(alessio.placitelli) → needinfo?(jdorlus)
Per IRC comment, going to add a 5 second wait after the first subsession split when we restart the browser after changing the prefs to see if that is causing the issue.
Flags: needinfo?(jdorlus)
Attachment #8887598 - Flags: review?(chutten)
Comment on attachment 8887598 [details]
Bug 1374863 - Added wait for telemetry to reinitialize after reboot when setting gecko prefs.

https://reviewboard.mozilla.org/r/157084/#review163788

Does what it says on the tin... 

But is there nothing better than time.sleep? testReset and friends have promises that you can await in JS.
Attachment #8887598 - Flags: review?(chutten) → review+
Comment on attachment 8887598 [details]
Bug 1374863 - Added wait for telemetry to reinitialize after reboot when setting gecko prefs.

https://reviewboard.mozilla.org/r/157084/#review163788

Unfortunately not.
Assignee: nobody → jdorlus
Status: NEW → ASSIGNED
Keywords: checkin-needed
We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again.

hg error in cmd: hg rebase -s 7626131f953b -d 664f305ee1a5: rebasing 408345:7626131f953b "Bug 1374863 - Added wait for telemetry to reinitialize after reboot when setting gecko prefs. r=chutten" (tip)
merging toolkit/components/telemetry/tests/marionette/harness/telemetry_harness/testcase.py
warning: conflicts while merging toolkit/components/telemetry/tests/marionette/harness/telemetry_harness/testcase.py! (edit, then use 'hg resolve --mark')
unresolved conflicts (see hg resolve, then hg rebase --continue)
We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again.

hg error in cmd: hg rebase -s 242b5dc9be46 -d 6fc6a92ad62e: rebasing 408932:242b5dc9be46 "Bug 1374863 - Added wait for telemetry to reinitialize after reboot when setting gecko prefs. r=chutten" (tip)
merging toolkit/components/telemetry/tests/marionette/harness/telemetry_harness/testcase.py
warning: conflicts while merging toolkit/components/telemetry/tests/marionette/harness/telemetry_harness/testcase.py! (edit, then use 'hg resolve --mark')
unresolved conflicts (see hg resolve, then hg rebase --continue)
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/614cdfbf0e2c
Added wait for telemetry to reinitialize after reboot when setting gecko prefs. r=chutten
https://hg.mozilla.org/mozilla-central/rev/614cdfbf0e2c
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
You need to log in before you can comment on or make changes to this bug.