Closed Bug 1115126 Opened 10 years ago Closed 9 years ago

e2e fails intermittently (perma orange) on the test server

Categories

(Hello (Loop) :: Client, defect)

defect
Not set
normal
Points:
5

Tracking

(Not tracked)

RESOLVED FIXED
mozilla38
Iteration:
38.1 - 26 Jan

People

(Reporter: drno, Assigned: standard8)

References

Details

Attachments

(2 files, 3 obsolete files)

Since I turned on the e2e test here: http://10.252.73.218:8080/view/hello/job/hello-e2e-marionette/
it seem to keep failing. I watched the machines remotely today a little bit via remote desktop and noticed two failure scenarios:

1) Only the drop down menu comes down, but the new conversation window never comes up.
  - I'm wondering if the machine is just too slow. As I noticed before the "start a conversation" turns dark blue only after its done downloading stuff (TokBox libraries/dependencies) from the Internet.
    - potential work around: wait longer - super ugly
    - can the test get informed that the button just turned dark blue and became clickable?
  - I think we don't verify that clicking the button actually was successful in any way. We probably should add some verification, to prevent the test from proceeding and spitting out mis-leading error messages.

2) The standalone page actually got loaded, but then displayed an error message about missing access to camera and microphone?!
  - I have no explanation for this. It looks like the media.navigator.permission.disabled flag is still set.
Everything works as expected on my Mac. Trying a Linux build and test run...
Linux works for me too.

Just noticed that the mic/camera prompt on standalone comes up, before the "you are the first in the room" message appears. So maybe it is in fact no problem of mic/camera access, but simply means the standalone client for some reason failed to join the room?!
Just realized that I overlooked one issue before deploying onto the test server: audio and video sources!

We have not hooked up the fake sources yet. I assume that might be a reason for the mic/cam access message not going away (if the test gets that far).
Verified that calls fail if no camera is attached to the machine. The result is that the prompt for access to the camera and mic stays on for ever. The test then fails as the call never gets established (is this a bug already - we had just audio in this case?) and the faces at the end of the call never show up.
By modifying the TB SDK I managed to establish a call with Firefox fake audio and video devices. It works fine on my Linux desktop.
On the test server this results in the Loop client showing a preview from the fake video source. But the standalone client still gets stuck at the "We need access to your camera and microphone" gray overlay in the video rendering area (I'm not referring to the door hanger here). Normally that gray overlay should switch to the next screen with "You are the first one here" before connecting the call.

This makes me wonder if the new hint on camera/mic access has introduce either:
- a problem when we disable the doorhanger
- or some kind of race conditions between the sequence of screens above
Mark, Dan,
do you guys have any idea what could be going wrong here? See comment #5 for my most recent ideas/concerns.
Flags: needinfo?(standard8)
Flags: needinfo?(dmose)
So without the fake devices the browser console shows this error message from the TB SDK:
  "Unknown Error while getting user media"
The TokBox SDK hides the real error message "Internal Error" from Firefox.

But even if I enable the fake video source (it shows on the Loop client preview window) and setting the media.video_loopback_dev I still get GUM errors.
As I see the SDK error message in the Web Console, I think the GUM problem is now (for some unknown reason) on the standalone client side.
So by re-enabling the door hanger shows that the standalone client only has an audio source as long as I have set media.video_loopback_dev set. If I then manually accept the door hanger the call connects fine. So for this suggests that we have some kind of race condition here, because if there is no delay in accepting because the door hanger is disabled then the standalone client is stuck at the "we need access..." grey overlay.
Jesup pointed out in IRC to me that with e10s on right now Loop can not share the access to resources from Loop and Standalone at the same time as the two try to access the same resource from different processes.
I turned off e10s in the test, but the problem persists. Which makes sense, because on our desktop machine with real cameras e10s should/would be on all the time as well and does not cause problems.

Which now leaves us with two facts:
- it works on our desktop machines with real cameras
- it works on the test server if I turn on the permission door hanger

This looks more and more like some kind of race condition in the order of expected events to me.
I don't have any ideas off-hand for what could be the issue here. Maybe we could get together sometime today or tomorrow and do some brainstorming.
Flags: needinfo?(standard8)
If I had to guess, I'd suspect the multiplexGum.js code.   Any or all of jaws, Standard8, or I could potentially get on a call with you tomorrow to try to reason our way through that code, which is unfortunately quite painful.

Other possibilities include: 

* do the icky thing of putting in a longer timeout, then wait until bug 1081248 is fixed, after which time we should be able to rip out multiplexGum.js entirely, I think.

* splatter the multiplexGum.js with a ton of console-logging to be turned on in test runs only, then see if the failing runs have any noticable differences from the passing ones.

* there was another gUM design cleanup bug that I think :florian was thinking about or working, but I can't find it now.  If I'm remember correctly, having that get fixed would likely also allow us to tear out multiplexGum.js.  Florian, is that stuff we talked about at Mozlandia captured in a bug?
Iteration: --- → 36.3
Flags: needinfo?(dmose) → needinfo?(florian)
See Also: → 1064257
From reading the topmost comment in multiplexGum.js this would make a lot of sense that this is causing the issues I'm seeing on the server.

Not sure if I mentioned it before:
- with the Firefox build in fake sources only the Loop client, but not the standalone client seems to utilize that
- with fake camera devices from the OS the standalone client seems to work, but the Loop client spits out the gUM error message from the TokBox SDK

So far this did not make any sense to me, but if multiplexGum.js still does what its description at the top claims I could easily see how that logic (which is probably not written for multiple clients and multiple devices) could get confused and act strange.

BTW even though I'm using a release build, I have everything on the server setup so that I can easily replace any JS or test code with instrumented versions.
(In reply to Dan Mosedale (:dmose) - needinfo? me for response from comment #14)

> * there was another gUM design cleanup bug that I think :florian was
> thinking about or working, but I can't find it now.  If I'm remember
> correctly, having that get fixed would likely also allow us to tear out
> multiplexGum.js.  Florian, is that stuff we talked about at Mozlandia
> captured in a bug?

The large set of gUM UI redesign bugs is the list of bugs blocking bug 1048209.

The conclusion of our discussion at Mozlandia was that the things we want to fix soon are:
- making gUM prompts harder to dismiss by accident
- adding mute buttons.
Everything else will likely have to wait a bit longer.
Flags: needinfo?(florian)
(In reply to Dan Mosedale (:dmose) - needinfo? me for response from comment #14)

> ...wait until bug
> 1081248 is fixed, after which time we should be able to rip out
> multiplexGum.js entirely, I think.

I don't think so. The user experience without this approach is still pretty bad, even if the doorhanger is less fragile. I think the long-term approach is probably asking TB to give us hooks to hand them a video stream instead of calling gUM directly.
(In reply to Adam Roach [:abr] from comment #18)
> I don't think so. The user experience without this
> approach is still pretty
> bad, even if the doorhanger is less fragile.

Yeah, I forgot about the key point of actually being able to request perms early. 

> I think the long-term approach is probably asking TB to
> give us hooks to hand them a video stream instead
> of calling gUM directly.

I _think_ we more or less did that in our last Tokbox call, didn't we?  And we're now waiting on them to get back to us have some engineers discuss more directly?


(In reply to Florian Quèze [:florian] [:flo] from comment #17)
> The conclusion of our discussion at Mozlandia was that the things we want to
> fix soon are:
> - making gUM prompts harder to dismiss by accident

This still seems like it would be a substantial UX win on its own.  But having skimmed that dependency tree twice, it doesn't seem to be captured there.  Am I missing something?
Flags: needinfo?(florian)
Flags: needinfo?(adam)
(In reply to Nils Ohlmeier [:drno] from comment #16)
> Not sure if I mentioned it before:
> - with the Firefox build in fake sources only the Loop client, but not the
> standalone client seems to utilize that
> - with fake camera devices from the OS the standalone client seems to work,
> but the Loop client spits out the gUM error message from the TokBox SDK

Both of these things are interesting, and could be worth pairing with me, (or jaws, or Standard8, or abr) on.

> BTW even though I'm using a release build, I have everything on the server
> setup so that I can easily replace any JS or test code with instrumented
> versions.

That's great to hear!
(In reply to Dan Mosedale (:dmose) - needinfo? me for response from comment #19)
> (In reply to Adam Roach [:abr] from comment #18)
> > I think the long-term approach is probably asking TB to
> > give us hooks to hand them a video stream instead
> > of calling gUM directly.
> 
> I _think_ we more or less did that in our last Tokbox call, didn't we?  And
> we're now waiting on them to get back to us have some engineers discuss more
> directly?

Did we? I recall asking for the ability to mock video for testing purposes, but the easy way to do that would be for them to let us add constraints to the gUM call (which wouldn't solve this problem). I know we planned to have follow-on conversations; we should ensure that the "getting permissions early" use case is included in those. I've updated the notes accordingly.
Flags: needinfo?(adam)
With Dan's help today I realized that my attempt to patch the TokBox SDK to enable the Firefox fake sources was only half way successful, because I patched omni.ja in the downloaded Firefox.
But as the standalone client gets it's SDK from the standalone server I needed to patch the SDK in the standalone server as well.

Now the fake sources work in both clients.

BUT the test keeps failing quite a lot on trying to click the initial "Start a conversation" button (I therefore kept the test disabled on the server).
We need to find out why the button is initially disabled, and what triggers it switching to dark blue and this triggers seems to be missing quite a lot recently. So who is expert for that button?
(In reply to Nils Ohlmeier [:drno] from comment #22)
> BUT the test keeps failing quite a lot on trying to click the initial "Start
> a conversation" button (I therefore kept the test disabled on the server).
> We need to find out why the button is initially disabled, and what triggers
> it switching to dark blue and this triggers seems to be missing quite a lot
> recently. So who is expert for that button?

It gets disabled until the registration with the push and loop-server has been completed, and the rooms list loaded.

It looks like we need to implement a "wait for enabled" function.
(In reply to Mark Banner (:standard8) from comment #23)
> It gets disabled until the registration with the push and loop-server has
> been completed, and the rooms list loaded.
> 
> It looks like we need to implement a "wait for enabled" function.

Yes I totally agree that would be helpful for the test(s). I can look into using a polling loop for now.

But from my personal observation the button changing its color just taking time something must have changed within the last couple of days so that it looks like that this registration/load process never finishes on a freshly started browser. I'll probably try to dig into that more today.
Attached patch adapt_hello_e2e_to_rooms.patch (obsolete) — Splinter Review
This updates the test case to the latest rooms UI.

Unfortunately neither wait_for_element_displayed or wait_for_element_exists work when waiting for the "start a conversation" button to become clickable. Mark any idea how to replace the hardcoded wait for that case?
Flags: needinfo?(standard8)
Mark the other helpful information would be: how can I debug why the button isn't switching to active? What logs can I turn on to get an idea of what is going on?
(In reply to Dan Mosedale (:dmose) - needinfo? me for response from comment #19)

> (In reply to Florian Quèze [:florian] [:flo] from comment #17)
> > The conclusion of our discussion at Mozlandia was that the things we want to
> > fix soon are:
> > - making gUM prompts harder to dismiss by accident
> 
> This still seems like it would be a substantial UX win on its own.  But
> having skimmed that dependency tree twice, it doesn't seem to be captured
> there.  Am I missing something?

It's not in the dependency tree because that issue is independent of the redesign/simplification effort. I expect the work to happen in bug 1004061 where you are already cc'ed.
Flags: needinfo?(florian)
This adds a wait for element for the start conversation button. I did a hack or two whilst testing to extend the time for which the button was disabled initially for 5 seconds, and it seemed to work fine.
Nils, can you try the patch I just attached? I think it should work from my testing.
Flags: needinfo?(standard8) → needinfo?(drno)
I applied the patch for now locally on the test server. In one run it worked fine. Although I had single successful runs in the past as well. I'll leave the test enabled for now, that should result in a few more test triggers over the next couple of hours. And we can close this depending on how further test runs look like.
Flags: needinfo?(drno)
I did a bunch of work on the script today and tidied up the namings of scripts, and replaced some of the sleeps with waits.

It seems stable for me, apart from a couple of issues that I've noted and commented out the tests for.

I think we should probably get this into the tree, as the next baseline, and see how it performs on Nils' test suite.
Attachment #8550329 - Flags: review?(dmose)
Attachment #8550329 - Flags: feedback?(drno)
Attachment #8548877 - Attachment is obsolete: true
Attachment #8545550 - Attachment is obsolete: true
(In reply to Mark Banner (:standard8) from comment #31)
> Created attachment 8550329 [details] [diff] [review]
> Rework the functional test to have better namings for rooms, and add some
> timeout handling.
> 
> I did a bunch of work on the script today and tidied up the namings of
> scripts, and replaced some of the sleeps with waits.

Great. Thanks.
 
> It seems stable for me, apart from a couple of issues that I've noted and
> commented out the tests for.

From the few test runs we got before bug 1108049 made into the new Nightly build we can see already that we will get failures like this:

TEST-UNEXPECTED-ERROR | test_1_browser_call.py Test1BrowserCall.test_1_browser_call | TimeoutException: TimeoutException: Timed out after 10.1 seconds with message: Timed out waiting for element to be enabled

But I'm fine with landing something here and open a specific bug just for that button enabling issue.
 
> I think we should probably get this into the tree, as the next baseline, and
> see how it performs on Nils' test suite.

I already have something in place which overwrites the test on the server to allow me debugging issue we only see on the server. So we could speed up things by "landing" a new test case on the test server even before it gets to mozilla-central.
Comment on attachment 8550329 [details] [diff] [review]
Rework the functional test to have better namings for rooms, and add some timeout handling.

Review of attachment 8550329 [details] [diff] [review]:
-----------------------------------------------------------------

I would like to keep using the "Copy" button for the URL.

::: browser/components/loop/test/functional/test_1_browser_call.py
@@ +89,4 @@
>          self.start_a_conversation()
>  
> +        room_url = self.wait_for_element_displayed(By.CLASS_NAME, \
> +                                                   "room-url-link").text

The loop-panel-iframe disappears after you pressed the conversation button. So this is getting the URL from something which is not visible or accessible to user at this time.

Dan and me decided that it would be better to follow the user flow here and press the "Copy" button in the chat window. That code was included in my previous patch.
The other alternative could be to open the loop-panel-iframe again after you started a conversation and take the URL from there. That is something the user could do. But it does not strike me as a natural flow a user would follow.
Therefore I would prefer to stick to using the "Copy" button in the chatbox.

@@ +155,2 @@
>  
> +        # Switch to the conversation window and join the local room

We are actually not joining the room here, the local client is in it already waiting.

@@ +158,3 @@
>  
>          # hangup the call
> +        self.leave_room_and_verify_feedback()

How about we prefix all these calls with standalone_ or local_ to differentiate who is acting here?
Then we could also have each of these call initially the function for switching to the expected context, so that you don't have to worry about that any more.
Attachment #8550329 - Flags: feedback?(drno) → feedback-
(In reply to Nils Ohlmeier [:drno] from comment #33)
> ::: browser/components/loop/test/functional/test_1_browser_call.py
> @@ +89,4 @@
> >          self.start_a_conversation()
> >  
> > +        room_url = self.wait_for_element_displayed(By.CLASS_NAME, \
> > +                                                   "room-url-link").text
> 
> The loop-panel-iframe disappears after you pressed the conversation button.
> So this is getting the URL from something which is not visible or accessible
> to user at this time.
> 
> Dan and me decided that it would be better to follow the user flow here and
> press the "Copy" button in the chat window. That code was included in my
> previous patch.
> The other alternative could be to open the loop-panel-iframe again after you
> started a conversation and take the URL from there. That is something the
> user could do. But it does not strike me as a natural flow a user would
> follow.
> Therefore I would prefer to stick to using the "Copy" button in the chatbox.

Maybe in the interest of less brittle and less external dependencies the approach of re-opening the loop-panel-iframe is actually the better.
But the current code is completely misleading IMHO as the function call is named wait_for_element_displayed() and the panel is actually not visible at that time any more.
Comment on attachment 8550329 [details] [diff] [review]
Rework the functional test to have better namings for rooms, and add some timeout handling.

I'll wait for a new patch with Nil's suggested changes before reviewing.
Attachment #8550329 - Flags: review?(dmose)
Instead of opening a new bug report I simply put a sleep(120) at the very beginning of the test, once its done with the setup. And now everything seems to be passing all the time.
So I guess we are fighting some strange browser warmup delay thing. Maybe the reason we don't see it on our dev machines is that this startup delay gets disabled in self combiled debug builds and only added in official builds (the test server is running the official, public Nightly)?

Does anyone know if we can alter that warmup/startup delay via a user pref?
I increased the timeout for wait_for_enable from 10 to 120s for the weekend, because that should work for developers and test machines without modifying user prefs.
Updated patch based on the comments, and merged in the copy code - I used the pyperclip module as that is more supported than the older clipboard version (and the clipboard one points to it anyway).

I also extended the commented-out video tests for both standalone and local so that we'll test video both ways once we can enable it (after bug 1122486)

I think what's here is a big improvement to what's in the tree, hence why I want to get the updates landed. With these changes, I'd be much more confident that it will pass than without.
Attachment #8550329 - Attachment is obsolete: true
Attachment #8551329 - Flags: review?(drno)
Iteration: 36.3 → 38.1 - 26 Jan
Here is one little patch which I think should be included in the fix.

The user pref for allowing ICE over loopback was wrong.
And the timeout for the button is too short for the server (I just chose 120s as a long value and have not take any attempts to bisect what would be a safer value between 10s and 120s).

We are actually running this now on the test server and it works.

Mark could you include this with your patch?
Attachment #8552082 - Flags: feedback?(standard8)
Comment on attachment 8551329 [details] [diff] [review]
Rework the Loop functional test to have better namings for rooms, and improve some timeout handling.

Review of attachment 8551329 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM with the little patch I added to the ticket for the timeout and the ICE loopback.
Attachment #8551329 - Flags: review?(drno) → review+
Comment on attachment 8552082 [details] [diff] [review]
bug_1115126_increase_timeout_fix_loopback_pref.patch

Ah, a proper ice pref, that's probably why I had intermittent failures due to ice before.

Adjusting the timeout seems fine. I'm a little curious as to why its slower on the test machines, but having a longer one seems fine as well - as its really to catch the failure cases only.
Attachment #8552082 - Flags: feedback?(standard8) → feedback+
I combined the two and landed them:

https://hg.mozilla.org/integration/fx-team/rev/7c69208584e5
Assignee: nobody → standard8
Points: --- → 5
Target Milestone: --- → mozilla38
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: