Closed Bug 1308453 Opened 3 years ago Closed 3 years ago

Firefox-UI update tests failing after fix in bug 1302707, with MarionetteException: Not a Number

Categories

(Testing :: Marionette, defect, major)

50 Branch
defect
Not set
major

Tracking

(firefox50 wontfix, firefox51 fixed, firefox52 fixed)

RESOLVED FIXED
mozilla52
Tracking Status
firefox50 --- wontfix
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: FlorinMezei, Assigned: ato)

References

Details

Attachments

(1 file)

All ondemand update tests failed on Beta 5: https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=49776d31766d&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-beta-cdntest(.

All failures are: TEST-UNEXPECTED-ERROR | test_direct_update.py TestDirectUpdate.test_update | MarionetteException: MarionetteException: Not a Number.

This was not showing in Beta 4, so the most likely regressor is: https://bugzilla.mozilla.org/show_bug.cgi?id=1302707 (https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=f4b45ea3c0c8&tochange=49776d31766d).
Flags: needinfo?(ato)
This is not only for ondemand update tests but also happens permanently on Aurora and Nightly. Problem here is:

03:52:54     INFO -   File "c:\jenkins\workspace\ondemand_update\build\venv\lib\site-packages\marionette\marionette_test.py", line 440, in setUp
03:52:54     INFO -     self.marionette.reset_timeouts()
03:52:54     INFO -   File "c:\jenkins\workspace\ondemand_update\build\venv\lib\site-packages\marionette_driver\marionette.py", line 740, in reset_timeouts
03:52:54     INFO -     self.set_page_load_timeout(30000)
03:52:54     INFO -   File "c:\jenkins\workspace\ondemand_update\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1208, in set_page_load_timeout
03:52:54     INFO -     self._send_message("timeouts", {"page load": timeout})
03:52:54     INFO -   File "c:\jenkins\workspace\ondemand_update\build\venv\lib\site-packages\marionette_driver\decorators.py", line 42, in _
03:52:54     INFO -     return func(*args, **kwargs)
03:52:54     INFO -   File "c:\jenkins\workspace\ondemand_update\build\venv\lib\site-packages\marionette_driver\marionette.py", line 690, in _send_message
03:52:54     INFO -     self._handle_error(err)
03:52:54     INFO -   File "c:\jenkins\workspace\ondemand_update\build\venv\lib\site-packages\marionette_driver\marionette.py", line 723, in _handle_error
03:52:54     INFO -     raise errors.lookup(error)(message, stacktrace=stacktrace)

Looks like `reset_timeouts()` is not working correctly.
Summary: Ondemand Update tests failing after fix in bug 1302707, with MarionetteException: Not a Number → Firefox-UI update tests failing after fix in bug 1302707, with MarionetteException: Not a Number
To add, this is actually blocking QA from signing off for Firefox 50 Beta 5. So we should get this fixed ASAP.
Severity: normal → critical
We have concerns with the 50.0beta stability in general. Because beta 5 contains important fixes, I would like us to move fast. Can we backout changes in bug 1302707?
I’m guessing the out-of-repo (?) ondemand_update uses an unpinned version of the Marionette Python client that tries to communicate with a Firefox that does not have the protocol changes in bug 1302707.

The patches in bug 1302707 change the server’s Marionette:timeouts command to be backwards compatible with earlier clients, but the changes in the client aren’t backwards compatible with earlier servers.

Since we aren’t particularly attached to the changes in bug 1302707, it’s fine to back them out of beta.
Flags: needinfo?(ato)
(In reply to Andreas Tolfsen ‹:ato› from comment #4)
> I’m guessing the out-of-repo (?) ondemand_update uses an unpinned version of
> the Marionette Python client that tries to communicate with a Firefox that
> does not have the protocol changes in bug 1302707.

Those update tests (here for beta) are testing updates from older releases of Firefox eg. Firefox 49 Beta X or even earlier. As base those tests make use of the test package for the target Firefox version. It means that for the above case Firefox 49 b3 uses a test package with the related timeout changes included.

> The patches in bug 1302707 change the server’s Marionette:timeouts command
> to be backwards compatible with earlier clients, but the changes in the
> client aren’t backwards compatible with earlier servers.

The came client is used here for both versions, but different servers. So that is causing the problem and also reflects why we had partial test bustage on mozilla-central and mozilla-aurora for a couple of days. Once the source build contained all the changes server and client wise the failure is gone.

But keep in mind that this will not be the case for our release update tests, at least not for the next 3+ releases! We have to find a way to properly fix it, or we can scratch the update tests.

> Since we aren’t particularly attached to the changes in bug 1302707, it’s
> fine to back them out of beta.

It would only be a temporary fix as stated above, which would allow us to run update tests for beta 5. But it would also mean that any upcoming update tests which use Beta 5 as source build will fail too because of the backout - at least if we aren't able to fix it properly until the next beta release.
Yes, I agree to all of your points above.

I also wrote a little summary in https://bugzilla.mozilla.org/show_bug.cgi?id=1302707#c72 about the current situation with making changes to the Marionette client.

(In reply to Henrik Skupin (:whimboo) from comment #5)
> (In reply to Andreas Tolfsen ‹:ato› from comment #4)
> > Since we aren’t particularly attached to the changes in bug 1302707, it’s
> > fine to back them out of beta.
> 
> It would only be a temporary fix as stated above, which would allow us to
> run update tests for beta 5. But it would also mean that any upcoming update
> tests which use Beta 5 as source build will fail too because of the backout
> - at least if we aren't able to fix it properly until the next beta release.

Should we use this bug to track the compatibility fix?

I think it should look something like this:

    def set_page_load_timeout(self, timeout):
        try:
            self._send_message("timeouts", {"page load": timeout})
        except MarionetteException:
            self._send_message("timeouts", {"type": "page load", "ms": ms}
(In reply to Andreas Tolfsen ‹:ato› from comment #6)
> Should we use this bug to track the compatibility fix?

That sounds fine to me. 

> I think it should look something like this:
> 
>     def set_page_load_timeout(self, timeout):
>         try:
>             self._send_message("timeouts", {"page load": timeout})
>         except MarionetteException:
>             self._send_message("timeouts", {"type": "page load", "ms": ms}

I assume we cannot do anything regarding the protocol level? Or has this also been bumped for this API change? I'm asking because in case of another change in that area, we could end-up with spaghetti-code trying to many different ways in setting this timeout.
I don’t normally bump the protocol level unless there are more fundamental, structural changes to the data model.

We could of course do that regularly whenever there is an API or type change, but I don’t think a magical number is any better than the above:

    def some_function(self):
        if self.protocol <= 14:
            # do something
        elfi self.protocol >= 23:
            # something else
Whatever you think is better. But I think we should at least add a Firefox version to the fallback case, so we can remove this code whenever we no longer have to support it.
Maybe it’s more reasonable to encode and use the Gecko version number when managing shims such as these, so we know exactly in which version an API changed without having to jump through the hoops of looking up in the code when some API disappeared or which Marionette protocol version is tied to what release of Firefox?

    def set_page_load_timeout(self, timeout):
        if self._gecko > 52:
            self._send_message("timeouts", {"page load": timeout})
        else:
            self._send_message("timeouts", {"type": "page load", "ms": timeout})

A possible improvement over this is having some sort of “bridge”, where we detach the protocol implementation from the client API and choose what bridge level we want at a much earlier point.  We should do this if we really care about the code quality of the Marionette client:

    class Gecko34Bridge(Bridge):
        def timeout(self, type, ms):
            self._send_message("timeouts", {"type": type, "ms": ms})

    class Gecko52Bridge(Gecko34Bridge):
        def timeout(self, type, ms):
            self._send_message("timeouts", {type: ms})

    def start_session(self):
        …
        self.bridge = # some negotiation code
        …

    …

    def set_page_load_timeout(self, timeout):
        self.bridge.timeouts("page load", timeout)

But I don’t think I particularly want to put in the effort to code something like that.
(In reply to Andreas Tolfsen ‹:ato› from comment #10)
> Maybe it’s more reasonable to encode and use the Gecko version number when
> managing shims such as these, so we know exactly in which version an API
> changed without having to jump through the hoops of looking up in the code
> when some API disappeared or which Marionette protocol version is tied to
> what release of Firefox?
> 
>     def set_page_load_timeout(self, timeout):
>         if self._gecko > 52:
>             self._send_message("timeouts", {"page load": timeout})
>         else:
>             self._send_message("timeouts", {"type": "page load", "ms":
> timeout})

Sounds good bug it would still cause a failure for update tests in the next couple of days as long as the source build is on the same version number but does not have the changes included. So only a combination of version and buildnumber would satisfy that. 

> A possible improvement over this is having some sort of “bridge”, where we
> detach the protocol implementation from the client API and choose what
> bridge level we want at a much earlier point.  We should do this if we
> really care about the code quality of the Marionette client:
[..] 
> But I don’t think I particularly want to put in the effort to code something
> like that.

I feel the same way. 

Maybe we stay with try/catch for now?
There is such a thing as a Services.appinfo.platformBuildId too which I presume is more specific, but I don’t know how much we really care about intra-version incompatibility?  I guess I’m trying to imagine how this would fail in the future if we follow this model, disregarding this current fallout.
The platform buildid should not be used currently. See bug 1298328. Otherwise it's just a question how complicated a check like that should be. I think in the case that we know about a breaking API change it might be fine that tests are broken. But it should definitely not happen for Beta and Release builds.
Andreas, I assume it's fine to assign this bug to you.
Assignee: nobody → ato
Severity: critical → major
Status: NEW → ASSIGNED
Comment on attachment 8801722 [details]
Bug 1308453 - Backwards compat for old Marionette:timeouts API;

https://reviewboard.mozilla.org/r/86394/#review85108

::: testing/marionette/client/marionette_driver/marionette.py:1240
(Diff revision 2)
>  
>          """
> +        try:
> -        self._send_message("timeouts", {"script": timeout})
> +            self._send_message("timeouts", {"script": timeout})
> +        except errors.MarionetteException as e:
> +            if "Not a Number" in e.message:

Mind adding a comment here which refers 52.0a? Otherwise no-one will most likely remove it when 45.0esr gets deprecated.

Same for the other changes.
Attachment #8801722 - Flags: review?(hskupin) → review+
Comment on attachment 8801722 [details]
Bug 1308453 - Backwards compat for old Marionette:timeouts API;

https://reviewboard.mozilla.org/r/86394/#review85112

::: testing/marionette/client/marionette_driver/marionette.py:1240
(Diff revisions 2 - 3)
>  
>          """
>          try:
>              self._send_message("timeouts", {"script": timeout})
>          except errors.MarionetteException as e:
> +            # remove when 52.0a is stable

That is not true. We still support 45esr until 52.2.0.
Comment on attachment 8801722 [details]
Bug 1308453 - Backwards compat for old Marionette:timeouts API;

https://reviewboard.mozilla.org/r/86394/#review85112

::: testing/marionette/client/marionette_driver/marionette.py:1240
(Diff revisions 2 - 3)
>  
>          """
>          try:
>              self._send_message("timeouts", {"script": timeout})
>          except errors.MarionetteException as e:
> +            # remove when 52.0a is stable

That is not true. We still support 45esr until 52.2.0.
Pushed by atolfsen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bf8f96812076
Backwards compat for old Marionette:timeouts API; r=whimboo
https://hg.mozilla.org/mozilla-central/rev/bf8f96812076
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
This is a test-only change which we would like to see uplifted to aurora to get a regression fixed. Thanks.
Whiteboard: [checkin-needed-aurora]
https://hg.mozilla.org/releases/mozilla-aurora/rev/59930726515c
Flags: in-testsuite+
Whiteboard: [checkin-needed-aurora]
To reduce confusion we should set status-firefox50 to wontfix. This patch never landed for that release, and the regressor got backed-out.
You need to log in before you can comment on or make changes to this bug.