Closed Bug 1284457 Opened 4 years ago Closed 3 years ago

Revise default socket_timeout of 360s

Categories

(Testing :: Marionette, defect)

49 Branch
defect
Not set

Tracking

(firefox49 fixed, firefox50 fixed, firefox51 fixed)

RESOLVED FIXED
mozilla51
Tracking Status
firefox49 --- fixed
firefox50 --- fixed
firefox51 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

Attachments

(1 file)

While working on bug 1257476 I noticed that we have a very long hang in `socket.recv()` in case the application doesn't answer. Exactly these are 360s! Would you ever assume that you have to wait 360s until you get a response from the other side? 

I feel that we somehow mix timeouts together and that we bumped up this value due to possible timeouts during e.g. startup. But for that we actually have already the startup_timeout defined with 120s. So in case of socket failures we could still retry to send the same data (new session) until a response has been sent back. Once the connection has been established I wouldn't tolerate more than 60s socket timeouts. Maybe even that is already too high.

One more thing related to IOError exceptions in _send_message(). If those appear we have had a timeout of 360s, and an additional wait for the binary to shutdown of 120s. All in all this causes a 480s hang!

Not sure what the reason is for those long timeouts. Maybe David and Jonathan can give some information.
Flags: needinfo?(jgriffin)
Flags: needinfo?(dburns)
I don't really remember, but I suspect it may have something to do with B2G emulators. Since we no longer care about those, we can probably reduce those long timeouts.
Flags: needinfo?(jgriffin)
I am pretty sure that jgriffin is right here... EMulators in automation are ridiculously slow...
Flags: needinfo?(dburns)
If we add support for Fennec, presumably this runs inside the same Android emulator?  I do think it’s worth experimenting with reducing it.  Perhaps with an informed look into this we will be able to wait for the right events and reduce the need for individual timeouts that whimboo describes above.
With the attached proposal to reduce the default timeout to 60s I triggered a full try build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cdc42db5d37e

Lets see where we fail and what needs further improvements. Maybe it might be Fennec related.
The try push actually looks good. There is no job which is failing due to this change. Given that we do not run Marionette tests for Fennec on try yet, I will try those locally now.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
I would like to wait for bug 1284874 which would give me Fennec on try via TC for free.
Depends on: 1284874
Now that we have try support for Fennec I pushed the patch to Try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b4cbb83ad385
All fine here with the try build for Fennec on Android. The one timeout related failure is not caused by my change but always happens on integration branches, and mozilla-central.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b4cbb83ad385&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=26515865
Attachment #8781135 - Flags: review?(ato)
Comment on attachment 8781135 [details]
Bug 1284457 - Reduce default socket timeout for Marionette to 60s

https://reviewboard.mozilla.org/r/71636/#review73104

Hm, lots of try failures here but I don’t believe any of them are Marionette related.
Attachment #8781135 - Flags: review?(ato) → review+
Comment on attachment 8781135 [details]
Bug 1284457 - Reduce default socket timeout for Marionette to 60s

https://reviewboard.mozilla.org/r/71636/#review73104

Yes, none of them show any indication that it is related to this change. Also nearly all of them have existing bugs for known failures. The try build for Fennec is similar. It has lots of failing tests but this is known and covered by 1297394.
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/799921e8d85a
Reduce default socket timeout for Marionette to 60s r=ato
https://hg.mozilla.org/mozilla-central/rev/799921e8d85a
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
See Also: → 1248056
As noticed this morning while checking marionette.py for other changes, I accidentally found bug 1248056. With it we raised the default timeout to 360s due to slowness with valgrind builds. I gave a comment over there that the fix was not that ideal, and that we should fix it specifically for the -valgrind option of mochitests but not for all test suites!

With the recent hangs in e10s we wasted a lot of machine hours in AWS, which we could have spent more wisely on other jobs. So I'm still behind this revert to 60s.
(In reply to Henrik Skupin (:whimboo) from comment #14)
> With the recent hangs in e10s we wasted a lot of machine hours in AWS, which
> we could have spent more wisely on other jobs. So I'm still behind this
> revert to 60s.

I wonder if we should even backport this patch to aurora and beta in case no major issues arise in the next couple of days. Btw I also talked to Julian Seward and he agrees to get this fixed for TC first, and care about developer needs afterward. 

David and Julian, what do you both think?
Flags: needinfo?(jseward)
Flags: needinfo?(dburns)
low risk to back port it.

What is the issue with TC? I can't see any mention of what the discussion was about in this bug
Flags: needinfo?(dburns)
We had random hangs in various tests due to bug 1294456, and there might be other cases in the future. So each time we spent 6 minutes in waiting for a socket timeout. Now it's only 1 minute.
This is a test-only change, so should be safe to uplift.  I think this fix was long overdue anyway.
It looks like that we agree here. Lets get this test-only uplifted to aurora and beta.
Flags: needinfo?(jseward)
Whiteboard: [checkin-needed-aurora][checkin-needed-beta]
Whiteboard: [checkin-needed-aurora][checkin-needed-beta]
You need to log in before you can comment on or make changes to this bug.