Closed Bug 1284457 Opened 9 years ago Closed 9 years ago

Revise default socket_timeout of 360s

Tracking

(firefox49 fixed, firefox50 fixed, firefox51 fixed)

Status:

RESOLVED FIXED

Milestone:

mozilla51

Tracking Flags:

Tracking

Status

firefox49

---

fixed

firefox50

---

fixed

firefox51

---

fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

Details

Attachments

(1 file)

Bug 1284457 - Reduce default socket timeout for Marionette to 60s 9 years ago Henrik Skupin [:whimboo][⌚️UTC+2] 58 bytes, text/x-review-board-request	ato : review+	Details

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Description

•

9 years ago

While working on bug 1257476 I noticed that we have a very long hang in `socket.recv()` in case the application doesn't answer. Exactly these are 360s! Would you ever assume that you have to wait 360s until you get a response from the other side? I feel that we somehow mix timeouts together and that we bumped up this value due to possible timeouts during e.g. startup. But for that we actually have already the startup_timeout defined with 120s. So in case of socket failures we could still retry to send the same data (new session) until a response has been sent back. Once the connection has been established I wouldn't tolerate more than 60s socket timeouts. Maybe even that is already too high. One more thing related to IOError exceptions in _send_message(). If those appear we have had a timeout of 360s, and an additional wait for the binary to shutdown of 120s. All in all this causes a 480s hang! Not sure what the reason is for those long timeouts. Maybe David and Jonathan can give some information.

Flags: needinfo?(jgriffin)

Flags: needinfo?(dburns)

Jonathan Griffin (:jgriffin)

Comment 1

•

9 years ago

I don't really remember, but I suspect it may have something to do with B2G emulators. Since we no longer care about those, we can probably reduce those long timeouts.

Flags: needinfo?(jgriffin)

David Burns :automatedtester

Comment 2

•

9 years ago

I am pretty sure that jgriffin is right here... EMulators in automation are ridiculously slow...

Flags: needinfo?(dburns)

Andreas Tolfsen ❲:ato❳

Comment 3

•

9 years ago

If we add support for Fennec, presumably this runs inside the same Android emulator? I do think it’s worth experimenting with reducing it. Perhaps with an informed look into this we will be able to wait for the right events and reduce the need for individual timeouts that whimboo describes above.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

9 years ago

Blocks: 1290372

Comment hidden (mozreview-request)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 5

•

9 years ago

With the attached proposal to reduce the default timeout to 60s I triggered a full try build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cdc42db5d37e Lets see where we fail and what needs further improvements. Maybe it might be Fennec related.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 6

•

9 years ago

The try push actually looks good. There is no job which is failing due to this change. Given that we do not run Marionette tests for Fennec on try yet, I will try those locally now.

Assignee: nobody → hskupin

Status: NEW → ASSIGNED

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 7

•

9 years ago

I would like to wait for bug 1284874 which would give me Fennec on try via TC for free.

Depends on: 1284874

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 8

•

9 years ago

Now that we have try support for Fennec I pushed the patch to Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b4cbb83ad385

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 9

•

9 years ago

All fine here with the try build for Fennec on Android. The one timeout related failure is not caused by my change but always happens on integration branches, and mozilla-central. https://treeherder.mozilla.org/#/jobs?repo=try&revision=b4cbb83ad385&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=26515865

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

9 years ago

Attachment #8781135 - Flags: review?(ato)

Andreas Tolfsen ❲:ato❳

Comment 10

•

9 years ago

mozreview-review

Comment on attachment 8781135 [details] Bug 1284457 - Reduce default socket timeout for Marionette to 60s https://reviewboard.mozilla.org/r/71636/#review73104 Hm, lots of try failures here but I don’t believe any of them are Marionette related.

Attachment #8781135 - Flags: review?(ato) → review+

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 11

•

9 years ago

mozreview-review-reply

Comment on attachment 8781135 [details] Bug 1284457 - Reduce default socket timeout for Marionette to 60s https://reviewboard.mozilla.org/r/71636/#review73104 Yes, none of them show any indication that it is related to this change. Also nearly all of them have existing bugs for known failures. The try build for Fennec is similar. It has lots of failing tests but this is known and covered by 1297394.

Pulsebot

Comment 12

•

9 years ago

Pushed by hskupin@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/799921e8d85a Reduce default socket timeout for Marionette to 60s r=ato

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 13

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/799921e8d85a

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

status-firefox51: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla51

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

9 years ago

Comment 14

•

9 years ago

As noticed this morning while checking marionette.py for other changes, I accidentally found bug 1248056. With it we raised the default timeout to 360s due to slowness with valgrind builds. I gave a comment over there that the fix was not that ideal, and that we should fix it specifically for the -valgrind option of mochitests but not for all test suites! With the recent hangs in e10s we wasted a lot of machine hours in AWS, which we could have spent more wisely on other jobs. So I'm still behind this revert to 60s.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 15

•

9 years ago

(In reply to Henrik Skupin (:whimboo) from comment #14) > With the recent hangs in e10s we wasted a lot of machine hours in AWS, which > we could have spent more wisely on other jobs. So I'm still behind this > revert to 60s. I wonder if we should even backport this patch to aurora and beta in case no major issues arise in the next couple of days. Btw I also talked to Julian Seward and he agrees to get this fixed for TC first, and care about developer needs afterward. David and Julian, what do you both think?

status-firefox49: --- → affected

status-firefox50: --- → affected

Flags: needinfo?(jseward)

Flags: needinfo?(dburns)

David Burns :automatedtester

Comment 16

•

9 years ago

low risk to back port it. What is the issue with TC? I can't see any mention of what the discussion was about in this bug

Flags: needinfo?(dburns)

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 17

•

9 years ago

We had random hangs in various tests due to bug 1294456, and there might be other cases in the future. So each time we spent 6 minutes in waiting for a socket timeout. Now it's only 1 minute.

Andreas Tolfsen ❲:ato❳

Comment 18

•

9 years ago

This is a test-only change, so should be safe to uplift. I think this fix was long overdue anyway.

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Comment 19

•

9 years ago

It looks like that we agree here. Lets get this test-only uplifted to aurora and beta.

Flags: needinfo?(jseward)

Whiteboard: [checkin-needed-aurora][checkin-needed-beta]

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 20

•

9 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/c6d4012b25e6

status-firefox50: affected → fixed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 21

•

9 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/668b6698a261

status-firefox49: affected → fixed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Updated

•

9 years ago

Whiteboard: [checkin-needed-aurora][checkin-needed-beta]

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

9 years ago

Blocks: 1283906

Henrik Skupin [:whimboo][⌚️UTC+2]

Assignee

Updated

•

9 years ago

Blocks: 1301661

BMO Automation

Updated

•

2 years ago

Product: Testing → Remote Protocol

You need to log in before you can comment on or make changes to this bug.