[mozmill] "Disconnect Error: Application unexpectedly closed" failures in various tests

RESOLVED FIXED in Thunderbird 55.0

Status

P2
major
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: Taraman, Assigned: Taraman)

Tracking

Trunk
Thunderbird 55.0

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [Thunderbird-testfailure: Z Linux debug])

Attachments

(1 attachment)

(Assignee)

Description

a year ago
+++ This bug was initially created as a clone of Bug #1351343 +++

Some Mozmill Tests are failing regularly with the message "Disconnect Error: Application unexpectedly closed"

This mainly occurs on debug-builds.

Looking at a debug-build testrun locally showed, that it runs very slowly indeed, so I suspected there is a real timeout issue somewhere on the build machines.

Looking at where the error message comes from [1] took me to [2], where one of the possibilities is indeed a timeout.

Some debug code showed [3]:
>00:55:11  WARNING -  TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed
>00:55:11     INFO -  Connection timed out
[3]

To prove it, I prolonged the timeout (which is 60s) to 300s and the disconnect errors disappeared [4].

I tried to log, how long the longest run-time of a test was, but unfortunately on the respective tests, the log lines weren't written. So the longest one was:
>05:54:31     INFO -  Timeout counter reset at: 28.8 with timeout = 300.0

So now the questions are:
- Is raising the timeout the right way?
- how far should we raise it?

p.s.: We also have the tests stopping after 2 hours:
>command timed out: 7200 seconds elapsed running
Maybe we can also address this issue here.

[1]: https://dxr.mozilla.org/comm-central/rev/8eeb5667420daba2148746b6b9e470d0d8ebc35d/mail/test/resources/mozmill/mozmill/__init__.py#805
[2]: https://dxr.mozilla.org/comm-central/rev/8eeb5667420daba2148746b6b9e470d0d8ebc35d/mail/test/resources/jsbridge/jsbridge/network.py#189
[3]: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=c9c99b1f2625e7aa37b572a7779c481dab8cd49d
[4]: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=8ded469355716bb268e1e4c2d24975fb06edac0b

Comment 1

a year ago
(In reply to Markus Adrario [:Taraman] from comment #0)
> This mainly occurs on debug-builds.
And on Linux, right?

That's for looking into it, I noticed the various red Z's :-(
Whiteboard: [Thunderbird-testfailure: Z Linux debug]

Comment 2

a year ago
It seemed to me the tests crashed due to an assert in m-c code.
E.g. at https://archive.mozilla.org/pub/thunderbird/try-builds/Mozilla@Adrario.de-c9c99b1f2625e7aa37b572a7779c481dab8cd49d/try-comm-central-linux64-debug/try-comm-central_ubuntu64_vm-debug_test-mozmill- bm54-tests1-linux64-build33.txt.gz, there is:
!!! ASSERTION: Table inline-size is less than the sum of its columns' min inline-sizes: '!(aISizeType == BTLS_FINAL_ISIZE && aISize < guess_min)', file /builds/slave/tb-try-c-cen-l64-d-00000000000/build/mozilla/layout/tables/BasicTableLayoutStrategy.cpp, line 812

If it is only a timeout issue we could split the tests.

But e.g. in the run https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&selectedJob=86419366 the only "unexpectedly closed" failure is in a calendar test, which is known and being worked on.

Comment 3

a year ago
(In reply to Markus Adrario [:Taraman] from comment #0)
> p.s.: We also have the tests stopping after 2 hours:
> >command timed out: 7200 seconds elapsed running
> Maybe we can also address this issue here.

This is covered in bug 1342828.
(Assignee)

Comment 4

a year ago
I also stumbled upon the Assertion failures. But these also happen in passing tests.

We actually have 2 timeout issues here. One is the 7200 secs which you stated are being worked on.

The other is within the tests. As far as I can see, these happens not only for a whole test but within single steps of a test.
Correct me if I'm wrong.

In
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=1ec931a8aacfa30062f67aa76f2826280f78ebe0
we have 4 tests failing, not only the cal-recurrence test.

in my latest test-run with timeout set to 120 secs ISO 60, no more disconnects:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=2eac0e24319c990b7fdddf7e02a177f230c26460

Comment 5

a year ago
OK, in that run multiple tests fail.
From that set I tried  test-cloudfile-backend-hightail.js locally and it takes 47secs. But even if I downclock my CPU, the test runs for 3 minutes and passes fine. When does the timeout apply?

What are those "Timeout counter reset at: 0.2" messages in the log?

Comment 6

a year ago
(In reply to Markus Adrario [:Taraman] from comment #4)
> We actually have 2 timeout issues here. One is the 7200 secs which you
> stated are being worked on.

Not that it is being worked on, just that it is covered in the other bug. If you know how to solve it, please propose the change in the other bug.
(Assignee)

Comment 7

a year ago
Created attachment 8860024 [details] [diff] [review]
fix_timeouts V1

This patch addresses the timeout within the tests.

Splitting the test itself does not seem a solution to me, because looking at the code - and also from what I saw when running these tests locally - the timeout does not count for the entire test, but for different test steps.

In the daily-Recurrence test, a lot of events are added and this takes very long on the debug builds.

So raising the timeout seems to a reasonable solution to me.
It would also be a possibility to raise the timeout only in case we are on a debug build.

aceman, are you the right person to ask for review here?
Assignee: nobody → Mozilla
Attachment #8860024 - Flags: review?(acelists)
(Assignee)

Comment 8

a year ago
Here is a testrun with the patch:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=ade81ade417736af6320d6a9038644cc06b9e00b

There is still one timeout in the cloudfile tests - maybe this one is a real error in the test itself...

Comment 9

a year ago
(In reply to Markus Adrario [:Taraman] from comment #7)
> Splitting the test itself does not seem a solution to me, because looking at
> the code - and also from what I saw when running these tests locally - the
> timeout does not count for the entire test, but for different test steps.

Does it time whole functions?
 
> In the daily-Recurrence test, a lot of events are added and this takes very
> long on the debug builds.

Maybe that could be split into separate functions. But that does not fix the timeouts of the other unrelated tests.
 
> aceman, are you the right person to ask for review here?

Yes, I'm a peer of TB 'testing infrastructure' module.

Comment 10

a year ago
Comment on attachment 8860024 [details] [diff] [review]
fix_timeouts V1

Review of attachment 8860024 [details] [diff] [review]:
-----------------------------------------------------------------

So I think we can try this, thanks.
Attachment #8860024 - Flags: review?(acelists) → review+

Comment 11

a year ago
Let me land that when I need a patch to trigger a build, please. I'll also remove the trailing space for you ;-)
(Assignee)

Comment 12

a year ago
Sure, have it!
Thx. ;-)

Comment 13

a year ago
https://hg.mozilla.org/comm-central/rev/756dadb206f5d280ce8b1f80eda047e7f5fa1e92

Change of plans: Landed it straight away. I let you resolve the bug if it helped ;-)
Target Milestone: --- → Thunderbird 55.0
(Assignee)

Comment 14

a year ago
No more disconnetct errors in the last days.

Only one cloudfile test as mentioned in comment #8.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.