Unexpected reboots of cb-seamonkey-osx-*

RESOLVED FIXED

Status

SeaMonkey
Release Engineering
--
major
RESOLVED FIXED
9 years ago
8 years ago

People

(Reporter: Robert Kaiser, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

9 years ago
The currently running Mac VM in the SeaMonkey pool sometimes reboot unexpectedly, possibly due to system/VM crashes.
This could be connected to bug 493321 and be a Parallels issue, this bug is just for tracking the problem as it blocks moving the pool to production.
(Reporter)

Comment 1

9 years ago
cb-seamonkey-osx-01 experienced network problems in recent cycles, looked like it couldn't get new connections to the outside (like chechout, etc.) but it still could sen stuff to the buildmaster. I also could ssh in and forced a reboot at about 6:53 today.
(Reporter)

Comment 2

9 years ago
cb-seamonkey-osx-02 unexpectedly rebooted between 10:33 ("slave lost" msg) and 10:38 ("slave connected" msg) today.
(Reporter)

Comment 3

9 years ago
And now cb-seamonkey-osx-01 did it between 14:50 and 14:53.
(Reporter)

Comment 4

9 years ago
cb-seamonkey-osx-02, between 11:11 and 11:15 today.
(Reporter)

Comment 5

9 years ago
We also have crashes in almost every mochitest-plain cycle from any of the two slaves, at different places in the test cycles, so not related to one specific thing the box is doing.
(Reporter)

Comment 6

9 years ago
To be clear, comment #5 is application crashes that are caught and reported by the test/buildbot harness, the rest in this bug are VMs crashes/reboots which we only see due to losing the slaves for a few minutes, and them coming online a few minutes later (due to autologin and automatic start of buildbot) with reset uptime of the machine.
(Reporter)

Comment 7

9 years ago
lost cb-seamonkey-osx-01 between 16:59 and 17:10
(Reporter)

Comment 8

9 years ago
Another one of -osx-01 between 18:33 and 18:37 yesterday might be related to phong trying out things for bug 493321, including reboots of -osx-03 and -osx-04, which came online at 18:44 and 18:52, respectively.

03 ran into networking problems very fast, very much the same as I reported in comment #1. 04 disconnected at 00:00 without having got anything to do up to that point.
I rebooted cb-seamonkey-osx-03 at 03:36, it came back online at 03:40.
(Reporter)

Comment 9

9 years ago
cb-seamonkey-osx-02 lost and regained network at 08:16, was no reboot this time.

Comment 10

9 years ago
I think osx-04 is frozen again.  I'm going to power it down unless you tell me that it's still up and running.
(Reporter)

Comment 11

9 years ago
Phong: yes, like I stated in comment #8, osx-04 went away at midnight - it didn't come back from that. If you could bring back win32-02 instead that would be cool as having only one of the Windows VMs makes the "pool" a bit slow for testing config changes.

osx-03 has had network problems for a while, again just like in comment #8, and disconnected and reconnected to buildbot at 17:16. I now rebooted it between 18:04 and 18:06.
(Reporter)

Comment 12

9 years ago
cb-seamonkey-osx-03 rebooted between 13:03 and 13:18 today.
(Reporter)

Comment 13

9 years ago
cb-seamonkey-osx-02 rebooted 14:26 to 14:30 today.
(Reporter)

Comment 14

9 years ago
cb-seamonkey-osx-02 once again 22:49 to 22:53 yersterday.
Blocks: 494671
(In reply to comment #5)
> We also have crashes in almost every mochitest-plain cycle

(In reply to comment #6)
> To be clear, comment #5 is application crashes that are caught and reported by
> the test/buildbot harness

I filed bug 494671 about the mochitest-plain crash(es).
(Reporter)

Comment 16

9 years ago
cb-seamonkey-osx-01 rebooted between 10:56 and 10:58 today.
(Reporter)

Comment 17

9 years ago
cb-seamonkey-osx-01 again, 16:46 to 16:50 today.
(Reporter)

Comment 18

9 years ago
ugh. and cb-seamonkey-osx-01 once again, 21:09 to 21:13 yesterday
(Reporter)

Comment 19

9 years ago
cb-seamonkey-osx-02 hasn't been crashing for some time but now started to report errors such as "FAILED TO GET ASN FROM CORESERVICES so aborting." in mochitests and then got into failures to get stuff from the network, same pattern as reported earlier in here, so I rebooted the VM just now.
(Reporter)

Comment 20

9 years ago
Bang. Now cb-seamonkey-osx-02 did a crash reboot again, between 08:44 and 08:48.

This seems to (almost) always happen during mochitest-plain runs, where we launch SeaMonkey and run a ton of tests on it. It happens at different points during the testing though.
(Reporter)

Comment 21

9 years ago
cb-seamonkey-osx-01 crashed/rebooted between 18:06 to 18:10, a few video tests failed before disconnecting in /tests/layout/base/tests/test_bug467672-1c.html
(Reporter)

Comment 22

9 years ago
All test cycles since then failed with a crsh/reboot:

cb-seamonkey-osx-02 between 18:51 and 18:55 yesterday, during nsITransactionManager Aggregate Batch Transaction Stress Test (make check).

cb-seamonkey-osx-01 between 21:59 and 22:06 yesterday, two video test failues, a test_jQuery.html failure, lost in /tests/layout/style/test/test_bug391221.html

cb-seamonkey-osx-02 between 23:34 and 23:37 yesterday, one video test failure, test_Scriptaculous.html failures, lost in /tests/layout/base/tests/test_bug441782-2b.html

cb-seamonkey-osx-02 between 3:42 and 3:46 today, some video test errors, later lost in /tests/layout/base/tests/test_bug441782-2d.html

Interestingly, the video failures before such a crash/reboot are wrong end times of the video element when it reports it's done playing.
(Reporter)

Comment 23

9 years ago
02 made it through a whole test cycle with a video end time failure and test_Scriptaculous.html failures, but without a crash or timeout, neither this one nor bug 494671.

cb-seamonkey-osx-01 did a crash/reboot again though between 7:24 and 7:28, failing in /tests/content/media/video/test/test_bug482461.html (only failure before that was an xpcshell test, but nothing seen usually)
(Reporter)

Comment 24

9 years ago
cb-seamonkey-osx-01, between 12:28 and 12:32, two video "currentTime at end" test failures, test_Scriptaculous.html failures, lost in /tests/layout/base/tests/test_bug467672-2c.html

At a similar time, cb-seamonkey-osx-02 ran into those network problems again (e.g. hg clones/updates failing, "abort: error: Temporary failure in name resolution") and I'm manually rebooting it right now.
(Reporter)

Comment 25

9 years ago
cb-seamonkey-osx-01, between 16:23 and 16:38, one video "currentTime at end"
test failure, test_jQuery.html failures, lost in /tests/layout/base/tests/test_bug441782-4a.html
(Reporter)

Comment 26

9 years ago
cb-seamonkey-osx-01, between 18:18 and 18:22, one video "currentTime at end" test failure, lost in /tests/layout/base/tests/test_bug467672-1c.html

cb-seamonkey-osx-02, between 21:59 and 22:03, one video "currentTime at end" test failure, lost in /tests/layout/base/tests/test_bug467672-4e.html

Then 01 actually made it through a cycle without an OS crash, but it saw two video "currentTime at end" test failures and crashed the SeaMonkey process after passing /tests/content/media/video/test/test_wav_onloadedmetadata.html

cb-seamonkey-osx-01, between 02:02 and 02:06, three video "currentTime at end" test failures, lost in /tests/content/media/video/test/test_wav_ended1.html

cb-seamonkey-osx-02, between 02:47 and 02:50, while doing a nightly build cycle, somewhere in building mailnews/ for i386 (not that easy to pinpoint due to parallel build process).
(Reporter)

Comment 27

9 years ago
cb-seamonkey-osx-02, between 10:32 and 10:36, lost in xpcshell-tests after xpcshell/test_mailnewsglobaldb/unit/test_gloda_content.js

cb-seamonkey-osx-02, between 12:08 and 13:13, two video "currentTime at end" test failures, lost in /tests/content/media/video/test/test_volume.html

cb-seamonkey-osx-02, between 13:55 and 13:58, one reftest and one crashtest 
failure, two video "currentTime at end" test failures, lost in /tests/layout/base/tests/test_bug441782-1c.html
Severity: normal → critical

Comment 28

9 years ago
It seems like the problem has been mostly with the OSX vms.  Can we try giving it more RAM?
[mid-air collision with comment 28 ... which I would gladly try first ;-)]

(with all these comments (to read))
It is not obvious to me whether we kind of narrowed down a "trigger" for this reboot behavior or not.

If not, I would suggest to try and disable various jobs:
start with mochitest-plain only, maybe all tests, even main build, up to whole buildbot (= leaving the VM idled).

I mean:
if this is caused by tests, let's find out which one(s),
...,
if it's an OS issue, no need to loose more time monitoring and commenting about build/tests.
(Reporter)

Comment 30

9 years ago
It's pretty clear that it's a virtualization/OS problem and not a test problem. The question is what things really trigger the OS or Parallels problem.
(Reporter)

Comment 31

9 years ago
And bugs on a non-production setup aren't critical IMHO.
Severity: critical → major
(In reply to comment #30)
> It's pretty clear that it's a virtualization/OS problem and not a test problem.

Yeah.

> The question is what things really trigger the OS or Parallels problem.

Comment 28 and comment 29 suggestions stands, I think.


(In reply to comment #31)

> And bugs on a non-production setup aren't critical IMHO.

(Well, see bug 493449 comment 5...)
(Reporter)

Comment 33

9 years ago
(In reply to comment #28)
> It seems like the problem has been mostly with the OSX vms.  Can we try giving
> it more RAM?

I'm just having a top open on ssh while mochitest-plain is running and I see that it says "1018M used, 6368K free." at this moment - so could we actually try that way of increasing RAM?
(Reporter)

Comment 34

9 years ago
BTW, just to get an impression of where the RAM is going, here a snippet of the top output:

31574 seamonkey-  62.5% 12:22.84  11   188-  1053  122M-   40M+  165M+  356M+
31573 ssltunnel    0.0%  0:00.50   5    37     89  420K  2600K  2884K    84M
31560 xpcshell     1.2%  1:00.30   4    66   2805  394M    15M   401M   555M
31559 python       1.1%  1:07.85   1    15    113 2808K  1468K  4584K    79M
(Reporter)

Comment 35

9 years ago
OK, this VM crash-rebooted while I had top open, this was the last top output left on my ssh session:

SharedLibs: num =    7, resident =   31M code, 1992K data, 3656K linkedit.
MemRegions: num =  8317, resident =  414M +   18M private,  257M shared.
PhysMem:  143M wired,  337M active,  290M inactive,  775M used,  249M free.
VM: 3380M + 374M   154296(0) pageins, 12115(0) pageouts

  PID COMMAND      %CPU   TIME   #TH #PRTS #MREGS RPRVT  RSHRD  RSIZE  VSIZE
31657 httpd        0.0%  0:00.01   1    11    199  280K    10M   792K    38M
31617 top         20.0%  9:48.14   1    20     34 1368K   188K  1956K    19M
31608 bash         0.0%  0:00.11   1    14     18  188K   184K   828K    18M
31607 sshd         0.0%  0:01.46   1    10     59  116K   884K   508K    22M
31592 sshd         0.0%  0:00.47   1    20     59  204K   884K  1572K    22M
31574 seamonkey-  96.7% 30:37.32  25   629+  1284  158M-   45M   205M-  697M-
31573 ssltunnel    0.0%  0:02.52   5    37     90 1312K  2600K  3548K    84M
31560 xpcshell     0.0%  1:32.85   4    66   1796  170M    15M   177M   323M
31559 python       0.0%  1:20.28   1    15    113 2800K  1468K  4648K    79M
31557 sh           0.0%  0:00.01   1    13     18  112K   184K   588K    74M
31555 gnumake      0.0%  0:00.03   1    13     18  276K   312K   620K    74M
31553 gnumake      0.0%  0:00.07   1    13     19  188K   312K   532K    18M
19584 ssh-agent    0.0%  0:00.10   1    23     28  436K   200K  1112K    19M
  212 python       0.0%  2:49.64   2    27    141 5908K  1468K  5760K    25M
  180 Finder       0.0%  0:42.66   7   147    121 1600K  8668K  7164K    96M
  179 SystemUISe   0.0%  0:03.10   6   185    192 2032K  9760K  6156K    96M
  174 Dock         0.0%  0:02.23   4   102    180 1600K    10M  7776K    69M
  173 coreaudiod   0.0%  0:00.23   2    82     25  200K   208K   848K    18M
  172 ATSServer    0.0%  0:12.97   2    84    133 1364K    11M  7136K   147M
  171 pboard       0.0%  0:00.02   1    15     23  104K   184K   540K    19M
  165 UserEventA   0.0%  0:00.43   2   109     86  632K  1196K  2080K    36M
  164 Spotlight    0.0%  0:01.21   2    78     82 1008K  4920K  4348K    54M
  162 AppleVNCSe   0.0%  0:00.25   2    73     40  364K  2460K  2224K    41M
  159 AirPort Ba   0.0%  0:00.28   2    58     67  512K  4176K  2768K    67M
  158 dynres       0.0%  0:06.39   1    28     28  200K   224K  1184K    35M
  153 launchd      0.0%  0:10.11   3   122     24  188K   296K   520K    18M
  137 VNCPrivile   0.0%  0:00.04   1    16     24  100K   188K   576K    19M
  126 CoreRAIDSe   0.0%  0:03.57   1    33     28  212K   348K   908K    19M
  125 httpd        0.0%  0:00.06   1    11    201  660K    10M  2268K    38M
  107 WindowServ  18.2% 16:04.82   5   154    233 4800K    22M    25M    92M
  105 krb5kdc      0.0%  0:00.10   1    17     42   88K   372K   712K    18M
   92 coreservic   0.0%  0:06.62   4   117     67 1080K  2856K  3448K    25M
   90 timesync     0.0%  0:00.49   1    40     29  264K   800K  1472K    19M
   89 socketfilt   0.0%  0:04.16   3    36     25  392K   200K  1236K    18M
   87 autofsd      0.0%  0:00.09   1    21     18  140K   184K   660K    18M
   83 diskarbitr   0.0%  0:10.71   1   104     19  328K   188K   912K    18M
   80 dynamic_pa   0.0%  0:00.08   1    17     20  156K   184K   696K    18M
   79 emond        0.1%  0:30.21   1    32     22  320K  1764K  1824K    27M
   77 fseventsd    0.2%  0:48.15  11    64     48  796K   184K  1292K    23M
   75 hidd         0.0%  0:00.04   2    28     20  116K   204K   592K    18M
   74 hwmond       0.0%  0:56.09  10    91     47  320K   352K  1260K    23M
   72 kdcmond      0.0%  0:06.03   2    24     19  188K   184K   904K    18M
   71 KernelEven   0.0%  0:00.04   2    20     19  152K   184K   628K    18M
   70 loginwindo   0.0%  0:01.33   3   174    105 1184K  6220K  4136K    56M
   69 mds          0.0%  0:43.93  14   185     86 1848K   192K  3716K    28M
   68 PasswordSe   0.0%  0:02.37  10    61    115  148K  1188K   852K    23M
   67 RFBRegiste   0.0%  0:00.07   1    16     18  176K   184K  1036K    18M Connection to cb-seamonkey-osx-01 closed by remote host.
Connection to cb-seamonkey-osx-01 closed.
Fwiw, might it a worse case of bug 494769?
(Reporter)

Comment 37

9 years ago
(In reply to comment #36)
> Fwiw, might it a worse case of bug 494769?

I wouldn't have thought about your statement here being true in any way, but since you checked in this disabling of that test, we haven't crashed the OS again - yet.

We're still seeing bug 494671 so something's really fishy with the OS or virtualization (I guess the latter) but it definitely feels like that .wav test did strike a chord. Since we're seeing other stuff related to media tests as well, I wonder if the problem is somewhere in that area of media (audio) on those virtualized Leopard boxes.
Depends on: 494769
(In reply to comment #37)

> I wouldn't have thought about your statement here being true in any way,

My initial thought was there might be a (very little) chance the reboot would be triggered by a memory allocation (related) "error".

> We're still seeing bug 494671 so something's really fishy with the OS or
> virtualization (I guess the latter) but it definitely feels like that .wav test
> did strike a chord. Since we're seeing other stuff related to media tests as

Yes, the media area looked most suspicious, but bug 494671 logs did not confirm this.

> well, I wonder if the problem is somewhere in that area of media (audio) on
> those virtualized Leopard boxes.

Now, with the fresh details found by Ted, the probable explanation would be that the/that media tests are simply relying more on the "time", thus hitting the underlying bug more easily.
(Reporter)

Comment 39

9 years ago
We'll need to watch this for a bit more, but it looks like the Parallels and system upgrades in bug 494462 might have fixed this. I'd like to see a day or so of non-crash data before closing the bug here though.
No longer blocks: 494671
(Reporter)

Updated

9 years ago
Depends on: 494671
(Reporter)

Comment 40

9 years ago
Hrm. cb-seamonkey-osx-01 rebooted again, between 11:20 and 11:24 today, while running /tests/layout/base/tests/test_bug467672-3d.html :(
(Reporter)

Comment 41

9 years ago
We lost cb-seamonkey-osx-03 after mozilla/layout/reftests/bugs/315920-3a.html today at 02:34 PDT, but the VM didn't come back so I'm unsure if it was a VM crash or a network loss.
(Reporter)

Comment 42

9 years ago
Phong says the 02:34 loss of osx-03 was a VM crash: "it was at the gray screen telling me to hold down the power button to reboot."
(Reporter)

Comment 43

9 years ago
cb-seamonkey-osx-03 again crash-rebooted between 13:49 and 13:54 today, last logged test was /tests/layout/base/tests/test_bug467672-3b.html
(Reporter)

Comment 44

8 years ago
Since the OS X VMs were reduced to one CPU, we haven't seen the machine crashes again, but I'm not completely trusting this yet as bug 494671 still causes strange crashes in test, so this machine thing might not be gone completely, actually.
(Reporter)

Comment 45

8 years ago
From all I see, this apparently has been fixed (temporarily) by reducing the VMs to one CPU. The issue behind it is probably still around in bug 494671 even though that issue has become more rare as well.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
(Reporter)

Updated

8 years ago
Component: Project Organization → Release Engineering
(Reporter)

Updated

8 years ago
QA Contact: organization → release
You need to log in before you can comment on or make changes to this bug.