1309851 - Fix "Intermittent valgrind-test | Valgrind timed out (reached 1800 second limit)"

Reporter

Description

•

8 years ago

treeherder

Filed by: tomcat [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=37539180&repo=mozilla-inbound

https://queue.taskcluster.net/v1/task/JnhisL2kSRG4L8x1WYCR4g/runs/0/artifacts/public%2Flogs%2Flive_backing.log

Carsten Book [:Tomcat]

Comment 1

•

8 years ago

julian, could you take a look at this ?

Flags: needinfo?(jseward)

Comment hidden (Intermittent Failures Robot)

8 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 4
* autoland: 3
* mozilla-central: 1

Platform breakdown:
* linux64: 8

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-10-10&endday=2016-10-16&tree=all

Comment hidden (Intermittent Failures Robot)

33 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 19
* autoland: 12
* fx-team: 2

Platform breakdown:
* linux64: 33

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-10-24&endday=2016-10-30&tree=all

Comment hidden (Intermittent Failures Robot)

22 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-central: 7
* autoland: 7
* mozilla-inbound: 4
* try: 3
* fx-team: 1

Platform breakdown:
* linux64: 22

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-10-31&endday=2016-10-31&tree=all

Comment hidden (Intermittent Failures Robot)

18 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 8
* autoland: 6
* try: 2
* mozilla-central: 2

Platform breakdown:
* linux64: 18

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-11-01&endday=2016-11-01&tree=all

Julian Seward [:jseward]

Comment 7

•

8 years ago

I looked at a few of these logs.  In some cases, V makes steady but slow
progress, emitting a message of some kind or other every half minute or so,
but eventually falls over the 30 minute limit.

In other cases, there seems to be a very large delay in between messages
and then it times out almost immediately thereafter.

[task 2016-10-13T10:38:00.494751Z] 10:38:00     INFO -   1:39.97 --44540-- memcheck GC: 90410 nodes, 18427 survivors ( 20.3%)
[task 2016-10-13T11:05:21.592269Z] 11:05:21     INFO -  29:01.07 --44540-- memcheck GC: 90410 nodes, 19555 survivors ( 21.6%)

The "memcheck GC:" messages make it somewhat possible to identify points of
equal progress in the different logs.

I am puzzled by the large variability in rates of progress.  Do the
compute nodes have swap space?  Could it be that, in the slow cases,
we are hitting swap a lot?  Is there a way to tell?

FWIW, on a 2.5 GHz Haswell with essentially unlimited memory, this test
takes about 18 minutes to run.

Flags: needinfo?(jseward)

Julian Seward [:jseward]

Comment 8

•

8 years ago

Dustin tells me that we don't use swap on these machines, so cancel
that theory.  He did also say that there's considerable variability in
performance across different instances of supposedly the same type of
node, though.

Julian Seward [:jseward]

Comment 9

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #7)
> FWIW, on a 2.5 GHz Haswell with essentially unlimited memory, this test
> takes about 18 minutes to run.

And about 2.9GB VSS, 1.9GB RSS.

One possibility is to disable origin tracking in the run.  This
doesn't reduce V's ability to detect errors, but it does make it
a bit more difficult to diagnose them.  Given that errors are
relatively rare and that diagnosing them usually requires multiple
local runs, it wouldn't be a bad tradeoff for runs on automation,
although we'd have to find a way to ensure origin tracking wasn't
disabled on local runs of |mach valgrind-test|.

With it disabled, the run time on this machine falls from 18 mins
to 10 mins, and max VSS is about 2.6GB, so the speedup is significant.

Nicholas Nethercote [inactive]

Comment 10

•

8 years ago

> we'd have to find a way to ensure origin tracking wasn't disabled on local runs of |mach valgrind-test|.

Keep origin tracking as the default for |mach valgrind-test|, but add a flag to disable it and then use that flag on automation?

Comment hidden (Intermittent Failures Robot)

50 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 26
* mozilla-inbound: 20
* try: 4

Platform breakdown:
* linux64: 49
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-11-02&endday=2016-11-02&tree=all

Julian Seward [:jseward]

Comment 12

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #7)
> I am puzzled by the large variability in rates of progress.

The logs have timestamps.  These show wallclock time, not CPU time.
So there's no way to distinguish (1) "the compute node is running
slowly for whatever reason" vs (2) "valgrind has fallen into some
pathological slow path".  Given the difficulty V has with dealing with
some of the SpiderMonkey behaviour, and the fact that this is a
somewhat older version, (2) is a real possibility.

One way to distinguish these is to run with a hacked valgrind that
monitors its own CPU usage and prints a line every (eg) 30 seconds,
possibly with some other critical statistics.

So the problem becomes: how to do one of these runs on the automation
nodes, using a custom valgrind build.

Comment hidden (Intermittent Failures Robot)

27 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 14
* mozilla-inbound: 12
* mozilla-central: 1

Platform breakdown:
* linux64: 27

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-11-03&endday=2016-11-03&tree=all

Comment hidden (Intermittent Failures Robot)

31 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 14
* autoland: 12
* try: 4
* mozilla-central: 1

Platform breakdown:
* linux64: 31

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-11-04&endday=2016-11-04&tree=all

Comment hidden (Intermittent Failures Robot)

21 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 9
* autoland: 7
* mozilla-central: 5

Platform breakdown:
* linux64: 21

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-11-05&endday=2016-11-05&tree=all

Comment hidden (Intermittent Failures Robot)

177 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 76
* mozilla-inbound: 69
* mozilla-central: 16
* try: 15
* fx-team: 1

Platform breakdown:
* linux64: 176
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-10-31&endday=2016-11-06&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 17

•

8 years ago

:jseward, do you have plans to fix this issue?  As it stands this is our #1 outstanding intermittent orange.  I see you have explained the issue above, but I am not clear if there is a pending fix or if you need help, etc.

Flags: needinfo?(jseward)

Julian Seward [:jseward]

Comment 18

•

8 years ago

Well, one immediate thing we could try is to increase the 1800 second limit.
That's just treating the symtoms though.

I spent much of Friday trying to build the "desktop-test" (Ubuntu 12.04)
image locally, so I can figure out how to push custom valgrind builds
into the run.  I even have the custom valgrind available now.  But
my build attempts have so far failed.  I am continuing to try today.

If you know of some other way to get a modified valgrind into the runs,
I'd be happy to go with that too.

Flags: needinfo?(jseward)

Joel Maher ( :jmaher ) (UTC -8)

Comment 19

•

8 years ago

as a note we have a 12.04 and a 16.04 version of desktop-test, possibly starting with 16.04 would be better as we are slowly migrating there.

If you want to redo the custom image, we can test that on try (we do dynamic docker image creation).  A lot of custom tools are added here:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/scripts/tester/test-ubuntu.sh?q=path%3Atest-ubuntu.sh&redirect_type=single

Possibly we can experiment on try, we would just need to make sure that all android/linux tests pass after modifying the image.

Comment hidden (Intermittent Failures Robot)

30 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 15
* mozilla-inbound: 13
* try: 2

Platform breakdown:
* linux64: 30

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-11-07&endday=2016-11-07&tree=all

Julian Seward [:jseward]

Comment 21

•

8 years ago

Attached patch with-instrumented-valgrind-2.cset — Details — Splinter Review

Patch that makes the test job run with a modified valgrind 3.12.0
(modification is itself within the patch)

Minimal try syntax: mach try -b o -p linux64-valgrind -u none -t none

Julian Seward [:jseward]

Comment 22

•

8 years ago

With help from dustin, jhford, jmaher, jonasfj (thanks all!) I managed
to get a modified V into the desktop-build image, which does the V test
run.  The first instrumented run is [1].  I am doing further runs to 
get a better picture, since many runs don't time out or experience
"unusual" delays.

The patch causes V to print a line like this every 10 CPU seconds:

 PROGRESS: U 110s, W 114s, 96.5% CPU, EvC 475.21M, TIn 294.6k, TOut 19.4k, #thr 59

This numbers are cumulative (viz, for the whole run so far).  They show:
  
   U:    user cpu time
   W:    wallclock time
   EvC:  a measure of the progress of the simulator -- the number
         of backwards and/or indirect branches processed
   TIn:  number of code superblocks instrumented
   TOut: number of blocks discarded due to invalidation or capacity
         constraints
   #thr: number of threads in the process

From these it's possible to deduce, eg:

* compute node hang or slow progress -- W increases, U doesn't
* simulation compute bound -- W and U increase together
* simulation in a loop -- EvC increases fast, TIn doesn't increase
* excessive discards probably due to SM code overwrites -- TOut increases fast


[1]
https://public-artifacts.taskcluster.net/PJlMoAraQB-GPz32zp0eHA/0/public/logs/live_backing.log

Julian Seward [:jseward]

Comment 23

•

8 years ago

> [1]
> https://public-artifacts.taskcluster.net/PJlMoAraQB-GPz32zp0eHA/0/public/
> logs/live_backing.log

This shows an interesting plateau between 170 and 360 seconds.  At this point
the instrumented block count stays fixed at 300.0k -- the bare minimum needed to
get Gecko started -- whilst the simulator (EvC #s) makes relatively rapid
forward progress, and U and W advance together.  This suggests that Gecko is in a
loop during this time and is not touching any new code.  And that resource problems
on the compute node are not a cause of the delay.

After this (not shown here) it returns to the more normal pattern, in which the
simulation advances more slowly, and TIn/TOut numbers increase rapidly, due to
processing of lots of short-lifetime bits of machine code created by Gecko's
various JITs.

 PROGRESS: U 160s, W 164s, 97.6% CPU, EvC 1541.29M, TIn 298.8k, TOut 19.7k, #thr 58
 PROGRESS: U 170s, W 174s, 97.7% CPU, EvC 1755.65M, TIn 299.6k, TOut 19.7k, #thr 58
 PROGRESS: U 180s, W 184s, 97.8% CPU, EvC 1972.37M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 190s, W 194s, 97.9% CPU, EvC 2190.27M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 200s, W 204s, 98.0% CPU, EvC 2408.18M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 210s, W 214s, 98.1% CPU, EvC 2625.88M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 220s, W 224s, 98.2% CPU, EvC 2843.98M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 230s, W 234s, 98.3% CPU, EvC 3062.09M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 240s, W 244s, 98.4% CPU, EvC 3279.69M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 250s, W 254s, 98.4% CPU, EvC 3497.30M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 260s, W 264s, 98.5% CPU, EvC 3715.10M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 270s, W 274s, 98.5% CPU, EvC 3933.00M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 280s, W 284s, 98.6% CPU, EvC 4150.81M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 290s, W 294s, 98.6% CPU, EvC 4368.41M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 300s, W 304s, 98.7% CPU, EvC 4586.52M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 310s, W 314s, 98.7% CPU, EvC 4804.52M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 320s, W 324s, 98.8% CPU, EvC 5022.13M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 330s, W 334s, 98.8% CPU, EvC 5239.93M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 340s, W 344s, 98.8% CPU, EvC 5457.94M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 350s, W 354s, 98.9% CPU, EvC 5675.64M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 360s, W 364s, 98.9% CPU, EvC 5893.15M, TIn 300.0k, TOut 19.7k, #thr 58
 PROGRESS: U 370s, W 374s, 98.9% CPU, EvC 5943.00M, TIn 316.2k, TOut 41.1k, #thr 60

Julian Seward [:jseward]

Comment 24

•

8 years ago

A second run [1] which does time out winds up in two apparent long
loops, similar to comment #23, except they occupy most of the 1800
seconds, and the second loop is still running when it times out:

PROGRESS: U 100s, W 105s, 95.2% CPU, EvC 241.39M, TIn 296.1k, TOut 20.5k, #thr 59
..
PROGRESS: U 560s, W 564s, 99.3% CPU, EvC 10273.09M, TIn 296.1k, TOut 20.5k, #thr 59

and
PROGRESS: U 590s, W 594s, 99.3% CPU, EvC 10897.11M, TIn 301.7k, TOut 20.9k, #thr 58
..
PROGRESS: U 1,790s, W 1,792s, 99.9% CPU, EvC 37044.95M, TIn 301.7k, TOut 20.9k, #thr 58

They can't be very far apart in the startup sequence, either: TIns of 296.1k
vs 301.7k.  By comparison a "normal" startup of the browser easily passes 400k TIn
before we get anything we can interact with.

What is it that the Valgrind test job actually does?  Does it involve any
kind of race condition or real-time-ness, that could account for such a
large amount of nondeterminism?

[1]
https://public-artifacts.taskcluster.net/P4n9SdIrTAuSV3_05kJrow/0/public/logs/live_backing.log

Nicholas Nethercote [inactive]

Comment 25

•

8 years ago

> What is it that the Valgrind test job actually does?  Does it involve any
> kind of race condition or real-time-ness, that could account for such a
> large amount of nondeterminism?

It just loads build/pgo/index.html, which opens a couple of basic pages and then runs sunspider. Nothing complicated at all.

Julian Seward [:jseward]

Comment 26

•

8 years ago

I believe this is caused by IcuTimeZoneInfo::acquire(), in
js/src/vm/DateTime.cpp.  It acquired its current form in bug 1220693.

Profiling valgrind running in automation [1] shows a spin of a 4
instruction block that lasts for more than a billion simulated
instructions.  This happens around the point TIn=295k, which is
consistent with observations in comments 23 and 24.

The block is JS::ResetTimeZone()+59.  JS::ResetTimeZone() doesn't have
a loop, but IcuTimeZoneInfo::acquire() is inlined into it, and
profiling locally shows the big cost in IcuTimeZoneInfo::acquire()+9,
so presumably the appearance in JS::ResetTimeZone() results from
inlining.

Imagine the following scenario, on a single core machine.  T1 enters
IcuTimeZoneInfo::acquire(), starts doing stuff, and is descheduled by
the kernel before it can call release().  T2 is then scheduled, and
calls acquire().  It can't acquire the lock because T1 holds it, so it
simply spins until the end of its timeslice, blocking T1 and any other
threads in of the process.

Valgrind serialises thread scheduling, because it isn't internally
multithreaded, and only allows CPU bound threads to drop the processor
every few hundred thousand simulated instructions.  So it is
effectively an extreme case of "single core machine".  I think this is
the cause of these variable-length delays.

[1] https://public-artifacts.taskcluster.net/eT29-666Q-21mkLBdO_MsA/0/public/logs/live_backing.log

Julian Seward [:jseward]

Comment 27

•

8 years ago

IMO IcuTimeZoneInfo has the following problems:

[1] Per comment #26, on single core hardware, it will cause stalling
    of all other threads for lengths of time depending on the kernel's
    scheduling, and possibly much longer than the length of time that
    the original authors envisaged a spinlock typically being held.

[2] acquire() does not include an Intel PAUSE instruction in the spin
    loop.  Per the Intel Optimisation guide ("Intel 64 and IA-32
    Architectures Optimization Reference Manual", Sept 2015), Sec 8.4.3
    "Optimization with Spin-Locks":
    
      If an application must use spin-locks, include the PAUSE
      instruction in the wait loop.
      [..]
      In spite of the name, the PAUSE instruction improves performance
      by introducing a slight delay in the loop and effectively
      causing the memory read requests to be issued at a rate that
      allows immediate detection of any store to the synchronization
      variable. This prevents the occurrence of a long delay due to
      memory order violation.

Adding a PAUSE insn (a.k.a. "rep nop") would also fix the problem with
Valgrind, since Valgrind also takes PAUSE as a hint to deschedule the
calling thread in near future, hence allowing other threads to run.

Unfortunately [2] is an Intel-specific fix.  I imagine other
architectures have similar instructions to deal with the same problem.

Considering the need to also fix [1], perhaps it would be safer to
rewrite the loop to spin for a few hundred iterations at most, and
then call sched_yield().  Is there any equivalent call for Windows?

Nicholas Nethercote [inactive]

Comment 28

•

8 years ago

ICU! /me shakes fist

Great analysis, jseward.

Nicholas Nethercote [inactive]

Comment 29

•

8 years ago

Any idea why this suddenly increased 9 days ago (i.e. at comment 4)?

Comment hidden (Intermittent Failures Robot)

35 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 17
* mozilla-inbound: 16
* try: 1
* graphics: 1

Platform breakdown:
* linux64: 35

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-11-08&endday=2016-11-08&tree=all

Julian Seward [:jseward]

Comment 31

•

8 years ago

(In reply to Nicholas Nethercote [:njn] from comment #29)
> Any idea why this suddenly increased 9 days ago (i.e. at comment 4)?

I don't know, but I have half a theory.  It seems to me that the
spinlock problems are being exacerbated by valgrind's use of its
pipe-based locking scheme (--fair-sched=no, the default).  This often
has the effect of causing the same thread to get scheduled for long
periods at a time even though other threads are runnable.

I'm experimenting with adding --fair-sched=yes as a backstop fix.
This will limit the worst-case spin length to around a million
instructions, which, whilst still bad, is much better than what's
described in comment 26.

I did three runs with this last night and it looked good.  jmaher
suggested 50 successful runs would be a good trial before landing it.
I'm testing more this morning.

So why has it changed?  Given that the current settings are very
sensitive to how the kernel schedules threads, it might be that the
EC2 instances are migrating to newer kernels with different
scheduling, or newer hardware, or a different VM setup.  At least,
that's my best guess.

Julian Seward [:jseward]

Comment 32

•

8 years ago

Attached patch bug1309851-1.cset — Details — Splinter Review

Proposed backstop fix, as described in comment 31.

Julian Seward [:jseward]

Comment 33

•

8 years ago

With the patch in comment 32, I have 13 out of 13 test runs finished
without timeout [1].  These are all on the us-west-* group, node type
m4.4xlarge.  If I had to guess, I'd say it means there's a baseline
"spin-free" time of about 19 minutes required, plus random extra time
for spins that can drag it out to about 28 minutes.  Still perilously
close to the 30 minute deadline.

Another 37 test runs are in progress.

[1]
20:59.11 TEST-PASS | valgrind-test | valgrind found no errors
18:57.00 TEST-PASS | valgrind-test | valgrind found no errors
25:14.30 TEST-PASS | valgrind-test | valgrind found no errors
19:19.14 TEST-PASS | valgrind-test | valgrind found no errors
20:00.80 TEST-PASS | valgrind-test | valgrind found no errors
21:42.84 TEST-PASS | valgrind-test | valgrind found no errors
23:20.61 TEST-PASS | valgrind-test | valgrind found no errors
23:18.44 TEST-PASS | valgrind-test | valgrind found no errors
19:59.72 TEST-PASS | valgrind-test | valgrind found no errors
19:28.44 TEST-PASS | valgrind-test | valgrind found no errors
19:28.30 TEST-PASS | valgrind-test | valgrind found no errors
27:09.68 TEST-PASS | valgrind-test | valgrind found no errors
27:40.08 TEST-PASS | valgrind-test | valgrind found no errors

Julian Seward [:jseward]

Comment 34

•

8 years ago

A further observation: whilst investigating earlier, I used a more
recent Valgrind (3.12) than what is in the desktop-build image (3.10).
It runs the workload in under 12 minutes, compared to 19+ for 3.10,
on the same node type (m4.4xlarge).

I'm not sure it's an apples-to-apples comparison, and I'm surprised
how big the difference is.  That said, 3.12 does contain various changes
that make it scale better for recent Gecko-sized workloads, so it would
be useful to move to that as a followup, if only to gain a bit more
performance headroom.

Three test runs with 3.12:

11:17.48 TEST-PASS | valgrind-test | valgrind found no errors
10:47.60 TEST-PASS | valgrind-test | valgrind found no errors
10:36.22 TEST-PASS | valgrind-test | valgrind found no errors

Progress numbers (just as a record), reporting interval 10s, last reports:

PROGRESS: U 670s, W 673s, 99.6% CPU, EvC 2224.82M, TIn 948.8k, TOut 654.3k, #thr 27
PROGRESS: U 640s, W 642s, 99.7% CPU, EvC 1472.68M, TIn 956.6k, TOut 666.8k, #thr 27
PROGRESS: U 630s, W 633s, 99.5% CPU, EvC 1608.70M, TIn 942.9k, TOut 629.9k, #thr 25

Dustin J. Mitchell [:dustin] (he/him)

Comment 35

•

8 years ago

We did upgrade instance kernels on Oct 21 (for badcow)

Julian Seward [:jseward]

Comment 36

•

8 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #35)
> We did upgrade instance kernels on Oct 21 (for badcow)

Was it a large change, or just a small fix just for that problem?

Julian Seward [:jseward]

Comment 37

•

8 years ago

Here are results from 49 test runs of the patch in comment 32 (the
50th failed in build for some reason).  These are sorted by run time
of the valgrind component.  There are 4 timeouts in 49 runs.  Is there
a way to determine whether this is an improvement the current situation?

18:57.00 TEST-PASS | valgrind-test | valgrind found no errors
19:14.51 TEST-PASS | valgrind-test | valgrind found no errors
19:14.81 TEST-PASS | valgrind-test | valgrind found no errors
19:19.14 TEST-PASS | valgrind-test | valgrind found no errors
19:20.47 TEST-PASS | valgrind-test | valgrind found no errors
19:28.30 TEST-PASS | valgrind-test | valgrind found no errors
19:28.44 TEST-PASS | valgrind-test | valgrind found no errors
19:38.65 TEST-PASS | valgrind-test | valgrind found no errors
19:44.36 TEST-PASS | valgrind-test | valgrind found no errors
19:53.37 TEST-PASS | valgrind-test | valgrind found no errors
19:54.08 TEST-PASS | valgrind-test | valgrind found no errors
19:59.72 TEST-PASS | valgrind-test | valgrind found no errors
20:00.80 TEST-PASS | valgrind-test | valgrind found no errors
20:04.75 TEST-PASS | valgrind-test | valgrind found no errors
20:05.67 TEST-PASS | valgrind-test | valgrind found no errors
20:11.92 TEST-PASS | valgrind-test | valgrind found no errors
20:12.05 TEST-PASS | valgrind-test | valgrind found no errors
20:19.26 TEST-PASS | valgrind-test | valgrind found no errors
20:24.41 TEST-PASS | valgrind-test | valgrind found no errors
20:29.52 TEST-PASS | valgrind-test | valgrind found no errors
20:42.91 TEST-PASS | valgrind-test | valgrind found no errors
20:43.77 TEST-PASS | valgrind-test | valgrind found no errors
20:49.58 TEST-PASS | valgrind-test | valgrind found no errors
20:57.86 ==43939== ERROR SUMMARY: 822 errors from 822 contexts (suppressed: 96 from 69) (non-timeout failure)
20:59.11 TEST-PASS | valgrind-test | valgrind found no errors
21:40.81 TEST-PASS | valgrind-test | valgrind found no errors
21:42.84 TEST-PASS | valgrind-test | valgrind found no errors
21:46.36 TEST-PASS | valgrind-test | valgrind found no errors
22:02.50 TEST-PASS | valgrind-test | valgrind found no errors
22:05.93 TEST-PASS | valgrind-test | valgrind found no errors
22:09.43 TEST-PASS | valgrind-test | valgrind found no errors
22:10.02 TEST-PASS | valgrind-test | valgrind found no errors
22:29.88 TEST-PASS | valgrind-test | valgrind found no errors
22:31.14 TEST-PASS | valgrind-test | valgrind found no errors
23:18.44 TEST-PASS | valgrind-test | valgrind found no errors
23:20.61 TEST-PASS | valgrind-test | valgrind found no errors
23:51.42 TEST-PASS | valgrind-test | valgrind found no errors
23:57.71 TEST-PASS | valgrind-test | valgrind found no errors
24:53.62 TEST-PASS | valgrind-test | valgrind found no errors
25:14.30 TEST-PASS | valgrind-test | valgrind found no errors
27:09.68 TEST-PASS | valgrind-test | valgrind found no errors
27:40.08 TEST-PASS | valgrind-test | valgrind found no errors
27:44.94 TEST-PASS | valgrind-test | valgrind found no errors
29:50.67 TEST-PASS | valgrind-test | valgrind found no errors
29:56.46 TEST-PASS | valgrind-test | valgrind found no errors
30:01.67 TEST-UNEXPECTED-FAIL | valgrind-test | Valgrind timed out (reached 1800 second limit)
30:01.74 TEST-UNEXPECTED-FAIL | valgrind-test | Valgrind timed out (reached 1800 second limit)
30:01.76 TEST-UNEXPECTED-FAIL | valgrind-test | Valgrind timed out (reached 1800 second limit)
30:01.81 TEST-UNEXPECTED-FAIL | valgrind-test | Valgrind timed out (reached 1800 second limit)

Julian Seward [:jseward]

Comment 38

•

8 years ago

I propose the following package of mitigations, in the given order:

[1] land the patch in comment 32.

[2] increase the timeout modestly to (eg) 2100 seconds.

[3] change the valgrind in the desktop-build image to
    3.12.0+mozilla-special-hacks, for improved performance
    and diagnosability.

[4] fix the underlying spinloop problem in IcuTimeZoneInfo,
    per comment 26.

Dustin J. Mitchell [:dustin] (he/him)

Comment 39

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #36)
> Was it a large change, or just a small fix just for that problem?

Just in the patch version, but the increment was more than one:
  https://github.com/taskcluster/docker-worker/pull/256/files

Julian Seward [:jseward]

Updated

•

8 years ago

Attachment #8808982 - Flags: review?(n.nethercote)

Nicholas Nethercote [inactive]

Updated

•

8 years ago

Attachment #8808982 - Flags: review?(n.nethercote) → review+

Comment hidden (Intermittent Failures Robot)

33 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 17
* autoland: 12
* try: 2
* mozilla-central: 2

Platform breakdown:
* linux64: 33

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-11-09&endday=2016-11-09&tree=all

Julian Seward [:jseward]

Updated

•

8 years ago

Summary: Intermittent valgrind-test | Valgrind timed out (reached 1800 second limit) → Fix "Intermittent valgrind-test | Valgrind timed out (reached 1800 second limit)"

Pulsebot

Comment 41

•

8 years ago

Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/87a690e25827
Fix "Intermittent valgrind-test | Valgrind timed out (reached 1800 second limit)".  r=n.nethercote@gmail.com.

Carsten Book [:Tomcat]

Comment 42

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/87a690e25827

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox52: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla52

Jeff Walden [:Waldo]

Assignee

Comment 43

•

8 years ago

Attached patch Possible patch — Details — Splinter Review

There's actually more than one of these sorts of spinlock in this code, so fix 'em both.

sfink, you reviewed one of these when it was added, and I'm going to pretend you're familiar these days with ExclusiveData use, so you get the short straw.  :-)

Attachment #8809626 - Flags: review?(sphink)

Jeff Walden [:Waldo]

Assignee

Updated

•

8 years ago

Assignee: nobody → jwalden+bmo

Jeff Walden [:Waldo]

Assignee

Updated

•

8 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Comment hidden (Intermittent Failures Robot)

124 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 60
* autoland: 48
* try: 11
* mozilla-central: 3
* graphics: 2

Platform breakdown:
* linux64: 124

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-11-07&endday=2016-11-13&tree=all

Julian Seward [:jseward]

Comment 45

•

8 years ago

(In reply to Carsten Book [:Tomcat] from comment #42)
> https://hg.mozilla.org/mozilla-central/rev/87a690e25827

With the comment 42 landing on 10 Nov, the level of orangeness has
gone down by perhaps a factor of 5, but is still not at zero:

https://brasstacks.mozilla.com/orangefactor/index.html?display=Bug&bugid=1309851&entireHistory=true&tree=trunk

so we need Waldo's comment 43 patch in there too.

Julian Seward [:jseward]

Comment 46

•

8 years ago

Some numbers, measured locally, no origin tracking, somewhat unscientific:

comment 32 patch (set --fair-sched=yes)
 |    comment 43 patch (spinlock removal)
 |    |
 no   no   U 338s, W 340s, 99.4% CPU, EvC 1688.43M, TIn 898.1k, TOut 324.5k
 yes  no   U 333s, W 335s, 99.4% CPU, EvC 1530.30M, TIn 907.5k, TOut 327.7k
 no   yes  U 327s, W 328s, 99.7% CPU, EvC 1524.19M, TIn 897.7k, TOut 318.1k
 yes  yes  U 330s, W 332s, 99.4% CPU, EvC 1512.92M, TIn 900.1k, TOut 322.1k

This more or less confirms what was suspected above:

--fair-sched=yes [c32 patch] gives a backstop, EvC = 1688M reduced to 1530M
remove spinlocks [c43 patch] also fixes it,    EvC = 1688M reduced to 1524M

but both are needed to be safe, per comment 45.

Julian Seward [:jseward]

Comment 47

•

8 years ago

[10:02] <sfink> ok. I looked at the patch, and my comment would be
"r=me, but I haven't used ExclusiveData before and I'm not confident 
about the rules around static vs lazy initialization and things, and
I'm not going to educate myself right now, so I'd prefer if this got
a second review from fitzgen."

Julian Seward [:jseward]

Updated

•

8 years ago

Attachment #8809626 - Flags: review?(sphink) → review?(nfitzgerald)

Nick Fitzgerald [:fitzgen] [⏰PST; UTC-8]

Comment 48

•

8 years ago

Comment on attachment 8809626 [details] [diff] [review]
Possible patch

Review of attachment 8809626 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great! Always love to see more ExclusiveData usage.

Nice bug hunting :)

Attachment #8809626 - Flags: review?(nfitzgerald) → review+

Pulsebot

Comment 49

•

8 years ago

Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/96186bf75ec2
Use proper mutices to protect local time zone adjustment data and the up-to-dateness of ICU's cached local time zone.  r=sfink, r=fitzgen

Pulsebot

Comment 50

•

8 years ago

Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1c7e3dc77f56
Fix unified builds.  r=bustage

Comment hidden (Intermittent Failures Robot)

9 failures in 715 pushes (0.013 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 5
* autoland: 3
* graphics: 1

Platform breakdown:
* linux64: 9

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-11-14&endday=2016-11-20&tree=all

Carsten Book [:Tomcat]

Comment 52

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/96186bf75ec2
https://hg.mozilla.org/mozilla-central/rev/1c7e3dc77f56

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

status-firefox53: --- → fixed

Resolution: --- → FIXED

Comment hidden (Intermittent Failures Robot)

8 failures in 623 pushes (0.013 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 4
* autoland: 3
* try: 1

Platform breakdown:
* linux64: 8

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1309851&startday=2016-11-21&endday=2016-11-27&tree=all

with-instrumented-valgrind-2.cset 8 years ago Julian Seward [:jseward] 8.83 KB, patch		Details \| Diff \| Splinter Review
bug1309851-1.cset 8 years ago Julian Seward [:jseward] 1.30 KB, patch	n.nethercote : review+	Details \| Diff \| Splinter Review
Possible patch 8 years ago Jeff Walden [:Waldo] 13.59 KB, patch	fitzgen : review+	Details \| Diff \| Splinter Review