Closed
Bug 1309851
Opened 7 years ago
Closed 7 years ago
Fix "Intermittent valgrind-test | Valgrind timed out (reached 1800 second limit)"
Categories
(Core :: General, defect)
Core
General
Tracking
()
RESOLVED
FIXED
mozilla52
People
(Reporter: intermittent-bug-filer, Assigned: Waldo)
References
Details
(Keywords: intermittent-failure)
Attachments
(3 files)
8.83 KB,
patch
|
Details | Diff | Splinter Review | |
1.30 KB,
patch
|
n.nethercote
:
review+
|
Details | Diff | Splinter Review |
13.59 KB,
patch
|
fitzgen
:
review+
|
Details | Diff | Splinter Review |
Filed by: tomcat [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=37539180&repo=mozilla-inbound https://queue.taskcluster.net/v1/task/JnhisL2kSRG4L8x1WYCR4g/runs/0/artifacts/public%2Flogs%2Flive_backing.log
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 7•7 years ago
|
||
I looked at a few of these logs. In some cases, V makes steady but slow progress, emitting a message of some kind or other every half minute or so, but eventually falls over the 30 minute limit. In other cases, there seems to be a very large delay in between messages and then it times out almost immediately thereafter. [task 2016-10-13T10:38:00.494751Z] 10:38:00 INFO - 1:39.97 --44540-- memcheck GC: 90410 nodes, 18427 survivors ( 20.3%) [task 2016-10-13T11:05:21.592269Z] 11:05:21 INFO - 29:01.07 --44540-- memcheck GC: 90410 nodes, 19555 survivors ( 21.6%) The "memcheck GC:" messages make it somewhat possible to identify points of equal progress in the different logs. I am puzzled by the large variability in rates of progress. Do the compute nodes have swap space? Could it be that, in the slow cases, we are hitting swap a lot? Is there a way to tell? FWIW, on a 2.5 GHz Haswell with essentially unlimited memory, this test takes about 18 minutes to run.
Flags: needinfo?(jseward)
Comment 8•7 years ago
|
||
Dustin tells me that we don't use swap on these machines, so cancel that theory. He did also say that there's considerable variability in performance across different instances of supposedly the same type of node, though.
Comment 9•7 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #7) > FWIW, on a 2.5 GHz Haswell with essentially unlimited memory, this test > takes about 18 minutes to run. And about 2.9GB VSS, 1.9GB RSS. One possibility is to disable origin tracking in the run. This doesn't reduce V's ability to detect errors, but it does make it a bit more difficult to diagnose them. Given that errors are relatively rare and that diagnosing them usually requires multiple local runs, it wouldn't be a bad tradeoff for runs on automation, although we'd have to find a way to ensure origin tracking wasn't disabled on local runs of |mach valgrind-test|. With it disabled, the run time on this machine falls from 18 mins to 10 mins, and max VSS is about 2.6GB, so the speedup is significant.
![]() |
||
Comment 10•7 years ago
|
||
> we'd have to find a way to ensure origin tracking wasn't disabled on local runs of |mach valgrind-test|.
Keep origin tracking as the default for |mach valgrind-test|, but add a flag to disable it and then use that flag on automation?
Comment hidden (Intermittent Failures Robot) |
Comment 12•7 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #7) > I am puzzled by the large variability in rates of progress. The logs have timestamps. These show wallclock time, not CPU time. So there's no way to distinguish (1) "the compute node is running slowly for whatever reason" vs (2) "valgrind has fallen into some pathological slow path". Given the difficulty V has with dealing with some of the SpiderMonkey behaviour, and the fact that this is a somewhat older version, (2) is a real possibility. One way to distinguish these is to run with a hacked valgrind that monitors its own CPU usage and prints a line every (eg) 30 seconds, possibly with some other critical statistics. So the problem becomes: how to do one of these runs on the automation nodes, using a custom valgrind build.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 17•7 years ago
|
||
:jseward, do you have plans to fix this issue? As it stands this is our #1 outstanding intermittent orange. I see you have explained the issue above, but I am not clear if there is a pending fix or if you need help, etc.
Flags: needinfo?(jseward)
Comment 18•7 years ago
|
||
Well, one immediate thing we could try is to increase the 1800 second limit. That's just treating the symtoms though. I spent much of Friday trying to build the "desktop-test" (Ubuntu 12.04) image locally, so I can figure out how to push custom valgrind builds into the run. I even have the custom valgrind available now. But my build attempts have so far failed. I am continuing to try today. If you know of some other way to get a modified valgrind into the runs, I'd be happy to go with that too.
Flags: needinfo?(jseward)
Comment 19•7 years ago
|
||
as a note we have a 12.04 and a 16.04 version of desktop-test, possibly starting with 16.04 would be better as we are slowly migrating there. If you want to redo the custom image, we can test that on try (we do dynamic docker image creation). A lot of custom tools are added here: https://dxr.mozilla.org/mozilla-central/source/taskcluster/scripts/tester/test-ubuntu.sh?q=path%3Atest-ubuntu.sh&redirect_type=single Possibly we can experiment on try, we would just need to make sure that all android/linux tests pass after modifying the image.
Comment hidden (Intermittent Failures Robot) |
Comment 21•7 years ago
|
||
Patch that makes the test job run with a modified valgrind 3.12.0 (modification is itself within the patch) Minimal try syntax: mach try -b o -p linux64-valgrind -u none -t none
Comment 22•7 years ago
|
||
With help from dustin, jhford, jmaher, jonasfj (thanks all!) I managed to get a modified V into the desktop-build image, which does the V test run. The first instrumented run is [1]. I am doing further runs to get a better picture, since many runs don't time out or experience "unusual" delays. The patch causes V to print a line like this every 10 CPU seconds: PROGRESS: U 110s, W 114s, 96.5% CPU, EvC 475.21M, TIn 294.6k, TOut 19.4k, #thr 59 This numbers are cumulative (viz, for the whole run so far). They show: U: user cpu time W: wallclock time EvC: a measure of the progress of the simulator -- the number of backwards and/or indirect branches processed TIn: number of code superblocks instrumented TOut: number of blocks discarded due to invalidation or capacity constraints #thr: number of threads in the process From these it's possible to deduce, eg: * compute node hang or slow progress -- W increases, U doesn't * simulation compute bound -- W and U increase together * simulation in a loop -- EvC increases fast, TIn doesn't increase * excessive discards probably due to SM code overwrites -- TOut increases fast [1] https://public-artifacts.taskcluster.net/PJlMoAraQB-GPz32zp0eHA/0/public/logs/live_backing.log
Comment 23•7 years ago
|
||
> [1]
> https://public-artifacts.taskcluster.net/PJlMoAraQB-GPz32zp0eHA/0/public/
> logs/live_backing.log
This shows an interesting plateau between 170 and 360 seconds. At this point
the instrumented block count stays fixed at 300.0k -- the bare minimum needed to
get Gecko started -- whilst the simulator (EvC #s) makes relatively rapid
forward progress, and U and W advance together. This suggests that Gecko is in a
loop during this time and is not touching any new code. And that resource problems
on the compute node are not a cause of the delay.
After this (not shown here) it returns to the more normal pattern, in which the
simulation advances more slowly, and TIn/TOut numbers increase rapidly, due to
processing of lots of short-lifetime bits of machine code created by Gecko's
various JITs.
PROGRESS: U 160s, W 164s, 97.6% CPU, EvC 1541.29M, TIn 298.8k, TOut 19.7k, #thr 58
PROGRESS: U 170s, W 174s, 97.7% CPU, EvC 1755.65M, TIn 299.6k, TOut 19.7k, #thr 58
PROGRESS: U 180s, W 184s, 97.8% CPU, EvC 1972.37M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 190s, W 194s, 97.9% CPU, EvC 2190.27M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 200s, W 204s, 98.0% CPU, EvC 2408.18M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 210s, W 214s, 98.1% CPU, EvC 2625.88M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 220s, W 224s, 98.2% CPU, EvC 2843.98M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 230s, W 234s, 98.3% CPU, EvC 3062.09M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 240s, W 244s, 98.4% CPU, EvC 3279.69M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 250s, W 254s, 98.4% CPU, EvC 3497.30M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 260s, W 264s, 98.5% CPU, EvC 3715.10M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 270s, W 274s, 98.5% CPU, EvC 3933.00M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 280s, W 284s, 98.6% CPU, EvC 4150.81M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 290s, W 294s, 98.6% CPU, EvC 4368.41M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 300s, W 304s, 98.7% CPU, EvC 4586.52M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 310s, W 314s, 98.7% CPU, EvC 4804.52M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 320s, W 324s, 98.8% CPU, EvC 5022.13M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 330s, W 334s, 98.8% CPU, EvC 5239.93M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 340s, W 344s, 98.8% CPU, EvC 5457.94M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 350s, W 354s, 98.9% CPU, EvC 5675.64M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 360s, W 364s, 98.9% CPU, EvC 5893.15M, TIn 300.0k, TOut 19.7k, #thr 58
PROGRESS: U 370s, W 374s, 98.9% CPU, EvC 5943.00M, TIn 316.2k, TOut 41.1k, #thr 60
Comment 24•7 years ago
|
||
A second run [1] which does time out winds up in two apparent long loops, similar to comment #23, except they occupy most of the 1800 seconds, and the second loop is still running when it times out: PROGRESS: U 100s, W 105s, 95.2% CPU, EvC 241.39M, TIn 296.1k, TOut 20.5k, #thr 59 .. PROGRESS: U 560s, W 564s, 99.3% CPU, EvC 10273.09M, TIn 296.1k, TOut 20.5k, #thr 59 and PROGRESS: U 590s, W 594s, 99.3% CPU, EvC 10897.11M, TIn 301.7k, TOut 20.9k, #thr 58 .. PROGRESS: U 1,790s, W 1,792s, 99.9% CPU, EvC 37044.95M, TIn 301.7k, TOut 20.9k, #thr 58 They can't be very far apart in the startup sequence, either: TIns of 296.1k vs 301.7k. By comparison a "normal" startup of the browser easily passes 400k TIn before we get anything we can interact with. What is it that the Valgrind test job actually does? Does it involve any kind of race condition or real-time-ness, that could account for such a large amount of nondeterminism? [1] https://public-artifacts.taskcluster.net/P4n9SdIrTAuSV3_05kJrow/0/public/logs/live_backing.log
![]() |
||
Comment 25•7 years ago
|
||
> What is it that the Valgrind test job actually does? Does it involve any
> kind of race condition or real-time-ness, that could account for such a
> large amount of nondeterminism?
It just loads build/pgo/index.html, which opens a couple of basic pages and then runs sunspider. Nothing complicated at all.
Comment 26•7 years ago
|
||
I believe this is caused by IcuTimeZoneInfo::acquire(), in js/src/vm/DateTime.cpp. It acquired its current form in bug 1220693. Profiling valgrind running in automation [1] shows a spin of a 4 instruction block that lasts for more than a billion simulated instructions. This happens around the point TIn=295k, which is consistent with observations in comments 23 and 24. The block is JS::ResetTimeZone()+59. JS::ResetTimeZone() doesn't have a loop, but IcuTimeZoneInfo::acquire() is inlined into it, and profiling locally shows the big cost in IcuTimeZoneInfo::acquire()+9, so presumably the appearance in JS::ResetTimeZone() results from inlining. Imagine the following scenario, on a single core machine. T1 enters IcuTimeZoneInfo::acquire(), starts doing stuff, and is descheduled by the kernel before it can call release(). T2 is then scheduled, and calls acquire(). It can't acquire the lock because T1 holds it, so it simply spins until the end of its timeslice, blocking T1 and any other threads in of the process. Valgrind serialises thread scheduling, because it isn't internally multithreaded, and only allows CPU bound threads to drop the processor every few hundred thousand simulated instructions. So it is effectively an extreme case of "single core machine". I think this is the cause of these variable-length delays. [1] https://public-artifacts.taskcluster.net/eT29-666Q-21mkLBdO_MsA/0/public/logs/live_backing.log
Comment 27•7 years ago
|
||
IMO IcuTimeZoneInfo has the following problems: [1] Per comment #26, on single core hardware, it will cause stalling of all other threads for lengths of time depending on the kernel's scheduling, and possibly much longer than the length of time that the original authors envisaged a spinlock typically being held. [2] acquire() does not include an Intel PAUSE instruction in the spin loop. Per the Intel Optimisation guide ("Intel 64 and IA-32 Architectures Optimization Reference Manual", Sept 2015), Sec 8.4.3 "Optimization with Spin-Locks": If an application must use spin-locks, include the PAUSE instruction in the wait loop. [..] In spite of the name, the PAUSE instruction improves performance by introducing a slight delay in the loop and effectively causing the memory read requests to be issued at a rate that allows immediate detection of any store to the synchronization variable. This prevents the occurrence of a long delay due to memory order violation. Adding a PAUSE insn (a.k.a. "rep nop") would also fix the problem with Valgrind, since Valgrind also takes PAUSE as a hint to deschedule the calling thread in near future, hence allowing other threads to run. Unfortunately [2] is an Intel-specific fix. I imagine other architectures have similar instructions to deal with the same problem. Considering the need to also fix [1], perhaps it would be safer to rewrite the loop to spin for a few hundred iterations at most, and then call sched_yield(). Is there any equivalent call for Windows?
![]() |
||
Comment 28•7 years ago
|
||
ICU! /me shakes fist Great analysis, jseward.
![]() |
||
Comment 29•7 years ago
|
||
Any idea why this suddenly increased 9 days ago (i.e. at comment 4)?
Comment hidden (Intermittent Failures Robot) |
Comment 31•7 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #29) > Any idea why this suddenly increased 9 days ago (i.e. at comment 4)? I don't know, but I have half a theory. It seems to me that the spinlock problems are being exacerbated by valgrind's use of its pipe-based locking scheme (--fair-sched=no, the default). This often has the effect of causing the same thread to get scheduled for long periods at a time even though other threads are runnable. I'm experimenting with adding --fair-sched=yes as a backstop fix. This will limit the worst-case spin length to around a million instructions, which, whilst still bad, is much better than what's described in comment 26. I did three runs with this last night and it looked good. jmaher suggested 50 successful runs would be a good trial before landing it. I'm testing more this morning. So why has it changed? Given that the current settings are very sensitive to how the kernel schedules threads, it might be that the EC2 instances are migrating to newer kernels with different scheduling, or newer hardware, or a different VM setup. At least, that's my best guess.
Comment 32•7 years ago
|
||
Proposed backstop fix, as described in comment 31.
Comment 33•7 years ago
|
||
With the patch in comment 32, I have 13 out of 13 test runs finished without timeout [1]. These are all on the us-west-* group, node type m4.4xlarge. If I had to guess, I'd say it means there's a baseline "spin-free" time of about 19 minutes required, plus random extra time for spins that can drag it out to about 28 minutes. Still perilously close to the 30 minute deadline. Another 37 test runs are in progress. [1] 20:59.11 TEST-PASS | valgrind-test | valgrind found no errors 18:57.00 TEST-PASS | valgrind-test | valgrind found no errors 25:14.30 TEST-PASS | valgrind-test | valgrind found no errors 19:19.14 TEST-PASS | valgrind-test | valgrind found no errors 20:00.80 TEST-PASS | valgrind-test | valgrind found no errors 21:42.84 TEST-PASS | valgrind-test | valgrind found no errors 23:20.61 TEST-PASS | valgrind-test | valgrind found no errors 23:18.44 TEST-PASS | valgrind-test | valgrind found no errors 19:59.72 TEST-PASS | valgrind-test | valgrind found no errors 19:28.44 TEST-PASS | valgrind-test | valgrind found no errors 19:28.30 TEST-PASS | valgrind-test | valgrind found no errors 27:09.68 TEST-PASS | valgrind-test | valgrind found no errors 27:40.08 TEST-PASS | valgrind-test | valgrind found no errors
Comment 34•7 years ago
|
||
A further observation: whilst investigating earlier, I used a more recent Valgrind (3.12) than what is in the desktop-build image (3.10). It runs the workload in under 12 minutes, compared to 19+ for 3.10, on the same node type (m4.4xlarge). I'm not sure it's an apples-to-apples comparison, and I'm surprised how big the difference is. That said, 3.12 does contain various changes that make it scale better for recent Gecko-sized workloads, so it would be useful to move to that as a followup, if only to gain a bit more performance headroom. Three test runs with 3.12: 11:17.48 TEST-PASS | valgrind-test | valgrind found no errors 10:47.60 TEST-PASS | valgrind-test | valgrind found no errors 10:36.22 TEST-PASS | valgrind-test | valgrind found no errors Progress numbers (just as a record), reporting interval 10s, last reports: PROGRESS: U 670s, W 673s, 99.6% CPU, EvC 2224.82M, TIn 948.8k, TOut 654.3k, #thr 27 PROGRESS: U 640s, W 642s, 99.7% CPU, EvC 1472.68M, TIn 956.6k, TOut 666.8k, #thr 27 PROGRESS: U 630s, W 633s, 99.5% CPU, EvC 1608.70M, TIn 942.9k, TOut 629.9k, #thr 25
Comment 35•7 years ago
|
||
We did upgrade instance kernels on Oct 21 (for badcow)
Comment 36•7 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #35) > We did upgrade instance kernels on Oct 21 (for badcow) Was it a large change, or just a small fix just for that problem?
Comment 37•7 years ago
|
||
Here are results from 49 test runs of the patch in comment 32 (the 50th failed in build for some reason). These are sorted by run time of the valgrind component. There are 4 timeouts in 49 runs. Is there a way to determine whether this is an improvement the current situation? 18:57.00 TEST-PASS | valgrind-test | valgrind found no errors 19:14.51 TEST-PASS | valgrind-test | valgrind found no errors 19:14.81 TEST-PASS | valgrind-test | valgrind found no errors 19:19.14 TEST-PASS | valgrind-test | valgrind found no errors 19:20.47 TEST-PASS | valgrind-test | valgrind found no errors 19:28.30 TEST-PASS | valgrind-test | valgrind found no errors 19:28.44 TEST-PASS | valgrind-test | valgrind found no errors 19:38.65 TEST-PASS | valgrind-test | valgrind found no errors 19:44.36 TEST-PASS | valgrind-test | valgrind found no errors 19:53.37 TEST-PASS | valgrind-test | valgrind found no errors 19:54.08 TEST-PASS | valgrind-test | valgrind found no errors 19:59.72 TEST-PASS | valgrind-test | valgrind found no errors 20:00.80 TEST-PASS | valgrind-test | valgrind found no errors 20:04.75 TEST-PASS | valgrind-test | valgrind found no errors 20:05.67 TEST-PASS | valgrind-test | valgrind found no errors 20:11.92 TEST-PASS | valgrind-test | valgrind found no errors 20:12.05 TEST-PASS | valgrind-test | valgrind found no errors 20:19.26 TEST-PASS | valgrind-test | valgrind found no errors 20:24.41 TEST-PASS | valgrind-test | valgrind found no errors 20:29.52 TEST-PASS | valgrind-test | valgrind found no errors 20:42.91 TEST-PASS | valgrind-test | valgrind found no errors 20:43.77 TEST-PASS | valgrind-test | valgrind found no errors 20:49.58 TEST-PASS | valgrind-test | valgrind found no errors 20:57.86 ==43939== ERROR SUMMARY: 822 errors from 822 contexts (suppressed: 96 from 69) (non-timeout failure) 20:59.11 TEST-PASS | valgrind-test | valgrind found no errors 21:40.81 TEST-PASS | valgrind-test | valgrind found no errors 21:42.84 TEST-PASS | valgrind-test | valgrind found no errors 21:46.36 TEST-PASS | valgrind-test | valgrind found no errors 22:02.50 TEST-PASS | valgrind-test | valgrind found no errors 22:05.93 TEST-PASS | valgrind-test | valgrind found no errors 22:09.43 TEST-PASS | valgrind-test | valgrind found no errors 22:10.02 TEST-PASS | valgrind-test | valgrind found no errors 22:29.88 TEST-PASS | valgrind-test | valgrind found no errors 22:31.14 TEST-PASS | valgrind-test | valgrind found no errors 23:18.44 TEST-PASS | valgrind-test | valgrind found no errors 23:20.61 TEST-PASS | valgrind-test | valgrind found no errors 23:51.42 TEST-PASS | valgrind-test | valgrind found no errors 23:57.71 TEST-PASS | valgrind-test | valgrind found no errors 24:53.62 TEST-PASS | valgrind-test | valgrind found no errors 25:14.30 TEST-PASS | valgrind-test | valgrind found no errors 27:09.68 TEST-PASS | valgrind-test | valgrind found no errors 27:40.08 TEST-PASS | valgrind-test | valgrind found no errors 27:44.94 TEST-PASS | valgrind-test | valgrind found no errors 29:50.67 TEST-PASS | valgrind-test | valgrind found no errors 29:56.46 TEST-PASS | valgrind-test | valgrind found no errors 30:01.67 TEST-UNEXPECTED-FAIL | valgrind-test | Valgrind timed out (reached 1800 second limit) 30:01.74 TEST-UNEXPECTED-FAIL | valgrind-test | Valgrind timed out (reached 1800 second limit) 30:01.76 TEST-UNEXPECTED-FAIL | valgrind-test | Valgrind timed out (reached 1800 second limit) 30:01.81 TEST-UNEXPECTED-FAIL | valgrind-test | Valgrind timed out (reached 1800 second limit)
Comment 38•7 years ago
|
||
I propose the following package of mitigations, in the given order: [1] land the patch in comment 32. [2] increase the timeout modestly to (eg) 2100 seconds. [3] change the valgrind in the desktop-build image to 3.12.0+mozilla-special-hacks, for improved performance and diagnosability. [4] fix the underlying spinloop problem in IcuTimeZoneInfo, per comment 26.
Comment 39•7 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #36) > Was it a large change, or just a small fix just for that problem? Just in the patch version, but the increment was more than one: https://github.com/taskcluster/docker-worker/pull/256/files
Updated•7 years ago
|
Attachment #8808982 -
Flags: review?(n.nethercote)
![]() |
||
Updated•7 years ago
|
Attachment #8808982 -
Flags: review?(n.nethercote) → review+
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Summary: Intermittent valgrind-test | Valgrind timed out (reached 1800 second limit) → Fix "Intermittent valgrind-test | Valgrind timed out (reached 1800 second limit)"
Comment 41•7 years ago
|
||
Pushed by jseward@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/87a690e25827 Fix "Intermittent valgrind-test | Valgrind timed out (reached 1800 second limit)". r=n.nethercote@gmail.com.
Comment 42•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/87a690e25827
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox52:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Assignee | ||
Comment 43•7 years ago
|
||
There's actually more than one of these sorts of spinlock in this code, so fix 'em both. sfink, you reviewed one of these when it was added, and I'm going to pretend you're familiar these days with ExclusiveData use, so you get the short straw. :-)
Attachment #8809626 -
Flags: review?(sphink)
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → jwalden+bmo
Assignee | ||
Updated•7 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment hidden (Intermittent Failures Robot) |
Comment 45•7 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #42) > https://hg.mozilla.org/mozilla-central/rev/87a690e25827 With the comment 42 landing on 10 Nov, the level of orangeness has gone down by perhaps a factor of 5, but is still not at zero: https://brasstacks.mozilla.com/orangefactor/index.html?display=Bug&bugid=1309851&entireHistory=true&tree=trunk so we need Waldo's comment 43 patch in there too.
Comment 46•7 years ago
|
||
Some numbers, measured locally, no origin tracking, somewhat unscientific: comment 32 patch (set --fair-sched=yes) | comment 43 patch (spinlock removal) | | no no U 338s, W 340s, 99.4% CPU, EvC 1688.43M, TIn 898.1k, TOut 324.5k yes no U 333s, W 335s, 99.4% CPU, EvC 1530.30M, TIn 907.5k, TOut 327.7k no yes U 327s, W 328s, 99.7% CPU, EvC 1524.19M, TIn 897.7k, TOut 318.1k yes yes U 330s, W 332s, 99.4% CPU, EvC 1512.92M, TIn 900.1k, TOut 322.1k This more or less confirms what was suspected above: --fair-sched=yes [c32 patch] gives a backstop, EvC = 1688M reduced to 1530M remove spinlocks [c43 patch] also fixes it, EvC = 1688M reduced to 1524M but both are needed to be safe, per comment 45.
Comment 47•7 years ago
|
||
[10:02] <sfink> ok. I looked at the patch, and my comment would be "r=me, but I haven't used ExclusiveData before and I'm not confident about the rules around static vs lazy initialization and things, and I'm not going to educate myself right now, so I'd prefer if this got a second review from fitzgen."
Updated•7 years ago
|
Attachment #8809626 -
Flags: review?(sphink) → review?(nfitzgerald)
Comment 48•7 years ago
|
||
Comment on attachment 8809626 [details] [diff] [review] Possible patch Review of attachment 8809626 [details] [diff] [review]: ----------------------------------------------------------------- Looks great! Always love to see more ExclusiveData usage. Nice bug hunting :)
Attachment #8809626 -
Flags: review?(nfitzgerald) → review+
Comment 49•7 years ago
|
||
Pushed by jwalden@mit.edu: https://hg.mozilla.org/integration/mozilla-inbound/rev/96186bf75ec2 Use proper mutices to protect local time zone adjustment data and the up-to-dateness of ICU's cached local time zone. r=sfink, r=fitzgen
Comment 50•7 years ago
|
||
Pushed by jwalden@mit.edu: https://hg.mozilla.org/integration/mozilla-inbound/rev/1c7e3dc77f56 Fix unified builds. r=bustage
Comment hidden (Intermittent Failures Robot) |
Comment 52•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/96186bf75ec2 https://hg.mozilla.org/mozilla-central/rev/1c7e3dc77f56
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
status-firefox53:
--- → fixed
Resolution: --- → FIXED
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•