Open Bug 1304005 Opened 3 years ago Updated 2 years ago
talos magically started crashing on PGO builds: Inconsistency detected by ld
.so: dl-open .c: 667: _dl _open: Assertion `_dl _debug _initialize (0, args .nsid)->r _state == RT _CONSISTENT' failed!
I have been plagued by hundreds of new reports in bug 1220362, this appears to have started up on September 15th and is becoming one of the top intermittent bugs in orange factor. I did a lot of retriggers and backfilling pgo, finally I narrowed it down to: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=linux%20pgo%20talos&fromchange=27f8a2467b316cfa2ea161f2235abbe074540494&tochange=0079d29d4ce323d812817b4a607eb91f63f680ea this is really bizarre because in rev 88a081d149b5 we have a few of these errors, then the next revision we had bad code land (no builds) which was backed out in the following revision f90c77639706 which has dozens of these errors. I verified the backout really backed out all the changes- so effectively we have two builds from the same code base producing wildly different results. I thought this might have been an issue with the builders or something, but on mozilla-aurora we see no problems, then after the merge yesterday we start seeing these problems. now if you go further down in the history, you can see this got worse with a breakpad landing and got better when it was backed out: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=linux%20pgo%20talos&fromchange=48cca0d90707e725f4fd89d835cf18783cdee0bb&tochange=0079d29d4ce323d812817b4a607eb91f63f680ea My naive viewpoint here is that our pgo process is doing something to cause this. Luckily this test fails fast, sometimes in 1 minute (yay for low test overhead), and we see a consistent pattern in the log file: 7:06:36 INFO - TEST-INFO | started process 6389 (/builds/slave/test/build/application/firefox/firefox -profile /tmp/tmphJvsM7/profile -tp file:/builds/slave/test/build/tests/talos/talos/tests/tp5n/tp5o.manifest.develop -tpchrome -tpnoisy -tpscrolltest -tpcycles 1 -tppagecycles 12) 17:06:36 INFO - PROCESS | 6389 | Inconsistency detected by ld.so: dl-open.c: 667: _dl_open: Assertion `_dl_debug_initialize (0, args.nsid)->r_state == RT_CONSISTENT' failed! 17:06:36 INFO - Terminating psutil.Process(pid=6389, name='firefox') 17:06:36 INFO - TEST-INFO | 6389: exit 127 17:06:36 INFO - TEST-UNEXPECTED-ERROR | tp5o_scroll | Could not find report in browser output: [('tsformat', ('__start_report', '__end_report')), ('tpformat', ('__start_tp_report', '__end_tp_report'))] that is the process starting up and terminating really fast (indicates a crash- but where is the crash report?). The assertion in _dl_open is odd. Why are we opening _dl_debug_initialize when we are running a non debug (pgo) build? in searching online, I see this error show up commonly and people have to reinstall plugins or libraries. I did see one issue related to firefox where changing the profile solved the problem, most likely some addon caused the issue. In the case of Talos, nothing has changed in recent times. We do have a lot of preferences set, and in the case of ts_paint (one of the tests where we see this happening), there is only one addon talos-powers: https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/talos-powers looking back in history on orange factor, August 5th is the first case of us seeing this: https://treeherder.mozilla.org/#/jobs?repo=fx-team&filter-searchStr=pgo%20talos%20linux&tochange=2c234f5a191673f6a4b0490435851e61852a54e8&fromchange=fff9d624dca283235c916d8769786e04f5622282 I did some retriggers on that range, the issues doesn't reproduce, I also did a bunch of retriggers and builds on the inbound revisions that merged into fx-team which showed the error: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=linux%20pgo%20talos&tochange=f0725e82f286&fromchange=9193c55ec783c7d2bf7049d6c28944e60b5fe612 no issues (still waiting on more data, but so far it is a super green machine). The next occurrence documented in orangefactor is August 12 on inbound, I have some retriggers pending, maybe this will show us more data. Either way something has landed/changed in our code base which is making linux64 talos tests on PGO only fail at 20%+ of the time. As it stands, linux64 talos will be hidden by default if the sheriffs stick to their visibility guidelines.
one difference between talos and other tests is we are running on hardware with different graphics card than what we have available on the AWS vm's we use for unittests.
breakpad code updated on Sept 16th in bug 1264367, not sure if this is related or not.
I was wondering if the Breakpad patch triggered this issue and then we're just seeing fallout from needing a clobber, but the PGO build process does a clean build in between steps when building with GCC, so I'm not sure that a dep build ought to matter: https://dxr.mozilla.org/mozilla-central/rev/62f79d676e0e11b3ad59a5425b3ebb3ec5bbefb5/Makefile.in#365
doing a lot of retriggers around the August 12th instance yields 1 more failure, it is still very rare: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=linux%20pgo%20talos&tochange=6accce4abaadad6c8de1de4da908132aad1f196e&fromchange=31f195c1bca6300f19bcb2ea1fd81b0deeb75564 Looking at ornage factor, then next set of failures start on August 17th, and a bunch on August 18th. let me hack on that for a while with a few hundred more jobs- currently I am backfilling a bunch of pgo builds (which will schedule a bunch of tests) on most of the commits in this range: https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=linux%20pgo%20talos&tochange=30da32629a3c6ef74b8c66f819c08b1106f6da91&fromchange=e1c4820431c80bd3208f811f8dfb98e54811c89b
> The assertion in _dl_open is odd. Why are we opening _dl_debug_initialize when we are running a non debug (pgo) build? "Why are we". We're not. ld.so is. That message comes from ld.so, and is an assertion on internal state in the system dynamic loader. Looking at the code, it's supposed to be impossible to happen except if something touches that values *while* the _dl_open function is running (it sets it to RT_CONSISTENT, and then checks it). Or there could be a bug in glibc. Has the glibc changed on the talos machines?
I understand this is not Firefox causing problems, yet firefox is referencing API's and other libraries probably incorrectly. I would vote out glibc changing on the talos machines because doing massive retriggers this week on data from early august shows zero instances of this error happening, doing retriggers from mid august (17th) shows many instances of this happening. With the volume of retriggers and knowing we have a pool of ~70 machines, I see instances occurring in orangefactor on 001 -> 069 except for 044 and 063. I am not sure of the online pool size. :rail, can you comment on the pool size of the linux machines and if they are 001 -> 069, also if 044 and 063 are online? Lastly, do you know if we have updated the machines at all recently, possibly in early or mid august?
after many retriggers on the august 17th range: https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=linux%20pgo%20talos&tochange=30da32629a3c6ef74b8c66f819c08b1106f6da91&fromchange=e1c4820431c80bd3208f811f8dfb98e54811c89b I have concluded that there is no one patch causing this problem, somehow we land code and get a few hundred great runs, then the next push we get a bunch of failures. Downloading a build that fails often locally on my ubuntu 16.04 desktop yields no reproducing- I have tried about 20 full test runs of a few tests with 100% passing rates. Possibly if we cannot figure this out we should consider upgrading glibc? Maybe just investigate upgrading from ubuntu 12.04 to 16.04 and rebaselining numbers? If we did that and this problem didn't go away, then where would we look? I want to explore all options and I am worried that we have no way to debug this PGO process. Maybe getting a loaner and reproducing with rr would help- I might have time on Friday for that.
(In reply to Joel Maher ( :jmaher) from comment #6) > :rail, can you comment on the pool size of the linux machines and if they > are 001 -> 069, also if 044 and 063 are online? 001 -> 069 sounds correct according to slavealloc, 044 and 063 are not in the pool, see https://bugzilla.mozilla.org/show_bug.cgi?id=1279159 https://bugzilla.mozilla.org/show_bug.cgi?id=1295369 > Lastly, do you know if we have updated the machines at all recently, possibly in early or mid august? I don't see anything obvious in http://hg.mozilla.org/build/puppet/, maybe http://hg.mozilla.org/build/puppet/rev/8a2bc1329427 somehow affects this?
adding bug 1221400 to the mix- same exact pattern of linux64 pgo only with mid august starting and spiking September 15th.
and these have thesame issue with ld.so in the log files- the ones from 1221400 have failed to initialize as we never get the warmup run completed in the browser.
:rail, thanks for confirming the machines and looking into changes. Those changes look harmless, but the timing seems right. I will keep it in the back of my mind as I didn't think machine changes would show much here- I suspect the issue is our pgo process. I have requested a loaner to reproduce this problem and try to debug it.
as an update, I have a loaner and can reproduce the problem fairly easily from a prompt using a pgo build. I have tried a few things to produce a core file: # ulimit -c unlimited # echo "* soft core unlimited" >> /etc/security/limits.conf; reboot these have resulted in no core files when the same error occurs. On top of that, I tried to use 'rr', but it didn't work out of the box, nor would it compile. I did run this under 'gdb' for a few hundred cycles and did not hit this ld.so error. I am out of options, should we upgrade to a newer version of Ubuntu? it is not straightforward to update libc6 as we currently run 2.15-0 and newer versions require many other packages to upgrade.
:rail if I were to install a new system, what graphics binaries/etc. would I need to install? I am considering installing a new OS on here so we can see if this error occurs on a more modern system. Then I can work to ensure all the talos jobs run smoothly on the updated version and we can consider upgrading all the machines.
DISCLAIMER :) It may not exactly reproduce what we have now, because I'm just trying to parse puppet manifests. Some packages and tweaks may be obsoleted and maybe not required anymore. I'm following the logic in https://dxr.mozilla.org/build-central/source/puppet/modules/toplevel/manifests/slave/releng/test/gpu.pp: It inherits a lot from https://dxr.mozilla.org/build-central/source/puppet/modules/toplevel/manifests/slave/releng/test.pp * make sure you use nvidia drivers * screen resolution should be set properly 1600x1200x32 * install xrestop, https://dxr.mozilla.org/build-central/source/puppet/modules/gui/manifests/init.pp#43-44 * https://dxr.mozilla.org/build-central/source/puppet/modules/gui/manifests/appearance.pp#39-49 disables screensaver using from https://dxr.mozilla.org/build-central/source/puppet/modules/gui/files/gsettings.gschema.override * make sure you have similar mesa packages installed: https://dxr.mozilla.org/build-central/source/puppet/modules/packages/manifests/mesa.pp#30-41 * fonts https://dxr.mozilla.org/build-central/source/puppet/modules/packages/manifests/fonts.pp#8-12 * ~/.fonts.conf https://dxr.mozilla.org/build-central/source/puppet/modules/tweaks/files/fonts.conf * tons of packages in https://dxr.mozilla.org/build-central/source/puppet/modules/talos/manifests/init.pp#20-49 * apache config https://dxr.mozilla.org/build-central/source/puppet/modules/talos/templates/talos-httpd.conf.erb
I got talos running on ubuntu 16.04 finally- after 25 cycles I was not able to see any error or reproduce this failure. I think there is a workaround we can do in mozharness to detect the error and retry (helping the sheriffs out), and then we need to figure out where upgrading to ubuntu 16.04 fits into the relops plan.
this started up again when we landed the omni.jar changed in bug 1340157. you can see it clearly here: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&filter-searchStr=talos%20other%20linux%20pgo&tochange=29f4d23a9e97c324b40c63c71eff04add5fd2a09&fromchange=22d650dc282e70b96de2fbaac84bf69a22cd3b54 :catlee as the patch author, can you look into this? based on previous investigations this is related to the OS which we are waiting on relops to get new machines before reinstalling with ubuntu 16.04. Possibly we back out your changes?
Hmm, I'm not sure what's going on. Do we see this on Aurora as well? It also has omni.ja compression enabled.
this is not seen on aurora or beta, but it still consistent on trunk branches, for pgo only.
You need to log in before you can comment on or make changes to this bug.