Closed Bug 1045165 Opened 10 years ago Closed 10 years ago

Intermittent B2G Hazard Analysis "command timed out: 3600 seconds without output running ['scripts/scripts/hazard_build.py', '--target', 'emulator-jb', '--config-file', 'b2g/releng-emulator.py', '--b2g-config-dir', 'emulator-jb', '--c"

Categories

(Firefox Build System :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(firefox32 unaffected, firefox33 unaffected, firefox34 fixed, firefox-esr24 unaffected, firefox-esr31 unaffected, b2g-v2.0 unaffected, b2g-v2.1 fixed)

RESOLVED FIXED
mozilla34
Tracking Status
firefox32 --- unaffected
firefox33 --- unaffected
firefox34 --- fixed
firefox-esr24 --- unaffected
firefox-esr31 --- unaffected
b2g-v2.0 --- unaffected
b2g-v2.1 --- fixed

People

(Reporter: RyanVM, Assigned: sfink)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 1 obsolete file)

Not the first of these we've been seeing, but I don't know for sure when they started either due to my own recent absence.

https://tbpl.mozilla.org/php/getParsedLog.php?id=44728692&tree=Mozilla-Inbound

b2g_mozilla-inbound_linux64-b2g-haz_dep on 2014-07-28 09:23:19 PDT for push 616ae23af35f
slave: b-linux64-hp-0026

10:10:20  WARNING -  /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jsgcinlines.h:263:46: warning: '*((void*)(& i)+12).js::gc::ArenaCellIterImpl::thing' may be used uninitialized in this function [-Wmaybe-uninitialized]
10:10:20     INFO -  In file included from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/objdir-gecko/js/src/Unified_cpp_js_src3.cpp:184:0:
10:10:20     INFO -  /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jit/Ion.cpp:1255:34: note: '*((void*)(& i)+12).js::gc::ArenaCellIterImpl::thing' was declared here
10:10:20     INFO -  In file included from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/gc/Barrier.h:12:0,
10:10:20     INFO -                   from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jsatom.h:14,
10:10:20     INFO -                   from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/vm/Runtime.h:21,
10:10:20     INFO -                   from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jscntxt.h:15,
10:10:20     INFO -                   from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jit/Ion.h:14,
10:10:20     INFO -                   from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jit/FixedList.h:12,
10:10:20     INFO -                   from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jit/BaselineCompiler.h:12,
10:10:20     INFO -                   from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jit/BaselineCompiler.cpp:7,
10:10:20     INFO -                   from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/objdir-gecko/js/src/Unified_cpp_js_src3.cpp:2:
10:10:20  WARNING -  /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/gc/Heap.h:207:49: warning: '*((void*)(& i)+12).js::gc::ArenaCellIterImpl::span.js::gc::FreeSpan::last' may be used uninitialized in this function [-Wmaybe-uninitialized]
10:10:20     INFO -  In file included from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/objdir-gecko/js/src/Unified_cpp_js_src3.cpp:184:0:
10:10:20     INFO -  /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jit/Ion.cpp:1255:34: note: '*((void*)(& i)+12).js::gc::ArenaCellIterImpl::span.js::gc::FreeSpan::last' was declared here
10:10:20     INFO -  In file included from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/vm/Shape-inl.h:22:0,
10:10:20     INFO -                   from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jsscriptinlines.h:19,
10:10:20     INFO -                   from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jit/BaselineCompiler.cpp:22,
10:10:20     INFO -                   from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/objdir-gecko/js/src/Unified_cpp_js_src3.cpp:2:
10:10:20  WARNING -  /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jsgcinlines.h:215:9: warning: '*((void*)(& i)+12).js::gc::ArenaCellIterImpl::span.js::gc::FreeSpan::first' may be used uninitialized in this function [-Wmaybe-uninitialized]
10:10:20     INFO -  In file included from /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/objdir-gecko/js/src/Unified_cpp_js_src3.cpp:184:0:
10:10:20     INFO -  /builds/slave/b2g_m-in_l64-b2g-haz_dep-00000/build/gecko/js/src/jit/Ion.cpp:1255:34: note: '*((void*)(& i)+12).js::gc::ArenaCellIterImpl::span.js::gc::FreeSpan::first' was declared here

command timed out: 3600 seconds without output running ['scripts/scripts/hazard_build.py', '--target', 'emulator-jb', '--config-file', 'b2g/releng-emulator.py', '--b2g-config-dir', 'emulator-jb', '--config-file', 'hazards/common.py', '--config-file', 'hazards/build_b2g.py'], attempting to kill
bumping and triaging into mozharness component.

note: sfink is currently on pto.

we should find out exactly when this started happening or if it has always been intermittent so we can cross ref recent changes. I can look at this in the future once it surpasses my other prioritized items.

looking at http://hg.mozilla.org/build/mozharness/annotate/6eea1941b891/scripts/hazard_build.py it appears that catlee reviewed a lot of this so sending gentle ping that way.

catlee - anything jump out to you here or will this just need to be put through our normal debugging process?
Component: General Automation → Mozharness
Flags: needinfo?(catlee)
QA Contact: catlee
This does appear to be a relatively recent regression. The first instance I see on m-c was Saturday. None there prior to that.
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=b2g_mozilla-inbound_linux64-b2g-haz_dep&rev=e0f5ea8e9082 is the first push I can find these on, pointing to the regression occurring on inbound sometime between Thursday and Friday last week. Which conveniently is around the same time Windows PGO build time incurred a massive spike (bug 1045133). Maybe a common regressor?
Not sure. It looks like something in the build process itself is hanging. Do we expect any of the commands here to take longer than an hour without producing output?

What are all the "Received connection" lines in the log for?
Flags: needinfo?(catlee)
This was the same issue as the Win PGO slowdown, which was taken care of by glandium's patches.
Status: NEW → RESOLVED
Closed: 10 years ago
Component: Mozharness → Build Config
Product: Release Engineering → Core
Resolution: --- → WORKSFORME
Target Milestone: --- → mozilla34
Or not, I guess.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Steve, would you mind driving this? :-)
Blocks: 937771
Flags: needinfo?(sphink)
The failure rate of this is too high; hidden on trunk trees.
Blocks: 1050738
I've been able to reproduce on my slave what is hopefully the same problem.

It's running out of memory and swapping itself to death. In particular, the xmanager process gets up to >4GB in vsize, 2.2GB(!) of it resident. That plus half a dozen cc1plus processes that take up 800-900MB each are enough to push it over the edge.

The xmanager process seems to be large because it has several .xdb database files open that it is seeking around in, truncating, and generally futzing with. One in particular is at 1.1GB when I see the swap death happening.

So, options:

1. Rewrite the analysis to use a more standard key/value data store that doesn't depend so much on RAM. This would also make it more generally useful to switch away from ctypes to direct DB file manipulation with Typed Objects, eliminating an annoying dependency and associated configuration. It might also speed things up and make the analysis more approachable.

2. make -j<smaller number>, close my eyes, and hope it doesn't grow too much in the future.

#2 it is, then. Patch forthcoming.
Flags: needinfo?(sphink)
Oops, wanted to leave the needinfo.
Flags: needinfo?(sphink)
Testing out -j3 on my slave right now. I don't know if the build.sh -j3 gets passed through to the gecko build.
Assignee: nobody → sphink
Status: REOPENED → ASSIGNED
So, I could see this as a consequence of the recent build system changes which changed how parallel compilation is done, and which:
- changed the order in which things are built.
- allowed some compilations to happen while libxul is linking.

I can see both have consequence on the overall memory requirement of normal builds, so I'm not entirely surprised this could have consequence on build jobs that already required more memory to begin with.
Nasty hackaround. Hopefully it won't slow down the build too much.
Attachment #8475678 - Flags: review?(terrence)
Attachment #8475561 - Attachment is obsolete: true
(In reply to Mike Hommey [:glandium] from comment #119)
> So, I could see this as a consequence of the recent build system changes
> which changed how parallel compilation is done, and which:
> - changed the order in which things are built.

Yeah, this could do it. I could easily believe that I was already skating close to the memory limits.

> - allowed some compilations to happen while libxul is linking.

I did not observe this on my slave. When it was running into trouble, it was still only compiling. But it would have eventually gotten to the libxul linking, so maybe it would have run into even more trouble then.

> I can see both have consequence on the overall memory requirement of normal
> builds, so I'm not entirely surprised this could have consequence on build
> jobs that already required more memory to begin with.

Yeah, it'd be nice for make to magically tune the parallelism according to the available memory as well as load, but that doesn't sound very easy to do.

Fwiw, my slave completed the build fine with -j3. I don't know what peak memory usage it hit. It'd be nice to have atop configured on the slaves to log utilization stats. (Or I guess in the case of this build, mach's logging would work too.)
Flags: needinfo?(sphink)
Comment on attachment 8475678 [details] [diff] [review]
Compilation exceeds memory size for b2g hazard build

Review of attachment 8475678 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me
Attachment #8475678 - Flags: review?(terrence) → review+
patch(es) in production
Marking this bug fixed. I scanned through m-i and didn't see any instances of this failure.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: