Closed Bug 1453155 Opened 6 years ago Closed 4 years ago

Intermittent make[4]: *** Deleting file 'filename'

Categories

(Taskcluster :: Workers, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ccoroiu, Assigned: wcosta)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Attachments

(1 file)

[task 2018-04-10T21:48:24.587Z] 21:48:24     INFO -  make[4]: Entering directory '/builds/worker/workspace/build/src/obj-firefox/modules/libmar/verify'
[task 2018-04-10T21:48:24.587Z] 21:48:24     INFO -  mkdir -p '.deps/'
[task 2018-04-10T21:48:24.588Z] 21:48:24     INFO -  make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/modules/libmar/verify'
[task 2018-04-10T21:48:24.595Z] 21:48:24     INFO -  make[4]: *** Deleting file 'taiwncal.o'
[task 2018-04-10T21:48:24.595Z] 21:48:24     INFO -  make[4]: Entering directory '/builds/worker/workspace/build/src/obj-firefox/config/external/icu/i18n'
[task 2018-04-10T21:48:24.596Z] 21:48:24     INFO -  /builds/worker/workspace/build/src/sccache2/sccache /builds/worker/workspace/build/src/clang/bin/clang++ -B /builds/worker/workspace/build/src/gcc/bin -std=gnu++14 -o taiwncal.o -c -I/builds/worker/workspace/build/src/obj-firefox/dist/system_wrappers -include /builds/worker/workspace/build/src/config/gcc_hidden.h -DNDEBUG=1 -DTRIMMED=1 -DU_I18N_IMPLEMENTATION -DUCONFIG_NO_TRANSLITERATION -DUCONFIG_NO_REGULAR_EXPRESSIONS -DUCONFIG_NO_LEGACY_CONVERSION -DU_USING_ICU_NAMESPACE=0 -DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -DU_CHARSET_IS_UTF8 -DU_HAVE_NL_LANGINFO_CODESET=0 -I/builds/worker/workspace/build/src/config/external/icu/i18n -I/builds/worker/workspace/build/src/obj-firefox/config/external/icu/i18n -I/builds/worker/workspace/build/src/intl/icu/source/common -I/builds/worker/workspace/build/src/obj-firefox/dist/include -I/builds/worker/workspace/build/src/obj-firefox/dist/include/nspr -I/builds/worker/workspace/build/src/obj-firefox/dist/include/nss -fPIC -DMOZILLA_CLIENT -include /builds/worker/workspace/build/src/obj-firefox/mozilla-config.h -Qunused-arguments -U_FORTIFY_SOURCE -fno-common -Qunused-arguments -Wall -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wunreachable-code-return -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wloop-analysis -Wc++1z-compat -Wcomma -Wimplicit-fallthrough -Werror=non-literal-null-conversion -Wstring-conversion -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wformat -Wformat-security -Wno-gnu-zero-variadic-macro-arguments -Wno-unknown-warning-option -Wno-return-type-c-linkage -D_GLIBCXX_USE_CXX11_ABI=0 -fno-sized-deallocation -fsanitize=address -U_FORTIFY_SOURCE -fno-common -fno-exceptions -fno-strict-aliasing -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe -g -O2 -gline-tables-only -fno-omit-frame-pointer -frtti  -MD -MP -MF .deps/taiwncal.o.pp   /builds/worker/workspace/build/src/intl/icu/source/i18n/taiwncal.cpp
[task 2018-04-10T21:48:24.597Z] 21:48:24     INFO -  /builds/worker/workspace/build/src/config/rules.mk:1024: recipe for target 'taiwncal.o' failed
[task 2018-04-10T21:48:24.597Z] 21:48:24     INFO -  make[4]: *** [taiwncal.o] Killed
[task 2018-04-10T21:48:24.597Z] 21:48:24     INFO -  make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox/config/external/icu/i18n'
[task 2018-04-10T21:48:24.597Z] 21:48:24     INFO -  /builds/worker/workspace/build/src/config/recurse.mk:73: recipe for target 'config/external/icu/i18n/target' failed
[task 2018-04-10T21:48:24.597Z] 21:48:24     INFO -  make[3]: *** [config/external/icu/i18n/target] Error 2
`make[4]: *** [taiwncal.o] Killed` doesn't really seem like a build system issue. This is make reporting that the process exited due to SIGKILL. This isn't something sccache is doing.
Component: General → Worker
Product: Firefox Build System → Taskcluster
QA Contact: pmoore
Version: Version 3 → unspecified
FWIW, if you're going to use `clone bug` it'd be helpful to remove the duplicated comment 0, it was pretty confusing to me.
No longer blocks: 1445218
From the host logs:

kernel: [ 7917.099494] BUG: unable to handle kernel paging request at ffffffffcd1e4b60                                                                                                                                                                        
kernel: [ 7917.102417] IP: [<ffffffff811d02e6>] PageHeadHuge+0x16/0x30
kernel: [ 7917.104291] PGD 1c11067 PUD 1c13067 PMD 0 
kernel: [ 7917.105769] Oops: 0000 [#1] SMP 
kernel: [ 7917.106924] Modules linked in: xt_nat veth nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype xt_conntrack br_netfilter overlay snd_aloop snd_pcm snd_timer xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables dm_crypt ppdev crct10dif_pclmul crc32_pclmul parport_pc ghash_clmulni_intel 8250_fintek pvpanic parport snd i2c_piix4 soundcore serio_raw v4l2loopback(OE) videodev aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd ena nvme
kernel: [ 7917.125492] CPU: 0 PID: 4701 Comm: sccache Tainted: G           OE   4.4.0-1014-aws #14taskcluster1-Ubuntu
kernel: [ 7917.128407] Hardware name: Amazon EC2 m5.4xlarge/, BIOS 1.0 10/16/2017
kernel: [ 7917.130369] task: ffff88032dc43f00 ti: ffff880f9951c000 task.ti: ffff880f9951c000
kernel: [ 7917.132583] RIP: 0010:[<ffffffff811d02e6>]  [<ffffffff811d02e6>] PageHeadHuge+0x16/0x30
kernel: [ 7917.134992] RSP: 0018:ffff880f9951fc18  EFLAGS: 00010202
kernel: [ 7917.137651] RAX: 00000000097380a0 RBX: ffffea000973ffc0 RCX: 0000000009738000
kernel: [ 7917.140824] RDX: 017fff800004406c RSI: 00007fb99f000000 RDI: ffffea0009738000
kernel: [ 7917.143999] RBP: ffff880f9951fc18 R08: ffff880f9951fc78 R09: ffff880f9951fc2c
kernel: [ 7917.147159] R10: 0000000000000007 R11: 0000000000000080 R12: ffffea0009738000
kernel: [ 7917.150330] R13: 0000000000000000 R14: ffff880f9951fd78 R15: ffffea000973ffc0
kernel: [ 7917.153509] FS:  00007fb9a054b8c0(0000) GS:ffff880fa3200000(0000) knlGS:0000000000000000
kernel: [ 7917.158037] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 7917.160817] CR2: ffffffffcd1e4b60 CR3: 0000000becaf8000 CR4: 0000000000360670
kernel: [ 7917.164003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [ 7917.167182] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: [ 7917.170374] Stack:
kernel: [ 7917.172018]  ffff880f9951fc40 ffffffff811929ae ffffea000973ffc0 00007fb99efff000
kernel: [ 7917.176465]  0000000000000000 ffff880f9951fc58 ffffffff81192acd 00007fb99efff9d0
kernel: [ 7917.180867]  ffff880f9951fca8 ffffffff810f97a4 ffff880f9728f800 01ffffff813e8614
kernel: [ 7917.185263] Call Trace:
kernel: [ 7917.187058]  [<ffffffff811929ae>] put_compound_page+0x9e/0x1a0
kernel: [ 7917.189874]  [<ffffffff81192acd>] put_page+0x1d/0x50
kernel: [ 7917.192406]  [<ffffffff810f97a4>] get_futex_key+0x1f4/0x280
kernel: [ 7917.195117]  [<ffffffff810fa13e>] futex_wait_setup+0x4e/0x110
kernel: [ 7917.197889]  [<ffffffff810fa364>] futex_wait+0x164/0x270
kernel: [ 7917.200497]  [<ffffffff810f9d11>] ? futex_wake+0x81/0x150
kernel: [ 7917.203228]  [<ffffffff810fc2cb>] do_futex+0xdb/0x520
kernel: [ 7917.205821]  [<ffffffff8108d861>] ? do_send_specific+0x71/0x90
kernel: [ 7917.208656]  [<ffffffff8108d90b>] ? do_tkill+0x8b/0xa0
kernel: [ 7917.211231]  [<ffffffff810fc781>] SyS_futex+0x71/0x150
kernel: [ 7917.213849]  [<ffffffff817eb248>] entry_SYSCALL_64_fastpath+0x1c/0xbb
kernel: [ 7917.217207] Code: 48 89 e5 80 e6 40 74 0b 48 8b 47 40 48 c1 e8 0b 83 e0 01 5d c3 0f 1f 44 00 00 48 8b 17 55 31 c0 48 89 e5 80 e6 40 74 15 8b 47 68 <48> 81 3c c5 60 46 82 81 a0 f0 1c 81 0f 94 c0 0f b6 c0 5d c3 66  
kernel: [ 7917.230661] RIP  [<ffffffff811d02e6>] PageHeadHuge+0x16/0x30
kernel: [ 7917.233578]  RSP <ffff880f9951fc18>
kernel: [ 7917.235680] CR2: ffffffffcd1e4b60
kernel: [ 7917.238530] ---[ end trace a4a7d8eee8a6947b ]--- 

So it looks like this tickled a kernel bug.  Let's see how frequent this is.
Component: Worker → Docker-Worker
QA Contact: pmoore
not sure why we have 2 bugs here with the same summary, bug 1445631 is resolved and got starred with 1 failure- so 2 for the last week!
no failures in >2 weeks.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This got filed again as bug 1459691. I don't know if it's the same root cause or not. It seems to be pretty low frequency (which is not unexpected). We should figure out how to make this easier to star consistently, though.
Callek fetched me the logs from papertrail for the worker from bug 1459691 and it looks like the same kernel issue:

[ 4716.181481] BUG: unable to handle kernel paging request at ffffffffb61a4b60
[ 4716.184130] IP: [<ffffffff811d02e6>] PageHeadHuge+0x16/0x30
[ 4716.185922] PGD 1c11067 PUD 1c12063 PMD 0
[ 4716.187348] Oops: 0000 [#1] SMP
[ 4716.188221] Modules linked in: xt_nat veth nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype xt_conntrac
k br_netfilter overlay snd_aloop snd_pcm snd_timer xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 ipta
ble_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp bridge stp llc iptable_filter ip_tabl
es x_tables dm_crypt ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel parport_pc 8250_fintek pvpanic parport i2c_
piix4 serio_raw snd soundcore v4l2loopback(OE) videodev aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryp
td ena nvme
[ 4716.205343] CPU: 4 PID: 4984 Comm: sccache Tainted: G           OE   4.4.0-1014-aws #14taskcluster1-Ubuntu
[ 4716.207731] Hardware name: Amazon EC2 c5.4xlarge/, BIOS 1.0 10/16/2017
[ 4716.209380] task: ffff8806e6323f00 ti: ffff8800367f4000 task.ti: ffff8800367f4000
[ 4716.211189] RIP: 0010:[<ffffffff811d02e6>]  [<ffffffff811d02e6>] PageHeadHuge+0x16/0x30
[ 4716.213200] RSP: 0018:ffff8800367f7c18  EFLAGS: 00010202
[ 4716.214797] RAX: 00000000069300a0 RBX: ffffea0006937fc0 RCX: 0000000006930000
[ 4716.217992] RDX: 017fff800004406c RSI: 00007fa5bee00000 RDI: ffffea0006930000
[ 4716.221148] RBP: ffff8800367f7c18 R08: ffff8800367f7c78 R09: ffff8800367f7c2c
[ 4716.224161] R10: 0000000000000007 R11: 0000000000000080 R12: ffffea0006930000
[ 4716.226622] R13: 0000000000000000 R14: ffff8800367f7d78 R15: ffffea0006937fc0
[ 4716.229351] FS:  00007fa5c01998a0(0000) GS:ffff8807df300000(0000) knlGS:0000000000000000
[ 4716.233079] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4716.235457] CR2: ffffffffb61a4b60 CR3: 0000000036782000 C[ 4716.237920] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4716.240357] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4716.242794] Stack:
[ 4716.244036]  ffff8800367f7c40 ffffffff811929ae ffffea0006937fc0 00007fa5bedff000
[ 4716.247959]  0000000000000000 ffff8800367f7c58 ffffffff81192acd 00007fa5bedff9d0
[ 4716.251355]  ffff8800367f7ca8 ffffffff810f97a4 ffff8807d50ab480 01ffffff813e8614
[ 4716.254784] Call Trace:
[ 4716.256137]  [<ffffffff811929ae>] put_compound_page+0x9e/0x1a0
[ 4716.258282]  [<ffffffff81192acd>] put_page+0x1d/0x50
[ 4716.260243]  [<ffffffff810f97a4>] get_futex_key+0x1f4/0x280
[ 4716.262337]  [<ffffffff810fa13e>] futex_wait_setup+0x4e/0x110
[ 4716.264468]  [<ffffffff810fa364>] futex_wait+0x164/0x270
[ 4716.266521]  [<ffffffff810f9d11>] ? futex_wake+0x81/0x150
[ 4716.269103]  [<ffffffff810fc2cb>] do_futex+0xdb/0x520
[ 4716.271167]  [<ffffffff8108d861>] ? do_send_specific+0x71/0x90
[ 4716.273319]  [<ffffffff8108d90b>] ? do_tkill+0x8b/0xa0
[ 4716.275333]  [<ffffffff810fc781>] SyS_futex+0x71/0x150
[ 4716.277356]  [<ffffffff817eb248>] entry_SYSCALL_64_fastpath+0x1c/0xbb
[ 4716.279668] Code: 48 89 e5 80 e6 40 74 0b 48 8b 47 40 48 c1 e8 0b 83 e0 01 5d c3 0f 1f 44 00 00 48 8b 17 55 31 c0 48
89 e5 80 e6 40 74 15 8b 47 68 <48> 81 3c c5 60 46 82 81 a0 f0 1c 81 0f 94 c0 0f b6 c0 5d c3 66
[ 4716.291848] RIP  [<ffffffff811d02e6>] PageHeadHuge+0x16/0x30
[ 4716.294046]  RSP <ffff8800367f7c18>
[ 4716.295655] CR2: ffffffffb61a4b60
[ 4716.298004] ---[ end trace 70d4c0cd34286c45 ]---R4: 0000000000360670

I snipped out the leading timestamp, but the first line of the kernel log (BUG: ...) has the timestamp 2018-05-04T23:13:17Z

and the task log from bug 1459691 shows:
[task 2018-05-04T23:13:17.950Z] 23:13:17     INFO -  make[4]: *** [ucnv_cb.o] Killed

So yeah, this seems to be what caused the compiler process to get killed.
:ted, should we close this again, it seems to happen once/month or so.
Flags: needinfo?(ted)
It certainly doesn't seem to be frequent enough to be worth someone's time to track down a fix. It might be helpful to improve the treeherder suggestions for make failures like this so we get the actual failure message (the make[4]: *** [ucnv_cb.o] Killed bit). Reading the output of the build can be tricky so it's no wonder that these failures are hard to star correctly.
Flags: needinfo?(ted)
Severity: major → normal
Priority: P5 → --
Component: Docker-Worker → Workers
Assignee: nobody → wcosta
Priority: -- → P5

This seems to be accumulating all sorts of make failures -- on Windows and Linux -- so I think it's become a bit of a dumping ground. The linux kernel failure is a fraction of the overall rate, which is already about 0.02%, and as Ted suggested that's not frequent enough to warrant attention.

A build task may occupy up to 40GB of disk space, but our disk threshold
is currently configured to 30GB, which causes an intermittent our of
disk space failure.

The last failure was 4 months ago, closing it.

Status: REOPENED → RESOLVED
Closed: 6 years ago4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: