Thunderbird automation and local build busted, 2018-04-18

RESOLVED FIXED

Status

defect
--
blocker
RESOLVED FIXED
Last year
Last year

People

(Reporter: jorgk, Assigned: Nika)

Tracking

Trunk
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 4 obsolete attachments)

This appears to have started with the Daily run on 2018-04-18.

The build (N) jobs still ran on M-C 789e30ff2e3d6e1fcfce1a373c1e563548 but the repacks might have run on M-C 4af4ae0aee552a99a995ce4b32198b98294a95f7.

In case that's not so, certainly builds succeed at 789e30ff2e3d6e1fcfce1a373c1e563548 but failed at 4af4ae0aee552a99a995ce4b32198b98294a95f7

with obscure errors:
Automation Error: mozprocess timed out after 2400 seconds running ['/usr/bin/python2.7', 'mach', '--log-no-times', 'build', '-v']

Aceman reported on IRC about a local build:

23:06:59 - aceman: I updated right now and it seems python is stuck in some loop for 10 minutes already
23:07:18 - aceman: it is in the 'export' tier of 'mach build'

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=789e30ff2e3d6e1fcfce1a373c1e563548 &tochange=4af4ae0aee552a99a995ce4b32198b9829
which is the push at Wed Apr 18 10:41:19 2018 +0000 and Wed Apr 18 10:50:38 2018 +0000.

Sadly I have no builds for 234 merged M-C changesets :-(
Flags: needinfo?(philipp)
Flags: needinfo?(mozilla)
On Windows, hangs in the XPT stage, last lines:

 0:49.01 zipwriter.xpt
 0:50.04 xptdata.cpp

Backout of
66a52ced217d	Tom Prince — Bug 1436662: Remove `installer` MOZ_AUTOMATION stage; r=mshal
fba7cab660da	Tom Prince — Bug 1436662: Package translated uninstaller; r=pike,mshal

Clear the hang. Well, I did Ctrl+C when it hung and then simply a non-clobber build, which already shows other bustage :-(
Hmm, looks like bug 1436662 was false alarm, even with that backed out I'm still hanging on xptdata.cpp :-(
So maybe bug Bug 1444745.
Nika, Andrew and/or Eric: What do we need to port to get going again. This busted *all* our builds.
Flags: needinfo?(nika)
Flags: needinfo?(erahm)
Flags: needinfo?(continuation)
> 0:49.01 zipwriter.xpt
> 0:50.04 xptdata.cpp

This is the same as what I see. Just the mach TIER indicator says it is at "export".

The problem continues also this morning (after midnight's merge).
I tried backing out bug 1444745 locally, but that needs a backout of bug 1444991 which in turn need backout of bug 1452183 :-(
(In reply to :aceman from comment #6)
> The problem continues also this morning (after midnight's merge).
Sure, caused by bug 1444745, so why would it magically solve itself?
I can confirm the problem, I am stuck at the same point.

top shows a python process with 99% CPU load.
Flags: needinfo?(mozilla)
Nika or I will take a look at this today. Bug 1444745 shouldn't have required any changes from Thunderbird, so maybe its XPIDL files are doing something that is hitting some kind of exponential behavior.
Blocks: 1444745
Flags: needinfo?(erahm)
The problem was that two different interfaces had the exact same UUID. This caused the perfect hash generator to hang forever, as it couldn't find a non-conflicting hash.

Giving both interfaces new UUIDs allowed the build to continue.
Attachment #8969352 - Flags: review?(jorgk)
Flags: needinfo?(nika)
Flags: needinfo?(philipp)
Flags: needinfo?(continuation)
Assignee: nobody → nika
Comment on attachment 8969353 [details] [diff] [review]
Part 2: Assert if there is a duplicated UUID or name in xptcodegen.py

Review of attachment 8969353 [details] [diff] [review]:
-----------------------------------------------------------------

Fun. Thanks for fixing this.
Attachment #8969353 - Flags: review?(continuation) → review+
Comment on attachment 8969352 [details] [diff] [review]
Part 1: Don't use the same UUID for two different interfaces (landed on C-C)

What can I say, old silly code from the year 1999 most likely :-(

I'll get this landed asap.
Attachment #8969352 - Flags: review?(jorgk) → review+
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/5b57bba232f3
Part 1: Don't use the same UUID for two different interfaces. r=jorgk DONTBUILD
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
And a BIG THANKS or course!!!
Sadly this hasn't fully worked. The hang at 
 0:49.01 zipwriter.xpt
 0:50.04 xptdata.cpp
is not gone, but I have another one as per comment ..., oops, I never posted it, so here goes:

 1:47.99 Microsoft (R) 32b/64b MIDL Compiler Version 8.01.0622
 1:47.99 Copyright (c) Microsoft Corporation. All rights reserved.
 1:47.99 64 bit Processing c:\mozilla-source\comm-central\mozilla\other-licenses\ia2\IA2CommonTypes.idl
 1:48.00 IA2CommonTypes.idl
 1:50.44 \local\temp\tmpfewa1f
 1:50.44 C:/PROGRA~2/WI3CF2~1/10/bin/100150~1.0/x64/fxc.exe -nologo -Tvs_4_0 c:/mozilla-source/comm-central/mozilla/gfx/layers/d3d11/mlgshaders\blend-vs.hlsl -EBlendVertexVS -VnBlendVertexVS -Vi -DVERTEX_SHADER -Fhc:\users\jorgk\appdata\local\temp\tmpu54xch

[snip]

 1:50.83 C:/PROGRA~2/WI3CF2~1/10/bin/100150~1.0/x64/fxc.exe -nologo -Tvs_4_0 c:/mozilla-source/comm-central/mozilla/gfx/layers/d3d11/mlgshaders\test-features-vs.hlsl -ETestConstantBuffersVS -VnTestConstantBuffersVS -Vi -DVERTEX_SHADER -Fhc:\users\jorgk\appdata\local\temp\tmpeer7vk

So is that yet another problem?
Flags: needinfo?(nika)
BTW, the Mac try run had this problem:
https://taskcluster-artifacts.net/Z0TTxTMfSPGL7_gVYu2Drg/0/public/logs/live_backing.log

[task 2018-04-19T18:04:35.451Z] 18:04:35     INFO -  make[4]: Entering directory '/builds/worker/workspace/build/src/obj-thunderbird/ipc/ipdl'
[task 2018-04-19T18:04:35.451Z] 18:04:35     INFO -  /builds/worker/workspace/build/src/obj-thunderbird/_virtualenv/bin/python /builds/worker/workspace/build/src/config/pythonpath.py \
[task 2018-04-19T18:04:35.451Z] 18:04:35     INFO -    -I/builds/worker/workspace/build/src/other-licenses/ply \
[task 2018-04-19T18:04:35.451Z] 18:04:35     INFO -    /builds/worker/workspace/build/src/ipc/ipdl/ipdl.py \
[task 2018-04-19T18:04:35.451Z] 18:04:35     INFO -    --sync-msg-list=/builds/worker/workspace/build/src/ipc/ipdl/sync-messages.ini \
[task 2018-04-19T18:04:35.451Z] 18:04:35     INFO -    --msg-metadata=/builds/worker/workspace/build/src/ipc/ipdl/message-metadata.ini \
[task 2018-04-19T18:04:35.451Z] 18:04:35     INFO -    --outheaders-dir=_ipdlheaders \
[task 2018-04-19T18:04:35.451Z] 18:04:35     INFO -    --outcpp-dir=. \
[task 2018-04-19T18:04:35.453Z] 18:04:35     INFO -    -I/builds/worker/workspace/build/src/obj-thunderbird/ipc/ipdl

[snip]

/src/widget/PlatformWidgetTypes.ipdlh /builds/worker/workspace/build/src/widget/headless/HeadlessWidgetTypes.ipdlh
[task 2018-04-19T18:04:35.473Z] 18:04:35     INFO -  make[4]: Leaving directory '/builds/worker/workspace/build/src/obj-thunderbird/ipc/ipdl'
[task 2018-04-19T18:44:35.501Z] 18:44:35     INFO - Automation Error: mozprocess timed out after 2400 seconds running ['/usr/bin/python2.7', 'mach', '--log-no-times', 'build', '-v']
[task 2018-04-19T18:44:35.513Z] 18:44:35    ERROR - timed out after 2400 seconds of no output
There still seem to be problems surrounding that xptdata.cpp file. Should that be empty? Later I see:
 6:17.43 c1xx: fatal error C1083: Cannot open source file: 'c:/mozilla-source/comm-central/obj-x86_64-pc-mingw32/xpcom/reflect/xptinfo/xptdata.cpp': Permission denied

I closed the window/process locking the file and restarted the build. Still hanging on

$ mozilla/mach build
 0:00.95 Clobber not needed.
 0:00.95 Adding make options from c:\mozilla-source\comm-central\.mozconfig
    CONFIG_GUESS=x86_64-pc-mingw32
    MOZ_OBJDIR=c:/mozilla-source/comm-central/obj-x86_64-pc-mingw32
    OBJDIR=c:/mozilla-source/comm-central/obj-x86_64-pc-mingw32
    FOUND_MOZCONFIG=c:/mozilla-source/comm-central/.mozconfig
    export FOUND_MOZCONFIG
 0:01.03 c:\mozilla-build\mozmake\mozmake.exe -f client.mk -s
 0:02.02 Adding client.mk options from c:/mozilla-source/comm-central/.mozconfig:
 0:02.02     CONFIG_GUESS=x86_64-pc-mingw32
 0:02.02     FOUND_MOZCONFIG := c:/mozilla-source/comm-central/.mozconfig
 0:03.77 Elapsed: 0.00s; From dist/public: Kept 0 existing; Added/updated 0; Removed 0 files and 0 directories.
 0:03.77 Elapsed: 0.00s; From dist/private: Kept 0 existing; Added/updated 0; Removed 0 files and 0 directories.
 0:03.77 Elapsed: 0.00s; From dist/xpi-stage: Kept 11 existing; Added/updated 0; Removed 0 files and 0 directories.
 0:04.19 Elapsed: 0.41s; From dist/idl: Kept 0 existing; Added/updated 1178; Removed 0 files and 0 directories.
 0:04.52 Elapsed: 0.73s; From _tests: Kept 0 existing; Added/updated 1455; Removed 0 files and 0 directories.
 0:04.56 Elapsed: 0.80s; From dist/bin: Kept 0 existing; Added/updated 2121; Removed 0 files and 0 directories.
 0:05.81 Elapsed: 2.03s; From dist/include: Kept 0 existing; Added/updated 5837; Removed 0 files and 0 directories.
 0:07.44 Elapsed: 0.45s; From ../../dist/idl: Kept 0 existing; Added/updated 1178; Removed 0 files and 0 directories.
 0:07.60 xptdata.cpp

:-(
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
BTW, I added the M-C patch but see no messages about further duplicates :-(
Posted image xptcodegen-hang.png (obsolete) —
It hangs in xptcodegen.py, first argument is the xptdata.cpp file then follow lots of xpt files.

I tried to add prints to the Python file, nothing got printed. There wouldn't be a limitation on the number of arguments this can take?

Nika reported on IRC:
23:25:20 - nika: jorgk: no
23:25:25 - nika: jorgk: It shouldn't be empty
23:25:33 - nika: jorgk: It seems to be running fine on my end

Sadly it's not running fine neither for me nor for Aceman.
Just for fun, I ran
$ mozilla/xpcom/reflect/xptinfo/xptcodegen.py jk obj-x86_64-pc-mingw32/config/makefiles/xpidl/layout_printing.xpt

That created jk with some content. Also, all the prints I had added were shown. However, as part of the build, it hangs. When you kill it, it leaves behind a locked file.

If you build again, it will go further and fail later on the locked and empty file. If you kill the shell that owns the killed but still zombie python process, then start the build again, it hangs again.

Strangely, in a clobber build it appears to go past the processing of xptdata.cpp but hangs later as per comment #22, on something like:
 1:45.47 C:/PROGRA~2/WI3CF2~1/10/bin/100150~1.0/x64/fxc.exe -nologo -Tvs_4_0 c:/mozilla-source/comm-central/mozilla/gfx/layers/d3d11/mlgshaders\test-features-vs.hlsl -ETestConstantBuffersVS -VnTestConstantBuffersVS -Vi -DVERTEX_SHADER -Fhc:\users\jorgk\appdata\local\temp\tmpin_xkm

That's just misleading, since in the process explorer, one can see that xptcodegen.py hangs. No print have come out.

So I guess something is wrong with the invocation of the script.
https://stackoverflow.com/questions/714475/what-is-a-maximum-number-of-arguments-in-a-python-function
starts with:
  It's somewhat common knowledge that Python functions can have a maximum of 256 arguments.

I counted half the screen in the image I attached and came to 70, so the whole lot should be ~140. In my
C:\mozilla-source\comm-central\obj-x86_64-pc-mingw32\config\makefiles\xpidl
directory, I have 183 files.

OK, here's the next attempt:
$ mozilla/xpcom/reflect/xptinfo/xptcodegen.py jk obj-x86_64-pc-mingw32/config/makefiles/xpidl/*.xpt

This hangs and I only see the "111...111" from this debug:
def link_to_cpp(interfaces, fd):
    # Perfect Hash from IID into the ifaces array.
    print "11111111111111111111111111111"
    iid_phf = PerfectHash(PHFSIZE, [
        (iid_bytes(iface['uuid']), iface)
        for iface in interfaces
    ])
    # Perfect Hash from name to index in the ifaces array.
    print "22222222222222222222222222222"
    name_phf = PerfectHash(PHFSIZE, [
        (bytearray(iface['name'], 'ascii'), idx)
        for idx, iface in enumerate(iid_phf.values)
    ])
    print "3333333333333333333333333333"
Killing that gives:

Traceback (most recent call last):
  File "mozilla/xpcom/reflect/xptinfo/xptcodegen.py", line 540, in <module>
    main()
  File "mozilla/xpcom/reflect/xptinfo/xptcodegen.py", line 537, in main
    link_and_write(args.xpts, fd)
  File "mozilla/xpcom/reflect/xptinfo/xptcodegen.py", line 522, in link_and_write
    link_to_cpp(interfaces, outfile)
  File "mozilla/xpcom/reflect/xptinfo/xptcodegen.py", line 160, in link_to_cpp
    for iface in interfaces
  File "c:\mozilla-source\comm-central\mozilla\xpcom\reflect\xptinfo\perfecthash.py", line 98, in __init__
    slot = hash(bucket.entries[idx].key, basis) % len(self.values)
  File "c:\mozilla-source\comm-central\mozilla\xpcom\reflect\xptinfo\perfecthash.py", line 38, in hash
    h *= FNV_PRIME  # Multiply by the FNV prime
Works fine with PHFSIZE set to 512. There were 1630 UUIDs, so isn't that hash table a bit small?
Posted patch hash.patch (obsolete) — Splinter Review
I don't know what the 256 is about. If you think this makes sense, maybe get it landed with an appropriate commit message.

Next try:
M-C:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=be79666e4a595570642c9e8a0097e975c285c03e
C-C:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=f2145f8b3970fa5072b668e2e5dee9144d4b9d9b
Attachment #8969512 - Flags: review?(nika)
Comment on attachment 8969512 [details] [diff] [review]
hash.patch

Andrew, given the urgency of the matter (no C-C builds for two days), perhaps you can take the review, suggest a good commit message and get this landed as well as Nika's other patch.
Attachment #8969512 - Flags: review?(continuation)
People watching this bug:
hg import https://hg.mozilla.org/try/rev/be79666e4a595570642c9e8a0097e975c285c03e
will give you a working local build.
And since I just hit that trap:
import it in subdir mozilla, not directly in C-C...
Strangely, I updated c-c+m-c today morning (without applying the hash patch) and I could build successfully.
We also have a successful build on C-C on some platforms:
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=5b57bba232f3883c11030edd337e66e58b582462
Linux32 and Linux64 seem to have passed, Mac hung and Windows seems to have passed but died on a build issue.

Watch the latest push:
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=e3d205e2a797fa405f995a58d352ca942a3c29d9

Maybe some random bug in the perfect hash function that appears to be mitigated by doubling the size.
Looks like the Linux opt builds worked, Mac failed and Windows failed due to something else.
Ah, nice find. I'd guess that Thunderbird has more XPIDL files, so it needs a larger hash table.
Comment on attachment 8969512 [details] [diff] [review]
hash.patch

Review of attachment 8969512 [details] [diff] [review]:
-----------------------------------------------------------------

I'm guessing that the increased number of IDLs in Thunderbird caused the PHF stage of the build to take longer than expected, timing out.

::: xpcom/reflect/xptinfo/xptcodegen.py
@@ +18,2 @@
>  # generate a more efficient modulo due to it being a power of 2.
> +PHFSIZE = 512

This is probably fine. This change in number will cause an additional 512 u32s to be included in static storage, totaling ~2kb additional uncompressed binary size.

This number controls how big the intermediate table used for the perfect hash function is. With the m-c setup, 256 worked well, so I didn't bother using 512, because it was fast enough as-is. This change will trade off compile time for binary size, which appears to be a pretty good tradeoff for Thunderbird at least.
Attachment #8969512 - Flags: review?(nika)
Attachment #8969512 - Flags: review?(continuation)
Attachment #8969512 - Flags: review+
Carrying forward Nika's r+. Added commit message.
Attachment #8969481 - Attachment is obsolete: true
Attachment #8969488 - Attachment is obsolete: true
Attachment #8969512 - Attachment is obsolete: true
Flags: needinfo?(nika)
Attachment #8969718 - Flags: review+
Keywords: checkin-needed
Attachment #8969352 - Attachment description: Part 1: Don't use the same UUID for two different interfaces → Part 1: Don't use the same UUID for two different interfaces (landed on C-C)
This is badly blocking TB, so please get this landed, preferably on M-C straight away.
Component: Build Config → XPCOM
Flags: needinfo?(aryx.bugmail)
Product: Thunderbird → Core
(In reply to :Nika Layzell from comment #41)
> I'm guessing that the increased number of IDLs in Thunderbird caused the PHF
> stage of the build to take longer than expected, timing out.
I beg to differ here. Running xptcodegen.py is simply hung, it times out after 40 minutes (2400 seconds) when a build on the fast servers takes 20-30 minutes. The behaviour is inconsistent across platforms and opt/debug builds, so I suggest there is a subtle bug somewhere. Please download attachment 8969488 [details] and see for yourself. I'd suggest to investigate this a bit further since hanging on a lot of input data is not really the right thing to do.
Pushed by dluca@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6c972ac2e177
Part 2: Assert if there is a duplicated UUID or name in xptcodegen.py, r=mccr8
https://hg.mozilla.org/integration/mozilla-inbound/rev/ca83c0b3a19d
Part 3: Increase size of table used by perfect hash function from 256 to 512. r=nika
Keywords: checkin-needed
Flags: needinfo?(aryx.bugmail)
https://hg.mozilla.org/mozilla-central/rev/6c972ac2e177
https://hg.mozilla.org/mozilla-central/rev/ca83c0b3a19d
Status: REOPENED → RESOLVED
Closed: Last yearLast year
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Component: XPCOM → Build Config
Product: Core → Thunderbird
Target Milestone: mozilla61 → ---
You need to log in before you can comment on or make changes to this bug.