Open Bug 1285254 Opened 3 years ago Updated 12 days ago

Intermittent KeyError: u'buildsymbols'

Categories

(Firefox Build System :: General, defect, P5)

defect

Tracking

(Not tracked)

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

Filed by: philringnalda@gmail.com

https://treeherder.mozilla.org/logviewer.html#?job_id=10347943&repo=fx-team

http://archive.mozilla.org/pub/firefox/tinderbox-builds/fx-team-win64/1467895222/fx-team-win64-bm70-build1-build45.txt.gz

07:00:16     INFO -  Error running mach:

 07:00:16     INFO -      ['--log-no-times', 'build', '-v']

 07:00:16     INFO -  The error occurred in code that was called by the mach command. This is either

 07:00:16     INFO -  a bug in the called code itself or in the way that mach is calling it.

 07:00:16     INFO -  You should consider filing a bug for this issue.

 07:00:16     INFO -  If filing a bug, please include the full output of mach, including this error

 07:00:16     INFO -  message.

 07:00:16     INFO -  The details of the failure are as follows:

 07:00:16     INFO -  KeyError: u'buildsymbols'

 07:00:16     INFO -    File "c:\builds\moz2_slave\fx-team-w64-000000000000000000\build\src\python/mozbuild/mozbuild/mach_commands.py", line 471, in build

 07:00:16     INFO -      telemetry_data = monitor.get_resource_usage()

 07:00:16     INFO -    File "c:\builds\moz2_slave\fx-team-w64-000000000000000000\build\src\python/mozbuild\mozbuild\controller\building.py", line 384, in get_resource_usage

 07:00:16     INFO -      o['tiers'] = self.tiers.tiered_resource_usage()

 07:00:16     INFO -    File "c:\builds\moz2_slave\fx-team-w64-000000000000000000\build\src\python/mozbuild\mozbuild\controller\building.py", line 119, in tiered_resource_usage

 07:00:16     INFO -      self.add_resources_to_dict(t_entry, phase=tier)

 07:00:16     INFO -    File "c:\builds\moz2_slave\fx-team-w64-000000000000000000\build\src\python/mozbuild\mozbuild\controller\building.py", line 128, in add_resources_to_dict

 07:00:16     INFO -      end=end, phase=phase, per_cpu=False)

 07:00:16     INFO -    File "c:\builds\moz2_slave\fx-team-w64-000000000000000000\build\src\testing/mozbase/mozsystemmonitor\mozsystemmonitor\resourcemonitor.py", line 445, in aggregate_cpu_percent

 07:00:16     INFO -      data = self.phase_usage(phase)

 07:00:16     INFO -    File "c:\builds\moz2_slave\fx-team-w64-000000000000000000\build\src\testing/mozbase/mozsystemmonitor\mozsystemmonitor\resourcemonitor.py", line 404, in phase_usage

07:00:16 INFO - time_start, time_end = self.phases[phase]
ha ha, from the log:
```
06:46:34     INFO -  c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/mozmake.EXE -C c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/security/nss/lib/smime libs  CC=' c:/mozilla-build/python27/python2.7.EXE c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/sccache/sccache.py c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/vs2015u2/VC/bin/amd64/cl.EXE' SOURCE_MD_DIR=c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/obj-firefox/dist SOURCE_MDHEADERS_DIR=c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/obj-firefox/dist/include/nspr DIST=c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/obj-firefox/dist NSPR_INCLUDE_DIR=c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/obj-firefox/dist/include/nspr NSPR_LIB_DIR=c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/obj-firefox/dist/lib MOZILLA_CLIENT=1 NO_MDUPDATE=1 NSS_ENABLE_ECC=1 NSS_ENABLE_TLS_1_3=1 SQLITE_LIB_NAME=nss3 SQLITE_INCLUDE_DIR=c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/obj-firefox/dist/include topsrcdir='c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src' BUILD='c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/obj-firefox/security/$(subst $(topsrcdir)/security/,,$(CURDIR))' BUILD_TREE='$(BUILD)' OBJDIR='$(BUILD)' DEPENDENCIES='$(BUILD)/.deps' SINGLE_SHLIB_DIR='$(BUILD)' SOURCE_XP_DIR=c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/obj-firefox/dist BUILD_OPT=1 OPT_CODE_SIZE=1 NS_USE_GCC= USE_64=1 OS_TARGET=WIN95 NSS_SSL_ENABLE_ZLIB= PROGRAMS= CHECKLOC= FREEBL_NO_DEPEND=0  NSS_NO_PKCS11_BYPASS=1 PUBLIC_EXPORT_DIR='c:/builds/moz2_slave/fx-team-w64-000000000000000000/build/src/obj-firefox/dist/include/$(MODULE)' SOURCE_XPHEADERS_DIR='$(SOURCE_XP_DIR)/include/$(MODULE)' MODULE_INCLUDES='$(addprefix -I$(SOURCE_XP_DIR)/include/,$(REQUIRES))' MAKE_OBJDIR='$(INSTALL) -D $(OBJDIR)' TARGETS='$(LIBRARY) $(SHARED_LIBRARY) $(PROGRAM)' XCFLAGS='' NSS_ENABLE_WERROR=0 PYTHON='c:/builds/moz2_slave/fx-team-w64-00000000000000000BUILDSTATUS TIER_FINISH buildsymbols
```

I wonder if this is Windows' inability to line-buffer screwing us, or something in make screwing us?
WTF
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
I think it has to do with the fact that we're using --output-sync=target, but the automation build steps use recursive make invocations to do things like 'make buildsymbols', 'make package', etc. The sub-make's output is synced only with itself, which means it can display at the same time as the parent make is displaying the BUILDSTATUS message. By using --output-sync=recurse, we should get the intended syncing of having the various automation steps display their output one at a time (including BUILDSTATUS messages).

I tested this locally with some dummy scripts that spit out lots of data in a recursive make invocation. With --output-sync=target, sometimes the two sub-makes would interrupt each other's output, but with --output-sync=recurse they seem to always be serialized.

Since this flag is only used when automation/* targets are specified, I think it is safe to land and just see if it helps.
Assignee: nobody → mshal
Comment on attachment 8858887 [details]
Bug 1285254 - Use 'recurse' output-sync to avoid Windows buffering issues;

https://reviewboard.mozilla.org/r/130876/#review134076

::: build/moz-automation.mk:8
(Diff revision 1)
>  # License, v. 2.0. If a copy of the MPL was not distributed with this
>  # file, You can obtain one at http://mozilla.org/MPL/2.0/.
>  
>  ifneq (,$(filter automation/%,$(MAKECMDGOALS)))
>  ifeq (4.0,$(firstword $(sort 4.0 $(MAKE_VERSION))))
> -MAKEFLAGS += --output-sync=target
> +MAKEFLAGS += --output-sync=recurse

Doesn't that make make essentially output nothing for most of the time, then dump large amounts of text all at once at the end? That seems like it would trigger intermittent kills of the jobs because of lack of output.

It also defeats config/link.py... Actually --output-sync=target probably does that too...
Attachment #8858887 - Flags: review?(mh+mozilla)
The --output-sync command is inside the 'ifneq (,$(filter automation/%,$(MAKECMDGOALS)))' block, so it is only turned on for the postflight automation/build steps, not the whole build (so link.py isn't included). In this case, I think we want to hold all the output until each step is complete, and then dump all of eg: 'make package' at once for a coherent log and functional BUILDSTATUS messages. Since these steps are on the order of minutes, I don't think we're in danger of triggering timeouts.
Comment on attachment 8858887 [details]
Bug 1285254 - Use 'recurse' output-sync to avoid Windows buffering issues;

gps, can you take a look? I didn't catch glandium in time.
Attachment #8858887 - Flags: review?(gps)
Comment on attachment 8858887 [details]
Bug 1285254 - Use 'recurse' output-sync to avoid Windows buffering issues;

https://reviewboard.mozilla.org/r/130876/#review135972

This patch and bug make me sad (through no fault of the patch author).

Yes, this change does fix the underlying issue of interleaved output causing `BUILDSTATUS` messages to get lost. But at what cost. As glandium noted, this will cause all output for these "tiers" to buffer until completion. This does result in more readable output. But it is also a significant setback for debugging logs.

For example, the timestamps in logs are no longer accurate since all output arrives at the same time to the process adding those timestamps. See https://public-artifacts.taskcluster.net/PN4jkgewTSC0xv-KMf-ZGg/0/public/logs/live_backing.log (search for "19:07:46"). It looks like symbols archive generation completed almost instantly! Same thing for packaging. I frequently look at those timestamps to help find performance issues in automation and losing them would be a major to triage automation slowness, IMO. Also, by buffering the output, we're losing details about what is executing when. Before, we know when different tasks are both active. With this patch, it's hard to tell when we are concurrent and what is executing concurrently.

There's a lot of lost visibility by making this change. And I'm really not thrilled about that. But at the same time, you have to acknowledge this change does fix a bug.

While I usually don't like scope bloating, does anyone have any ideas for (ugly) workarounds? Some crazy ideas off the top of my head:

* Write `BUILDSTATUS` messages to a special file defined by an environment variable. Should eliminate most conditions emitting that data. Will eliminate all race conditions if we do locking.
* Have a wrapper process for invoking `make` for a tier. That process will add timestamps such that when `--output-sync=recurse` output is emitted total ordering can be reconstructured with accurate timestamps. As long as the sub `make` inherits the parent's job server, I think we won't incur any scheduling issues.
Attachment #8858887 - Flags: review?(gps)
(In reply to Gregory Szorc [:gps] from comment #11)
> Yes, this change does fix the underlying issue of interleaved output causing
> `BUILDSTATUS` messages to get lost. But at what cost. As glandium noted,
> this will cause all output for these "tiers" to buffer until completion.
> This does result in more readable output. But it is also a significant
> setback for debugging logs.
> 
> For example, the timestamps in logs are no longer accurate since all output
> arrives at the same time to the process adding those timestamps. See
> https://public-artifacts.taskcluster.net/PN4jkgewTSC0xv-KMf-ZGg/0/public/
> logs/live_backing.log (search for "19:07:46"). It looks like symbols archive
> generation completed almost instantly! Same thing for packaging. I
> frequently look at those timestamps to help find performance issues in
> automation and losing them would be a major to triage automation slowness,
> IMO. Also, by buffering the output, we're losing details about what is
> executing when. Before, we know when different tasks are both active. With
> this patch, it's hard to tell when we are concurrent and what is executing
> concurrently.

I agree it's not ideal. Prior to these steps being in tree, they were run serially from buildbot. But because we could now run them in parallel, the logs from multiple steps would override each other and make it difficult to follow. The --output-sync flag was intended to fix this so they would at least be readable, but it has the issue of making the timestamps unusable as you mentioned.

> * Write `BUILDSTATUS` messages to a special file defined by an environment
> variable. Should eliminate most conditions emitting that data. Will
> eliminate all race conditions if we do locking.
> * Have a wrapper process for invoking `make` for a tier. That process will
> add timestamps such that when `--output-sync=recurse` output is emitted
> total ordering can be reconstructured with accurate timestamps. As long as
> the sub `make` inherits the parent's job server, I think we won't incur any
> scheduling issues.

I think either of these would probably work, but IMO we should just turn this into actual dependencies rather than use recursive make. I kept the make invocations when we moved this stuff out of buildbot into the tree mainly to keep the initial work simple (getting it to work on all platforms was already complicated enough!) But now that we have it in tree, we have much more flexibility in changing how the automation steps run. I'm not sure how hard that would be to do, though (there's recursion in at least moz-automation.mk and browser/build.mk). Maybe one of your suggestions would be simpler for now.
Unassigning for now, since it seems like this isn't a quick fix. I may revisit it later, but if someone else wants to pick it up, feel free!
Assignee: mshal → nobody
See Also: → 1298040
Blocks: 1382511
(In reply to Michael Shal [:mshal] from comment #12)
> I think either of these would probably work, but IMO we should just turn
> this into actual dependencies rather than use recursive make. I kept the
> make invocations when we moved this stuff out of buildbot into the tree
> mainly to keep the initial work simple (getting it to work on all platforms
> was already complicated enough!) But now that we have it in tree, we have
> much more flexibility in changing how the automation steps run. I'm not sure
> how hard that would be to do, though (there's recursion in at least
> moz-automation.mk and browser/build.mk). Maybe one of your suggestions would
> be simpler for now.

Realistically, aside from the build itself none of the other automation targets really need to be doing a recursive make. Most of them are just running a set of commands that's effectively a shell script.
Bulk priority update of open intermittent test failure bugs. 

P3 => P5

https://bugzilla.mozilla.org/show_bug.cgi?id=1381960
Priority: P3 → P5
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.