AUS serves partial updates to the wrong nightly build when there are multiple nightly builds in the same day

RESOLVED FIXED

Status

Release Engineering
General Automation
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: rstrong, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Found while researching partial failures on telemetry.

Linux was not affected.

On Windows for the build date 4/12 there were 7,110 failures and only 25 successes when staging a partial.

On Mac for the build date 4/12 there were 188 failures and 35 successes when staging a partial.

On Windows for the build date 4/11 there were 45 failures and 6,790 successes when staging a partial.

On Mac for the build date 4/11 there was 1 failures and 175 successes when staging a partial.

On Windows for the build date 4/13 there were 81 failures and 12,640 successes when staging a partial.

On Mac for the build date 4/13 there was 3 failures and 386 successes when staging a partial.

So, as can be seen the telemetry for the 4/12 build date is extremely out of kilter from what we normally see.

From there I looked into what mar files were served around those dates. There are two partial mar files for Windows and Mac on 4/12 and only one partial mar file for Linux.
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2015-04-12-03-02-33-mozilla-central/

Both 4/11 and 4/13 only had a single partial mar for each platform.

If the wrong partial was advertised that would explain it and the single partial for Linux would explain why it didn't appear to be affected.
(We should probably kill the AUS product now that we've killed aus3....)

I'm actually not sure how this is happening. The only thing I can think is that whatever is determining the buildid of the "from" MAR gets it wrong. This snippet from a recent nightly doesn't help, beyond pointing the finger at mach:
05:15:07     INFO -  mkdir current previous
05:15:07     INFO -  latestmar=$(ssh -l ffxbld -i /home/mock_mozilla/.ssh/ffxbld_rsa stage.mozilla.org 'ls -1t /pub/mozilla.org/firefox/nightly/latest-mozilla-central | grep linux-i686.complete.mar$ | head -n 1'); \
05:15:07     INFO -    if test -n "$latestmar"; then \
05:15:07     INFO -      wget -O /builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/dist/update/previous.mar http://stage.mozilla.org//pub/mozilla.org/firefox/nightly/latest-mozilla-central/$latestmar && \
05:15:07     INFO -      (cd previous; \
05:15:07     INFO -        MAR=/builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/dist/host/bin/mar perl /builds/slave/m-cen-lx-ntly-0000000000000000/build/src/tools/update-packaging/unwrap_full_update.pl '/builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/dist/update/previous.mar') && \
05:15:07     INFO -      (cd current; \
05:15:07     INFO -        MAR=/builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/dist/host/bin/mar perl /builds/slave/m-cen-lx-ntly-0000000000000000/build/src/tools/update-packaging/unwrap_full_update.pl '/builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/dist/update/firefox-40.0a1.en-US.linux-i686.complete.mar') && \
05:15:07     INFO -      find current -name \*.pgc -print -delete && \
05:15:07     INFO -      find previous -name \*.pgc -print -delete && \
05:15:07     INFO -      rm -f /builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/dist/update/*.partial.*.mar && \
05:15:07     INFO -      SRC_BUILD_ID=$(python /builds/slave/m-cen-lx-ntly-0000000000000000/build/src/config/printconfigsetting.py $(find previous -maxdepth 4 -type f -name application.ini) App BuildID) \
05:15:07     INFO -      DST_BUILD_ID=$(cat ../../config/buildid) \
05:15:07     INFO -      SRC_BUILD=previous DST_BUILD=current \
05:15:07     INFO -        /usr/bin/gmake partial-patch && \
05:15:07     INFO -      rm -f /builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/dist/update/previous.mar; \
05:15:07     INFO -    else \
05:15:07     INFO -      echo "No previous MAR found; not creating a partial MAR"; \
05:15:07     INFO -    fi
05:15:09     INFO -  --2015-04-22 08:15:09--  http://stage.mozilla.org//pub/mozilla.org/firefox/nightly/latest-mozilla-central/firefox-40.0a1.en-US.linux-i686.complete.mar
05:15:09     INFO -  Resolving stage.mozilla.org... 63.245.215.47
05:15:09     INFO -  Connecting to stage.mozilla.org|63.245.215.47|:80... connected.
05:15:09     INFO -  HTTP request sent, awaiting response... 200 OK
05:15:09     INFO -  Length: 54892514 (52M) [application/octet-stream]
05:15:09     INFO -  Saving to: '/builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/dist/update/previous.mar'
05:15:09     INFO -       0K .......... .......... .......... .......... ..........  0%  319K 2m48s

05:15:19     INFO -   53600K .....                                                 100%  116M=9.9s
05:15:19     INFO -  2015-04-22 08:15:19 (5.27 MB/s) - '/builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/dist/update/previous.mar' saved [54892514/54892514]
05:15:32     INFO -  gmake[3]: Entering directory `/builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/tools/update-packaging'
05:15:32     INFO -  MAR=/builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/dist/host/bin/mar \
05:15:32     INFO -  MBSDIFF=/builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/dist/host/bin/mbsdiff \
05:15:32     INFO -    /builds/slave/m-cen-lx-ntly-0000000000000000/build/src/tools/update-packaging/make_incremental_update.sh \
05:15:32     INFO -    '/builds/slave/m-cen-lx-ntly-0000000000000000/build/src/obj-firefox/dist/update/firefox-40.0a1.en-US.linux-i686.partial.20150421092926-20150422030206.mar' \


I don't _think_ we can do anything about this in Balrog itself. It's merely serving up the data that was given to it. If you point at a partial MAR that was a diff between X and Y, but it was actually a diff between W and Y, it has no way to know that that's wrong. I think this needs to be fixed from the mozharness or mach side of things. Or we ignore it because partial MAR generation should be changing in a way that would prevent this from happening when Funsize is ready.
Component: General → General Automation
Product: AUS → Release Engineering
QA Contact: catlee
Version: 3.0 → other
Also cc'ing jlund and mshal, in case they have insight into how mozharness or mach figures out the buildid of the "from" build. The tl;dr here is that it seems to get miscalculated if two nightlies are running at the same time.
Perhaps there could be a restriction added to the build system that prevents multiple nightly builds for a repo and platform from running at the same time. For example, yesterday a new nightly was triggered for the same issue by two people.
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #3)
> Perhaps there could be a restriction added to the build system that prevents
> multiple nightly builds for a repo and platform from running at the same
> time. For example, yesterday a new nightly was triggered for the same issue
> by two people.

Yeah, some sort of locking mechanism around that would be nice. That might be easier after we move to Taskcluster...
The buildid of the from build is determined in that snippet of code you pasted:

SRC_BUILD_ID=$(python /builds/slave/m-cen-lx-ntly-0000000000000000/build/src/config/printconfigsetting.py $(find previous -maxdepth 4 -type f -name application.ini) App BuildID)

So, we're downloading the "latest mar", and then unpacking it and pulling out the buildid from its application.ini file. I'd guess the problem is with our definition of "latest mar", rather than with determining the buildid once we have the mar.

To determine the "latest mar", mozharness passes in the LATEST_MAR_DIR in the environment, which is set to '/pub/mozilla.org/firefox/nightly/latest-%(branch)s'. The update-packaging Makefile uses that and the funky ssh command to find the latest mar file in that directory, which should have been ported as-is from buildbotcustom: http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#2118

Looking back on Apr 11, there were multiple nightlies triggered for OSX & Windows, but not for Linux:

Linux opt 	N
Linux x64 opt 	N
Linux x64 asan 	Nd No
OS X 10.8 opt 	N* N N
Windows XP opt 	N* N N
Windows 8 x64 opt 	N* N N N

I think what's happening is one of the nightlies completes, and uploads a complete mar and a partial mar (20150411030207-20150412030233.mar), and the next one gets the other nightly's complete mar, which has the same buildid (and builduid), so it creates 20150412030233-20150412030233.mar
I agree with mshal's analysis. We should be smarter about determining the 'from' when we generate partials, funsize will help I think. Locking around the number of nightlies running on a revision, or preventing extra nightlies unless there was an error before uploading, would also be needed.
Duplicate of this bug: 1167329
Duplicate of this bug: 1167425
Rail, is this still an issue with funsize?
Flags: needinfo?(rail)
I believe this is not an issue anymore. Funsize queries balrog to get a list of existing nightlies and they are grouped by build ID. Also no more `ssh ftp ls latest`. \o/
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Flags: needinfo?(rail)
Resolution: --- → FIXED
So far I can second that it looks to be fixed. I haven't seen any fallout anymore since we switched our update tests to use funsize. I will still keep an eye out.
You need to log in before you can comment on or make changes to this bug.