If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Stage should not be a major point of failure for AUS



Release Engineering
12 years ago
4 years ago


(Reporter: morgamic, Unassigned)


Firefox Tracking Flags

(Not tracked)




12 years ago
Here is the breakdown of the situation (as I see it), in case this is confusing for anyone:
* Build systems push binary data to stage
* Stage gets pushed to mirror network
* Client gets update information from AUS2, grabs .mar from mirror network
* Client updates, or doesn't update depending on integrity of previous steps

Our ideal situation is to not have these items depend on each other and have graceful failure in the event that:
* Mirror network becomes unavailable
* Mirror becomes compromised and the checksum doesn't hold water
* AUS2 is unavailable
* Other possible scenarios

This particular bug is related to bug 302348, bug 317303 and bug 317618 because these all cover a specific situations of failure caused by stage running out of disk, which then caused an inconsitency between available .mar files on the mirror network and the AUS2 build data that generates the update XML files.

That said, there are some solutions to consider based on our observations:
* Adjust client to fail gracefully when data is missing/corrupted
* Adjust AUS2 so it verifies that published AUS2 updates actually exist on the mirrors?
* Adjust build scripts that publish update metadata for AUS2 so they don't depend on stage (needs more investigation)

Now, whether the responsibility for checking the mirror network should rest on the client or server side is definitely debateable.  Personally I feel it should rest on the client, because in worst-case scenarios the server may not be available.

Currently AUS2 assumes that stage does not run out of disk and that the build information provided (which gets pushed at the same time as the binaries when the build is finished -- correct me if I'm wrong, Chase).

The bit of mystery is the last point, where we need to investigate why the build data was no longer being pushed to aus2 after the disk problem on stage.  The failure of AUS2, specifically, is this mysterious dependency on stage.  AUS2's worst-case is offerring no updates, which is in itself worrisome especially if we needed to push a major security update.

Thoughts/comments?  :)

Comment 1

12 years ago
Since we're days from 1.5 release, maybe best to go with the server-side solution so the client doesn't have to be modified?

Comment 2

12 years ago
Chase, we (IT) don't have any action items in this bug.  They all involve changing the build process - could you take a look?
Assignee: server-ops → chase
Component: Server Operations → Build & Release
QA Contact: myk → chase

Comment 3

12 years ago
(In reply to comment #2)
> Chase, we (IT) don't have any action items in this bug.  They all involve
> changing the build process - could you take a look?

I'm working on changing the build process, but that's just one of three (or more) parts to this bug.  The other two are in the client (bug 302348) and in bouncer (which requires coordination between build+release, sysadmins, and Mike Morgan/OSUOSL).

This bug has a straightforward workaround (ensure that all of the missing builds are uploaded to stage) and the problem of stage filling up doesn't happen that often.  It's unlikely this bug will be wholly fixed on those merits alone in the short-term.

I'm migrating all of the update build goodness into better automation that is more tolerant of failure modes.  When that's in place, the build+release component of this bug will be fixed.

Comment 4

12 years ago
Mass reassign of open bugs for chase@mozilla.org to build@mozilla-org.bugs.
Assignee: chase → build
Severity: critical → normal
Priority: -- → P3
Assignee: build → nobody
QA Contact: chase → build
Depends on: 394069
Mike, is this kind of thing still an issue? Seems like bouncer provides us some protection from linking to a MAR that does not exist.

I'm guessing the "better automation that is more tolerant of failure modes" mentioned in comment 3 didn't happen :) 

I'm not sure what to do here, can we either redefine the problem to something solvable or WONTFIX?
WONTFIX as per comment #5. Please reopen if there's something to do here.
Last Resolved: 10 years ago
Resolution: --- → WONTFIX
For the record, the three bugs in comment #0 were all for nightly updates and go back to when we carried those on the mirrors. This problem got a lot better when only ftp.m.o had nightlies, and disappeared completely when when stage.m.o and ftp.m.o used shared storage instead of an rsync. It _might_ reappear when we enable virus scanning before publishing any new files, the lag from that is not well quantified yet.


4 years ago
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.