Closed Bug 317623 Opened 19 years ago Closed 16 years ago

Stage should not be a major point of failure for AUS

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: morgamic, Unassigned)

References

Details

Here is the breakdown of the situation (as I see it), in case this is confusing for anyone:
* Build systems push binary data to stage
* Stage gets pushed to mirror network
* Client gets update information from AUS2, grabs .mar from mirror network
* Client updates, or doesn't update depending on integrity of previous steps

Our ideal situation is to not have these items depend on each other and have graceful failure in the event that:
* Mirror network becomes unavailable
* Mirror becomes compromised and the checksum doesn't hold water
* AUS2 is unavailable
* Other possible scenarios

This particular bug is related to bug 302348, bug 317303 and bug 317618 because these all cover a specific situations of failure caused by stage running out of disk, which then caused an inconsitency between available .mar files on the mirror network and the AUS2 build data that generates the update XML files.

That said, there are some solutions to consider based on our observations:
* Adjust client to fail gracefully when data is missing/corrupted
* Adjust AUS2 so it verifies that published AUS2 updates actually exist on the mirrors?
* Adjust build scripts that publish update metadata for AUS2 so they don't depend on stage (needs more investigation)

Now, whether the responsibility for checking the mirror network should rest on the client or server side is definitely debateable.  Personally I feel it should rest on the client, because in worst-case scenarios the server may not be available.

Currently AUS2 assumes that stage does not run out of disk and that the build information provided (which gets pushed at the same time as the binaries when the build is finished -- correct me if I'm wrong, Chase).

The bit of mystery is the last point, where we need to investigate why the build data was no longer being pushed to aus2 after the disk problem on stage.  The failure of AUS2, specifically, is this mysterious dependency on stage.  AUS2's worst-case is offerring no updates, which is in itself worrisome especially if we needed to push a major security update.

Thoughts/comments?  :)
Since we're days from 1.5 release, maybe best to go with the server-side solution so the client doesn't have to be modified?
Chase, we (IT) don't have any action items in this bug.  They all involve changing the build process - could you take a look?
Assignee: server-ops → chase
Component: Server Operations → Build & Release
QA Contact: myk → chase
(In reply to comment #2)
> Chase, we (IT) don't have any action items in this bug.  They all involve
> changing the build process - could you take a look?

I'm working on changing the build process, but that's just one of three (or more) parts to this bug.  The other two are in the client (bug 302348) and in bouncer (which requires coordination between build+release, sysadmins, and Mike Morgan/OSUOSL).

This bug has a straightforward workaround (ensure that all of the missing builds are uploaded to stage) and the problem of stage filling up doesn't happen that often.  It's unlikely this bug will be wholly fixed on those merits alone in the short-term.

I'm migrating all of the update build goodness into better automation that is more tolerant of failure modes.  When that's in place, the build+release component of this bug will be fixed.
Mass reassign of open bugs for chase@mozilla.org to build@mozilla-org.bugs.
Assignee: chase → build
Severity: critical → normal
Priority: -- → P3
Assignee: build → nobody
QA Contact: chase → build
Mike, is this kind of thing still an issue? Seems like bouncer provides us some protection from linking to a MAR that does not exist.

I'm guessing the "better automation that is more tolerant of failure modes" mentioned in comment 3 didn't happen :) 

I'm not sure what to do here, can we either redefine the problem to something solvable or WONTFIX?
WONTFIX as per comment #5. Please reopen if there's something to do here.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → WONTFIX
For the record, the three bugs in comment #0 were all for nightly updates and go back to when we carried those on the mirrors. This problem got a lot better when only ftp.m.o had nightlies, and disappeared completely when when stage.m.o and ftp.m.o used shared storage instead of an rsync. It _might_ reappear when we enable virus scanning before publishing any new files, the lag from that is not well quantified yet.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.