317623 - Stage should not be a major point of failure for AUS

Reporter

Description

•

19 years ago

Here is the breakdown of the situation (as I see it), in case this is confusing for anyone:
* Build systems push binary data to stage
* Stage gets pushed to mirror network
* Client gets update information from AUS2, grabs .mar from mirror network
* Client updates, or doesn't update depending on integrity of previous steps

Our ideal situation is to not have these items depend on each other and have graceful failure in the event that:
* Mirror network becomes unavailable
* Mirror becomes compromised and the checksum doesn't hold water
* AUS2 is unavailable
* Other possible scenarios

This particular bug is related to bug 302348, bug 317303 and bug 317618 because these all cover a specific situations of failure caused by stage running out of disk, which then caused an inconsitency between available .mar files on the mirror network and the AUS2 build data that generates the update XML files.

That said, there are some solutions to consider based on our observations:
* Adjust client to fail gracefully when data is missing/corrupted
* Adjust AUS2 so it verifies that published AUS2 updates actually exist on the mirrors?
* Adjust build scripts that publish update metadata for AUS2 so they don't depend on stage (needs more investigation)

Now, whether the responsibility for checking the mirror network should rest on the client or server side is definitely debateable.  Personally I feel it should rest on the client, because in worst-case scenarios the server may not be available.

Currently AUS2 assumes that stage does not run out of disk and that the build information provided (which gets pushed at the same time as the binaries when the build is finished -- correct me if I'm wrong, Chase).

The bit of mystery is the last point, where we need to investigate why the build data was no longer being pushed to aus2 after the disk problem on stage.  The failure of AUS2, specifically, is this mysterious dependency on stage.  AUS2's worst-case is offerring no updates, which is in itself worrisome especially if we needed to push a major security update.

Thoughts/comments?  :)

elfguy

Comment 1

•

19 years ago

Since we're days from 1.5 release, maybe best to go with the server-side solution so the client doesn't have to be modified?

Justin Fitzhugh

Comment 2

•

19 years ago

Chase, we (IT) don't have any action items in this bug.  They all involve changing the build process - could you take a look?

Assignee: server-ops → chase

Component: Server Operations → Build & Release

QA Contact: myk → chase

Chase Phillips

Comment 3

•

19 years ago

(In reply to comment #2)
> Chase, we (IT) don't have any action items in this bug.  They all involve
> changing the build process - could you take a look?

I'm working on changing the build process, but that's just one of three (or more) parts to this bug.  The other two are in the client (bug 302348) and in bouncer (which requires coordination between build+release, sysadmins, and Mike Morgan/OSUOSL).

This bug has a straightforward workaround (ensure that all of the missing builds are uploaded to stage) and the problem of stage filling up doesn't happen that often.  It's unlikely this bug will be wholly fixed on those merits alone in the short-term.

I'm migrating all of the update build goodness into better automation that is more tolerant of failure modes.  When that's in place, the build+release component of this bug will be fixed.

Chase Phillips

Comment 4

•

18 years ago

Mass reassign of open bugs for chase@mozilla.org to build@mozilla-org.bugs.

Assignee: chase → build

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

17 years ago

Severity: critical → normal

Priority: -- → P3

Reed Loden [:reed]

Updated

•

17 years ago

Assignee: build → nobody

QA Contact: chase → build

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

17 years ago

Depends on: 394069

Robert Helmer [:rhelmer]

Comment 5

•

17 years ago

Mike, is this kind of thing still an issue? Seems like bouncer provides us some protection from linking to a MAR that does not exist.

I'm guessing the "better automation that is more tolerant of failure modes" mentioned in comment 3 didn't happen :) 

I'm not sure what to do here, can we either redefine the problem to something solvable or WONTFIX?

Robert Helmer [:rhelmer]

Comment 6

•

16 years ago

WONTFIX as per comment #5. Please reopen if there's something to do here.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → WONTFIX

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

16 years ago

For the record, the three bugs in comment #0 were all for nightly updates and go back to when we carried those on the mirrors. This problem got a lot better when only ftp.m.o had nightlies, and disappeared completely when when stage.m.o and ftp.m.o used shared storage instead of an rsync. It _might_ reappear when we enable virus scanning before publishing any new files, the lag from that is not well quantified yet.

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Bugzilla

Quick Search

Stage should not be a major point of failure for AUS

Categories

(Release Engineering :: General, defect, P3)

Tracking

(Not tracked)

People

(Reporter: morgamic, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Updated

Updated

Comment 5

Comment 6

Comment 7

Updated