Closed Bug 355309 (end2end-bld) Opened 18 years ago Closed 16 years ago

Tracking bug for end-to-end release run

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: joduinn)

References

(Depends on 1 open bug, )

Details

(Keywords: meta)

Attachments

(3 files, 2 obsolete files)

Tracking bug for automated release harness work (Bootstrap).
The code lives in mozilla/tools/release.
Depends on: 355606
Depends on: 356185
Depends on: 361297
Depends on: 363237
Blocks: 366850
No longer blocks: 366850
Depends on: 366850
Depends on: 367438
Depends on: 368579
Depends on: 369004
Depends on: 369538
Depends on: 370228
Depends on: 370459
Depends on: 370853
Talked with rhelmer; gonna co-opt this bug for tracking the end-to-end release work we're doing for Q2; the dependent bugs are all things that we should really try to fix as part of this effort anyway.
Alias: end2end-bld
Assignee: rhelmer → preed
Summary: Tracking bug for automated release harness (Bootstrap) → Tracking bug for end-to-end release run
Depends on: 371305
Depends on: 371325
Depends on: 372744
Depends on: 372746
Depends on: 372755
Depends on: 372757
Depends on: 372759
Depends on: 372762
Depends on: 372764
Depends on: 372765
Depends on: 373080
Depends on: 373401
Depends on: 373995
Depends on: 373116
Depends on: 374555
Depends on: 375006
Depends on: 375587
Depends on: 375714
No longer depends on: 372746
No longer depends on: 375714
Depends on: 375784
Depends on: 375785
Depends on: 375786
Depends on: 375787
Depends on: 375788
Depends on: 375789
Depends on: 376959
Depends on: 378529
Keywords: meta
needed for "release automation", hence marking as critical.
Severity: major → critical
Priority: -- → P1
Priority: P1 → P2
Down to P3 for this triage round.
Priority: P2 → P3
Depends on: 387426
Depends on: 387970
Over to John.
Assignee: preed → joduinn
Priority: P3 → P2
Lots of activity this week, here's a summary:

To see what we really have working, we're setting up a staging environment, and will then duplicate that to create an equivalent production environment. 

A staging buildbot master is now up and running at http://staging-build-console.build.mozilla.org:8810 and also communicating with http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaTest. This VMimage is a clean, new clone of the current 1.8 production linux build environment.

A staging linux buildbot slave is up and running on staging-prometheus-vm, and connecting to buildbot master for jobs. This VMimage is a clean, new clone of the current 1.8 production linux build environment.

We've found a few minor gotchas which are fixed on the spot, or being tracked in separate bugs. At this point, the tag, source, linuxbuild, update and stage steps seem to be running just fine. 

Status: NEW → ASSIGNED
Depends on: 389206
A staging mac buildbot slave is up and running on bm-xserve14.build.mozilla.org, and connecting to buildbot master for jobs. This physical machine is a clean, new machine, running the current production mac build environment. Tracking bug#388373 closed.

We now have tag, source, linuxbuild, macosx-build, update and stage running fine.
Depends on: 388373
A staging win32 buildbot slave is up and running on staging-pacifica-vm.build.mozilla.org, , and connecting to buildbot master for jobs. This VMimage is a clean, new clone of the current 1.8 production win32
build environment. Tracking bug#388366 closed.

We now have tag, source, linuxbuild, win32-build, macosx-build, update and stage running fine.
Depends on: 388366
The l10n/repack step is now working for linux, win32, and can be seen on the staging-build-console (see URL above) along with the other working steps. 

It looks like l10n/repack is just now also working for mac also, but we want to triple check in the morning with fresh eyes before we declare success on mac.
l10n/repack confirmed worked for mac also, forgot to update.
"l10n verification" now setup, but failing because we found a script hardcoded to use stage.mozilla.org, causing the step to fail. Fixing... 

"update verification" now setup and working fully for linux and win32. On macosx, slave setup but incorrectly tried to do linux update verification on the mac! Debugging...

rhelmer just fixed l10 verification.
macosx update verification now working. All we did was stop and restart the mac slave, and now this works. We'll have to watch this and see if it happens again.
After a few false-starts over the last week, we had our first human-free build today. It took from 10:12am to 7:23pm to complete from tag->stage, with no human intervention, and no triggering of subsequent steps. 

(The signing step was just stubbed out, doing a quick symlink to "fake out" that there were signed bits present, so the end-to-end time may slightly increase.)

Now that the staging systems are up and running, we started setting up the equivilent production systems. 

The production build master is at http://build-console.build.mozilla.org:8810/

A setback.

While configuring the production mac slave, I discovered that the production mac slave (bm-xserve12), and the staging mac slave (bm-xserve14) were both Intel based machines. 

However, the current production mac builds are being done on a powerpc mac (bm-xserve02). This means that we are not complete with staging setup as we had thought. We now need to find two powerpc macs (bug#391496 for a staging slave), (bug#391498 for a production slave) and install buildbot on them both. 

For now, while I work on this, we are putting back the previously working Intel-based mac slave (bm-xserve14), so that at least work on the overall end-to-end automation can continue in the meanwhile. 
Production linux slave (on production-prometheus-vm) and production win32 slave (on production-pacifica-vm) are now up and running. They were tested against the staging-build-console, worked fine and are now connected to build-console.

We're trying to find/setup some intel xserve hardware, for mac production slave. 
We were resigned to moving to Intel xserve hardware, because we couldnt find any more PPC xserves. We only had bm-xserve02.build.mozilla.org, and it was currently being used for FF2.0.0.x production. If this machine failed we had no replacement.

Last week, we found 2 PPC xserves: 
- doing inventory check, we found and reimaged a PPC xserve called fireball.build.mozilla.org. This is now called bm-xserve03.build.mozilla.org, and was reimaged from the current production bm-xserve02. This is being used for automation staging. For details, see bug#391496.
- IT repaired a long broken bm-xserve05.build.mozilla.org. This was also reimaged from the current production bm-xserve02. This is being used for automation production. For details, see bug#391498.
- doing inventory check, we found bm-xserve01. This was being used by dmills, but he was happy to move to a different machine if needed.

This means we could rollout automation without having to worry about CPU architecture changes / cross compiling. And we also have spare machines, just in case...
We've left bm-xserve01 untouched, but its good to know it exists! 

The other two machines (bm-xserve03, bm-xserve05) both now have buildbot slaves installed, which connect to the build-console and staging-build-console correctly. So far each test run makes it through some steps ok, but then hits tinderbox setup problems, which we then fix, only to hit a later problem. Still debugging!
We also reimaged two Intel xserves (bm-xserve12, bm-xserve14) for use as mac slaves on TRUNK. These were able to connect and do builds from build-console / staging-build-console. 

For now, these have been disconnected from buildmasters, and are both sitting idle. Once we have automation running on FF1.8 branch, we can comeback to do this for TRUNK. Keeping bug#388373 and bug#390519 open for tracking.
Last night, got the staging system to run through end-to-end just fine, including using the new PPC mac slaves. 

This morning+afternoon, rhelmer and I went though the list of "just for testing" hacks that we had on production. See details at: http://wiki.mozilla.org/Build:Release_Automation#Just_for_testing

We alos reconfirmed cfg files are in sync between staging and production, manually cleaned up previous test runs, etc. 

Now that all those workaround are removed, we've started our first end-to-end production run. This is attempting to product "FF2.0.0.7 RC1" from the live cvs repository, using a cutoff time from last night's nightly run:

mac (bm-xserve02)
 timestamp: 1187698800
 build log:  
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8/1187698800.1400.gz&fulltext=1
 push dir: 
http://stage.mozilla.org/pub/mozilla.org/firefox/nightly/2007-08-21-05-mozilla1.8/

win32
 timestamp: 1187694420
 build log: 
http://tinderbox.mozilla.org/showlog.cgi?tree=Mozilla1.8&errorparser=windows&logfile=1187694420.13259.gz&buildtime=1187694420&buildname=WINNT%205.2%20pacifica-vm%20Depend%20Fx-Nightly&fulltext=1
 push dir: 
http://stage.mozilla.org/pub/mozilla.org/firefox/nightly/2007-08-21-04-mozilla1.8/

linux
 timestamp: 1187690820
 build log: 
http://tinderbox.mozilla.org/showlog.cgi?tree=Mozilla1.8&errorparser=unix&logfile=1187690820.25345.gz&buildtime=1187690820&buildname=Linux%20prometheus-vm%20Depend%20Fx-Nightly&fulltext=1
push dir: 
http://stage.mozilla.org/pub/mozilla.org/firefox/nightly/2007-08-21-03-mozilla1.8/

Delayed yesterday at the manual signing step, because of errors in the docs. Fixed doc mid-afternoon. Once signed, the automation detected the logfile as planned and continued.

Finished 2007-rc1 builds were handed off to QA for testing. So far so good.
QA have run the following tests:

smoketests (linux, win, mac, vista)
BFT (linux, win, mac)
FFT (win)
l10n tests for 12 p1 locales (linux, win, mac)
addon tests
...and additional manual update testing

All pass!!
Attachment #278478 - Attachment mime type: application/octet-stream → text/plain
These files are currently being used on staging-build-console and build-console. Note: for both of these attached master.cfg files

1) The password field for all buildbot slaves have been intentionally blanked out. However, the running staging and production systems do use passwords.

2) The staging buildmaster uses a differently named (but otherwise identical) set of slaves to the production buildmaster.

3) There are a couple of differences between the staging buildmaster and production buildmaster, around tagging and signing.

Diff-ing the two master.cfg files will show these differences.
Depends on: 394034
Attachment #278478 - Flags: review?(rhelmer)
Attachment #278478 - Flags: review?(joduinn)
Attachment #278479 - Flags: review?(joduinn)
Attachment #278479 - Flags: review?(rhelmer)
Comment on attachment 278478 [details]
buildbot master.cfg as used on production build

The two steps below should not be doing "clean tinder-config area" or "TinderConfig"; can you remove those before checkin please? :


>l10nverifyFactory.addStep(ShellCommand, description='clean tinder-config area', workdir='build',
>                     command=['rm', '-rfv', '/builds/config'])
>l10nverifyFactory.addStep(ShellCommand, description='TinderConfig', workdir='build',
>                     command=['perl', './release', '-o', 'TinderConfig'],
>                     timeout=36000, haltOnFailure=1, env={'CVS_RSH': 'ssh'})

>updateverifyFactory.addStep(ShellCommand, description='clean tinder-config area', workdir='build',
>                      command=['rm', '-rfv', '/builds/config'])
>updateverifyFactory.addStep(ShellCommand, description='TinderConfig', workdir='build',
>                     command=['perl', './release', '-o', 'TinderConfig'],
>                     timeout=36000, haltOnFailure=1, env={'CVS_RSH': 'ssh'})
>updateverifyFactory.addStep(ShellCommand, description='update verificaton', workdir='build',
>                     command=['perl', './release', '-v', '-o', 'Updates'],
>                     timeout=36000, haltOnFailure=1, env={'CVS_RSH': 'ssh'})


r=rhelmer with that change.

There's a bunch of stuff we know we need to do already, I'd like to get this checked in first but just to enumerate:

1) switch from Buildbot's CVS class to ShellCommand cvs, so we can always use branch provided in master.cfg (switch back if we can get the branch support working right with what we're doing)

2) set up more schedulers so we can resume the process after a failed build

3) general refactoring and cleanup (consider creating a Bootstrap subclass of Shell, etc).
Attachment #278478 - Flags: review?(rhelmer) → review+
Comment on attachment 278479 [details]
buildbot master.cfg as used on staging build system

Same as comment #27; remove the two TinderConfig related steps from l10nverify and updateverify.

Only build should need to run TinderConfig.
Attachment #278479 - Flags: review?(rhelmer) → review+
Buildbot configs have been going here:
mozilla/tools/buildbot-configs/

Probably makes sense to have something like:

mozilla/tools/buildbot-configs/automation

And inside there have "staging" and "production" subdirectories.
(In reply to comment #27)
> >updateverifyFactory.addStep(ShellCommand, description='update verificaton', workdir='build',
> >                     command=['perl', './release', '-v', '-o', 'Updates'],
> >                     timeout=36000, haltOnFailure=1, env={'CVS_RSH': 'ssh'})
> 
> 
> r=rhelmer with that change.

Sorry, you need this last line of course :) Overzealous cut and paste on my part.
1) renames remaining slaves, so now all slaves have naming format of: <os>-<branch>-slave<number>

2) remove extra lines, per rhelmer's review.
Attachment #278479 - Attachment is obsolete: true
Attachment #280136 - Flags: review?(rhelmer)
Attachment #280136 - Flags: review?(preed)
Attachment #280136 - Flags: review?(joduinn)
Attachment #278479 - Flags: review?(preed)
Attachment #278479 - Flags: review?(joduinn)
1) renames remaining slaves, so now all slaves have naming format of:
<os>-<branch>-slave<number>

2) remove extra lines, per rhelmer's review.
Attachment #278478 - Attachment is obsolete: true
Attachment #280138 - Flags: review?(rhelmer)
Attachment #280138 - Flags: review?(preed)
Attachment #280138 - Flags: review?(joduinn)
Attachment #278478 - Flags: review?(preed)
Attachment #278478 - Flags: review?(joduinn)
Attachment #280136 - Flags: review?(rhelmer) → review+
Comment on attachment 280138 [details]
buildbot master.cfg as used on production build [checked in]

>####### PROJECT IDENTITY
>c['projectName'] = "Release Automation Test"

This should should probably not have "Test" in the name; looks fine besides!
Attachment #280138 - Flags: review?(rhelmer) → review+
Attachment #280136 - Flags: review?(joduinn) → review+
Attachment #280138 - Flags: review?(joduinn) → review+
Agreed... good catch. "Test" has been in name of both "staging-build-console" and "build-console" for ever. Will fix that later, with the next set of changes. 
Comment on attachment 280138 [details]
buildbot master.cfg as used on production build [checked in]

Let me preface my review comments with the disclaimer I haven't been working on this project for quite some time now (not for lack of wanting to), and so given that this is a bit of a "big bang" landing, don't know how relevant these review comments are; I'm missing a lot of the context for why things were done in certain ways, and I'm not sure if there's a place I can find that in Bugzilla or in the dependent bugs. If there is, please point me at it (or any relevant documentation on the wiki).

Having said that:

-- There are 1.8 and trunk slaves; are we using this build automation for the trunk now? If so, in what capacity? Was bug 379278 fixed, and I missed it?

-- There are a bunch of slaves that are linux-1.8-slave1, etc. It looks like there's one set of slaves for staging ("1") and one set for production ("2"). If that's the case, it'd probably be clearer to name these linux-1.8-console-staging, unless the assumption is that you can mix and match these two sets in the future for redundancy?

-- I'm concerned that release builds are coming from machines that are different than where the nightlies come from, which is a fundamental process shift which I didn't see discussed anywhere. Is there a plan to address that, hopefully before 2.0.0.7?

-- Is there a reason make test is run before each step (especially in production)? Isn't it testing the same code on the slaves? (Upon further inspection, I see you have each step checking out the code, which I don't know if I understand, but ok. In that case, shouldn't it be checking out a stable tag in the production config? If a floating tag is used, a "cvs stat" after each checkout might be useful.)

-- In general, the log management seems a bit heavy handed; there are a lot of calls to "make clean_logs"; I remember talking about log management a bit, and don't know if there was any conclusion. I'm a little worried about losing logs entirely, and I think the response was "there are copies on the master," but how do we keep track of those longterm? Do we care? (I certainly do, but... maybe I'm alone here).

In general, the approach seems OK. I'm more familiar with the Bootstrap code, so it's harder for me to comment directly on the Buildbot approach. The dependent scheduling was what I had planned on using before these bugs were reassigned, so I think that's good/useful.

Can you point me at where the deliverables are popping out these days? I think that, combined with looking at the generated AUS snippets and such will be helpful.
Attachment #280138 - Flags: review?(preed)
Attachment #280136 - Flags: review?(preed)
(In reply to comment #29)
> Buildbot configs have been going here:
> mozilla/tools/buildbot-configs/
> 
> Probably makes sense to have something like:
> mozilla/tools/buildbot-configs/automation
> 
> And inside there have "staging" and "production" subdirectories.


Yeah, that seems good to me. Lets at least get what we used in 2007rc1 landed before we go any further... 
Attached patch patch as landed — — Splinter Review
RCS file: /cvsroot/mozilla/tools/buildbot-configs/automation/production/master.cfg,v
done
Checking in automation/production/master.cfg;
/cvsroot/mozilla/tools/buildbot-configs/automation/production/master.cfg,v  <--  master.cfg
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/tools/buildbot-configs/automation/staging/master.cfg,v
done
Checking in automation/staging/master.cfg;
/cvsroot/mozilla/tools/buildbot-configs/automation/staging/master.cfg,v  <--  master.cfg
initial revision: 1.1
done
Attachment #280136 - Attachment description: buildbot master.cfg as used on staging build system → buildbot master.cfg as used on staging build system [checked in]
Attachment #280138 - Attachment description: buildbot master.cfg as used on production build → buildbot master.cfg as used on production build [checked in]
(In reply to comment #35)
> (From update of attachment 280138 [details])
> Let me preface my review comments with the disclaimer I haven't been working on
> this project for quite some time now (not for lack of wanting to), and so given
> that this is a bit of a "big bang" landing, don't know how relevant these
> review comments are; I'm missing a lot of the context for why things were done
> in certain ways, and I'm not sure if there's a place I can find that in
> Bugzilla or in the dependent bugs. If there is, please point me at it (or any
> relevant documentation on the wiki).

Sorry, the design context was covered in the special build team meeting you missed on 05sep2007. Lets talk offline to schedule another time we can redo this for you. Meanwhile, as you've taken yourself off the review list for this bug, we've gone ahead with landing these configs now, as they worked fine for 2007rc1, are worth preserving, and are not yet formally checked in anywhere yet.



> Having said that:
> -- There are 1.8 and trunk slaves; are we using this build automation for the
> trunk now? If so, in what capacity? Was bug 379278 fixed, and I missed it?

Automation is not yet being used for trunk, and bug#379278 has not been touched, afaik. The few trunk slaves that have been setup are listed in the "bot" section. This is intentional ae one buildmaster should be usable for both trunk and 1.8 slaves, hence putting 1.8 slaves and trunk slaves in master.cfg.



> -- There are a bunch of slaves that are linux-1.8-slave1, etc. It looks like
> there's one set of slaves for staging ("1") and one set for production ("2").
> If that's the case, it'd probably be clearer to name these
> linux-1.8-console-staging, unless the assumption is that you can mix and match
> these two sets in the future for redundancy?

Correct, the idea is to enable more slaves for redundancy, so expect to soon see linux-1.8-slave3, 4, 5, etc. And yes, with trivial changes (for example, the ssh keys), we can switch a slave between staging & production. Therefore, I would rather not encode that in the slave name, to avoid confusion. 


> -- I'm concerned that release builds are coming from machines that are
> different than where the nightlies come from, which is a fundamental process
> shift which I didn't see discussed anywhere. Is there a plan to address that,
> hopefully before 2.0.0.7?

The traditional build machines did run both production release builds, and also nightly builds, as separate different processes on the same machine. The new automation machines are intentionally separate machines from the traditional build machines, so we could in no way disrupt our live system while setting up the new automation system. 

These new automation build machines were cloned from our traditional build machines, so hardware, exact OS patchlevels, compilers, linkers, etc are identical. We then added the minimum set of added tools needed for automation (buildbot, python, twisted, etc). For details, see: http://wiki.mozilla.org/ReferencePlatforms/BuildBot. Yes, this does technically mean that the new automation machines are no longer exactly the same bits as the traditional build machines in a minimally small-as-possible way. This was part of the reason QA gave 2007rc1 a most thorough testing.

Moving nightlies to these automation machines is in the plans, but is not done yet, as there were other differences to reconcile also.



> -- Is there a reason make test is run before each step (especially in
> production)? Isn't it testing the same code on the slaves? (Upon further
> inspection, I see you have each step checking out the code, which I don't know
> if I understand, but ok. In that case, shouldn't it be checking out a stable
> tag in the production config? If a floating tag is used, a "cvs stat" after
> each checkout might be useful.)
> 
> -- In general, the log management seems a bit heavy handed; there are a lot of
> calls to "make clean_logs"; I remember talking about log management a bit, and
> don't know if there was any conclusion. I'm a little worried about losing logs
> entirely, and I think the response was "there are copies on the master," but
> how do we keep track of those longterm? Do we care? (I certainly do, but...
> maybe I'm alone here).
> 
> In general, the approach seems OK. I'm more familiar with the Bootstrap code,
> so it's harder for me to comment directly on the Buildbot approach. The
> dependent scheduling was what I had planned on using before these bugs were
> reassigned, so I think that's good/useful.
> 
> Can you point me at where the deliverables are popping out these days? I think
> that, combined with looking at the generated AUS snippets and such will be
> helpful.

Which deliverables you are talking about here? Updates/downloadable-full-install/etc? Each buildbot step sends out this type of information in emails to Build@mozilla.org. For example, at 20:53 tonight, a staging run posted full win32 installable bits on http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2007-09-13-20-firefox2.0.0.4/ 

Let me know if you are looking for something not already covered in those emails.
(In reply to comment #38)
> (In reply to comment #35)
> Which deliverables you are talking about here?
> Updates/downloadable-full-install/etc? Each buildbot step sends out this type
> of information in emails to Build@mozilla.org. For example, at 20:53 tonight, a
> staging run posted full win32 installable bits on
> http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2007-09-13-20-firefox2.0.0.4/ 

Tinderbox is hardcoded to report that it pushed to "ftp.mozilla.org" but it doesn't really; on staging, everything pushes to staging-build-console. For production, everything pushes to build-console. The candidate and staging areas are then sync'd to the ftpserver.

The idea is to push to stage and FTP in the same exact locations as before, but we don't want to run a buildbot slave on the ftpserver. So, you can find all RC1 bits in the usual staging directories, just like any previous release.
(In reply to comment #38)

> Sorry, the design context was covered in the special build team meeting you
> missed on 05sep2007. Lets talk offline to schedule another time we can redo
> this for you.

Actually, I think it would be better to write up this information (or annotate http://wiki.mozilla.org/Build:Release_Automation with any relevant notes from this meeting), and post it publicly, for comment and review.

People outside of MoCo who are interested in the development of the release automation aren't able to attend special build team meetings.

As this is a community-project we're working on (as much as Firefox is), it's important that that information is accessible by others outside of our walls, so they can comment and contribute. (We've commented before at MoFo Project meetings, seeking assistance from the community, so it makes it extremely difficult for anyone to contribute if there are special meetings to discuss development).

Additionally, these meetings should also be announced in public and held in public.

> Automation is not yet being used for trunk, and bug#379278 has not been
> touched, afaik. The few trunk slaves that have been setup are listed in the
> "bot" section. This is intentional ae one buildmaster should be usable for both
> trunk and 1.8 slaves, hence putting 1.8 slaves and trunk slaves in master.cfg.

If the automation isn't being used for trunk, can those sections be commented out until we are using it? (I'm mostly worried that a slave will execute bootstrap in a trunk context, and since the bug you pointed at has not been touched, that would be a Bad Thing (tm) (thinking specifically of things like unexpected tagging behavior against the trunk, etc.).

> Correct, the idea is to enable more slaves for redundancy, so expect to soon
> see linux-1.8-slave3, 4, 5, etc. And yes, with trivial changes (for example,
> the ssh keys), we can switch a slave between staging & production. Therefore, I
> would rather not encode that in the slave name, to avoid confusion. 

My suggestion was about naming, not about redundancy. 

"1" and "2" are not as clear as "staging" and "production," which aren't as clear as "staging" and "staging-backup."

It's difficult to remember that "1" = "staging" and "2" = "production," so if that's the case, why not call them that, so the mapping doesn't have to be remembered?

(Ignore the redundancy issue; in total agreement there; I'm talking about what the slaves are called.)

> The traditional build machines did run both production release builds, and also
> nightly builds, as separate different processes on the same machine. The new
> automation machines are intentionally separate machines from the traditional
> build machines, so we could in no way disrupt our live system while setting up
> the new automation system.

[snip.]

> Moving nightlies to these automation machines is in the plans, but is not done
> yet, as there were other differences to reconcile also.

Is there a bug to track this? Is there a place to discuss this change? What other differences are there to reconcile?

I understand the reasoning behind doing this while it was in development, but it seems like we're now using this for production as well (i.e. 2007), and I don't see where the discussion to have this (rather large, I might add) release process change took place.

Maybe I missed it, and it's lurking in a bug/newsgroup somewhere?

> Which deliverables you are talking about here?
> Updates/downloadable-full-install/etc? Each buildbot step sends out this type
> of information in emails to Build@mozilla.org. For example, at 20:53 tonight, a
> staging run posted full win32 installable bits on
> http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2007-09-13-20-firefox2.0.0.4/ 

Rhelmer answered; I wanted to look at the actual deliverables from automation; if the staging builds were getting on ftp.m.o, I'd be a) very surprised, and b) very scared.

> Let me know if you are looking for something not already covered in those
> emails.

There were the questions about logs and cvs stat usage; I believe, though, that you mentioned you just forgot to answer them, so I'd be interested in the answer when you have a chance. :-)
Depends on: 397554
Depends on: 397842
Depends on: 398223
Depends on: 399628
Depends on: 399900
Depends on: 400103
QA Contact: mozpreed → build
Depends on: 401150
Depends on: 401290
Depends on: 401459
Depends on: 401579
Depends on: 401596
Depends on: 401628
When setting up build automation machines on trunk, the win32 slave consistently hangs at the end of "cvs co...". I was able to reproduce the problem manually on the machine using:

 $ cvs -d staging-trunk-automation.build.mozilla.org:/builds/cvsmirror/cvsroot co -d release -r release mozilla/tools/tinderbox-configs/firefox/win32

The files are correctly checked out from cvs, and present on the slave local disk, but the cvs command never returns to the msys/bash prompt. Eventually, tinderbox times out, kills it, and flags the "cvs co" step as failed. 

Found this post: http://osdir.com/ml/gnu.mingw.msys/2003-05/msg00029.html, which claimed this is a symptom of a known "ssh not disconnecting under msys" problem from 2003, and suggested to use "-z3" or "-z5" as a workaround. 

Adding "-z3" to the cvs command win32 trunk staging slave solved the problem. I was able to run this 10 times in a row, without any problems.

 $ cvs -z3 -d staging-trunk-automation.build.mozilla.org:/builds/cvsmirror/cvsroot co -d release -r release mozilla/tools/tinderbox-configs/firefox/win32

Making a note of this here, as it seems others are hitting similar problems, and I wonder if the same workaround will help.
Did followon experiment on fx-win32-tbox, the current TRUNK production win32 machine, and found: 

1) Running the same command:
 $ cvs -d staging-trunk-automation.build.mozilla.org:/builds/cvsmirror/cvsroot
co -d release -r release mozilla/tools/tinderbox-configs/firefox/win32
...hangs on fx-win32-tbox also, just like it did on the win32 staging slave. Changing the command to be "cvs -z3 -d staging...", like we did above, worked perfectly, just like it did on the win32 staging slave machine. 

2) Changing the same command to use a different CVS repo did *not* hang, even without the workaround -z3 parameter:
 $ cvs -d cvs.mozilla.org:/cvsroot
co -d release -r release mozilla/tools/tinderbox-configs/firefox/win32
...did not hang 


Looks like there is something different about the connection to staging-trunk-automation.build.mozilla.org and cvs.mozilla.org???
Depends on: 404062
Depends on: 383297
Depends on: 408811
Depends on: 409393
No longer depends on: 409393
Depends on: 409394
Depends on: 409395
Depends on: 410861
No longer depends on: 397554
Depends on: 411928
This bug is huge and kind of nebulous. By some standards, we can do "end to end" runs now, maybe we should go over all the deps on this bug, close what we can, and then file new tracking bugs with more specific mandates?
Depends on: 417703
Lots of discussions here, but the remaining work items seem to be covered in the dependent bugs, so closing.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: