Closed Bug 511901 Opened 15 years ago Closed 15 years ago

Nightly automatic updates for en-US will take longer if the build finishes after 7AM

Categories

(Release Engineering :: General, defect, P2)

x86
Windows XP
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: stevee, Assigned: armenzg)

References

Details

(Keywords: regression)

Attachments

(7 files, 5 obsolete files)

2.25 KB, text/plain
Details
2.25 KB, text/plain
Details
14.37 KB, patch
coop
: review+
Details | Diff | Splinter Review
388 bytes, patch
coop
: review+
Details | Diff | Splinter Review
413 bytes, patch
coop
: review+
Details | Diff | Splinter Review
3.05 KB, patch
nthomas
: review+
Details | Diff | Splinter Review
4.13 KB, patch
nthomas
: review+
nthomas
: checked-in+
Details | Diff | Splinter Review
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a2pre) Gecko/20090820 Namoroka/3.6a2pre ID:20090820053820

Currently the 20090821 Namoroka build finished three hours ago, but still there is nothing being pushed on the automatic update channel. For some reason the automatic update system is now taking much longer to push the next update to nightly testers. This happened with yesterday's build too.

For many years we nightly testers have received our next dollop of firefox goodness at the same time, so this unexpected change is causing some to scratch their heads and wonder if the update system is broken, especially as you can see the final builds on FTP.

Eventually the update system does work, but something has changed recently to cause this new delay between 'build finish' and 'update push', so I'm filing this bug to make sure it was an expected and planned change, or if it's a regression and there's something that can be done to make the process as fast as it used to be.
This is probably happening because the en-US Windows nightlies don't finish before the l10n nightlies start - which causes the en-US ones to get processed with a ton of l10n ones instead of beforehand. This will also affect any en-US nightlies that we need to respin in the morning, too.

I got a mac nightly update pretty early today, fwiw.
Start and end times for "WINNT 5.2 mozilla-1.9.2 nightly" (which they all started at their expected time): 

21st - Fri Aug 21 03:32:01 2009 - Fri Aug 21 07:56:42 2009
20th - Thu Aug 20 03:32:00 2009 - Thu Aug 20 08:18:44 2009
19th - Wed Aug 19 03:32:00 2009 - Wed Aug 19 07:39:06 2009

Any en-US nightly build that finishes after 7AM it will take much longer for its update to be pushed live since it will fall into the bucket of pending updates to be processed (locales get triggered at 7AM).

I see few ways of going with this:
- disable l10n nightly updates (I have not said this out loud!)
- start nightly builds earlier (this would not fix the situation for retriggered nightly builds)
- starts l10n repackages few hours later (this might fix it for most days)
- prioritize en-US in patch-packager.pl and push an update at a time (total time will probably increase)
- updates get generated and pushed by slaves (Long term, it is one of our goals and the best way)

Right now the logs of patch-packager are not kept locally and are only being emailed if there is any problems to build@m.o.
To help us debug or know what patch-packager is up to I have started emailing the logs to myself regardless if there is success or failure.

I was thinking that we might want to email them to tinderbox so people can now if their locale has been processed yet. What do people think?
Assignee: nobody → armenzg
Status: NEW → ASSIGNED
Depends on: 480081, 510524
Priority: -- → P2
Summary: Nightly automatic updates now take hours to appear and work. → Nightly automatic updates for en-US will take longer if the build finishes after 7AM
I won't fight you for this bug, but I think it is important to get it fixed fast. My thoughts are below:

(In reply to comment #2)
> - prioritize en-US in patch-packager.pl and push an update at a time (total
> time will probably increase)

I think this is the best short-term solution and the quickest path forward.

Here's a little perl snippet that will sort an array putting en-US ahead of everything else. Alter as requried:

sub enUS_first {
    if ($a =~ /en-US/) {
        if ($b =~ /en-US/) {
            return 0;
        } else {
            return -1;
        }
    } elsif ($b =~ /en-US/) {
        return 1;
    }
    return $a cmp $b;
}

In the patch-packager.pl script, you'll want to sort the keys from the %$proposed_update_graph in the &write_proposed_update_graph() sub using a routine similar to that above. You'll also want to move the rsync calls out of main() into &write_proposed_update_graph() and make upload only the current set of patches (complete+partial) being generated.

> - updates get generated and pushed by slaves (Long term, it is one of our goals
> and the best way)

I'll file the bug for this today. I think this shouldn't be too hard to do if we start keeping track of the previous nightly build for any given product/branch/platform on stage so we can easily pull down the mar+complete.txt for that combo.

> I was thinking that we might want to email them to tinderbox so people can now
> if their locale has been processed yet. What do people think?

When this becomes part of the build steps on the slave, this problem goes away.
(In reply to comment #3) 
> I'll file the bug for this today. I think this shouldn't be too hard to do if
> we start keeping track of the previous nightly build for any given
> product/branch/platform on stage so we can easily pull down the
> mar+complete.txt for that combo.

Bug 511967 filed.
(In reply to comment #1)
> This is probably happening because the en-US Windows nightlies don't finish
> before the l10n nightlies start

We don't want to be in this situation in general, because l10n will be repacking the previous day's build rather the freshly minted nightly. Pushing the the l10n builds until after en-US completes would help (perhaps there's been bug traffic about schedulers triggering l10n).

(In reply to comment #3)
> In the patch-packager.pl script, you'll want to sort the keys from the
> %$proposed_update_graph in the &write_proposed_update_graph() sub using a
> routine similar to that above. You'll also want to move the rsync calls out of
> main() into &write_proposed_update_graph() and make upload only the current set
> of patches (complete+partial) being generated.

The combination of sorting en-US to the top and uploading after each partial sounds good. The "upload only the current set of patches" is to avoid rsync overhead each time around the for loop ?
(In reply to comment #5)
> We don't want to be in this situation in general, because l10n will be
> repacking the previous day's build rather the freshly minted nightly. Pushing
> the the l10n builds until after en-US completes would help (perhaps there's
> been bug traffic about schedulers triggering l10n).

Nick: I have a working patch for bug 469364 now. In fact, my quickfire testing of it on Friday was probably why we ran out of space on staging over the weekend. Oops.

> The combination of sorting en-US to the top and uploading after each partial
> sounds good. The "upload only the current set of patches" is to avoid rsync
> overhead each time around the for loop ?

Armen and I talked about this morning. We're going to try to make the rsync calls more targeted through some testing in staging, but we'll eat the overhead if we have to.
For the curious here is my WIP which has not worked on the run I did earlier but it gives an idea that:
- moved the global removing at the top of the main function 
- doing an rsync (full rsync since I do not know yet how to compose the proper path to an individual locale) per processed locale
- sorting so en-US gets processed before other locales
The previous patch worked but I had run the unmodified version and thought it did not work.

This morning I moved aus as aus.older and forgot to create it again. As soon as I run patch-packager.pl it removed any work under app/. I think this has been fixed in this patch since in the rsyncs that delete files I have substituted the ";" for "&&" instead.

This patch is now based off the latest checkout and it contains some of coop's changes that allow patch-packager.pl to work with two configuration files (one for production and one for staging). I hope not too much work has been lost. The mentioned files are not in the patch.

I will continue tomorrow working on this.
Attachment #396303 - Attachment is obsolete: true
Attached patch configuration for staging (obsolete) — Splinter Review
Attachment #397047 - Flags: review?(ccooper)
Attached patch configuration for production (obsolete) — Splinter Review
Attachment #397048 - Flags: review?(ccooper)
Attachment #397050 - Attachment description: unsorted list of locales (just as an example) → unsorted list of updates (just as an example)
Attachment #397050 - Attachment mime type: application/octet-stream → text/plain
Attachment #397051 - Attachment mime type: application/octet-stream → text/plain
Comment on attachment 397044 [details] [diff] [review]
patch-packager.pl - sorts list of pending updates by starting with en-US first and enables email notification as locale is processed

+# the following variables must be declared in the patch-packager.cfg file
+# my ($prefix, $pid_file, $aus_host, $ftp_host, $stage_host, $ffox_user,
+#     $tbird_user, $default_hashtype, $notification_email, $doUpload,
+#     $doNotify, $verbose)

If it's going to be commented out, there's no need for the "my ()" declaration. Un-comment the section so all the variables are defined (makign sure they still get imported from the cfg file properly), or just list the vars in the comment.

+for my $name ('$prefix', '$pid_file', '$aus_host', '$ftp_host', 
+              '$stage_host', '$ffox_user', '$tbird_user',  
+             '$default_hashtype', '$notification_email',
+              '$doUpload', '$doNotify', '$verbose') {

This won't work. Some of these vars (i.e. $doUpload, $doNotify, $verbose) can be 0, which will fail the check_param test.

+sub enUS_first {
A descriptive comment would be nice here.

-    printf("Decrompressing from.mar... ");
+    printf("Decompressing from.mar... ");

Glad to see someone finally fixing that typo!
Attachment #397044 - Flags: review?(ccooper) → review-
Comment on attachment 397047 [details] [diff] [review]
configuration for staging

>$doNotify = true;
>$doUpload = true;
>$verbose = false;

Perl has no truth boolean type. You'll need to use 1(true) and 0(false).
Attachment #397047 - Flags: review?(ccooper) → review-
Comment on attachment 397048 [details] [diff] [review]
configuration for production

>$doNotify = true;
>$doUpload = true;
>$verbose = false;

Perl has no truth boolean type. You'll need to use 1(true) and 0(false).
Attachment #397048 - Flags: review?(ccooper) → review-
(In reply to comment #14)
> (From update of attachment 397044 [details] [diff] [review])
> +for my $name ('$prefix', '$pid_file', '$aus_host', '$ftp_host', 
> +              '$stage_host', '$ffox_user', '$tbird_user',  
> +             '$default_hashtype', '$notification_email',
> +              '$doUpload', '$doNotify', '$verbose') {
> 
> This won't work. Some of these vars (i.e. $doUpload, $doNotify, $verbose) can
> be 0, which will fail the check_param test.
> 
I realized and that is why I tried using the "true" and "false" values which seem to work but I guess it was because they were not empty and in the later conditions were probably mactched to true which was the intended value. I have fixed this but I had to add a "1;" at the end of patch-packager.cfg since the last variable that I define is "$verbose = 0;" which perl seems to understand as a module loaded incorrectly (I verified this by making $verbose = 1; and removing the last 1;"

I will post the patch for review in the morning. I will let this work overnight.
Attachment #397047 - Attachment is obsolete: true
Attachment #397274 - Flags: review?(ccooper)
Attachment #397048 - Attachment is obsolete: true
Attachment #397275 - Flags: review?(ccooper)
Attachment #397275 - Flags: review?(ccooper) → review+
Attachment #397274 - Flags: review?(ccooper) → review+
Attachment #397273 - Flags: review?(ccooper) → review+
RCS file: /mofo/release/patcher/README,v
done
Checking in README;
/mofo/release/patcher/README,v  <--  README
initial revision: 1.1
done
Checking in patch-packager-cron.sh;
/mofo/release/patcher/patch-packager-cron.sh,v  <--  patch-packager-cron.sh
new revision: 1.5; previous revision: 1.4
done
RCS file: /mofo/release/patcher/patch-packager-production.cfg,v
done
Checking in patch-packager-production.cfg;
/mofo/release/patcher/patch-packager-production.cfg,v  <--  patch-packager-production.cfg
initial revision: 1.1
done
RCS file: /mofo/release/patcher/patch-packager-staging.cfg,v
done
Checking in patch-packager-staging.cfg;
/mofo/release/patcher/patch-packager-staging.cfg,v  <--  patch-packager-staging.cfg
initial revision: 1.1
done
Checking in patch-packager.pl;
/mofo/release/patcher/patch-packager.pl,v  <--  patch-packager.pl
new revision: 1.24; previous revision: 1.23
done
Attachment #397273 - Flags: checked-in+
Attachment #397274 - Flags: checked-in+
Attachment #397275 - Flags: checked-in+
The latest changes have been deployed. I will watch over the weekend to see how this behaves.
Blocks: 513479
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Armen, this still isn't working as expected.  There has been mention in the recent (even today's) trunk threads on mozillazine about Firefox still not being able to update for a few hours after the builds are released.
(In reply to comment #23)
> Armen, this still isn't working as expected.  There has been mention in the
> recent (even today's) trunk threads on mozillazine about Firefox still not
> being able to update for a few hours after the builds are released.

This is a known issue, I'm afraid. 

Linux and and Mac builds finish relatively quickly, and then trigger their corresponding l10n repacks. These l10n repacks swamp the update machine with partial update requests before the original Windows build is even finished. Windows updates are delayed across the board.

I'm currently working on the fix in bug 511967. I wrote a blog post with more details here: http://coop.deadsquid.com/2009/09/consequences-of-l10n-nightly-updates/
As Chris mentions having made the L10n repackages starts as soon as en-US builds it makes the queue quite big before it can check for new updates and then do the sorting.

I am working on a patch that the script will process a set number of updates rather than all pending ones. This will make the script refresh with the new pending updates increasing the chances of processing en-US earlier.
This patch reduces the number of updates that get processed by each run of patch-packager.pl. This will make new en-US pending updates to be processed earlier.

I have done successful runs but the integrity of the file system in staging-nightly-updates is miserable and I was not able to do a perfect testing scneario.

I believe it is straight forward and I can test it more in the morning if more testing is needed.
Attachment #398522 - Flags: review?(nthomas)
Comment on attachment 398522 [details] [diff] [review]
reduce the number of updates processed to a set number

Looks fine to me, although there are some small problems I'll fix on checkin. The test should be '>=' to exit after completing 10 updates, rather than 11; the max pref needs adding to both production and staging configs.
Attachment #398522 - Flags: review?(nthomas) → review+
Checked into the repo. Deployed to the production system, and staging-nightly-updates:~/release.

Also landed the
 $notification_email = 'build-announce@mozilla.org';
which was a local change in production.
Attachment #398616 - Flags: review+
Attachment #398616 - Flags: checked-in+
Nick thanks tons for finishing it off and taking it all the way to the end.

These are the times of the last 4 en-US updates being generated: 
- WINNT 5.2 mozilla-1.9.1 nightly -- 5:35:08 (build end time) -- 5:36:13 (update generated)
- OS X 10.5.2 mozilla-1.9.2 nightly -- 6:22:19 (build end time) -- 6:21:41 (update generated)
- WINNT 5.2 mozilla-central nightly -- 6:45:16 (build end time) -- 6:41:19
- WINNT 5.2 mozilla-1.9.2 nightly -- 7:08:03 (build end time) -- 7:11:18 (update generated)

The times are reported from different machines and the variance should be little.

I declare this re-fixed.
Two users confirmed this as fixed on mozillazine.  Thanks Nick and Armen!
(In reply to comment #28)
> Created an attachment (id=398616) [details]
> [as checked in] reduce the number of updates processed to a set number
> 
The fix created idle times on prometheus-vm and this is tracked in bug 515364
Blocks: 515364
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.