Last Comment Bug 783784 - No Firefox win32 l10n builds for mozilla-central since Aug 15th
: No Firefox win32 l10n builds for mozilla-central since Aug 15th
Status: RESOLVED FIXED
:
Product: Release Engineering
Classification: Other
Component: Other (show other bugs)
: other
: All All
: P2 normal (vote)
: ---
Assigned To: Chris Cooper [:coop]
:
:
Mentors:
Depends on: 785748
Blocks:
  Show dependency treegraph
 
Reported: 2012-08-18 08:29 PDT by Alexander L. Slovesnik
Modified: 2013-08-12 21:54 PDT (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
+
fixed
+
fixed


Attachments
Add periodic reboot after l10n jobs, clean up current.work working (5.04 KB, patch)
2012-08-30 14:23 PDT, Chris Cooper [:coop]
bugspam.Callek: review+
Details | Diff | Splinter Review
Remove .pgc files before creating the partial mar (2.60 KB, patch)
2012-08-31 15:54 PDT, Chris Cooper [:coop]
coop: review+
coop: checked‑in+
Details | Diff | Splinter Review

Description Alexander L. Slovesnik 2012-08-18 08:29:47 PDT
All win32 l10n builds in http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-central-l10n/ are built on Aug 15th.

Build log ( http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-l10n-ru/1345300590.1345302212.8382.gz ) shows errors like: 

Failure: exceptions.RuntimeError: SIGKILL failed to kill process
Comment 1 Semtex 2012-08-18 10:32:06 PDT
Looks like backout of this bug https://bugzilla.mozilla.org/show_bug.cgi?id=782981 cause problems with win32 Nightly l10n.
Comment 2 Justin Wood (:Callek) 2012-08-18 11:00:08 PDT
iirc :rail was looking into this yesterday, I don't recall if there was any outcome
Comment 3 Alexander L. Slovesnik 2012-08-20 04:56:21 PDT
Looks like it's been fixed. I see on ftp win32 Nightly l10n builds from Aug 19th.
Comment 4 Pavel Cvrcek [:JasnaPaka] 2012-08-20 07:32:45 PDT
Latest cs Win32 build (ZIP).

17.0a1 (2012-08-15)
Built from http://hg.mozilla.org/mozilla-central/rev/86ee4deea55b

It looks like there is still old version. No new updates are available.
Comment 5 Chris Cooper [:coop] 2012-08-20 07:35:11 PDT
Looks green to me:

http://l10n.mozilla-community.org/~axel/nightlies/
Comment 6 Chris Cooper [:coop] 2012-08-20 10:28:08 PDT
Still seeing many timeouts on m-c. As in comment #0, final output in the logs always seems to be:

Adding file patch and add instructions to file 'update.manifest'
      patch: xul.dll

Timeout is currently set to 1200s.
Comment 7 Chris Cooper [:coop] 2012-08-20 10:47:12 PDT
All the failures are timing out in "make_partial_mar." The only repacks that are succeeding are for locales where we can't find the previous complete mar, so we don't even try to make a partial mar.

e.g.:
http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-l10n-mn/1345479237.1345483921.3415.gz&fulltext=1
Comment 8 Chris Cooper [:coop] 2012-08-21 10:29:23 PDT
I tried re-running the "make_partial_mar" command on mw32-ix-slave06 after a failed repack to see whether we needed a slightly longer timeout, but the repack hadn't made any visible progress after 1 hour.
Comment 9 Chris Cooper [:coop] 2012-08-21 11:21:39 PDT
I should note that *all* the mw32 slaves have been rebooted since yesterady, and I've scheduled multiple clobbers. Repacks are still failing.
Comment 10 Chris Cooper [:coop] 2012-08-21 12:45:52 PDT
Hrmm, some of the clobbers failed too, but that doesn't stop the build. 

Worse, manual clobbers of a few slaves (mw32-ix[02,03,06,26]) as Administrator haven't helped either. We're still timing out.
Comment 11 Chris Cooper [:coop] 2012-08-21 14:08:08 PDT
I did a reconfig this afternoon and whatever was blocking these builders seems to have become unblocked. We'll probably need to wait until tomorrow to get a full set of partial updates, but there will be some at least today.

Leaving open until I'm sure this is resolved.
Comment 12 Chris Cooper [:coop] 2012-08-22 09:15:42 PDT
(In reply to Chris Cooper [:coop] from comment #11)
> I did a reconfig this afternoon and whatever was blocking these builders
> seems to have become unblocked. We'll probably need to wait until tomorrow
> to get a full set of partial updates, but there will be some at least today.

Still broken AFAICT. We only managed to generate one partial successfully yesterday (it), and the timeout pattern has resumed this morning.
Comment 13 Semtex 2012-08-23 11:25:58 PDT
Looks like Linux and Mac builds also fail today.
Comment 14 Axel Hecht [:Pike] 2012-08-23 15:10:42 PDT
Yeah, I noticed, I suspect that's something with running configure, and then updating to the code of the previous nightly. I hope those will just settle tomorrow.
Comment 15 Chris Cooper [:coop] 2012-08-23 15:49:52 PDT
(In reply to Axel Hecht [:Pike] from comment #14)
> Yeah, I noticed, I suspect that's something with running configure, and then
> updating to the code of the previous nightly. I hope those will just settle
> tomorrow.

Mac and linux will certainly be resolved by bug 785066. Limited testing in staging indicates that it *might* unblock Windows repacks as well.
Comment 16 Semtex 2012-08-25 05:37:56 PDT
I've get update for my Nightly, need to download manually but it is working stable.
Comment 17 Chris Cooper [:coop] 2012-08-25 10:16:37 PDT
Again, leaving this open until Monday to make sure we're green on both m-c and aurora again.
Comment 18 Semtex 2012-08-25 11:30:58 PDT
Still not OK, some builds are present some not.
Comment 19 Chris Cooper [:coop] 2012-08-26 06:25:48 PDT
(In reply to semtex2 from comment #18)
> Still not OK, some builds are present some not.

We're almost there. Aurora is back to normal, and about a third of m-c repacks failed yesterday. This could be due to certain repack slaves still needing a clobber: these slaves don't reboot very often, if at all, so they would have trouble clearing a wedged state on their own.
Comment 20 Chris Cooper [:coop] 2012-08-26 07:53:56 PDT
(In reply to Chris Cooper [:coop] from comment #19) 
> We're almost there. Aurora is back to normal, and about a third of m-c
> repacks failed yesterday. This could be due to certain repack slaves still
> needing a clobber: these slaves don't reboot very often, if at all, so they
> would have trouble clearing a wedged state on their own.

We've been so long without nightlies that the repacks yesterday didn't even try to try to generate partial patches, that's why they were green.
Comment 21 Robert Kaiser 2012-08-26 09:09:28 PDT
So, if I get that right, we fail either in creating the patch (binary diff) for xul.dll - or the one after it (does make_partial_mar print the name of the file before or after working on it?).
Can we try in some place to re-enact this and maybe create more debug output from that script so we can get to the bottom what what exactly is failing or timing out there?

It feels to me like we changed xul.dll in some way that it either grew too large (so that make_partial_mar runs too long without output) or has something in it that the binary diff tools don't like.
Comment 22 Alex Keybl [:akeybl] 2012-08-26 14:58:47 PDT
It's suspected that this may cause l10n repacks to fail once 17 merges to mozilla-aurora. Given that, nominating for tracking so that we check in again early in the week.

It'd be great to get a locally reproducible case for devs to look at, and a sample of the process that's timing out.
Comment 23 Nick Thomas [:nthomas] 2012-08-26 15:42:40 PDT
coop, if you take a failing working dir for m-c l10n, and copy in the mbsdiff executable from aurora (working, right?), does that resolve the issue diffing xul.dll? 

If so I'd suggest bug 579517 is causing funkiness on windows
* the Makefile in other-licenses/bsdiff/ includes toolkit/mozapps/update/updater
* http://hg.mozilla.org/mozilla-central/rev/88e47f6905e9 landed on Aug 8 (but maybe didn't show up until a clobber took affect ?)
Comment 24 Nick Thomas [:nthomas] 2012-08-26 19:08:43 PDT
KaiRo pointed out that this theory should affect en-US partials too, which I countered with 'but our minimized l10n build setup could be quite a different environment'. Callek also pointed out that we might be using older MSVC.
Comment 25 Alex Keybl [:akeybl] 2012-08-29 16:06:38 PDT
I see builds in https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-aurora-l10n/ with differing timestamps. Does that mean the issue is now on Aurora as well?
Comment 26 Chris Cooper [:coop] 2012-08-29 16:45:17 PDT
Yes, I'm seeing lots of purple for win32 l10n aurora repacks now.

Sorry I haven't had a lot of time to spend on this so far this week. Between the release and pymake, it's been busy.

As I was debugging this, I noticed that the process creates a working dir called current.work/. I amended the rm_unpack_dirs step to also remove this working dir, and saw the following on future attempts:

rm: cannot remove `current.work/xul!1.pgc.patch': Permission denied

This error persists until the slave is rebooted, so I've also added a maybe_reboot step for l10n that reboots the slave after 5 jobs, just like everything else.

The slave can remove the current.work/ dir after the reboot, but the first repack just gets it wedged in the same way again. :/
Comment 27 Robert Kaiser 2012-08-30 04:23:45 PDT
Actually, is it in the end "just" a problem with the ! in that filename?
Comment 28 Chris Cooper [:coop] 2012-08-30 06:58:38 PDT
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #27)
> Actually, is it in the end "just" a problem with the ! in that filename?

I can test that out, but as Ted indicates in https://bugzilla.mozilla.org/show_bug.cgi?id=785748#c2, the file shouldn't be there in the first place.
Comment 29 Chris Cooper [:coop] 2012-08-30 14:23:28 PDT
Created attachment 657034 [details] [diff] [review]
Add periodic reboot after l10n jobs, clean up current.work working

This should help prevent one errant slave from burning many l10n jobs (as) quickly.
Comment 30 Chris Cooper [:coop] 2012-08-30 14:26:06 PDT
Not sure that bug 785748 really blocks here, given that we actually generated l10n win32 partials on m-c last night. Still busted on aurora though.
Comment 31 Aki Sasaki [:aki] back dec19 2012-08-30 16:33:58 PDT
This is now in production.
Comment 32 Chris Cooper [:coop] 2012-08-31 11:43:17 PDT
(In reply to Chris Cooper [:coop] from comment #28)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #27)
> > Actually, is it in the end "just" a problem with the ! in that filename?
> 
> I can test that out, but as Ted indicates in
> https://bugzilla.mozilla.org/show_bug.cgi?id=785748#c2, the file shouldn't
> be there in the first place.

I've tried this now without success. Partial patch generation still stalls on the .pgc file regardless of whether there are special chars in the filename.

I also tried subbing in working copies of mar and mbsdiff from m-c to aurora, but that didn't help. We still fail on the .pgc file.

Given those results, I went and looked at the unpacked directories for the complete mars on m-c since we're getting partial mars there again. The xul!1.pgc are absent from those complete mars now, despite a lack of visible progress on bug 785748.

Did something land on m-c *after* the merge to aurora that would have fixed this on m-c? I'd like some help trying to track this down.
Comment 33 Chris Cooper [:coop] 2012-08-31 15:54:33 PDT
Created attachment 657472 [details] [diff] [review]
Remove .pgc files before creating the partial mar

This gets updates unblocked on aurora until we figure out why the .pgc are being packaged in the first place.
Comment 34 Chris Cooper [:coop] 2012-08-31 18:58:43 PDT
Comment on attachment 657472 [details] [diff] [review]
Remove .pgc files before creating the partial mar

Got review from Aki on IRC after:
* escaping the wildcard
* adding the -print so we can see what gets deleted
* testing against Mac and Linux as well (we passed)

https://hg.mozilla.org/build/buildbotcustom/rev/93832dff28f6

This is in production now.
Comment 35 Alex Keybl [:akeybl] 2012-09-05 16:05:27 PDT
Just checked https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-aurora-l10n/. The bug looks less pronounced, but these builds are from 9/4

MX
fa
ku
si
sk
sl
sr

while the rest are from 9/5. Can we call that fixed?
Comment 36 Semtex 2012-09-09 10:55:21 PDT
I don't know if current situation i related to this bug, but since 2 days random Linux and Win32 build fails again. Anyway most of them are missing...
Comment 37 Nick Thomas [:nthomas] 2012-09-09 14:48:11 PDT
It appears to be two unrelated problems. Filed
* win32 - bug 789838 - branding.nsi missing in win32 l10n builds
* linux32 - bug 789837 - fatal error: opening dependency file .deps/elf-dynstr-gc.pp in linux32 l10n builds

Note You need to log in before you can comment on or make changes to this bug.