Closed Bug 475276 Opened 16 years ago Closed 15 years ago

Some of l10n mozilla-central directory not removed on clobber, breaks subsequent builds

Categories

(Release Engineering :: General, defect, P2)

x86
Windows Server 2003
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ehsan.akhgari, Assigned: coop)

References

Details

Attachments

(3 files, 4 obsolete files)

Looks like we had a glitch with moz2-win32-slave11's mozilla-central checkout, which caused this problem for a bunch of locales. Hard to say what happened because the cleanup script has since removed the working directory, but we were getting this on hg update of m-c
  abort: repository default not found!
I've seen that when the dir for mozilla-central is empty (unknown cause, CC'ing catlee incase the cleanup script has a glitch). I was touching all these machines last night and may have screwed this one up in the process, apologies if so.

Should come right on the next scheduled build (1900 PST), please reopen if it happens again.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → WORKSFORME
The next set of win32 mozilla-central l10n builds will clobber their working dirs before building. There are at least a couple of slaves with bogus mozilla-central checkouts (01 and 09), mostly empty but a few files and the .hg dir remaining. Please reopen if that doesn't help, or the problem comes back.
Status: REOPENED → RESOLVED
Closed: 16 years ago15 years ago
Resolution: --- → FIXED
Those four are all on moz2-win32-slave08, but moz2-win32-slave16 is also broken. I've fixed the both slaves up, and will leave this open to figure out the cause.

We're ending up with /build/moz2-slave/mozilla-central-win32-l10n-nightly/build/mozilla-central containing (only) these three 3 files
./.hg/store/data/toolkit/crashreporter/google-breakpad/src/processor/testdata/symbols/kernel32.pdb/_b_c_e8785_c57_b44245_a669896_b6_a19_b9542/kernel32.sym.i
./.hg/store/data/toolkit/crashreporter/google-breakpad/src/processor/testdata/symbols/test__app.pdb/5_a9832_e5287241_c1838_e_d98914_e9_b7_f_f1/test__app.sym.d
./.hg/store/data/toolkit/crashreporter/google-breakpad/src/processor/testdata/symbols/test__app.pdb/5_a9832_e5287241_c1838_e_d98914_e9_b7_f_f1/test__app.sym.i
Mode is 644, sensible file ownership. Something must be failing to clobber properly but it'll take a bit of digging to find out what.
Summary: Build failure on WINNT 5.2 mozilla-central l10n nightly (fa) → Some of l10n mozilla-central directory not removed on clobber, breaks subsequent builds
Might be a consequence of the ignoreErrors here
  http://hg.mozilla.org/build/tools/file/27453eb43283/buildfarm/maintenance/purge_builds.py#l50

catlee, do you recall why you added that option when writing this script in bug 464103 ?
I know we're investigating if there's some machine weirdness going on. 

But separate to that, I'm curious - this bug only talks about problems with "fa" locale. If this was all machine-specific, shouldnt we see problems with other locales also processed on this same machine?
(In reply to comment #7)
Did you see the original summary of this bug ? It would affected other locales if they were given to broken slaves but seems unlikely Ehsan will be watching more than his own locale.
(In reply to comment #8)
> (In reply to comment #7)
> Did you see the original summary of this bug ? It would affected other locales
> if they were given to broken slaves but seems unlikely Ehsan will be watching
> more than his own locale.

That's right, I've never watched any other l10n tinderbox.

Comparing <http://tinderbox.mozilla.org/showbuilds.cgi?tree=Mozilla-l10n-fa&maxdate=1237066261&legend=0&norules=1> and <http://tinderbox.mozilla.org/showbuilds.cgi?tree=Mozilla-l10n&maxdate=1237066261&legend=0&norules=1>, I see that there are a lot of similar failures on the same column, and I'm sure that not all of them have been with fa, but I couldn't tell which other locales have been affected, as the build logs don't seem to show any problem.
(In reply to comment #8)
> (In reply to comment #7)
> Did you see the original summary of this bug ? 
Yep, I did. The original summary also talked about "fa" only.


> It would affected other locales
> if they were given to broken slaves but seems unlikely Ehsan will be watching
> more than his own locale.
Yes, and I saw your comments about the clobber step not doing the expected
step. I understand that we expect this to be breaking other locales, but if
this has been happening since 25jan2009, it strikes me as odd that no other
locales have noticed/reported it yet. Hence my question and cc-ing Axel.
Trying other locales randomly, I saw these two failures with nl:

<http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-l10n-nl/1237052911.1237052955.32256.gz>
<http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-l10n-nl/1237041526.1237041621.8058.gz>

This should be enough to show that it's not locale specific.
(BTW, I checked three or four locales)
(In reply to comment #11)
> Trying other locales randomly, I saw these two failures with nl:
> 
> <http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-l10n-nl/1237052911.1237052955.32256.gz>
> <http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-l10n-nl/1237041526.1237041621.8058.gz>
> 
> This should be enough to show that it's not locale specific.

Perfect, thanks Ehsan, exactly the data I was looking for.
Assignee: nobody → ccooper
catlee: any thoughts on comment 6?
Status: REOPENED → ASSIGNED
OS: Windows Vista → Windows Server 2003
Priority: -- → P2
(In reply to comment #14)
> catlee: any thoughts on comment 6?

I ignored errors because it seemed better to delete as many files as possible, instead of failing on the first error and possibly turning the build orange/red.

Perhaps purge_builds.py should list any files/directories it failed to delete, and then turn the build orange so that we can investigate?
I dig some digging. 

shutil.rmtree will apparently fail on Windows if it hits any files that are read-only. The Subversion guys actually ran into this, so one of their cleanup scripts sets the entire tree to r/w before running shutil.rmtree.

ref: http://svn.collab.net/repos/svn/trunk/tools/backup/hot-backup.py.in
Attachment #367646 - Flags: review?(catlee)
Our import story is sketchy on Windows, so easiest to just add the required function to the script.
Attachment #367646 - Attachment is obsolete: true
Attachment #367649 - Flags: review?(catlee)
Attachment #367646 - Flags: review?(catlee)
Attachment #367649 - Flags: review?(catlee) → review+
Comment on attachment 367649 [details] [diff] [review]
Use rmdirRecursive instead of shutil.rmtree, v2

changeset:   243:0f24104fe0a7
Attachment #367649 - Flags: checked‑in+ checked‑in+
Status: ASSIGNED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
Just caught a windows box trying to delete it's current build dir:

C:\WINDOWS\system32\cmd.exe /c python tools/buildfarm/maintenance/purge_builds.py -s 7 -n info -n repo_setup -n tag -n source -n updates -n final_verification -n l10n_verification -n macosx_update_verify -n macosx_build -n macosx_repack -n win32_update_verify -n win32_build -n win32_repack -n linux_update_verify -n linux_build -n linux_repack ..
 in dir e:\builds\moz2_slave\mozilla-central-win32\. (timeout 3600 secs)
...
Deleting ..\mozilla-central-win32
Traceback (most recent call last):
  File "tools/buildfarm/maintenance/purge_builds.py", line 111, in <module>
  File "tools/buildfarm/maintenance/purge_builds.py", line 82, in purge
  File "tools/buildfarm/maintenance/purge_builds.py", line 65, in rmdirRecursive
WindowsError: [Error 13] The process cannot access the file because it is being used by another process: '..\\mozilla-central-win32'
program finished with exit code 1

http://production-master.build.mozilla.org:8010/builders/WINNT%205.2%20mozilla-central%20build/builds/7779/steps/shell_5/logs/stdio

Went on to blow up the build by running out of space. Perhaps we should be appending -n <current build> to this.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
And this on linux:

python tools/buildfarm/maintenance/purge_builds.py -s 5 -n info -n repo_setup -n tag -n source -n updates -n final_verification -n l10n_verification -n macosx_update_verify -n macosx_build -n macosx_repack -n win32_update_verify -n win32_build -n win32_repack -n linux_update_verify -n linux_build -n linux_repack ..
 in dir /builds/moz2_slave/mozilla-1.9.1-linux/. (timeout 3600 secs)
....
Deleting ../tracemonkey-linux
Traceback (most recent call last):
  File "tools/buildfarm/maintenance/purge_builds.py", line 111, in <module>
    purge(args[0], options.size, options.skip, options.dry_run)
  File "tools/buildfarm/maintenance/purge_builds.py", line 82, in purge
    rmdirRecursive(d)
  File "tools/buildfarm/maintenance/purge_builds.py", line 61, in rmdirRecursive
    rmdirRecursive(full_name)
  File "tools/buildfarm/maintenance/purge_builds.py", line 61, in rmdirRecursive
    rmdirRecursive(full_name)
  File "tools/buildfarm/maintenance/purge_builds.py", line 61, in rmdirRecursive
    rmdirRecursive(full_name)
  File "tools/buildfarm/maintenance/purge_builds.py", line 61, in rmdirRecursive
    rmdirRecursive(full_name)
  File "tools/buildfarm/maintenance/purge_builds.py", line 63, in rmdirRecursive
    os.chmod(full_name, 0700)
OSError: [Errno 2] No such file or directory: '../tracemonkey-linux/build/configs/mozilla2/linux/mozilla-1.9.1'
program finished with exit code 1

Probably related to this symlink mozilla-1.9.1@ -> mozilla-central
Attachment #367649 - Flags: checked‑in+ checked‑in+ → checked‑in- checked‑in-
Comment on attachment 367649 [details] [diff] [review]
Use rmdirRecursive instead of shutil.rmtree, v2

Backed this out for bustage in previous comments.
changeset:   244:d8b5cee9aff2
(In reply to comment #22)
> Probably related to this symlink mozilla-1.9.1@ -> mozilla-central

This we can fix by checking the link before we chmod, e.g.: "if not os.path.lexists(full_name)"

(In reply to comment #21)
 
> Went on to blow up the build by running out of space. Perhaps we should be
> appending -n <current build> to this.

Yeah, we were ignoring errors so this never bit us previously.

I'll whip up a new patch shortly.
This fixes up how we handle symlinks so that we don't skip over them at the outset and don't try to chmod them later on.

Need a separate buildbotcustom patch to avoid trying to delete the current working dir.
Attachment #367649 - Attachment is obsolete: true
Attachment #367707 - Flags: review?(nthomas)
Status: REOPENED → ASSIGNED
I've tested the basename command on linux/mac/win32, and ran the build step on staging-master2 (although only against mac) to verify it works correctly.
Attachment #367780 - Flags: review?(catlee)
Comment on attachment 367707 [details] [diff] [review]
Use rmdirRecursive instead of shutil.rmtree, v3

Switching reviewer...no offense intended to nthomas!
Attachment #367707 - Flags: review?(nthomas) → review?(catlee)
Attachment #367780 - Flags: review?(catlee) → review+
Comment on attachment 367780 [details] [diff] [review]
Set the builddir property, and then add it to the list of dirs to ignore.

changeset:   224:c14c4daed3bc
Attachment #367780 - Flags: checked‑in+ checked‑in+
Attachment #367707 - Flags: review?(catlee) → review+
Comment on attachment 367707 [details] [diff] [review]
Use rmdirRecursive instead of shutil.rmtree, v3

can we test this on staging first?
(In reply to comment #29)
> (From update of attachment 367707 [details] [diff] [review])
> can we test this on staging first?

Sure thing. I'll get it running there this aft/tonight.
Attachment #367707 - Flags: checked‑in+ checked‑in+
Comment on attachment 367707 [details] [diff] [review]
Use rmdirRecursive instead of shutil.rmtree, v3

changeset:   245:282a97d732d5
clobberer.py suffers from the same problem. See bug 483943 for details there.

Solution sounds like switching clobberer.py to using rmdirRecursive as well.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
Still having problems with some of the files mentioned in comment #5. I speculated that it might be a windows path length problem, but it only seems to add up to 230 characters, which is well below the 255 limit.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
See bug 396187 for a 2^8 length limit we've hit in the past, although we're getting a different WindowsError than what Rob Helmer mentioned in bug 396187 comment #18 ('The system cannot find the path specified' here vs 'the filename or extension is too long').
Depends on: 484670
Something else is also going on, eg from the m-1.9.1 nightly on moz2-win32-slave06 today:

1, On the nightly build it failed with
Deleting ..\mozilla-1.9.1-win32-l10n-nightly
...
  File "tools/buildfarm/maintenance/purge_builds.py", line 61, in rmdirRecursive
    os.chmod(full_name, 0600)
WindowsError: [Error 2] The system cannot find the path specified: '..\\mozilla-1.9.1-win32-l10n-nightly\\build\\mozilla-1.9.1\\.hg\\store\\data\\toolkit\\crashreporter\\google-breakpad\\src\\processor\\testdata\\symbols\\test__app.pdb\\5_a9832_e5287241_c1838_e_d98914_e9_b7_f_f1\\test__app.sym.d'

2. Then it did a bunch of l10n builds, the first didn't neet to clean up any space so I'm assuming the later ones didn't either.

3. Then we get a mozilla-1.9.1 dep build,
Deleting ..\mozilla-1.9.1-win32-l10n-nightly
Deleting ..\mozilla-central-win32-unittest
  File "tools/buildfarm/maintenance/purge_builds.py", line 70, in rmdirRecursive
    os.rmdir(dir)
WindowsError: [Error 41] The directory is not empty: '..\\mozilla-central-win32-unittest\\build\\modules\\freetype2\\src\\tools'

Two things of note here
  a) it's removed the l10n dir it failed at last time, perhaps one of the l10n builds fixed it somehow
  b) there's a new type of fail removing the unit dir. This slave last did that type of build at Fri Mar 20 17:28:37 2009 (build 2150) and hit errors in most of the test suites. There are no processes hanging round at this point, several hours later than the dir removal was attempted.

The next non-l10n build removed had
Deleting ..\mozilla-central-win32-unittest
9.26 GB of space available
Comment on attachment 367780 [details] [diff] [review]
Set the builddir property, and then add it to the list of dirs to ignore.

Given the continuing problems I'm going to back this out again. Occasionally broken l10n is a better situation than a much wider range of builds running out of space. Hopefully we can get this working reliably in staging.
Attachment #367780 - Flags: checked‑in+ checked‑in+ → checked‑in- checked‑in-
Comment on attachment 367780 [details] [diff] [review]
Set the builddir property, and then add it to the list of dirs to ignore.

Sorry, munged the wrong attachment.
Attachment #367780 - Flags: checked‑in- checked‑in- → checked‑in+ checked‑in+
Attachment #367707 - Flags: checked‑in+ checked‑in+ → checked‑in- checked‑in-
Status: REOPENED → ASSIGNED
I'll be playing around with a version of purge_builds.py in my user repository (http://hg.mozilla.org/users/coop_mozilla.com/build-tools) in staging this week. 

Ideally, I'd like to catch a windows slave in the act so I can poke at those .d files and find out why they think they're so special.
Looks like coop set some clobbers to resolve this problem.
clobberer.py has the same problem (bug 483943).

joduinn and I talked today about moving(remaning) the directory prior to a clobber. The problem we're seeing now is that undeleted files hold the whole checkout dir open so our buildsteps try to update a mostly empty dir rather than clone a new copy. Moving the old dir out of the way prior to attempting to delete anything should solve that, even if part of the delete fails.

One possible problem I foresee is if a clobber attempts to move a dir when the previous clobber failed so the renamed dir already exists. Do we need a system process looking for failed clobber dirs to clean out? We can always increment the to-be-moved dirname to avoid collisions, but I wouldn't want that to get out of control, i.e. end up with many mostly-empty checkout dirs.
Running with a new patch in staging right now that moves top-level files/directories out of the way prior to deleting them, with appropriate checks to make sure we aren't piling up undeleted files/directories.
purge_builds.py:
- re-add rmdirRecursive()
- wrap actual purging in a try block

Both:
- rename top-level files/dir prior to deleting them
- check for existing renamed files/dirs prior to deletion and delete them first

Has run in staging for a few days now without incident.
Attachment #367707 - Attachment is obsolete: true
Attachment #382760 - Flags: review?(catlee)
Same as previous patch, but moves the try block for purge_builds.py inside the loop, and prints any exceptions to stdout so that we still have a fighting chance of freeing the needed space.
Attachment #382760 - Attachment is obsolete: true
Attachment #382789 - Flags: review?(catlee)
Attachment #382760 - Flags: review?(catlee)
(In reply to comment #46)
> prints any exceptions to stdout, I of course meant stderr.
(In reply to comment #46)
> Created an attachment (id=382789) [details]
> Move dirs aside before deleting them, v2
> 
> Same as previous patch, but moves the try block for purge_builds.py inside the
> loop, and prints any exceptions to stdout so that we still have a fighting
> chance of freeing the needed space.

The updated patch ran successfully in staging last night. No failures in the logs (I checked Linux and Mac too), and there were no renamed dirs left lying around on the Windows slaves.
Attachment #382789 - Flags: review?(catlee) → review+
Comment on attachment 382789 [details] [diff] [review]
Move dirs aside before deleting them, v2

changeset:   296:e145709d107f
Attachment #382789 - Flags: checked‑in+ checked‑in+
I'll monitor this over the weekend in case we see any issues (like before) in wider testing.
This is busted, causing clobberer to fail:

Removing mail/                                                                            
Couldn't clobber properly, bailing out.

Manually added a raise statement to the offending bit of code to see what was going on:

Traceback (most recent call last):                                                               
  File "../../tools/clobberer/clobberer.py", line 171, in <module>
    do_clobber(options.dryrun, options.skip)
  File "../../tools/clobberer/clobberer.py", line 90, in do_clobber
    if d.endswith(clobber_suffix):
NameError: global name 'd' is not defined
Simple fix of an incorrectly named variable.
Attachment #383310 - Flags: review?(catlee)
Comment on attachment 383310 [details] [diff] [review]
[checked in] Fix variable name typo in v2 patch

Ugh, cut-n-paste fail.
Attachment #383310 - Flags: review?(catlee) → review+
Comment on attachment 383310 [details] [diff] [review]
[checked in] Fix variable name typo in v2 patch

changeset:   298:01f01389bbf0
Attachment #383310 - Attachment description: Fix variable name typo in v2 patch → [checked in] Fix variable name typo in v2 patch
I forced a round of clobbers on m-c and both scripts seem to be working now.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: