Closed Bug 522804 Opened 10 years ago Closed 10 years ago

ZipFile exception.WindowsError build problem

Categories

(Firefox Build System :: General, defect)

x86
Windows XP
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ddahl, Assigned: ted)

Details

Attachments

(2 files)

Looks like a python build issue? 

Exception exceptions.WindowsError: (32, 'The process cannot access the file because it is being used by another process', '../../dist/bin/chrome\\toolkit.jar.lck') in <bound method LockFile.__del__ of <utils.LockFile object at 0x00B24E50>> ignored
Exception exceptions.AttributeError: "ZipFile instance has no attribute '_remove'" in <bound method ZipFile.__del__ of <MozZipFile.ZipFile instance at 0x00B2DC88>> ignored
Exception exceptions.AttributeError: "ZipFile instance has no attribute '_remove'" in <bound method ZipFile.__del__ of <MozZipFile.ZipFile instance at 0x00B2EC88>> ignored
make[6]: *** [libs] Error 1
make[6]: *** [libs] Error 1
make[5]: *** [generic_libs] Error 2
make[5]: *** Waiting for unfinished jobs....
make[5]: *** [mathml_libs] Error 2
make[4]: *** [libs_tier_gecko] Error 2
make[3]: *** [tier_gecko] Error 2
make[2]: *** [default] Error 2
make[1]: *** [build] Error 2
make: *** [profiledbuild] Error 2
Whiteboard: [orange]
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1255733733.1255734014.28630.gz

WINNT 5.2 mozilla-central build [busted] Started 15:55, finished 16:01
ddahl, can you post the link to the error log for comment 0?

Interestingly, the error message in comment 1 was slightly different from the one in Comment 0. In particular, Comment 0's error mentions "exceptions.WindowsError" and a file being "used by another process", but neither of those are in the log for comment 1.

Here's the error for comment 1:
../../../dist/bin/chrome\toolkit.jar.lck has been locked for more than 600 seconds (PID 2348
)
Exception exceptions.AttributeError: "ZipFile instance has no attribute '_remove'" in <bound method ZipFile.__del__ of <MozZipFile.ZipFile instance at 0x00B2FBE8>> ignored
make[7]: *** [libs] Error 1
make[6]: *** [libs] Error 2
make[5]: Leaving directory `/e/builds/moz2_slave/mozilla-central-win32/build/obj-firefox/intl'
make[4]: Leaving directory `/e/builds/moz2_slave/mozilla-central-win32/build/obj-firefox'
make[3]: Leaving directory `/e/builds/moz2_slave/mozilla-central-win32/build/obj-firefox'
make[2]: Leaving directory `/e/builds/moz2_slave/mozilla-central-win32/build/obj-firefox'
make[1]: Leaving directory `/e/builds/moz2_slave/mozilla-central-win32/build'
make[5]: *** [libs] Error 2
make[4]: *** [libs_tier_necko] Error 2
make[3]: *** [tier_necko] Error 2
make[2]: *** [default] Error 2
make[1]: *** [build] Error 2
make: *** [profiledbuild] Error 2
program finished with exit code 2
Pike wrote the JarMaker...
I guess the lock itself was some slave problem or so, and should have busted the build, IMHO.

The MozZipFile didn't deal with it too gracefully, hiding the original error. I'll check if that can be done better.
I'm having problems to reproduce this locally, if someone can create a test, that'd help a ton.
Here's a full log:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1258497842.1258509206.14528.gz&fulltext=1

WINNT 5.2 mozilla-1.9.2 build [busted] Started 15:48, finished 16:27
This is clearly not blocking bug randoorange, this is an intermittent build failure.
No longer blocks: 438871
Hrm... Should this have [orange] status in whiteboard, then?  IIUC, the [orange] whiteboard-status is supposed to exactly match the "bug 438871 blocking status".

Support for this is on http://tinderbox.mozilla.org/Firefox/ :
 "Intermittent failures are tracked with a metabug and the [orange] list. "
and also on the last comment of the metabug, bug 438871 comment 22. ("Adding all the bugs which have [orange] in their whiteboard but do not block
this bug.")

I'd argue that exactly one of the following is true:
   (A) intermittent build failures can be considered "test failures" for the purposes of the metabug and for [orange], to make things easier on sheriffs
... OR....
   (B) we need a distinct whiteboard status (and/or metabug) for intermittent build failures.

I'd been assuming that (A) was the case, but maybe (B) is instead the case?
No, it should not have [orange] on the whiteboard. Removed.
Whiteboard: [orange]
I don't think that the ZipFile exception is the real problem here. It seems that the locks just go bad for a variety of different reasons, and the JarMaker aborts the build on that. Which on the way down raises the zipfile thing, but the real problem is that the JarMaker can't work right on the locks on windows.

I tried to setup a windows build to test this, but failed again, and I don't have the hours right now to fix that. If someone can step up and add some windows help, that'd be great.
I started hitting this pretty consistently on my Windows machine, probably because it was the first time I tried to do a full -j4 build since that parallel dirs patch landed. The problem appears to be that os.open can fail with EACCES as well as EEXIST on Windows, and we're not handling the former in lockFile. With this patch I don't hit this error at all.
Attachment #413665 - Flags: review?(l10n)
Comment on attachment 413665 [details] [diff] [review]
handle more exception types

Cool, thanks for finding that.
Attachment #413665 - Flags: review?(l10n) → review+
Since I haven't been able to land this patch yet, I'll just throw another patch on it. This fixes two more race conditions that I hit:
1) In LockFile.__del__, on Windows you can hit EACCES in os.remove if another process has the lockfile open. This adds a loop around the remove with a check for EACCES.
2) In the open for reading/stat, we can hit EACCES on Windows, so fix this like I did for the first open.

(also I fixed a silly printing nit where we didn't strip the trailing newline from the PID that was read from the file.)
Assignee: nobody → ted.mielczarek
Status: NEW → ASSIGNED
Attachment #414232 - Flags: review?(l10n)
I can build on Windows -j4 reliably with these patches without hitting any JarMaker errors.
Comment on attachment 414232 [details] [diff] [review]
fix two more race conditions

Let's do a quick time.sleep in the __del__ loop, I don't mind if 10 or 100 ms, as long as it's drastically faster than the 1 sec for the retry of locking the file. Given that there are process switches and disk io, 10 might be too quick?

r=me with that.
Attachment #414232 - Flags: review?(l10n) → review+
Component: General → Build Config
QA Contact: general → build-config
I pushed these both to m-c despite the freeze because they fix intermittent build bustage:
http://hg.mozilla.org/mozilla-central/rev/b85c66115cd2
http://hg.mozilla.org/mozilla-central/rev/941ad9d7d079
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.