Closed Bug 563151 Opened 10 years ago Closed 10 years ago

[Debug Windows SeaMonkey] "nsinstall.exe: Bad file number" regression

Categories

(Core :: DOM: Core & HTML, defect, blocker)

x86
Windows Server 2003
defect
Not set
blocker

Tracking

()

VERIFIED FIXED
mozilla1.9.3a5

People

(Reporter: sgautherie, Assigned: kairo)

References

Details

(Keywords: regression)

Attachments

(1 file)

First failing build:
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1272636421.1272645355.5375.gz
WINNT 5.2 comm-central-trunk leak test build on 2010/04/30 07:07:01
{
make[7]: Leaving directory `/e/builds/slave/comm-central-trunk-win32-debug/build/objdir/mozilla/content/base/test'
/bin/sh: e:/builds/slave/comm-central-trunk-win32-debug/build/objdir/mozilla/config/nsinstall.exe: Bad file number
NEXT ERROR make[7]: *** [libs] Error 126
make[6]: Leaving directory `/e/builds/slave/comm-central-trunk-win32-debug/build/objdir/mozilla/content/base'
make[5]: Leaving directory `/e/builds/slave/comm-central-trunk-win32-debug/build/objdir/mozilla/content'
make[4]: Leaving directory `/e/builds/slave/comm-central-trunk-win32-debug/build/objdir/mozilla'
make[3]: Leaving directory `/e/builds/slave/comm-central-trunk-win32-debug/build/objdir/mozilla'
NEXT ERROR make[6]: *** [tools] Error 2
make[5]: *** [base_tools] Error 2
}
It is not always logged exactly in the same order, but it's always the same failure.

Regression timeframe:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=254ea07099d2&tochange=37cd6605aea2
I see nothing obvious in these 2 sets of changesets,
but it seems odd that all our VMs started to fail with this error...

NB: "Bad file number" is not that helpful. Does it mean "file not found" or "too much file handlers" or ??

PS: This means no more Windows test suites anymore :-/
I've seen a bit of that yesterday, but I was confused as you that it started happening on multiple machines probably at the same time - and I'm mostly away this weekend. :(

I wonder if there could be some code change that affects nsinstall, but then, I think it's only our boxes... really a strange thing, this.
Bad file number probably indicates a coding error, not a non-existent file or something like that.  It seems to indicate a program reference to a file which was opened but now closed, or was never opened.
I now have clobbered mozilla/dist, mozilla/config (where nsinstall.exe gets built) and mozilla/content/base/test (where we fail) on all bmachines and triggered a re-configure.
Let's see if that may help anything.
(In reply to comment #0)
> Regression timeframe:
> http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=254ea07099d2&tochange=37cd6605aea2
> I see nothing obvious in these 2 sets of changesets,
> but it seems odd that all our VMs started to fail with this error...

You mean other than both sets of changes patching content/base, which is where we're failing in? And you mean other than the following one patching content/base/tests, which is even more exactly what we're failing in?

http://hg.mozilla.org/mozilla-central/rev/efc77717d5ce

Even more, it adds a file to an already very long list of files, and something with "file number" makes me wonder if there is a problem with the count of files on that extremely long commandline we get there - or with the files in a directory, given that we might possibly have FAT32 in use, which might have limits on how many files can be somewhere...
(In reply to comment #4)
> given that we might possibly have FAT32 in use

I take that back, I just did take a look and the two machines I checked (one VMWare, one Parallels) do actually have NTFS for the E drive (which is where builds live).
So, the length of the commandline is definitely a problem we're hitting here. Trying manually, I can remove any entry from _TEST_FILES and a make succeeds in that directory, but with the full list, it fails.
Note that the command line is 33,469 characters long and contains 318 files to be installed.
OK, it looks like the length of the command line is what's causing the problem after all.

Shortening the "comm-central-trunk-win32-debug" part of the local path to "cctwd" made the build succeed in a local test.

Our path "/e/builds/slave/comm-central-trunk-win32-debug/build/objdir/mozilla/content/base/test" is longer than the Firefox one both because of "comm-central-trunk-win32-debug" vs. "mozilla-central" and the additional "/mozilla" in that path - that's the reason why this fails for us and succeeds for Firefox. Also, the our non-debug builds just omit the "-debug" in the path and so still succeed.

I think we should push for the list of files to install being split so that we don't run into this problem here any more.

Ted, do you think that's a reasonable action here?
Seems fine. It's just installing test files.
Here's a patch to split this set of files into two groups. As we have about 320 files in that list right now, I decided to split at roughly 200 files, which should leave us well under limits everywhere.

I tested on Linux only, but saw the same 321 files in the dire before and after my patch.
Assignee: nobody → kairo
Status: NEW → ASSIGNED
Attachment #443918 - Flags: review?(ted.mielczarek)
As a side note, this is more or less blocking the cutting of SeaMonkey 2.1 Alpha 1, as I'd like to see working test runs on Windows on the revision we're tagging, but as long as this isn't fixed, we don't get any test runs at all.
(In reply to comment #4)

> You mean [blahblahblah]

I mean I saw nothing obviously wrong in all these changesets when I had a look at them.

> Even more, it adds a file to an already very long list of files, and something
> with "file number" makes me wonder [...]

Wasn't my comment 0 |"too much file handlers" or ??| hint close enough to the actual issue you then confirmed?

PS: Please, make additional efforts to direct your rant to someone else. It's really annoying me now {and why I left last SM meeting abruptly}. Here, I simply filed a bug with data I could gather !
Blocks: 562652
Blocks: 563643
(In reply to comment #12)
> PS: Please, make additional efforts to direct your rant to someone else.

Hey, this wasn't a rant, just a pointer to the fact that I actually did find the cause there indeed, even if it was one of the rare cases where you didn't - but that's where multiple pairs of eyes are often better than one. I'm not sure why so many people are often trying to see the worst in me when I'm just trying to make things better. Please, everyone relax! When we work together and see the positive instead of the negative, we all win!

Serge, you're doing such tremendously good and helpful work and seeing and fixing so many things, I wouldn't want to have it any other way! Keep it up!

This was a tough nut to crack, and if I wouldn't have sat down, taken one of our few build machines out of the loop for a while and debugged this there, we probably still wouldn't know what's up - no matter if we knew that it was smaug's checkin that triggered it.

Oh, and if anyone has a problem with me and/or my reaction to something, please tell me straight out somewhere private (email or IRC query or whatever) and let's talk it out. The last thing we need anywhere in our projects is personal animosities. I always want the best for our project(s) and our community and everyone involved, take that for granted. I think we all want the best there. If it sometimes seems differently or someone overreacts, let's relax, take it easy and talk it out. Please, let's have fun all together, OK?
Attachment #443918 - Flags: review?(ted.mielczarek) → review+
Pushed the fix as http://hg.mozilla.org/mozilla-central/rev/a214b695dc7b with approval from peterv for pushing into network-unstable tree.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
(In reply to comment #13)

> even if it was one of the rare cases where you didn't -

Actually, I did think about (and hinted to) the actual cause: I just have no access to any builder to confirm it.

> why so many people are often trying to see the worst in me

Because of the way you wrote your comment 4: quite different to your next comments which I have no problem with.

*****

V.Fixed, per tinderboxes.
No longer blocks: 563643
Status: RESOLVED → VERIFIED
blocking2.0: ? → ---
Flags: in-testsuite-
Target Milestone: --- → mozilla1.9.3a5
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.