Closed Bug 496712 Opened 15 years ago Closed 15 years ago

[mostly try server] Corrupt file system on slaves with NTFS drives ("Circular directory structure")

Categories

(Release Engineering :: General, defect, P1)

x86
Windows Server 2003
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ehsan.akhgari, Assigned: bhearsum)

References

()

Details

Attachments

(1 file, 1 obsolete file)

I submitted a try server patch and got the following from the WINCE builder:

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1244297278.1244297292.19205.gz

rm: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following two directories have the same inode number:

build/objdir/xulrunner/profile
`build/objdir/xulrunner/profile/public/_xpidlgen'

I'm not sure what this means, but I thought I'd file it for you guys to check it out...
Ehsan: thanks for the headsup, we'll investigate.

jhford/aki: any ideas?
I noticed this happening on other try server windows builds as well.  I am not sure what is going on there.
i have noticed this is mainly happening on slave15 and slave18, i dont know if it is happening on others.
win15 is 
 red   at Sat Jun 6 07:07  (build 96)
 red   at Thu Jun 4 21:21  (build 84)
 green at Wed Jun 3 07:59  (build 65)

win18 is 
 red   at Sun Jun 7 04:17  (build 104)
 red   at Fri Jun 5 00:00  (build 87)
 green at Tue Jun 2 17:52  (build 55) 
 green at Mon Jun 1 15:39  (build 38)

win10 is 
 red   at Tue Jun 2 11:13  (build 49)
but green for the 6 builds since then.

Can't see any further back that Sun May 31 08:01 (build 24). Have shut off buildbot on 15 and 18 to investigate.
Summary: Corrupt file system on WINCE's hg builder? → [try server] Corrupt file system on WINCE's hg builder?
Comment #4 is all for "WINCE try hg build", but win10 also hit problems with removing the previous build in
  Jun 02 11:42: WINNT 5.2 try hg unit test (build 1318)
  Jun 02 09:28: WINNT 5.2 try hg unit test (build 1315)
  Jun 02 02:36: WINNT 5.2 try hg unit test (build 1310)
but is subsequently OK. Not clear if it came right by itself or if someone intervened.
Here's some of the 'find -ls | sort' output from win15:/e/builds/slave/sendchange-wince-hg/build/objdir/xulrunner, trimmed to show just the inode number, number of hard links, mod. date, and dir/file name

224228    2 Jun  3 08:42 ./profile/public/_xpidlgen
224228    3 Jun  4 21:29 ./profile

224246    2 Jun  3 08:42 ./xpfe/browser/public/.deps
224246    4 Jun  3 08:14 ./rdf/tests

224247    2 Jun  3 08:14 ./rdf/tests/rdfcat
224247    2 Jun  3 08:42 ./xpfe/browser/public/_xpidlgen

plus a further 439 examples of duplicated inodes in that directory. There are some truly weird combinations in there, eg the zero size file xpfe/components/directory/_xpidlgen/.done with the directory widget/src/gtkxtbin. It's not limited to zero size objects though, there some 200KB+ files which are too large to be stored in the NTFS MFT IIRC (acroynms FTW!!). 

If I look on win11 (which hasn't had any errors) at the whole objdir for the wince build, then there are no duplicated inodes at all. Hardlink counts greater than 1 appear to be normal on both NTFS and FAT drives on our slaves, which seems a bit weird to me given FAT not supporting that. Perhaps msys's find is telling lies. Ted, what do you expect nsinstall to do wrt to link creation on windows ? Copy, right ?

chkdsk gives the E drive a clean bill of health, so I've moved sendchange-wince-hg to sendchange-wince-hg-broken to allow further investigation. Also moved it aside on win18, then both back into service.
Note that win10 through 19 are using NTFS disks rather than FAT.  And at least two of these VMs are on new storage partitions on the Equallogic setup:
  win15 - eql01-bm10
  win18 - eql01-bm11
and possibly also win10 on eql01-bm09. Phong, everything alright on those partitions ? There are obviously other VMs on those partitions so this might be a transient problem.
Phong - any info you can share about the questions from comment #7?
Blocks: 497580
No longer blocks: 497580
(In reply to comment #6)
> Ted, what do you expect nsinstall to do wrt to link
> creation on windows ? Copy, right ?

That's correct. The version of nsinstall used in MozillaBuild simply copies (and preserves timestamps, the way we call it).
Should I expect this not to bother trunk tinderboxen? The HTML5 parsing repo hits this issue every time, and I wonder if this should block landing the HTML5 preffed off on the trunk or not. (Bug 487949)
This just bit me on both the Windows & Windows CE tryserver builders:

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245280107.1245280903.17313.gz
WINNT 5.2 try hg unit test on 2009/06/17 16:08:27

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1245280587.1245280603.16760.gz
WINCE try hg build on 2009/06/17 16:16:27

(Updating bug summary to indicate that this isn't CE-only)
Summary: [try server] Corrupt file system on WINCE's hg builder? → [try server] Corrupt file system on WINCE & Windows hg builder?
The above two builds happened on win14 and win12.
I'm really hoping this isn't something that will prevent us from using NTFS entirely.

I'm also wondering if we're hitting the max path length issue somehow. I'd be curious as to whether renaming the build dir from sendchange-wince-hg to just wince-hg would help things a bit.

Those are my hypotheses currently.
Attached patch Rename win32/wince hg build dirs (obsolete) — Splinter Review
The directory names are for our [releng's] use only, and sendchange- can be assumed when you're on a try-* slave.  Shortening the build dir names to help rule out max file path lengths.
Attachment #383820 - Flags: review?(bhearsum)
Don't forget that script that starts buildbot on boot needs updating in this situation. It lives at /d/mozilla-build/start-buildbot.bat
Oh n/m, I misread the patch.
I didn't think that this issue was occurring before we landed the WINCE
builders which makes me think that the issues is with the total number of files
on the drive.  We could try clobbering all windows builds after they complete
as a diagnostic step as this would ensure a minimum number of files on each
partition.

I was reading the links on the reference platform explanation and it seems that
there is a way to boost the size of the MFT
(http://www.windowsdevcenter.com/pub/a/windows/2005/02/08/NTFS_Hacks.html step
7)
(In reply to comment #16)
> I didn't think that this issue was occurring before we landed the WINCE
> builders which makes me think that the issues is with the total number of files
> on the drive.  We could try clobbering all windows builds after they complete
> as a diagnostic step as this would ensure a minimum number of files on each
> partition.
> 
> I was reading the links on the reference platform explanation and it seems that
> there is a way to boost the size of the MFT
> (http://www.windowsdevcenter.com/pub/a/windows/2005/02/08/NTFS_Hacks.html step
> 7)

Worth looking into, but I would be very very surprised if this was the case. We have production slaves with the NTFS drives and FAT drives which no doubt have more files on them considering all the different builddirs they have..
Attachment #383820 - Flags: review?(bhearsum) → review+
Comment on attachment 383820 [details] [diff] [review]
Rename win32/wince hg build dirs

I've never seen the path length issue manifest itself like this, but it's worth a shot. Please double check any slaves that have 30GB drives and make sure we won't run out of space.
Assignee: nobody → aki
win32-slave07 hit this on pre-packaged mochitest:

rm -rf build
 in dir e:\builds\moz2_slave\mozilla-central-win32-unittest-mochitests\. (timeout 1200 secs)
rm: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following two directories have the same inode number:
build
`build/symbols/testservmgr.pdb/95E426C62F314C888AC7744C333E91B12'

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1245634069.1245635540.13287.gz&fulltext=1

That step probably needs haltOnFailure set too, because it happily carried on.
I probably won't be able to handle this before going on vacation.
Throwing back in the pool.
Assignee: aki → nobody
I'll land this the next we have downtime.
Assignee: nobody → bhearsum
Status: NEW → ASSIGNED
Comment on attachment 383820 [details] [diff] [review]
Rename win32/wince hg build dirs

Going to roll this out in tomorrows downtime.
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1246131272.1246131287.929.gz
WINCE try hg build on 2009/06/27 12:34:32
"s: win16"
WINNT 5.2 mozilla-1.9.1 unit test on 2009/06/28 12:12:20 on win32-slave06:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.5/1246216340.1246216420.27374.gz&fulltext=1
Summary: [try server] Corrupt file system on WINCE & Windows hg builder? → [mostly try server] Corrupt file system on slaves with NTFS drives
We hit this again on try-w32-slave12,
  http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1246580677.1246580682.25662.gz
so I tried moving aside /d/mozilla-build/ and installing Ted's MozillaBuild 1.4rc1 (which includes msys 1.0.11-rc-1). That was able to rm -rf the directory which the old MB 1.2 install could not. 

So that's a way forward, but one that requires pretty extensive validation that nothing regresses across our builds. Perhaps we can start with try staging, and perhaps disable WinCE builds to prevent the spurious noise for developers.

Changes between MozillaBuild 1.3 and 1.4rc1
http://hg.mozilla.org/mozilla-build/pushloghtml?fromchange=95c0aff5718a&tochange=939fab300c0f
Changes between 1.2 (which is what our slaves are using) and 1.3 can be seen on http://hg.mozilla.org/mozilla-build/pushloghtml/3 using the marked tags.

try-w32-slave12 is back in the try pool with MB v1.2.
OS: Windows CE → Windows Server 2003
Summary: [mostly try server] Corrupt file system on slaves with NTFS drives → [mostly try server] Corrupt file system on slaves with NTFS drives ("Circular directory structure")
Comment on attachment 383820 [details] [diff] [review]
Rename win32/wince hg build dirs

Obsoleted by patches in bug 500056 (wince->winmo).
Attachment #383820 - Attachment is obsolete: true
since I am going to be testing a TON of different builds already because of my changes with shared source directories, it might be worthwhile for me to test out the new version of mozilla-build on the windows machines.
Hit this on try-w32-slave11 see bug in comment 32
Bumping the priority on this. Given the encouraging report in comment #29 I'm rolling out MozillaBuild 1.4 in try staging tonight. If everything looks good overnight I'll roll it out across the rest of the try slaves tomorrow.
Priority: -- → P1
I tried to upgrade the try slaves to MB 1.4 today. It went poorly due to what looks like a Buildbot bug. Need more testing and bugfixes before we try this again.
(In reply to comment #37)
> I tried to upgrade the try slaves to MB 1.4 today. It went poorly due to what
> looks like a Buildbot bug. Need more testing and bugfixes before we try this
> again.

I ended up pushing through and getting MozillaBuild 1.4 deployed yesterday. Let's keep this open for a couple days to make sure the issue is fixed.
We hit this on slave11 today. I logged in to check and it definitely has MB 1.4. While this confirms that the issue isn't fixed, it does seem quite a bit better with MB1.4 in place. There was 12 occurrences of this between July 15th and July 21st and only 1 occurrence since MB 1.4 was installed two days ago.
Ok, so it sounds like we need to look at using some regular win32 command line tools or reuse our clobbering code to delete directories instead of using msys's rm.
This patch should switch us over to rmdir /s /q for Windows. I gave it a quick run in staging on all build types, letting them run until they got past the rm/rmdir step. Turns out we already use rmdir /s /q for the CVS try builds.
Attachment #390461 - Flags: review?(catlee)
Comment on attachment 390461 [details] [diff] [review]
use rmdir /s /q on windows

Assuming this works, it looks ok to me.
Attachment #390461 - Flags: review?(catlee) → review+
Comment on attachment 390461 [details] [diff] [review]
use rmdir /s /q on windows

changeset:   369:272144a467e0
changeset:   1372:76a7fa2d07c5

I updated the try server master with this patch. Let's see how things go over the weekend.
Attachment #390461 - Flags: checked‑in+
No failures with this error over the weekend. Leaving this open until tomorrow to be sure, though.
Haven't had any failures of this sort at all since the patch was landed, I'm going to consider this fixed.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: