Closed Bug 963123 Opened 10 years ago Closed 10 years ago

New Windows build slaves fail in NSS on branches below trunk

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Windows Server 2008
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: armenzg)

References

Details

Attachments

(3 files)

I've seen this for a few weeks with my weekly Try pushes of Aurora as if it were merged to Beta already, and just retriggered until I got an older slave since non-trunk isn't really supported on Try, but last night b-2008-ix-0001 (which must be the first new Win build slave we've had in the non-try pool for a while) took the Windows XULRunner nightly on Aurora and failed the same way, so apparently we've got a systemic problem with newly created slaves, which some build change that's on trunk only causes to not happen there.

https://tbpl.mozilla.org/php/getParsedLog.php?id=33444843&tree=Mozilla-Aurora

Creating Resource file: module.res
nsinstall: failed to create directory c:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build\obj-firefox\security\C:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build\security\nss\cmd\lib: [Error 123] The filename, directory name, or volume label syntax is incorrect: u'c:\\builds\\moz2_slave\\m-aurora-w32-xr-ntly-000000000\\build\\obj-firefox\\security\\C:'

Only on new slaves, only on branches below trunk, the NSS build system decides that C:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build\security\nss\cmd\lib is a path relative to the directory c:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build\obj-firefox\security
Assignee: nobody → armenzg
jhopkins, can you think of anything that is causing this?
I assume there is a difference between the rev2 and rev1 Win64 imaging.
I can reproduce the problem if I do this:
C:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build\obj-firefox>python C:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build\config\nsinstall.py -D c:\\builds\\moz2_slave\\m-aurora-w32-xr-ntly-000000000\\build\\obj-firefox\\security\\C:
nsinstall: failed to create directory c:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build\obj-firefox\security\C:: [Error 123] The filename, directory name, or volume label syntax is incorrect: u'c:\\builds\\moz2_slave\\m-aurora-w32-xr-ntly-000000000\\build\\obj-firefox\\security\\C:'

The line that prints the message is on line 68.
http://mxr.mozilla.org/mozilla-aurora/source/config/nsinstall.py#63

What I have not been able to figure out is who is passing 'c:\\builds\\moz2_slave\\m-aurora-w32-xr-ntly-000000000\\build\\obj-firefox\\security\\C:' to nsinstall.py.

Any ideas?

63   # just create one directory?
64   def maybe_create_dir(dir, mode, try_again):
65     dir = os.path.abspath(dir)
66     if os.path.exists(dir):
67       if not os.path.isdir(dir):
68         print('nsinstall: {0} is not a directory'.format(dir), file=sys.stderr)
69         return 1
70       if mode:
71         os.chmod(dir, mode)
72       return 0
73 
74     try:
75       if mode:
76         os.makedirs(dir, mode)
77       else:
78         os.makedirs(dir)
79     except Exception as e:
80       # We might have hit EEXIST due to a race condition (see bug 463411) -- try again once
81       if try_again:
82         return maybe_create_dir(dir, mode, False)
83       print("nsinstall: failed to create directory {0}: {1}".format(dir, e))
84       return 1
85     else:
86       return 0
All lines are trying to make the same directory:
u'c:\\builds\\moz2_slave\\m-aurora-w32-xr-ntly-000000000\\build\\obj-firefox\\security\\C:'

The most significant different line is:
C:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build\obj-firefox\security\build\Makefile:486:0: command 'C:/mozilla-build/python27/python.exe c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/build/pymake/pymake/../make.py -C C:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/security/nss/lib/crmf libs  CC=' cl' SOURCE_MD_DIR=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist SOURCE_MDHEADERS_DIR=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist/include/nspr DIST=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist NSPR_INCLUDE_DIR=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist/include/nspr NSPR_LIB_DIR=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist/lib MOZILLA_CLIENT=1 NO_MDUPDATE=1 NSS_ENABLE_ECC=1 SQLITE_LIB_NAME=nss3 SQLITE_INCLUDE_DIR=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist/include ABS_topsrcdir='c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build' BUILD='c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/security/$(subst $(ABS_topsrcdir)/security/,,$(CURDIR))' BUILD_TREE='$(BUILD)' OBJDIR='$(BUILD)' DEPENDENCIES='$(BUILD)/.deps' SINGLE_SHLIB_DIR='$(BUILD)' SOURCE_XP_DIR=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist BUILD_OPT=1 OPT_CODE_SIZE=1 NS_USE_GCC= OS_TARGET=WIN95 NSS_ENABLE_ZLIB= PROGRAMS= CHECKLOC= FREEBL_NO_DEPEND=0 NSS_NO_PKCS11_BYPASS=1 PUBLIC_EXPORT_DIR='c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist/include/$(MODULE)' SOURCE_XPHEADERS_DIR='$(SOURCE_XP_DIR)/include/$(MODULE)' MODULE_INCLUDES='$(addprefix -I$(SOURCE_XP_DIR)/include/,$(REQUIRES))' MAKE_OBJDIR='$(INSTALL) -D $(OBJDIR)' TARGETS='$(LIBRARY) $(SHARED_LIBRARY) $(PROGRAM)' PYTHON='c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/_virtualenv/Scripts/python.exe' NSINSTALL_PY='C:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/config/nsinstall.py' NSINSTALL='$(PYTHON) $(NSINSTALL_PY)' INSTALL='$(NSINSTALL) -t' ' failed, return code 2nsinstall: failed to create directory c:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build\obj-firefox\security\C:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build\security\nss\lib\dbm\src: [Error 123] The filename, directory name, or volume label syntax is incorrect: u'c:\\builds\\moz2_slave\\m-aurora-w32-xr-ntly-000000000\\build\\obj-firefox\\security\\C:'
I also see all of these failures (added new lines to help reading):
C:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build\obj-firefox\security\build\Makefile:486:0: command 
'C:/mozilla-build/python27/python.exe 
c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/build/pymake/pymake/../make.py -C C:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/security/nss/lib/dbm libs  
CC=' cl' 
SOURCE_MD_DIR=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist 
SOURCE_MDHEADERS_DIR=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist/include/nspr 
DIST=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist 
NSPR_INCLUDE_DIR=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist/include/nspr 
NSPR_LIB_DIR=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist/lib 
MOZILLA_CLIENT=1 
NO_MDUPDATE=1 
NSS_ENABLE_ECC=1 
SQLITE_LIB_NAME=nss3 
SQLITE_INCLUDE_DIR=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist/include 
ABS_topsrcdir='c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build' 
BUILD='c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/security/$(subst $(ABS_topsrcdir)/security/,,$(CURDIR))'
BUILD_TREE='$(BUILD)' 
OBJDIR='$(BUILD)' 
DEPENDENCIES='$(BUILD)/.deps' 
SINGLE_SHLIB_DIR='$(BUILD)' 
SOURCE_XP_DIR=c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist 
BUILD_OPT=1 
OPT_CODE_SIZE=1 
NS_USE_GCC= 
OS_TARGET=WIN95 
NSS_ENABLE_ZLIB= 
PROGRAMS= 
CHECKLOC= 
FREEBL_NO_DEPEND=0 
NSS_NO_PKCS11_BYPASS=1 
PUBLIC_EXPORT_DIR='c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/dist/include/$(MODULE)'
SOURCE_XPHEADERS_DIR='$(SOURCE_XP_DIR)/include/$(MODULE)' 
MODULE_INCLUDES='$(addprefix -I$(SOURCE_XP_DIR)/include/,$(REQUIRES))' 
MAKE_OBJDIR='$(INSTALL) -D $(OBJDIR)' 
TARGETS='$(LIBRARY) $(SHARED_LIBRARY) $(PROGRAM)' 
PYTHON='c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox/_virtualenv/Scripts/python.exe' 
NSINSTALL_PY='C:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/config/nsinstall.py' 
NSINSTALL='$(PYTHON) $(NSINSTALL_PY)' 
INSTALL='$(NSINSTALL) -t' ' failed, return code 2
I'm suspicious of OS_TARGET=WIN95
No, that's normal.
I'm putting the slave on staging and see what it does.

I'm out of ideas. I don't know how to debug Makefiles et al.
I have confirmed that it compiles for mozilla-central:
http://dev-master01.build.scl1.mozilla.com:8040/builders/WINNT%205.2%20mozilla-central%20xulrunner%20nightly
but not for mozilla-aurora:
http://dev-master01.build.scl1.mozilla.com:8040/builders/WINNT%205.2%20mozilla-aurora%20xulrunner%20nightly/builds/0/steps/compile/logs/stdio

I believe there are code differences from one branch to the other which is causing the rev2 win64 machines fail in some weird way.
mshal is actually helping me look into this.
Here's what I know so far:

The incorrect nsinstall paths come from this line in security/build/Makefile.in:

DEFAULT_GMAKE_FLAGS += BUILD='$(MOZ_BUILD_ROOT)/security/$$(subst $$(ABS_topsrcdir)/security/,,$$(CURDIR))'

MOZ_BUILD_ROOT is c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/obj-firefox
ABS_topsrcdir is c:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build
CURDIR is C:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build/security/nss/cmd/lib

CURDIR is the weird one - it has a 'C:' instead of 'c:', so the $(subst) doesn't match. CURDIR is built-in to make, which here means pymake. pymake gets its CURDIR value either from os.getcwd(), or from the directory passed in with '$(MAKE) -C subdir'

Here's the weird part (to me) - if I print out os.getcwd() in pymake's main, when I run it on the command-line I get:

pymake getcwd is:  c:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build

But the same thing from a builder gets me:

pymake getcwd is:  C:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build

I tried wrapping this in os.path.normcase(), but due to the fact that paths also come from the $(MAKE) -C subdir route, it doesn't fix the problem. When nss does a $(MAKE) -C, it uses NSS_SUBDIR, which is set to $(topsrcdir), which comes from the configure-generated security/build/Makefile:

topsrcdir := C:/builds/moz2_slave/m-aurora-w32-xr-ntly-000000000/build

Note that in comparison, my own Windows machine generates topsrcdir with a lower-case "c:".

So I think we can try to fix the $(subst) to be more lenient, which might be a bit wonky to do in make. Or we can figure out why os.getcwd() gives us a different casing for "C:" in the command-line vs. in the builder. We would also need to figure out why topsrcdir gets generated with upper-case then (which would probably be the same root cause).
Actually I think we can just replace ABS_topsrcdir with topsrcdir, since both topsrcdir & CURDIR will use the same capitalization. It would also get rid of a $(shell)...
Try this in python:
   os.chdir('c:\Program Files')
   print(os.getcwd())
   os.chdir('C:\Users')
   print(os.getcwd())

Yep, what os.getcwd() returns is conditioned by how the current directory was set. And afaik, the script we run on slaves do change directory before launching commands.

So there's a simpler fix, then: fix the path used by the build slaves scripts, which itself must be containing a 'C', while the same path on other builders must have a 'c'.
That's strange - I tried a similar test in the command-line:

$ pwd
/c/Users/marf

$ python ok.py
c:\Users\marf

$ cd /C/Users/marf

$ pwd
/C/Users/marf

$ python ok.py
c:\Users\marf

In both cases, os.getcwd() gives a lower-case 'c' instead of what pwd shows. I can reproduce your test though, so I guess the 'c' vs. 'C' must be internal to python rather than what it gets from the OS.
I'm not sure where in the configuration the 'C:' setting is coming from (I pinged armenzg about it, but he's not sure either). Either way I don't think we should fail in this bizarre way if one path says 'c:' and another says 'C:' on a case-insensitive filesystem.
Attachment #8366168 - Flags: review?(mh+mozilla)
(In reply to Michael Shal [:mshal] from comment #13)
> In both cases, os.getcwd() gives a lower-case 'c' instead of what pwd shows.
> I can reproduce your test though, so I guess the 'c' vs. 'C' must be
> internal to python rather than what it gets from the OS.

And the harness is in python...
Comment on attachment 8366168 [details] [diff] [review]
0001-Bug-963123-Fix-case-sensitivity-to-C-vs-c-in-NSS.patch

Review of attachment 8366168 [details] [diff] [review]:
-----------------------------------------------------------------

I still think this should be worked around at the builder level, too. There's no reason that can't be made to work without the patch, if it works on the older slaves. And I'm not terribly thrilled at the idea of touching the (fragile) nss build system on branches to make them work with new slaves.
Attachment #8366168 - Flags: review?(mh+mozilla) → review+
Oh wow and I thought I would not be surprised anymore of how one little thing could affect such a far distant problem!
I really thought I would not be able to find anything after several hours of investigations.

When a machine starts, we grab a buildbot.tac file (this allows for allocating machines).
One of the values is basedir = 'C:\\\\builds\\\\moz2_slave'
Which runslave.py passes as cwd to buildbot when starting [1]:
   211         rv = subprocess.call(
   212             self.options.twistd_cmd + 
   213                     [ '--no_save',
   214                       '--logfile', os.path.join(self.get_basedir(), 'twistd.log'),
   215                       '--python', self.get_filename(),
   216                     ],
   217             cwd=self.get_basedir())

It seems that 12 win64 rev2 machines have that value [2] as well.

I triggered a new build after fixing it [3] and I see this workdir:
 in dir c:\builds\moz2_slave\m-aurora-w32-xr-ntly-000000000\build (timeout 7200 secs) (maxTime 16200 secs)
instead of
 in dir C:\\builds\\moz2_slave\m-aurora-w32-xr-ntly-000000000\build (timeout 7200 secs) (maxTime 16200 secs)

I've fixed the 12 Win64 machines that had the issue and the b-2008-ix machines.
I will add b-2008-ix-0001 back into production.

[1] http://hg.mozilla.org/build/puppet-manifests/file/tip/modules/buildslave/files/runslave.py

[2]
mysql> select name, basedir from slaves where binary basedir like 'C:%' and name like 'w64-ix%';
+-----------------+------------------------+
| name            | basedir                |
+-----------------+------------------------+
| w64-ix-slave159 | C:\\builds\\moz2_slave |
| w64-ix-slave160 | C:\\builds\\moz2_slave |
| w64-ix-slave161 | C:\\builds\\moz2_slave |
| w64-ix-slave162 | C:\\builds\\moz2_slave |
| w64-ix-slave163 | C:\\builds\\moz2_slave |
| w64-ix-slave164 | C:\\builds\\moz2_slave |
| w64-ix-slave165 | C:\\builds\\moz2_slave |
| w64-ix-slave166 | C:\\builds\\moz2_slave |
| w64-ix-slave167 | C:\\builds\\moz2_slave |
| w64-ix-slave168 | C:\\builds\\moz2_slave |
| w64-ix-slave169 | C:\\builds\\moz2_slave |
| w64-ix-slave170 | C:\\builds\\moz2_slave |
+-----------------+------------------------+
12 rows in set (0.54 sec)

[3] 
mysql> select name, basedir from slaves where name like 'b-2008-ix-0001';
+----------------+------------------------+
| name           | basedir                |
+----------------+------------------------+
| b-2008-ix-0001 | C:\\builds\\moz2_slave |
+----------------+------------------------+
1 row in set (0.42 sec)

mysql> update slaves set basedir='c:\\builds\\moz2_slave' where binary basedir like 'C:%' and name like 'w64-ix%';Query OK, 12 rows affected (0.42 sec)
Rows matched: 12  Changed: 12  Warnings: 0

mysql> update slaves set basedir='c:\\builds\\moz2_slave' where name like 'b-2008-ix-0001';
Query OK, 1 row affected (0.40 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select name, basedir from slaves where name like 'b-2008-ix-0001';
+----------------+----------------------+
| name           | basedir              |
+----------------+----------------------+
| b-2008-ix-0001 | c:\builds\moz2_slave |
+----------------+----------------------+
1 row in set (0.40 sec)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
:glandium, would you still be fine if we land the build fix? This would prevent any issues in the future.
I'm glad you were able to find the source of the path! Personally I'd prefer not to leave the case-sensitivity in the build system since I think it's likely we'll stumble on it again, but I'll defer to glandium.
Oh yes, please land, but let it ride the trains instead of uplifting.
great :)
I will land it.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 964855
Comment on attachment 8366168 [details] [diff] [review]
0001-Bug-963123-Fix-case-sensitivity-to-C-vs-c-in-NSS.patch

 https://hg.mozilla.org/integration/mozilla-inbound/rev/877ea08fb1cf
Attachment #8366168 - Flags: checked-in+
https://hg.mozilla.org/mozilla-central/rev/877ea08fb1cf
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: