Closed Bug 809978 Opened 12 years ago Closed 11 years ago

xpt.py link: race condition - unpack requires a string argument of length 32

Categories

(Firefox Build System :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: joey, Assigned: joey)

Details

Attachments

(1 file, 3 obsolete files)

When idl files are compiled in parallel and a module is later created by "xpt.py link" the link action can periodically fail attempting to link files while they are being compiled.

The error can be detected under a very specific condition so add a small retry block around the I/O loop when linking to give the compile thread a little extra time to finish writing content out to disk.
Assignee: nobody → joey
This really doesn't sound like the right solution: shouldn't makefile deps prevent attempting to link XPTs until the prerequisites are finished?
Comment on attachment 679770 [details] [diff] [review]
retry link I/O on potential race conditions

Added a 60-second (5x12) loop around the I/O block for reading source.

If exception reported is: 'unpack requires a string arg of length x'.  xpt.py operation is 'link' and the file was recently modified -- retry file I/O.


Cleanups:
  o include filename when reporting exceptions.
  o removed whitespace embedded within blank lines.
Attachment #679770 - Flags: review?(gps)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #2)
> This really doesn't sound like the right solution: shouldn't makefile deps
> prevent attempting to link XPTs until the prerequisites are finished?

Yes makefile deps would be the optimal answer but adding new logic in the makefiles tends to be frowned upon so this was a straightforward alternative.
This feels wrong. If our dependencies aren't handling this then our dependencies are broken.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #6)
> This feels wrong. If our dependencies aren't handling this then our
> dependencies are broken.

Yes the rules for xpidl module linking is definitely broken.  People have reported the issue a few times but the fix has been 'try and rebuild'.  If there will not be an issue adding makefile logic proper deps would help resolve the problem.

In the interim this patch provides a workaround that will reduce the chance of the transient failure affecting builds.  Deploy it to improve stability, retain the bug/state=open.  The patch will be easy enough to deprecate after makefile deps have been properly fixed.
Comment on attachment 679770 [details] [diff] [review]
retry link I/O on potential race conditions

Review of attachment 679770 [details] [diff] [review]:
-----------------------------------------------------------------

I'm pretty busy right now and this seems like a somewhat controversial patch. I'd rather not get myself involved at this juncture. Ask me again in a week or two if you still need review. Sorry.
Attachment #679770 - Flags: review?(gps)
Err msg reported: "struct.error: unpack requires a string argument of length 32"
(In reply to Joey Armstrong [:joey] from comment #9)
> Err msg reported: "struct.error: unpack requires a string argument of length
> 32"

ps> error message was reported as part of the subject line
Attachment #679770 - Attachment is obsolete: true
Attachment #719528 - Attachment is obsolete: true
Comment on attachment 720780 [details] [diff] [review]
retry link I/O on potential race conditions

Revive patch so work is not lost.
Note: this patch will not address the underlying build dependency problems, that is the topic of another bug.  This patch intends to board up the broken window so people will no longer trip over the problem.

o Add ArgParse support for formal command line arg support + usage/help text.
  Existing 'link' and 'dump' arguments retained.
  Added argument --workaround-bug-754358 to enable functionality.  If argument is commented out or removed while porting to mozbuild problem behavior will revert until bug 754358 is addressed.
o Race condition workaround
  - if --workaround-bug-754358 passed
  - and last error was a struct length error
  - and operation is (xpt.py) 'link'
  - and file was very recently modified

attempt to re-read the source file a few times.

If error persists or the initial error was something other than file size 0 fail as usual.
Attachment #720780 - Flags: review?(ted)
The simpler fix here is to stop linking xpt files in each directory where there are several, now that we can do that. It just requires to write more "interface" lines in the manifest.
Comment on attachment 720780 [details] [diff] [review]
retry link I/O on potential race conditions

cancel review - argparse python module not available on android or fedora.
Will need add a compatibility layer, fallback to using optparse when argparse not available until the transition to 2.7 as a default lands everywhere.
Attachment #720780 - Flags: review?(ted)
(In reply to Joey Armstrong [:joey] from comment #15)
> Comment on attachment 720780 [details] [diff] [review]
> retry link I/O on potential race conditions
> 
> cancel review - argparse python module not available on android or fedora.
> Will need add a compatibility layer, fallback to using optparse when
> argparse not available until the transition to 2.7 as a default lands
> everywhere.

or just do as comment 14 says.
Attachment #720780 - Attachment is obsolete: true
(In reply to Mike Hommey [:glandium] from comment #16)
> (In reply to Joey Armstrong [:joey] from comment #15)
> > Comment on attachment 720780 [details] [diff] [review]
> > retry link I/O on potential race conditions
> > 
> > cancel review - argparse python module not available on android or fedora.
> > Will need add a compatibility layer, fallback to using optparse when
> > argparse not available until the transition to 2.7 as a default lands
> > everywhere.
> 
> or just do as comment 14 says.

That would be a good future solution but this problem has persisted for a few months so why not board up the broken window in the interim so people stop tripping over it until the problem can be fixed permanently ?
Comment on attachment 721204 [details] [diff] [review]
retry link I/O on potential race conditions

Python problem was due to virtualenv path being stripped from xpidl_link.
Full command to xpt.py now passed.

https://tbpl.mozilla.org/?tree=Try&rev=308414ae8d0f

Revive patch so work is not lost.
Note: this patch will not address the underlying build dependency problems, that is the topic of another bug.  This patch intends to board up the broken window so people will no longer trip over the problem.

o Add ArgParse support for formal command line arg support + usage/help text.
  Existing 'link' and 'dump' arguments retained.
  Added argument --workaround-bug-754358 to enable functionality.  If argument is commented out or removed while porting to mozbuild problem behavior will revert until bug 754358 is addressed.
o Race condition workaround
  - if --workaround-bug-754358 passed
  - and last error was a struct length error
  - and operation is (xpt.py) 'link'
  - and file was very recently modified

attempt to re-read the source file a few times.

If error persists or the initial error was something other than file size 0 fail as usual
Attachment #721204 - Flags: review?(ted)
(In reply to Mike Hommey [:glandium] from comment #14)
> The simpler fix here is to stop linking xpt files in each directory where
> there are several, now that we can do that. It just requires to write more
> "interface" lines in the manifest.

Is there a bug open for these edits and an ETA when they are all expected to land ?
Comment on attachment 721204 [details] [diff] [review]
retry link I/O on potential race conditions

Review of attachment 721204 [details] [diff] [review]:
-----------------------------------------------------------------

::: config/rules.mk
@@ +1358,5 @@
>  # no need to link together if XPIDLSRCS contains only XPIDL_MODULE
>  ifneq ($(XPIDL_MODULE).idl,$(strip $(XPIDLSRCS)))
> +
> +# based on $(XPIDL_LINK)
> +XPT_PY ?= $(PYTHON) $(LIBXUL_DIST)/sdk/bin/xpt.py

I would feel better if you defined XPT_PY in config.mk, and then defined XPIDL_LINK as "$(XPT_PY) link". Also, this is sort of broken given the xpidl-module-deps below, isn't it? (you'll be adding a dependency on the python binary)

::: xpcom/typelib/xpt/tools/xpt.py
@@ +99,5 @@
>  class Type(object):
>      """
>      Data type of a method parameter or return value. Do not instantiate
>      this class directly. Rather, use one of its subclasses.
> +

Thanks for cleaning up the extraneous whitespace.

@@ +1129,5 @@
> +                    and len(sys.argv) > 1
> +                    and sys.argv[1] == 'link'
> +                    and len(err.args) > 0
> +                    and err.args[0].startswith('unpack requires a string argument of length') 
> +                   ):

This seems like a very awkward way to diagnose the condition "trying to read an empty file". I would instead just change the code above where it stats the input file to detect if the filesize is < 32 bytes, and do your retry loop there.

@@ +1440,5 @@
> +def local_argparse():
> +    """
> +    Use argparse to parse & validate command line args.
> +    A dictionary of values will be returned.
> +

This feels really extraneous to this patch. I wouldn't bother putting in argparse, I'd just make this behavior the default until we fix things so that we can remove it.
Attachment #721204 - Flags: review?(ted) → review-
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #21)
> Comment on attachment 721204 [details] [diff] [review]
> retry link I/O on potential race conditions
> 
> Review of attachment 721204 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> @@ +1129,5 @@
> > +                    and len(sys.argv) > 1
> > +                    and sys.argv[1] == 'link'
> > +                    and len(err.args) > 0
> > +                    and err.args[0].startswith('unpack requires a string argument of length') 
> > +                   ):
> 
> This seems like a very awkward way to diagnose the condition "trying to read
> an empty file". I would instead just change the code above where it stats
> the input file to detect if the filesize is < 32 bytes, and do your retry
> loop there.

This condition is not simply testing for an empty file, it was written to specifically limit when the retry loop is allowed to run.  If a non-standard record size is written to disk (32 > x > 0) the condition will be a false positive and failure is not immediate.  The race condition is known to happen with empty files so that is why the conditional is testing for it.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #21)
> Comment on attachment 721204 [details] [diff] [review]
> retry link I/O on potential race conditions
> 
> Review of attachment 721204 [details] [diff] [review]:
> -----------------------------------------------------------------

> @@ +1440,5 @@
> > +def local_argparse():
> > +    """
> > +    Use argparse to parse & validate command line args.
> > +    A dictionary of values will be returned.
> > +
> 
> This feels really extraneous to this patch. I wouldn't bother putting in
> argparse, I'd just make this behavior the default until we fix things so
> that we can remove it.

Question: why would adding formal argument parsing and help text within the script be a problem ?
Because it's not necessary? The script doesn't currently take any optional arguments, so you're adding this purely for the sake of adding the argument you're using, which I think you should just make the default behavior.
(In reply to Joey Armstrong [:joey] from comment #22)
> This condition is not simply testing for an empty file, it was written to
> specifically limit when the retry loop is allowed to run.  If a non-standard
> record size is written to disk (32 > x > 0) the condition will be a false
> positive and failure is not immediate.  The race condition is known to
> happen with empty files so that is why the conditional is testing for it.

This error condition happens because there are < 32 bytes in the file it's trying to read. You're testing this in a roundabout way by catching the specific struct.unpack error we hit in this condition. I think you should just test the result of os.stat directly, since it's testing the same thing but is more straightforward.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #21)
> Comment on attachment 721204 [details] [diff] [review]
> retry link I/O on potential race conditions
> 
> Review of attachment 721204 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: config/rules.mk
> @@ +1358,5 @@
> >  # no need to link together if XPIDLSRCS contains only XPIDL_MODULE
> >  ifneq ($(XPIDL_MODULE).idl,$(strip $(XPIDLSRCS)))
> > +
> > +# based on $(XPIDL_LINK)
> > +XPT_PY ?= $(PYTHON) $(LIBXUL_DIST)/sdk/bin/xpt.py
> 
> I would feel better if you defined XPT_PY in config.mk, and then defined
> XPIDL_LINK as "$(XPT_PY) link". Also, this is sort of broken given the
> xpidl-module-deps below, isn't it? (you'll be adding a dependency on the
> python binary)

Oh also the deps are not broken the added dependency is correct.  We should recompile when tools used to generate content are changed or it will create a hole for testing by leaving stale content on disk and/or not update timestamps.

This is a formula for landmines.  Stale content left on disk by not recompiling.  Whomever follows along later and modifies source to force a recompile is the luck person who has to deal with the untested changes from xpt.py.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #24)
> Because it's not necessary? The script doesn't currently take any optional
> arguments, so you're adding this purely for the sake of adding the argument
> you're using, which I think you should just make the default behavior.

Syntax for argunment handling in the script is a bit awkward as it stands.  Arguments are positional and checking for values like 'link' at *hardcoded* offsets.

If formal command line argument handling is added it will be a lot easier for people to enhance this script later on.  Even trying to add arguments like --debug or --verbose can be difficult with the positional handling.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #25)
> (In reply to Joey Armstrong [:joey] from comment #22)
> > This condition is not simply testing for an empty file, it was written to
> > specifically limit when the retry loop is allowed to run.  If a non-standard
> > record size is written to disk (32 > x > 0) the condition will be a false
> > positive and failure is not immediate.  The race condition is known to
> > happen with empty files so that is why the conditional is testing for it.
> 
> This error condition happens because there are < 32 bytes in the file it's
> trying to read. You're testing this in a roundabout way by catching the
> specific struct.unpack error we hit in this condition. I think you should
> just test the result of os.stat directly, since it's testing the same thing
> but is more straightforward.

Yes testing for <32 would widen the scope to detect problems but the majority of those conditions would not make sense in this loop.  If I were to introduce a typo in xpt.py which generated a bad typedef size, falling into the retry loop attempting to detect a race condition would add needless overhead.  If the record size were to change the compile should fail immediately, which happens now because of the empty file test.
I don't think you should be optimizing for the possibility of someone adding typos in the code. We have unit tests that should catch errors like that.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #29)
> I don't think you should be optimizing for the possibility of someone adding
> typos in the code. We have unit tests that should catch errors like that.

This is not optimization to detect typos it is restricting a condition to detect one very specific error condition.  Widening the check to handle (32 > x > 0) implies a wildcard match that will force the logic block to handle failures it was never intended to.

The only observed/reported cases for the race condition have been touch(file) + size(rec)==0.  If we begin observing size(rec) != 0 that would be the time to expand the conditional into a catchall.
I'm supposed to get this resolved.
Flags: needinfo?(gps)
I /think/ bug 850380 will invalidate this one since all idls and xpts will be defined in a single make file and we won't have to worry about race conditions since all dependencies will be explicit. Perhaps we can close this bug WONTFIX?
Flags: needinfo?(gps)
I was hitting this error and rebuilds didn't resolve it - something had written 0-byte nsILocalFile.xpt and nsILocalFileWin.xpt files, so rebuilding just hit the error again. 

Perhaps I aborted the build (I don't think I did, but can no longer remember), but that shouldn't cause all future builds to fail.  Could hitting the race condition per above leave the 0-byte files?

*At Least* on a 0-byte file, throw a usable error that says what the file was and that it should be removed and a rebuild done.  I was stuck on this for multiple days until I got annoyed enough to try to figure out why (and I didn't want to blow away a partially-done build on windows, given how long they take, and then find it happens again, so I let it sit).
We should probably never write target files directly but write to temporary files and then rename them.
patch with race workaround no longer needed.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: