Closed Bug 1215834 Opened 9 years ago Closed 8 years ago

Intermittent OverflowError: timeout period too long

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox44 --- affected

People

(Reporter: philor, Unassigned)

Details

(Keywords: intermittent-failure)

Attachments

(5 files, 4 obsolete files)

https://treeherder.mozilla.org/logviewer.html#?job_id=5181180&repo=fx-team

 19:35:54     INFO -  TEST-PASS | tests/jit-test/jit-test/tests/generators/throw-closes.js | Success (code 0, args "--ion-eager --ion-offthread-compile=off --non-writable-jitcode --ion-check-range-analysis --ion-extra-checks --no-sse3 --no-threads")
 19:35:54     INFO -  Traceback (most recent call last):
 19:35:54     INFO -    File "/builds/slave/test/build/tests/jit-test/jit-test/jit_test.py", line 294, in <module>
 19:35:54     INFO -      main(sys.argv[1:])
 19:35:54     INFO -    File "/builds/slave/test/build/tests/jit-test/jit-test/jit_test.py", line 282, in main
 19:35:54     INFO -      ok = jittests.run_tests(job_list, prefix, options)
 19:35:54     INFO -    File "/builds/slave/test/build/tests/jit-test/tests/lib/jittests.py", line 545, in run_tests
 19:35:54     INFO -      ok = process_test_results(gen, num_tests, pb, options)
 19:35:54     INFO -    File "/builds/slave/test/build/tests/jit-test/tests/lib/jittests.py", line 483, in process_test_results
 19:35:54     INFO -      for i, res in enumerate(results):
 19:35:54     INFO -    File "/builds/slave/test/build/tests/jit-test/tests/lib/tasks_unix.py", line 203, in run_all_tests
 19:35:54     INFO -      read_input(tasks, timeout)
 19:35:54     INFO -    File "/builds/slave/test/build/tests/jit-test/tests/lib/tasks_unix.py", line 107, in read_input
 19:35:54     INFO -      readable, _, _ = select.select(rlist, [], exlist, timeout)
 19:35:54     INFO -  OverflowError: timeout period too long
 19:35:54    ERROR - Return code: 1
 19:35:54    ERROR - No tests run or test summary not found
Attached patch log.patchSplinter Review
This is really strange:

- in tasks_unix.py, read_input uses a timeout option in run_all_tests which is defined as get_max_wait(tasks, options.timeout), and not modified in the function itself.
- reading get_max_wait, it *has* to be between 0 and ProgressBar.update_granularity(), that is 0.1 seconds.
- on my machine, select.select([0], [], [], X) starts to throw OverflowError for X > 2**64. If the value is negative, it throws InvalidError; it seems to work fine with floating point values (e.g. 1.5) as well.

So there is something really fishy going on here. VM bug? Float rounding error in the loop in get_max_wait?

It is sad that this is an intermittent issue. We should do something about it, at least try to log the timeout value when this fails, to understand more precisely what's happening.
Attachment #8732831 - Flags: review?(sphink)
Attached patch overflow.patch (obsolete) — Splinter Review
Random idea: clamp max the value returned by get_max_wait, just in case there's a non-obvious floating point error in the loop. And use total_seconds as defined on the timedelta object, rather than using an external function.

Not setting r? yet, I'd like to see what logging tells us about the timeout value first.
Comment on attachment 8732831 [details] [diff] [review]
log.patch

Review of attachment 8732831 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/tests/lib/tasks_unix.py
@@ +109,5 @@
> +    try:
> +        readable, _, _ = select.select(rlist, [], exlist, timeout)
> +    except OverflowError as e:
> +        print("timeout value", timeout)
> +        raise e

minor nit: the 'e' is unneeded, this can be

  except OverflowError:
    print("timeout value", timeout)
    raise

...except I'd also like that output to go to stderr, not stdout, so either make that

  print >> sys.stderr, "timeout value", timeout

or

  from __future__ import print_function
  ...
  print("timeout value", timeout, file=sys.stderr)

Not that any of this is important enough to bother with for a temporary diagnostic aid. :-) I didn't trace through the code to try to guess how we're getting the dang OverflowError in the first place; your analysis sounds about right and scary enough for me to stay away.
Attachment #8732831 - Flags: review?(sphink) → review+
Thank you for the quick review! Let's see what the logs have to say...
Keywords: leave-open
Attached patch followup.patchSplinter Review
Looking at the latest oranges, I was worried that this started *not* reproducing anymore. I think the intermittent error has moved since then...

This is a major case of "duh".

Anybody feel free to steal review.
Attachment #8734305 - Flags: review?(sphink)
Attachment #8734305 - Flags: review?(sphink) → review+
Wow, the values are pretty scary, here's a sample:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=24795816
07:51:39 INFO - timeout value 3.61596749018e+159

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=25073470
09:40:57 INFO - timeout value 2.22637487449e+246

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=3574815
15:34:46 INFO - timeout value 4.45822451776e+246

Making a new patch that tries fixing a few totally random intuitions.
Attached patch maxwait.patch (obsolete) — Splinter Review
I suspect three possible reasons:
- python VM error, in which case there's not much we can do.
- ieee754 rounding error?
- run_all_tests is a generator. It reads options.timeout in a loop that yields results at some point; what if the caller, after reading the yielded value, somehow changes the value of option.timeout to store something else there, confusing the next call to run_all_tests? To make sure this doesn't happen, just read the value of options.timeout before the loop, so that it doesn't change (I've checked other callers using "options.timeout" aren't supposed to change the value).

I haven't removed the try/except block, just in case it still happens with this patch. If you have any other idea of what might cause these issues, I'd take them!
Attachment #8732833 - Attachment is obsolete: true
Attachment #8737732 - Flags: review?(sphink)
Comment on attachment 8737732 [details] [diff] [review]
maxwait.patch

Review of attachment 8737732 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry, I keep meaning to look closer at this.

I definitely like the removal of total_seconds(). It was kind of weird that that was there in the first place.

I'm fine with this patch, but it does feel like it papers over the problem a little too completely. Can you have it print out the values when they are out of bounds, before clamping them down to sanity?
Attachment #8737732 - Flags: review?(sphink) → review+
(In reply to Steve Fink [:sfink] [:s:] from comment #27)
> I'm fine with this patch, but it does feel like it papers over the problem a
> little too completely. Can you have it print out the values when they are
> out of bounds, before clamping them down to sanity?

Yeah, this patch is dumb and blindly trying things indeed.

About the suggestion to print out the values: that sounds like a good idea, but shouldn't it be done before trying to fix the problem? Because if the problem gets somehow fixed by the patch above, then the values printed won't be easily found (this intermittent is happening once or twice a day maximum, in average, and I am not aware of any tools to crawl the logs searching for a pattern, but maybe there's something?).
Flags: needinfo?(sphink)
Ok, instead of the previous patch that tries to fix the problem, let's land more diagnostic, as this still happens and we're not quite sure of the cause.
Flags: needinfo?(sphink)
Attachment #8747657 - Flags: review?(sphink)
(I wonder if this can be an underflow/overflow in the addition/subtractions in the loop, if one machine running the task doesn't have the time correctly set)
Comment on attachment 8747657 [details] [diff] [review]
morediagnostic.patch

Review of attachment 8747657 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the long review delay.

I read through the code a bit. The code is a little weird. get_max_wait computes min(update_granularity, time remaining for all tasks). So the name is a little deceptive; the documentation says "Return the maximum time we can wait before any task should time out." It's a very odd way of describing it; I would say something like "Return the time remaining until the earliest upcoming event of interest, where an event is either a progress bar update or a test timing out." But I can see how the current naming came to be.

A NullProgressBar's update_granularity will be timedelta.max. On my machine, that is datetime.timedelta(999999999, 86399, 999999), with total_seconds of 86400000000000.0 (10^9 days), aka 8.64e13. Which is way less than the observed values. But perhaps some other python version has a timedelta.max that uses INT_MAX or something stupid?

I would be tempted to go full-out. After the for loop, check if wait > ProgressBar.update_granularity. If so:

  for task in tasks:
      print >> sys.stderr, "start={} elapsed={} remaining={}".format(task.start, (now - task.start).total_seconds(), (task.start + timeout_delta - now).total_seconds())

::: js/src/tests/lib/tasks_unix.py
@@ +77,5 @@
> +    ret = max(total_seconds(wait), 0)
> +
> +    if ret > total_seconds(ProgressBar.update_granularity()):
> +        print >> sys.stderr, "timeout/return value/udpate_granularity in get_max_wait:", \
> +                ret, timeout, total_seconds(ProgressBar.update_granularity())

"return value/timeout/update_granularity in get_max_wait:". And print out 'now' as well.
Attachment #8747657 - Flags: review?(sphink) → review+
Fun: none of the logs from the previous comment contains the new diagnostics strings, which means that get_max_wait is always correct and we are somehow losing the value at some other point. That being said, bug 1234743 just taught me there is a hard limit above us: the process that runs jit_tests.py has a hard limit of 1000 seconds, so I'm willing to try clamping the value to a maximum of 970 seconds (so that it gets 30 seconds to recover from a timeout). This will move a lot of intermittent failures from bug 1234743 or from this bug to other crashes, but at least we should be closer to the source of error and get more meaningful error messages, related to precise test cases.
Attached patch timeout.patch (obsolete) — Splinter Review
To be tested once try reopens, but after some long digging, I think this should do the right thing and fix bug 1234743 as well.
Let's land this part, independently of the possible fix to this intermittent.
Attachment #8737732 - Attachment is obsolete: true
Attachment #8756302 - Flags: review?(sphink)
Attached patch 2.timeoutoption.patch (obsolete) — Splinter Review
Pass a --timeout option to the JS shell. Jordan, I've put it here because other test suites have similar special options being passed here; please let me know if there is a better place (having a config option in the config files would seem unfortunate, as we'd need to repeat the key/value pair in every single config file).

This should solve the current intermittent because of the way we compute the max wait in https://dxr.mozilla.org/mozilla-central/source/js/src/tests/lib/tasks_unix.py#58.

As the timeout given to the shell is per test, this won't solve the other intermittent as I thought (imagine two tests timing out in sequence: the second one would timeout after the process running jit_tests timed out itself). If we still get intermittents here after this, it means we would've a state inconsistency in the jit_tests.py script (either option.timeout getting clobbered as iterate over the tests, or python vm being broken, or anything else).
Attachment #8755438 - Attachment is obsolete: true
Attachment #8756306 - Flags: review?(jlund)
Attachment #8756302 - Flags: review?(sphink) → review+
Comment on attachment 8756306 [details] [diff] [review]
2.timeoutoption.patch

Review of attachment 8756306 [details] [diff] [review]:
-----------------------------------------------------------------

you mentioned not wanting to add this option to each config which makes sense however there is the argument of whether it is better to update each config file on change or add complexity to the script by handling edge cases.

It has been a while since I looked at all this stuff but isn't there just three places you would have to add a time out to: https://dxr.mozilla.org/mozilla-central/search?q=path%3Amozharness%20path%3Aconfigs%20jittest&=mozilla-central&redirect=true ?

i.e. add '--timeout=970' here https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/configs/unittests/linux_unittest.py#88

feel free to push back. clearing r? for now. willing to r+ if you feel strongly about your solution or if I missed something.
Attachment #8756306 - Flags: review?(jlund)
Thank you for the feedback!

(In reply to Jordan Lund (:jlund) from comment #41)
> Comment on attachment 8756306 [details] [diff] [review]
> 2.timeoutoption.patch
> 
> Review of attachment 8756306 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> you mentioned not wanting to add this option to each config which makes
> sense however there is the argument of whether it is better to update each
> config file on change or add complexity to the script by handling edge cases.
> 
> It has been a while since I looked at all this stuff but isn't there just
> three places you would have to add a time out to:
> https://dxr.mozilla.org/mozilla-central/
> search?q=path%3Amozharness%20path%3Aconfigs%20jittest&=mozilla-
> central&redirect=true ?
> 
> i.e. add '--timeout=970' here
> https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/configs/
> unittests/linux_unittest.py#88
> 
> feel free to push back. clearing r? for now. willing to r+ if you feel
> strongly about your solution or if I missed something.

My (other) worry with this approach was that this would make an implicit requirement among values in different files: the timeout option for the process that runs jit_tests.py would be set in desktop_unittest.py (ie 1000), the timeout option for jit_tests.py would be set in linux_unittest.py (970). Of course, we can add comments in both places gently asking to keep in sync between the two different values, but a comment can easily be missed. What do you prefer?
Flags: needinfo?(jlund)
(In reply to Benjamin Bouvier [:bbouvier] from comment #42)
> Thank you for the feedback!
> 
> (In reply to Jordan Lund (:jlund) from comment #41)
> > Comment on attachment 8756306 [details] [diff] [review]
> > 2.timeoutoption.patch
> > 
> > Review of attachment 8756306 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > you mentioned not wanting to add this option to each config which makes
> > sense however there is the argument of whether it is better to update each
> > config file on change or add complexity to the script by handling edge cases.
> > 
> > It has been a while since I looked at all this stuff but isn't there just
> > three places you would have to add a time out to:
> > https://dxr.mozilla.org/mozilla-central/
> > search?q=path%3Amozharness%20path%3Aconfigs%20jittest&=mozilla-
> > central&redirect=true ?
> > 
> > i.e. add '--timeout=970' here
> > https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/configs/
> > unittests/linux_unittest.py#88
> > 
> > feel free to push back. clearing r? for now. willing to r+ if you feel
> > strongly about your solution or if I missed something.
> 
> My (other) worry with this approach was that this would make an implicit
> requirement among values in different files: the timeout option for the
> process that runs jit_tests.py would be set in desktop_unittest.py (ie
> 1000), the timeout option for jit_tests.py would be set in linux_unittest.py
> (970). Of course, we can add comments in both places gently asking to keep
> in sync between the two different values, but a comment can easily be
> missed. What do you prefer?

if we are wanting to make sure that we stay in sync between mh timeout and subprocess timeout 30s apart, e.g. 1000 / 970 maybe we could add something like:

'run_timeout': 1000, after this line https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/configs/unittests/linux_unittest.py#96

that way it's configured in the config and all you need to add is something like the following to the script:
`cmd_timeout = c["suite_definitions"][suite_category].get('run_timeout', 1000)`
`if suite_category == 'cppunittest': cmd_timeout = 2500`
here: https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/scripts/desktop_unittest.py#612

of course it would be better if we added a run_timeout to all the 'cppunittest' configs so we don't need the second line but that's probably out of scope for this bug :)

does this make sense or am I misunderstanding. feel free to ping me on irc for faster turn around. :)
Flags: needinfo?(jlund)
So something like this?
Attachment #8756306 - Attachment is obsolete: true
Attachment #8757933 - Flags: review?(jlund)
Comment on attachment 8757933 [details] [diff] [review]
2.timeoutoption.patch

Review of attachment 8757933 [details] [diff] [review]:
-----------------------------------------------------------------

looks great to me! :)
Attachment #8757933 - Flags: review?(jlund) → review+
No intermittents of this kind for 3 weeks (as stated earlier, I think they have moved to another intermittent instead, bug 1234743):

https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1215834&startday=2016-05-23&endday=2016-06-20&tree=all
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.