Closed Bug 1116376 Opened 5 years ago Closed 5 years ago

Intermittent jsreftest.html?test=js1_5/Regress/regress-360969-05.js | application timed out after 330 seconds with no output

Categories

(Core :: JavaScript Engine, defect)

x86
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox37 --- wontfix
firefox38 --- fixed
firefox39 --- fixed
firefox-esr31 --- unaffected
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: philor, Assigned: arai)

References

Details

(Keywords: intermittent-failure)

Attachments

(8 files, 1 obsolete file)

No description provided.
Blocks: 1116663
Locally, regress-360969-05.js runs within 30 seconds if I run only regress-360969-*.js in jsreftest, but if I run whole tests under js1_5/, regress-360969-05.js takes too long time and hits timeout.

Then, if I add `gc()` before the loop, regress-360969-05.js takes 30 seconds even on latter case. so I guess GC (maybe a kind of partial GC?) took so long time during the loop.

I added `gc()` to all tests for bug 360969, so bug 1122413, bug 1137221, bug 1116663 should also be fixed by this patch.

Green on try run, and regress-360969-05.js took 11 seconds there:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=147f51049d70
Attachment #8572829 - Flags: review?(jcoppeard)
(In reply to Tooru Fujisawa [:arai] from comment #227)
Thanks for the patch.  I agree that makes the test pass but I'd really like to know what is going wrong here.  GCs shouldn't make the test take so long.  So I'm going to leave the request outstanding till I work out what to do here.

It's interesting that you managed to reproduce this locally though, I haven't been able to do that.

Also, it's possible that the fix for bug 1139832 might cure this.
Thank you for your feedback :)
Okay, I'll also try to investigate what actually goes there, and try the patch in bug 1139832 on local build.

I used virtual machine on virtualbox to reproduce it.
  Host
    Debian jessie 64bit (Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt2-1 (2014-12-08) x86_64 GNU/Linux)
    CPU: Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz
    RAM: 4GB
  Guest
    Debian jessie 32bit (Linux 3.16.0-4-586 #1 Debian 3.16.7-ckt4-3 (2015-02-03) i686 GNU/Linux)
    Processors: 2
    RAM: 2GB
Added logging in GCRuntime::collect, and tried both pass/fail case (without my gc() patch)

When regress-360969-05.js passes, it's called twice while the loop, with incremental=true and reason=ALLOC_TRIGGER, and test takes 31 seconds.

When regress-360969-05.js fails, it's called 212 times while the loop (almost each 3 seconds), with incremental=true and reason=ALLOC_TRIGGER. Also called twice before the loop, with incremental=true and reason=INTER_SLICE_GC. the test takes 767 seconds.

I'll add more logging.
testing patch for logging.
I'll post results for pass and failure cases.
Attached file log_pass.txt
Attached file log_fail.txt
in failure case, GC is performed many times but not so much memory is freed.
(In reply to Tooru Fujisawa [:arai] from comment #245)
Thanks for logs!  212 collections is really surprising.  Where do the byte values in your logs come from?
`usage.gcBytes()` in following 2 macros, called in GCRuntime::collect, in attachment 8573619 [details] [diff] [review].
>+#define LOG_BEFORE() fprintf(stderr, "****collect**** [%d: %s] - before: %10zu bytes\n", incremental, reasonToString(reason), usage.gcBytes());
>+#define LOG_AFTER(MESSAGE) fprintf(stderr, "****collect**** [%d: %s] -  after: %10zu bytes (%s)\n", incremental, reasonToString(reason), usage.gcBytes(), MESSAGE);

Now I'm trying logging for GCRuntime::maybeAllocTriggerZoneGC, with usedBytes, thresholdBytes, and igcThresholdBytes, since all reasons are ALLOC_TRIGGER.
testing patch for logging, in GCRuntime::maybeAllocTriggerZoneGC.
Attached file log_pass2.txt
log for attachment 8573952 [details] [diff] [review], with test pass
Attached file log_fail2.txt
log for attachment 8573952 [details] [diff] [review], with test failure

in pass case, triggerZoneGC is called only in `usedBytes >= igcThresholdBytes` condition.

in failure case, triggerZoneGC is also called in `usedBytes >= thresholdBytes` condition, and it hits so much time.
so, this maybe the issue related to the value of thresholdBytes?
(In reply to Tooru Fujisawa [:arai] from comment #251)
So we are getting many ALLOC_TRIGER GCs where the threshold for the zone stays the same (5259264 bytes).  This indicates that we aren't collecting that zone, because even if we collected a zone but didn't free anything in the zone we should still update (i.e. increase) its threshold value afterwards.

So given that this test creates a lot of local variables I'm guessing that this is the atoms zone we are triggering on.  We're trying to do a zone GC, but we don't actually collect the atoms zone unless we're doing a full GC (as any zone may contain references to atoms and they're not in the cross-compartment map).  So actually we will not collect this zone.

Then, since I fixed bug 1032750 we will not update the threshold values for zones we didn't collect, and so we get this problem.  Checking that bug reveals it landed on 30/12/14... the date this started happening.
(In reply to Jon Coppeard (:jonco) from comment #253)
...wait, zone GC is not the culprit, we already take account of this in GCRuntime::triggerZoneGC() and trigger a full GC if zone GC of the atoms zone is requested.  Maybe collection of atoms is being supressed for another reason.