Closed Bug 993649 Opened 11 years ago Closed 11 years ago

Android 4.0 Debug JN tests are timing out with HTTP cache v2 on

Categories

(Core :: Networking: Cache, defect)

All
Android
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 1005475

People

(Reporter: mayhemer, Assigned: mayhemer)

References

Details

The relevant part of the code is: 06:05:02 INFO - REFTEST TEST-LOAD | http://10.12.133.18:30542/jsreftest/tests/jsreftest.html?test=test262/ch11/11.5/11.5.1/S11.5.1_A3_T2.9.js | 852 / 2289 (37%) 06:05:23 INFO - REFTEST TEST-PASS | http://10.12.133.18:30542/jsreftest/tests/jsreftest.html?test=test262/ch11/11.5/11.5.1/S11.5.1_A3_T2.9.js | item 1 Takes 20 secs to finish. The whole test suit is then killed after a minute. Need logs with cache2:5.
This seems like an infra or test bug that cache2 just exposes more. In the logs from *both good and bad runs* I can see occasional 1 minute stuck of some (random) tests, like: 15:24:46 INFO - REFTEST TEST-LOAD | http://10.12.128.18:30096/jsreftest/tests/jsreftest.html?test=test262/ch11/11.4/11.4.1/11.4.1-4.a-17.js | 680 / 2289 (29%) 15:25:46 INFO - REFTEST TEST-PASS | http://10.12.128.18:30096/jsreftest/tests/jsreftest.html?test=test262/ch11/11.4/11.4.1/11.4.1-4.a-17.js | item 1 It took one minute to execute the one particular test, sometimes its 20 seconds (exactly). In good runs I can see some 50 "stucks" like these - what mirrors well with the overall run time of the whole suite - 50 minutes! In bad runs (with cache2) I can see around some 60 delayed tests. This is enough to exceed the 1 hour limit for the overall run. Simplest fix to not block cache2 landing is to prolong the timeout from 1h to say 2h or so. Then somebody familiar with the tests and the infra has to investigate what's wrong. Geoff, any idea what's going on here?
Flags: needinfo?(gbrown)
Summary: HTTP cache v2: Android 4.0 Debug JN tests are timing out → Android 4.0 Debug JN tests are timing out with HTTP cache v2 on
I would not pay much attention to those timestamps. The test, running on the device (pandaboard) is going to write out: REFTEST TEST-LOAD | http://10.12.128.18:30096/jsreftest... and then every once in a while -- and I think it might be every 20 seconds or every 60 seconds -- the python script running on the host/server is going to poll the pandaboard and retrieve any new content from the device log. It will add on the timestamp and "INFO", with the timestamp reflecting the time on the host when that portion of the log was retrieved. Use the logcat timestamp or the NSPR timestamp to get a better idea of the time that something happened during the test: 15:43:15 INFO - 04-15 15:43:08.640 I/PRLog ( 2123): 2014-04-15 15:43:08.641662 UTC - ^^^^^^^^^^^^^^^^^^---logcat time ^^^^^^^^^^^^^^^^^^^^^^^^^^--- NSPR time 26602816[5be24100]: continuation state has been reset I think we try to keep all test jobs to under 1 hour, but we might be able to run these tests in smaller chunks if necessary.
Flags: needinfo?(gbrown)
Thanks Valentin!
Assignee: nobody → valentin.gosu
There are several issues with the reftests: From what I can tell from tbpl, the first Android 4.0 Debug logs I can find were somewhere at the beginning of April. https://tbpl.mozilla.org/?rev=723177fe1452 It took about 60 minutes back then too. Some of the recent tests in m-c show the tests taking 60-68 minutes for J1-J3. https://tbpl.mozilla.org/?rev=1d0496e30feb The Gum tree has always failed the 4.0 Debug jsref tests, but it shows similar and sometimes better performance for the optimized jsref tests. The debug tests just seem to take longer, which takes it over the time limit. I too would recommend increasing the time limit for Android 4.0 Debug. We might have to do it anyway, since it's close to the limit.
Geoff, are you the one to ask for prolonging the timeout here? It seems that the new cache only has more MOZ_ASSERTS and debug checks that prolongs the execution. Otherwise in opt build performs better. Valentin, thanks for looking into this.
Flags: needinfo?(gbrown)
Yes, I can increase the timeout, or perhaps split the tests into more chunks. I want to be sure we are on the same page though. On mozilla-central, the total job time is often 60-68 minutes, but the "test time", indicated by something like: INFO | automation.py | Application ran for: 0:49:41.809401 is often 50 to 55 minutes. *That* is the time that needs to be less than 60 minutes. On Gum, the test time is hitting 60 minutes and only 30% to 50% of the tests are being run in that time. It looks to me like we would need to increase the timeout perhaps as high as 180 minutes. When I read Comment 8, I thought "oh, these are just running a little slower, pushing it over the timeout", but when I look at the logs, it seems much worse. Do we agree that this is a significant increase in test run time, perhaps as much as 3 times as long as the time required for the same tests on mozilla-central?
Flags: needinfo?(gbrown)
(In reply to Geoff Brown [:gbrown] from comment #10) > When I read Comment 8, I thought "oh, these are just running a little > slower, pushing it over the timeout", but when I look at the logs, it seems > much worse. I didn't have time to check on this so far, I was assuming only. > Do we agree that this is a significant increase in test run > time, perhaps as much as 3 times as long as the time required for the same > tests on mozilla-central? Hard to say. For me the important part is the opt build runs in the same or shorter time. As I said, the new cache does a lot of debug only checks, nore then the old cache probably. But 3 times longer the time for all tests... hm.. that sounds a bit weird. Valentin, would you be willing to get some more data where we could be so much slower? Bisecting might help, since this was not happening all the time, only started a month ago. If however trying to find out would become too complicated, I think we should for now just prolong the timeout and move this bug (or a new one) to be fixed after we enable the new cache. I somewhat don't think this is absolutely critical to turn the cache2 on.
Flags: needinfo?(valentin.gosu)
I think the assumption that this is caused by the extra asserts is correct. There are 170 asserts in the old cache, and 512 in the new cache. We can probably settle the question by doing a try run with all of the asserts commented out. I'll give it a shot, and see how far along this takes us.
Ok, so from a try run with most of the asserts removed, it turns out there's no significant speedup. I'm still puzzled by the fact that performance is so much worse on DEBUG, rather than OPT. I doubt there is a difference between the infrastructure for debug-opt, but there might be a difference in the amount of logging the debug version does. Honza, could this be the case? https://tbpl.mozilla.org/?tree=Try&rev=cae19dc6848b (In reply to Honza Bambas (:mayhemer) from comment #11) > Bisecting might help, since this was not happening all the > time, only started a month ago. I was unable to find any Android 4.0 DEBUG run that passes the tests on Gum. I can't run jsref tests locally, the most I can do is oprofile the debug/opt versions and look for differences during normal browsing.
Flags: needinfo?(valentin.gosu)
You may try to disable #define DEBUG_STATS 1 in netwerk/cache2/CacheIndex.h. I might be first time looking at a wrong test run on gum (I mixed opt and debug) to support my claim this has only started few months ago. You can also try: clone gum, update to some older changesets (just scroll down on tbpl to get some useful easily) and do a new try push from that changeset.
https://tbpl.mozilla.org/?tree=Try&rev=38f8bee788b7 I got the J1 tests to pass with an older changeset (99a4d8132178).
https://tbpl.mozilla.org/?tree=Try&rev=b4bad5bb61c2 J1 passes, J3 almost passes, and J2 finishes all the tests but hangs at shutdown. (rev 7dabdb8f2b52 on Gum)
(In reply to Valentin Gosu [:valentin] from comment #15) > https://tbpl.mozilla.org/?tree=Try&rev=38f8bee788b7 > > I got the J1 tests to pass with an older changeset (99a4d8132178). This one is from Jan 17, does Jan 22 [1] work? does Jan 14 [2] work? [1] https://tbpl.mozilla.org/?tree=Gum&rev=212b73f337ca [1] https://tbpl.mozilla.org/?tree=Gum&rev=fa46ddfd2751 (In reply to Valentin Gosu [:valentin] from comment #15) > https://tbpl.mozilla.org/?tree=Try&rev=38f8bee788b7 > > I got the J1 tests to pass with an older changeset (99a4d8132178). Are you just reporting your bisection progress or is this something that you expect not to work, but it works?
(In reply to Honza Bambas (:mayhemer) from comment #17) > This one is from Jan 17, does Jan 22 [1] work? does Jan 14 [2] work? They both pass. > [1] https://tbpl.mozilla.org/?tree=Gum&rev=212b73f337ca https://tbpl.mozilla.org/?tree=Try&rev=8e65f60f5bad > [1] https://tbpl.mozilla.org/?tree=Gum&rev=fa46ddfd2751 https://tbpl.mozilla.org/?tree=Try&rev=0990ef1e7618 > Are you just reporting your bisection progress or is this something that you > expect not to work, but it works? I was just reporting on my findings so far.
So, it turns out that the bad commit is somewhere between 25 Mar and 4 Apr. https://tbpl.mozilla.org/?tree=Try&rev=b4bad5bb61c2 (based on 7dabdb8f2b52) 25 Mar (passes,barely) https://tbpl.mozilla.org/?tree=Try&rev=9d3221df75e2 (based on e8289dd6c7cc) 4 Apr (fails, by a long shot) The commits don't seem to make a distinction between DEBUG and OPT, but I'll back out a few of them, and hopefully we can find the one.
Ok, it seems that the issue is caused by rev 1e0b6c711512 https://tbpl.mozilla.org/?tree=Try&rev=fc44ed7ed9a4 based on 1e0b6c711512 fails https://tbpl.mozilla.org/?tree=Try&rev=651a90132615 based on 299f58206926 passes I for one fail to see something in https://hg.mozilla.org/projects/gum/rev/1e0b6c711512 that would cause this huge difference on DEBUG. Maybe it calls code that contains #ifdef-ed instructions?
Flags: needinfo?(honzab.moz)
Valentin, thanks a lot for help here! I will look closely to that changeset and try to find out what's wrong. Retry with m-c+cache2 (@https://hg.mozilla.org/mozilla-central/rev/6109876e1a22): https://tbpl.mozilla.org/?tree=Try&rev=3adb877caa9c (expected to fail) https://tbpl.mozilla.org/?tree=Try&rev=c9cce2ac7e2a (expected to pass)
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #21) > https://tbpl.mozilla.org/?tree=Try&rev=3adb877caa9c (expected to fail) > https://tbpl.mozilla.org/?tree=Try&rev=c9cce2ac7e2a (expected to pass) Badly chosen base cs: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=d5b0e9e6a849 Test useless... ://
Blocks: 968593
So, I backed out the patch from gum and pushed to try [1]. It's still timing out. Valentin, are you *100%* sure the changeset is the one where this bug starts? [1] https://tbpl.mozilla.org/?tree=Try&rev=77ce41d966c6
Flags: needinfo?(valentin.gosu)
J1 (92%), J2 (87%), J3 (69%) is still twice as fast than the most current gum revision J1 (45%), J2 (54%), J3 (37%). So there is definitely something there. I think it may be that there's another more recent revision that we may blame for the additional delay.
Flags: needinfo?(valentin.gosu)
Backout of Parsable context key and Memory pool rework from m-c test: https://tbpl.mozilla.org/?tree=Try&rev=7b98ec4ec6e7
Blocks: 986179
The two bugs casing this DEBUG-only slowdown of jsreftests: bug 986179 bug 968593 I have logs with timestamp from both cache1 and cache2 runs, going to look into them where we are actually slow. Hopefully it will answer. If this goes to be a too long trip, I would suggest prolonging the timeout for now and investigate (and fix) actual cause later.
So, the commit from comment 25 was wrong - cache2 was off. This one is correct, and still timing out! https://tbpl.mozilla.org/?tree=Try&rev=08a92424b812 I'm getting lost here!
Seems like we sometimes get stuck in CacheEntry::AsyncOpen (for a replaced entry - truncating) for more then a second. I see this more or less regularly in the logs. The method is simple, however enters the entry's lock. That lock at the moment should not be known to anyone else. Needs more investigation.
Assignee: valentin.gosu → honzab.moz
Theory: Since we do lock under lock under lock (concrete: CacheService -> CacheEntry -> CacheFile) - only in this direction! - the deadlock detection may get crazy about it. I often see blocks (seconds) when some of this chained locking acquire is happening. Bug 1005475 tries to remove some of them.
Loosely changing deps here.
No longer blocks: 968593, 986179
Depends on: 1005475
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.