993649 - Android 4.0 Debug JN tests are timing out with HTTP cache v2 on

This seems like an infra or test bug that cache2 just exposes more. In the logs from *both good and bad runs* I can see occasional 1 minute stuck of some (random) tests, like: 15:24:46 INFO - REFTEST TEST-LOAD | http://10.12.128.18:30096/jsreftest/tests/jsreftest.html?test=test262/ch11/11.4/11.4.1/11.4.1-4.a-17.js | 680 / 2289 (29%) 15:25:46 INFO - REFTEST TEST-PASS | http://10.12.128.18:30096/jsreftest/tests/jsreftest.html?test=test262/ch11/11.4/11.4.1/11.4.1-4.a-17.js | item 1 It took one minute to execute the one particular test, sometimes its 20 seconds (exactly). In good runs I can see some 50 "stucks" like these - what mirrors well with the overall run time of the whole suite - 50 minutes! In bad runs (with cache2) I can see around some 60 delayed tests. This is enough to exceed the 1 hour limit for the overall run. Simplest fix to not block cache2 landing is to prolong the timeout from 1h to say 2h or so. Then somebody familiar with the tests and the infra has to investigate what's wrong. Geoff, any idea what's going on here?

Flags: needinfo?(gbrown)

Honza Bambas (:mayhemer)

Assignee

Updated

•

11 years ago

Summary: HTTP cache v2: Android 4.0 Debug JN tests are timing out → Android 4.0 Debug JN tests are timing out with HTTP cache v2 on

Geoff Brown [:gbrown]

Comment 6

•

11 years ago

I would not pay much attention to those timestamps. The test, running on the device (pandaboard) is going to write out: REFTEST TEST-LOAD | http://10.12.128.18:30096/jsreftest... and then every once in a while -- and I think it might be every 20 seconds or every 60 seconds -- the python script running on the host/server is going to poll the pandaboard and retrieve any new content from the device log. It will add on the timestamp and "INFO", with the timestamp reflecting the time on the host when that portion of the log was retrieved. Use the logcat timestamp or the NSPR timestamp to get a better idea of the time that something happened during the test: 15:43:15 INFO - 04-15 15:43:08.640 I/PRLog ( 2123): 2014-04-15 15:43:08.641662 UTC - ^^^^^^^^^^^^^^^^^^---logcat time ^^^^^^^^^^^^^^^^^^^^^^^^^^--- NSPR time 26602816[5be24100]: continuation state has been reset I think we try to keep all test jobs to under 1 hour, but we might be able to run these tests in smaller chunks if necessary.

Flags: needinfo?(gbrown)

Honza Bambas (:mayhemer)

Assignee

Comment 7

•

11 years ago

Thanks Valentin!

Assignee: nobody → valentin.gosu

Valentin Gosu [:valentin] (he/him)

Comment 8

•

11 years ago

There are several issues with the reftests: From what I can tell from tbpl, the first Android 4.0 Debug logs I can find were somewhere at the beginning of April. https://tbpl.mozilla.org/?rev=723177fe1452 It took about 60 minutes back then too. Some of the recent tests in m-c show the tests taking 60-68 minutes for J1-J3. https://tbpl.mozilla.org/?rev=1d0496e30feb The Gum tree has always failed the 4.0 Debug jsref tests, but it shows similar and sometimes better performance for the optimized jsref tests. The debug tests just seem to take longer, which takes it over the time limit. I too would recommend increasing the time limit for Android 4.0 Debug. We might have to do it anyway, since it's close to the limit.

Honza Bambas (:mayhemer)

Assignee

Comment 9

•

11 years ago

Geoff, are you the one to ask for prolonging the timeout here? It seems that the new cache only has more MOZ_ASSERTS and debug checks that prolongs the execution. Otherwise in opt build performs better. Valentin, thanks for looking into this.

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Comment 10

•

11 years ago

Yes, I can increase the timeout, or perhaps split the tests into more chunks. I want to be sure we are on the same page though. On mozilla-central, the total job time is often 60-68 minutes, but the "test time", indicated by something like: INFO | automation.py | Application ran for: 0:49:41.809401 is often 50 to 55 minutes. *That* is the time that needs to be less than 60 minutes. On Gum, the test time is hitting 60 minutes and only 30% to 50% of the tests are being run in that time. It looks to me like we would need to increase the timeout perhaps as high as 180 minutes. When I read Comment 8, I thought "oh, these are just running a little slower, pushing it over the timeout", but when I look at the logs, it seems much worse. Do we agree that this is a significant increase in test run time, perhaps as much as 3 times as long as the time required for the same tests on mozilla-central?

Flags: needinfo?(gbrown)

Honza Bambas (:mayhemer)

Assignee

Comment 11

•

11 years ago

(In reply to Geoff Brown [:gbrown] from comment #10) > When I read Comment 8, I thought "oh, these are just running a little > slower, pushing it over the timeout", but when I look at the logs, it seems > much worse. I didn't have time to check on this so far, I was assuming only. > Do we agree that this is a significant increase in test run > time, perhaps as much as 3 times as long as the time required for the same > tests on mozilla-central? Hard to say. For me the important part is the opt build runs in the same or shorter time. As I said, the new cache does a lot of debug only checks, nore then the old cache probably. But 3 times longer the time for all tests... hm.. that sounds a bit weird. Valentin, would you be willing to get some more data where we could be so much slower? Bisecting might help, since this was not happening all the time, only started a month ago. If however trying to find out would become too complicated, I think we should for now just prolong the timeout and move this bug (or a new one) to be fixed after we enable the new cache. I somewhat don't think this is absolutely critical to turn the cache2 on.

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] (he/him)

Comment 12

•

11 years ago

I think the assumption that this is caused by the extra asserts is correct. There are 170 asserts in the old cache, and 512 in the new cache. We can probably settle the question by doing a try run with all of the asserts commented out. I'll give it a shot, and see how far along this takes us.

Valentin Gosu [:valentin] (he/him)

Comment 13

•

11 years ago

Ok, so from a try run with most of the asserts removed, it turns out there's no significant speedup. I'm still puzzled by the fact that performance is so much worse on DEBUG, rather than OPT. I doubt there is a difference between the infrastructure for debug-opt, but there might be a difference in the amount of logging the debug version does. Honza, could this be the case? https://tbpl.mozilla.org/?tree=Try&rev=cae19dc6848b (In reply to Honza Bambas (:mayhemer) from comment #11) > Bisecting might help, since this was not happening all the > time, only started a month ago. I was unable to find any Android 4.0 DEBUG run that passes the tests on Gum. I can't run jsref tests locally, the most I can do is oprofile the debug/opt versions and look for differences during normal browsing.

Flags: needinfo?(valentin.gosu)

Honza Bambas (:mayhemer)

Assignee

Comment 14

•

11 years ago

You may try to disable #define DEBUG_STATS 1 in netwerk/cache2/CacheIndex.h. I might be first time looking at a wrong test run on gum (I mixed opt and debug) to support my claim this has only started few months ago. You can also try: clone gum, update to some older changesets (just scroll down on tbpl to get some useful easily) and do a new try push from that changeset.

Valentin Gosu [:valentin] (he/him)

Comment 15

•

11 years ago

https://tbpl.mozilla.org/?tree=Try&rev=38f8bee788b7 I got the J1 tests to pass with an older changeset (99a4d8132178).

Valentin Gosu [:valentin] (he/him)

Comment 16

•

11 years ago

https://tbpl.mozilla.org/?tree=Try&rev=b4bad5bb61c2 J1 passes, J3 almost passes, and J2 finishes all the tests but hangs at shutdown. (rev 7dabdb8f2b52 on Gum)

Honza Bambas (:mayhemer)

Assignee

Comment 17

•

11 years ago

(In reply to Valentin Gosu [:valentin] from comment #15) > https://tbpl.mozilla.org/?tree=Try&rev=38f8bee788b7 > > I got the J1 tests to pass with an older changeset (99a4d8132178). This one is from Jan 17, does Jan 22 [1] work? does Jan 14 [2] work? [1] https://tbpl.mozilla.org/?tree=Gum&rev=212b73f337ca [1] https://tbpl.mozilla.org/?tree=Gum&rev=fa46ddfd2751 (In reply to Valentin Gosu [:valentin] from comment #15) > https://tbpl.mozilla.org/?tree=Try&rev=38f8bee788b7 > > I got the J1 tests to pass with an older changeset (99a4d8132178). Are you just reporting your bisection progress or is this something that you expect not to work, but it works?

Valentin Gosu [:valentin] (he/him)

Comment 18

•

11 years ago

(In reply to Honza Bambas (:mayhemer) from comment #17) > This one is from Jan 17, does Jan 22 [1] work? does Jan 14 [2] work? They both pass. > [1] https://tbpl.mozilla.org/?tree=Gum&rev=212b73f337ca https://tbpl.mozilla.org/?tree=Try&rev=8e65f60f5bad > [1] https://tbpl.mozilla.org/?tree=Gum&rev=fa46ddfd2751 https://tbpl.mozilla.org/?tree=Try&rev=0990ef1e7618 > Are you just reporting your bisection progress or is this something that you > expect not to work, but it works? I was just reporting on my findings so far.

Valentin Gosu [:valentin] (he/him)

Comment 19

•

11 years ago

So, it turns out that the bad commit is somewhere between 25 Mar and 4 Apr. https://tbpl.mozilla.org/?tree=Try&rev=b4bad5bb61c2 (based on 7dabdb8f2b52) 25 Mar (passes,barely) https://tbpl.mozilla.org/?tree=Try&rev=9d3221df75e2 (based on e8289dd6c7cc) 4 Apr (fails, by a long shot) The commits don't seem to make a distinction between DEBUG and OPT, but I'll back out a few of them, and hopefully we can find the one.

Valentin Gosu [:valentin] (he/him)

Comment 20

•

11 years ago

Ok, it seems that the issue is caused by rev 1e0b6c711512 https://tbpl.mozilla.org/?tree=Try&rev=fc44ed7ed9a4 based on 1e0b6c711512 fails https://tbpl.mozilla.org/?tree=Try&rev=651a90132615 based on 299f58206926 passes I for one fail to see something in https://hg.mozilla.org/projects/gum/rev/1e0b6c711512 that would cause this huge difference on DEBUG. Maybe it calls code that contains #ifdef-ed instructions?

Flags: needinfo?(honzab.moz)

Honza Bambas (:mayhemer)

Assignee

Comment 21

•

11 years ago

Valentin, thanks a lot for help here! I will look closely to that changeset and try to find out what's wrong. Retry with m-c+cache2 (@https://hg.mozilla.org/mozilla-central/rev/6109876e1a22): https://tbpl.mozilla.org/?tree=Try&rev=3adb877caa9c (expected to fail) https://tbpl.mozilla.org/?tree=Try&rev=c9cce2ac7e2a (expected to pass)

Flags: needinfo?(honzab.moz)

Honza Bambas (:mayhemer)

Assignee

Comment 22

•

11 years ago

(In reply to Honza Bambas (:mayhemer) from comment #21) > https://tbpl.mozilla.org/?tree=Try&rev=3adb877caa9c (expected to fail) > https://tbpl.mozilla.org/?tree=Try&rev=c9cce2ac7e2a (expected to pass) Badly chosen base cs: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=d5b0e9e6a849 Test useless... ://

Honza Bambas (:mayhemer)

Assignee

Updated

•

11 years ago

Blocks: 968593

Honza Bambas (:mayhemer)

Assignee

Comment 23

•

11 years ago

So, I backed out the patch from gum and pushed to try [1]. It's still timing out. Valentin, are you *100%* sure the changeset is the one where this bug starts? [1] https://tbpl.mozilla.org/?tree=Try&rev=77ce41d966c6

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] (he/him)

Comment 24

•

11 years ago

J1 (92%), J2 (87%), J3 (69%) is still twice as fast than the most current gum revision J1 (45%), J2 (54%), J3 (37%). So there is definitely something there. I think it may be that there's another more recent revision that we may blame for the additional delay.

Flags: needinfo?(valentin.gosu)

Honza Bambas (:mayhemer)

Assignee

Comment 25

•

11 years ago

wrong-comment

Backout of Parsable context key and Memory pool rework from m-c test: https://tbpl.mozilla.org/?tree=Try&rev=7b98ec4ec6e7

Honza Bambas (:mayhemer)

Assignee

Updated

•

11 years ago

Blocks: 986179

Honza Bambas (:mayhemer)

Assignee

Comment 26

•

11 years ago

The two bugs casing this DEBUG-only slowdown of jsreftests: bug 986179 bug 968593 I have logs with timestamp from both cache1 and cache2 runs, going to look into them where we are actually slow. Hopefully it will answer. If this goes to be a too long trip, I would suggest prolonging the timeout for now and investigate (and fix) actual cause later.

Honza Bambas (:mayhemer)

Assignee

Comment 27

•

11 years ago

So, the commit from comment 25 was wrong - cache2 was off. This one is correct, and still timing out! https://tbpl.mozilla.org/?tree=Try&rev=08a92424b812 I'm getting lost here!

Honza Bambas (:mayhemer)

Assignee

Comment 28

•

11 years ago

Seems like we sometimes get stuck in CacheEntry::AsyncOpen (for a replaced entry - truncating) for more then a second. I see this more or less regularly in the logs. The method is simple, however enters the entry's lock. That lock at the moment should not be known to anyone else. Needs more investigation.

Honza Bambas (:mayhemer)

Assignee

Updated

•

11 years ago

Assignee: valentin.gosu → honzab.moz

Honza Bambas (:mayhemer)

Assignee

Comment 29

•

11 years ago

Theory: Since we do lock under lock under lock (concrete: CacheService -> CacheEntry -> CacheFile) - only in this direction! - the deadlock detection may get crazy about it. I often see blocks (seconds) when some of this chained locking acquire is happening. Bug 1005475 tries to remove some of them.

Honza Bambas (:mayhemer)

Assignee

Comment 30

•

11 years ago

Loosely changing deps here.

No longer blocks: 968593, 986179

Depends on: 1005475

Honza Bambas (:mayhemer)

Assignee

Comment 31

•

11 years ago

Bug 1005475 seems to help here! Yay! https://tbpl.mozilla.org/?tree=Try&rev=d3dcea2411ef

Honza Bambas (:mayhemer)

Assignee

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → DUPLICATE