1315037 - Intermittent dom/canvas/test/webgl-conf/generated/test_2_conformance__textures__video__tex-2d-rgba-rgba-unsigned_short_5_5_5_1.html | Test timed out.

It seems that the test hangs at the very beginning, I am now trying to figure out why this happen. I think maybe there is some race condition which cause out test have a dangling state that prevent it from continue executing.

Michael Leu[:Lenzak](UTC+8)

Comment 11

•

9 years ago

Bug1315091 has almost identical log, I suspect they hang for similar reason.

Michael Leu[:Lenzak](UTC+8)

Updated

•

9 years ago

Comment 12

•

9 years ago

I'm now trying to build a VM similar to tryserver to reproduce it locally. However, although I managed to make mochitest run on a 32-bit windows 7 vm with limited system resource, we cannot obtain a gl context because of the unsupported gfx driver (VMWare SVGA Adapter). And I saw the raw log of this test, it seems that VMs running mochitest-gl has their access to physical NVidia GFX cards. I'm now building a test environment with windows 7 32bit and HD4000, hopefully this environment can reproduce the hang.

Michael Leu[:Lenzak](UTC+8)

Comment 13

•

9 years ago

Although I don't know the configuration of the VM this test runs on, I found nvidia physX folder in the raw log. So I suspect it's related to NVidia driver. I tried to reproduce this failure locally, tested 2 machines with nvidia GT630 and Intel HD4000. When I try to stress test them by repeating run this mochitest multiple times, the machine with GT630 barfs GL_OUT_OF_MEMORY after tens of repeating and then get a "test timed out" while the HD4000 one runs flawlessly no matter how many times I repeat. Now I will loan a slave machine for further investigation, hopefully I can reproduce it or ideally, recreate the whole environment locally.

Comment hidden (Intermittent Failures Robot)

Michael Leu[:Lenzak](UTC+8)

Comment 15

•

9 years ago

FWIW, I am now trying to run this test on my local win7-32bit VM, and I keep failing to obtain webgl2 context and the whole test hangs until time out when I run with mach mochitest. However, it can run same test with no problem on khronos' website. I found the cause is in the pref. When we run mach mochitest, we generate a temporary profile which has webgl.force-enabled=true. This pref will lead to failure obtaining correct config in glesGetConfig which prevent us get a valid context. I don't know whether it is relevant to this bug or not, I will further investigate it by adding logs to verify the pref and push it to try server.

Comment hidden (Intermittent Failures Robot)

Michael Leu[:Lenzak](UTC+8)

Comment 17

•

9 years ago

I tried to run mochitest-gl on the t-win732-spot slave machine for multiple time, and sometimes it hangs because it failed to obtain webgl context here. https://dxr.mozilla.org/mozilla-central/source/dom/canvas/WebGLContext.cpp?q=error+during+angle+opengl+init&redirect_type=single#624 I have compared the raw log between successful and failure mochitest-gl and both of them do not print logs in webgl conformance test. So I suspect it is the cause of the intermittent fail. I will further investigate it by tracing CreateOffscreen to figure out why does it sometimes fail.

Michael Leu[:Lenzak](UTC+8)

Comment 18

•

9 years ago

Attached image Screen Shot 2017-01-25 at 2.24.33 PM.png — Details

OK... It seems that this intermittent failure is not caused by fail to fetch a valid webGL context. But I found something interesting in the raw log. The red marked log is the logs about this test, and it exhibits different behavior comparing to a normal successful test. It seems that the test just hangs right after it started at 08:58:01 until the 5 minutes time limit is reached which is around 09:03:01. Additionally, the timed out notification is fired with some test-info simultaneously, we can see the 2 TEST-PASS log which is called by requestFlakyTimeout in the beginning of mochitest-single.html. Through these information, I think the conformance test did not run in this try instance at all for some reason, and I think the problem is in the test infrastructure instead of webgl backend.

Michael Leu[:Lenzak](UTC+8)

Comment 19

•

9 years ago

Hi Geoff The test occasionally hang right after it started until the 5-minute time limit ends. My observation is in Comment 18. Do you have any suggestion about this one?

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Comment 20

•

9 years ago

Michael, thanks for looking into this. There's not much information in that log. As you say, we know we got as far as requestFlakyTimeout, but it's unclear how much further the test executed. Do you have any idea where the 2 Javascript Errors come from (what point in the test)? Maybe that's relevant...but I don't know. If you can reproduce the hang, you could add more logging to the test, starting at the beginning, and see if those new messages are logged before the hang. There are lots of "Test timed out" failures reported against many tests, but I'm not aware that any of those are caused by a problem in the test infrastructure; it's possible, but I don't know of any specific issue. By the way, did you notice that this failure has only been reported on mozilla-aurora for the last few weeks? Can you still reproduce it on central? (It looks like bug 1315091 stopped too...maybe that's good!?)

Flags: needinfo?(gbrown)

Michael Leu[:Lenzak](UTC+8)

Comment 21

•

9 years ago

Hi Geoff, thanks for your suggestion :) I cannot reproduce the hang with same behavior. And I didn't notice that the hangs only reported on aurora, so I am now using aurora to try reproducing this fail, got no luck yet. I will look into other similar intermittent failure like bug1315091 and look through the patches landed just after orange factor robot stop reporting if I am still unable to reproduce it, maybe some of them actually fixed the problem.

Comment hidden (Intermittent Failures Robot)

Michael Leu[:Lenzak](UTC+8)

Comment 23

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f609da7a3fba32d5b1c09adc6c8481227e32b935&selectedJob=74231798 I deliberately retrigger this test multiple times, it seems that the failure rate is higher when multiple tasks are run simultaneously. So I tried to throttle the CPU and GPU resource by using hardware stress testing software along with the test running on my local machine, the test does run slowly but not completely hang.

Michael Leu[:Lenzak](UTC+8)

Comment 24

•

9 years ago

OK, I think I found the cause of this bug, this is actually a memory issue instead of WebGL's problem. The 2 Javascript errors in the attached screenshot does have something to do with the hang. I found these 2 errors always present just before the hang happens. So I add logs to trace the c++ part related to the JS modules which barf errors. I found the second error is always caused by a service create failure because of out of memory. This explains why the hang only presents in 32bit non-e10s mode because it is the most prone configuration to make a process reach its memory limit. Now I wonder why the hang is absent after aurora, maybe some patch improved the memory management in JS VM? Hi Geoff, do you have any idea about it? Should I find which patch fixed the hang and uplift it to beta?

Flags: needinfo?(gbrown)

Michael Leu[:Lenzak](UTC+8)

Updated

•

9 years ago

Comment 25

•

9 years ago

The bugs marked in "see also" exhibits similar behavior, I think their cause is same as this bug.

Geoff Brown [:gbrown]

Comment 26

•

9 years ago

(In reply to Michael Leu[:lenzak800](UTC+8) from comment #24) > Hi Geoff, do you have any idea about it? > > Should I find which patch fixed the hang and uplift it to beta? That's great progress - thanks for keeping with it. I don't have much insight into the problem, but I think it would be interesting and useful to find the patch that fixed this and get it uplifted...if that's not too much trouble.

Flags: needinfo?(gbrown)

Michael Leu[:Lenzak](UTC+8)

Comment 27

•

9 years ago

Hi Geoff, do you think splitting Mochitest-GL into more chunks can fix this issue? The intermittent failure sometime fails in test_2_conformance__textures__video__tex-2d-rgba-rgba-unsigned_short_4_4_4_4.html which is just adjacent to this test. And the "MEMORY STAT" indicates that these kind of tests require more memory than other webgl conformance tests. If we can isolate these tests, dividing them into different chunks, maybe we can prevent the OOM problem?

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Comment 28

•

9 years ago

There's been a lot of discussion in bug 1300355 about similar issues, but related to reftests -- have you seen that? One thought I have is that maybe something is going wrong with GC/CC...or maybe something could be done to be more aggressive about GC/CC during tests? More chunks might help, but that doesn't seem like a proper fix: As tests are added or removed, tests become redistributed in chunks, which might create another problematic chunk at any time. A similar approach that's slightly better might be to separate tests into different manifests. :jmaher - Anything to add? How about run-by-dir or something like that?

Flags: needinfo?(gbrown) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 29

•

9 years ago

I think run-by-dir is good, by creating different manifests we can isolate tests 99% which would be a good position to be in.

Flags: needinfo?(jmaher)

Michael Leu[:Lenzak](UTC+8)

Comment 30

•

9 years ago

Attached image Screen Shot 2017-02-14 at 4.23.41 PM.png — Details

FWIW, here is the screenshot of the try rawlog after I added additional logs. And the try result is here https://treeherder.mozilla.org/#/jobs?repo=try&revision=9d9d6719f7ff7d97880dc99cb515e70b94e1f289&selectedJob=75039636 You can see the error code -2147024882 which is 0x8007000e in hex that means out of memory. https://dxr.mozilla.org/mozilla-central/source/xpcom/base/ErrorList.h#26 If I did not add these log, the stock JavaScript error is just too vague to know what is wrong. Maybe we should change the error message when we encounter this kind of error, I think it should be helpful for other intermittent failures.

Kelsey Gilbert [:jgilbert]

Comment 31

•

9 years ago

Agreed.

Michael Leu[:Lenzak](UTC+8)

Comment 32

•

9 years ago

Hi Bobby Do you think it is possible to make nsJSCID return more detail error message to let us know why it fails? NS_ERROR_XPC_GS_RETURNED_FAILURE is too vague for us to know what's wrong, but maybe we can make the return code carry fail reason rv here? https://dxr.mozilla.org/mozilla-central/source/js/xpconnect/src/XPCJSID.cpp#695 It would be helpful for us to track intermittent failures like this one.

Flags: needinfo?(bobbyholley)

Bobby Holley (:bholley)

Comment 33

•

9 years ago

(In reply to Michael Leu[:lenzak800](UTC+8) from comment #32) > Hi Bobby > > Do you think it is possible to make nsJSCID return more detail error message > to let us know why it fails? > > NS_ERROR_XPC_GS_RETURNED_FAILURE is too vague for us to know what's wrong, > but maybe we can make the return code carry fail reason rv here? > https://dxr.mozilla.org/mozilla-central/source/js/xpconnect/src/XPCJSID. > cpp#695 > > It would be helpful for us to track intermittent failures like this one. In general the XPConnect error reporting stuff is all pretty fragile - messing around with it usually has unintended consequences. Propagating |rv| instead could be more informative in some cases, but less in others. For example, if |rv| is NS_ERROR_FAILURE, NS_ERROR_XPC_GS_RETURNED_FAILURE is a lot more informative. Would it be sufficient to add an NS_WARNING to the callsites you're interested to log more information to debug builds?

Flags: needinfo?(bobbyholley)

Michael Leu[:Lenzak](UTC+8)

Comment 34

•

9 years ago

Attached patch Add log for NS_ERROR_OUT_OF_MEMORY in nsJSCID (obsolete) — Details — Splinter Review

I think it is OK to add logs since we just want to be informed what going wrong. I originally want to make it generate warning for all rv but it seems that NS_WARNING doesn't support format string like %X, so it only generates warning when rv is NS_ERROR_OUT_OF_MEMORY since memory issue is the most likely cause for intermittent failure like this. Do you think it needs to generate warning for all rv in core errors?

Attachment #8839314 - Flags: feedback?(bobbyholley)

Bobby Holley (:bholley)

Comment 35

•

9 years ago

Comment on attachment 8839314 [details] [diff] [review] Add log for NS_ERROR_OUT_OF_MEMORY in nsJSCID Review of attachment 8839314 [details] [diff] [review]: ----------------------------------------------------------------- This is ok, but I think we should just add the right API to nsDebug.h. How about adding an NS_WARN_WITH_RESULT(str, rv) API that shares the stringification logic with NS_ENSURE_SUCCESS?

Attachment #8839314 - Flags: feedback?(bobbyholley) → feedback-

Michael Leu[:Lenzak](UTC+8)

Comment 36

•

8 years ago

Attached patch Bug1315037 - Add Logs for XPCJSCID (obsolete) — Details — Splinter Review

Attachment #8839314 - Attachment is obsolete: true

Michael Leu[:Lenzak](UTC+8)

Comment 37

•

8 years ago

Attached patch Bug1315037 - Add Logs for XPCJSCID (obsolete) — Details — Splinter Review

Attachment #8844795 - Attachment is obsolete: true

Michael Leu[:Lenzak](UTC+8)

Comment 38

•

8 years ago

Comment on attachment 8844796 [details] [diff] [review] Bug1315037 - Add Logs for XPCJSCID Review of attachment 8844796 [details] [diff] [review]: ----------------------------------------------------------------- Sorry for late reply, do you mean creating a new Macro in nsDebug.h like this?

Attachment #8844796 - Flags: review?(bobbyholley)

Bobby Holley (:bholley)

Comment 39

•

8 years ago

Comment on attachment 8844796 [details] [diff] [review] Bug1315037 - Add Logs for XPCJSCID Review of attachment 8844796 [details] [diff] [review]: ----------------------------------------------------------------- lgtm with that change. Please get review from erahm on the final patch. ::: xpcom/base/nsDebug.h @@ +306,5 @@ > > +#define NS_WARN_WITH_RESULT_BODY(res, str) \ > + char *msg = mozilla::Smprintf("NS_WARN_WITH_RESULT failed with result"\ > + " 0x%" PRIX32 "with message: %s", \ > + static_cast<uint32_t>(__rv), str); \ There's no "failure" here. The string should just be like a regular warning except that it includes an nsresult.

Attachment #8844796 - Flags: review?(bobbyholley) → feedback+

Michael Leu[:Lenzak](UTC+8)

Comment 40

•

8 years ago

Attached patch Bug1315037 - Add Logs for XPCJSCID (obsolete) — Details — Splinter Review

Using more neutral words instead of failure in NS_WARN_WITH_RESULT

Attachment #8844796 - Attachment is obsolete: true

Michael Leu[:Lenzak](UTC+8)

Comment 41

•

8 years ago

Comment on attachment 8845262 [details] [diff] [review] Bug1315037 - Add Logs for XPCJSCID Review of attachment 8845262 [details] [diff] [review]: ----------------------------------------------------------------- Hi Eric, can you review it for me? I want to add some logs to diagnose intermittent failure like this easier in the future.

Attachment #8845262 - Flags: review?(erahm)

Eric Rahm [:erahm]

Comment 42

•

8 years ago

Comment on attachment 8845262 [details] [diff] [review] Bug1315037 - Add Logs for XPCJSCID Review of attachment 8845262 [details] [diff] [review]: ----------------------------------------------------------------- I like the idea, just a few thoughts on a more useful implementation. If you choose to do the nsDebug changes can you please split them out into a separate patch with a detailed commit comment? ::: js/xpconnect/src/XPCJSID.cpp @@ +657,5 @@ > rv = compMgr->CreateInstance(mDetails->ID(), nullptr, *iid, getter_AddRefs(inst)); > MOZ_ASSERT(NS_FAILED(rv) || inst, "component manager returned success, but instance is null!"); > > + if (NS_FAILED(rv) || !inst) { > + NS_WARN_WITH_RESULT(rv, "nsJSCID::CreateInstance Failed."); Strictly speaking I don't think you need a new warning type. You could replace these lines with: > MOZ_RELEASE_ASSERT(NS_FAILED(rv) || !inst, ...); > NS_ENSURE_SUCCESS(rv, NS_ERROR_XPC_CI_RETURNED_FAILURE); That said, generally we frown on adding more NS_ENSURE's and I agree with Bobby that this could be useful elsewhere. I'll leave it up to you if you want to pursue the change here or just file a follow up and use an |NS_ENSURE_SUCCESS| for now. ::: xpcom/base/nsDebug.h @@ +342,4 @@ > } \ > } while(0) > > +#define NS_WARN_WITH_RESULT(res, str) \ This name isn't quite right, as-is it should be |NS_WARN_WITH_RESULT_IF_FAILED| or more succinctly |NS_WARN_IF_FAILED|. You could modify the implementation to just not check |__rv| and keep the name as-is. I think it would be more useful if it were implemented like |NS_WARN_IF| so that it can be used in an |if| statement, ie: > if (NS_WARN_IF_FAILED(rv) || !srvc) > return NS_ERROR_XPC_GS_RETURNED_FAILURE; Hopefully that makes sense! Thanks for doing this, I think it'll be a nice improvement for debug logging.

Attachment #8845262 - Flags: review?(erahm) → review-

Michael Leu[:Lenzak](UTC+8)

Comment 43

•

8 years ago

Attached patch Use NS_ENSURE_SUCCESS for nsresult diagnosis. — Details — Splinter Review

I think it is better to use NS_ENSURE_SUCCESS here and open a follow bug to add new warning API. I am not sure if using MOZ_RELEASE_ASSERT is correct since it is just an if, we don't want it crash when the condition is true. So I just move rv to NS_ENSURE_SUCCESS while retain original fail handling code with null instance/service. Besides, consider what I want in a debug message if the it failed, I hope I can know what failure code immediately, so if I implement a new warning API, I'll make an NS_WARN_IF_FAILED(rv) which encapsulate NS_warn_if_impl just like NS_WARN_IF, but with additional message indicates what rv is it with GetErrorName. Is that reasonable?

Attachment #8846521 - Flags: review?(erahm)

Michael Leu[:Lenzak](UTC+8)

Updated

•

8 years ago

Attachment #8845262 - Attachment is obsolete: true

Eric Rahm [:erahm]

Comment 44

•

8 years ago

Comment on attachment 8846521 [details] [diff] [review] Use NS_ENSURE_SUCCESS for nsresult diagnosis. Review of attachment 8846521 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, r=me Keep an eye out for warning spam :)

Attachment #8846521 - Flags: review?(erahm) → review+

Michael Leu[:Lenzak](UTC+8)

Comment 45

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=054739b43229870c341e161dfcda481688490305 Here is try result.

Michael Leu[:Lenzak](UTC+8)

Updated

•

8 years ago

Keywords: checkin-needed

Pulsebot

Comment 46

•

8 years ago

Pushed by cbook@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1853515da1f5 Use NS_ENSURE_SUCCESS for nsresult diagnosis. r=erahm

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 47

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1853515da1f5

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox55: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

Ryan VanderMeulen [:RyanVM]

Comment 48

•

8 years ago

AFAICT, what landed was a diagnostic patch, not a real fix. That said, thanks for working on this! It's definitely a lingering failure on ESR52 that would be great to eliminate!

Status: RESOLVED → REOPENED

status-firefox52: --- → wontfix

status-firefox53: --- → affected

status-firefox54: --- → affected

status-firefox55: fixed → affected

status-firefox-esr52: --- → affected

Resolution: FIXED → ---

Target Milestone: mozilla55 → ---

Comment hidden (Intermittent Failures Robot)

Michael Leu[:Lenzak](UTC+8)

Updated

•

8 years ago

Assignee: cleu → nobody

Joel Maher ( :jmaher ) (UTC -8)

Comment 50

•

8 years ago

I don't see any failures in the last 9 days, this looks to be fixed.

Whiteboard: [gfx-noted] → [gfx-noted][stockwell fixed:other]

Michael Leu[:Lenzak](UTC+8)

Comment 51

•

8 years ago

It stopped happening after the aurora version 5 months ago. And only win7-32bit non-e10s mode has this issue. I think we can consider this fixed since it has stopped happening and we will disable non-e10s mochitests by default soon.

Comment hidden (Intermittent Failures Robot)

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Priority: -- → P3

Firefox Bug Husbandry Bot

Updated

•

8 years ago

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Keywords: bulk-close-intermittents

Resolution: --- → INCOMPLETE

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox53: affected → wontfix

status-firefox54: affected → wontfix

status-firefox55: affected → wontfix

status-firefox-esr52: affected → wontfix

Screen Shot 2017-01-25 at 2.24.33 PM.png 9 years ago Michael Leu[:Lenzak](UTC+8) 248.78 KB, image/png		Details
Screen Shot 2017-02-14 at 4.23.41 PM.png 9 years ago Michael Leu[:Lenzak](UTC+8) 145.73 KB, image/png		Details
Add log for NS_ERROR_OUT_OF_MEMORY in nsJSCID 9 years ago Michael Leu[:Lenzak](UTC+8) 1.56 KB, patch	bholley : feedback-	Details \| Diff \| Splinter Review
Bug1315037 - Add Logs for XPCJSCID 8 years ago Michael Leu[:Lenzak](UTC+8) 3.48 KB, patch		Details \| Diff \| Splinter Review
Bug1315037 - Add Logs for XPCJSCID 8 years ago Michael Leu[:Lenzak](UTC+8) 3.47 KB, patch	bholley : feedback+	Details \| Diff \| Splinter Review
Bug1315037 - Add Logs for XPCJSCID 8 years ago Michael Leu[:Lenzak](UTC+8) 3.47 KB, patch	erahm : review-	Details \| Diff \| Splinter Review
Use NS_ENSURE_SUCCESS for nsresult diagnosis. 8 years ago Michael Leu[:Lenzak](UTC+8) 1.42 KB, patch	erahm : review+	Details \| Diff \| Splinter Review