Closed
Bug 1315037
Opened 9 years ago
Closed 8 years ago
Intermittent dom/canvas/test/webgl-conf/generated/test_2_conformance__textures__video__tex-2d-rgba-rgba-unsigned_short_5_5_5_1.html | Test timed out.
Categories
(Core :: Graphics: CanvasWebGL, defect, P3)
Core
Graphics: CanvasWebGL
Tracking
()
People
(Reporter: intermittent-bug-filer, Unassigned)
References
Details
(Keywords: bulk-close-intermittents, intermittent-failure, Whiteboard: [gfx-noted][stockwell fixed:other])
Attachments
(3 files, 4 obsolete files)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
![]() |
||
Comment 5•9 years ago
|
||
jgilbert -- any idea why this would have started a few weeks ago?
Flags: needinfo?(jgilbert)
Comment 6•9 years ago
|
||
Nope. Maybe we can get someone to look into it?
Flags: needinfo?(jgilbert) → needinfo?(howareyou322)
Comment hidden (Intermittent Failures Robot) |
Updated•9 years ago
|
Whiteboard: [gfx-noted]
Comment hidden (Intermittent Failures Robot) |
Comment 9•9 years ago
|
||
Michael is looking at this now.
Assignee: nobody → cleu
Flags: needinfo?(howareyou322)
Comment 10•9 years ago
|
||
It seems that the test hangs at the very beginning, I am now trying to figure out why this happen.
I think maybe there is some race condition which cause out test have a dangling state that prevent it from continue executing.
Comment 11•9 years ago
|
||
Bug1315091 has almost identical log, I suspect they hang for similar reason.
Comment 12•9 years ago
|
||
I'm now trying to build a VM similar to tryserver to reproduce it locally.
However, although I managed to make mochitest run on a 32-bit windows 7 vm with limited system resource,
we cannot obtain a gl context because of the unsupported gfx driver (VMWare SVGA Adapter).
And I saw the raw log of this test, it seems that VMs running mochitest-gl has their access to physical NVidia GFX cards.
I'm now building a test environment with windows 7 32bit and HD4000, hopefully this environment can reproduce the hang.
Comment 13•9 years ago
|
||
Although I don't know the configuration of the VM this test runs on, I found nvidia physX folder in the raw log.
So I suspect it's related to NVidia driver.
I tried to reproduce this failure locally, tested 2 machines with nvidia GT630 and Intel HD4000.
When I try to stress test them by repeating run this mochitest multiple times, the machine with GT630
barfs GL_OUT_OF_MEMORY after tens of repeating and then get a "test timed out" while the HD4000 one runs flawlessly no matter how many times I repeat.
Now I will loan a slave machine for further investigation, hopefully I can reproduce it or ideally, recreate the whole environment locally.
Comment hidden (Intermittent Failures Robot) |
Comment 15•9 years ago
|
||
FWIW, I am now trying to run this test on my local win7-32bit VM, and I keep failing to obtain webgl2 context and the whole test hangs until time out when I run with mach mochitest. However, it can run same test with no problem on khronos' website.
I found the cause is in the pref.
When we run mach mochitest, we generate a temporary profile which has webgl.force-enabled=true.
This pref will lead to failure obtaining correct config in glesGetConfig which prevent us get a valid context.
I don't know whether it is relevant to this bug or not, I will further investigate it by adding logs to verify the pref and push it to try server.
Comment hidden (Intermittent Failures Robot) |
Comment 17•9 years ago
|
||
I tried to run mochitest-gl on the t-win732-spot slave machine for multiple time, and sometimes it hangs because it failed to obtain webgl context here.
https://dxr.mozilla.org/mozilla-central/source/dom/canvas/WebGLContext.cpp?q=error+during+angle+opengl+init&redirect_type=single#624
I have compared the raw log between successful and failure mochitest-gl and both of them do not print logs in webgl conformance test.
So I suspect it is the cause of the intermittent fail.
I will further investigate it by tracing CreateOffscreen to figure out why does it sometimes fail.
Comment 18•9 years ago
|
||
OK... It seems that this intermittent failure is not caused by fail to fetch a valid webGL context.
But I found something interesting in the raw log.
The red marked log is the logs about this test, and it exhibits different behavior comparing to a normal successful test.
It seems that the test just hangs right after it started at 08:58:01 until the 5 minutes time limit is reached which is around 09:03:01.
Additionally, the timed out notification is fired with some test-info simultaneously, we can see the 2 TEST-PASS log which is called by requestFlakyTimeout in the beginning of mochitest-single.html.
Through these information, I think the conformance test did not run in this try instance at all for some reason, and I think the problem is in the test infrastructure instead of webgl backend.
Comment 19•9 years ago
|
||
Hi Geoff
The test occasionally hang right after it started until the 5-minute time limit ends.
My observation is in Comment 18.
Do you have any suggestion about this one?
Flags: needinfo?(gbrown)
![]() |
||
Comment 20•9 years ago
|
||
Michael, thanks for looking into this.
There's not much information in that log. As you say, we know we got as far as requestFlakyTimeout, but it's unclear how much further the test executed. Do you have any idea where the 2 Javascript Errors come from (what point in the test)? Maybe that's relevant...but I don't know.
If you can reproduce the hang, you could add more logging to the test, starting at the beginning, and see if those new messages are logged before the hang.
There are lots of "Test timed out" failures reported against many tests, but I'm not aware that any of those are caused by a problem in the test infrastructure; it's possible, but I don't know of any specific issue.
By the way, did you notice that this failure has only been reported on mozilla-aurora for the last few weeks? Can you still reproduce it on central? (It looks like bug 1315091 stopped too...maybe that's good!?)
Flags: needinfo?(gbrown)
Comment 21•9 years ago
|
||
Hi Geoff, thanks for your suggestion :)
I cannot reproduce the hang with same behavior.
And I didn't notice that the hangs only reported on aurora, so I am now using aurora to try reproducing this fail, got no luck yet.
I will look into other similar intermittent failure like bug1315091 and look through the patches landed just after orange factor robot stop reporting if I am still unable to reproduce it, maybe some of them actually fixed the problem.
Comment hidden (Intermittent Failures Robot) |
Comment 23•9 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f609da7a3fba32d5b1c09adc6c8481227e32b935&selectedJob=74231798
I deliberately retrigger this test multiple times, it seems that the failure rate is higher when multiple tasks are run simultaneously.
So I tried to throttle the CPU and GPU resource by using hardware stress testing software along with the test running on my local machine, the test does run slowly but not completely hang.
Comment 24•9 years ago
|
||
OK, I think I found the cause of this bug, this is actually a memory issue instead of WebGL's problem.
The 2 Javascript errors in the attached screenshot does have something to do with the hang.
I found these 2 errors always present just before the hang happens.
So I add logs to trace the c++ part related to the JS modules which barf errors.
I found the second error is always caused by a service create failure because of out of memory.
This explains why the hang only presents in 32bit non-e10s mode because it is the most prone configuration to make a process reach its memory limit.
Now I wonder why the hang is absent after aurora, maybe some patch improved the memory management in JS VM?
Hi Geoff, do you have any idea about it?
Should I find which patch fixed the hang and uplift it to beta?
Flags: needinfo?(gbrown)
Updated•9 years ago
|
Comment 25•9 years ago
|
||
The bugs marked in "see also" exhibits similar behavior, I think their cause is same as this bug.
![]() |
||
Comment 26•9 years ago
|
||
(In reply to Michael Leu[:lenzak800](UTC+8) from comment #24)
> Hi Geoff, do you have any idea about it?
>
> Should I find which patch fixed the hang and uplift it to beta?
That's great progress - thanks for keeping with it. I don't have much insight into the problem, but I think it would be interesting and useful to find the patch that fixed this and get it uplifted...if that's not too much trouble.
Flags: needinfo?(gbrown)
Comment 27•9 years ago
|
||
Hi Geoff, do you think splitting Mochitest-GL into more chunks can fix this issue?
The intermittent failure sometime fails in test_2_conformance__textures__video__tex-2d-rgba-rgba-unsigned_short_4_4_4_4.html which is just adjacent to this test.
And the "MEMORY STAT" indicates that these kind of tests require more memory than other webgl conformance tests.
If we can isolate these tests, dividing them into different chunks, maybe we can prevent the OOM problem?
Flags: needinfo?(gbrown)
![]() |
||
Comment 28•9 years ago
|
||
There's been a lot of discussion in bug 1300355 about similar issues, but related to reftests -- have you seen that?
One thought I have is that maybe something is going wrong with GC/CC...or maybe something could be done to be more aggressive about GC/CC during tests?
More chunks might help, but that doesn't seem like a proper fix: As tests are added or removed, tests become redistributed in chunks, which might create another problematic chunk at any time.
A similar approach that's slightly better might be to separate tests into different manifests.
:jmaher - Anything to add? How about run-by-dir or something like that?
Flags: needinfo?(gbrown) → needinfo?(jmaher)
Comment 29•9 years ago
|
||
I think run-by-dir is good, by creating different manifests we can isolate tests 99% which would be a good position to be in.
Flags: needinfo?(jmaher)
Comment 30•9 years ago
|
||
FWIW, here is the screenshot of the try rawlog after I added additional logs.
And the try result is here https://treeherder.mozilla.org/#/jobs?repo=try&revision=9d9d6719f7ff7d97880dc99cb515e70b94e1f289&selectedJob=75039636
You can see the error code -2147024882 which is 0x8007000e in hex that means out of memory.
https://dxr.mozilla.org/mozilla-central/source/xpcom/base/ErrorList.h#26
If I did not add these log, the stock JavaScript error is just too vague to know what is wrong.
Maybe we should change the error message when we encounter this kind of error, I think it should be helpful for other intermittent failures.
Comment 31•9 years ago
|
||
Agreed.
Comment 32•9 years ago
|
||
Hi Bobby
Do you think it is possible to make nsJSCID return more detail error message to let us know why it fails?
NS_ERROR_XPC_GS_RETURNED_FAILURE is too vague for us to know what's wrong, but maybe we can make the return code carry fail reason rv here?
https://dxr.mozilla.org/mozilla-central/source/js/xpconnect/src/XPCJSID.cpp#695
It would be helpful for us to track intermittent failures like this one.
Flags: needinfo?(bobbyholley)
Comment 33•9 years ago
|
||
(In reply to Michael Leu[:lenzak800](UTC+8) from comment #32)
> Hi Bobby
>
> Do you think it is possible to make nsJSCID return more detail error message
> to let us know why it fails?
>
> NS_ERROR_XPC_GS_RETURNED_FAILURE is too vague for us to know what's wrong,
> but maybe we can make the return code carry fail reason rv here?
> https://dxr.mozilla.org/mozilla-central/source/js/xpconnect/src/XPCJSID.
> cpp#695
>
> It would be helpful for us to track intermittent failures like this one.
In general the XPConnect error reporting stuff is all pretty fragile - messing around with it usually has unintended consequences.
Propagating |rv| instead could be more informative in some cases, but less in others. For example, if |rv| is NS_ERROR_FAILURE, NS_ERROR_XPC_GS_RETURNED_FAILURE is a lot more informative.
Would it be sufficient to add an NS_WARNING to the callsites you're interested to log more information to debug builds?
Flags: needinfo?(bobbyholley)
Comment 34•9 years ago
|
||
I think it is OK to add logs since we just want to be informed what going wrong.
I originally want to make it generate warning for all rv but it seems that NS_WARNING doesn't support format string like %X, so it only generates warning when rv is NS_ERROR_OUT_OF_MEMORY since memory issue is the most likely cause for intermittent failure like this.
Do you think it needs to generate warning for all rv in core errors?
Attachment #8839314 -
Flags: feedback?(bobbyholley)
Comment 35•9 years ago
|
||
Comment on attachment 8839314 [details] [diff] [review]
Add log for NS_ERROR_OUT_OF_MEMORY in nsJSCID
Review of attachment 8839314 [details] [diff] [review]:
-----------------------------------------------------------------
This is ok, but I think we should just add the right API to nsDebug.h. How about adding an NS_WARN_WITH_RESULT(str, rv) API that shares the stringification logic with NS_ENSURE_SUCCESS?
Attachment #8839314 -
Flags: feedback?(bobbyholley) → feedback-
Comment 36•8 years ago
|
||
Attachment #8839314 -
Attachment is obsolete: true
Comment 37•8 years ago
|
||
Attachment #8844795 -
Attachment is obsolete: true
Comment 38•8 years ago
|
||
Comment on attachment 8844796 [details] [diff] [review]
Bug1315037 - Add Logs for XPCJSCID
Review of attachment 8844796 [details] [diff] [review]:
-----------------------------------------------------------------
Sorry for late reply, do you mean creating a new Macro in nsDebug.h like this?
Attachment #8844796 -
Flags: review?(bobbyholley)
Comment 39•8 years ago
|
||
Comment on attachment 8844796 [details] [diff] [review]
Bug1315037 - Add Logs for XPCJSCID
Review of attachment 8844796 [details] [diff] [review]:
-----------------------------------------------------------------
lgtm with that change. Please get review from erahm on the final patch.
::: xpcom/base/nsDebug.h
@@ +306,5 @@
>
> +#define NS_WARN_WITH_RESULT_BODY(res, str) \
> + char *msg = mozilla::Smprintf("NS_WARN_WITH_RESULT failed with result"\
> + " 0x%" PRIX32 "with message: %s", \
> + static_cast<uint32_t>(__rv), str); \
There's no "failure" here. The string should just be like a regular warning except that it includes an nsresult.
Attachment #8844796 -
Flags: review?(bobbyholley) → feedback+
Comment 40•8 years ago
|
||
Using more neutral words instead of failure in NS_WARN_WITH_RESULT
Attachment #8844796 -
Attachment is obsolete: true
Comment 41•8 years ago
|
||
Comment on attachment 8845262 [details] [diff] [review]
Bug1315037 - Add Logs for XPCJSCID
Review of attachment 8845262 [details] [diff] [review]:
-----------------------------------------------------------------
Hi Eric, can you review it for me?
I want to add some logs to diagnose intermittent failure like this easier in the future.
Attachment #8845262 -
Flags: review?(erahm)
Comment 42•8 years ago
|
||
Comment on attachment 8845262 [details] [diff] [review]
Bug1315037 - Add Logs for XPCJSCID
Review of attachment 8845262 [details] [diff] [review]:
-----------------------------------------------------------------
I like the idea, just a few thoughts on a more useful implementation. If you choose to do the nsDebug changes can you please split them out into a separate patch with a detailed commit comment?
::: js/xpconnect/src/XPCJSID.cpp
@@ +657,5 @@
> rv = compMgr->CreateInstance(mDetails->ID(), nullptr, *iid, getter_AddRefs(inst));
> MOZ_ASSERT(NS_FAILED(rv) || inst, "component manager returned success, but instance is null!");
>
> + if (NS_FAILED(rv) || !inst) {
> + NS_WARN_WITH_RESULT(rv, "nsJSCID::CreateInstance Failed.");
Strictly speaking I don't think you need a new warning type. You could replace these lines with:
> MOZ_RELEASE_ASSERT(NS_FAILED(rv) || !inst, ...);
> NS_ENSURE_SUCCESS(rv, NS_ERROR_XPC_CI_RETURNED_FAILURE);
That said, generally we frown on adding more NS_ENSURE's and I agree with Bobby that this could be useful elsewhere. I'll leave it up to you if you want to pursue the change here or just file a follow up and use an |NS_ENSURE_SUCCESS| for now.
::: xpcom/base/nsDebug.h
@@ +342,4 @@
> } \
> } while(0)
>
> +#define NS_WARN_WITH_RESULT(res, str) \
This name isn't quite right, as-is it should be |NS_WARN_WITH_RESULT_IF_FAILED| or more succinctly |NS_WARN_IF_FAILED|. You could modify the implementation to just not check |__rv| and keep the name as-is.
I think it would be more useful if it were implemented like |NS_WARN_IF| so that it can be used in an |if| statement, ie:
> if (NS_WARN_IF_FAILED(rv) || !srvc)
> return NS_ERROR_XPC_GS_RETURNED_FAILURE;
Hopefully that makes sense! Thanks for doing this, I think it'll be a nice improvement for debug logging.
Attachment #8845262 -
Flags: review?(erahm) → review-
Comment 43•8 years ago
|
||
I think it is better to use NS_ENSURE_SUCCESS here and open a follow bug to add new warning API.
I am not sure if using MOZ_RELEASE_ASSERT is correct since it is just an if, we don't want it crash when the condition is true.
So I just move rv to NS_ENSURE_SUCCESS while retain original fail handling code with null instance/service.
Besides, consider what I want in a debug message if the it failed,
I hope I can know what failure code immediately, so if I implement a new warning API,
I'll make an NS_WARN_IF_FAILED(rv) which encapsulate NS_warn_if_impl just like NS_WARN_IF, but with additional message indicates what rv is it with GetErrorName.
Is that reasonable?
Attachment #8846521 -
Flags: review?(erahm)
Updated•8 years ago
|
Attachment #8845262 -
Attachment is obsolete: true
Comment 44•8 years ago
|
||
Comment on attachment 8846521 [details] [diff] [review]
Use NS_ENSURE_SUCCESS for nsresult diagnosis.
Review of attachment 8846521 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good, r=me Keep an eye out for warning spam :)
Attachment #8846521 -
Flags: review?(erahm) → review+
Comment 45•8 years ago
|
||
Updated•8 years ago
|
Keywords: checkin-needed
Comment 46•8 years ago
|
||
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1853515da1f5
Use NS_ENSURE_SUCCESS for nsresult diagnosis. r=erahm
Keywords: checkin-needed
Comment 47•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox55:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Comment 48•8 years ago
|
||
AFAICT, what landed was a diagnostic patch, not a real fix. That said, thanks for working on this! It's definitely a lingering failure on ESR52 that would be great to eliminate!
Status: RESOLVED → REOPENED
status-firefox52:
--- → wontfix
status-firefox53:
--- → affected
status-firefox54:
--- → affected
status-firefox-esr52:
--- → affected
Resolution: FIXED → ---
Target Milestone: mozilla55 → ---
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Assignee: cleu → nobody
Comment 50•8 years ago
|
||
I don't see any failures in the last 9 days, this looks to be fixed.
Whiteboard: [gfx-noted] → [gfx-noted][stockwell fixed:other]
Comment 51•8 years ago
|
||
It stopped happening after the aurora version 5 months ago.
And only win7-32bit non-e10s mode has this issue.
I think we can consider this fixed since it has stopped happening and we will disable non-e10s mochitests by default soon.
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Priority: -- → P3
Updated•8 years ago
|
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Keywords: bulk-close-intermittents
Resolution: --- → INCOMPLETE
Updated•4 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•