Closed Bug 863716 Opened 7 years ago Closed 6 years ago

WinXP M1 "remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion." from WebGL mochitests

Categories

(Core :: Canvas: WebGL, defect)

x86
Windows XP
defect
Not set

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: RyanVM, Unassigned)

References

Details

(Keywords: crash, intermittent-failure)

Attachments

(3 files, 2 obsolete files)

Bug 862520 made this better, but we're still seeing disconnects in the WebGL suite.

https://tbpl.mozilla.org/php/getParsedLog.php?id=22011122&tree=Mozilla-Central
Attached patch skip another WebGL test on WinXP (obsolete) — Splinter Review
So among the WebGL test pages run just before the OOM, the only one to do anything weird with huge 'size' arguments and also the last one run just before the OOM, is:

conformance/more/functions/bufferSubDataBadArgs.html

Indeed, it does:

gl.bufferSubData(gl.ELEMENT_ARRAY_BUFFER, 2400000, new Uint16Array([1,2,3]));

Now that shouldn't cause any buffer allocation, but it's not unconceivable that a buggy driver would OOM as a result of that call, and that's my only theory anyway.
Attachment #739763 - Flags: review?(jgilbert)
Have you tested that this fixes it? It would seem like this at-worst allocates ~2.4MB for the buffer.
I haven't tested it. You're right, we should push this to try and trigger M1 a lot of times.
Why does the Try run show quickCheckAPI-B2.html still running (and hanging)?
Ouch. Good catch. Needs debugging.
This try push, if I didn't make a syntax error, will tell us something in the logs.
https://tbpl.mozilla.org/?tree=Try&rev=747c3aa9ef7f
Indeed, "testsToSkip = <html><head><title>404 Not Found</title></head><body><h1>404 Not Found</h1>", pushed https://hg.mozilla.org/integration/mozilla-inbound/rev/6cfec8d6ce46 as a bug 862520 followup.
Haha, yeah, that would do it :-) sorry about this omission.
Looks like now we want to skip quickCheckAPI-B3.html and bufferSubDataBadArgs.html.
bufferSubDataBadArgs is already under review here.

Let me update the patch to add quickCheckAPI-B3.
Attached patch skip more testsSplinter Review
Attachment #739763 - Attachment is obsolete: true
Attachment #739763 - Flags: review?(jgilbert)
Attachment #744151 - Flags: review?(jgilbert)
Attachment #744151 - Flags: review?(jgilbert) → review+
Please land this for me, inbound is closed now.
Attachment #744268 - Flags: review+
Attachment #744268 - Flags: checkin?
So could we actually be leaking? (for some definition of 'leak') Does some accumulation of memory accrue while running the test suite, such that we really do hit an OOM?
https://tbpl.mozilla.org/php/getParsedLog.php?id=22483709&tree=Mozilla-Inbound - quickCheckAPI-B4.html, whom I'm already ready to see the backside of.
https://hg.mozilla.org/mozilla-central/rev/3b76894efafa
Assignee: nobody → bjacob
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
This smells a lot like we're bailing water instead of plugging the leaks.

If we split the conformance suite tests in half, I wouldn't be surprised to see these failures go away.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [keep open]
Maybe "we" are leaking, but it's fairly likely to be a leak in a driver rather than our own --- this is mature code that runs well elsewhere and hasn't changed much recently --- so disabling tests may still be the only thing to do about it.
Attached patch skip yet more tests (obsolete) — Splinter Review
Attachment #745260 - Flags: review?(jgilbert)
(In reply to Benoit Jacob [:bjacob] from comment #23)
> Maybe "we" are leaking, but it's fairly likely to be a leak in a driver
> rather than our own --- this is mature code that runs well elsewhere and
> hasn't changed much recently --- so disabling tests may still be the only
> thing to do about it.

Maybe, but I think we're just going to end up disabling all tests after these ones, as they come back OOM one after the other.

Would we still OOM if we did half the tests, killed the tab, then reloaded it to do the second half?
https://tbpl.mozilla.org/php/getParsedLog.php?id=22574566&tree=Mozilla-Inbound
Summary: WebGL mochitests cause frequent Windows XP mochitest-1 crashes → WinXP M1 "remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion." from WebGL mochitests
Comment 28 would be attachment 745260 [details] [diff] [review] moving the failure on to quickCheckAPI-C.html.
Final tally: out of 100, 8 did hit this, 6 in quickCheckAPI-C.html and 2 in quickCheckAPI-D_G.html, but on the other hand, twice I hit a complex of media/ timeouts that we almost never hit because you eat every single slow run, and twice I hit a complex of XHR timeouts that we've never hit before because you eat every single slow run.

So I'm starting another try run with those two skipped as well, because all too often I've waited for the perfect through a couple thousand failures. If someone can write a patch that lets us run those tests without killing buildbot, sweet, all the more so if they write it while we're still young, but I'm not waiting.
Or, I guess I could just fix it.
Assignee: bjacob → philringnalda
Alas, those last two, in or around bufferSubDataBadArgs.html, mean I'm not actually fixing it, just making it much less frequent.
Assignee: philringnalda → nobody
Whiteboard: [keep open] → [leave open]
Target Milestone: mozilla23 → ---
Pushed https://hg.mozilla.org/integration/mozilla-inbound/rev/b109e2dbf03b to at least take us down from around 10% failure to around 1%. I didn't fix the comment to explain the double GC yet, because I haven't yet found anyone who will explain the why behind what I was cargo-culting from other uses.
And because the world is harsh and cruel, that's a deleteBufferBadArgs.html on my push.
(In reply to Jeff Gilbert [:jgilbert] from comment #25)
> (In reply to Benoit Jacob [:bjacob] from comment #23)
> > Maybe "we" are leaking, but it's fairly likely to be a leak in a driver
> > rather than our own --- this is mature code that runs well elsewhere and
> > hasn't changed much recently --- so disabling tests may still be the only
> > thing to do about it.
> 
> Maybe, but I think we're just going to end up disabling all tests after
> these ones, as they come back OOM one after the other.

We're still talking about bufferData / bufferSubData tests here. It's still possible that we're dealing with a driver issue that's limited to these functions. For that reason, I still believe that the present patch is worth a try.

In later comments Philor also mentioned failures in other pages, but they are pages just after the pages we are disabling here, so it could be just a deferred effect of this OOM bug.

> 
> Would we still OOM if we did half the tests, killed the tab, then reloaded
> it to do the second half?

I don't know how we would do that automatically on test slaves. Maybe xpcshell can do that, but that sounds like a lot of work.
Flags: needinfo?(jgilbert)
Jeff - review ping. I know this isn't a great solution, but neither is the current situation.
Please can you disable some more tests asap (the patch awaiting review seems out of date given comment 54) - since this is soon going to have things disabled by sheriffs, who may not be so discerning as to what we switch off. Thank you :-)
Flags: needinfo?(bjacob)
So first of all, it seems that this changeset accidentally removed from the skip-list two of the three tests that we had added to that skip-list in attempts to fix this bug:

https://hg.mozilla.org/mozilla-central/rev/b109e2dbf03b

So I just landed on inbound a cset re-adding them to the skip-list:

https://hg.mozilla.org/integration/mozilla-inbound/rev/a644a0c9f92d

Other than that I don't see how the patch awaiting review here is obsoleted by comment 54, but looking at recent starrings, if we are going to take them at face value, they would seem to imply that we have to skip also some of the other *BadArgs tests, namely the texture-related ones. But it's hard to tell if that isn't just an artifact of having removed the above-mentioned two tests from the skip-list.
Flags: needinfo?(bjacob)
We want to wait a little bit to see if my inbound push (previous comment), re-adding the two tests to the skip-list, is enough. If not, then this patch takes a heavy-handed approach to adding a bunch more tests to it, namely all the *BadArgs ones that pertain to buffers or textures.
Attachment #745260 - Attachment is obsolete: true
Attachment #745260 - Flags: review?(jgilbert)
Attachment #749862 - Flags: review?(jgilbert)
(In reply to Benoit Jacob [:bjacob] from comment #124)
> Other than that I don't see how the patch awaiting review here is obsoleted
> by comment 54

I meant more since the comment 54 changeset removed items from the disable list (as though the approach there was supposed to have fixed things), but that appears to have been an oversight now :-)
Thank you :-)
It wasn't an accidental removal, with an empty skiplist and 100 runs on try (multiple times, as I tried slightly different things, so say 400-500 runs) I was seeing zero quickCheckAPI failures.
Comment on attachment 749862 [details] [diff] [review]
add more tests to skip

As much as I love disabling tests (and that's *really really a lot*), you don't need to bother with this. If you expand the tbplbot comments and look at the slave names, you'll see that every single one of them includes "r3-xp" and that none of them includes "xp32-ix". That means that this only happens on the old WinXP Mac mini slaves, and it doesn't happen on the new slaves. Just let it ride, the old slaves (and their insufficient memory or cruddy driver or whatever the actual problem is) will be gone before too much longer.
Attachment #749862 - Flags: feedback-
Depends on: 925285
Whiteboard: [leave open]
Comment 248 was a mis-star.

The others are on release branches etc, likely for the reason in comment 193.
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → WORKSFORME
Flags: needinfo?(jgilbert)
Attachment #749862 - Flags: review?(jgilbert)
You need to log in before you can comment on or make changes to this bug.