Open Bug 896840 Opened 11 years ago Updated 2 years ago

Find cause of reftests sometimes failing with "load failed: null", on Android at least

Categories

(Testing :: Reftest, defect)

ARM
Android
defect

Tracking

(Not tracked)

People

(Reporter: dholbert, Unassigned)

References

Details

Attachments

(1 file)

As documented in bug 789751, we have lots of Android test failures where the harness reports the failure as e.g.:
{
REFTEST TEST-UNEXPECTED-FAIL | http://10.250.48.23:30311/tests/gfx/tests/crashtests/580719-1.html | load failed: null
}

The "null" there is us trying to print a passed-in explanation for the failure, but either no explanation was passed, or the explanation string was never set, or something like that.

We should be better about actually printing a useful message here. (I suspect these are mostly timeouts, but I can't really know for sure because all I have is "null".)

Some handy MXR links:
 - The failure message is printed here:
> 1641 function LoadFailed(why)
> 1642 {
> [...]
> 1644     gDumpLog("REFTEST TEST-UNEXPECTED-FAIL | " +
> 1645          gURLs[0]["url" + gState].spec + " | load failed: " + why + "\n");
[1]

So, presumably a null value is being passed for the "why" argument there.

LoadFailed gets called by RecvFailedLoad() [2], which is registered as a message listener for "reftest:FailedLoad"[3], which gets invoked asynchronously by SendFailedLoad in reftest-content.js [4], which gets invoked by LoadFailed() in reftest-content.js [5].

And this appears to be where the "why" explanation-string traces back to -- LoadFailed passes in gFailureReason as the explanation. And there are some clauses that set or append to gFailureReason, but there's also code (e.g. in RecordResult()) that sets it to null.

We might want to null-check it at the callsite in [5] and pass in the constant-string "unknown reason" or something like that, as a first pass, to make sure it's *actually* making it through to the actual logging code via the code-path I described above.

[1] http://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/reftest.js#293
[2] http://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/reftest.js#1869
[3] http://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/reftest.js#1801
[4] http://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/reftest-content.js#790
[5] http://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/reftest-content.js#655
(In reply to Daniel Holbert [:dholbert] from comment #0)
> As documented in bug 789751, we have lots of Android test failures where the
> harness reports the failure as e.g.:

I can read.
dholbert++ for filing this :-)
[Thanks, edmorley!]

So here's a patch to hopefully make some progress on this. Per comment 0, this seems to trace back to gFailureReason being undefined or null, so this patch initializes it and resets it to actual helpful string values instead of to those values.

(I initially wrote this as a temporary debugging patch, with the idea being we'd land it & watch for bug 789751 & then backout. But we might even just want to keep this or a modified version of it.)
Attachment #781970 - Flags: review?(dbaron)
(I pushed this to Try, hoping to hit an instance of bug 789751 & learn something, but even after triggering tons of reftest/crashtest jobs, I haven't hit bug 789751 yet, unfortunately. It's low-frequency enough that it's hard to hit on-demand, I guess.  That Try run is here, for reference: https://tbpl.mozilla.org/?tree=Try&rev=4167e2a99987 )
Comment on attachment 781970 [details] [diff] [review]
diagnosing patch 1

It doesn't seem all that interesting to distinguish between these two cases -- isn't all it will tell you whether this is the first reftest in this reftest run?

I think both of these strings should probably say
  "Haven't started test load yet."
... or at least should, if you have another assignment to gFailureReason for all later steps in the loading process.  (The current gFailureReason logic is mostly about debugging the special things we wait for, and not the "normal" loading.)

Also, for debugging this problem, it's probably more useful to find points *earlier* in the test loading process at which you could set gFailureReason -- that seems more likely to lead to useful information.  That may or may not mean building a complete set of gFailureReasons describing everything we're waiting for.

Also, given the way gFailureReason works, it seems bad to check in reasons with gaps in the logic *after* them, since gFailureReason is a description of what we're currently waiting for.

But I think once that's done, the "Haven't started test load yet" message would be useless, since if we have everything we're waiting for documented, then we would never get stuck in that state.
Attachment #781970 - Flags: review?(dbaron) → review-
FWIW, we're hitting these on B2G now as well.
OOps, just realised comments 8-10 were meant to be in bug 789751
Blocks: 789751
Summary: Reftests sometimes fail with "load failed: null", on Android at least → Find cause of reftests sometimes failing with "load failed: null", on Android at least
(In reply to David Baron [:dbaron] (needinfo? me) (UTC-7) from comment #6)
> Also, for debugging this problem, it's probably more useful to find points
> *earlier* in the test loading process at which you could set gFailureReason
> -- that seems more likely to lead to useful information.  That may or may
> not mean building a complete set of gFailureReasons describing everything
> we're waiting for.

I've been playing around with some try pushes for this, and got the jsreftest failure to occur with some additional logging:
https://tbpl.mozilla.org/php/getParsedLog.php?id=38005581&tree=Try

When this failure occurs, in StartTestURI() we return from LoadURI() [1] but then never enter OnDocumentLoad() [2].

[1] https://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/reftest-content.js#136
[2] https://mxr.mozilla.org/mozilla-central/source/layout/tools/reftest/reftest-content.js#556

David, are you the best person to investigate further?
Blocks: 720452
Flags: needinfo?(dbaron)
OS: Linux → Android
Hardware: x86_64 → ARM
Clearing old needinfo requests where I've given up expecting a response, to clear up my bugzilla-todos dashboard "to nag" list.
Flags: needinfo?(dbaron)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: