frequent timeouts on tp5 require minor cleanups in pageloader

RESOLVED FIXED

Status

Testing
Talos
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: jmaher, Assigned: jmaher)

Tracking

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 1 obsolete attachment)

pageloader is the core addon for testing many of the talos tests, specifically tp5.  We have seen frequent timeouts on OSX 10.6 and Win7 platforms.

I have adjusted a few things:
1) support for tptimeout, now for tp5n and tp5o tests, if a page doesn't load in 5 seconds, it will hit the timeout routine
2) allow for retries (up to 20).  If we hit the timeout routine, retry, this involves some resetting of pageloader state and variables.
3) remove mozillafilelogger for desktop tests (except for xperf).  I am not sure if this is a problem or not, but I saw some issues with it while debugging this.

I also found that there were some internal variables that needed cleaning up as we added/removed events.

The core of the problem is we call loadURI(url), and while that completes, we never receive the event.  While I am not able to figure out why we are not receiving these events, a retry does a great job of picking back up.  In rare cases we either:
1) hit the 20 retries
2) never trigger the timeout event

the current timeouts are <3% of the time that we currently timeout- in fact, I believe with this patch we should see these 2 rare failures <1% of the time.

While we could optimize this more (I have thoughts of detecting a timeout and closing the test tab and replacing it with a new one) and we could debug why events are not being seen by our handlers, I am at a point where the amount of time invested has yielded more knowledge and what I view as a reasonable cleanup to a long standing persistent problem.

If this continues to be a problem there is a starting point to spend another week or two on figuring out the next steps.
Thank you for this Joel - really appreciated! :-)
Created attachment 8367377 [details] [diff] [review]
cleanup pageloader to be more resiliant to timeouts (1.0)

I still have more testing, but this has yielded a lot of green retriggers on try server!
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #8367377 - Flags: review?(dminor)
Created attachment 8367402 [details] [diff] [review]
cleanup pageloader to be more resiliant to timeouts (1.1)

minor adjustments as I forgot to remove some calculations to pull the width out of the current display frame- we are not using this at all- easy to remove and safe.  Less work is a good thing!
Attachment #8367377 - Attachment is obsolete: true
Attachment #8367377 - Flags: review?(dminor)
Attachment #8367402 - Flags: review?(dminor)

Comment 4

5 years ago
Comment on attachment 8367402 [details] [diff] [review]
cleanup pageloader to be more resiliant to timeouts (1.1)

Review of attachment 8367402 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me.

::: talos/pageloader/chrome/pageloader.js
@@ +369,5 @@
> +    dumpLine("__WARNTimeout (" + numRetries + "/" + maxRetries + ") exceeded on " + pageName + "__WARN");
> +    // TODO: make this a cleaner cleanup
> +    pageCycle--;
> +    content.removeEventListener('load', plLoadHandler, true);
> +    content.removeEventListener('load', plLoadHandlerCapturing, true);

nit: you're using " rather than ' in this function
Attachment #8367402 - Flags: review?(dminor) → review+
landed on talos:
https://hg.mozilla.org/build/talos/rev/3ee19c646c89

will be doing a try run before updating talos.json to use this revision.
Depends on: 965731
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.