Closed Bug 1552334 Opened 5 months ago Closed 5 months ago

Work around slow packet.net instances

Categories

(Testing :: General, defect, P1)

defect

Tracking

(firefox68 fixed)

RESOLVED FIXED
mozilla68
Tracking Status
firefox68 --- fixed

People

(Reporter: gbrown, Assigned: gbrown)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files)

See bug 1545308: From the task perspective, some packet.net instances are slow. Sometimes tests run at approx 1/4 normal speed, and /proc/cpuinfo shows approx 1/4 the normal MHz. We have tried 2 potential solutions without success.

Android 4.3 emulator tests on aws have experienced a similar issue: The emulator runs slowly on certain aws instances. There is an existing mechanism in mozharness to fail a task and trigger an automatic retry of the task when the emulator sees bogomips below a configurable threshold.

Android x86_64 7.0 tests normally see an emulator bogomips value of 7000 - 8000; a quick random selection of problematic tasks found emulator bogomips values in the 1500 - 2500 range.

I would prefer to see us fully understand the problem and solve it in bug 1545308, but perhaps the time has come for a workaround. We should still be able to find bad instances by reviewing retried tasks, so work can continue in bug 1545308.

Let's enable the mozharness minimum bogomips check for packet.net Android tests.

Enable the existing android mozharness support for retrying a task when the
bogomips are insufficient, for packet.net tests. This has worked well for
Android 4.3.

There are some other ideas in flight in bug 1545308; I don't mind putting this workaround aside if there is another solution imminent, but we really should get tests running reliably asap.

Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d51a13b4f0ef
Work around slow packet.net instances with min bogomips check; r=jmaher
Priority: -- → P1
Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
Blocks: 1411358, 1474758

This isn't working: Android 7.0 /proc/cpuinfo is slightly different from Android 4.3.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Older Android reported "BogoMIPS"; newer Android reports "bogomips".

Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cff3c3d2b3c5
Ignore case when searching for android bogomips; r=jmaher
Status: REOPENED → RESOLVED
Closed: 5 months ago5 months ago
Resolution: --- → FIXED

Now this is definitely working. Many Android 7.0 retries are evident, and the majority are due to the bogomips check. A survey of these retries show an excellent correlation with the problematic worker ids showing 800 MHz in cpuinfo.

There have been 0 Android 7.0 failures reported in bug 1474758 and bug 1411358 since this landing: an effective workaround.

excellent:)

Is there a high cost associated with this? how much extra cpu time do we spend for each slow instance check/retry? how many do we have? I care more about capacity planning at this point.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #10)

The workaround is inefficient; please see https://bugzilla.mozilla.org/show_bug.cgi?id=1545308#c36.

Regressions: 1553359
Blocks: 1556090
You need to log in before you can comment on or make changes to this bug.