Closed Bug 893391 Opened 12 years ago Closed 8 years ago

Investigate CPU steal / unexpected 100% CPU utilization

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gps, Unassigned)

References

Details

System resource monitoring has landed in mozharness in bug 859573! The early returns seem to show an unexpected CPU usage value on some platforms. Let's take a sample xpcshell test result: From OS X 10.8 opt: https://tbpl.mozilla.org/php/getParsedLog.php?id=25246533&tree=Cedar&full=1 17:38:28 INFO - Total resource usage - Wall time: 1441s; CPU: 12%; Read bytes: 37781504; Write bytes: 9456452608; Read time: 22955; Write time: 385785 17:38:28 INFO - install - Wall time: 11s; CPU: 12%; Read bytes: 90764288; Write bytes: 177738752; Read time: 11310; Write time: 32294 17:38:28 INFO - run-tests - Wall time: 1430s; CPU: 12%; Read bytes: 30744576; Write bytes: 9276365824; Read time: 19331; Write time: 351462 1 From Ubuntu64 opt: https://tbpl.mozilla.org/php/getParsedLog.php?id=25251650&tree=Cedar&full=1 19:38:06 INFO - Total resource usage - Wall time: 1835s; CPU: 100%; Read bytes: 6754304; Write bytes: 5492326400; Read time: 1004; Write time: 3301420 19:38:06 INFO - install - Wall time: 14s; CPU: 100%; Read bytes: 0; Write bytes: 229376; Read time: 0; Write time: 76 19:38:06 INFO - run-tests - Wall time: 1821s; CPU: 100%; Read bytes: 1683456; Write bytes: 5492097024; Read time: 512; Write time: 3301344 OS X ran in 1430s but only used 12% CPU on average. Ubuntu64 ran in 1821s (longer) but used 100%. That doesn't sound right! Since the Ubuntu slave is a virtual machine (tst-linux64-ec2-382), I suspect the 100% accounts for other virtual machines on the physical machine. Linux reports CPU usage from other virtual machines as "CPU steal." The underlying resource monitoring code does distinguish between various CPU usage types (user, system, steal, etc). We just don't report on it yet. We should consider reporting on or filtering out CPU steal from the output. This is likely a temporary workaround until bug 893388 lands and we can analyze (and graph) the raw data.
I accidentally compared a debug and opt build. Here are the proper values: OS X: 19:37:52 INFO - Total resource usage - Wall time: 874s; CPU: 10%; Read bytes: 28880896; Write bytes: 10214718464; Read time: 16594; Write time: 464182 19:37:52 INFO - install - Wall time: 18s; CPU: 13%; Read bytes: 215960576; Write bytes: 302470144; Read time: 17683; Write time: 30979 19:37:52 INFO - run-tests - Wall time: 856s; CPU: 10%; Read bytes: 21164032; Write bytes: 9909498880; Read time: 14153; Write time: 430615 Ubuntu64: 19:38:06 INFO - Total resource usage - Wall time: 1835s; CPU: 100%; Read bytes: 6754304; Write bytes: 5492326400; Read time: 1004; Write time: 3301420 19:38:06 INFO - install - Wall time: 14s; CPU: 100%; Read bytes: 0; Write bytes: 229376; Read time: 0; Write time: 76 19:38:06 INFO - run-tests - Wall time: 1821s; CPU: 100%; Read bytes: 1683456; Write bytes: 5492097024; Read time: 512; Write time: 3301344 This is even worse because the Ubuntu VM is executing the tests over 2x slower! (Part of it might be explained by I/O wait - although a straight compare of the numbers between platforms isn't advised due to differences in how things are measured.)
Product: mozilla.org → Release Engineering
We're not going to investigate this for buildbot AWS instances.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.