Some packet.net instances are slow: /proc/cpuinfo shows reduced MHz
Categories
(Taskcluster :: General, defect)
Tracking
(Not tracked)
People
(Reporter: gbrown, Unassigned)
References
(Blocks 2 open bugs)
Details
(Keywords: leave-open)
Attachments
(1 file)
In bug 1540280 we noticed that some Android test tasks running on packet.net fail due to reduced performance and that poor performance is strongly associated with particular instances: machine-13 initially, but more recently machine-20, machine-7, and perhaps machine-12.
Bug 1474758 has a good collection of these test failures due to poor performance:
Android test tasks create an artifact called "android-performance.log" which includes a dump of /proc/cpuinfo. The /proc/cpuinfo from poor-test-performance logs shows very low "cpu MHz": around 800 MHz vs 3000+ MHz for normal performance runs.
Low MHz instances:
https://taskcluster-artifacts.net/VcmCJJ7HTj22PmZ9kSp7tg/0/public/test_info//android-performance.log
https://taskcluster-artifacts.net/EldBuqS9RdeIwMvvtdHXlA/0/public/test_info//android-performance.log
https://taskcluster-artifacts.net/doK_4p7NQbuMhJBDbMw0Bw/0/public/test_info//android-performance.log
eg.
Host /proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz
stepping : 3
microcode : 0x6a
cpu MHz : 799.941 <<<===
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 7007.88
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
Normal MHz instances:
https://taskcluster-artifacts.net/ZCYF9RmrQm6o1wU6AQxNmw/0/public/test_info//android-performance.log
https://taskcluster-artifacts.net/BGlX0vd9SKiMTHJH-XGIqg/0/public/test_info//android-performance.log
https://taskcluster-artifacts.net/FmVpczkERQuqgTGDCZ6L5A/0/public/test_info//android-performance.log
eg
Host /proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz
stepping : 3
microcode : 0x8a
cpu MHz : 3700.156 <<<===
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 7008.63
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
Comment 1•6 years ago
|
||
Is that Linux CPU scaling?
Reporter | ||
Comment 2•6 years ago
|
||
See especially https://bugzilla.mozilla.org/show_bug.cgi?id=1474758#c41: "we may be facing resource starvation in shared tasks. The solution would be to reduce worker capacity and spawn more instances."
While I have some concerns (why didn't we see this effect when we initially experimented with a very small pool? why did this seem to start fairly suddenly a few weeks ago, when there were no big changes to load?), it seems the best explanation. Can we change to 2 workers per instance?
Comment 3•6 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #2)
While I have some concerns (why didn't we see this effect when we initially experimented with a very small pool? why did this seem to start fairly suddenly a few weeks ago, when there were no big changes to load?), it seems the best explanation. Can we change to 2 workers per instance?
It will take a while to recreate all the instances. I'll start tomorrow AM. I'll also want to socialize the associated increase in cost with Travis.
Comment 4•6 years ago
|
||
I've bumped the total number of instances up to 40 from 25.
I started recreating machine-0, but it took 5 attempts to recreate that single instance due to the flakiness of bootstrapping from scratch every time. Rather than recreate the existing instances, I've decided to provision the new, higher-number instances first instead. Once we have that new capacity, I'll return and recreate the low-numbered ones.
So far we have 4 instances that are each running only 2 workers instead of 4. These are:
machine-0
machine-25
machine-30
machine-34
That gives you some indication of how often provisioning works on the first pass. :/
I'll continue slogging through this.
Comment 5•6 years ago
|
||
machine-[24-39] are all running with 2 workers each. I'll now start recreating the existing instances.
Comment 6•6 years ago
|
||
There are now 40 packet.net instances, each of which are running 2 workers.
If we are still seeing timeouts and slowdowns in this new configuration, we can drop to a single worker per instance, but at that point, we've pretty much invalidated the reason for pursuing packet.net instances in the first place.
Reporter | ||
Comment 7•6 years ago
|
||
:coop -- Despite your efforts here, this condition continues; in fact, I see no difference.
From bug 1474758, all of these recent (April 28, 29) failures' android-performance.log artifacts show /proc/cpuinfo around 800 Mhz:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=243333700&repo=autoland&lineNumber=14416
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=243310537&repo=autoland&lineNumber=12875
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=243217995&repo=mozilla-inbound&lineNumber=28534
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=243217990&repo=mozilla-inbound&lineNumber=27638
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=243217501&repo=autoland&lineNumber=29276
eg https://taskcluster-artifacts.net/H71E7qBJRbGwhymOkqE4YQ/0/public/test_info//android-performance.log
Host /proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Xeon(R) CPU E3-1240 v6 @ 3.70GHz
stepping : 9
microcode : 0x8e
cpu MHz : 799.980
If we really are running max 2 workers/instance now, then I think there is no correlation between the reduced cpuinfo MHz and # workers/instance.
Comment 8•6 years ago
|
||
I would be surprised to see such a correspondance -- nothing in docker apportions MHz between containers.
This sounds a lot more like CPU throttling has somehow been enabled.
Comment 9•5 years ago
|
||
I just opened a new ticket in packet for the CPU throttling problem.
Updated•5 years ago
|
Comment 10•5 years ago
|
||
for reference a few years ago :garndt and I worked on running talos at packet.net and we saw the same thing- specific workers were running significantly slower than the majority of the other workers- we were doing 1 process/machine. We spent a few weeks working with our contact at packet.net at the time and didn't get anywhere, they were confused and couldn't explain it.
if we were to work around this and had 4 instances/machine, that means that all instances on those machines would be "auto retried/failed".
does this change over time, as in one machine works fine at 9am but at 11am it is running slower? Was the machine rebooted in between, etc.?
Comment 11•5 years ago
|
||
I got a response from packet.net:
So after reviewing what could be the possible cause for this, I dag deeper on c1.small.x86 capabilities instead.
I highly suggest to tune your CPU and this to verify or If you haven't already here is our best guide below:
https://support.packet.com/kb/articles/cpu-tuning
This can be verified/confirmed by
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
I am working in another bug and will dig into this afterward.
Reporter | ||
Comment 12•5 years ago
|
||
Verify setting of scaling_governor by adding it to existing log.
Reporter | ||
Updated•5 years ago
|
Comment 13•5 years ago
|
||
I've seen this on multiple Intel based machines - desktops and servers.
It's related to the scaling governor - and the default appears to throttle down the processor to 800MHz (usually) if its not modified. You may have to tweak the following scripts - depending on your processor type.
First to see what mode the scaling governor is in:
#!/bin/bash
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
----- call this mode.sh --------
Next to set all the cores for max performance - try this:
#!/bin/bash
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null
------ call this scripts perf.sh --------
Next - rerun the "mode.sh" script above and you should see:
performance
printed out once for every core you have access to.
Lastly - if you want to save energy, you can set the scaling governor to save power with:
#!/bin/bash
echo powersave | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null
------- call this psave.sh -------
These scripts are copied from from a server with a 6-core 12 thread Xeon processor with an Intel motherboard. I've hunted through all the BIOS settings and set everything up to "go fast" - that is - no throttling. Yet I have to run the perf.sh script everytime the machine reboots!!
Comment 14•5 years ago
|
||
Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d6416b899841 Add cpufreq/scaling_governor info to android-performance.log; r=wcosta
Comment 15•5 years ago
|
||
bugherder |
Reporter | ||
Comment 16•5 years ago
|
||
Thanks Al - sounds like we are on the right track.
And my diagnostic patch confirms that we usually see "powersave".
I can't write to scaling_governor in the same place -- no privileges. Maybe that's better done in the worker? Hoping wcosta or coop can sort that out...
Comment hidden (Intermittent Failures Robot) |
Comment 18•5 years ago
|
||
I just redeployed instances with CPU governor set to "performance". :gbrown, could you please confirm the slowness issue is gone?
Reporter | ||
Comment 19•5 years ago
|
||
I still see "powersave" reported by recent tasks:
https://treeherder.mozilla.org/logviewer.html#?job_id=244855542&repo=autoland
https://taskcluster-artifacts.net/aA8f776nSMqt3jS99rDvzQ/0/public/test_info//android-performance.log
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu5/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu6/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu7/cpufreq/scaling_governor: powersave
Comment 20•5 years ago
|
||
Two things:
-
Given that reducing the # of workers per instance didn't fix this problem, are we back to running 4 workers/instance with 25 instances total again? I never landed my change to switch to 2 workers/instance with 40 instances when it became clear that wasn't helping, and AFAICT there's been no other change to the terraform file: https://github.com/taskcluster/taskcluster-infrastructure/blob/master/docker-worker.tf Just want to make sure that the github repo is representative of the current state.
-
The sooner we can get to image-based deployments in packet.net, the better. This iteration cycle is going to be painful otherwise. Wander: can you pick up bug 1523569 once sccache in GCP is done, please? Maybe a git repo is overkill, but having some sort of local filestore for images hosted in packet.net will be required for bug 1508790 anyway.
Comment 21•5 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #19)
I still see "powersave" reported by recent tasks:
https://treeherder.mozilla.org/logviewer.html#?job_id=244855542&repo=autoland
https://taskcluster-artifacts.net/aA8f776nSMqt3jS99rDvzQ/0/public/test_info//
android-performance.log/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu5/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu6/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu7/cpufreq/scaling_governor: powersave
There was a bustage in the code, I redeployed and it now has "performance"
Reporter | ||
Comment 22•5 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #21)
There was a bustage in the code, I redeployed and it now has "performance"
Sorry, but I still see "powersave" reported by all recent tasks:
https://treeherder.mozilla.org/logviewer.html#?job_id=245395212&repo=autoland (Started: Wed, May 8, 13:04:33)
https://taskcluster-artifacts.net/Ghl2qrptRyujTc4L3ecniQ/0/public/test_info//android-performance.log
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu5/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu6/cpufreq/scaling_governor: powersave
/sys/devices/system/cpu/cpu7/cpufreq/scaling_governor: powersave
Comment 23•5 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #21)
There was a bustage in the code, I redeployed and it now has "performance"
Relevant PR is here: https://github.com/taskcluster/taskcluster-infrastructure/pull/46
I suggested to Wander in IRC that we should try fixing this by hand, i.e. ssh to each machine and manually set scaling_governor to performance. We can then iterate on making the deployment automation do this automatically.
Reporter | ||
Comment 24•5 years ago
|
||
The latest tasks have "performance" now:
https://treeherder.mozilla.org/logviewer.html#?job_id=245429039&repo=autoland
https://taskcluster-artifacts.net/Cug6JQYTT-2QwWLnjJR6tQ/0/public/test_info//android-performance.log
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu5/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu6/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu7/cpufreq/scaling_governor: performance
Self ni to check on associated intermittent failures tomorrow.
Reporter | ||
Comment 25•5 years ago
|
||
Oh darn.
From bug 1474758,
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=245433579&repo=autoland&lineNumber=13527
https://taskcluster-artifacts.net/SOjCQvELT7OoeYGgaVYy-w/0/public/test_info//android-performance.log
Host cpufreq/scaling_governor:
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu5/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu6/cpufreq/scaling_governor: performance
/sys/devices/system/cpu/cpu7/cpufreq/scaling_governor: performance
Host /proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz
stepping : 3
microcode : 0x6a
cpu MHz : 799.941
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 7008.65
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
Notice "performance": this happened after yesterday's change.
Notice "cpu MHz : 799.941"!!
Comment 26•5 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #25)
Oh darn.
Notice "cpu MHz : 799.941"!!
OK, that's concerning.
Wander: can you double-check the pool of instances to see how pervasive this is, and then reach out to packet for next steps here?
I've NI-ed Al who is already on this bug too.
Reporter | ||
Comment 27•5 years ago
|
||
I see 3 new examples so far:
https://treeherder.mozilla.org/logviewer.html#?job_id=245544363&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=245433579&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=245434394&repo=mozilla-central
2 of the 3 examples above have worker id "machine-4". Can that worker id be translated into a packet.net instance that someone could ssh into to investigate further?
Comment 28•5 years ago
|
||
The fix I provided in Comment 13 will not persist after a reboot.
To ensure that the scaling governor is placed into performance mode requires the following steps:
Add the following line:
GOVERNOR="performance"
in
/etc/init.d/cpufrequtils
On Ubuntu 18.04 you need to run:
sudo apt-get install cpufrequtils
sudo systemctl disable ondemand
Comment 29•5 years ago
|
||
I spotted frequencies going to 800 MHz even with scaling governor set to performance. What I am now doing is, besides setting scaling governor to performance, I also set the minimum CPU frequency to 3.5 GHz. :gbrown, could you please keep an eye on failing tasks?
Comment 30•5 years ago
|
||
Update: even so, I can see machines running with 800 MHz.
Reporter | ||
Comment 31•5 years ago
|
||
Associated test failures definitely continue and remain a big concern. Logs show "performance" and ~800 MHz.
Do we have more ideas? Any work in progress?
I am tempted to have the task fail and retry when it finds this condition.
Comment 32•5 years ago
|
||
Do we know the frequency how often this happens? Is it happening on all workers or only for some of those? Also which scaling driver is actually running? Maybe there is a bug we are just hitting here?
Reporter | ||
Comment 33•5 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #32)
Do we know the frequency how often this happens?
Not really. It is not very frequent, but there are about 10 cases found in bug 1474758 each day; that is only jsreftests, which account for maybe 10% of packet.net tasks, so a very gross estimate would be 100 cases of low MHz per day.
Is it happening on all workers or only for some of those?
There is a correlation with certain worker-ids for a period of time, but the affected worker-ids seem to change from day to day. Look at the "Machine name" column of https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-05-09&endday=2019-05-16&tree=trunk&bug=1474758 to see what I mean.
Also which scaling driver is actually running?
When a task cats /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor, it sees "performance" now.
Maybe there is a bug we are just hitting here?
A scaling driver / governor bug?
Comment 34•5 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #33)
Not really. It is not very frequent, but there are about 10 cases found in bug 1474758 each day; that is only jsreftests, which account for maybe 10% of packet.net tasks, so a very gross estimate would be 100 cases of low MHz per day.
Is it happening on all workers or only for some of those?
There is a correlation with certain worker-ids for a period of time, but the affected worker-ids seem to change from day to day. Look at the "Machine name" column of https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-05-09&endday=2019-05-16&tree=trunk&bug=1474758 to see what I mean.
As it looks like it happens for the workers with the name 7, 8, 34, and 36. Others only appear once in that list. Maybe someone could check one of those manually? Can we blacklist (taking out of the pool) for now?
Also which scaling driver is actually running?
When a task cats /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor, it sees "performance" now.
So comment 23 referenced: https://github.com/taskcluster/taskcluster-infrastructure/pull/46/files
Where in this patch do we actually set the governor to performance
? We only start the cpufreq service, or? Using cat
in a task (?) to set it, doesn't it conflict with the service?
A scaling driver / governor bug?
Yes, so it would be good to know which driver is actually used. Is it intel_pstate
?
Reporter | ||
Comment 35•5 years ago
|
||
Bug 1552334 recognizes the slow instances and retries the affected task. That is effective in avoiding test failures, but sometimes delays test runs and is inefficient in terms of worker use: It would still be great to see this bug resolved properly.
Reporter | ||
Comment 36•5 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #35)
sometimes delays test runs and is inefficient in terms of worker use
As an example, in
this task retried 4 times (getting worker 8 or worker 34 each time). The task takes about 5 minutes to detect the poor performance (this could be improved) + time for rescheduling, so the start of the successful task was delayed by about 30 minutes in total.
Comment 37•5 years ago
|
||
so maybe we estimate 7 minutes/retry- and then calculate the total retries or % retries or retries/day, then we could determine how many workers we need.
Actually if we have this retry in place, possibly we could consider running talos on linux @packet.net :)
Updated•5 years ago
|
Reporter | ||
Comment 38•5 years ago
|
||
This seems to have stopped!
I see no retries due to reduced MHz since May 25. Wonderful!
Let's keep this bug open, unless we know how it was fixed...
Comment 39•5 years ago
|
||
Self note: the correct command line to set the cpu governor for intel_pstate driver is:
echo performance | tee /sys/devices/system/cpu/cpufreq/policy*/scaling_governor
Comment 40•5 years ago
|
||
https://github.com/taskcluster/docker-worker/commit/0f3cef5b3d10e83fa9e5cf23c841637520f3d7bc
This is fixed in the image based workers.
Reporter | ||
Comment 41•5 years ago
|
||
Beginning May 30, tasks consistently reported "powersave" governors. Intermittent retries for reduced MHz were noticed today.
Description
•