Closed Bug 1164464 Opened 9 years ago Closed 6 years ago

cache-match.https.html takes an order of magnitude longer to run on Win7/Win8 vs XP

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
Windows
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: coop, Unassigned)

Details

(Whiteboard: [windows][test])

bkelly reported relative slowness on the Win7 and Win8 platforms today in #releng. File I/O are very slow vs XP. Here's the IRC log:

[10:07am] bkelly: hello, can anyone tell me about the win7 and win8 test machines on treeherder?  in particular, I'm curious about their disk subsystem as I've noticed its much slower than other platform test machines
[10:11am] coop: bkelly: what do you want to know? if there’s a command I can give you output from, i can poke a machine. if it’s more involved I can loan you one to investigate yourself
[10:12am] bkelly: coop: I'm just curious if its VM or physical machine... flash or spinning disk... do we ever defragment or reformat the disk partitions?
[10:13am] coop: bkelly: physical machines, spinning disk
[10:13am] coop: machines are re-installed when required, but not regularly. no regular defrag or reformatting
[10:14am] coop: pretty much the worst of all your options 
[10:14am] bkelly: coop: hmm, ok... I am seeing win7 and win8 run about 10x slower on these file I/O tests compared to other platforms... enough that the tests time out quite a bit
[10:15am] bkelly: I'm trying to optimize the code where I can, but if the file systems are completely hosed its an uphill battle for me
[10:15am] bkelly: these are not meant to be performance tests
[10:19am] bkelly: for example, linux64: cache-match.https.html | took 4753ms
[10:19am] bkelly: win xp: cache-match.https.html | took 7096ms
[10:19am] bkelly: win7: cache-match.https.html | took 49782ms  (this is faster than it normally runs actually)
[10:20am] bkelly: win8: cache-match.https.html | took 54127ms
[10:21am] bkelly: coop: anyway, if there is anything we can do to improve the file system performance on these tests it might help our build/test time as well
[10:21am] mshal: I'm surprised it's that different between winxp and win7... that definitely sounds like something is broken
[10:22am] coop: it’s quite possible there a configuration problem

I'm not a Windows admin, I've cc-ed the 3 Windows experts from relops to see whether they have any quick ideas to investigate.

Going forward, it may make sense to run a defrag job after N other tasks on the Windows platforms. Once runner is running on Windows (bug 1055794), this seems like a good post-flight task to add.
Note that defrag can sometimes take on the order of hours, and depending on system access/procs can require reboots (if it tries to move a file that is locked in/access)
Can we just re-image the machines with a pristine file system periodically?  Like once a week or something?
Have we actually tried a defrag or reinstall on a client to verify that disk fragmentation is the problem?
(In reply to Amy Rich [:arich] [:arr] from comment #3)
> Have we actually tried a defrag or reinstall on a client to verify that disk
> fragmentation is the problem?

Not that I'm aware of.
But running the test on my local win8.1 machine takes about 6 seconds.  This is similar to the other platforms in automation.
We don't have any indication that disk fragmentation is the cause, so implementing a solution for that is a bit premature. I think the best first step is to assign a loaner where you can defrag the disk to see if that helps your test speed, :bkelly.

If that doesn't yield any improvement, we can try reimaging the same machine and have you test things out again.

:selena, :mgerva can one of you take the bug and help with this, please?
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(mgervasini)
I agree verifying that is the root problem is a good idea.  The tests are fully automated, though.  Is there a reason I need to be in the loop to run the tests?  You just need to run:

 ./mach web-platform-tests service-worker/cache-storage/window/cache-match.https.html

And look for the time difference between start and end.

Also, just running some file system benchmarks would probably show the same thing.

Please note I will be landing bug 1162342 tomorrow which will roughly halve the run time of the cache-match.https.html test.  I think investigating the disk performance on these test machines would still be a good idea just for improving test throughput for the windows platform.
Assignee: relops → nobody
Component: RelOps → Buildduty
Flags: needinfo?(sdeckelmann)
Product: Infrastructure & Operations → Release Engineering
QA Contact: arich → bugspam.Callek
Flags: needinfo?(mgervasini) → needinfo?(kmoir)
:bkelly: what type of loaner windows test machine would you prefer - Win7 or Win8?
Flags: needinfo?(kmoir) → needinfo?(bkelly)
(In reply to Kim Moir [:kmoir] from comment #8)
> :bkelly: what type of loaner windows test machine would you prefer - Win7 or
> Win8?

I'm sorry, but I don't have time to investigate this myself.  I will be on PTO next week and have a number of other things I need to get done.

Also, I worked around the problem.  It still runs an order of magnitude slower than the other platforms, but I sped up the base test time enough that it no longer times out.

I guess if releng doesn't have time to investigate, then just go ahead and close this WONTFIX.
Flags: needinfo?(bkelly)
:bkelly

What directory does the 

./mach web-platform-tests service-worker/cache-storage/window/cache-match.https.html

command need to be run in?

i.e what branch and type of build
mozilla-central browser build should work fine.  I guess thats a "B" in treeherder.  I don't know what directory that corresponds to on disk, though.
None, because we don't build and run tests on the same slave. That's what you would do after you built locally, not what you would do if you had a fresh test slave with no build on it. If you had a fresh test slave with no build on it... we don't actually have any documentation on how to run tests on a downloaded build and downloaded packaged tests, do we? The wailing and gnashing of teeth from everyone who has ever gotten a loaner certainly makes it seem like we don't.
Can we just run the defragmenter's analyze on the machine?  If the file system looks good, then maybe there is nothing to do here.

Of course, it would still be nice to try to explain the difference if thats the case.  Like, winxp runs on flash drives instead of spinning media or the win7 machines are configured in a way to honor fsync while winxp isn't, etc.  I don't know much about how to configure windows file systems, though, so not even sure if thats possible.
(In reply to Kim Moir [:kmoir] from comment #10)
> :bkelly
> 
> What directory does the 
> 
> ./mach web-platform-tests
> service-worker/cache-storage/window/cache-match.https.html
> 
> command need to be run in?
> 
> i.e what branch and type of build

Kim: did this command get run? What's the follow-up here?
As Phil pointed out in comment 12 there is an extra step to download the build which I guess not many people know how to do.

Note, we're also seeing weird behavior on individual windows machines over in bug 1160013 comment 197 now.  The same machine will be responsible for all the timeouts for days/weeks.  If that machine is disabled, then it might show up on a new machine (and stick with that new one) some time later.

Seems like these machines might be getting into a bad state somehow.
No I didn't run it because of comment #12
Does this test not run as part of a larger suite? We have ways of running the suites, and could pare down a suite to target this particular test.
"web-platform-tests" - it's in web-platform-tests-4 if you run chunked.
Component: Buildduty → Platform Support
QA Contact: bugspam.Callek → coop
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Found in triage,
Hey Chris/Phil any updates on this bug? Do we still need it open?
Flags: needinfo?(philringnalda)
Flags: needinfo?(coop)
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(coop)
Resolution: --- → INCOMPLETE
Flags: needinfo?(philringnalda)
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.