Closed Bug 987152 Opened 11 years ago Closed 10 years ago

Remove %APPDATA% and %LOCALAPPDATA% from Windows testers

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Linux
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: q)

References

Details

In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #501) > Q, wfm from what you said about the cltbld context. > > FTR: %APPDATA% resolved to C:\Users\cltbld\AppData\Roaming > > (In reply to Q from comment #482) > > I have a scheduled task ready to test that nukes the profile folders in > > %APPDATA%\Mozilla\Firefox\Profiles\. > > > > The bat is simple and looks thus: > > > > for /F "delims=\" %%I in ('dir /ad /b %APPDATA%\Mozilla\Firefox\Profiles') > > DO ( > > rd /S /Q "%APPDATA%\Mozilla\Firefox\Profiles\%%I" > > > > > > ) (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #502) > As mentioned on IRC, we need to clean %LOCALAPPDATA% as well so we don't > perpetually accumulate stale cache directories.
Can we get an ETA on this? Bug 918507 remains and extremely high frequency test failure on branches where it wasn't disabled, it and is doubly bad because it aborts the test run when it hits.
Severity: normal → critical
Flags: needinfo?(q)
I am still testing but it should be done this week.
Flags: needinfo?(q)
I can implement this today with a full cleanup to be clear the previous cleanup in starttalos plus this change will clean the following: %APPDATA%\Mozilla\Firefox\console.log %LOCALAPPDATA%\Temp %APPDATA%\Mozilla\Firefox\Profiles %LOCALAPPDATA%\Mozilla\Firefox\Profiles
I also think we should clean %userprofile%\downloads Gentlemen what do you think?
Flags: needinfo?(armenzg)
Also be aware that this may cause a delay in testing times for a period as some of these directories are packed and will take time to clear.
That is fine. Thanks for letting us know about the delays. Understandable. Thanks Q!
Flags: needinfo?(armenzg)
Great the new clearing will happen next reboot before runslave is launched.
We're 4 hours since this deployed (taking comment #8 as the timestamp), and down all but 7 of our t-w864-ix slaves. The main trees are closed. We have to backout, reboot some boxes to get them back online, and figure out a gradual deployment strategy.
nthomas closed two hours now ? if we have a way to remediate this and get things open, why wouldn't we use it ? I asked t-w864-ix-029 to delete 10.5k tmp<blah> dirs, 10 mins later it has found 625k files using 26G to delete, but is still counting this is going to take ages to remove at 250 files/s RyanVM|afk, nthomas: lets back out the change and keep at least 50% of the slaves running while it gets deployed to the other portion? nthomas Q: ^^ please please pretty please I can handle reboots if that helps Q On it
Q commented out the cleanup commands in the startup script. I've walked up https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-w864-ix sorted by slave name ascending as far as t-w864-ix-090, rebooting if the slave was still cleaning up temp files. About 57 hosts in all, the rest had already finished themselves. There are about 20 slaves with higher NNN which are still cleaning up, which I'm leaving to finish.
Nick may be right that the issue with being orange on the first run was due to deleting the profile directory but not %APPDATA%\Mozilla\Firefox\profiles.ini.
I'm very sorry about this. I could have not thought it could have taken this long :( I think we should look into do clean up through mozharness scripts with/without the GPO cleanup.
Having just starred bug 918507 on Windows XP, I'm also reminded this cleanup would be nice to run on all Windows test slaves eventually once the bugs are worked out. Bonus points for OSX too (where I know I've seen screenshots of the desktop filled with garbage files generated during test runs).
If we do this clean up every time I don't think it will be an issue after we get over this first hump.
I am going to do a background clean up for %LOCALAPPDATA\Temp ( which should be done at the beginning of every job) for any file older than 2 days then put just that part of the cleaner script back. Where are we on the profiles bug? Q
Flags: needinfo?(armenzg)
If it is a background removal we could have talos jobs being affected. I don't know though. We should really fix this on the mozharness side. <armenzg> Q, what does a "background" clean up mean? is it different than a normal cleanup? <armenzg> Q, so what you're saying is that you're picking only 1 directory instead of the 4 ones mentioned on comment 3? <armenzg> Q, is there a way to deploy a change to 20 machines at a time? <armenzg> Q, that way we can see if the machines fall behind <Q> Aremn: sure <Q> Batchs are easy <jhopkins|buildduty> Q: armen mentioned a "background cleanup" - if it is what it sounds like, could that background process overlap with a build? <jhopkins|buildduty> just trying to make sure there's no race condition (eg. large number of temp files causes strange failures) <Q> jhopkins|mtg: It would be a background script that would only kill files older than 2 days. The load is niced down so it "should" be invisible <jhopkins|mtg> ok.. if we keep seeing failures around temp dirs we should confirm that more closely <Q> RyanVM|sheriffduty: jhopkins|mtg: Should I kick off the background clean up ? <RyanVM|sheriffduty> Q: by kick off you mean turn off? <RyanVM|sheriffduty> or start? <Q> to be clear this will only be for %LOCALAPPDATA\Temp <Q> Start <RyanVM|sheriffduty> what's happening now that we think might be causing problems? Callek> Q: as a data point, when did we do the cp python.exe python2.7.exe as well here? <Callek> could we possibly be missing some process elevation whitelist entry with regard to that? <Q> Callek: good question let me check <Callek> (potentially as it relates to easy_install* since anything with install in name can trigger UAC on windows) <RyanVM|sheriffduty> oh dammit, the failures on aurora are real bustage <Q> Callek: I hate that "feature" <Callek> Q: agreed :/ <Q> Can someone aim me at a machine currently having issues ? <RyanVM|sheriffduty> "currently"? <RyanVM|sheriffduty> didn't we go over this last week? By the time we see a problem, it's already on to another job? <RyanVM|sheriffduty> so https://tbpl.mozilla.org/php/getParsedLog.php?id=37018979&tree=Mozilla-Aurora this is a failure <RyanVM|sheriffduty> no clue what state that slave is in now <RyanVM|sheriffduty> oh, it already ran green on another job <Q> Wasn't that for the Profiles directory ? <Q> Last week that is <Q> RyanVM|sheriffduty ^ <RyanVM|sheriffduty> My point is that the inherent lag of this failures being reported makes it basically impossible for me to point you at a slave that's having problems "right now" <RyanVM|sheriffduty> just one that was failing at one point in the recent past <Q> Right Sorry I was getting hopeful and forgetful
Flags: needinfo?(armenzg)
I never got a chance to kick this off. You are right that we should be doing something on the mozharness side to fix this with an OS level catchall just in case. So there are two issues here: 1) cleaning out the %TEMP% folder 2) Cleaning the Profiles folders ( which causes a test failure) * This is a serious issue as we may have been using polluted profile folders for tests.
Okay after looking at this for a while. for issue 2 I think we should delete: %APPDATA%\Mozilla %LOCALAPPDATA%\Mozilla entirely before each run and mozharness should check for a clean environment as well. Those directories are created at browser first run ( unless of course that import prompt will break anything but I think our test account for a run on a blank system) thoughts ? Q
Flags: needinfo?
Blocks: 991236
For me the plan makes sense. However, I currently cannot help with the mozharness component. Perhaps in a week or two.
Flags: needinfo?
As per IRC discussion, the cleanup that was backed out cleaned enough that we're not as much under pressure, however, we have some machines that could be time-bombs if they did not go through the cleanup process. I'm asking around if anyone can pick it up this week.
Severity: critical → major
As a status updates. We are back to cleaning %TEMP% so we are no longer in danger of drives filling up from that. What we need now is someone to verify that killing %APPDATA%\Mozilla %LOCALAPPDATA%\Mozilla won't kill the next test that runs. The last conjecture here was that killing the profiles but not the profiles.ini is what was causing a failure condition on hte next test run. If that is true than killing the parent directories should work.
Flags: needinfo?(armenzg)
jmaher: do you know if deleting those 2 dirs will cause any issues? I don't think so but wanted to double check. Do we have to regenerate them before starting tests? Q: I have a new idea, let me know if this is possible without trouble. Can GPO remove files for few minutes and then stop? This way it would delete it a bit every day until we have cleaned up enough. If it doesn't make sense, don't worry about it.
Flags: needinfo?(armenzg) → needinfo?(jmaher)
I am not sure if we will have problems, I would bet 3/1 that we would be fine with those directories removed. The idea of removing a few files is a good one. Can mozharness do some of this as well? I know that isn't the right place to do it, but it would ensure success!
Flags: needinfo?(jmaher)
This could live in a preflight_clobber() (or postflight_clobber() ) in mozharness.mozilla.testing.testbase.TestingMixin . Or we could add an optional self.config['additional_clobber_files'] that clobber() looks for, and set that for all appropriate tests.
This is done in starttalos as of yesterday
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.