In case of early aborts (e.g. failed download) log files are not always copied to upload dir

RESOLVED FIXED

Status

Release Engineering
Mozharness
RESOLVED FIXED
2 years ago
a year ago

People

(Reporter: whimboo, Assigned: whimboo)

Tracking

unspecified

Firefox Tracking Flags

(firefox45 wontfix, firefox46 fixed, firefox-esr45 fixed)

Details

Attachments

(3 attachments, 2 obsolete attachments)

(Assignee)

Description

2 years ago
Created attachment 8695293 [details]
log output

Yesterday I have seen that mozharness does not always copy the log files to the upload dir. This happens e.g. when a download fails. See the attached log file. The problem here is that CI systems which rely on log files to be present in the uploaded dir will not upload the log file, so Treeherder stays blank. :(

In our case for firefox-ui-tests I implemented a post-run listener for the report files and this is correctly executed even in the above case. You can also see that in the log output.

So I would assume the registering of the listener for copying the log files is not done correctly and that's why the method is not called.
(Assignee)

Comment 1

2 years ago
The `copy_logs_to_upload_dir()` method is part of the BaseScript which means that every existent mozharness entry level script makes use of it. What I do not see is that we register it as a postrun method! 

A change like the following would do it:

> +    @PostScriptRun
>      def copy_logs_to_upload_dir(self):

But it looks like there are some other scripts which directly call this method:

* https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/scripts/b2g_build.py?from=b2g_build.py#788
* https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/scripts/vcs-sync/initial_beagle.py#441
* https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/scripts/vcs-sync/vcs_sync.py#595
* https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/scripts/vcs-sync/vcs_sync.py#1080

So I'm not sure if we really want to get this added to BaseScript. I could easily overwrite the method for firefox-ui-tests and mark it with the PostScriptRun decorator which would make it also work.
Flags: needinfo?(jlund)
Flags: needinfo?(ahalberstadt)
(Assignee)

Comment 2

2 years ago
Created attachment 8695342 [details] [diff] [review]
Patch v1 (firefox-ui-tests only)

This would be the fix to fix it in our Firefox ui tests. Please be aware that I also set the dev mode so we do not run the download method of the proxxy.py file.

Comment 3

2 years ago
so I'm not sure why aki/gps[1] didn't put this[2] into the finally block. That would force it to run every time regardless of what happens within the try block that executes. It's not clear to me why we explicitly call it within other script's subclasses. maybe as a work around? maybe aki wanted it optional..

[1] https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#1737 
[2] https://hg.mozilla.org/build/mozharness/rev/87a8c74982cf
Flags: needinfo?(jlund)

Comment 4

2 years ago
(In reply to Jordan Lund (:jlund) from comment #3)
> so I'm not sure why aki/gps[1] didn't put this[2] into the finally block.

typo: [1] and  [2] are swapped

Comment 5

2 years ago
Comment on attachment 8695342 [details] [diff] [review]
Patch v1 (firefox-ui-tests only)

Review of attachment 8695342 [details] [diff] [review]:
-----------------------------------------------------------------

I think you can do this from either post_fatal() or via like you are with a PostScriptRun decorator. I think post_fatal might be better so that we only call copy_logs_to_upload_dir() once in the case where we don't fail to download / call self.fatal()..
(Assignee)

Comment 6

2 years ago
(In reply to Jordan Lund (:jlund) from comment #3)
> so I'm not sure why aki/gps[1] didn't put this[2] into the finally block.
> That would force it to run every time regardless of what happens within the
> try block that executes. It's not clear to me why we explicitly call it
> within other script's subclasses. maybe as a work around? maybe aki wanted
> it optional..

Lets get an answer from aki and gps then. :)
Flags: needinfo?(gps)
Flags: needinfo?(escapewindow+mozbugs)
Flags: needinfo?(ahalberstadt)
(Assignee)

Comment 7

2 years ago
(In reply to Jordan Lund (:jlund) from comment #5)
> I think you can do this from either post_fatal() or via like you are with a
> PostScriptRun decorator. I think post_fatal might be better so that we only
> call copy_logs_to_upload_dir() once in the case where we don't fail to
> download / call self.fatal()..

Good call. But I will wait for the reply from gps / aki first to see what we really have to do here.

Comment 8

2 years ago
(In reply to Jordan Lund (:jlund) from comment #3)
> so I'm not sure why aki/gps[1] didn't put this[2] into the finally block.
> That would force it to run every time regardless of what happens within the
> try block that executes. It's not clear to me why we explicitly call it
> within other script's subclasses. maybe as a work around? maybe aki wanted
> it optional..
> 
> [1]
> https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/
> base/script.py#1737 
> [2] https://hg.mozilla.org/build/mozharness/rev/87a8c74982cf

It's been a while since I've looked at this, but https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#1737 is inside the outer |finally|, not the inner |finally|.  Since the inner try/except doesn't throw, we should hit the final lines of the outer |finally|.

(In reply to Henrik Skupin (:whimboo) from comment #1)
> The `copy_logs_to_upload_dir()` method is part of the BaseScript which means
> that every existent mozharness entry level script makes use of it. What I do
> not see is that we register it as a postrun method! 
> 
> A change like the following would do it:
> 
> > +    @PostScriptRun
> >      def copy_logs_to_upload_dir(self):

This is probably the desired behavior.

> But it looks like there are some other scripts which directly call this
> method:

Some history that may or may not be completely accurate:

* upload_dir was in my head since maybe 2005-6.  At a previous company, our uploads happened all over the place in the scripts, and we kept on running into issues where the directory wasn't created yet, or we wanted to change a file that was uploaded and we had to scour the entire codebase to find where/why/when it happened.  Having a single upload dir on the client, that gets uploaded to the server at the end, was my solution to make it sane.
* I coded the client-side portion of upload_dir into mozharness but never really used it directly in most? scripts, since Firefox builds have their own |make upload| syntax/process.  So upload_dir was just a non-functioning stub for most scripts, with the intention of making it useful at some point in the future.
** I used upload_dir in a few scripts, like the vcs-sync ones, and so copying the logs on fatal was something I noticed and fixed there.
* When TaskCluster became a thing, we intended to make use of upload_dir as a real thing.  I never saw this become reality at Mozilla, but if things went according to plan, this may now be true.
* Now that it's in use in more scripts, bugs like this are becoming more evident, and we can fix them properly.

I think the proper fix is the above PostScriptRun copy_logs_to_upload_dir, and removing the script-specific ones.  I'm lacking current context, so there may be additional side effects I'm not thinking about, but I think it'll be good overall.
Flags: needinfo?(escapewindow+mozbugs)

Comment 9

2 years ago
> I think the proper fix is the above PostScriptRun copy_logs_to_upload_dir,
> and removing the script-specific ones.  I'm lacking current context, so
> there may be additional side effects I'm not thinking about, but I think
> it'll be good overall.

thanks aki. +1 to this approach.

Updated

2 years ago
Flags: needinfo?(gps)
(Assignee)

Comment 10

a year ago
Created attachment 8707426 [details] [diff] [review]
Always copy logs

Ok, so I updated my patch so we cover all possibly involved scripts. I was able to test this with firefox-ui-tests and it works great. Some questions remain and I hope you can answer them:

1. When I override copy_logs_to_upload_dir in my own class, I absolutely have to add @PostScriptRun again? If I don't do so, the method won't be noticed anymore and is not getting run during the post steps.

2. I'm unclear about this code: https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/scripts/vcs-sync/vcs_sync.py#1079 - Is this still necessary? Where is `preflight_upload()` used? I cannot find any reference in mozharness. If we have to keep it I feel the if condition is wrong and the not has to be removed.

3. How can I test if the vcs changes are fine? For b2g_build.py I would see results via try but not sure about the vcs ones. Hal, could you help here?

Once the above questions have been cleared I will trigger a try build with those changes.
Attachment #8695342 - Attachment is obsolete: true
Flags: needinfo?(hwine)
Attachment #8707426 - Flags: feedback?(jlund)
Attachment #8707426 - Flags: feedback?(escapewindow+mozbugs)
(In reply to Henrik Skupin (:whimboo) from comment #10)
 
> 3. How can I test if the vcs changes are fine? For b2g_build.py I would see
> results via try but not sure about the vcs ones. Hal, could you help here?

There are several different chunks of 'vcs' code in mozharness.

You need not worry about any of the following code:
  mozharness/scripts/vcs-sync/*
  mozharness/scripts/b2g_bumper.py
  mozharness/configs/b2g_bumper/*
  mozharness/configs/vcs_sync/*
Those will be removed with bug 1225364.

As for testing, the 'tox' tests still work, as described in the README.txt file. (I'm not a mozharness developer, just a user.)
Flags: needinfo?(hwine)
(Assignee)

Comment 12

a year ago
Great info. Thank you Hal. I think in this case I will simply revert the changes in those files. I will upload a new version of the patch in a minute.
(Assignee)

Comment 13

a year ago
Created attachment 8707570 [details] [diff] [review]
Always copy logs v1.1
Attachment #8707426 - Attachment is obsolete: true
Attachment #8707426 - Flags: feedback?(jlund)
Attachment #8707426 - Flags: feedback?(escapewindow+mozbugs)
Attachment #8707570 - Flags: review?(jlund)
Attachment #8707570 - Flags: feedback?(escapewindow+mozbugs)
(Assignee)

Comment 14

a year ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2c3f0312f1eb

Updated

a year ago
Attachment #8707570 - Flags: review?(jlund) → review+

Comment 15

a year ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/999cdf95280c

Comment 16

a year ago
Comment on attachment 8707570 [details] [diff] [review]
Always copy logs v1.1

It's been a while, but that passes the eyeball test.
Attachment #8707570 - Flags: feedback?(escapewindow+mozbugs) → feedback+

Comment 17

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/999cdf95280c
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
status-firefox46: --- → fixed
Resolution: --- → FIXED
in reviewing another patch, I noticed that 1 test is failing now. I suspect by the nature of the test that this bug is related:

$cd testing/mozharness
$source some/venv && pip install tox
$tox

1) FAIL: test_decorators_registered (test_base_script.TestScriptDecorators)

   Traceback (most recent call last):
    test/test_base_script.py line 707 in test_decorators_registered
      self.assertEqual(len(self.s._listeners['post_run']), 2)
   AssertionError: 3 != 2
   -------------------- >> begin captured logging << --------------------
   root: INFO: MultiFileLogger online at 20160115 12:56:23 in /Users/jlund/devel/mozilla/dirtyRepos/mozilla-inbound/testing/mozharness
   root: INFO: Run as /Users/jlund/devel/mozilla/dirtyRepos/mozilla-inbound/testing/mozharness/.tox/py27-hg2.6/bin/nosetests -v --with-xunit --rednose --force-color
   --------------------- >> end captured logging << ---------------------
178 tests run in 50.1 seconds.
1 FAILED, 2 skipped (175 tests passed)
ERROR: InvocationError: '/Users/jlund/devel/mozilla/dirtyRepos/mozilla-inbound/testing/mozharness/.tox/py27-hg2.6/bin/coverage run --source configs,mozharness,scripts --branch /Users/jlund/devel/mozilla/dirtyRepos/mozilla-inbound/testing/mozharness/.tox/py27-hg2.6/bin/nosetests -v --with-xunit --rednose --force-color'
____________________________________________________________________________________ summary _____________________________________________________________________________________
ERROR:   py27-hg2.6: commands failed

Updated

a year ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Updated

a year ago
See Also: → bug 1240184
(Assignee)

Comment 19

a year ago
While also working on the unpack feature on bug 1237706 I noticed the following:

11:01:24     INFO - Using ZipFile to extract /mozilla/code/gecko/testing/mozharness/a/info_busted.zip to /mozilla/code/gecko/testing/mozharness/b
11:01:24    FATAL - File name in directory "info.txt" and header "info.sxt" differ.: /mozilla/code/gecko/testing/mozharness/a/info_busted.zip
11:01:24    FATAL - Running post_fatal callback...
11:01:24    FATAL - Exiting 2

When we call unpack for a broken archive within the constructor of the script, we still don't get the log files copied to the upload folder.
(Assignee)

Comment 20

a year ago
(In reply to Henrik Skupin (:whimboo) from comment #19)
> When we call unpack for a broken archive within the constructor of the
> script, we still don't get the log files copied to the upload folder.

I think that this is something we might not be able to change. That's why all probably breaking code should be part of an actual build step.

I will have a look at the test fixes, and would like to also have an uplift for aurora, so we can have this important fix for the 45ESR branch. Jordan, do you think that this works? So far we haven't seen any breakage
Status: REOPENED → ASSIGNED
status-firefox45: --- → affected
(Assignee)

Comment 21

a year ago
(In reply to Jordan Lund (:jlund) from comment #18)
> in reviewing another patch, I noticed that 1 test is failing now. I suspect
> by the nature of the test that this bug is related:

Sorry for the delay here. I wanted to fix it now but noticed that this test failure got fixed along with bug 1244781. So nothing left to do here.
Status: ASSIGNED → RESOLVED
Last Resolved: a year agoa year ago
Resolution: --- → FIXED
(Assignee)

Comment 22

a year ago
Created attachment 8736457 [details] [diff] [review]
backport patch for mozilla-esr45

Patch for mozilla-esr45 with the remaining test fix included. Mozharness unit tests are passing, but to ensure it doesn't bust anything I started a try build:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=41d2fc7305f3

As callek mentioned on IRC it could be that it wont pass in case the build configs drifted apart. So lets see.
Attachment #8736457 - Flags: review+
(Assignee)

Updated

a year ago
status-firefox45: affected → wontfix
status-firefox-esr45: --- → affected
(Assignee)

Comment 23

a year ago
The try build has lots of test failures and some aborts including crashes. But none of them looks to be related to my above change.

So can some please land attachment 8736457 [details] [diff] [review] as a test-only change on mozilla-esr45? Thanks.
Whiteboard: [checkin-needed-esr45]
https://hg.mozilla.org/releases/mozilla-esr45/rev/7327a6b02d2e
status-firefox-esr45: affected → fixed
Whiteboard: [checkin-needed-esr45]
You need to log in before you can comment on or make changes to this bug.