Closed Bug 1444826 Opened 2 years ago Closed 2 years ago

Add more logs to help debugging failing DAMP tests

Categories

(DevTools :: General, enhancement, P3)

59 Branch
enhancement

Tracking

(firefox61 fixed)

RESOLVED FIXED
Firefox 61
Tracking Status
firefox61 --- fixed

People

(Reporter: jdescottes, Assigned: ochameau)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files)

While splitting DAMP in 1441703, there was very little logging when tests had errors. This made debugging issues a bit painful.

We should add minimal logs to know:
- if a given test failed to load
- if a helper is waiting for an async method to finish 

We might also want to add a log helper to avoid relying on dump(str + "\n")
> if a helper is waiting for an async method to finish 

As a quick note, I often remarked that in tests involving async methods we nearly never handle errors right. Especially we seldom fail the test in the error callback / in a promise's catch clause. I believe this is a big source of frustration when debugging an async test, more than missing logs. For example if you do a programming mistake (which would yield ReferenceError or TypeError) in an async function (be it a normal callback or a promise chain) it would be silently caught if no `catch` is present.
See also the discussion over bug 1378002 where most DAMP errors ends up being flagged as "Intermittent PROCESS-CRASH | damp | application crashed". I would like to help sheriffs flags various DAMP intermittents/failures with distinct bugs!

This may be slightly different topic but any error should appear correctly in PerfHerder summary without having to manually read the full stdout log!
Assignee: nobody → poirot.alex
Joel, thanks to bug 1378002 comment 65, I ended up tweaking the python part.
But:
* instead of doing something only about timeout, I also handle exception/errors.
* I used a per subtest timeout from damp.js (instead of tptimeout).

I had a first approach before this patch, instead of dumping "DAMP ERROR ", I was using "TEXT-UNEXPECTED-FAIL | damp | ".
So that the string would be referenced in treeherder. But it doesn't make Talos python app quit with an error code and so the job is still considered as successful.
In order to be more generic, we could grep for "TEXT-UNEXPECTED-FAIL |" and kill the browser session if we have one.
That would make the python patch more generic and usable by other talos test suites if they want to stop the test and report an error.

Otherwise, I used 15 minutes timeout arbitrarily, I'll do some analysis on how long each subtest last.
Comment on attachment 8963065 [details]
Bug 1444826 - Stop DAMP with explicit error on exceptions and timeouts.

https://reviewboard.mozilla.org/r/231922/#review237488

I have a pending question about the done variable.  outside of that, the idea of using TEST-UNEXPECTED-FAIL is interesting- it seems like a good route to go in the future, we would have to ensure the mozharness parsers accept it properly, that seems like a great followup bug.

::: testing/talos/talos/tests/devtools/addon/content/damp.js:25
(Diff revision 1)
>  // "normal". Print allocation sites to stdout if DEBUG_DEVTOOLS_ALLOCATIONS is set to
>  // "verbose".
>  const DEBUG_ALLOCATIONS = env.get("DEBUG_DEVTOOLS_ALLOCATIONS");
>  
> +// Maximum time spent in one test, in milliseconds
> +const TEST_TIMEOUT = 15 * 60000;

15 minutes for a single test seems very long- I would think 5 minutes would be enough.  Are there specific tests in damp which take a long time?  Is this leaving buffer room for profiling or running on debug builds?

to make things easier, I would like to add a comment in test.py where we define damp noting "there is a 15 minute internal timeout for damp".  That would help with debugging timeout issues if for some reason the DAMP ERROR is not present in the line.

::: testing/talos/talos/tests/devtools/addon/content/damp.js:290
(Diff revision 1)
>    _doneInternal() {
> +    // Ignore any duplicated call to this method
> +    if (this.done) {
> +      return;
> +    }
> +    this.done = true;

where is this.done defined at?  I am old fashioned and code more in python than javascript- does _done == this.done?
Comment on attachment 8963065 [details]
Bug 1444826 - Stop DAMP with explicit error on exceptions and timeouts.

https://reviewboard.mozilla.org/r/231922/#review237522

Thanks a lot Alex, that's a great improvement! 
I tested it with errors added in both test files and head.js and the test runner stopped in both cases, with a helpful log.
In a followup we could make out logs standout more by prefixing them with somthing like [DAMP]

r+ with the _done/done typo mentioned by Joel fixed.

::: testing/talos/talos/talos_process.py:77
(Diff revision 1)
>              self.event.set()
>          elif line == 'TART: TIMEOUT':
>              self.got_timeout = True
>              self.timeout_message = 'TART'
>              self.event.set()
> +        elif line.startswith('DAMP ERROR: '):

This is probably more a question for Joel, but isn't it weird to have something specific to DAMP here (rather than `TALOS ERROR:` or `TALOS ERROR (DAMP):`)

Maybe we could define a common prefix that all tests could use in a followup.

::: testing/talos/talos/tests/devtools/addon/content/damp.js:231
(Diff revision 1)
> +    if (promise && typeof(promise.catch) == "function") {
> +      promise.catch(e => {
> +        this.exception(e);
> +      });
> +    }

This makes me think that an interesting follow up would be to:
- change _runNextTest to an async method
- use `await testMethod();`
- run testTeardown() at the end of _runNextTest instead of manually calling it in every test
Attachment #8963065 - Flags: review?(jdescottes) → review+
Attached file damp-runtimes.txt
Here is the time it takes to run each DAMP subtests.
The slowest if webconsole/custom.js, which takes 27s to run.

Given that we suggest adding new tests to custom pages, it is going to grow over time. I imagine at some point, it would make sense to tweak DAMP harness to be able to timeout individual subtests, may be via `runTest` method.

In the meantime, what about using 5 minutes?
It is totally arbitrary, it gives you 5 minutes to debug damp before it is automatically killed.

To reply to Joel about how to debug DAMP with this new debug mechanism, I don't have a good answer excet making the timeout big enough to be able to debug a bit.
We could add a "--keep-open" argument, somewhat similar to mochitest one, but to disable timeouts? Or disable the timeout if this is local run? Or?
In the latest patch, I'm using TEST-UNEXPECTED-FAIL, it is correctly picked up by TreeHerder.
Then there is no need to save the error line on the python side,
the fact that it is on stdout is enough for treeherder to pick it up.
The only thing that python code should do is to quit with an error code.

See this failed run:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=aeb6fcc513f5e62ffa730ca77d9bffe602d07aad&selectedJob=170819262

Also, in damp.js, I added some code to print the time it took to execute each test to help seeing when we reach to timeout limit.

And finally, a green try:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=5b25ea077b1f91d2cd206b4b46ebf69e1ab6f104
Comment on attachment 8963065 [details]
Bug 1444826 - Stop DAMP with explicit error on exceptions and timeouts.

https://reviewboard.mozilla.org/r/231922/#review237604

thanks, this is looking good.

::: testing/talos/talos/tests/devtools/addon/content/damp.js:25
(Diff revision 2)
>  // "normal". Print allocation sites to stdout if DEBUG_DEVTOOLS_ALLOCATIONS is set to
>  // "verbose".
>  const DEBUG_ALLOCATIONS = env.get("DEBUG_DEVTOOLS_ALLOCATIONS");
>  
> +// Maximum time spent in one test, in milliseconds
> +const TEST_TIMEOUT = 5 * 60000;

one suggestion here is that if running locally via mach (there will be develop mode in the python harness) we could set a pref or send in a url param and the timeout could be much longer.
Attachment #8963065 - Flags: review?(jmaher) → review+
Green try running all talos suites:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f2decb8d6d1f14873f42aa135b000bbfee5dc3b
Just to ensure that none of the other suites were printing TEST-UNEXPECTED-FAIL.
New version looks good to me!
Blocks: damp
Blocks: 1449919
Pushed by apoirot@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/84ba9406ef28
Stop DAMP with explicit error on exceptions and timeouts. r=jdescottes,jmaher
https://hg.mozilla.org/mozilla-central/rev/84ba9406ef28
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 61
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.