Closed Bug 1401612 Opened 7 years ago Closed 6 years ago

Telemetry histogram validator should provide more than 1 error/warning/whatever at a time

Categories

(Toolkit :: Telemetry, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox57 --- wontfix
firefox59 --- fixed

People

(Reporter: Gijs, Assigned: adbugger, Mentored)

References

Details

(Whiteboard: [good second bug][lang=python])

Attachments

(1 file, 6 obsolete files)

STR:

1. try to add histogram
2. compile
3. error from the validator
4. fix error
5. compile
6. different error from the validator

(repeat steps 4-6 until you appease the validator)

Expected:

3. all errors from the validator
4. fix all the errors
5. compile
6. happy validator!


This is pretty annoying when trying to add a new histogram. We've added a lot of required properties, which then have whitelists for existing histograms, so just copy-pasting and modifying an existing histogram isn't really a guarantee to avoid validator issues.

Please can we make the compile/edit/recompile cycle less painful.
There are a variety of things we could do to make this less painful:

1) We could refactor the errors into "immediately-fatal" and "eventually-fatal" and let it run through all of the "eventually-fatal" failures (missing fields, wrong-typed fields, etc) before kicking out of the process.
2) We could fix the build rules so `./mach build faster toolkit/components/telemetry/` would run the validator quickly and effectively after just changing one of the data definition files (Histograms.json, Scalars.yaml, Events.yaml)
3) We could introduce a validate script so that developers could run just that code from the command line like `python toolkit/components/telemetry/validate_all.py`
Mentor: chutten
Priority: -- → P3
Whiteboard: [good second bug][lang=python]
(In reply to Chris H-C :chutten from comment #1)
> 2) We could fix the build rules so `./mach build faster
> toolkit/components/telemetry/` would run the validator quickly and
> effectively after just changing one of the data definition files
> (Histograms.json, Scalars.yaml, Events.yaml)
> 3) We could introduce a validate script so that developers could run just
> that code from the command line like `python
> toolkit/components/telemetry/validate_all.py`

Both of these are already solved by running

> ./mach build toolkit/components/telemetry

The parsers for all the .json/.yaml files are the first things that get called before any real compilation happens. However, printing all the errors (1) seems like a sensible thing to do.

(In reply to :Gijs from comment #0)
> This is pretty annoying when trying to add a new histogram. We've added a
> lot of required properties, which then have whitelists for existing
> histograms, so just copy-pasting and modifying an existing histogram isn't
> really a guarantee to avoid validator issues.
> 
> Please can we make the compile/edit/recompile cycle less painful.

FWIW, there's a sample histogram with all the required properties in the documentation: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/collection/histograms.html#declaring-a-histogram
Hi, I would like to work on this bug.

I've done some work regarding telemtry before but I'm fairly new and some resources regarding histograms (apart from https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/collection/histograms.html) and telemetry in general would be much appreciated.

Apart from that, where do the compiler files reside? If it's ok to work on this, I'll start by looking there.
(In reply to Aditya Bharti [:adbugger] from comment #3)
> Hi, I would like to work on this bug.
 
Hi Aditya! I've assigned the bug to you, thank you for your interest in it!
 
> Apart from that, where do the compiler files reside? If it's ok to work on
> this, I'll start by looking there.

We have 3 different Python parser, one for each of our data primitives: scalars, events and histograms.

The one for the scalars lives at [1], the other 2 live in the same directory and are called parse_histograms.py and parse_events.py.

The istructions for fixing this bug can be found in comment 1

[1] - https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/parse_scalars.py
Assignee: nobody → adibhar97
So, I was just thinking about how to go about this fix. What if, instead of raising the 'eventually fatal' errors, I keep appending the ParserError objects into an array. When an 'immediately fatal' error is hit, I print the error messages for all errors in array and then raise the final 'immediately fatal' error? 

Since this will require modifying almost all 'raise ParserError' calls, is it a good idea to delegate this responsibilty to the ParserError class itself? (The error array can be a class attribute.) That way, it will be applied to parse_histograms.py, parse_events.py and parse_scalars.py all in one go.

Does this seem like a sensible way to proceed?
Flags: needinfo?(chutten)
Here's an initial version of what I have in mind. Added a wrong histogram in Histograms.json and changed the ParserError class. For now, I've only changed two 'raise's in parse_histograms.py check_name() function to illustrate what happens.

Here is the output of ./mach build toolkit/components/telemetry:

 0:00.15 /usr/bin/make -C /home/user/mozilla-central/obj-x86_64-pc-linux-gnu -j4 -s backend
 0:00.22 /usr/bin/make -C toolkit/components/telemetry -j4 -s
 0:00.30 TelemetryHistogramData.inc
 0:00.30 TelemetryHistogramEnums.h
 0:00.30 TelemetryScalarData.h
 0:00.30 TelemetryScalarEnums.h
 0:00.52 Error for histogram name "LABELS_WRONG#HISTOGRAM": "#" is not allowed.
 0:00.52 
 0:00.52 Error processing histograms:
 0:00.52 Error for histogram name "LABELS_WRONG#HISTOGRAM":  can not start with "LABELS_".
 0:00.52 
 0:00.54 Error for histogram name "LABELS_WRONG#HISTOGRAM": "#" is not allowed.
 0:00.54 
 0:00.54 Error processing histograms:
 0:00.54 Error for histogram name "LABELS_WRONG#HISTOGRAM":  can not start with "LABELS_".
 0:00.54 
 0:00.54 backend.mk:30: recipe for target 'TelemetryHistogramEnums.h' failed
 0:00.54 make[1]: *** [TelemetryHistogramEnums.h] Error 1
 0:00.54 make[1]: *** Deleting file 'TelemetryHistogramEnums.h'
 0:00.56 make[1]: *** Waiting for unfinished jobs....
 0:00.56 backend.mk:23: recipe for target 'TelemetryHistogramData.inc' failed
 0:00.56 make[1]: *** [TelemetryHistogramData.inc] Error 1
 0:00.56 make[1]: *** Deleting file 'TelemetryHistogramData.inc'
 0:00.72 /home/user/mozilla-central/config/rules.mk:432: recipe for target 'default' failed
 0:00.72 make: *** [default] Error 2

Both errors are printed to console. I'm not sure why all errors are printed twice though; that was the behaviour even before my changes.
Flags: needinfo?(chutten)
Sorry, accidentally canceled the needinfo back there.
Flags: needinfo?(chutten)
I wonder if there's code that is printing the error and then a default exception handler that is also printing the error when it gets it.

This seems like a reasonable approach. I'm interested to see what a full conversion might look like, and how it might behave.
Flags: needinfo?(chutten)
IF there a default exception handler, then shouldn't only one error get duplicated? Only one 'immediately fatal' error is raised, the others just get their messages printed. I'm not sure what's happening here but I'll keep investigating.

As of now, a full conversion will require removing all 'raise' statements from the parsers and replacing them with ParserError().handle_now/later(). The 'raise' statements must be removed because they will only raise the error they are given and not print the messages of the previous 'eventually fatal' errors in array. 

To minimize changes to code, I thought about overloading the 'raise' keyword for the ParserError class so that the eventual errors are also printed but it appears that 'raise' is built into the python interpreter itself and is not a class method. So I guess modifying all the raise statements in the parser files is the only way. Given the changes involved, do you want me to proceed in this direction?
Flags: needinfo?(chutten)
Alright, so I think I know why each error is printed twice.

I changed the print statement in ParserError().handle_now() to: print(self.error + "handling now"). The output of ./mach build toolkit/components/telemetry is:
 0:00.51 Error for histogram name "LABELS_WRONG#HISTOGRAM": "#" is not allowed.handling now
 0:00.51 
 0:00.51 Error processing histograms:
 0:00.51 Error for histogram name "LABELS_WRONG#HISTOGRAM":  can not start with "LABELS_".
 0:00.52 
 0:00.52 Error for histogram name "LABELS_WRONG#HISTOGRAM": "#" is not allowed.handling now
 0:00.52 
 0:00.52 Error processing histograms:
 0:00.52 Error for histogram name "LABELS_WRONG#HISTOGRAM":  can not start with "LABELS_".
 0:00.53 
The same print function is being called twice; there is no other default exception handler.

Looking at these lines in the output:
 0:00.54 backend.mk:23: recipe for target 'TelemetryHistogramData.inc' failed    <--
 0:00.54 make[1]: *** [TelemetryHistogramData.inc] Error 1
 0:00.54 make[1]: *** Deleting file 'TelemetryHistogramData.inc'
 0:00.54 make[1]: *** Waiting for unfinished jobs....
 0:00.54 backend.mk:30: recipe for target 'TelemetryHistogramEnums.h' failed     <--
 0:00.54 make[1]: *** [TelemetryHistogramEnums.h] Error 1
 0:00.54 make[1]: *** Deleting file 'TelemetryHistogramEnums.h'

It appears there is a backend.mk makefile that is calling the parsers twice in the build process. Since the build uses 4 parallel jobs (make -j4),
 0:00.14 /usr/bin/make -C /home/user/mozilla-central/obj-x86_64-pc-linux-gnu -j4 -s backend
 0:00.22 /usr/bin/make -C toolkit/components/telemetry -j4 -s
the error on line 23 (in backend.mk) does not stop the build completely, and it waits for unfinished jobs, causing the parser to be called again on line 30 and printing the errors again.

On doing a `find . -type f -name backend.mk mozilla-central`, I found many backend.mk files generated during the build process in the obj-linux directory. I think the file of interest is at ./obj-x86_64-pc-linux-gnu/toolkit/components/telemetry/backend.mk . It definitely has the following lines:

export:: TelemetryHistogramData.inc
GARBAGE += TelemetryHistogramData.inc
EXTRA_MDDEPEND_FILES += TelemetryHistogramData.inc.pp
TelemetryHistogramData.inc: /home/user/mozilla-central/toolkit/components/telemetry/gen_histogram_data.py $(srcdir)/Histograms.json $(topsrcdir)/dom/base/UseCounters.conf $(topsrcdir)/dom/base/nsDeprecatedOperationList.h
	$(REPORT_BUILD)     [line 23]
	$(call py_action,file_generate,/home/user/mozilla-central/toolkit/components/telemetry/gen_histogram_data.py main TelemetryHistogramData.inc $(MDDEPDIR)/TelemetryHistogramData.inc.pp $(srcdir)/Histograms.json $(topsrcdir)/dom/base/UseCounters.conf $(topsrcdir)/dom/base/nsDeprecatedOperationList.h)

export:: TelemetryHistogramEnums.h
GARBAGE += TelemetryHistogramEnums.h
EXTRA_MDDEPEND_FILES += TelemetryHistogramEnums.h.pp
TelemetryHistogramEnums.h: /home/user/mozilla-central/toolkit/components/telemetry/gen_histogram_enum.py $(srcdir)/Histograms.json $(topsrcdir)/dom/base/UseCounters.conf $(topsrcdir)/dom/base/nsDeprecatedOperationList.h
	$(REPORT_BUILD)     [line 30]
	$(call py_action,file_generate,/home/user/mozilla-central/toolkit/components/telemetry/gen_histogram_enum.py main TelemetryHistogramEnums.h $(MDDEPDIR)/TelemetryHistogramEnums.h.pp $(srcdir)/Histograms.json $(topsrcdir)/dom/base/UseCounters.conf $(topsrcdir)/dom/base/nsDeprecatedOperationList.h)

So, yeah, I guess a build process using parallel jobs is the source of the errors being printed twice. Since that is not a bug, I guess the only question I have now is whether or not to proceed modifying the parse calls as per comment #9 . Not canceling needinfo.
Excellent deduction of the cause of the duplicate error messages! It highlights that we are parsing Histograms.json twice for each build (once for .inc, the other for .h), which might result in an interesting direction for future build improvements.

Please go ahead with your store-then-throw approach. I look forward to seeing the full solution.
Flags: needinfo?(chutten)
I have removed all raise ParserError statements from parse_histograms.py for now. However, some raise ValueError statements still exist which cannot be handled by the ParserError class.

Extended implementation of ParserError class. The old version would not have raised an error or printed any 'eventually fatal' messages unless there was an 'immediately fatal' error.

The problem I'm facing now is the case of 'eventually fatal' errors without any 'immediate' ones. The use of a sys.exitfunc in the parser file checks for any remaining 'eventually fatal' errors and prints them to console. However, the build does not stop and continues:

 0:00.14 /usr/bin/make -C /home/user/mozilla-central/obj-x86_64-pc-linux-gnu -j4 -s backend
 0:00.22 /usr/bin/make -C toolkit/components/telemetry -j4 -s
 0:00.30 TelemetryEventData.h
 0:00.30 TelemetryEventEnums.h
 0:00.30 TelemetryHistogramData.inc
 0:00.30 TelemetryHistogramEnums.h
 0:00.30 TelemetryProcessData.h
 0:00.49 TelemetryProcessEnums.h
 0:00.53 TelemetryScalarData.h
 0:00.66 TelemetryScalarEnums.h
 0:00.70 Error in sys.exitfunc:
 0:00.70 Error for histogram name "LABELS_WRONG_HISTOGRAM":  can not start with "LABELS_".
 0:00.70 Traceback (most recent call last):
 0:00.70   File "/home/user/mozilla-central/toolkit/components/telemetry/shared_telemetry_utils.py", line 44, in handle_now
 0:00.70     raise self
 0:00.70 shared_telemetry_utils.ParserError: Checking for residual errors.
 0:01.14 Error in sys.exitfunc:
 0:01.14 Error for histogram name "LABELS_WRONG_HISTOGRAM":  can not start with "LABELS_".
 0:01.14 Traceback (most recent call last):
 0:01.14   File "/home/user/mozilla-central/toolkit/components/telemetry/shared_telemetry_utils.py", line 44, in handle_now
 0:01.14     raise self
 0:01.14 shared_telemetry_utils.ParserError: Checking for residual errors.
 0:01.27 Elapsed: 0.05s; From ../../dist/idl: Kept 1012 existing; Added/updated 0; Removed 0 files and 0 directories.
 0:13.81 libtoolkit_components_telemetry.a.desc
 0:16.69 libtelemetrytest.a.desc
 0:17.15 /usr/bin/make -C browser/app -j4 -s
 0:17.42 /usr/bin/make -C toolkit/components/build -j4 -s
 0:17.70 /usr/bin/make -C toolkit/library -j4 -s
 0:17.75 libxul_s.a.desc
 0:17.75 libxul.so

There is an 'eventually fatal' ParserError being being printed, and a generic 'checking for residual errors' ParserError being thrown but it is not stopping the build for some reason. There are two options at this point:
1) Figure out how to make the build stop in this case when only 'eventually fatal' errors occur. In this case, I will take some time to study the build process in detail.
2) Leave it to the responsibility of the programmer to read and handle the case when only 'eventually fatal' errors occur; and continue with the build. Continuing with a possibly time-consuming build seems pointless when errors are there.

In any case, a careful segregation of errors into 'eventually' and 'immediately' fatal is required, regardless of the final implementation. For now, I need some advice on how to proceed.

I tried an approach where the class keeps a track of the number of ParserError objects instantiated and performs a check when the '__del__' function is called on the error objects. However, the __del__ function is not always called when the interpreter exits. Hence I resorted to the current sys.exitfunc approach. 

I will need some help handling this case of only 'eventually fatal' errors.

Note that when at least one 'immediately fatal' error occurs, everything works smoothly; the 'eventual' errors are printed and the 'immediate' error is thrown, stopping the build.
Flags: needinfo?(chutten)
Comment on attachment 8934497 [details] [diff] [review]
Example of partially functioning fix: extended ParserError class

Review of attachment 8934497 [details] [diff] [review]:
-----------------------------------------------------------------

Formatting and cleanup aside, this looks very promising. I think we should prefer handle_later in all but the worst cases to attempt to print as many errors as possible at once. Obviously, any error that results in a state that is unrecoverable must be handled immediately, but I think those cases should be few and far between.

Instead of sys.exitfunc (deprecated in Python 2.4) have you considered the atexit module? If you wanted, it looks as though you could use that to store the ParserErrors as atexit handlers instead of storing them in the type.

::: toolkit/components/telemetry/parse_histograms.py
@@ +11,4 @@
>  import sys
>  import shared_telemetry_utils as utils
>  
> +from shared_telemetry_utils import ParserError; sys.exitfunc = ParserError('Checking for residual errors.').handle_now

Maybe the "raise self" doesn't result in a non-zero exit value for the script?

@@ +288,5 @@
>          if self._strict_type_checks:
>              pattern = '^[a-z][a-z0-9_]+[a-z0-9]$'
>              if not re.match(pattern, name, re.IGNORECASE):
> +                ParserError('Error for histogram name "%s": name does not conform to "%s"' %
> +                                  (name, pattern)).handle_now()

This error should be fine to handle later, I think

@@ +310,2 @@
>                                 '\nSee: {}#expires-in-version')
> +                              .format(name, field, expiration, HISTOGRAMS_DOC_URL)).handle_now()

Can probably be handle_later

@@ +321,5 @@
>  
>          invalid = filter(lambda l: len(l) > MAX_LABEL_LENGTH, labels)
>          if len(invalid) > 0:
> +            ParserError('Label values for "%s" exceed length limit of %d: %s' %
> +                              (name, MAX_LABEL_LENGTH, ', '.join(invalid))).handle_now()

Can probably be handle_later

@@ +326,4 @@
>  
>          if len(labels) > MAX_LABEL_COUNT:
> +            ParserError('Label count for "%s" exceeds limit of %d' %
> +                              (name, MAX_LABEL_COUNT)).handle_now()

Can probably be handle_later

@@ +332,5 @@
>          # the label values to a strict pattern.
>          invalid = filter(lambda l: not re.match(CPP_IDENTIFIER_PATTERN, l, re.IGNORECASE), labels)
>          if len(invalid) > 0:
> +            ParserError('Label values for %s are not matching pattern "%s": %s' %
> +                              (name, CPP_IDENTIFIER_PATTERN, ', '.join(invalid))).handle_now()

Can probably be handle_later

@@ +345,5 @@
>          DOC_URL = HISTOGRAMS_DOC_URL + "#record-in-processes"
>  
>          if not rip:
> +            ParserError('Histogram "%s" must have a "%s" field:\n%s'
> +                              % (name, field, DOC_URL)).handle_now()

Should probably be handle_later, then we can get multiple "missing field" errors.

@@ +350,5 @@
>  
>          for process in rip:
>              if not utils.is_valid_process_name(process):
> +                ParserError('Histogram "%s" has unknown process "%s" in %s.\n%s' %
> +                                  (name, process, field, DOC_URL)).handle_now()

Can be handle_later

@@ +417,4 @@
>  
>          for field in ['alert_emails', 'bug_numbers']:
>              if field not in definition and name not in whitelists[field]:
> +                ParserError('New histogram "%s" must have a "%s" field.' % (name, field)).handle_now()

This can be handle_later, no?

@@ +421,4 @@
>              if field in definition and name in whitelists[field]:
>                  msg = 'Histogram "%s" should be removed from the whitelist for "%s" in ' \
>                        'histogram-whitelists.json.'
> +                ParserError(msg % (name, field)).handle_now()

We want this to be handle_later so it can tell us all the whitelists it can be removed from all at once.

@@ +494,5 @@
>          if not self._strict_type_checks:
>              return
>          for key in definition.iterkeys():
>              if key not in allowed_keys:
> +                ParserError('Key "%s" is not allowed for histogram "%s".' % (key, name)).handle_now()

handle_later will allow us to enumerate all the invalid keys at once

@@ +509,5 @@
>                      ' amounts of resources. Contact a Telemetry peer (e.g. in #telemetry)'
>                      ' if you think an exception ought to be made:\n'
>                      'https://wiki.mozilla.org/Modules/Toolkit#Telemetry'
>                      % self._name
> +                    ).handle_now()

Can probably be handle_later

@@ +569,5 @@
>  
>          value = definition.get('releaseChannelCollection', 'opt-in')
>          if value not in datasets:
> +            ParserError('Unknown value for releaseChannelCollection'
> +                              ' policy for histogram "%s".' % self._name).handle_now()

handle_later

@@ +581,4 @@
>      d = collections.OrderedDict()
>      for key, value in ordered_pairs:
>          if strict_type_checks and key in d:
> +            ParserError("Found duplicate key in Histograms file: %s" % key).handle_now()

handle_later

@@ +666,4 @@
>  
>          for (name, definition) in histograms.iteritems():
>              if name in all_histograms:
> +                ParserError('Duplicate histogram name "%s".' % name).handle_now()

Can probably be handle_later
(In reply to Chris H-C :chutten from comment #13)

> Instead of sys.exitfunc (deprecated in Python 2.4) have you considered the
> atexit module? If you wanted, it looks as though you could use that to store
> the ParserErrors as atexit handlers instead of storing them in the type.
I used sys.exitfunc because I thought importing an extra module could be avoided. Did not know about the deprecation though. Have used atexit.register in this version of the patch. Except for the strict_type_check error in check_name function (for demonstration, output below), all other errors have been converted to handle_later.

> ::: toolkit/components/telemetry/parse_histograms.py
> @@ +11,4 @@
> >  import sys
> >  import shared_telemetry_utils as utils
> >  
> > +from shared_telemetry_utils import ParserError; sys.exitfunc = ParserError('Checking for residual errors.').handle_now
> 
> Maybe the "raise self" doesn't result in a non-zero exit value for the
> script?
When calling ParserError.handle_now() directly, the "raise self" raises the error and correctly stops the build. Output below:
 0:00.14 /usr/bin/make -C /home/user/mozilla-central/obj-x86_64-pc-linux-gnu -j4 -s backend
 0:00.22 /usr/bin/make -C toolkit/components/telemetry -j4 -s
 0:00.30 TelemetryHistogramData.inc
 0:00.30 TelemetryHistogramEnums.h
 0:00.46 Error for histogram name "LABELS_WRONG#HISTOGRAM": "#" is not allowed.
 0:00.46 Error for histogram name "LABELS_WRONG#HISTOGRAM":  can not start with "LABELS_".
 0:00.46 
 0:00.46 Error processing histograms:
 0:00.46 Error for histogram name "LABELS_WRONG#HISTOGRAM": name does not conform to "^[a-z][a-z0-9_]+[a-z0-9]$"
 0:00.46 
 0:00.48 backend.mk:23: recipe for target 'TelemetryHistogramData.inc' failed
 0:00.48 make[1]: *** [TelemetryHistogramData.inc] Error 1
 0:00.48 make[1]: *** Deleting file 'TelemetryHistogramData.inc'
 0:00.48 make[1]: *** Waiting for unfinished jobs....
 0:00.48 Error for histogram name "LABELS_WRONG#HISTOGRAM": "#" is not allowed.
 0:00.48 Error for histogram name "LABELS_WRONG#HISTOGRAM":  can not start with "LABELS_".
 0:00.48 
 0:00.48 Error processing histograms:
 0:00.48 Error for histogram name "LABELS_WRONG#HISTOGRAM": name does not conform to "^[a-z][a-z0-9_]+[a-z0-9]$"
 0:00.48 
 0:00.50 backend.mk:30: recipe for target 'TelemetryHistogramEnums.h' failed
 0:00.50 make[1]: *** [TelemetryHistogramEnums.h] Error 1
 0:00.50 make[1]: *** Deleting file 'TelemetryHistogramEnums.h'
 0:00.50 /home/user/mozilla-central/config/rules.mk:432: recipe for target 'default' failed
 0:00.50 make: *** [default] Error 2
All errors are correctly printed and the last 'immediately fatal' error is raised, stopping the build, so I don't think the "raise self" statement is the problem.

> @@ +288,5 @@
> >          if self._strict_type_checks:
> >              pattern = '^[a-z][a-z0-9_]+[a-z0-9]$'
> >              if not re.match(pattern, name, re.IGNORECASE):
> > +                ParserError('Error for histogram name "%s": name does not conform to "%s"' %
> > +                                  (name, pattern)).handle_now()
> 
> This error should be fine to handle later, I think
This error is handled now just for above output. Will change in final fix. Also, why define pattern again, when the global CPP_IDENTIFIER_PATTERN is the same?

> Instead of sys.exitfunc (deprecated in Python 2.4) have you considered the
> atexit module? If you wanted, it looks as though you could use that to store
> the ParserErrors as atexit handlers instead of storing them in the type.
The atexit.register provides the same functionality as sys.exitfunc and runs sys.exitfunc in the background. I tried registering a new atexit handler for each 'handle_later' error. The same problem persists, the errors are printed but the build doesn't stop, even if an error is raised. It appears the return value of any exit handler is ignored. So for now, added a classmethod, ParserError.exit_func, and registered as exit_handler in parse_histograms.py to handle all 'eventually fatal' errors upon exit. The output now is more readable than multiple exit handlers but the same problem persists; all the errors are still printed but the build doesn't stop:

 0:00.17 /usr/bin/make -C /home/user/mozilla-central/obj-x86_64-pc-linux-gnu -j4 -s backend
 0:00.24 /usr/bin/make -C toolkit/components/telemetry -j4 -s
 0:00.32 TelemetryEventData.h
 0:00.32 TelemetryEventEnums.h
 0:00.33 TelemetryHistogramData.inc
 0:00.33 TelemetryHistogramEnums.h
 0:00.33 TelemetryProcessData.h
 0:00.53 TelemetryProcessEnums.h
 0:00.55 TelemetryScalarData.h
 0:00.70 TelemetryScalarEnums.h
 0:00.72 Error for histogram name "LABELS_WRONG#HISTOGRAM": "#" is not allowed.
 0:00.72 Error for histogram name "LABELS_WRONG#HISTOGRAM":  can not start with "LABELS_".
 0:00.72 Error for histogram name "LABELS_WRONG#HISTOGRAM": name does not conform to "^[a-z][a-z0-9_]+[a-z0-9]$"   <-- changed to handle later for this output
 0:00.72 Key "what_should_happen" is not allowed for histogram "LABELS_WRONG#HISTOGRAM".
 0:00.72 Error in atexit._run_exitfuncs:
 0:00.72 Traceback (most recent call last):
 0:00.72   File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
 0:00.72     func(*targs, **kargs)
 0:00.72   File "/home/user/mozilla-central/toolkit/components/telemetry/shared_telemetry_utils.py", line 54, in exit_func
 0:00.72     cls("Some errors occurred").handle_now()
 0:00.73   File "/home/user/mozilla-central/toolkit/components/telemetry/shared_telemetry_utils.py", line 44, in handle_now
 0:00.73     raise self
 0:00.73 ParserError: Some errors occurred
 0:00.73 Error in sys.exitfunc:
 0:00.73 Traceback (most recent call last):
 0:00.73   File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
 0:00.73     func(*targs, **kargs)
 0:00.73   File "/home/user/mozilla-central/toolkit/components/telemetry/shared_telemetry_utils.py", line 54, in exit_func
 0:00.73     cls("Some errors occurred").handle_now()
 0:00.73   File "/home/user/mozilla-central/toolkit/components/telemetry/shared_telemetry_utils.py", line 44, in handle_now
 0:00.73     raise self
 0:00.73 shared_telemetry_utils.ParserError: Some errors occurred
 0:01.15 Error for histogram name "LABELS_WRONG#HISTOGRAM": "#" is not allowed.
 0:01.15 Error for histogram name "LABELS_WRONG#HISTOGRAM":  can not start with "LABELS_".
 0:01.15 Error for histogram name "LABELS_WRONG#HISTOGRAM": name does not conform to "^[a-z][a-z0-9_]+[a-z0-9]$"
 0:01.15 Key "what_should_happen" is not allowed for histogram "LABELS_WRONG#HISTOGRAM".
 0:01.15 Error in atexit._run_exitfuncs:
 0:01.15 Traceback (most recent call last):
 0:01.15   File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
 0:01.15     func(*targs, **kargs)
 0:01.15   File "/home/user/mozilla-central/toolkit/components/telemetry/shared_telemetry_utils.py", line 54, in exit_func
 0:01.15     cls("Some errors occurred").handle_now()
 0:01.15   File "/home/user/mozilla-central/toolkit/components/telemetry/shared_telemetry_utils.py", line 44, in handle_now
 0:01.15     raise self
 0:01.15 ParserError: Some errors occurred
 0:01.15 Error in sys.exitfunc:
 0:01.15 Traceback (most recent call last):
 0:01.15   File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
 0:01.15     func(*targs, **kargs)
 0:01.15   File "/home/user/mozilla-central/toolkit/components/telemetry/shared_telemetry_utils.py", line 54, in exit_func
 0:01.15     cls("Some errors occurred").handle_now()
 0:01.15   File "/home/user/mozilla-central/toolkit/components/telemetry/shared_telemetry_utils.py", line 44, in handle_now
 0:01.15     raise self
 0:01.15 shared_telemetry_utils.ParserError: Some errors occurred
 0:01.28 Elapsed: 0.05s; From ../../dist/idl: Kept 1012 existing; Added/updated 0; Removed 0 files and 0 directories.
 0:01.63 In file included from /home/user/mozilla-central/toolkit/components/telemetry/Telemetry.h:16:0,
 0:01.63                  from /home/user/mozilla-central/toolkit/components/telemetry/KeyedStackCapturer.h:9,
 0:01.63                  from /home/user/mozilla-central/toolkit/components/telemetry/KeyedStackCapturer.cpp:7:
 0:01.63 /home/user/mozilla-central/obj-x86_64-pc-linux-gnu/dist/include/mozilla/TelemetryHistogramEnums.h:13:15: error: stray ‘#’ in program
 0:01.63    LABELS_WRONG#HISTOGRAM,
 0:01.63                ^
 0:01.83 In file included from /home/user/mozilla-central/toolkit/components/telemetry/Telemetry.h:16:0,
 0:01.83                  from /home/user/mozilla-central/toolkit/components/telemetry/Telemetry.cpp:51:
 0:01.83 /home/user/mozilla-central/obj-x86_64-pc-linux-gnu/dist/include/mozilla/TelemetryHistogramEnums.h:13:15: error: stray ‘#’ in program
 0:01.83    LABELS_WRONG#HISTOGRAM,
 0:01.83                ^ 
 
As per this, the atexit function runs and raises and error using the "raise self" statement which previously stopped the build. This time, the error when raised by the exitfunc (sys.exitfunc or atexit.register), it is ignored and the build continues, resulting in errors further on. Any other ideas on how to proceed? In the meantime, where can I find resources to understand the build process?

NOTES:
-----
1. Imported sys module in parser. All error messages are printed to sys.stderr now, instead of the normal stdout. Not directly related to this bug, but seemed appropriate because otherwise some errors would go to stdout and some errors (e.g. the final 'immediately fatal' one which is raised) would go to stderr.

2. In the 'strict_type_check' condition of the check_name function in parse_histograms.py, the pattern variable is defined again even though it is equal to the global CPP_IDENTIFIER_PATTERN. Not related to this bug so not touching, but seems like something that can be changed though.
Attachment #8933486 - Attachment is obsolete: true
Attachment #8934497 - Attachment is obsolete: true
You are likely correct about the pattern/CPP_IDENTIFIER_PATTERN duplication. It was likely missed in a refactor some time ago. Please file a bug in Toolkit::Telemetry so that we don't forget about it.

You've been very patient with the oddities in our build. Thank you for that. I'm not sure why raising errors at exit isn't returning a non-0 exit code (which is what the build process uses to detect failed steps).

So let's go straight to the source, then. Instead of raise self, print the error being handled now and then sys.exit(1). I've tried this locally and it brings the build to a satisfying halt.
Flags: needinfo?(chutten)
You are right. The build respects the return value from sys.exit. Output attached:

 0:00.16 /usr/bin/make -C /home/user/mozilla-central/obj-x86_64-pc-linux-gnu -j4 -s backend
 0:00.24 /usr/bin/make -C toolkit/components/telemetry -j4 -s
 0:00.31 TelemetryHistogramData.inc
 0:00.31 TelemetryHistogramEnums.h
 0:00.31 TelemetryProcessData.h
 0:00.31 TelemetryProcessEnums.h
 0:00.31 TelemetryScalarData.h
 0:00.52 TelemetryScalarEnums.h
 0:00.71 Error for histogram name "LABELS_WRONG#HISTOGRAM": "#" is not allowed.
 0:00.71 Error for histogram name "LABELS_WRONG#HISTOGRAM":  can not start with "LABELS_".
 0:00.71 Error for histogram name "LABELS_WRONG#HISTOGRAM": name does not conform to "^[a-z][a-z0-9_]+[a-z0-9]$"
 0:00.71 Key "what_should_happen" is not allowed for histogram "LABELS_WRONG#HISTOGRAM".
 0:00.71 Exception KeyError: KeyError(140319735588608,) in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored
 0:00.72 backend.mk:30: recipe for target 'TelemetryHistogramEnums.h' failed
 0:00.73 make[1]: *** [TelemetryHistogramEnums.h] Error 1
 0:00.73 make[1]: *** Deleting file 'TelemetryHistogramEnums.h'
 0:00.73 make[1]: *** Waiting for unfinished jobs....
 0:00.98 Error for histogram name "LABELS_WRONG#HISTOGRAM": "#" is not allowed.
 0:00.98 Error for histogram name "LABELS_WRONG#HISTOGRAM":  can not start with "LABELS_".
 0:00.98 Error for histogram name "LABELS_WRONG#HISTOGRAM": name does not conform to "^[a-z][a-z0-9_]+[a-z0-9]$"
 0:00.98 Key "what_should_happen" is not allowed for histogram "LABELS_WRONG#HISTOGRAM".
 0:00.98 Exception KeyError: KeyError(140009284740864,) in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored
 0:00.99 backend.mk:23: recipe for target 'TelemetryHistogramData.inc' failed
 0:00.99 make[1]: *** [TelemetryHistogramData.inc] Error 1
 0:00.99 make[1]: *** Deleting file 'TelemetryHistogramData.inc'
 0:01.03 /home/user/mozilla-central/config/rules.mk:432: recipe for target 'default' failed
 0:01.03 make: *** [default] Error 2

I guess now cleanup and formatting is left before this turns into a proper patch. 

What exactly did you mean by cleanup?
Regarding formatting, eslint shows no errors or warnings on the changed files. Is there another linter that I have to use?

Also, what about parse_scalars.py and parse_events.py? They also use the ParserError class and can be changed to print multiple errors at a time. I guess that will be a different bug, if at all.
Attachment #8934828 - Attachment is obsolete: true
Flags: needinfo?(chutten)
Accidentally uploaded wrong diff file in comment 16. Apologies for that.
Attachment #8935636 - Attachment is obsolete: true
Filed bug 1424148 regarding the duplication of pattern and CPP_IDENTIFIER_PATTERN in parse_histograms.py.
(In reply to Aditya Bharti [:adbugger] from comment #16)
> What exactly did you mean by cleanup?
> Regarding formatting, eslint shows no errors or warnings on the changed
> files. Is there another linter that I have to use?

For python files, flake8 should be used. You can run it via:

./mach lint -l flake8 toolkit/components/telemetry
Like for the pattern/CPP_IDENTIFIER_PATTERN duplication, we'll need separate bugs to handle those files. Luckily bugs are cheap :)

Follow Alessio's advice to run the python linter, remove any commented-out code, and set the r? flag on the patch to :chutten, and we'll see about making this available for everyone to enjoy.
Flags: needinfo?(chutten)
Attached patch 1401612_v5.patch (obsolete) — Splinter Review
Handling of 'only eventual errors' case:

0:00.16 /usr/bin/make -C /home/user/mozilla-central/obj-x86_64-pc-linux-gnu -j4 -s backend
0:00.23 /usr/bin/make -C toolkit/components/telemetry -j4 -s
0:00.31 TelemetryEventData.h
0:00.31 TelemetryEventEnums.h
0:00.31 TelemetryHistogramData.inc
0:00.31 TelemetryHistogramEnums.h
0:00.31 TelemetryProcessData.h
0:00.50 TelemetryProcessEnums.h
0:00.51 TelemetryScalarData.h
0:00.67 TelemetryScalarEnums.h
0:00.77 Error for histogram name "LABELS_WRONG#HISTOGRAM": "#" is not allowed.
0:00.77 Error for histogram name "LABELS_WRONG#HISTOGRAM":  can not start with "LABELS_".
0:00.77 Error for histogram name "LABELS_WRONG#HISTOGRAM": name does not conform to "^[a-z][a-z0-9_]+[a-z0-9]$"
0:00.77 Key "what_should_happen" is not allowed for histogram "LABELS_WRONG#HISTOGRAM".
0:00.77 Some errors occurred
0:00.77 Exception KeyError: KeyError(139650580150016,) in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored

Handling of case with an 'immediately fatal' error:

0:00.42 Traceback (most recent call last):
0:00.42   File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
0:00.42 Traceback (most recent call last):
0:00.42   File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
0:00.44     "__main__", fname, loader, pkg_name)
0:00.44     "__main__", fname, loader, pkg_name)
0:00.45   File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
0:00.45   File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
0:00.45     exec code in run_globals
0:00.45   File "/home/user/mozilla-central/python/mozbuild/mozbuild/action/file_generate.py", line 112, in <module>
0:00.45     exec code in run_globals
0:00.45   File "/home/user/mozilla-central/python/mozbuild/mozbuild/action/file_generate.py", line 112, in <module>
0:00.45     sys.exit(main(sys.argv[1:]))
0:00.45   File "/home/user/mozilla-central/python/mozbuild/mozbuild/action/file_generate.py", line 63, in main
0:00.45     sys.exit(main(sys.argv[1:]))
0:00.45     ret = module.__dict__[method](output, *args.additional_arguments)
0:00.45   File "/home/user/mozilla-central/toolkit/components/telemetry/gen_histogram_enum.py", line 49, in main
0:00.45   File "/home/user/mozilla-central/python/mozbuild/mozbuild/action/file_generate.py", line 63, in main
0:00.45     all_histograms = list(parse_histograms.from_files(filenames))
0:00.45   File "/home/user/mozilla-central/toolkit/components/telemetry/parse_histograms.py", line 704, in from_files
0:00.45     ret = module.__dict__[method](output, *args.additional_arguments)
0:00.45   File "/home/user/mozilla-central/toolkit/components/telemetry/gen_histogram_data.py", line 196, in main
0:00.45     histograms = list(parse_histograms.from_files(filenames))
0:00.45   File "/home/user/mozilla-central/toolkit/components/telemetry/parse_histograms.py", line 704, in from_files
0:00.45     yield Histogram(name, definition, strict_type_checks=strict_type_checks)
0:00.45   File "/home/user/mozilla-central/toolkit/components/telemetry/parse_histograms.py", line 131, in __init__
0:00.45     self.verify_attributes(name, definition)
0:00.45   File "/home/user/mozilla-central/toolkit/components/telemetry/parse_histograms.py", line 268, in verify_attributes
0:00.45     allowed_keys = table[kind]
0:00.45 KeyError    yield Histogram(name, definition, strict_type_checks=strict_type_checks)
0:00.45   File "/home/user/mozilla-central/toolkit/components/telemetry/parse_histograms.py", line 131, in __init__
0:00.45 : u'unsupported'
0:00.45     self.verify_attributes(name, definition)
0:00.45   File "/home/user/mozilla-central/toolkit/components/telemetry/parse_histograms.py", line 268, in verify_attributes
0:00.45     allowed_keys = table[kind]
0:00.45 KeyErrorUnknown kind "unsupported" for histogram "LABELS_WRONG#HISTOGRAM".
0:00.45 : u'unsupported'
0:00.45 Some errors occurred

More verbose but still informative.

With the wrong histogram removed, the build passes. Uploading patch. Commented code removed and files properly linted.
Attachment #8935638 - Attachment is obsolete: true
Attachment #8935999 - Flags: review?(chutten)
Comment on attachment 8935999 [details] [diff] [review]
1401612_v5.patch

Review of attachment 8935999 [details] [diff] [review]:
-----------------------------------------------------------------

I apologize for the delay.

This patch looks good! One question to be answered or addressed, and then we're good to go, I think. Do you have access to run this on try, or would you like me to handle it?

::: toolkit/components/telemetry/parse_histograms.py
@@ +423,4 @@
>  
>          for field in ['alert_emails', 'bug_numbers']:
>              if field not in definition and name not in whitelists[field]:
> +                ParserError('New histogram "%s" must have a "%s" field.' %

Does this have to be handled immediately?
Attachment #8935999 - Flags: review?(chutten) → review+
Attached patch 1401612_v6.patchSplinter Review
I guess you're right. With the build now halting successfully in case of eventual errors, handling this eventually makes sense too. Made the required change. Marked the previous patch obsolete. flake8 checks out.

I do not have try server access. I'll need your help to handle this one on the try server. For future work though, how do I get access?
Attachment #8935999 - Attachment is obsolete: true
Attachment #8936997 - Flags: review?(chutten)
Comment on attachment 8936997 [details] [diff] [review]
1401612_v6.patch

Review of attachment 8936997 [details] [diff] [review]:
-----------------------------------------------------------------

The full description of what Try is and how to get access is on the wiki over here: https://wiki.mozilla.org/ReleaseEngineering/TryServer

The short summary is you need Level 1 Commit Access to access try, and then you need to do some configuration. You don't have to worry about this now if you don't want to. There are always plenty of people around to help :)

For now I have pushed your commit up to try so that we can watch treeherder as the various builds come back and tell us whether or not they succeeded: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d3c331f08ebabcbeca6d28c77d88586e27754c7f

If these all come back green, we'll mark this as checkin-needed and get it into the tree.
Attachment #8936997 - Flags: review?(chutten) → review+
Builds clean on try. Let's get this in.
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/92dbf32cc352
Telemetry histogram validator now prints multiple error messages at a time. r=chutten
Keywords: checkin-needed
Is it a good idea to file bugs regarding changing the parse_events.py and parse_scalars.py right after this check-in or should I wait for the status change from NEW to VERIFIED/RESOLVED?
Flags: needinfo?(chutten)
https://hg.mozilla.org/mozilla-central/rev/92dbf32cc352
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
You can do that (and start working on them, if you like :D ) whenever you'd like. I tend to file bugs as soon as I think of them, so I don't forget. Since they're related, you can put this bug's number (1401612) in the "See Also" field of these other bugs.
Flags: needinfo?(chutten)
See Also: → 1426460
See Also: → 1426461
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: