Closed Bug 1507408 Opened 7 years ago Closed 6 years ago

Automated Bug Generator changes required

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: dlabici, Assigned: dragrom)

References

Details

Attachments

(1 file)

With the changes to the workflow that are coming very soon, the automated bug generator will need to suffer some modifications which I have covered with :dragrom in a meeting. To outline the points discussed: - Bug Title should be under the following format: [$(MDC)] $(HOSTNAME) Generic Worker CODE 69. -- End result should look like: [MDC1] t-yosemite-r7-XXX Generic Worker CODE 69 - Add an Alias of $(HOSTNAME) to ensure we don't have duplicated bugs for the same machine. If that alias already exists, use the existing Problem Tracking bug . - If possible, automatically add CiDuty's team account ( https://bugzilla.mozilla.org/user_profile?login=ciduty%40mozilla.com )to NeedInfo (everyone in the team has visibility and will receive the email from NI?)
Status: NEW → ASSIGNED

alias will be $(HOSTNAME)_code69, and the bug will be linked to the tracker bug returned by $(HOSTNAME) alias

The worker is now in working state

Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
Flags: needinfo?(dcrisan)
Flags: needinfo?(dcrisan) → needinfo?
Flags: needinfo?
Flags: needinfo?(dcrisan)
Flags: needinfo?(dcrisan) → needinfo?
Flags: needinfo?
Flags: needinfo?
Flags: needinfo?

As part of this bug, I'll also convert run-generic-worker.sh.erb from bash scripting to python

These are the possible exit codes of generic-worker:

  Exit Codes:

    0      Tasks completed successfully; no more tasks to run (see config setting
           numberOfTasksToRun).
    64     Not able to load generic-worker config. This could be a problem reading the
           generic-worker config file on the filesystem, a problem talking to AWS/GCP
           metadata service, or a problem retrieving config/files from the taskcluster
           secrets service.
    65     Not able to install generic-worker on the system.
    66     Not able to create an OpenPGP key pair.
    67     A task user has been created, and the generic-worker needs to reboot in order
           to log on as the new task user. Note, the reboot happens automatically unless
           config setting disableReboots is set to true - in either code this exit code will
           be issued.
    68     The generic-worker hit its idle timeout limit (see config settings idleTimeoutSecs
           and shutdownMachineOnIdle).
    69     Worker panic - either a worker bug, or the environment is not suitable for running
           a task, e.g. a file cannot be written to the file system, or something else did
           not work that was required in order to execute a task. See config setting
           shutdownMachineOnInternalError.
    70     A new deploymentId has been issued in the AWS worker type configuration, meaning
           this worker environment is no longer up-to-date. Typcially workers should
           terminate.
    71     The worker was terminated via an interrupt signal (e.g. Ctrl-C pressed).
    72     The worker is running on spot infrastructure in AWS EC2 and has been served a
           spot termination notice, and therefore has shut down.
    73     The config provided to the worker is invalid.
    74     Could not grant provided SID full control of interactive windows stations and
           desktop.
    75     Not able to create an ed25519 key pair.
    76     Not able to save generic-worker config file after fetching it from AWS provisioner
           or Google Cloud metadata.
    77     Not able to apply required file access permissions to the generic-worker config
           file so that task users can't read from or write to it.

I would receommend that the following exit codes should NOT cause the worker to be quarantined, all other exit codes should cause an automatic quarantine:

  Exit Codes:

    0      Tasks completed successfully; no more tasks to run (see config setting
           numberOfTasksToRun).
    67     A task user has been created, and the generic-worker needs to reboot in order
           to log on as the new task user. Note, the reboot happens automatically unless
           config setting disableReboots is set to true - in either code this exit code will
           be issued.
    68     The generic-worker hit its idle timeout limit (see config settings idleTimeoutSecs
           and shutdownMachineOnIdle).

Had a meeting with :dragrom today and we discussed on how we can approach the new requirements, here is an overview of the work that it's gonna be done:

  • Only use the existing bugs that we have for the machine, we will not use "_code69" in the alias anymore.
  • Whenever an Exit code is hit from comment 4 we will: ReOpen the bug (if needed), Comment with the exitcode and it's description.
  • If possible via the API, we will also update the whiteboard of the bug with the current exitcodes (are remove them when they are fixed).

As work distribution, :dragrom will be doing the implementation, I'll do the review + comment content/style.

Refactoring automatic bug generation

Depends on: 1541435
Status: ASSIGNED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: