Closed Bug 513960 Opened 15 years ago Closed 15 years ago

talos should go orange on failure to post to graph server

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: anodelman)

References

Details

Attachments

(2 files, 1 obsolete file)

Currently stays green on failure to post, but should report the error.

We should first attempt to resend the data 3-5 times before giving up and going orange.
Assignee: nobody → anodelman
Some clean up here in how we generate correctly formatted data to send to the graph server - mostly getting rid of using temp files.
Attachment #398466 - Flags: review?(catlee)
Found with the new error reporting code.  Figured that it can be part of this bug as we can't roll out the error reporting till we are vaguely sure that we won't cause everything to burn.
Attachment #398529 - Flags: review?(rdoherty)
Comment on attachment 398529 [details] [diff] [review]
[Checked in]add extra character (%) to allowed string in graph server

Ran tests, all pass. Also changed 

assert c.StringValidator.validate('1Aa9Zz._()-+ ')  == '1Aa9Zz._()-+ '
to
assert c.StringValidator.validate('1Aa9Zz._()%-+ ')  == '1Aa9Zz._()%-+ '

in server/pyfomatic/test/test_collect.py to verify it will accept a %.

If that could be included when committing it would make it even awesomer :)
Attachment #398529 - Flags: review?(rdoherty) → review+
Comment on attachment 398466 [details] [diff] [review]
report graph posting errors, try to send 5 times before failing out

>+  #send all the strings along to the graph server
>+  for data_string in result_strings:
>+    RETRIES = 5
>+    times = 0
>+    msg = ""
>+    while (times < RETRIES):
>+      try:
>+        utils.stamped_msg("Transmitting test: " + testname, "Started")
>+        links += process_Request(post_file.post_multipart(results_server, results_link, [("key", "value")], [("filename", "data_string", data_string)]))
>+        break
>+      except talosError, e:
>+        times += 1
>+        msg = e.msg
>+    if times == RETRIES:
>+        raise talosError("Failed to send data %d times... quitting\n%s" % (RETRIES, msg))

There should be a time.sleep() call in the except block so that the graph
server has a chance to recover from whatever problem it's experiencing.  It
should also increase the time before the next retry each time through the loop.
E.g. sleep for 5 seconds after the first failure, 15 seconds after the second,
45 after the 3rd, etc.
Attachment #398466 - Flags: review?(catlee) → review-
Wait between each attempt to send to graph server.  Double wait time after each failure.
Attachment #398466 - Attachment is obsolete: true
Attachment #398760 - Flags: review?(catlee)
Attachment #398760 - Flags: review?(catlee) → review+
Comment on attachment 398529 [details] [diff] [review]
[Checked in]add extra character (%) to allowed string in graph server

changeset:   241:0b8873afb30f
Attachment #398529 - Attachment description: add extra character (%) to allowed string in graph server → [Checked in]add extra character (%) to allowed string in graph server
Attachment #398529 - Flags: checked-in+
Changes to graph server pushed to production.
Attachment #398760 - Attachment description: report graph posting errors, try to send 5 times before failing out (take 2) → [checked in]report graph posting errors, try to send 5 times before failing out (take 2)
Attachment #398760 - Flags: checked-in+
Successfully reported graph server errors.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: