Closed Bug 1790808 Opened 2 years ago Closed 2 years ago

since 2022-09-09, tecken nodes monotonically use disk and then run out of space

Categories

(Tecken :: Upload, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

We had a couple of out-of-disk-space errors today from the same tecken node. Looking into it further, if we zoom out, we see tecken nodes increase their disk usage over time.

https://earthangel-b40313e5.influxcloud.net/d/a9-7FT0Zk/tecken-app-metrics?orgId=1&refresh=1m&from=now-7d&to=now&viewPanel=70

Does Tecken clean up files correctly after successful and unsuccessful symbols upload? If not, do those files ever get cleaned up or do they sit around forever accumulating on the node over time?

Grabbing this to research for now. If there's something easy we can do for now, we should do that. Otherwise, we should wait for the upload code overhaul in bug #1652835.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

The upload view is wrapped in a decorator that creates a temporary directory and removes it and all its contents if something fails. That seems fine.

The tecken app code runs in gunicorn using the sync worker (the default) with no max_requests setting.

The timeout is set to 5 minutes. If the timeout is hit, then gunicorn will kill the process and that'll (probably) leave stuff around on disk. There isn't another process to clean things up. As near as I can tell, the temp directory prefix doesn't include the process id.

I think we shouldn't have a 5 minute timeout. Tecken upload requests are long and gunicorn shouldn't be killing them. The timeout should well exceed the other timeouts for the request (nginx, elb, etc).

I think for a first pass, I want to double the timeout.

We can probably emit a metric when gunicorn kills a process because of the timeout:

https://docs.gunicorn.org/en/stable/settings.html#worker-abort

If we add that first, we can probably use that to prove the hypothesis.

I fixed the graphs so the min is 0 and max is 110 so it's easier to eyeball disk usage over months. You can see monotonically increasing disk usage over time for all the nodes going back months. However, there aren't instances of "omg--ran out of disk!" until after 2022-09-09 or so. (I'll attach a screenshot.)

The recent deploy was 2022.09.01 (https://bugzilla.mozilla.org/show_bug.cgi?id=1788705). That had a handful of library updates, but nothing that would affect the Tecken app, so I don't currently believe that the update caused problems.

Rather, that deploy pushed out support for inline functions in Eliot and then I did a prod deploy for Socorro which added support for inline functions to the stackwalker. That unblocked the change to dump_syms to include inline information. We're pretty sure that symbol files for Firefox builds after that point are much larger. I don't know how that affects upload handling.

I'm looking at the upload data now to see what I can see in it. Can we see an increase in symbol upload payload sizes? Can we see an increase in failed upload attempts / timeouts? Etc.

Summary: tecken nodes have increased disk usage over time → since 2022-09-09, tecken nodes monotonically use disk and then run out of space

I pulled down the upload records in Tecken and grouped them by created_at day. We get a lot of symbol uploads from a variety of things, but the ones I wanted to look at are Firefox uploads and whether we can see data to answer these questions:

  1. Are uploads getting larger over time especially after we changed dump_syms to output inline function information?
  2. Are the number of incomplete symbol uploads increasing?
  3. Are the number of symbol upload retries increasing?

Statistics on the data for 2022-08-01 through 2022-09-15:

date total # > 1.5gb (%) largest large_mean large_median incomplete retries
2022-08-01 368 7 (2%) 1,576,903,409 1,567,346,745 1,565,997,640 0 50 **
2022-08-02 516 22 (4%) 1,745,710,279 1,593,816,537 1,561,064,084 1 ** 78 **
2022-08-03 397 6 (2%) 1,737,134,926 1,606,841,206 1,570,585,792 0 53 **
2022-08-04 519 5 (1%) 1,577,553,351 1,556,057,992 1,562,683,176 1 ** 76 **
2022-08-05 242 7 (3%) 1,745,912,052 1,582,702,699 1,562,650,105 0 34
2022-08-06 160 4 (2%) 1,746,446,640 1,605,762,392 1,564,373,941 0 21
2022-08-07 134 3 (2%) 1,727,490,725 1,612,724,496 1,562,831,666 0 24
2022-08-08 370 2 (1%) 1,562,786,047 1,555,314,904 1,555,314,904 0 42
2022-08-09 515 10 (2%) 1,746,567,744 1,597,133,634 1,566,503,841 0 68 **
2022-08-10 361 10 (3%) 1,746,098,211 1,601,122,417 1,577,491,940 0 49
2022-08-11 553 23 (4%) 1,756,919,860 1,607,231,930 1,577,198,021 0 79 **
2022-08-12 314 8 (3%) 1,747,959,178 1,605,184,042 1,566,210,165 0 41
2022-08-13 165 7 (4%) 1,581,368,389 1,574,916,091 1,581,368,389 2 ** 39
2022-08-14 154 2 (1%) 1,565,488,569 1,558,816,703 1,558,816,703 0 21
2022-08-15 380 8 (2%) 1,751,433,548 1,615,560,198 1,575,999,559 0 54 **
2022-08-16 376 12 (3%) 1,751,488,788 1,597,988,334 1,575,625,224 0 44
2022-08-17 287 13 (5%) 1,763,678,614 1,581,321,682 1,566,678,881 0 41
2022-08-18 495 21 (4%) 1,758,133,505 1,615,566,074 1,582,184,323 0 75 **
2022-08-19 291 5 (2%) 1,732,428,538 1,586,308,615 1,551,288,918 0 40
2022-08-20 139 2 (1%) 1,564,698,282 1,558,229,630 1,558,229,630 0 20
2022-08-21 91 2 (2%) 1,564,708,633 1,558,278,964 1,558,278,964 0 10
2022-08-22 506 11 (2%) 1,756,457,550 1,622,764,433 1,580,719,435 0 67 **
2022-08-23 514 20 (4%) 1,761,337,101 1,596,614,213 1,566,674,656 1 ** 80 **
2022-08-24 452 9 (2%) 1,580,946,776 1,562,992,787 1,564,757,649 0 54 **
2022-08-25 415 6 (1%) 1,580,827,951 1,550,335,272 1,558,500,759 0 50 **
2022-08-26 380 9 (2%) 1,581,424,375 1,540,914,378 1,554,163,400 0 51 **
2022-08-27 138 3 (2%) 1,566,849,056 1,541,293,599 1,556,111,623 0 24
2022-08-28 139 5 (4%) 1,580,838,803 1,554,773,989 1,566,880,690 0 19
2022-08-29 652 28 (4%) 1,757,737,547 1,567,978,328 1,564,860,868 2 ** 103 **
2022-08-30 315 8 (3%) 1,566,904,766 1,528,253,263 1,520,962,221 0 46
2022-08-31 307 9 (3%) 1,757,695,313 1,563,202,801 1,560,336,790 0 34
2022-09-01 379 16 (4%) 1,757,712,278 1,594,870,256 1,574,171,165 0 59 **
2022-09-02 351 22 (6%) 1,880,381,770 ** 1,644,686,469 1,584,237,154 2 ** 55 **
2022-09-03 179 4 (2%) 1,575,003,472 1,537,121,459 1,533,869,393 0 29
2022-09-04 135 4 (3%) 1,575,039,782 1,537,046,045 1,533,780,616 0 22
2022-09-05 201 4 (2%) 1,575,029,388 1,537,105,078 1,533,869,986 0 37
2022-09-06 429 21 (5%) 1,755,485,569 1,591,760,028 1,569,847,829 0 64 **
2022-09-07 284 13 (5%) 1,881,787,977 ** 1,656,032,726 1,576,388,878 0 46
2022-09-08 402 64 (16%) 2,093,423,277 ** 1,651,129,629 1,598,509,649 5 ** 71 **
2022-09-09 294 65 (22%) 1,863,099,416 ** 1,643,485,523 1,597,164,988 8 ** 52 **
2022-09-10 220 56 (25%) 2,093,424,033 ** 1,652,239,917 1,612,249,382 3 ** 36
2022-09-11 173 32 (18%) 2,110,969,213 ** 1,682,474,218 1,646,073,222 0 30
2022-09-12 363 54 (15%) 1,864,537,012 ** 1,636,018,866 1,608,426,648 3 ** 55 **
2022-09-13 290 85 (29%) 1,886,950,967 ** 1,656,497,723 1,616,254,826 5 ** 46
2022-09-14 264 65 (25%) 1,886,903,353 ** 1,647,921,235 1,608,794,966 8 ** 34
2022-09-15 266 37 (14%) 1,866,327,546 ** 1,642,404,826 1,606,754,392 8 ** 80 **

I don't know whether there are symbol upload attempts that fail that aren't represented in this data. If that's important (and it might be), we should capture metrics in the upload_symbols code in m-c.

However, I think the data we have in the uploads table lets us answer the questions I posed and also lets us see whether changes in Tecken help/hinder the situation.

  1. Are uploads getting larger over time especially after we changed dump_syms to output inline function information?

    The size field of the upload_uploads table is the size of the data in the multipart/form-data field for symbols.zip. An HTTP POST should only contain one zip file.

    Yes. The number of symbol uploads > 1.5gb increased after 2022-09-07 and the size of the largest symbol upload increased and occasionally exceeds 2gb.

  2. Are the number of incomplete symbol uploads increasing?

    An incomplete is when there's a created_at datestamp but no completed_at datestamp. I'm pretty sure this happens when the upload times out in a way that kills the gunicorn process. I need to look into this more.

    Yes. Incompletes have happened before, but there's a clear increase in likelihood after 2022-09-07.

  3. Are the number of symbol upload retries increasing?

    Symbol upload requests have no helpful metadata (product, version, release channel, buildid, etc), so we have to invent a unique key. I'm using (email address, symbol upload size, date) as the key, so if two records with the same email address and symbol upload size occur on the same day, I'm treating that as a retry attempt.

    Bug #1789414 increased the timeout for the upload_symbols job giving which gives it more time to do retries. That landed on 2022-09-06.

    I think the data here is less clear. We had retries before and we continue to have retries, but it's not obviously increasing after 2022-09-06.

The gunicorn timeout change went out yesterday with bug #1791389.

I checked the dashboard and my data today. I don't see disk usage going up abnormally. There were two incompletes on 2022-09-19, but at least one of those happened before the deploy. The other might have happened either just before or during the deploy. There are no incompletes so far for 2022-09-20. Things look good so far....

I checked the dashboard and my data today.

Per the dashboard, the disk usage for all nodes is still under 40%. There's no case where disk usage for one node is going up abnormally. That's good.

Per upload data, there were 2 incompletes on 2022-09-19, 0 on 2022-09-20, and 1 on 2022-09-21 (today).

date total # > 1.5gb (%) largest large_mean large_median incomplete retries notes
2022-09-01 379 16 (4%) 1,757,712,278 1,594,870,256 1,574,171,165 0 59 **
2022-09-02 351 22 (6%) 1,880,381,770 ** 1,644,686,469 1,584,237,154 2 ** 55 **
2022-09-03 179 4 (2%) 1,575,003,472 1,537,121,459 1,533,869,393 0 29
2022-09-04 135 4 (3%) 1,575,039,782 1,537,046,045 1,533,780,616 0 22
2022-09-05 201 4 (2%) 1,575,029,388 1,537,105,078 1,533,869,986 0 37
2022-09-06 429 21 (5%) 1,755,485,569 1,591,760,028 1,569,847,829 0 64 **
2022-09-07 284 13 (5%) 1,881,787,977 ** 1,656,032,726 1,576,388,878 0 46
2022-09-08 402 64 (16%) 2,093,423,277 ** 1,651,129,629 1,598,509,649 5 ** 71 **
2022-09-09 294 65 (22%) 1,863,099,416 ** 1,643,485,523 1,597,164,988 8 ** 52 **
2022-09-10 220 56 (25%) 2,093,424,033 ** 1,652,239,917 1,612,249,382 3 ** 36
2022-09-11 173 32 (18%) 2,110,969,213 ** 1,682,474,218 1,646,073,222 0 30
2022-09-12 363 54 (15%) 1,864,537,012 ** 1,636,018,866 1,608,426,648 3 ** 55 **
2022-09-13 290 85 (29%) 1,886,950,967 ** 1,656,497,723 1,616,254,826 5 ** 46
2022-09-14 264 65 (25%) 1,886,903,353 ** 1,647,921,235 1,608,794,966 8 ** 34
2022-09-15 246 33 (13%) 1,866,327,546 ** 1,653,076,830 1,608,731,289 8 ** 60 **
2022-09-16 179 60 (34%) 2,115,394,298 ** 1,663,924,918 1,627,238,448 11 ** 32
2022-09-17 125 26 (21%) 1,866,764,012 ** 1,644,074,572 1,615,446,013 1 ** 21
2022-09-18 97 25 (26%) 1,866,796,714 ** 1,640,863,803 1,603,346,111 3 ** 16
2022-09-19 291 41 (14%) 1,888,487,575 ** 1,640,004,775 1,584,311,996 2 ** 52 ** timeout fix deployed
2022-09-20 307 68 (22%) 2,092,021,812 ** 1,658,221,868 1,628,013,860 0 59 **
2022-09-21 146 37 (25%) 1,952,903,687 ** 1,683,191,141 1,652,180,949 1 ** 44 partial day

Incompletes can happen for reasons other than gunicorn killing the process and we have seen them in the past. For example, 2022-09-02. The fact that disk usage is normal is good. If that continues for the rest of the week, then we can close this out.

I continued to watch the graphs and upload data and everything continues to look fine. Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
See Also: → 1817902
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: