waitfor for localstack in bin/test.sh sometimes fails in CI
Categories
(Socorro :: General, defect, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
Details
Attachments
(1 file)
On Wednesday, May 1st in pr 6594, I updated localstack from 1.1.0 to 3.4.0. Since then, the socorro tests fail periodically when running in CircleCI in bin/test.sh
at the waitfor line for localstack.
Example failure:
make: 'my.env' is up to date.
[+] Running 105/5
✔ pubsub 6 layers [⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 57.0s
✔ postgresql 13 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 29.0s
✔ statsd 39 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 56.6s
✔ elasticsearch 16 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 34.5s
✔ localstack 26 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 57.4s
[+] Running 8/8
✔ Network socorro_default Created0.0s
✔ Container socorro-localstack-1 Created0.6s
✔ Container socorro-fakesentry-1 Created0.6s
✔ Container socorro-postgresql-1 Created0.6s
✔ Container socorro-gcs-emulator-1 Created0.6s
✔ Container socorro-statsd-1 Created0.6s
✔ Container socorro-elasticsearch-1 Created0.6s
✔ Container socorro-pubsub-1 Created0.6s
[+] Running 7/7
✔ Container socorro-elasticsearch-1 Started1.0s
✔ Container socorro-statsd-1 Started1.1s
✔ Container socorro-gcs-emulator-1 Started1.0s
✔ Container socorro-localstack-1 Started1.1s
✔ Container socorro-pubsub-1 Started1.1s
✔ Container socorro-postgresql-1 Started0.8s
✔ Container socorro-fakesentry-1 Started0.9s
^@^@+ echo '>>> set up environment'
>>> set up environment
+ DATABASE_URL=postgres://postgres:postgres@postgresql:5432/socorro_test
+ SENTRY_DSN=http://public@fakesentry:8090/1
+ ELASTICSEARCH_URL=http://elasticsearch:9200
+ LOCAL_DEV_AWS_ENDPOINT_URL=http://localstack:4566/
+ export PYTHONPATH=/app/:/app
+ PYTHONPATH=/app/:/app
++ which pytest
+ PYTEST=/usr/local/bin/pytest
++ which python
+ PYTHON=/usr/local/bin/python
+ echo '>>> wait for services to be ready'
>>> wait for services to be ready
+ urlwait postgres://postgres:postgres@postgresql:5432/socorro_test
+ urlwait http://elasticsearch:9200
+ urlwait http://pubsub:5010 10
+ urlwait http://gcs-emulator:8001/storage/v1/b 10
+ python ./bin/waitfor.py --verbose --codes=200,404 http://public@fakesentry:8090/1
Testing http://fakesentry:8090/1 for [200, 404] with timeout 15...
+ python ./bin/waitfor.py --timeout 20 --verbose http://localstack:4566/
Testing http://localstack:4566/ for [200] with timeout 20...
Traceback (most recent call last):
File "/app/./bin/waitfor.py", line 84, in <module>
sys.exit(main(sys.argv[1:]))
^^^^^^^^^^^^^^^^^^
File "/app/./bin/waitfor.py", line 64, in main
with urllib.request.urlopen(url) as resp:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/urllib/request.py", line 519, in open
response = self._open(req, data)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/urllib/request.py", line 496, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "/usr/local/lib/python3.11/urllib/request.py", line 1377, in http_open
return self.do_open(http.client.HTTPConnection, req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/urllib/request.py", line 1352, in do_open
r = h.getresponse()
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/http/client.py", line 1395, in getresponse
response.begin()
File "/usr/local/lib/python3.11/http/client.py", line 325, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/http/client.py", line 286, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/socket.py", line 706, in readinto
return self._sock.recv_into(b)
^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: timed out
Exited with code exit status 1
On Thursday, we increased the waitfor timeout from the default 10s to 20s, but it's still failing in the same spot.
This issue covers figuring out why it's failing and fixing the issue.
Assignee | ||
Comment 1•15 days ago
|
||
One theory is that localstack hasn't finished starting up even after 20 seconds which would be weird, but maybe due to running out of resources or something. We could set it up so that the localstack startup logs are viewable and then look at that.
Another theory is that we're hitting a bug in waitfor where it's hitting a different timeout than the one we're setting to 20s.
Instead of tinkering with all that, we could ditch all the urlwait and waitfor stuff for what we did with Tecken where we moved the waiting/healthcheck bits to docker compose using depends_on
.
In the meantime, "restart from start" seems to eventually work though it's a bit of a nuisance.
Assignee | ||
Comment 2•11 days ago
|
||
Grabbing this one because it's slowing down PRs.
Assignee | ||
Comment 3•11 days ago
|
||
Assignee | ||
Comment 4•11 days ago
|
||
Assignee | ||
Comment 5•10 days ago
|
||
That fix has passed 3 CI runs in a row, so that's promising. I'll keep tabs on it.
Assignee | ||
Comment 6•10 days ago
|
||
I saw a whole bunch of CI runs pass, so I think we're good now.
Assignee | ||
Comment 7•8 days ago
|
||
CI has done fine since I landed the fix. Marking as FIXED.
Description
•