Open Bug 1108821 Opened 10 years ago Updated 9 years ago

Travis selenium tests are failing pretty consistently for PostgreSQL for all branches

Categories

(Bugzilla :: QA Test Scripts, defect)

defect
Not set
blocker

Tracking

()

People

(Reporter: dkl, Unassigned)

References

Details

Attachments

(1 file)

Recent example: 
https://travis-ci.org/bugzilla/bugzilla/jobs/43374630

Will need to setup Pg locally and try to reproduce. The apache error log is filled with the following lines

[Thu Nov 20 16:24:00 2014] buglist.cgi: 	(in cleanup) Insecure dependency in eval while running with -T switch at /home/travis/perl5/perlbrew/perls/5.10/lib/5.10.1/CGI.pm line 932 during global destruction.

which may give clue to what is happening.

dkl
Flags: blocking5.0+
Also found this error in the Apache log:

[Mon Dec 8 21:41:11 2014] process_bug.cgi: DBD::Pg::db do failed: ERROR: could not serialize access due to concurrent update [for Statement "UPDATE bug_user_last_visit SET last_visit_ts = ? WHERE id = ?"] at Bugzilla/Object.pm line 531.

[Mon Dec 8 21:41:11 2014] process_bug.cgi: Bugzilla::Object::update('Bugzilla::BugUserLastVisit=HASH(0x4d4f118)') called at Bugzilla/Bug.pm line 4264

[Mon Dec 8 21:41:11 2014] process_bug.cgi: Bugzilla::Bug::update_user_last_visit('Bugzilla::Bug=HASH(0x49e97d0)', 'Bugzilla::User=HASH(0x4715230)', '2014-12-08 21:41:12') called at Bugzilla/Bug.pm line 1136

[Mon Dec 8 21:41:11 2014] process_bug.cgi: Bugzilla::Bug::update('Bugzilla::Bug=HASH(0x49e97d0)') called at /home/travis/build/bugzilla/bugzilla/process_bug.cgi line 373
http://www.postgresql.org/docs/9.1/static/transaction-iso.html#XACT-REPEATABLE-READ has some information about this error.

Should we change ISOLATION_LEVEL to 'SERIALIZABLE' for DB/Pg.pm?

dkl
(In reply to David Lawrence [:dkl] from comment #2)
> Should we change ISOLATION_LEVEL to 'SERIALIZABLE' for DB/Pg.pm?

How is this supposed to help? SERIALIZABLE can get similar errors too. (I have to admit that the Pg documentation is not ultra clear to me about differences between transaction levels.)
(In reply to Frédéric Buclin from comment #3)
> How is this supposed to help? SERIALIZABLE can get similar errors too. (I
> have to admit that the Pg documentation is not ultra clear to me about
> differences between transaction levels.)

I was going by the documentation at http://www.postgresql.org/docs/9.1/interactive/transaction-iso.html#XACT-SERIALIZABLE which stated SERIALIZABLE was the better method but requires a minimum of 9.1

The weird thing is that after installing Pg locally, I am not able to get it to fail in this way and may be related to the Travis environment unfortunately. I created a Docker images for doing Pg testing if anyone else wants to try it.

https://github.com/dklawren/docker-bugzilla/tree/pgsql

Also this test case has passed recently on Travis which makes it even more strange.

https://travis-ci.org/bugzilla/bugzilla/builds/44470839

LpSolit, if you can run the qa suite yourself successfully on Pg then I feel we may be able to close this and move on. I really do want to find out why it is not consistently working in the Travis environment but do not feel we should hold up 5.0 any longer it if works for you.

dkl
Flags: needinfo?(LpSolit)
(In reply to David Lawrence [:dkl] from comment #4)
> LpSolit, if you can run the qa suite yourself successfully on Pg then I feel
> we may be able to close this and move on. I really do want to find out why
> it is not consistently working in the Travis environment but do not feel we
> should hold up 5.0 any longer it if works for you.

if we take this approach, these failing tests need to be removed/disabled from travis.  it may also be worth documenting in the relnotes that pg users are outside of our ci test coverage.
Removed from 5.0 testing temporarily and no longer blocker for RC1. This is passing locally and should not be considered a blocker for now. I am leaving it enabled for master as we still want to find out how to make this work properly in Travis for the future.

To ssh://gitolite3@git.mozilla.org/bugzilla/bugzilla.git
   9d4d946..5d0b206  5.0 -> 5.0

dkl
Flags: needinfo?(LpSolit)
Flags: blocking5.0+
Keywords: relnote
Isn't this bug wontfix now, as you plan to leave Travis CI in favor of TaskCluster?
(In reply to Frédéric Buclin from comment #7)
> Isn't this bug wontfix now, as you plan to leave Travis CI in favor of
> TaskCluster?

Well Pg selenium is still failing for master on TaskCluster as well so there are still issues to be resolved. Hopefully the failures won't be because of the Travis CI environment. 

Let's leave this open until I see the Pg tests passing consistently in the near future.

dkl
We're looking at migrating from MySQL and postgres and are seeing similar serialization errors as well.

It'a a side effect of using REPEATABLE READ (bugzilla's default isolation level) and SERIALIZABLE (even stricter) transaction isolation levels.

Serialization errors are unavoidable in postgres when using these isolation modes.  To quote section 13.2.2 of the Postgres manual when talking about REPEATABLE READ transactions.

> Applications using this level must be prepared to retry transactions due to serialization failures.

http://www.postgresql.org/docs/9.1/interactive/transaction-iso.html

Essentially, if two transactions (A and B) are running concurrently, and transaction A changes the underlying data transaction B is using, and transaction A commits first, transaction B will be killed.

It's easily reproducible if you open two psql clients and execute transactions in the right order.
Locking (eg, SELECT ... FOR UPDATE) will not prevent this problem.
See the attachment for examples of how to trigger this.
The finest level of locking that postgres can handle is a row, so any concurrent updates to a row using REPEATABLE READ or SERIALIZABLE will result in this problem.

I'm currently pondering how to solve this problem in Bugzilla.  Ultimately, the only way to do it is to retry the transaction.  Either throw the user a bug collision page and ask them to try again, or some other method.  This also applies to the RPC interface.

Also, see bug https://bugzilla.mozilla.org/show_bug.cgi?id=514778 for a similar report of this problem.
Still, these errors do not make sense on Travis, because we don't run concurrent tests. A test cannot conflict with itself. So something else must conflict with out tests on Travis, i.e. something is interacting with our own DB?? If the DB error, we would know where to look at. Is that the case? It it always the same DB error at the same place?
(In reply to Frédéric Buclin from comment #10)
> Still, these errors do not make sense on Travis, because we don't run
> concurrent tests.

viewing a bug that you're involved in kicks off a background request that updates the bug_user_last_visit table.  it's possible this is happening concurrently with other requests, especially with the high frequency of requests a testsuite would trigger.
(In reply to Byron Jones ‹:glob› from comment #11)
> viewing a bug that you're involved in kicks off a background request that
> updates the bug_user_last_visit table.  it's possible this is happening
> concurrently with other requests, especially with the high frequency of
> requests a testsuite would trigger.

Ah, this helps to know where to look at. :)
See Also: → 1141426
Keywords: relnote
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: