Closed Bug 627552 Opened 10 years ago Closed 10 years ago

Python server at stage-auth doesn't actually sync

Categories

(Cloud Services :: Server: Sync, defect)

defect
Not set
blocker

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: tracy, Assigned: petef)

References

Details

(Keywords: regression)

Attachments

(2 files)

Attached file client A log
Seen testing beta9 with Mac and Win 7

1) Created new account and added second client via PAKE.
2) added some history, tabs, bookmarks, form entry and password to client A, Sync. Allow it to complete.
3) Sync Client B

Tested results: logs look ok, but the data added on Client A doesn't appear on client B

Expected Results: Data is synced across clients
Attached file client B log
Assignee: nobody → tarek
This is against stage-auth.  I'm blocked from further testing 'til this is sorted out.
Relevant line from the logs:
2011-01-20 16:28:04	Service.Main         INFO	Testing info/collections: {"tabs":1295562039.733859,"clients":null,"crypto":null,"bookmarks":null,"prefs":null,"history":null}

This indicates that the client is not getting any timestamps for collections stored in the DB. 

After investigation, turns out that part of the DB was corrupted. Richard fixed the DB and the timestamps are now back.

One question remains though: should the client stop with an error in case some timestamps are null during sync ? 

While this error is due to a corrupted DB, sync could be stopped in that case. Or maybe the server can return a 503 if tabs has a timestamps but other collection have none (which is a impossible state afaik)
Tracy: I am going to sync several clients this morning and check that everything works fine now
Tracy: Repaired all InnoDB and MyISAM tables on weave-stage-db01 in all databases. Smoke test should work again. Could you confirm during your next smoke test pass whenever?
Sync is working on stage-auth.  I'll resolve this bug as fixed.  Tarek, can you file a bug against how to handle the timestamp issue?
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
A bug was filed already for this: bug 627671
Sorry, the last comment was an unrelated bug. 

I am reopening the bug because Tracy had the issue again.
Sorry, the last comment was an unrelated bug. 

I am reopening the bug because Tracy had the issue again.
https://stage-node02.services.mozilla.com/1.0/7ohn6wlh6dz4otolncc4dkykm6zpzlwu is where I am currently seeing this.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Richard: it seems that another DB is corrupted. 

How could we prevent or detect DB corruptions ?
Can we detect if the mysql servers are getting restarted often in the logs ? 
("restarted mysql" in the error log). This seems to be one possible reason of getting corrupted DBs.

If this not the case, can we activate if it's not, the general logger ? and try to find when the DB gets corrupted to build a test case to reproduce the issue.
Staging database is not being restarted.  What you're seeing in the logs is evidence of mysql worker threads crashing.  A clearer description of what you're seeing in the error log would be "disconnected from mysql server, reconnecting".

The general logger is active, but truncating after 6 hours (bug open to extend this time).  Based on recent experience there are helpful stack traces any time the python code loses its connection to mysql, so we may be able to tell what the specific query affected is.
Stage still seems out of order

I get a 500 here:  https://stage-auth.services.mozilla.com/user/1.0/k7ndfwezspuuuiwlvqzyrwuz6yzwwq6i/node/weave

What's the current status of stage ? do we use PHP for reg/sreg or not ?
Do we have enough nodes provisioned in the available_nodes table ?
Blocks: 609676
I'm waiting on this to make another complete testing pass against the python server.
Found two problems:

* available_nodes was 0; upped to 100 for each active node (and adding monitoring for this)
* gunicorn-syncstorage was not running

I can sign up a new user in stage & sync successfully.
Assignee: tarek → petef
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
As a result of this work, is Python reg/sreg up and running in staging, or is it PHP?
We need PHP in stage, so we have a similar environment that what is going to be launched at first. 

reg/sreg in Python will be introduced in production in a second phase, to minimize the global risks
Pete, I just set up a new account against http://stage-auth.services.mozilla.com.  It still won't sync. Seeing the same in logs as initially reported.

2011-02-23 09:24:38	Net.Resource         DEBUG	GET fail 500 https://stage-auth.services.mozilla.com/user/1.0/p4stuhhw7gvybs3b37fsr5nh2ck7z5fy/node/weave
2011-02-23 09:24:38	Service.Main         DEBUG	Exception: Unexpected response code: 500 No traceback available
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
What's installed in stage ? 

I am under the impression that nothing has changed yet (see 632816) -- if so, doing QA on it for now is a waste of time.

We need:

- Python sync server 
- PHP reg / sreg server
(In reply to comment #19)
> It still won't sync. Seeing the same in logs as initially reported.

available_nodes dropped back to 0 again on adm1. I'll re-bump to a couple thousand I guess.


Tarek, I'm working on the other bug and getting php reg/sreg deployed in stage.
(In reply to comment #21)
> (In reply to comment #19)
> > It still won't sync. Seeing the same in logs as initially reported.
> 
> available_nodes dropped back to 0 again on adm1. I'll re-bump to a couple
> thousand I guess.

Does the admin scripts run in stage ? I am thinking about the one that cleans that table to increment the available nodes w/ the daily deleted account numbers.

I am saying this because Hudson generates several hundreds users per day (and deletes them) so you will hit the problem again.


> 
> Tarek, I'm working on the other bug and getting php reg/sreg deployed in stage.

Cool thanks !
(In reply to comment #22)
> Does the admin scripts run in stage ? I am thinking about the one that cleans
> that table to increment the available nodes w/ the daily deleted account
> numbers.
> 
> I am saying this because Hudson generates several hundreds users per day (and
> deletes them) so you will hit the problem again.

AFAICT, no. I'll talk to atoll about that today.

php reg/sreg deployed, and tarek ran a functional test through Hudson and everything passed (we had to temporarily disable captcha to get it to pass, which is expected). Also deployed latest syncstorage from the other bug.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
stage-auth is working now
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.