`BatchStatement` occasionally gets data out of sync - c#

"Cassandra: The Definitive Guide, 2nd Edition" says:
Cassandra’s batches are a good fit for use cases such as making
multiple updates to a single partition, or keeping multiple tables in
sync. A good example is making modifications to denormalized tables
that store the same data for different access patterns.
The last statement above applies to the following attempt, where all the Save... are insert statements for different tables
var bLogged = new BatchStatement();
var now = DateTimeOffset.UtcNow;
var uuidNow = TimeUuid.NewId(now);
bLogged.Add(SaveMods.Bind(id, uuidNow, data1)); // 1
bLogged.Add(SaveMoreMods.Bind(id, uuidNow, data2)); // 2
bLogged.Add(SaveActivity.Bind(now.ToString("yyyy-MM-dd"), id, now)); // 3
await GetSession().ExecuteAsync(bLogged);
We'll focus on statements 1 and 2 (the 3rd one is just to signify there's one more statement in the batch).
Statement 1 writes to table1 partitioned by id with uuidNow being a clustering key desc.
Statement 2 writes to table2 partitioned by id only, so it's the tip of the table1 for the same id.
More times than I'd like the two tables get out of sync in the sense that table2 does not have the tip of the table1. It would be one or two mods behind within a few milliseconds.
While looking for resolution most on the web advise against all batches, which prompted my solution eliminating all mismatches:
await Task.WhenAll(
GetSession().ExecuteAsync(SaveMods.Bind(id, uuidNow, data1)),
GetSession().ExecuteAsync(SaveMoreMods.Bind(id, uuidNow, data2)),
GetSession().ExecuteAsync(SaveActivity.Bind(now.ToString("yyyy-MM-dd"), id, now))
);
The question is: what are batches good for, just the first statement in the quote? In that case how do I ensure modifications to different tables are in sync?

Using higher consistency (ie quorum) on reads/writes may help but there is always a possibility for inconsistencies between the table/partitions.
Batch statements will try to ensure that all the mutations in the batch will all happen or not. It does not guarantee that all the mutations will occur in an instant (no isolation, you can do a read where first mutation has been applied but others haven't). Also, batch statements will not provide a consistent view of all the data across all the nodes. For linearizable consistency you should consider using paxos (lightweight transactions) for conditional updates and trying to limit things that require the linearizability into a single partition.

Related

Ideas on incorrect ORDER BY results

I want to emphasize that I'm looking for ideas, not necessarily a concrete answer since it's difficult to show what my queries look like, but I don't believe that's needed.
The process looks like this:
Table A keeps filling up, like a bucket - an SQL job keeps calling SP_Proc1 every minute or less and it inserts multiple records into table A.
At the same time a C# process keeps calling another procedure SP_Proc2 every minute or less that does an ordered TOP 5 select from table A and returns the results to the C# method. After C# code finishes processing the results it deletes the selected 5 records from table A.
I bolded the problematic part above. It is necessary that the records from table A be processed 5 at a time in the order specified, but a few times a month SP_Proc2 selects the ordered TOP 5 records in a wrong order even though all the records are present in table A and have correct column values that are used for ordering.
Something to note:
I'm ordering by integers, not varchar.
The C# part is using 1 thread.
Both SP_Proc1 and SP_Proc2 use a transaction and use READ COMMITTED OR READ COMMITTED SNAPSHOT transaction isolation level.
One column that is used for ordering is a computed value, but a very simple one. It just checks if another column in table A is not null and sets the computed column to either 1 or 0.
There's a unique nonclustered index on primary key Id and a clustered index composed of the same columns used for ordering in SP_Proc2.
I'm using SQL Server 2012 (v11.0.3000)
I'm beginning to think that this might be an SQL bug or maybe the records or index in table A get corrupted and then deleted by the C# process and that's why I can't catch it.
Edit:
To clarify, SP_Proc1 commits a big batch of N records to table A at once and SP_Proc2 pulls the records from table A in batches of 5, it orders the records in the table and selects TOP 5 and sometimes a wrong batch is selected, the batch itself is ordered correctly, but a different batch was supposed to be selected according to ORDER BY. I believe Rob Farley might have the right idea.
My guess is that your “out of order TOP 5” is ordered, but that a later five overlaps. Like, one time you get 1231, 1232, 1233, 1234, and 1236, and the next batch is 1235, 1237, and so on.
This can be an issue with locking and blocking. You’ve indicated your processes use transactions, so it wouldn’t surprise me if your 1235 hasn’t been committed yet, but can just be ignored by your snapshot isolation, and your 1236 can get picked up.
It doesn’t sound like there’s a bug here. What I’m describing above is a definite feature of snapshot isolation. If you must have 1235 picked up in an earlier batch than 1236, then don’t use snapshot isolation, and force your table to be locked until each block of inserts is finished.
An alternative suggestion would be to use a table lock (tablock) for the reading and writing procedures.
Though this is expensive, if you desire absolute consistency then this may be the way to go.

DotNet Core C# Concurrency Entity Framework (Duplicate Key Value unique Constraints)

TL&DR: Several concurrent Tasks trying to place identical records into a database; Essentially SEVERAL tasks are being spun up and opening up several files that could be identical.
It is vital to save all the information, in a heavily nested table, based on the IP Address; Here is what I have tried so far in the last 4 days of work (even during Christmas!)
Tried to use a Transaction, within a do while() loop (with context.Rollback(). [Didn't work!]
Tried to Put Random Sleeps within each of the Inserts to Stop Race Condition [Didn't work!]
Made Code no longer Asyc . [Didn't work!]
Current algorithm doesn't work and peg's CPU! [Doesn't work!]\
Seperately add EACH object to the Table individually [Didn't work!]
Each of the Objects increments, during insert. This is why this doesn't make sense. I am at at a loss of words.
Object Relationships
IP has many Incidents;
I think you might have a problem in these lines:
Vendor vendorInstancer = new Vendor();
vendorInstance.IncidentID = IncidentId;
context.Vendors.Add(vendorInstancer);
Note the variable names. You create vendorInstancer but update ID of the vendorInstance. That is, not the entity you're saving to the database. Hard to spot that one letter difference.

MongoDB - Inserting the result of a query in one round-trip

Consider this hypothetical snippet:
using (mongo.RequestStart(db))
{
var collection = db.GetCollection<BsonDocument>("test");
var insertDoc = new BsonDocument { { "currentCount", collection.Count() } };
WriteConcernResult wcr = collection.Insert(insertDoc);
}
It inserts a new document with "currentCount" set to the value returned by collection.Count().
This implies two round-trips to the server. One to calculate collection.Count() and one to perform the insert. Is there a way to do this in one round-trip?
In other words, can the value assigned to "currentCount" be calculated on the server at the time of the insert?
Thanks!
There is no way to do this currently (Mongo 2.4).
The upcoming 2.6 version should have batch operations support but I don't know if it will support batching operations of different types and using the results of one operation from another operation.
What you can do, however, is execute this logic on the server by expressing it in JavaScript and using eval:
collection.Database.Eval(new BsonJavaScript(#"
var count = db.test.count();
db.test.insert({ currentCount: count });
");
But this is not recommended, because of several reasons: you lose the write concern, it is very unsafe in terms of security, it requires admin permissions, it holds a global write lock, and it won't work on sharded clusters :)
I think your best route at the moment would be to do this in two queries.
If you're looking for atomic updates or counters (which don't exactly match your example but seem somewhat related), take a look at findAndModify and the $inc operator of update.
If you've got a large collection and you're looking to save CPU, it's recommended that you create another collection called counters that only has one document per collection that you want to count and increment the document pertaining to you're collection each time you insert a document.
See the guidance here.
It appears that you can place a JavaScript function inside your query, so perhaps it can be done in one trip, but I haven't implemented this in my own app, so I can't confirm that.

EF and Linq with self referential table

I have a self-referential table in my database that looks sort of like above. Basically its setup in such a way that each row has a unique ID (identity PK) and a DependentID to indicate any other record in the set that it is dependent on. It is very similar to the parent-child type examples you often see in SQL textbooks but my case is subtly unique in the sense that a given record can also be dependent upon itself (see row 1 above)
Two questions:
Can EF be made to represent this relationship properly? I've read several posts on here that suggest that it does not deal with this scenario gracefully so my initial thought was that it might not even be worth it, I might be better off just treating it as a normal table and writing the business logic to ensure the data gets inserted/updated correctly. In my scenario, I won't ever be querying these entities thru EF really, the app will basically load them all at startup and then I'll run linq queries against them at runtime to filter as needed
Assuming I cannot get it to work with EF and as I note in #1 I simply load em all up into memory at startup (there are only going to be 50-100 or so), what would be the most efficient way to join on this via linq? I would want to be able to pass in a DependentId and get all the records associated with it and their properties...so in this example I'd want to pass in '1' and get back:
1 - John - 10
2 - Mike - 25
3 - Bob - 5
thanks for the help
Indeed, the entity framework cannot represent such a relationship, certainly not in in a recursively queryable form.
But you are not asking for recursive queries, so you could treat DependentId as just another data column. Doing that, it would be trivial to build and execute your question-two query against the database.
UPDATE:
That query would look something like
int dependentIdToSearch = 1;
var q = from something in db.mytable
where something.DependentId == dependentIdToSearch
select new { something.Id, something.Name, something.Value };
END UPDATE
If you do need recursive queries (all direct and indirect dependencies of), you need a table valued function with a common table expression. The entity framework cannot deal with that either, at least not in the current version. If you need this support, you can wait for EF 5 or use Linq to SQL (which had support for table valued functions since the first version years ago).
You can indeed also read the entire table in memory, provided that it is read-only, or that there is only "one memory" (single server, not load-balanced or client app with local database).
If it's read-only, you have the option to build an object graph once at load time, enabling efficient execution later. For example, you could define a class with a collection of objects that are dependent on each object. Your query then becomes a trivial iteration over that collection.

How to speed up LINQ inserts with SQL CE?

History
I have a list of "records" (3,500) which I save to XML and compress on exit of the program. Since:
the number of the records increases
only around 50 records need to be updated on exit
saving takes about 3 seconds
I needed another solution -- embedded database. I chose SQL CE because it works with VS without any problems and the license is OK for me (I compared it to Firebird, SQLite, EffiProz, db4o and BerkeleyDB).
The data
The record structure: 11 fields, 2 of them make primary key (nvarchar + byte). Other records are bytes, datatimes, double and ints.
I don't use any relations, joins, indices (except for primary key), triggers, views, and so on. It is flat Dictionary actually -- pairs of Key+Value. I modify some of them, and then I have to update them in database. From time to time I add some new "records" and I need to store (insert) them. That's all.
LINQ approach
I have blank database (file), so I make 3500 inserts in a loop (one by one). I don't even check if the record already exists because db is blank.
Execution time? 4 minutes, 52 seconds. I fainted (mind you: XML + compress = 3 seconds).
SQL CE raw approach
I googled a bit, and despite such claims as here:
LINQ to SQL (CE) speed versus SqlCe
stating it is SQL CE itself fault I gave it a try.
The same loop but this time inserts are made with SqlCeResultSet (DirectTable mode, see: Bulk Insert In SQL Server CE) and SqlCeUpdatableRecord.
The outcome? Do you sit comfortably? Well... 0.3 second (yes, fraction of the second!).
The problem
LINQ is very readable, and raw operations are quite contrary. I could write a mapper which translates all column indexes to meaningful names, but it seems like reinventing the wheel -- after all it is already done in... LINQ.
So maybe it is a way to tell LINQ to speed things up? QUESTION -- how to do it?
The code
LINQ
foreach (var entry in dict.Entries.Where(it => it.AlteredByLearning))
{
PrimLibrary.Database.Progress record = null;
record = new PrimLibrary.Database.Progress();
record.Text = entry.Text;
record.Direction = (byte)entry.dir;
db.Progress.InsertOnSubmit(record);
record.Status = (byte)entry.LastLearningInfo.status.Value;
// ... and so on
db.SubmitChanges();
}
Raw operations
SqlCeCommand cmd = conn.CreateCommand();
cmd.CommandText = "Progress";
cmd.CommandType = System.Data.CommandType.TableDirect;
SqlCeResultSet rs = cmd.ExecuteResultSet(ResultSetOptions.Updatable);
foreach (var entry in dict.Entries.Where(it => it.AlteredByLearning))
{
SqlCeUpdatableRecord record = null;
record = rs.CreateRecord();
int col = 0;
record.SetString(col++, entry.Text);
record.SetByte(col++,(byte)entry.dir);
record.SetByte(col++,(byte)entry.LastLearningInfo.status.Value);
// ... and so on
rs.Insert(record);
}
Do more work per transaction.
Commits are generally very expensive operations for a typical relational database as the database must wait for disk flushes to ensure data is not lost (ACID guarantees and all that). Conventional HDD disk IO without specialty controllers is very slow in this sort of operation: the data must be flushed to the physical disk -- perhaps only 30-60 commits can occur a second with an IO sync between!
See the SQLite FAQ: INSERT is really slow - I can only do few dozen INSERTs per second. Ignoring the different database engine, this is the exact same issue.
Normally, LINQ2SQL creates a new implicit transaction inside SubmitChanges. To avoid this implicit transaction/commit (commits are expensive operations) either:
Call SubmitChanges less (say, once outside the loop) or;
Setup an explicit transaction scope (see TransactionScope).
One example of using a larger transaction context is:
using (var ts = new TransactionScope()) {
// LINQ2SQL will automatically enlist in the transaction scope.
// SubmitChanges now will NOT create a new transaction/commit each time.
DoImportStuffThatRunsWithinASingleTransaction();
// Important: Make sure to COMMIT the transaction.
// (The transaction used for SubmitChanges is committed to the DB.)
// This is when the disk sync actually has to happen,
// but it only happens once, not 3500 times!
ts.Complete();
}
However, the semantics of an approach using a single transaction or a single call to SubmitChanges are different than that of the code above calling SubmitChanges 3500 times and creating 3500 different implicit transactions. In particular, the size of the atomic operations (with respect to the database) is different and may not be suitable for all tasks.
For LINQ2SQL updates, changing the optimistic concurrency model (disabling it or just using a timestamp field, for instance) may result in small performance improvements. The biggest improvement, however, will come from reducing the number of commits that must be performed.
Happy coding.
i'm not positive on this, but it seems like the db.SubmitChanges() call should be made outside of the loop. maybe that would speed things up?

Categories