Why is first insert winning over second one in Cassandra?

Why is first insert winning over second one in Cassandra? - c#

There are two data centers with 3 nodes each. I'm doing two simple inserts (very fast back to back) to the same table with a consistency level of local quorum. The table has one partitioning key and no clustering columns.
Sometimes the first insert wins over the second one. The data produced by the first insert statement is what gets saved in the database even though I do an insert right after that.
C# Code
var statement = "Insert Into customer (id,name) Values (1, "foo")";
statement.SetConsistencyLevel(ConsistencyLevel.LocalQuorum);
session.Execute(statement);

Set the timestamp on client. In most new drivers this is done automatically to better ensure order preserved. However older drivers or pre Cassandra 2.1 its not supported and needs to be in query. I dont know what driver or version you are using, but you can also put it in the CQL. Its supported on protocol level though so driver should have better mechanism.
Something like: var statement = "INSERT INTO customer (id,name) VALUES (1, 'foo') USING TIMESTAMP {microsecond timestamp}";
Best approach is to use a monatomic timestamp so that each call is always higher then last (ie use current milliseconds and add a counter). I don't know C# to tell you how to best approach that. Look at https://docs.datastax.com/en/developer/csharp-driver/3.3/features/query-timestamps/#using-a-timestamp-generator
If you don't have a timestamp set it on the mutation, the coordinator will assign it after it parses the query. Since networks and netty queues can do funny things order is not a sure thing, especially as they end up on different nodes that may have some clock drift.

Related

C# Winforms Fastest Way To Query MS Access

This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?

The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.

yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).

Manage live stream viewer huge log

I've developped an .NET CORE application for Live Stream, witch has a lot of funcionalities. One of those, is to show to our clients how many people was watching in every 5 minute interval.
By now, im saving on a SQL Server database, a log for each viewer with ViewerID and TimeStamp in a 5 minutes interval. It seem's to be a bad approach, since in first couple days, i've reached 100k rows in that table. I need that data, because we have a "Time Peek Chart", that shows how many people and who was watching in a 5 minutes interval.
Anyways, do anyone have a suggestion of how can i handle this? I was thinking about a .txt file with the same data, but it also seems that I/O of the server can be a problem...
Also o though about a NoSQL database, maybe use a existing MongoDB AaS, like scalegrid.io or mlab.com.
Can someone help me with this, please? Thanks in advance!

I presume this is related to one of your previous questions Filter SQL GROUP by a filter that is not in GROUP and an expansion of the question in comments 'how to make this better'.
This answer below is definitely not the only way to do this - but I think it's a good start.
As you're using SQL Server for the initial data storage (minute-by-minute) I would suggest continuing to use SQL Server for the next stage of data storage. I think you'd need a compelling argument to use something else for the next stage, as you then need to maintain both of them (e.g., keeping software up-to-date, backups, etc), as well as having all the fun of transferring data properly between the two pieces of software.
My suggested approach is to keep the most detailed/granular data that you need, but no more.
In the previous question, you were keeping data by the minute, then calculating up to the 5-minute bracket. In this answer I'd summarise (and store) the data for the 5-minute brackets then discard your minute-by-minute data once it has been summarised.
For example, you could have a table called 'StreamViewerHistory' that has the Viewer's ID and a timestamp (much like the original table).
This only has 1 row per viewer per 5 minute interval. You could make the timestamp field a smalldatetime (as you don't care about seconds) or even have it as an ID value pointing to another table that references each timeframe. I think smalldatetime is easier to start with.
Depending exactly on how it's used, I would suggest having the Primary Key (or at least the Clustered index) being the timestamp before the ViewerID - this means new rows get added to the end. It also assumes that most queries of data are filtered by timeframes first (e.g., last week's worth of data).
I would consider having an index on ViewerId then the timestamp, for when people want to view an individual's history.
e.g.,
CREATE TABLE [dbo].[StreamViewerHistory](
[TrackDate] smalldatetime NOT NULL,
[StreamViewerID] int NOT NULL,
CONSTRAINT [PK_StreamViewerHistory] PRIMARY KEY CLUSTERED
(
[TrackDate] ASC,
[StreamViewerID] ASC
)
GO
CREATE NONCLUSTERED INDEX [IX_StreamViewerHistory_StreamViewerID] ON [dbo].[StreamViewerHistory]
(
[StreamViewerID] ASC,
[TrackDate] ASC
)
GO
Now, on some sort of interval (either as part of your ping process, or a separate process run regularly) interrogate the data in your source table LiveStreamViewerTracks, crunch the data as per the previous question, and save the results in this new table. Then delete the rows from LiveStreamViewerTracks to keep it smaller and usable. Ensure you delete the relevant rows only though (e.g., the ones that have been processed).
The advantage of the above process is that the data in this new table is very usable by SQL Server. Whenever you need a graph (e.g., of the last 14 days) it doesn't need to read the whole table - instead it just starts at the relevant day and only read the relevant rows. Note to make sure your queries are SARGable though e.g.,
-- This is SARGable and can use the index
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE TrackDate >= '20201001'
-- These are non-SARGable and will read the whole table
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE CAST(TrackDate as date) >= '20201001'
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE DATEDIFF(day, TrackDate, '20201001') <= 0
Typically, if you want counts of users for every 5 minutes within a given timeframe, you'd have something like
SELECT TrackDate, COUNT(*) AS NumViewers
FROM StreamViewerHistory
WHERE TrackDate >= '20201001 00:00:00' AND TrackDate < '20201002 00:00:00'
GROUP BY TrackDate
This should be good enough for quite a while. If your views/etc do slow down a lot, you could consider other things to help e.g., you could also do further calculations/other reporting tables e.g., also have a table with TrackDate and NumViewers - where there's one row per TrackDate. This should be very fast when reporting overall number of users, but will not allow you to drill down to a specific user.

Ideas on incorrect ORDER BY results

I want to emphasize that I'm looking for ideas, not necessarily a concrete answer since it's difficult to show what my queries look like, but I don't believe that's needed.
The process looks like this:
Table A keeps filling up, like a bucket - an SQL job keeps calling SP_Proc1 every minute or less and it inserts multiple records into table A.
At the same time a C# process keeps calling another procedure SP_Proc2 every minute or less that does an ordered TOP 5 select from table A and returns the results to the C# method. After C# code finishes processing the results it deletes the selected 5 records from table A.
I bolded the problematic part above. It is necessary that the records from table A be processed 5 at a time in the order specified, but a few times a month SP_Proc2 selects the ordered TOP 5 records in a wrong order even though all the records are present in table A and have correct column values that are used for ordering.
Something to note:
I'm ordering by integers, not varchar.
The C# part is using 1 thread.
Both SP_Proc1 and SP_Proc2 use a transaction and use READ COMMITTED OR READ COMMITTED SNAPSHOT transaction isolation level.
One column that is used for ordering is a computed value, but a very simple one. It just checks if another column in table A is not null and sets the computed column to either 1 or 0.
There's a unique nonclustered index on primary key Id and a clustered index composed of the same columns used for ordering in SP_Proc2.
I'm using SQL Server 2012 (v11.0.3000)
I'm beginning to think that this might be an SQL bug or maybe the records or index in table A get corrupted and then deleted by the C# process and that's why I can't catch it.
Edit:
To clarify, SP_Proc1 commits a big batch of N records to table A at once and SP_Proc2 pulls the records from table A in batches of 5, it orders the records in the table and selects TOP 5 and sometimes a wrong batch is selected, the batch itself is ordered correctly, but a different batch was supposed to be selected according to ORDER BY. I believe Rob Farley might have the right idea.

My guess is that your “out of order TOP 5” is ordered, but that a later five overlaps. Like, one time you get 1231, 1232, 1233, 1234, and 1236, and the next batch is 1235, 1237, and so on.
This can be an issue with locking and blocking. You’ve indicated your processes use transactions, so it wouldn’t surprise me if your 1235 hasn’t been committed yet, but can just be ignored by your snapshot isolation, and your 1236 can get picked up.
It doesn’t sound like there’s a bug here. What I’m describing above is a definite feature of snapshot isolation. If you must have 1235 picked up in an earlier batch than 1236, then don’t use snapshot isolation, and force your table to be locked until each block of inserts is finished.

An alternative suggestion would be to use a table lock (tablock) for the reading and writing procedures.
Though this is expensive, if you desire absolute consistency then this may be the way to go.

SQL - Better two queries instead of one big one

I am working on a C# application, which loads data from a MS SQL 2008 or 2008 R2 database. The table looks something like this:
ID | binary_data | Timestamp
I need to get only the last entry and only the binary data. Entries to this table are added irregular from another program, so I have no way of knowing if there is a new entry.
Which version is better (performance etc.) and why?
//Always a query, which might not be needed
public void ProcessData()
{
byte[] data = "query code get latest binary data from db"
}
vs
//Always a smaller check-query, and sometimes two queries
public void ProcessData()
{
DateTime timestapm = "query code get latest timestamp from db"
if(timestamp > old_timestamp)
data = "query code get latest binary data from db"
}
The binary_data field size will be around 30kB. The function "ProcessData" will be called several times per minutes, but sometimes can be called every 1-2 seconds. This is only a small part of a bigger program with lots of threading/database access, so I want to the "lightest" solution. Thanks.

Luckily, you can have both:
SELECT TOP 1 binary_data
FROM myTable
WHERE Timestamp > #last_timestamp
ORDER BY Timestamp DESC
If there is a no record newer than #last_timestamp, no record will be returned and, thus, no data transmission takes place (= fast). If there are new records, the binary data of the newest is returned immediately (= no need for a second query).

I would suggest you perform tests using both methods as the answer would depend on your usages. Simulate some expected behaviour.
I would say though, that you are probably okay to just do the first query. Do what works. Don't prematurely optimise, if the single query is too slow, try your second two-query approach.

Two-step approach is more efficient from overall workload of system point of view:
Get informed that you need to query new data
Query new data
There are several ways to implement this approach. Here are a pair of them.
Using Query Notifications which is built-in functionality of SQL Server supported in .NET.
Using implied method of getting informed of database table update, e.g. one described in this article at SQL Authority blog

I think that the better path is a storedprocedure that keeps the logic inside the database, Something with an output parameter with the data required and a return value like a TRUE/FALSE to signal the presence of new data

Getting rows from a SQL table matching a dictionary using LINQ

I have the following code snippet:
var matchingAuthors = from authors in DB.AuthorTable
where m_authors.Keys.Contains(authors.AuthorId)
select authors;
foreach (AuthorTableEntry author in matchingAuthors)
{
....
}
where m_authors is a Dictionary containing the "Author" entries, and DB.AuthorTable is a SQL table. When the size of m_authors goes beyond a certain value (somewhere around the 3000 entries mark), I get an exception:
System.Data.SqlClient.SqlException: The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect.
Too many parameters were provided in this RPC request. The maximum is 2100.
Is there any way I can get around this and work with a larger size dictionary? Alternatively, is there a better way to get all rows in a SQL table where a particular column value for that row matches one of the dictionary entries?

LINQ to SQL uses a parametrized IN statement to perform a local Contains():
...
WHERE AuthorId IN (#p0, #p1, #p2, ...)
...
So the error you're seeing is that SQL ran out of parameters to use for your keys. I can think of two options:
Select the whole table and filter using LINQ to Objects.
Generating an expression tree from your keys: see Option 2 here.

Another option is to consider how you populate m_authors and whether you can include that in the query as a query element itself so it turns into a server-side join/subselect.

Depending on your requirements, you could break apart the work into multiple smaller chunks (first thousand, second thousand, etc.) This runs certain risks if your data is read-write and changes frequently, but it might give you a bit better scalability beyond pulling back thousands of rows in one big gulp. And, if your data can be worked on in part (i.e. without having the entire set in memory), you could send off chunks to be worked on in a separate thread while you are pulling back the next chunk.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.