I want to emphasize that I'm looking for ideas, not necessarily a concrete answer since it's difficult to show what my queries look like, but I don't believe that's needed.
The process looks like this:
Table A keeps filling up, like a bucket - an SQL job keeps calling SP_Proc1 every minute or less and it inserts multiple records into table A.
At the same time a C# process keeps calling another procedure SP_Proc2 every minute or less that does an ordered TOP 5 select from table A and returns the results to the C# method. After C# code finishes processing the results it deletes the selected 5 records from table A.
I bolded the problematic part above. It is necessary that the records from table A be processed 5 at a time in the order specified, but a few times a month SP_Proc2 selects the ordered TOP 5 records in a wrong order even though all the records are present in table A and have correct column values that are used for ordering.
Something to note:
I'm ordering by integers, not varchar.
The C# part is using 1 thread.
Both SP_Proc1 and SP_Proc2 use a transaction and use READ COMMITTED OR READ COMMITTED SNAPSHOT transaction isolation level.
One column that is used for ordering is a computed value, but a very simple one. It just checks if another column in table A is not null and sets the computed column to either 1 or 0.
There's a unique nonclustered index on primary key Id and a clustered index composed of the same columns used for ordering in SP_Proc2.
I'm using SQL Server 2012 (v11.0.3000)
I'm beginning to think that this might be an SQL bug or maybe the records or index in table A get corrupted and then deleted by the C# process and that's why I can't catch it.
Edit:
To clarify, SP_Proc1 commits a big batch of N records to table A at once and SP_Proc2 pulls the records from table A in batches of 5, it orders the records in the table and selects TOP 5 and sometimes a wrong batch is selected, the batch itself is ordered correctly, but a different batch was supposed to be selected according to ORDER BY. I believe Rob Farley might have the right idea.
My guess is that your “out of order TOP 5” is ordered, but that a later five overlaps. Like, one time you get 1231, 1232, 1233, 1234, and 1236, and the next batch is 1235, 1237, and so on.
This can be an issue with locking and blocking. You’ve indicated your processes use transactions, so it wouldn’t surprise me if your 1235 hasn’t been committed yet, but can just be ignored by your snapshot isolation, and your 1236 can get picked up.
It doesn’t sound like there’s a bug here. What I’m describing above is a definite feature of snapshot isolation. If you must have 1235 picked up in an earlier batch than 1236, then don’t use snapshot isolation, and force your table to be locked until each block of inserts is finished.
An alternative suggestion would be to use a table lock (tablock) for the reading and writing procedures.
Though this is expensive, if you desire absolute consistency then this may be the way to go.
Related
This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?
The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.
yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).
I've developped an .NET CORE application for Live Stream, witch has a lot of funcionalities. One of those, is to show to our clients how many people was watching in every 5 minute interval.
By now, im saving on a SQL Server database, a log for each viewer with ViewerID and TimeStamp in a 5 minutes interval. It seem's to be a bad approach, since in first couple days, i've reached 100k rows in that table. I need that data, because we have a "Time Peek Chart", that shows how many people and who was watching in a 5 minutes interval.
Anyways, do anyone have a suggestion of how can i handle this? I was thinking about a .txt file with the same data, but it also seems that I/O of the server can be a problem...
Also o though about a NoSQL database, maybe use a existing MongoDB AaS, like scalegrid.io or mlab.com.
Can someone help me with this, please? Thanks in advance!
I presume this is related to one of your previous questions Filter SQL GROUP by a filter that is not in GROUP and an expansion of the question in comments 'how to make this better'.
This answer below is definitely not the only way to do this - but I think it's a good start.
As you're using SQL Server for the initial data storage (minute-by-minute) I would suggest continuing to use SQL Server for the next stage of data storage. I think you'd need a compelling argument to use something else for the next stage, as you then need to maintain both of them (e.g., keeping software up-to-date, backups, etc), as well as having all the fun of transferring data properly between the two pieces of software.
My suggested approach is to keep the most detailed/granular data that you need, but no more.
In the previous question, you were keeping data by the minute, then calculating up to the 5-minute bracket. In this answer I'd summarise (and store) the data for the 5-minute brackets then discard your minute-by-minute data once it has been summarised.
For example, you could have a table called 'StreamViewerHistory' that has the Viewer's ID and a timestamp (much like the original table).
This only has 1 row per viewer per 5 minute interval. You could make the timestamp field a smalldatetime (as you don't care about seconds) or even have it as an ID value pointing to another table that references each timeframe. I think smalldatetime is easier to start with.
Depending exactly on how it's used, I would suggest having the Primary Key (or at least the Clustered index) being the timestamp before the ViewerID - this means new rows get added to the end. It also assumes that most queries of data are filtered by timeframes first (e.g., last week's worth of data).
I would consider having an index on ViewerId then the timestamp, for when people want to view an individual's history.
e.g.,
CREATE TABLE [dbo].[StreamViewerHistory](
[TrackDate] smalldatetime NOT NULL,
[StreamViewerID] int NOT NULL,
CONSTRAINT [PK_StreamViewerHistory] PRIMARY KEY CLUSTERED
(
[TrackDate] ASC,
[StreamViewerID] ASC
)
GO
CREATE NONCLUSTERED INDEX [IX_StreamViewerHistory_StreamViewerID] ON [dbo].[StreamViewerHistory]
(
[StreamViewerID] ASC,
[TrackDate] ASC
)
GO
Now, on some sort of interval (either as part of your ping process, or a separate process run regularly) interrogate the data in your source table LiveStreamViewerTracks, crunch the data as per the previous question, and save the results in this new table. Then delete the rows from LiveStreamViewerTracks to keep it smaller and usable. Ensure you delete the relevant rows only though (e.g., the ones that have been processed).
The advantage of the above process is that the data in this new table is very usable by SQL Server. Whenever you need a graph (e.g., of the last 14 days) it doesn't need to read the whole table - instead it just starts at the relevant day and only read the relevant rows. Note to make sure your queries are SARGable though e.g.,
-- This is SARGable and can use the index
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE TrackDate >= '20201001'
-- These are non-SARGable and will read the whole table
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE CAST(TrackDate as date) >= '20201001'
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE DATEDIFF(day, TrackDate, '20201001') <= 0
Typically, if you want counts of users for every 5 minutes within a given timeframe, you'd have something like
SELECT TrackDate, COUNT(*) AS NumViewers
FROM StreamViewerHistory
WHERE TrackDate >= '20201001 00:00:00' AND TrackDate < '20201002 00:00:00'
GROUP BY TrackDate
This should be good enough for quite a while. If your views/etc do slow down a lot, you could consider other things to help e.g., you could also do further calculations/other reporting tables e.g., also have a table with TrackDate and NumViewers - where there's one row per TrackDate. This should be very fast when reporting overall number of users, but will not allow you to drill down to a specific user.
Suppose we have the following situation:
We have 2-3 tables in database with a huge amount of data (let it be 50-100mln of records) and we want to add 2k of new records. But before adding them we need to check our db on duplicates. So if this 2k contains records which we have in our DB we should ignore them. But to find out whether new record is a duplicate or not we need info from both tables (for example we need to make left join).
The idea of solution is: one task or thread create a suitable data for comparison and pushes data into queue (by batches, not record by record), so our queue(or concurrentQueue) is a global variable. The second thread gets batch from queue and look it through. But there's a problem - memory is growing...
How can I clean memory after I've surfed through the batch?
P.S. If smb has another idea how to optimize this process - please describe it...
This is not the specific answer to the question you are asking, because what you are asking, doesn't really make sense to me.
if you are looking to update specific rows:
INSERT INTO tablename (UniqueKey,columnname1, columnname2, etc...)
VALUES (UniqueKeyValue,value1,value2, etc....)
ON DUPLICATE KEY
UPDATE columnname1=value1, columnname2=value2, etc...
If not, simply ignore/remove the update statement.
This would be darn fast, considering, it would use the unique index of whatever field you want to be unique, and just do an insert or update. No need to validate in a separate table or anything.
"Cassandra: The Definitive Guide, 2nd Edition" says:
Cassandra’s batches are a good fit for use cases such as making
multiple updates to a single partition, or keeping multiple tables in
sync. A good example is making modifications to denormalized tables
that store the same data for different access patterns.
The last statement above applies to the following attempt, where all the Save... are insert statements for different tables
var bLogged = new BatchStatement();
var now = DateTimeOffset.UtcNow;
var uuidNow = TimeUuid.NewId(now);
bLogged.Add(SaveMods.Bind(id, uuidNow, data1)); // 1
bLogged.Add(SaveMoreMods.Bind(id, uuidNow, data2)); // 2
bLogged.Add(SaveActivity.Bind(now.ToString("yyyy-MM-dd"), id, now)); // 3
await GetSession().ExecuteAsync(bLogged);
We'll focus on statements 1 and 2 (the 3rd one is just to signify there's one more statement in the batch).
Statement 1 writes to table1 partitioned by id with uuidNow being a clustering key desc.
Statement 2 writes to table2 partitioned by id only, so it's the tip of the table1 for the same id.
More times than I'd like the two tables get out of sync in the sense that table2 does not have the tip of the table1. It would be one or two mods behind within a few milliseconds.
While looking for resolution most on the web advise against all batches, which prompted my solution eliminating all mismatches:
await Task.WhenAll(
GetSession().ExecuteAsync(SaveMods.Bind(id, uuidNow, data1)),
GetSession().ExecuteAsync(SaveMoreMods.Bind(id, uuidNow, data2)),
GetSession().ExecuteAsync(SaveActivity.Bind(now.ToString("yyyy-MM-dd"), id, now))
);
The question is: what are batches good for, just the first statement in the quote? In that case how do I ensure modifications to different tables are in sync?
Using higher consistency (ie quorum) on reads/writes may help but there is always a possibility for inconsistencies between the table/partitions.
Batch statements will try to ensure that all the mutations in the batch will all happen or not. It does not guarantee that all the mutations will occur in an instant (no isolation, you can do a read where first mutation has been applied but others haven't). Also, batch statements will not provide a consistent view of all the data across all the nodes. For linearizable consistency you should consider using paxos (lightweight transactions) for conditional updates and trying to limit things that require the linearizability into a single partition.
In part of my application I have to get the last ID of a table where a condition is met
For example:
SELECT(MAX) ID FROM TABLE WHERE Num = 2
So I can either grab the whole table and loop through it looking for Num = 2, or I can grab the data from the table where Num = 2. In the latter, I know the last item will be the MAX ID.
Either way, I have to do this around 50 times...so would it be more efficient grabbing all the data and looping through the list of data looking for a specific condition...
Or would it be better to grab the data several times based on the condition..where I know the last item in the list will be the max id
I have 6 conditions I will have to base the queries on
Im just wondering which is more efficient...looping through a list of around 3500 items several times, or hitting the database several times where I can already have the data broken down like I need it
I could speak for SqlServer. If you create a StoredProcedure where Num is a parameter that you pass, you will get the best performance due to its optimization engine on execution plan of the stored procedure. Of course an Index on that field is mandatory.
Let the database do this work, it's what it is designed to do.
Does this table have a high insert frequency? Does it have a high update frequency, specifically on the column that you're applying the MAX function to? If the answer is no, you might consider adding an IS_MAX BIT column and set it using an insert trigger. That way, the row you want is essentially cached, and it's trivial to look up.