best way to split up a long file. Programming or SQL?

best way to split up a long file. Programming or SQL? - c#

I have a database Table (in MS-Access) of GPS information with a record of Speed, location (lat/long) and bearing of a vehicle for every second. There is a field that shows time like this 2007-09-25 07:59:53. The problem is that this table has has merged information from several files that were collected on this project. So, for example, 2007-09-25 07:59:53 to 2007-09-25 08:15:42 could be one file and after a gap of more than 10 seconds, the next file will start, like 2007-09-25 08:15:53 to 2007-09-25 08:22:12. I need to populate a File number field in this table and the separating criterion for each file will be that the gap in time from the last and next file is more than 10 sec. I did this using C# code by iterating over the table and comparing each record to the next and changing file number whenever the gap is more than 10 sec.
My question is, should this type of problem be solved using programming or is it better solved using a SQL query? I can load the data into a database like SQL Server, so there is no limitation to what tool I can use. I just want to know the best approach.
If it is better to solve this using SQL, will I need to use cursors?
When solving this using programming (for example C#) what is an efficient way to update a Table when 20000+ records need to be updated based on an updated DataSet? I used the DataAdapter.Update() method and it seemed to take a long time to update the table (30 mins or so).

Assuming SQL Server 2008 and CTEs from your comments:
The best time to use SQL is generally when you are comparing or evaluating large sets of data.
Iterative programming languages like C# are better suited to more expansive analysis of individual records or analysis of rows one at a time (*R*ow *B*y *A*gonizing *R*ow).
For examples of recursive CTEs, see here. MS has a good reference.
Also, depending on data structure, you could do this with a normal JOIN:
SELECT <stuff>
FROM MyTable T
INNER JOIN MyTable T2
ON t2.timefield = DATEADD(minute, -10, t.timefield)
WHERE t2.pk = (SELECT MIN(pk) FROM MyTable WHERE pk > t.pk)

Related

C# Winforms Fastest Way To Query MS Access

This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?

The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.

yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).

Manage live stream viewer huge log

I've developped an .NET CORE application for Live Stream, witch has a lot of funcionalities. One of those, is to show to our clients how many people was watching in every 5 minute interval.
By now, im saving on a SQL Server database, a log for each viewer with ViewerID and TimeStamp in a 5 minutes interval. It seem's to be a bad approach, since in first couple days, i've reached 100k rows in that table. I need that data, because we have a "Time Peek Chart", that shows how many people and who was watching in a 5 minutes interval.
Anyways, do anyone have a suggestion of how can i handle this? I was thinking about a .txt file with the same data, but it also seems that I/O of the server can be a problem...
Also o though about a NoSQL database, maybe use a existing MongoDB AaS, like scalegrid.io or mlab.com.
Can someone help me with this, please? Thanks in advance!

I presume this is related to one of your previous questions Filter SQL GROUP by a filter that is not in GROUP and an expansion of the question in comments 'how to make this better'.
This answer below is definitely not the only way to do this - but I think it's a good start.
As you're using SQL Server for the initial data storage (minute-by-minute) I would suggest continuing to use SQL Server for the next stage of data storage. I think you'd need a compelling argument to use something else for the next stage, as you then need to maintain both of them (e.g., keeping software up-to-date, backups, etc), as well as having all the fun of transferring data properly between the two pieces of software.
My suggested approach is to keep the most detailed/granular data that you need, but no more.
In the previous question, you were keeping data by the minute, then calculating up to the 5-minute bracket. In this answer I'd summarise (and store) the data for the 5-minute brackets then discard your minute-by-minute data once it has been summarised.
For example, you could have a table called 'StreamViewerHistory' that has the Viewer's ID and a timestamp (much like the original table).
This only has 1 row per viewer per 5 minute interval. You could make the timestamp field a smalldatetime (as you don't care about seconds) or even have it as an ID value pointing to another table that references each timeframe. I think smalldatetime is easier to start with.
Depending exactly on how it's used, I would suggest having the Primary Key (or at least the Clustered index) being the timestamp before the ViewerID - this means new rows get added to the end. It also assumes that most queries of data are filtered by timeframes first (e.g., last week's worth of data).
I would consider having an index on ViewerId then the timestamp, for when people want to view an individual's history.
e.g.,
CREATE TABLE [dbo].[StreamViewerHistory](
[TrackDate] smalldatetime NOT NULL,
[StreamViewerID] int NOT NULL,
CONSTRAINT [PK_StreamViewerHistory] PRIMARY KEY CLUSTERED
(
[TrackDate] ASC,
[StreamViewerID] ASC
)
GO
CREATE NONCLUSTERED INDEX [IX_StreamViewerHistory_StreamViewerID] ON [dbo].[StreamViewerHistory]
(
[StreamViewerID] ASC,
[TrackDate] ASC
)
GO
Now, on some sort of interval (either as part of your ping process, or a separate process run regularly) interrogate the data in your source table LiveStreamViewerTracks, crunch the data as per the previous question, and save the results in this new table. Then delete the rows from LiveStreamViewerTracks to keep it smaller and usable. Ensure you delete the relevant rows only though (e.g., the ones that have been processed).
The advantage of the above process is that the data in this new table is very usable by SQL Server. Whenever you need a graph (e.g., of the last 14 days) it doesn't need to read the whole table - instead it just starts at the relevant day and only read the relevant rows. Note to make sure your queries are SARGable though e.g.,
-- This is SARGable and can use the index
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE TrackDate >= '20201001'
-- These are non-SARGable and will read the whole table
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE CAST(TrackDate as date) >= '20201001'
SELECT TrackDate, StreamViewerID
FROM StreamViewerHistory
WHERE DATEDIFF(day, TrackDate, '20201001') <= 0
Typically, if you want counts of users for every 5 minutes within a given timeframe, you'd have something like
SELECT TrackDate, COUNT(*) AS NumViewers
FROM StreamViewerHistory
WHERE TrackDate >= '20201001 00:00:00' AND TrackDate < '20201002 00:00:00'
GROUP BY TrackDate
This should be good enough for quite a while. If your views/etc do slow down a lot, you could consider other things to help e.g., you could also do further calculations/other reporting tables e.g., also have a table with TrackDate and NumViewers - where there's one row per TrackDate. This should be very fast when reporting overall number of users, but will not allow you to drill down to a specific user.

SQL Server - Best practice to circumvent large IN (...) clause (>40000 items)

I'm developing an ASP.NET app that analyzes Excel files uploaded by user. The files contain various data about customers (one row = one customer), the key field is CustomerCode. Basically the data comes in form of DataTable object.
At some point I need to get information about the specified customers from SQL and compare it to what user uploaded. I'm doing it the following way:
Make a comma-separated list of customers from CustomerCode column: 'Customer1','Customer2',...'CustomerN'.
Pass this string to SQL query IN (...) clause and execute it.
This was working okay until I ran into The query processor ran out of internal resources and could not produce a query plan exception when trying to pass ~40000 items inside IN (...) clause.
The trivial ways seems to:
Replace IN (...) with = 'SomeCustomerCode' in query template.
Execute this query 40000 times for each CustomerCode.
Do DataTable.Merge 40000 times.
Is there any better way to work this problem around?
Note: I can't do IN (SELECT CustomerCode FROM ... WHERE SomeConditions) because the data comes from Excel files and thus cannot be queried from DB.

"Table valued parameters" would be worth investigating, which let you pass in (usually via a DataTable on the C# side) multiple rows - the downside is that you need to formally declare and name the data shape on the SQL server first.
Alternatively, though: you could use SqlBulkCopy to throw the rows into a staging table, and then just JOIN to that table. If you have parallel callers, you will need some kind of session identifier on the row to distinguish between concurrent uses (and: don't forget to remove your session's data afterwards).

You shouldn't process too many records at once, because of errors as you mentioned, and it is such a big batch that it takes too much time to run and you can't do anything in parallel. You shouldn't process only 1 record at a time either, because then the overhead of the SQL server communication will be too big. Choose something in the middle, process eg. 10000 records at a time. You can even parallelize the processing, you can start running the SQL for the next 10000 in the background while you are processing the previous 10000 batch.

SQL - Better two queries instead of one big one

I am working on a C# application, which loads data from a MS SQL 2008 or 2008 R2 database. The table looks something like this:
ID | binary_data | Timestamp
I need to get only the last entry and only the binary data. Entries to this table are added irregular from another program, so I have no way of knowing if there is a new entry.
Which version is better (performance etc.) and why?
//Always a query, which might not be needed
public void ProcessData()
{
byte[] data = "query code get latest binary data from db"
}
vs
//Always a smaller check-query, and sometimes two queries
public void ProcessData()
{
DateTime timestapm = "query code get latest timestamp from db"
if(timestamp > old_timestamp)
data = "query code get latest binary data from db"
}
The binary_data field size will be around 30kB. The function "ProcessData" will be called several times per minutes, but sometimes can be called every 1-2 seconds. This is only a small part of a bigger program with lots of threading/database access, so I want to the "lightest" solution. Thanks.

Luckily, you can have both:
SELECT TOP 1 binary_data
FROM myTable
WHERE Timestamp > #last_timestamp
ORDER BY Timestamp DESC
If there is a no record newer than #last_timestamp, no record will be returned and, thus, no data transmission takes place (= fast). If there are new records, the binary data of the newest is returned immediately (= no need for a second query).

I would suggest you perform tests using both methods as the answer would depend on your usages. Simulate some expected behaviour.
I would say though, that you are probably okay to just do the first query. Do what works. Don't prematurely optimise, if the single query is too slow, try your second two-query approach.

Two-step approach is more efficient from overall workload of system point of view:
Get informed that you need to query new data
Query new data
There are several ways to implement this approach. Here are a pair of them.
Using Query Notifications which is built-in functionality of SQL Server supported in .NET.
Using implied method of getting informed of database table update, e.g. one described in this article at SQL Authority blog

I think that the better path is a storedprocedure that keeps the logic inside the database, Something with an output parameter with the data required and a return value like a TRUE/FALSE to signal the presence of new data

What's the best way to use SqlBulkCopy to fill a really large table?

Nightly, I need to fill a SQL Server 2005 table from an ODBC source with over 8 million records. Currently I am using an insert statement from linked server with syntax select similar to this:
Insert Into SQLStagingTable from Select * from OpenQuery(ODBCSource, 'Select * from SourceTable')
This is really inefficient and takes hours to run. I'm in the middle of coding a solution using SqlBulkInsert code similar to the code found in this question.
The code in that question is first populating a datatable in memory and then passing that datatable to the SqlBulkInserts WriteToServer method.
What should I do if the populated datatable uses more memory than is available on the machine it is running (a server with 16GB of memory in my case)?
I've thought about using the overloaded ODBCDataAdapter fill method which allows you to fill only the records from x to n (where x is the start index and n is the number of records to fill). However that could turn out to be an even slower solution than what I currently have since it would mean re-running the select statement on the source a number of times.
What should I do? Just populate the whole thing at once and let the OS manage the memory? Should I populate it in chunks? Is there another solution I haven't thought of?

The easiest way would be to use ExecuteReader() against your odbc data source and pass the IDataReader to the WriteToServer(IDataReader) overload.
Most data reader implementations will only keep a very small portion of the total results in memory.

SSIS performs well and is very tweakable. In my experience 8 million rows is not out of its league. One of my larger ETLs pulls in 24 million rows a day and does major conversions and dimensional data warehouse manipulations.

If you have indexes on the destination table, you might consider disabling those till the records get inserted?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

best way to split up a long file. Programming or SQL? - c#

Related

C# Winforms Fastest Way To Query MS Access

Manage live stream viewer huge log

SQL Server - Best practice to circumvent large IN (...) clause (>40000 items)

SQL - Better two queries instead of one big one

What's the best way to use SqlBulkCopy to fill a really large table?

Categories

Resources