Retrieving and comparing very large data sets with multiple columns

Retrieving and comparing very large data sets with multiple columns - c#

The requirement: I have multiple databases (Oracle / SQL Server) etc. From database I need to get large/huge amount of data into a c# program and compare the data with one and other. Each data file from a dataset will have a key (not 100% unique, might have duplicates as well), using that key I can compare other dataset files/databases.
Each database will return approx around 1.5 million rows. I have 5 different databases from which I will be getting data. i.e 7.5 million rows will loaded into my program.
What is the best way to load the data into the program (currently each SQL takes 5 minutes on the database side). Load into CSV and then read in C#? Any other ideas?
I am planning to load data into HashSet in c#, is that good option?
DB 1:
Account Amount
1234 1
9999 66
DB 2:
Account Amount
1234 2
9999 66
DB 3:
Account Amount
1234 1
9999 66
DB 4:
Account Amount
1234 10
9999 66
After comparing the output looks like
Account DB1 Amt DB1 Amt DB3 Amt DB4 Amt Match?
1234 1 2 1 10 No
9999 66 66 66 66 Yes

With respect, this is not a huge problem. It's a medium sized problem, in which you must process 7.5 megarows. In your example these rows seem to be relatively short. If you have access to a computer with more than two GB of RAM, you can probably do this whole job in RAM fairly easily. A typical 2011-era laptop can do that. Almost any Win x64 laptop can do it in RAM.
You asked whether you should draw your data directly from the database systems or from CSVs. If you're planning to use this system in production, you should stick to working with the database systems. That avoids the possibility of working with stale data by mistake.
From your question it looks like the Account values in your various systems match each other exactly, without a lot of monkey-business about fuzzy matching. That is, it seems that an account is called "1234" in several databases, and not "1234" in one of them, "1234-001" in another, and "A1234-2014" in a third. That is very good news. It means you can use such things as HashSets to handle them in memory.
You probably should set your system up so it can process either all data or an arbitrary subset of Account values. For example, you might allow a subset to be specified as '1000' - '1999'. This will come in very handy for testing, because you'll be able to do short runs with just a few thousand accounts. This should mean you can get everything working with short-running subset queries. When you're satisfied that everything is working well, you can start a production run and go home for the night.
Notice that you might also, if this is a one-off job, simply install some DBMS (MySQL or PostgreSQL would be good open source choices) on your personal machine, load the various extracts from various database systems into tables in it, and do JOINs on them.
Finally, if you are inheriting data of unknown quality, Google lets you download a very helpful data-inspection and cleaning tool called OpenRefine.

Related

C# Web API SQL Server 2012: Errors Under Load Test (14 million+ Records)

I am trying to build a Web API that connects to a SQL Server 2012 database with 14 million+ records in a single table. The only action for the API is GET, and will be available to the public as an Open Data API, so it needs to be able to handle many concurrent users.
The table has seven fields:
field1 bigint
field2 nvarchar(50)
field3 nvarchar(10)
field4 float
field5 datetime
field6 nvarchar(20)
field7 nvarchar(10)
I have written simple test APIs in:
- C# .NET 4.6.1
- C# .NET Core 2
I've also tried ApirIO as a Nuget package, and as a command-line app.
I have also tried using Python Eve with SQLAlchemy but with similar results.
The API works in that I can see the results on my browser, Postman, cURL, etc. But when I try to load test using Vegeta at a rate of 30 requests per second (test duration 30 seconds), I get multiple connection errors and latency rises to around 30 seconds.
I have pasted the results for a load test on the API running on the AperIO command line app:
Requests [total, rate] 900, 30.03
Duration [total, attack, wait] 59.9700536s, 29.966666367s, 30.003387233s
Latencies [mean, 50, 95, 99, max] 29.903549803s, 30.002625352s, 30.004389905s, 30.012575115s, 30.03090955s
Bytes In [total, mean] 49579, 55.09
Bytes Out [total, mean] 0, 0.00
Success [ratio] 1.22%
Status Codes [code:count] 200:11 0:889
Error Set:
Get http://localhost:18092/xyz?pageSize=25: net/http: timeout awaiting response headers
I have tried C# with and without OData, and with manually coded paging classes with the max page size set to 50, then 25, then 10, then 5, but the results are all broadly similar.
Note: I truncated the table and re-populated with 5000 records and there were no problems with the load test. I re-populated with 14 million records and the errors re-appeared.
Is there some way SQL Server can be optimised to serve a database with a recordset of over 14 million to an API with multiple concurrent users (e.g. 1000 users) with a very low latency (around 0.015sec)?
Thanks in advance,
Mo
Edit (to clarify from comments):
The server has RAM 32GB / Memory 160GB / 4 CPUs
The CPUs went to 100% after about 5 seconds, we temporarily upped the server to 8 CPUs but they also went up to 100%
There is a clustered and non-clustered index on the table
The Linq code for the Get method in the Controller is:
var source = (from aqm in _context.aqm_context.
OrderBy(a => a.field1)
select aqm).AsQueryable();
var items = source.Skip((CurrentPage - 1) * PageSize).Take(PageSize).ToList();
return items;
From SQL Server Profiler, the SQL being executed is
SELECT
[Extent1].[field1] AS [field1],
[Extent1].[field2] AS [field2],
[Extent1].[field3] AS [field3],
[Extent1].[field4] AS [field4],
[Extent1].[field5] AS [field5],
[Extent1].[field6] AS [field6],
[Extent1].[field7] AS [field7]
FROM [dbo].[table] AS [Extent1]
ORDER BY [Extent1].[field1] ASC
OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY
I was originally using MongoDB with Python Eve (with the same recordset) and I was getting 0.015 sec latency with no errors running the same tests. MongoDB is not on my organisation's list of approved technologies (yet) so I was asked to try SQL Server as a backend. I tried the same Eve config using SQLAlchemy connected to SQL Server and immediately received poor latency and connection errors.
So the question could also be:
- Is SQL Server a valid choice to serve an open data API with a recordset of 14 million to the public with potentially thousands of concurrent users? Or is a NoSQL document store like MongoDB better suited?

To answer your question, Can SQL Server handle this amount of data?
Yes, we are already using it with 400 million rows. Below was steps we followed:
We opted for a multi-core server and divided the tempdb there and put on SSD.
For us, read operation was important, so indexing was done based on that.
We enabled paging via OFFSET FETCH, and made sure we are hitting right indexes.
Made it mandatory for only certain amount are allowed to extracted by client applications (not all in one go). Only warehousing applications were allowed, but again during off-peak time aka 2 AM
Indexes can do magic, if read operation is priority.
Also, we have archiving in place, which moves records older than 2 years into archive tables. This keeps the size of our table almost static.
I thought instead of asking what you are doing, which is not 100% clear, to me, as there are several questions. I thought of sharing how we did to achieve performance. Again, this is not full-proof; however, works for our case. For you, something else might work.

How to Minimize Data Transfer Out From Azure Query in C# .NET

I have a small table(23 rows, 2 int columns), just a basic user-activity monitor. The first column represents user id. The second column holds a value that should be unique to every user, but I must alert the users if two values are the same. I'm using an Azure Sql database to hold this table, and Linq to Sql in C# to run the query.
The problem: Microsoft will bill me based on data transferred out of their data-centers. I would like have all of my users to be aware of the current state of this table at all times, second by second, and keep data-transfer under 5 GB per month. I'm thinking along the lines of a Linq-To-Sql expression such as
UserActivity.Where(x => x.Val == myVal).Count() > 1;
But this would download the table to the client, which cannot happen. Should I be implementing a Linq solution? Or would SqlDataReader download less metadata from the server? Am I taking the right approach by using a database at all? Gimme thoughts!

If it is data transfer you are worried about you need to do your processing on the server and return only the results. A SQLDataReader solution can return a smaller, already processed set of data to minimise the traffic.

A couple thoughts here:
First, I strongly encourage you to profile the SQL generated by your LINQ-to-SQL queries. There are several tools available for this, here's one at random (I have no particular preference or affiliation):
LINQ Profiler from Devart
Your prior experience with LINQ query inefficiency notwithstanding, the LINQ sample you quote in your question isn't particularly complex so I would expect you could make it or similar work efficiently, given a good feedback mechanism like the tool above or similar.
Second, you don't explicitly mention whether your query client is running in Azure or outside, but I gather from your concern about data egress costs that its running outside Azure. So the data egress costs are going to be query results using the TDS protocol (low-level protocol for SQL Server), which is pretty efficient. Some quick back-of-the-napkin math shows that you should be fine to stay below your monthly 5 GB limit:
23 users
10 hours/day
30 days/month (less if only weekdays)
3600 requests/hour/user
32 bits of raw data per response
= about 95 MB of raw response data per month
Even if you assume 10x overhead of TDS for header metadata, etc. (and if my math is right :-) ) then you've still got plenty of room underneath 5 GB. The point isn't that you should stop thinking about it and assume it's fine... but don't assume it isn't fine, either. In fact, don't assume anything. Test, and measure, and make an informed choice. I suspect you'll find a way to stay well under 5 GB without much trouble, even with LINQ.
One other thought... perhaps you could consider running your query inside Azure, and weigh the cost of that vs. the cost of data egress under the "query running outside Azure" scenario? This could (for example) take the form of a small Azure Web Job that runs the query every second and notifies the 23 users if the count goes above 1.
Azure Web Jobs
In essence, you wouldn't notify them if the condition is false, only when it's true. As for the notification mechanism, there are various cloud-friendly options:
Azure mobile push notifications
SMS messaging
SignalR notifications
The key here is to determine whether its more cost-effective and in line with any bigger-picture technology or business goals to have each user issue the query continuously, or to use some separate process in Azure to notify users asynchronously if the "trigger condition" is met.
Best of luck!

Streaming financial time series stored in Cassandra - poor performance

We're in the process of evaluating Cassandra for use with financial time series data and are trying to understand the best way to store and retrieve the data we need in the most performant way. We are running Cassandra on a Virtual machine to which 8 cores and 8Gb RAM have been allocated. The remaining resources of the host machine (another 8 cores and 12Gb RAM) are used for development of the testing client application. Our data is currently stored in flat files and is of the order of 100-150Gb each day (uncompressed). In terms of retrieving the data from cassandra we need to be able to stream either:
All of the data - i.e. stream data for all securities for an entire day ordered by timestamp
All of the data for a particular time period which is a subset of the entire day ordered by timestamp
Data for a subset of the securities and a particular time period which is a subset of the entire day ordered by timestamp.
We have so far experimented with partitioning the data based on security and day with a table that has the following schema:
create table MarketData (
Security text
,Date date
,Timestamp timestamp
...
other columns
...
primary key((Security,Date),timestamp));
However when we perform a simple paged query from within a C# client application as below it takes roughly 8 secs to retrieve 50K records, which is very poor. We've experimented with different page sizes and a page size of approx. 450 seems to give the least bad results.
var ps = client.Session.Prepare("select security, date, timestamp, toUnixTimestamp(timestamp), from marketdata where security = ? and date = ?");
int pageSize = 450;
var statement = ps.Bind("AAPL_O",new LocalDate(2016,01,12)).SetPageSize(pageSize);
stopwatch.Start();
var rowSet = client.Session.Execute(statement);
foreach (Row row in rowSet)
{
}
stopwatch.Stop();
Furthermore, this kind of a schema would also be problematic in terms of selecting SORTED data across partitions (i.e. for multiple securities) since it involves sorting across partitions which Cassandra doesn't seem to be well suited to.
We have also cosidered partinioning based on minute with the following schema:
create table MarketData (
Year int,
Month int,
Day int,
Hour int,
Minute int,
Security text
,Timestamp timestamp
...
other columns
...
primary key((Year,Month,Day,Hour,Minute),timestamp));
However, our concern is that our perlimiary test of paging through the results of a straightforward 'select' statement is so poor.
Are we approaching things in the wrong way? Could our configuration be incorrect? Or is Cassandra maybe not the appropriate bigdata solution for what we are trying to achieve?
Thanks

".... poor performance...."
"We are running Cassandra on a Virtual machine "
I think those 2 highlighted words are related :). Out of curiosity, what is the nature of your hard drive ? Shared storage ? SAN ? Spinning disk ? SSD ? Mutualised hard drive ?
Furthermore, this kind of a schema would also be problematic in terms of selecting SORTED data across partitions (i.e. for multiple securities)
Exact, Cassandra does not sort by partition key. You'll probably need to create another table (or a materialized view, new Cassandra 3.0 feature) with PRIMARY KEY ((time_period),security, timestamp) so that you can order by Security
Are we approaching things in the wrong way?
Yes, why do you want to do "performance benchmark" on a virtual machine ? Those 2 ideas are pretty antinomic. The general recommendation with Cassandra is to use dedicated hard drives (spinning disk at least, preferably SSD). Cassandra read performance is strongly bound to your disk I/O.
With virtual machines and virtualized storage, you deactivate all Cassandra optimization for disk throughput. Writing a sequential block of data on a virtualized disk do not guarantee you that the data are effectively written sequentially because the hypervisor/virtual disk controller can re-order the them to split across several blocks on the actual physical disks
Cassandra deployment on virtual machines are only suited for P.O.C to validate a data model & queries. You'll need to have dedicated physical hard drives to benchmark the actual performance of your data model with Cassandra.

High performance real time project .NET, SQL Server

I have a demanding project and I need your starting guidelines on this!
I need to have a database with approximately 2.000.000 records with markers lat,lng. These markers are moving objects and update their positions every 10 seconds. If the received marker does not exist in the database it needs to be inserted.
I need somehow the end user to have a realtime data in the web request e.g (www.example.com/getmarkers?minlat=x&maxlat=x&minlng=x&maxlng=x&zoom=x) for the specified zoom and eliminate the markers that overlap each other.
The main server app will receive the update commands via TCP and UDP protocol on multiple ports
Can I use C sharp and a memory datatable to do all these updates every second? Also can the end user hit this datatable so everything stays in memory to be faster? What do you think about performance and what is your opinion for develop a project like this? Real time data is what I need
I prefer to user C#, SQL Server 2008
Thanks a lot

I’d start of by making estimates based on following data with the goal of estimation number of requests per minute or second.
Average number of moving markers at any time. If you have 200 vehicles to track how many do you expect to be moving simultaneously? Does time of the day matter? If it does make sure you make calculations based on the peak hours.
How many simultaneous requests from users do you expect? If you have 800 users are they going to be using the application throughout the whole day or only several times a day or once a week?
Once you get the data multiply it by at least 3. This will accommodate for all false assumptions you may have made in the calculations and allow for future growth.
Once you get the final number it will be a lot easier to decide whether you need only one two 6-core CPU server, four 12 core CPU server or a mini data center with in memory databases and other advanced stuff

Storing related timed-based data with variable logging frequencies

I work with a data logging system in car racing. I am developing an application that aids in the analysis of this logged data and have found some of the query functionality with, datasets, datatables and LINQ to be very useful, i.e. Minimums, Averages, etc.
Currently, I am extracting all data from its native format to a data table and post-processing that data. I am also currently working with data where all channels are logged at the same rate, i.e. 50 Hz (50 samples per second). I would like to start writing this logged data to a database so it is somewhat platform independent, and the extraction process doesn't have to happen everytime I want to analyze a dataset.
Which leads me to the main question... Does anyone have a recommendation for the best way to store data that is related by time, but logged at different rates? I have approximately 200 channels that are logged and the rates vary from 1 Hz to 500 Hz.
Some of the methods I have thought of so far are:
creating a datatable for all data at 500 Hz using Double.NaN for values that are between actual logged samples
creating separate tables for each logging frquency, i.e. one table for 1 Hz, another for 10 Hz, and another for 500 Hz.
creating a separate table for each channel with a relationship to a time table. Each time step would then be indexed, and the table of data for each channel would not be dependent on a fixed time frequency
I think I'm leaning towards the index time stamp with a separate table for each channel, but I wanted to find out if anyone has advice on a best practice.
For the record, the datasets can range from 10 Mb to 200-300 Mb depending on the duration of the time that the car is on track.
I would like to have a single data store that houses an entire season, or at least an entire race event, so that is something I am considering as well.
Thanks very much for any advice!

Can you create a table something like:
Channel, Timestamp, Measurement
?
The database structure doesn't need to depend on the frequency; the frequency can be determined by the amount of time between timestamps.
This gives you more flexibility as you can write one piece of code to handle the calculations on all the channels, just give it a channel name.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.