Data Partition in C#

Data Partition in C# - c#

I want to distribute a large amount of data to different C# applications. For example, my table contains millions of records. I would like to specify that first 3 millions records are processed by App1 and next 3 million in another C# application App2 and so on. Table rows are deleted and added as per requirement. Now I want to write a SQL query that will process first 3 million records. Now if 5 records are deleted from app1 then app1 must fetch next 5 records from app2 and app2 from app3. So that data always remain constant in each app.
I have used limit in the SQL query, but I didn't get the required output. How can I write the SQL query for this and how should I design the C# application.

It looks a bit as if you want to take over from the database and do the processing that a database is tasked and tailored to do, in your own application. You talk of an SQL query with a LIMIT statement. Don't use that. Millions of records is not much in database terms. If you have performance issues, you may need to index your table or revisit the query design (watch its execution path for performance issues).
If you really cannot let the database do the task and you need to process them one by one in your application, the network latency and bandwidth is likely to be an earlier candidate for performance issues, which you won't make any faster by using multiple apps (let alone the cost of such queries).
If my observations are wrong and your processing of the records must take place outside the database and the network is not a bottleneck, nor are the processors or the database machine and multiple applications will provide a performance gain, then I suggest you create a dispatch application that processes the records and makes them available to your other applications (or better: threads) through normal POCOs. This creates a much easier way of spreading the processing and the dispatch application (or thread) can work as some kind of funnel for your processing applications.
However, look at the cost / benefit equation: is the trouble really going to gain you some performance, or is it better to revisit your design and find a more practical solution?

That sounds like a really bad idea. Requesting a limit of 3 million records is a very slow operation.
An alternative approach would be to have an instance number column and have each instance of your application reserve rows as it needs them by writing its instance number into this column. Process your data in smaller chunks if possible.
Adding an index to the instance number column will allow you to count how many rows you have already handled and also to find the next batch of 1000 (for example) that haven't been assigned to any instance yet.

I would beneift from more understanding of the details of the application and the process to get, select, delete, etc. However, to give it a shot to a viable answer.
In short, use partitioned tables and distributed views. Each application is "keyed" to those tables through the common partitioned view, if any application has to act on another table (or "key") it can use the same view and act on the other tables.
In more detail ...
If you have the Enterprise or Developer edition of SQL Server, or any other that provides distributed views, then you can create three or more tables with a partitioned column ("App1", "App2", "App3"), like what Mark Byers has said, that would then distribute the ability to process against data evenly.
Now create a view (WITH SCHEMABINDING) for "Select Field1, Field2, Field3, etc. from table1 UNION Select Field1, Field2, Field3, etc. from table2 UNION Select Field1, Field2, Field3, etc. from table3"
Create a unique clustered key on the one/two fields that uniquly represent your data. When this is done, you can now select/delete/update from view where partitioncolumn = "app1" and "id=?". This routine makes it that the activity on the view is allowed for action queries (insert/update/delete) and only acts on the table of the partitioned data.
So, App1, sends an "App1" WHERE filter, and the db engine only acts on table1 even though the view has.

Related

Should I cache SQL database data and query it locally? Or have the app query the database directly?

I am making an application in C#, which is the first professional project I will be adding to my portfolio. The application connects to a remote SQL Server and queries specific tables to get data such as Student information, details on courses, and data of student enrollment in each course.
My question is, even though the query will return a single record most of the time, is there a need to cache the tables? There are only 3 tables (Students, Courses, Enrollment), and Courses is the only table to doesn't change all that often, at least in comparison to the other two.
In a nutshell, the app is a CLI that lets the user view school courses, registered students and the student's respective enrollment in those courses. The app has the functionality of entering student info such as their name, their mailing address and contact information, which is then persisted to the SQL Server. The same goes for the Course details like the CourseID, Name and description. As well as the enrollment, which is where the server joins the StudentID and CourseID in the record to show that the specified student is enrolled in that course.
I am currently running a local instance of MSSQL, but plan to create a lightweight Virtual Machine to hold the SQL server to replicate a remote access scenario.
I figure that if the application is ever deployed to a large scale environment, the tables will grow to a large size and a simple query may take some time to execute remotely.
Should I be implementing a cache system if I envision the tables growing to a large size? Or should I just do it out of good practice?
So far, the query executes very quickly. However, that could be due to the fact the MSSQL installation is local or the fact that the tables currently only have 2-3 records of sample data. I do in the future plan to create more sample data to see if the execution time is managable.

Caching is an optimisation tool. Try to avoid premature optimisation, especially in cases when you do not know (and can't even guess) what you are optimising for (CPU, Network, HD speed etc).
Keep in mind that databases are extremely efficient at searching and retrieving data. Provided adequate hardware is available, a database engine will always outperform C# cache structures.
Where caching is useful is in scenarios where network latency (between DB and the app) is an issue or chatty application designs (multiple simple DB calls into small tables in one interaction/page load).

Well for a C# app (Desktop/mobile), cache system is a good practice.But you can make a project for a school without cache system because it doesn't weaken your app performance a lot. Its up to you whether you want to use it or not.

Caching is a good option for the kind of data which is to be accessed frequently but does not change so frequently. In your case it would apply to 'Courses' for which you said that data won't change frequently.
However, for data that is going to grow in size in future and will have frequent inserts/updates to it, it is better to think about optimizing the way they are stored and retrieved from the data stores. In your case, the tables 'Student'and 'Enrollment' are such tables where it is expected to have lots of inserts/updates over time.
So it is better to write optimized procedures to perform CRUD operations on these tables and keep indexes of the right sort on the tables as well. It will not only provide better manageability of data but also give the performance that you are looking for, when compared to caching the results.

Copying Data from Oracle Server to SQL Server

I'm quite new to coding in general and I'm looking to copy 47 columns with c300,000 rows of data, from an Oracle to an SQL database, on a daily basis. The code will be stored as a Windows Service, running at the same time every day (or more likely night).
The data from the Oracle DB table (let's call this the Oracle_Source) will be used to both append to a history table (call this SQL_History) and also to append new/update matching/delete missing rows from a live table (call this SQL_Live). The two types of databases are housed on different servers, but the two SQL tables are on the same DB.
I have a few questions around the best way to approach this.
Using VB/C#, is it faster to loop through rows (either 1 by 1 or batches of 100/1000/etc.) of Oracle_Source and insert/update SQL_History/SQL_Live OR copy the entire table of Oracle_Source in one go and insert into the SQL tables? Previously I have used the loop to download data into a .csv.
Using the more efficient of the above methods, would it be faster to work on both SQL tables simultaneously OR copy the data into the SQL_History table and then use that to APPEND/UPDATE/DELETE from the SQL_Live table?
Am I approaching this completely wrong?
Any other advice available is also much appreciated.

The correct question is “What is the fast way to copy the table?”
In your specific case , with 2 different server and a “big” table to copy, you are probably limited by network IO.
So, the first point is to update only the rows that must be update (Update/ Insert / Delete), so less byte to move.
To answer to your first point, you have to use transaction to improve the speed on sql server during the writing phase. The dimension of transaction depend on differenct factor (db, machine, ...) but I usually make transaction with 500/1000 simple commands. In my personal experience, if you use INSERT with more rows, you can send 500 rows for INSERT without performance issue.
In my experience, a bulk copy is faster than an efficient INSERT, UPDATE and DELETE because the db does not calculate key and does not check duplicate rows.
Better explained:
you TRUNCATE all data
DISABLE keys
massive INSERT of all rows and
re-ENABLE keys.
This is the faster way to copy a table but if your communication is from different server with low network speed this can't be the best choice.
Obviously, what is the best choice depend from your infrastructure and the table dimension
For example:
If you have one server your lan and the second server on clouds, the bottleneck is on the speed of internet connection and you must pay more attention to have an efficient communication(less byte).
If both servers are on your lan with two gigabit connection, probably the full network communication are around 100mb, and you can use a simple move all the table rows without headache.

Performance: Lots of queries or lots of processing?

Currently I am creating a C# application which has to read a lot of data (over 2,000,000 records) from an existing database and compare it with a lot of other data (also about 2,000,000 records) which do not exist in the database. These comparisons will mostly be String comparisons. The amount of data will grow much bigger and therefore I need to know which solution will result in the best performance.
I have already searched the internet and I came up with two solutions;
Solution 1
The application will execute a single query (SELECT column_name FROM table_name, for example) and store all the data in a DataTable. The application will then compare all the stored data with the input, and if there is a comparison it will be written to the database.
Pros:
The query will only be executed once. After that, I can use the stored data multiple times for all incoming records.
Cons:
As the database grows bigger, so will my RAM usage. Currently I have to work with 1GB (I know, tough life) and I'm afraid it won't fit if I'd practically download the whole content of the database in it.
Processing all the data will take lots and lots of time.
Solution 2
The application will execute a specific query for every record, for example
SELECT column_name FROM table_name WHERE value_name = value
and will then check is the DataTable will have records, something like
if(datatable.Rows.Count>0) { \\etc }
If it has records, I can conclude there are matching records and I can write to the database.
Pros:
Probably a lot less usage of RAM since I will only get specific data.
Processing goes a lot faster.
Cons:
I will have to execute a lot of queries. If you are interested in numbers, it will probably around 5 queries per record. Having 2,000,000 records, that would be 10,000,000 queries.
My question is, what would be the smartest option, given that I have limited RAM?
Any other suggestions are welcome aswell, ofcourse.

If you have SQL Server available to you, this seems a job directly suited to SQL Server Integration Services. You might consider using that tool instead of building your own. It depends on your exact business needs, but in general data merging like this would be a batch/unattended or tool based operation ?
You might be able to code it to run faster than SSIS, but I'd give it a try just to see if its acceptible to you, and save yourself the cost of the custom development.

Should I persist consolidated sums in a separate table?

I am developing a C# application working with millions of records retrieved from a relational database (SQL Server). My main table "Positions" contains the following columns:
PositionID, PortfolioCode, SecurityAccount, Custodian, Quantity
Users must be able to retrieve Quantities consolidated by some predefined set of columns e.g. {PortfolioCode, SecurityAccount}, {Porfolio, Custodian}
First, I simply used dynamic queries in my application code but, as the database grew, the queries became slower.
I wonder if it would be a good idea to add another table that will contain the consolidated quantities. I guess it depends on the distribution of those groups?
Besides, how to synchronize the source table with the consolidated one?

In SQL Server you could use indexed views to do this, it'd keep the aggregates synchronised with the underlying table, but would slow down inserts to the underlying table:
http://technet.microsoft.com/en-us/library/ms191432.aspx
If it's purely a count of grouped rows in a single table, would standard indexing not suffice here? More info on your structure would be useful.
Edit: Also, it sounds a little like you're using your OLTP server as a reporting server? If so, have you considered whether a data warehouse and an ETL process might be appropriate?

C#: Very fast object search & retrieval using any persistence model

I am developing an application with Fluent nHibernat/nHibernate 3/Sqlite. I have run into a very specific problem for which I need help with.
I have a product database and a batch database. Products are around 100k but batches run in around 11 million+ mark as of now. When provided with a product, I need to fill a Combobox with batches. As I do not want to load all the batches at once because of memory constraints, I am loading them, when the product is provided, directly from the database. But the problem is that sqlite (or maybe the combination of sqlite & nh) for this, is a little slow. It normally takes around 3+ seconds to retrieve the batches for a particular product. Although it might not seem like a slow scenario, I want to know that can I improve this time? I need sub second results to make order entry a smooth experience.
The details:
New products and batches are imported periodically (bi-monthly).
Nothing in the already persisted products or batchs ever changes (No Update).
Storing products is not an issue. Batches are the main culprit.
Product Ids are long
Batch Ids are string
Batches contain 3 fields, rate, mrp (both decimal) & expiry (DateTime).
The requirements:
The data has to be stored in a file based solution. I cannot use a client-server approach.
Storage time is not important. Search & retrieval time is.
I am open to storing the batch database using any other persistence model.
I am open to using anything like Lucene, or a nosql database (like redis), or a oodb, provided they are based on single storage file implementation.
Please suggest what I can use for fast object retrieval.
Thanks.

You need to profile or narrow down to find out where those 3+ seconds are.
Is it the database fetching?
Try running the same queries in Sqlite browser. Does the queries take 3+ seconds there too? Then you might need to do something with the database, like adding some good indexes.
Is it the filling of the combobox?
What if you only fill the first value in the combobox and throw away the others? Does that speed up the performance? Then you might try BeginUpdate and EndUpdate.
Are the 3+ seconds else where? If so, find out where.

This may seem like a silly question, but figured I'd double-check before proceeding to alternatives or other optimizations, but is there an index (or hopefully a primary key) on the Batch Id column in your Batch table. Without indexes those kinds of searches will be painfully slow.
For fast object retrieval, a key/value store is definitely a viable alternative. I'm not sure I would necessarily recommend redis in this situation since your Batches database may be a little too large to fit into memory, and although it also stores to a disk it's generally better when suited with a dataset that strictly fits into memory.
My personal favourite would be mongodb - but overall the best thing to do would be to take your batches data, load it into a couple of different nosql dbs and see what kind of read performance you're getting and pick the one that suits the data best. Mongo's quite fast and easy to work with - and you could probably ditch the nhibernate layer for such a simple data structure.
There is a daemon that needs to run locally, but depending on the size of the db it will be single file (or a few files if it has to allocate more space). Again, ensure there is an index on your batch id column to ensure quick lookups.

3 seconds to load ~100 records from the database? That is slow. You should examine the generated sql and create an index that will improve the query's performance.
In particular, the ProductId column in the Batches table should be indexed.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.