I'm using a Oracle 12c database and I wonder which is the optimal way to iterate in C# over the result of a select query. Once I get the row values I use it to do some work.
My idea is to use the full processor capacity so I thought I need one thread per cpu core, each thread would have his own connection, which is use to get (select count(*) from table where condition)/(cores) rows and then each thread makes the work.
Each table has more 500000 rows.
Am I right or there is a better way to do this?
Thank you in advance, and I'm apology for my English.
It all DEPENDS... :).
Does it all sits on the same box? Oracle and your program?
If not then the number of connections SHOULD NOT be the function of the number of the CPU cores! It should depend on your network infrastructure, the number of connections between the box with your program and the box with your database. Sometimes even on one physical connection is possible to get higher throughput with multiple DB connections, but it has own limits and is useful as long as the DB does not provide maximum resources to one request, rather sharing resources to multiple requests and in case, when generating requests takes time per each generated row of the result. But that is best decided by benchmarking.
If your program and database are sharing the same box, then it is the different scenario. In such case, I would in general use on the side of my app less or equally half of the cores available, as the DB needs cores for processing too and the big number of result rows comes in continually...
It also depends on what you do next with the rows.
Related
I have a small table(23 rows, 2 int columns), just a basic user-activity monitor. The first column represents user id. The second column holds a value that should be unique to every user, but I must alert the users if two values are the same. I'm using an Azure Sql database to hold this table, and Linq to Sql in C# to run the query.
The problem: Microsoft will bill me based on data transferred out of their data-centers. I would like have all of my users to be aware of the current state of this table at all times, second by second, and keep data-transfer under 5 GB per month. I'm thinking along the lines of a Linq-To-Sql expression such as
UserActivity.Where(x => x.Val == myVal).Count() > 1;
But this would download the table to the client, which cannot happen. Should I be implementing a Linq solution? Or would SqlDataReader download less metadata from the server? Am I taking the right approach by using a database at all? Gimme thoughts!
If it is data transfer you are worried about you need to do your processing on the server and return only the results. A SQLDataReader solution can return a smaller, already processed set of data to minimise the traffic.
A couple thoughts here:
First, I strongly encourage you to profile the SQL generated by your LINQ-to-SQL queries. There are several tools available for this, here's one at random (I have no particular preference or affiliation):
LINQ Profiler from Devart
Your prior experience with LINQ query inefficiency notwithstanding, the LINQ sample you quote in your question isn't particularly complex so I would expect you could make it or similar work efficiently, given a good feedback mechanism like the tool above or similar.
Second, you don't explicitly mention whether your query client is running in Azure or outside, but I gather from your concern about data egress costs that its running outside Azure. So the data egress costs are going to be query results using the TDS protocol (low-level protocol for SQL Server), which is pretty efficient. Some quick back-of-the-napkin math shows that you should be fine to stay below your monthly 5 GB limit:
23 users
10 hours/day
30 days/month (less if only weekdays)
3600 requests/hour/user
32 bits of raw data per response
= about 95 MB of raw response data per month
Even if you assume 10x overhead of TDS for header metadata, etc. (and if my math is right :-) ) then you've still got plenty of room underneath 5 GB. The point isn't that you should stop thinking about it and assume it's fine... but don't assume it isn't fine, either. In fact, don't assume anything. Test, and measure, and make an informed choice. I suspect you'll find a way to stay well under 5 GB without much trouble, even with LINQ.
One other thought... perhaps you could consider running your query inside Azure, and weigh the cost of that vs. the cost of data egress under the "query running outside Azure" scenario? This could (for example) take the form of a small Azure Web Job that runs the query every second and notifies the 23 users if the count goes above 1.
Azure Web Jobs
In essence, you wouldn't notify them if the condition is false, only when it's true. As for the notification mechanism, there are various cloud-friendly options:
Azure mobile push notifications
SMS messaging
SignalR notifications
The key here is to determine whether its more cost-effective and in line with any bigger-picture technology or business goals to have each user issue the query continuously, or to use some separate process in Azure to notify users asynchronously if the "trigger condition" is met.
Best of luck!
I have code that carries out data retrieval - basically executes anything from 3 to 12 SQL (oracle) read statements to retrieve data about an object.
Unfortunantly its running slowly (no SQL statement in particular, its just the fact I have so many of them - and they take around 0.2 seconds per statement, which can mean over 2 secs for the code to complete).
I am looking into ways of improving the performance. One way is to merge some of the tables into a single query (which can reduce the combined results by 0.5 secs). However it doesn't make sense to merge the rest since there will only be data there under certain cicumstances, and trying to determine when there is data there to marshal could get tricky.
I am considering introducing threading into my program, so after the initial query, I would spawn a thread for each of the other queries, so they are executed at the same time. However I have never used threading and am wary of introducing deadlocks or other pit falls.
Currently the other queries marshal the results into different sections of the SAME object. Would this cause any issues (i.e. since we are accessing/updating the same object in different threads though different sections/fields within the object?). Would it be better to return the results and marshal into the object after all the threads have finished?
I know these types of questions are hard to answer since its more general advice, but I would appreciate if anyone thought it was a good idea, or had other suggestions?
If you are doing only reading (select from) - don't worry about deadlocks. Oracle readings are not blockable (mostly). The biggest problem with threading queries to oracle would be how to deal with connections. To create connection, run a query and close connection - is very very very bad. Connections are expensive. They are also limited, so you don't want to create one million connections to execute your logic.
As a result, you would use some sort of connection pool and put your queries in a queue.
Also, I hope you are using bind variables and not string concatenation to pass queries to oracle.
In general, I would collect all the data (better in one query) and only then update the object. You could also consider to brake your object into it sections.
Threading workss perfectly. 2 years ago I did a project that used a multi strage / multi threading approeach to push data into a oracle database (and pull some data out of it for updates).
I basicallly used a staged approach (a request would go through multiple stages, get consumed there and new data be pusehd to the next stage) and every stage used a configurable thread pool, which would take a message, process it and post the new messages.
We used I think at that time close to 200 threads to process about a million SQL statements per minute (hitting an Oracle Exadata that was really getting some work out of that).
So, multithreading "just works" - obviously if you know how to do it and you have to get your architecture and the sql statements nice and non blocking. Databases in general are perfectly calable of handling multiple threads.
Now, for details: THAT DEPENDS.
Example:
Currently the other queries marshal the results into different
sections of the SAME object. Would this cause any issues (i.e. since
we are accessing/updating the same object in different threads though
different sections/fields within the object?)
Absolutely no problem as long as:
You make suer all updates are finished before moving the object to the next phase and
The updates do not overlap or have a cardinality (1 must finish for 2 to have the required data).
These are implementation details and it is really hard to make a generic answer for those (totally impossible). Especially as this is multi threading 101 - and has nothing to do with any database access.
In general - you will also have to tune the number of threads. .NET can not do that itself - as it will see the CPU not busy and spawn up more threads, even if the database server is the bottleneck. This is why we went with multiple stages - so we could tune the number of threads depending what they do (and the last stage used bulk inserting to insert the aggregated data into temporary staging tables with a small number of threads, moving a lot of data in every statement - this will require some tuning possibilities to not totally overload the database side).
I am quite confused on which approach to take and what is best practice.
Lets say i have a C# application which does the following:
sends emails from a queue. Emails to send and all the content is stored in the DB.
Now, I know how to make my C# application almost scalable but I need to go somewhat further.
I want some form of responsibility of being able to distribute the tasks across say X servers. So it is not just 1 server doing all the processing but to share it amoungst the servers.
If one server goes down, then the load is shared between the other servers. I know NLB does this but im not looking for an NLB here.
Sure, you could add a column of some kind in the DB table to indicate which server should be assigned to process that record, and each of the applications on the servers would have an ID of some kind that matches the value in the DB and they would only pull their own records - but this I consider to be cheap, bad practice and unrealistic.
Having a DB table row lock as well, is not something I would do due to potential deadlocks and other possible issues.
I am also NOT indicating using threading "to the extreme" here but yes, there will be threading per item to process or batching them up per thread for x amount of threads.
How should I approach and what do you recommend on making a C# application which is scalable and has high availability? The aim is to have X servers, each with the same application and for each to be able to get records and process them but have the level of processing/items to process shared amoungst the servers so incase if one server or service fails, the other can take on that load until another server is put back.
Sorry for my lack of understanding or knowledge but have been thinking about this quite alot and had lack of sleep trying to think of a good robust solution.
I would be thinking of batching up the work, so each app only pulled back x number of records at a time, marking those retrieved records as taken with a bool field in the table. I'd amend the the SELECT statement to pull only records not marked as taken/done. Table locks would be ok in this instance for very short periods to ensure there is no overlap of apps processing the same records.
EDIT: It's not very elegant, but you could have a datestamp and a status for each entry (instead of a bool field as above). Then you could run a periodic Agent job which runs a sproc to reset the status of any records which have a status of In Progress but which have gone beyond a time threshold without being set to complete. They would be ready for reprocessing by another app later on.
This may not be enterprise-y enough for your tastes, but I'd bet my hide that there are plenty of apps out there in the enterprise which are just as un-sophisticated and work just fine. The best things work with the least complexity.
I have one BIG table(90k rows, size cca 60mb) which holds info about free rooms capacities for about 50 hotels. This table has very few updates/inserts per hour.
My application sends async requests to this(and joined tables) at max 30 times per sec.
When i start 30 threads(with default AppPool class at .NET 3.5 C#) at one time(with random valid sql query string), only few(cca 4) are processed asynchronously and other threads waits. Why?
Is it becouse of SQL SERVER 2008 table locking, or becouse of .NET core? Or something else?
If it is a SQL problem, can help if i split this big table into one table per each hotel model?
My goal is to have at least 10 threads servet at a time.
This table is tiny. It's doesn't even qualify as a "medium sized" table. It's trivial.
You can be full table scanning it 30 times per second, you can be copying the whole thing in ram, no server is going to be the slightest bit bothered.
If your data fits in ram, databases are fast. If you don't find this, you're doing something REALLY WRONG. Therefore I also think the problems are all on the client side.
It is more than likely on the .NET side. If it were table locking more threads would be processing, but they would be waiting on their queries to return. If I remember correctly there's a property for thread pools that controls how many actual threads they create at once. If there are more pending threads than that number, then they get in line and wait for running threads to finish. Check that.
Have you tried changing the transaction isolation level?
Even when reading from a table Sql Server will lock the table
try setting the isolation level to read uncommitted and see if that improves the situation,
but be advised that its feasible that you will read 'dirty' data make sure you understand the ramifications if this is the solution
this link explains what it is.
link text
Rather than ask, measure. Each of your SQL queries that is actually submitted by your application will create a request on the server, and the sys.dm_exec_requests DMV shows the state of each request. When the request is blocked the wait_type column shows a non-empty value. You can judge from this whether your requests are blocked are not. If they are blocked you'll also know the reason why they are blocked.
Here I am dealing with a database containing tens of millions of records. I have an application which connects to the database, gets all the data from a single column in a table and does some operation on it and updates it (for SQL Server - using cursors).
For millions of records it is taking very very ... long time to update. So I want to make it faster by
using multiple threads with an independent connection for each thread.
or
by using a single connection throughout all the threads to fire the update queries.
Which one is faster, or if you have any other ideas plz explain.
I need a solution which is independent of database type , or even if you know specific solutions for each type of db, please reply.
The speedup you're trying to achieve won't work. To the contrary, it will slow down the overall processing as the database now has also to keep multiple connections/sessions/transactions in sync.
Keep with as few connections/transactions as possible for repetitive and comparable operations.
If it takes too long for your taste, maybe try to analyze if the queries can be optimized somehow. Also have a look at database-specific extensions (ie bulk operations) suitable for your problem.
All depends on the database, and the hardware it is running on.
If the database can make use of concurrent processing, and avoids contention on shared resources (e.g. page base locks would span multiple records, record based would not). Shared resources in this case include hardware, a single core box will not be able to execute multiple CPU intensive activities (e.g. parsing SQL) truely in parallel.
Network latency is something you might help alleviate with concurrent inserts even if the database is itself not able to exploit concurrency.
As with any question of performance there is substitute for testing in your specific scenario.
If possible try to use the Stored procedure the do all the processing and update the records.