I'm working on a project for an academic institution and I need advice on the best way to approach this problem. It's been a long time since I did any traditional application development (close to five years).
The college administration recently revised the college's academic standards policy. Previously, the administration only had three status codes, so this wasn't as big of an issue. However, the new policy has six status codes:
Good Standing
Academic Concern
Academic Intervention (1)
One-Term Dismissal
Academic Intervention (2)
Four-Term Dismissal
From here on, I'll differentiate between GPA for the term by saying termGPA and cumulative GPA by saying cumGPA. If a student's termGPA falls below 2.0, and that causes his/her cumGPA to also fall below 2.0, he/she gets placed on Academic Concern. Once on Academic Concern, one of three things can happen to students in following terms. They:
Return to good standing if their termGPA and cumGPA rise above 2.0.
Stay in the current status if their termGPA is above 2.0, but their cumGPA stays below 2.0.
Move to the next status if both their termGPA and cumGPA are below 2.0.
Normally, I would approach this process by writing a console application that processed each student iteratively and building the status codes as I go. However, we're handling at least 8000 students, and in most cases around 12,500 students per term.
Additionally, this policy has to be applied retroactively over an as-yet-unspecified period of time (since former students could return to the college and would then be subject to the new policy's restrictions), and once I include a student in the data set, I have to go back through that student's entire history with the college. I'm conservatively guessing that I'll go through at least a million student records and calculating each student's termGPA and rolling cumGPA.
Questions:
Is there any way handle this problem in SQL and avoid using a cursor?
(Assuming the answer to 1. is "No") How should I structure a console application? Should I create a large collection and process a few thousand students at a time before writing to the database, or update the database after each I process each student?
Am I making way too big of a deal about this?
Thanks in advance for any insight and advice.
Edit: Based on comments to answers here, I should've provided more information about the data structures and the way I'm calculating the GPAs.
I can't use the pre-calculated cumGPA values in our database -- I need the student's cumGPA at the end of each progressive term, like so (note: I made up the GPA values below):
ID TermID CumGpa TermGPA TermNumber PolicyCode
123545 09-10-2 2.08 2.08 1 GoodStanding
123545 09-10-3 1.94 0.00 2 AcademicConcern
123545 09-10-4 1.75 1.00 3 AcademicIntervention
123545 10-11-2 1.88 2.07 4 AcademicIntervention
123545 10-11-4 2.15 2.40 5 GoodStanding
123545 11-12-1 2.30 2.86 6 GoodStanding
The problem is that each subsequent term's status code could depend on the previous term's status code -- Good Standing is actually the only one that doesn't.
As far as I know, that means that I would have to use a cursor in SQL to get each student's most current status code, which is not something I'm interested in, as I work for a cash-strapped college that has precisely three database servers: one for testing, and two servers with the same data on them (we're in the process of moving to SQL Server 2008 R2).
That is interesting. I don't think you'll have to worry too much about the SQL performance. It will run fairly quickly for your application. I just ran a stupid little console app to fix a mess up and inserted 15000 records one at a time. It took about 5 seconds.
First of all, 12 000 records are nothing for nowadays databases so that's not your consern. You should rather focus on keeping it simple. It seems like your database will be offten based on events so I would recomend using triggers ie: fisrt trriger when your termGPA is inserted - update cumGPA, second one after cumGPA update - check your criteria and update status if they occured.
Even the free vesion of SQL now handles databases up to 10 GB. 12,500 records is small. Going though 1 million records you should itterate though each student or groups to allow the tranaction log to clear. That could be done in using a cursor or a console application. If you can perform the calcution in TSQL then batch them would probably be faster than one at a time. The down side is the bigger the batch the bigger tranaction log so there is a sweet spot. If the calculation is too complex for TSQL and takes almost as long (or longer) than insert statement you could insert on a separate thread (or calculate on a separate thread) so the insert and caluculation are in parrallel. I do this an applicaiton where I parse the words out of text - the parse takes about the amount of time as to insert the words. But I don't let it spin up multiple theads. On the SQL side it still had to maintain the indexes and hitting it with inserts from two threads slowed it down. Just two threads and the faster thread waits on the slower. The order you do your updates also matters. If you process in the order of the clustered index then you have a better chance that record is already in memory.
I ended up writing a console application in C# to process these status codes. My users changed the initial status update requirements to only include the previous two terms, but the process had enough edge cases that I opted to take my time and write cleaner, object-oriented code that will be easier to pick back up (he says, hopefully) once this policy matures and changes.
Also, I ended up having to deploy this database onto a SQL 2005 instance, so table-valued parameters were not available to me. If it had been, I would've opted to commit to the database only after processing each student, rather than after processing each term for each student.
Related
I have a small table(23 rows, 2 int columns), just a basic user-activity monitor. The first column represents user id. The second column holds a value that should be unique to every user, but I must alert the users if two values are the same. I'm using an Azure Sql database to hold this table, and Linq to Sql in C# to run the query.
The problem: Microsoft will bill me based on data transferred out of their data-centers. I would like have all of my users to be aware of the current state of this table at all times, second by second, and keep data-transfer under 5 GB per month. I'm thinking along the lines of a Linq-To-Sql expression such as
UserActivity.Where(x => x.Val == myVal).Count() > 1;
But this would download the table to the client, which cannot happen. Should I be implementing a Linq solution? Or would SqlDataReader download less metadata from the server? Am I taking the right approach by using a database at all? Gimme thoughts!
If it is data transfer you are worried about you need to do your processing on the server and return only the results. A SQLDataReader solution can return a smaller, already processed set of data to minimise the traffic.
A couple thoughts here:
First, I strongly encourage you to profile the SQL generated by your LINQ-to-SQL queries. There are several tools available for this, here's one at random (I have no particular preference or affiliation):
LINQ Profiler from Devart
Your prior experience with LINQ query inefficiency notwithstanding, the LINQ sample you quote in your question isn't particularly complex so I would expect you could make it or similar work efficiently, given a good feedback mechanism like the tool above or similar.
Second, you don't explicitly mention whether your query client is running in Azure or outside, but I gather from your concern about data egress costs that its running outside Azure. So the data egress costs are going to be query results using the TDS protocol (low-level protocol for SQL Server), which is pretty efficient. Some quick back-of-the-napkin math shows that you should be fine to stay below your monthly 5 GB limit:
23 users
10 hours/day
30 days/month (less if only weekdays)
3600 requests/hour/user
32 bits of raw data per response
= about 95 MB of raw response data per month
Even if you assume 10x overhead of TDS for header metadata, etc. (and if my math is right :-) ) then you've still got plenty of room underneath 5 GB. The point isn't that you should stop thinking about it and assume it's fine... but don't assume it isn't fine, either. In fact, don't assume anything. Test, and measure, and make an informed choice. I suspect you'll find a way to stay well under 5 GB without much trouble, even with LINQ.
One other thought... perhaps you could consider running your query inside Azure, and weigh the cost of that vs. the cost of data egress under the "query running outside Azure" scenario? This could (for example) take the form of a small Azure Web Job that runs the query every second and notifies the 23 users if the count goes above 1.
Azure Web Jobs
In essence, you wouldn't notify them if the condition is false, only when it's true. As for the notification mechanism, there are various cloud-friendly options:
Azure mobile push notifications
SMS messaging
SignalR notifications
The key here is to determine whether its more cost-effective and in line with any bigger-picture technology or business goals to have each user issue the query continuously, or to use some separate process in Azure to notify users asynchronously if the "trigger condition" is met.
Best of luck!
I am working with aws sqs queue. The queue may having massive messages i.e if i do not process there will be more than a million mesasge per hour.
I am processing all the messages and putting them into a mysql table. Innodb with 22 columns. Insert on Duplicate Key Update. I have a primary key and unique key.
I am working with C# where i ran 80 threads in order to pull messages from sqs.
I applied transaction in c# run the query like "insert on duplicate key update"
at the same time i am using lock in c# so only single thread can update the table. if id do not use C# lock then an exception is thrown from mysql dead lock occured.
Problem is here i can see there are a lot of threads are waiting before C# lock and this time gradually increasing. Can any body suggest me what is the best way to do this..
Note, i have 8GB RAM intell xeon 2.53 with 1GE internet speed. please suggest me in this regard.
If I were to do it, the C# program would primarily be creating the CSV file to empty your SQS queue. Or at least a significant chunk of it. The file would then be used for bulk insert into an empty non-indexed in anyway worktable. I would steer for non-temporary but whatever. I see no reason to add temporary to the mix when this is recurring, and when done the worktable is truncated anyway.
The bulk insert would be achieved through LOAD DATA FROM INFILE fired off from the c# program. Alternatively, a value in a new row in some other table could be written with an incrementer saying file2 is ready, file3 is ready, and the LOAD happens in an event triggered, say every n minutes. An event that was put together with mysql Create Event. Six of one, half dozen of another.
But the benefits of a sentinal, a mutex, might be of value, as this whole thing happens in batches. And the next batch(es) to be processed need to be suspended while this occurs. Let's call this concept The Blocker, and the one being worked on is row N.
Ok, now your data is in the worktable. And it is safe from being stomped on until processed. Let's say you have 250k rows. Other batches shortly to follow. If you have special processing to have happen, you may wish to create indexes. But at this moment there are none.
You perform a normal insert on duplicate key update (IODKU) to the REAL table using this worktable. It would, in that IODKU follow a normal insert into select pattern, where the select part comes from the worktable.
At the end of that statement, the worktable is truncated, any indexes dropped, row N has its status set to complete, and The Blocker is free to work on row N+1 when it appears.
The indexes are dropped to facilitate the next round of bulk insert, where maintaining indexes is of least importance. And indexes on the worktable may very well be overhead baggage unnecessary during IODKU.
In this manner, you get the best of both worlds
LOAD DATA FROM INFILE
IODKU
And the focus is taken off of multi-threading, a good thing to take one's focus off of.
Here is a nice article on performance and strategies titled Testing the Fastest Way to Import a Table into MySQL. Don't let the mysql version of the title or inside the article scare you away. Jumping to the bottom and picking up some conclusions:
The fastest way you can import a table into MySQL without using raw
files is the LOAD DATA syntax. Use parallelization for InnoDB for
better results, and remember to tune basic parameters like your
transaction log size and buffer pool. Careful programming and
importing can make a >2-hour problem became a 2-minute process. You
can disable temporarily some security features for extra performance
I would separate the C# routine entirely from the actual LOAD DATA and IODKU update effort and leave that to the event mentioned with Create Event for several reasons. Mainly better design. As such the C# program is only dealing with SQS and writing out files with incrementing file #'s.
Have a project that uses the entity framework (v1 with .NET 3.5). It's been in use for a few years, but it's now being used by more people. Started getting timeout errors and have tracked it down to a few things. For simplicity sake let's say my database has three tables, product, part, and product_part. There are ~1400 parts and a handful of products.
The user has the ability to add any number of parts to a product. My problem is that when there are many parts added to the product the inserts take a long time. I think it's mostly due to network traffic/delay, but to insert all 1400 takes around a minute. If someone goes in and tries to view the details of a part while those records are being inserted I get a timeout and can see a block in the Activity Monitor of SQL Server.
What can I do to avoid this? My apologies if this has been asked before and I missed it.
Thanks,
Nick
I think the root problem is that your write transaction is taking so long. EF is not good at executing mass DML. It executes each insert in a separate network roundtrip and separate statement.
If you want to insert 1400 rows, and performance matters, do the insert in one single statement using TVP's (INSERT ... SELECT * FROM #tvp). Or switch to bulk-copy but I don't think that will be advantageous at only 1400 rows.
If your read transactions are getting blocked, and this is a problem, switch on snapshot isolation. That takes care of the readers 100% as they never block under snapshot isolation.
I am quite confused on which approach to take and what is best practice.
Lets say i have a C# application which does the following:
sends emails from a queue. Emails to send and all the content is stored in the DB.
Now, I know how to make my C# application almost scalable but I need to go somewhat further.
I want some form of responsibility of being able to distribute the tasks across say X servers. So it is not just 1 server doing all the processing but to share it amoungst the servers.
If one server goes down, then the load is shared between the other servers. I know NLB does this but im not looking for an NLB here.
Sure, you could add a column of some kind in the DB table to indicate which server should be assigned to process that record, and each of the applications on the servers would have an ID of some kind that matches the value in the DB and they would only pull their own records - but this I consider to be cheap, bad practice and unrealistic.
Having a DB table row lock as well, is not something I would do due to potential deadlocks and other possible issues.
I am also NOT indicating using threading "to the extreme" here but yes, there will be threading per item to process or batching them up per thread for x amount of threads.
How should I approach and what do you recommend on making a C# application which is scalable and has high availability? The aim is to have X servers, each with the same application and for each to be able to get records and process them but have the level of processing/items to process shared amoungst the servers so incase if one server or service fails, the other can take on that load until another server is put back.
Sorry for my lack of understanding or knowledge but have been thinking about this quite alot and had lack of sleep trying to think of a good robust solution.
I would be thinking of batching up the work, so each app only pulled back x number of records at a time, marking those retrieved records as taken with a bool field in the table. I'd amend the the SELECT statement to pull only records not marked as taken/done. Table locks would be ok in this instance for very short periods to ensure there is no overlap of apps processing the same records.
EDIT: It's not very elegant, but you could have a datestamp and a status for each entry (instead of a bool field as above). Then you could run a periodic Agent job which runs a sproc to reset the status of any records which have a status of In Progress but which have gone beyond a time threshold without being set to complete. They would be ready for reprocessing by another app later on.
This may not be enterprise-y enough for your tastes, but I'd bet my hide that there are plenty of apps out there in the enterprise which are just as un-sophisticated and work just fine. The best things work with the least complexity.
I have a SQL table with indexes containing leads to call. About 30 users will be calling these leads. To be sure that there is no two users calling the same lead, the system has to be instant.
So I would like to go this way:
Set the table to the right index
Scan the table for a lead I can call (there are conditions), following the index
When I have a call, indicate that the record is "in use"
Here are my issues:
- I can't find any way to set a table to an index by c# code
- Linq requires dataContext (not instant) and ADO requires DataSet
I have not found any resource to help me on that. If you have any, they are more than welcome.
Sorry if I may sound ignorant, I'm new to SQL databases.
Thank you very much in advance!
Mathieu
I've worked on similar systems before. The tact we took was to have a distribution routine that handled passing out the leads to the call center people. Typically we had a time limit on how long the lead was allowed to be in any one users queue before it was yanked away and given to someone else.
This allowed us to do some pretty complicated things like giving preference based on details about the lead as well as productivity of the individual call center person.
We had a very high volume of leads that came in and had our distribution routine set to run once a minute. The SLA was set so that a lead was contacted within 2 minutes of us knowing about them.
To support this, your leads table should have a AssignedUserId and probably a date/time stamp of when it was assigned. Write a proc or some c# code which grabs all the records from that table which aren't assigned. Do the assignment routine, saving the changes back to the table. This routine should probably take into account how many leads they are currently working and the acceptable number of open leads per person in order to give preference in a round robin distribution.
When the user refreshes they will have their leads. You can control the refresh rate in the UI.
I don't see how your requirement of being "instant" relates to the use of an index. Accessing a table by index is not instantaneous either.
To solve your problem, I would suggest to lock the whole table while a lead is being called. This will limit performance, but it will also ensure that the same lead is never called by two users.
Example code:
Begin Transaction
Lock Table
Search for Lead
Update Lead to indicate that it is in use
Commit Transaction (removes the lock)
Locking a table in SQL Server until the end of the transaction can be done by using SELECT * FROM table WITH (HOLDLOCK, TABLOCKX) WHERE 1=0.
Disclaimer: Yes, I'm aware that cleaner solutions with less locking are possible. The advantage of the above solution is that it is simple (no worrying about the correct transaction isolation level etc.) and it is usally performant enough (if you remember to keep the "locked part" short and there is not too much concurrent access).