How to Minimize Data Transfer Out From Azure Query in C# .NET

How to Minimize Data Transfer Out From Azure Query in C# .NET - c#

I have a small table(23 rows, 2 int columns), just a basic user-activity monitor. The first column represents user id. The second column holds a value that should be unique to every user, but I must alert the users if two values are the same. I'm using an Azure Sql database to hold this table, and Linq to Sql in C# to run the query.
The problem: Microsoft will bill me based on data transferred out of their data-centers. I would like have all of my users to be aware of the current state of this table at all times, second by second, and keep data-transfer under 5 GB per month. I'm thinking along the lines of a Linq-To-Sql expression such as
UserActivity.Where(x => x.Val == myVal).Count() > 1;
But this would download the table to the client, which cannot happen. Should I be implementing a Linq solution? Or would SqlDataReader download less metadata from the server? Am I taking the right approach by using a database at all? Gimme thoughts!

If it is data transfer you are worried about you need to do your processing on the server and return only the results. A SQLDataReader solution can return a smaller, already processed set of data to minimise the traffic.

A couple thoughts here:
First, I strongly encourage you to profile the SQL generated by your LINQ-to-SQL queries. There are several tools available for this, here's one at random (I have no particular preference or affiliation):
LINQ Profiler from Devart
Your prior experience with LINQ query inefficiency notwithstanding, the LINQ sample you quote in your question isn't particularly complex so I would expect you could make it or similar work efficiently, given a good feedback mechanism like the tool above or similar.
Second, you don't explicitly mention whether your query client is running in Azure or outside, but I gather from your concern about data egress costs that its running outside Azure. So the data egress costs are going to be query results using the TDS protocol (low-level protocol for SQL Server), which is pretty efficient. Some quick back-of-the-napkin math shows that you should be fine to stay below your monthly 5 GB limit:
23 users
10 hours/day
30 days/month (less if only weekdays)
3600 requests/hour/user
32 bits of raw data per response
= about 95 MB of raw response data per month
Even if you assume 10x overhead of TDS for header metadata, etc. (and if my math is right :-) ) then you've still got plenty of room underneath 5 GB. The point isn't that you should stop thinking about it and assume it's fine... but don't assume it isn't fine, either. In fact, don't assume anything. Test, and measure, and make an informed choice. I suspect you'll find a way to stay well under 5 GB without much trouble, even with LINQ.
One other thought... perhaps you could consider running your query inside Azure, and weigh the cost of that vs. the cost of data egress under the "query running outside Azure" scenario? This could (for example) take the form of a small Azure Web Job that runs the query every second and notifies the 23 users if the count goes above 1.
Azure Web Jobs
In essence, you wouldn't notify them if the condition is false, only when it's true. As for the notification mechanism, there are various cloud-friendly options:
Azure mobile push notifications
SMS messaging
SignalR notifications
The key here is to determine whether its more cost-effective and in line with any bigger-picture technology or business goals to have each user issue the query continuously, or to use some separate process in Azure to notify users asynchronously if the "trigger condition" is met.
Best of luck!

Related

Best server design for MongoDB based system (low response times needed)

I am developing a web service which will serve users all over the world.
The server is based on a C# WCF application hosted on IIS.
It uses an MsSQL Configuration Database (access time is not important here),
and a MongoDB database which contains all the important data (access time is VERY important here).
Also it serves small images (48px * 48px JPEGs).
Now, for the image hosting I will probably use Amazon's CloudFront CDN hosting (unless you guys have better suggestions).
My issue is maintaining a low access time (ping) to both the Web Application and the MongoDB.
I was thinking to lease 4 servers in Singapore + US + Europe + Middle East to get a low response time.
Each server will hold the Web Application and an instance of MongoDB.
And one server will hold the MsSQL instance.
I need all MongoDB's to be synced (not instantly if its an issue).
What design would you use?

Low access time is a function of cost vs benefit. First you need to identify, how low is low. Do you need a response time of 100ms overall from the app? or 1s?
Once you do, you map out the different costs.
Total time taken = time for request + //across internet
processing by web app + request for data +
preparing the response + response back to client.
If your desired latency is 100ms, there is a good chance that it can't be done regardless of how fast your servers are, simple because network traffic might take too long.
You need to analyze your dataset. Querying 1000 documents is different from querying 1 billion docs. You need to calculate how much size the index is taking, and is it in RAM or not. If index is not in RAM, your access is going to be slow.
Mongodb configuration
Mongodb can work in a cluster, with automatic syncing (immediate or delayed, this is configurable), and automatic failover (or manual, this is configurable too). It also supports sharding if your dataset is huge, so request is sent to the server that actually contains data.
Similarly, you need to have a look at your app server and figure out how slow/fast components are to get a guaranteed response time.
With the information you have provided, this is about as detailed a response I can give.
Profile and then optimize
If 80% of your requests come from middle east, then you ought make it fast for them first. Using the same principal, you need to figure out the slowest components in response time, and improve them. In order to do that, you need to gather the data.
Clustering
Setting up a cluster in one continent or across continents, will help you provide redundancy, automatic failover (if configured), and load balancing (depending on how you configure it). If you have alot of data, consider sharding.
Consider going through the docs for replication and sharding.
Example Server Setup
Suppose you want to have 10 shards with replication factor of 3 i.e your data is divided across 10 servers and each server really is a replica set of 3 servers (for availability and fail over) i.e each server in the replica set contains duplicate data.
here notation s1p1 means, shard1 - primary 1 and s1s1 is shard 1 secondary 1 and so on
s1p1 s2p1 ... s10p1
s1s1 s2s1 ... s10s1
s1s2 s2s2 ... s10s2
Shards 1-10 divide the data, where each shard approximately keeps 1/10 of total. Each shard comprises of a replica set with a primary and 2 secondaries. You can increase this if you need more redundancy. Try to keep it to odd, so during elections there is a tie breaker. If you want to have only 2 copies of the data, then you can also introduce an "Arbiter" to break the tie.
You could analyze your queries, and choose a shard key so that they go to the closest server, or the server which serves the region. You'll most likely have to do some sort of analysis to optimize this bit.
Hope it helps.

Mid-tier caching for Windows Forms Application

I have a simple Windows Forms Application which is written C# 4.0. The application shows some of the records from database. The application features a query option which is initiated by user.
The records in the database we can call as jobs
Consider the two columns JobID and Status
These being updated by two of the background services which in fact work like a producer consumer services. The status of the job will be updated by these services running behind.
Now for the user, who has an option to query the records from the database, say for e.g. to query data based on status (Submitted, processing, completed). This can result in thousands of records and the GUI might face some performance glitches on displaying these much of data.
Hence, it's important to display chunks of the query results as pages. The GUI isn't refreshed until user manually refresh or make the new query.
Say for e.g. Since the jobs are being constantly updated from the services, the job status can be different at any point of time. The basic requirement that the pages should have the data at the time those were fetched from the DB.
I am using LINQ to SQL for fetching data from the DB. It's quite easy to use but there isn't something mid-level caching required to meet this demand. Using the process memory to cache the results can shoot up page memory to the extreme if the number of records are very high. Unfortunately LINQ isn't providing any mid-tier caching facilities with the DataContext objects.
What are the preferable way to implement a paging mechanism with C# 4.0 + SQL Server + Windows environment?
Some of the alternatives I feel like to have a duplicated table/DB which can temporarily store the results as cache. Or using Enterprising Application Library's Application Cache Block. I believe that this is a typical problem faced by most of the developers. Which is the most efficient way to solve this problem. (NOTE: my application and DB running on same box)

While caching is a sure way to improve performance, implementing a caching strategy properly can be more difficult than it may seem. The problem is managing cache expiration or essentially ensuring that the cache is synchronized up to a desired degree. Therefore, before considering caching consider whether you need it in the first place. Based on what I can gather from the question it seems like the data model is relatively simple and doesn't require any joins. If that is the case, why not optimize the tables and indexes for pagination? SQL server and Linq To SQL will handle pagination for thousands of records transparently and with a breeze.
You are correct in stating that displaying too many records at once is prohibitive for the GUI and it is also prohibitive for the user. No user will want to see more records than are filling the screen at any given time. Given the constraint that the data doesn't need to be refreshed until requested by the user, it should be safe to assume that the number of queries will be relatively low. The additional constraint that the DB is on the same box as the application further solidifies the point that you don't need caching. SQL server already does caching internally.
All advice about performance tuning states that you should profile and measure performance before attempting to make optimizations. As state by Donald Knuth, premature optimization is the root of all evil.

Scalability and availability

I am quite confused on which approach to take and what is best practice.
Lets say i have a C# application which does the following:
sends emails from a queue. Emails to send and all the content is stored in the DB.
Now, I know how to make my C# application almost scalable but I need to go somewhat further.
I want some form of responsibility of being able to distribute the tasks across say X servers. So it is not just 1 server doing all the processing but to share it amoungst the servers.
If one server goes down, then the load is shared between the other servers. I know NLB does this but im not looking for an NLB here.
Sure, you could add a column of some kind in the DB table to indicate which server should be assigned to process that record, and each of the applications on the servers would have an ID of some kind that matches the value in the DB and they would only pull their own records - but this I consider to be cheap, bad practice and unrealistic.
Having a DB table row lock as well, is not something I would do due to potential deadlocks and other possible issues.
I am also NOT indicating using threading "to the extreme" here but yes, there will be threading per item to process or batching them up per thread for x amount of threads.
How should I approach and what do you recommend on making a C# application which is scalable and has high availability? The aim is to have X servers, each with the same application and for each to be able to get records and process them but have the level of processing/items to process shared amoungst the servers so incase if one server or service fails, the other can take on that load until another server is put back.
Sorry for my lack of understanding or knowledge but have been thinking about this quite alot and had lack of sleep trying to think of a good robust solution.

I would be thinking of batching up the work, so each app only pulled back x number of records at a time, marking those retrieved records as taken with a bool field in the table. I'd amend the the SELECT statement to pull only records not marked as taken/done. Table locks would be ok in this instance for very short periods to ensure there is no overlap of apps processing the same records.
EDIT: It's not very elegant, but you could have a datestamp and a status for each entry (instead of a bool field as above). Then you could run a periodic Agent job which runs a sproc to reset the status of any records which have a status of In Progress but which have gone beyond a time threshold without being set to complete. They would be ready for reprocessing by another app later on.
This may not be enterprise-y enough for your tastes, but I'd bet my hide that there are plenty of apps out there in the enterprise which are just as un-sophisticated and work just fine. The best things work with the least complexity.

Approach for caching data from data logger

Greetings,
I've been working on a C#.NET app that interacts with a data logger. The user can query and obtain logs for a specified time period, and view plots of the data. Typically a new data log is created every minute and stores a measurement for a few parameters. To get meaningful information out of the logger, a reasonable number of logs need to be acquired - data for at least a few days. The hardware interface is a UART to USB module on the device, which restricts transfers to a maximum of about 30 logs/second. This becomes quite slow when reading in the data acquired over a number of days/weeks.
What I would like to do is improve the perceived performance for the user. I realize that with the hardware speed limitation the user will have to wait for the full download cycle at least the first time they acquire a larger set of data. My goal is to cache all data seen by the app, so that it can be obtained faster if ever requested again. The approach I have been considering is to use a light database, like SqlServerCe, that can store the data logs as they are received. I am then hoping to first search the cache prior to querying a device for logs. The cache would be updated with any logs obtained by the request that were not already cached.
Finally my question - would you consider this to be a good approach? Are there any better alternatives you can think of? I've tried to search SO and Google for reinforcement of the idea, but I mostly run into discussions of web request/content caching.
Thanks for any feedback!

Seems like a very reasonable approach. Personally I'd go with SQL CE for storage, make sure you index the column holding the datetime of the record, then use TableDirect on the index for getting and inserting data so it's blazing fast. Since your data is already chronological there's no need to get any slow SQL query processor involved, just seek to the date (or the end) and roll forward with a SqlCeResultSet. You'll end up being speed limited only by I/O. I profiled doing really, really similar stuff on a project and found TableDirect with SQLCE was just as fast as a flat binary file.

I think you're on the right track wanting to store it locally in some queryable form.
I'd strongly recommend SQLite. There's a .NET class here.

Reducing roundtrips to the server/database

When writing ASP.NET pages, what signs do you look for that your page is making too many roundtrips to a database or server?
(This is a general question but I say ASP.NET as the majority of my coding is on the web side of things).

How much is too much? The €1M question! Profile. Then profile. If your app is spending most of its time doing data access, you have a problem (and should look at a sql trace). If it is spending most of its time drawing the UI, then (assuming your view isn't doing data access) you should probably look elsewhere first...

Round trips are more relevant to latency than the total quantity of data being moved, so it really does make sense to optimize for them. The usual way is to use stored procedures that do multiple steps, perhaps even returning multiple result sets.

What I do is I look at the ASP performance counters and SQL performance counters. To get an accurate measurement you must ensure that there is no random noise activity on the SQL Server (ie. import batches running unrelated to the web site).
The relevant counters I look at are:
SQL Statistics/Batch requests/sec: This indicates exactly how many Transact-SQL batches the server receives. It can be, in most cases, equated 1:1 with the number of round trips from the web site to SQL.
Databases/Transaction/sec: this counter is instanced per database, so I can quickly see in which database there is 'activity'. This way I can correlate the web site data roundtrips (ie. my app logic requests, goes to app database) and the ASP session state and user stuff (goes to Asp session db or tempdb)
Databases/Write Transaction/sec: This I correlate with the counters above (transaction per second) so I can get a feel of the read-to-write ratio the site is doing.
ASP.NET Applications/Requests/sec: With this counter I can get the number of requests/sec the site is seeing. Correlated with the number of SQL Batch Requests/sec it gives a good indication of the average number of round-trips per request.
The next thing to measure is usually trying to get a feel for where is the time spent in the request. On my own project, I use abundantly performance counters I publish myself so is really easy to measure. But I'm not always so lucky as to clean up only my own mess... Profiling is usually not an option for me because I most times troubleshoot live production systems I cannot instrument.
My approach is to try to sort out the SQL side of things first, since it's easy to find the relevant statistics for execution times in SQL: SQL keeps a nice aggregated statistic ready to look at in sys.dm_exec_query_stats. I can also use Profiler to measure execution duration in real time. With some analysis of these numbers collected, knowing the normal request pattern of the most visited pages, you can give a pretty good estimate of the total time spent in SQL per web request. If this times adds up to nearly all the time it takes a request to serve the page, then you have your answer.
And to answer the original question title: to reduce the number of round-trips, you make fewer requests. Seriously. First, caching what is appropriate to cache I guess is obvious. Second you reduce the complexity: don't display unnecessary data on each page, you cache and display stale data when you can get away with it, you hide details on secondary navigation panels.
If you feel that the problem is the number of round-trips per se as opposed to the number of requests (ie. you would benefit tremendously from batching multiple requests in one round-trip), then you should somehow measure that the round-trip overhead is what's killing you. With connection pooling on a normal network connection this is usually not the most important factor.
And finally you should look if everything that can be done in sets is done in sets. If you have some half-brained ORM that retrieves objects one at a time from an ID keyset, get rid of it.

I know that this may sound reiterative, but client server round trips depends of how many program logic is located at any side of the connection.
First thing to check is validation: You have to validate and sanitize your input at server side always, but it does not means that you cannot do it too at client side too reducing a round trips that are been used only too check input.
At second: What can you do at client side to reduce server side overload? There are calculations that you can check or make at client side. There is also AJAX that can be used to load only a percentage of the page that is changing.
At third: Can you delegate work to another server? If your server is too loaded, why not to use web services or simply delegate some side of the logic to another server?
As Mark wrote: ¿How is too much? It is is up to you and your budget.

When writing ASP.NET pages, what signs
do you look for that your page is
making too many roundtrips to a
database or server?
Of course it all depends and you have to profile. However, here are some indicators, they do not mean there is a problem, but often will indicate
Page is taking a very long time to render locally.
Read this question: Slow response-time cheat sheet , In particular this link
To render the page you need more than 30 round trips. I pulled that number out of my hat, but assuming a round trip is taking about 3.5ms then 30 round trips will kick you over the 100ms guideline (before any other kind of processing).
All the queries involved in rendering the page are heavily optimized and do not take longer than a millisecond or two to execute. There are no operations that require lots of CPU cycles that execute every time you render the page.
Data access is abstracted away and not cached in any kind of way. If, for example, GetCustomer will call the DAL which in turn issues a query and your page is asking for 100 Customer objects which are not retrieved in a batch, you are probably in trouble.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.