How "expensive" is GetDiskFreeSpaceEx? OR - is there a better way to fallback - c#

I am writing a service that saves files and may potentially fill up one storage and may have a secondary storage to fall back to. I'd like to monitor the amount of free space on the primary storage and quit even trying to write to it if it's running low on free space. To do this, I was thinking of calling GetDiskFreeSpaceEx somewhat frequently - maybe once every few seconds. Is GetDiskFreeSpaceEx expensive - is there a reason not to do this? OR, is there a better pattern to accomplish what I'm trying to do?
[EDIT]
The kind of answer I was hoping for might come from either
A)someone who has an idea of what GetDiskFreeSpaceEx actually does behind the scenes who might say either "don't do that, the system does all the expensive operations when you ask for the free disk space" or "that's fine to do; the system keeps a count of that in memory and just has to retrieve it for you" or
B)someone who has written an application with a similar requirement, and already been down this road and down the research

Related

Trying to optimize I/O for MongoDB

I have an updater script that runs every few hours for various regions on a gaming server. I am looking to run this script more frequently and add more regions. Ideally I would love to spread the load of the CPU and I/O as evenly as possible. I used to run this script using mysql, but now the website uses mongodb for everything, so it kinda made sense to move the updater scripts to mongodb too. I am having really high I/O spikes when mongodb flushes all of the updates to the database.
The script is written in C#, although I don't think that's too relative. More importantly is that we are doing about 500k to 1.2 million updates each time one of these scripts runs. We have done some small optimizations in the code and with indexes, but at this point we are stuck at how to optimize the actual mongodb settings.
Some other important information is that we do something like this
update({'someIdentifier':1}, $newDocument)
instead of this:
$set : { internalName : 'newName' }
Not sure if this is a lot worse in performance than doing $set or not.
What can we do to try and spread the load out? I can assign more memory to the VM if that will help as well.
I am happy to provide more information.
Here are my thoughts:
1) Properly explain your performance concerns.
So far I can't really figure out what the issue is or if you have one at all. As far as I can tell you're doing around a GB of updates and are writing about a GB of data to the disk... not much of a shock.
Oh and do some damn testing - Not sure if this is a lot worse in performance than doing $set or not. - why don't you know? What do your tests say?
2) Check to see if there is any hardware mismatch.
Is your disk just slow? Is your working set bigger than RAM?
3) Ask on mongo-user and other MongoDB specific communities...
...simply because you might get a better answer there than the lack of answers here.
4) Consider trying TokuMX.
Wait what? Didn't I just accuse the last guy of suggesting that basically spamming his own product?
Sure, it's a new product that's only been very newly introduced into Mongo (it appears to have a mysql version for a bit longer), but the fundamentals seem sound. In particular it's very good at being fast of not only insertions, but updates/deletions. It does this by not needing to actually go and make the changes to the document in question - but store the insertion/update/deletion message in a buffered queue as part of the index structure. As the buffer fills up it applies these changes in bulk, which is massively more efficient in terms of I/O. On top of that, it uses compression in storing data which should additionally reduce I/O - there's physically less to write.
The biggest disadvantage I can see so far is that its best performance is seen with big data - if your data fits into RAM than it loses to BTrees in a bunch of tests. Still fast, but not as fast.
So yeah, it's very new and I would not trust it for anything without testing, and even then only for non-mission-critical stuff, but it might be what you're looking for. And TBH, as it's just a new index/store sub-system... it fits the bill of being an optimisation for mongodb than a separate product. Especially since index/storage systems in mongodb have always been a bit simple - 'lets use memory-mapped files for caching' etc.

How do I know if I'm about to get an OutOfMemoryException?

I have a program that processes high volumes of data, and can cache much of it for reuse with subsequent records in memory. The more I cache, the faster it works. But if I cache too much, boom, start over, and that takes a lot longer!
I haven't been too successful trying to do anything after the exception occurs - I can't get enough memory to do anything.
Also I've tried allocating a huge object, then de-allocating it right away, with inconsistent results. Maybe I'm doing something wrong?
Anyway, what I'm stuck with is just setting a hardcoded limit on the # of cached objects that, from experience, seems to be low enough. Any better Ideas? thanks.
edit after answer
The following code seems to be doing exactly what I want:
Loop
Dim memFailPoint As MemoryFailPoint = Nothing
Try
memFailPoint = New MemoryFailPoint( mysize) ''// size of MB of several objects I'm about to add to cache
memFailPoint.Dispose()
Catch ex As InsufficientMemoryException
''// dump the oldest items here
End Try
''// do work
next loop.
I need to test if it is slowing things down in this arrangement or not, but I can see the yellow line in Task Manager looking like a very healthy sawtooth pattern with a consistent top - yay!!
You can use MemoryFailPoint to check for available memory before allocating.
You may need to think about your release strategy for the cached objects. There is no possible way you can hold all of them forever so you need to come up with an expiration timeframe and have older cached objects removed from memory. It should be possible to find out how much memory is left and use that as part of your strategy but one thing is certain, old objects must go.
If you implement your cache with WeakRerefences (http://msdn.microsoft.com/en-us/library/system.weakreference.aspx) that will leave the cached objects still eligible for garbage collection in situations where you might otherwise throw an OutOfMemory exception.
This is an alternative to a fixed sized cache, but potentially has the problem to be overly aggressive in clearing out the cache when a GC does occur.
You might consider taking a hybrid approach, where there are a (tunable) fixed number of non-weakreferences in the cahce but you let it grow additionally with weakreferences. Or this may be overkill.
There are a number of metrics you can use to keep track of how much memory your process is using:
GC.GetTotalMemory
Environment.WorkingSet (This one isn't useful, my bad)
The native GlobalMemoryStatusEx function
There are also various properties on the Process class
The trouble is that there isn't really a reliable way of telling from these values alone whether or not a given memory allocation will fail as although there may be sufficient space in the address space for a given memory allocation memory fragmentation means that the space may not be continuous and so the allocation may still fail.
You can however use these values as an indication of how much memory the process is using and therefore whether or not you should think about removing objects from your cache.
Update: Its also important to make sure that you understand the distinction between virtual memory and physical memory - unless your page file is disabled (very unlikely) the cause of the OutOfMemoryException will be caused by a lack / fragmentation of the virtual address space.
If you're only using managed resources you can use the GC.GetTotalMemory method and compare the results with the maximum allowed memory for a process on your architecture.
A more advanced solution (I think this is how SQL Server manages to actually adapt to the available memory) is to use the CLR Hosting APIs:
the interface allows the CLR to inform the host of the consequences of
failing a particular allocation
which will mean actually removing some objects from the cache and trying again.
Anyway I think this is probably an overkill for almost all applications unless you really need an amazing performance.
The simple answer... By knowing what your memory limit is.
The closer you are to reach that limit the more you ARE ABOUT to get an OutOfMemoryException.
The more elaborated answer.... Unless you yourself writes a mechanism to do that kind of thing, programming languages/systems do not work that way; as far as I know they cannot inform you ahead or in advance you are exceeding limits BUT, they gladly inform you when the problem has occurred, and that usually happens through exceptions which you are supposed to write code to handle.
Memory is a resource that you can use; it has limits and it also has some conventions and rules for you to follow to make good use of that resource.
I believe what you are doing of setting a good limit, hard coded or configurable, seems to be your best bet.

using C# for real time applications

Can C# be used for developing a real-time application that involves taking input from web cam continuously and processing the input?
You cannot use any main stream garbage collected language for “hard real-time systems”, as the garbage collect will sometimes stop the system responding in a defined time. Avoiding allocating object can help, however you need a way to prove you are not creating any garbage and that the garbage collector will not kick in.
However most “real time” systems don’t in fact need to always respond within a hard time limit, so it all comes down do what you mean by “real time”.
Even when parts of the system needs to be “hard real time” often other large parts of the system like the UI don’t.
(I think your app needs to be fast rather than “real time”, if 1 frame is lost every 100 years how many people will get killed?)
I've used C# to create multiple realtime, high speed, machine vision applications that run 24/7 and have moving machinery dependent on the application. If something goes wrong in the software, something immediately and visibly goes wrong in the real world.
I've found that C#/.Net provide pretty good functionality for doing so. As others have said, definitely stay on top of garbage collection. Break up to processing into several logical steps, and have separate threads working each. I've found the Producer Consumer programming model to work well for this, perhaps ConcurrentQueue for starters.
You could start with something like:
Thread 1 captures the camera image, converts it to some format, and puts it into an ImageQueue
Thread 2 consumes from the ImageQueue, processing the image and comes up with a data object that is put onto a ProcessedQueue
Thread 3 consumes from the ProcessedQueue and does something interesting with the results.
If Thread 2 takes too long, Threads 1 and 3 are still chugging along. If you have a multicore processor you'll be throwing more hardware at the math. You could also use several threads in place of any thread that I wrote above, although you'd have to take care of ordering the results manually.
Edit
After reading other peoples answers, you could probably argue my definition of "realtime". In my case, the computer produces targets that it sends to motion controllers which do the actual realtime motion. The motion controllers provide their own safety layers for things like timing, max/min ranges, smooth accel/decelerations and safety sensors. These controllers read sensors across an entire factory with a cycle time of less than 1ms.
Absolutely. The key will be to avoid garbage collection and memory management as much as possible. Try to avoid new-ing objects as much as possible, using buffers or object pools when you can.
Of course, someone has even developed a library to do that: AForge.NET
As with any real-time application and not just C#, you'll have to manage the buffers well as #David suggested.
Not only that, there're also the XNA Framework (for things like 3D games) and you can program DirectX using C# as well which are very real-time.
And did you know that, if you want, you can do pointer manipulations in C# too?
It depends on how 'real-time' it needs to be; ie, what your timing constraints are, and how quickly you need to 'do something'.
If you can handle 'doing something' maybe every 300ms or so in .NET, say on a timer event, I've found Windows to work okay. Note that this is something I found true on multiple systems of different ages and different speeds. As always, YMMV.
But that number is awfully long for a lot of applications. Maybe not for yours.
Do some research, make sure your app responds quickly enough for your application.

How to keep memory of process protected

I have a process that will have some important values in the memory. I don't want anyone to be able to read the memory of my process and obtain those values. So I tried to create a program that would look at the list of programs running and determine if any of them were "debuggers", etc. But I realized that someone could just write a quick program to dump the memory of my process. I know several process on my system have their memory protected. How could I also obtain this? (ps: I'm using C#)
Any application that runs under an user with enough privileged (eg. local administrator) can call ReadProcessMemory and read your process at will, any time, without being attached to your process debugging port, and without your processing being able to prevent, or even detect this. And I'm not even going into what is possible for a system kernel driver to do...
Ultimately, all solutions available to do this are either snake oil, or just a way to obfuscate the problem by raising the bar to make it harder. Some do make it really hard, but none make it bullet-proof. But ultimately, one cannot hide anything from a user that has physical access to the machine and has sufficiently elevated privileges.
If you don't want users to read something, simply don't have on the user machine. Use a service model where your IP is on a server and users access it via internet (ie. web services).
First of all, there will always be a way to dump the memory image of your program. Your program can only make it harder, never impossible. That said, there may be ways to 'hide' the values. It is generally considered hard to do and not worth the trouble, but there are programs which encrypt those values in memory. However, to be able to use them, they need to decrypt them temporarily and re-encrypt (or discard) them afterwards.
While encryption is easy with the .Net framework, discarding the decrypted value is not an easy thing to do in C#. In C, you would allocate a chunk of memory to store the decrypted values and clear that (by writing zero's or random data to it) before freeing it. In C#, there is no guarantee that your data won't be stored somewhere (caching, garbage collection) and you won't be able to clear it. However, as eulerfx noted in a comment, in .Net 4.0 SecureString may be used. How safe that is, I don't know.
As you may see, there will always be a short time where the value lies in memory unencrypted, and that is the vulnerability here.
I think the best way to do it is employ a commercial solution such as in this PDF brochure, this is their website. You may be better off going down this route if you really care about protecting the application from sniffing, IP theft etc instead of rolling up your own solution...
Edit: I would not go down the route in kidding myself that the solution I shall craft up will be tamper proof, crack proof, idiot proof. Leave that to a company like Arxan I mentioned (I aint a sales rep - I'm just giving an example), sure it might be costly, but you can sleep better at night knowing it is much harder for a cracker to break than having no solution at all...

What is wrong with polling?

I have heard a few developers recently say that they are simply polling stuff (databases, files, etc.) to determine when something has changed and then run a task, such as an import.
I'm really against this idea and feel that utilising available technology such as Remoting, WCF, etc. would be far better than polling.
However, I'd like to identify the reasons why other people prefer one approach over the other and more importantly, how can I convince others that polling is wrong in this day and age?
Polling is not "wrong" as such.
A lot depends on how it is implemented and for what purpose. If you really care about immedatly notification of a change, it is very efficient. Your code sits in tight loop, constantly polling (asking) a resource whether it has changed / updated. This means you are notified as soon as you can be that something is different. But, your code is not doing anything else and there is overhead in terms of many many calls to the object in question.
If you are less concerned with immediate notification you can increase the interval between polls, and this can also work well, but picking the correct interval can be difficult. Too long and you might miss critical changes, too short and you are back to the problems of the first method.
Alternatives, such as interrupts or messages, etc. can provide a better compromise in these situations. You are notified of a change as soon as is practically possible, but this delay is not something you control, it depends on the component tself being timely about passing on changes in state.
What is "wrong" with polling?
It can be resource hogging.
It can be limiting (especially if you have many things you want to know about / poll).
It can be overkill.
But...
It is not inherently wrong.
It can be very effective.
It is very simple.
Examples of things that use polling in this day and age:
Email clients poll for new messages (even with IMAP).
RSS readers poll for changes to feeds.
Search engines poll for changes to the pages they index.
StackOverflow users poll for new questions, by hitting 'refresh' ;-)
Bittorrent clients poll the tracker (and each other, I think, with DHT) for changes in the swarm.
Spinlocks on multi-core systems can be the most efficient synchronisation between cores, in cases where the delay is too short for there to be time to schedule another thread on this core, before the other core does whatever we're waiting for.
Sometimes there simply isn't any way to get asynchronous notifications: for example to replace RSS with a push system, the server would have to know about everyone who reads the feed and have a way of contacting them. This is a mailing list - precisely one of the things RSS was designed to avoid. Hence the fact that most of my examples are network apps, where this is most likely to be an issue.
Other times, polling is cheap enough to work even where there is async notification.
For a local file, notification of changes is likely to be the better option in principle. For example, you might (might) prevent the disk spinning down if you're forever poking it, although then again the OS might cache. And if you're polling every second on a file which only changes once an hour, you might be needlessly occupying 0.001% (or whatever) of your machine's processing power. This sounds tiny, but what happens when there are 100,000 files you need to poll?
In practice, though, the overhead is likely to be negligible whichever you do, making it hard to get excited about changing code that currently works. Best thing is to watch out for specific problems that polling causes on the system you want to change - if you find any then raise those rather than trying to make a general argument against all polling. If you don't find any, then you can't fix what isn't broken...
There are two reasons why polling could be considered bad by principle.
It is a waste of resources. It is very likely that you will check for a change while no change has occurred. The CPU cycles/bandwidth spend on this action does not result in a change and thus could have been better spend on something else.
Polling is done on a certain interval. This means that you won’t know that a change has occurred until the next time that the interval has passed.
It would be better to be notified of changes. This way you’re not polling for changes that haven’t occurred and you’ll know of a change as soon as you receive the notification.
Polling is easy to do, very easy, its as easy as any procedural code. Not polling means you enter the world of Asynchronous programming, which isn't as brain-dead easy, and might even become challenging at times.
And as with everything in any system, the path of less resistance is normally more commonly taken, so there will always be programmers using polling, even great programmers, because sometimes there is no need to complicate things with asynchronous patterns.
I for one always thrive to avoid polling, but sometimes I do polling anyways, especially when the actual gains of asynchronous handling aren't that great, such as when acting against some small local data (of course you get a bit faster, but users won't notice the difference in a case like this). So there is room for both methodologies IMHO.
Client polling doesn't scale as well as server notifications. Imagine thousands of clients asking the server "any new data?" every 5 seconds. Now imagine the server keeping a list of clients to notify of new data. Server notification scales better.
I think people should realize that in most cases, at some level there is polling being done, even in event or interrupt driven situations, but you're isolated from the actual code doing the polling. Really, this is the most desirable situation ... isolate yourself from the implementaion, and just deal with the event. Even if you must implement the polling yourself, write the code so that it's isolated, and the results are dealt with independently of the implementation.
The thing about polling is that it works! Its reliable and simple to implement.
The costs of pooling can be high -- if you are scanning a database for changes every minute when there are only two changes a day you are consuming a lot of resources for a very small result.
However the problem with any notification technoligy is that they are much more complex to implement and not only can they be unreliable but (and this is a big BUT) you cannot easily tell when they are not working.
So if you do drop polling for some other technoligy make sure it is usable by average programmers and is ultra reliable.
Its simple - polling is bad - inefficient, waste of resources, etc. There is always some form of connectivity in place that is monitoring for an event of some sort anyway, even if 'polling' is not chosen.
So why go the extra mile and put additional polling in place.
Callbacks are the best option - just need to worry about tie the callback in with your current process. Underlying, there is polling going on to see that the connection is still in place anyhow.
If you keep phoning/ringing your girlfriend and shes never answers, then why keep calling? Just leave a message, and wait until she 'calls back' ;)
I use polling occasionally for certain situations (for example, in a game, I would poll the keyboard state every frame), but never in a loop that ONLY does polling, rather I would do polling as a check (has resource X changed? If yes, do something, otherwise process something else and check again later). Generally speaking though, I avoid polling in favor of asynchronous notifications.
The reasons being that I do not spend resources (CPU time, whatever) waiting for something to happen (especially if those resources could speed up that thing happening in the first place). The cases where I use polling, I don't sit idle waiting, I use the resources elsewhere, so it's a non-issue (for me, at least).
If you are polling for changes to a file, then I agree that you should use the filesystem notifications that are available for when this happens, which are available in most operating systems now.
In a database you could trigger on update/insert and then call your external code to do something. However it might just be that you don't have a requirement for instant actions. For instance you might only need to get data from Database A to Database B on a different network within 15 minutes. Database B might not be accessible from Database A, so you end up doing the polling from, or as a standalone program running near, Database B.
Also, Polling is a very simple thing to program. It is often a first step implementation done when time constraints are short, and because it works well enough, it remains.
I see many answers here, but I think the simplest answer is the answer it self:
Because is (usually) much more simple to code a polling loop than to make the infrastructure for callbacks.
Then, you get simpler code which if it turns out to be a bottleneck later can be easily understood and redesigned/refactored into something else.
This is not answering your question. But realistically, especially in this "day and age" where processor cycles are cheap, and bandwidth is large, polling is actually a pretty good solution for some tasks.
The benefits are:
Cheap
Reliable
Testable
Flexible
I agree that avoiding polling is a good policy. However, In reference to Robert's post, I would say that the simplicity of polling can make it a better approach in instances where the issues mentioned here are not such a big problem, as the asynchronous approach is often considerably less readable and harder to maintain, not to mention the bugs that can creep in to its implementation.
As with everything, it depends. A large high-transaction system I work on currently uses a notification with SQL (A DLL loaded within SQL Server that is called by an extended SP from triggers on certain tables. The DLL then notifies other apps that there is work to do).
However we're moving away from this because we can practically guarantee that there will be work to do continuously. Therefore in order to reduce the complexity and actually speed things up a bit, the apps will process their work and immediately poll the DB again for new work. Should there be none it'll try again after a small interval.
This seems to work quicker and is much simpler. However, another part of the application which is much lower volume does not benefit from a speed increase using this method - unless the polling interval is very small, which leads to performance problems. So we're leaving it as is for this part. Therefore it's a good thing when it's appropriate, but everybody's needs are different.
Here is a good summary of relative merits of push and pull:
https://stpeter.im/index.php/2007/12/14/push-and-pull-in-application-architectures/
I wish I could summarize it further into this answer but some things are best left unabridged.
When thinking about SQL polling, back in the day of VB6 you used to be able to create recordsets using the WithEvents keyword which was an early incarnation of async "listening".
I personally would always look for a way of using an events driven implementation before polling. Failing that a manual implementation of any of the following might help:
sql service broker / dependency class
Some kind of queue technology(RabbitMQ or similar)
UDP broadcast - interesting technique that can
be built with multiple node listeners. Not always possible on some net works though.
Some of these may require a slight redesign of your project, but in an enterprise world might be the better route to go rather than a polling service.
Agreee with most responses that Async/Messaging is usually better. I absolutely agree with Robert Gould's answer. But I'd like to add one more point.
One addition is that polling can kill two birds with one stone. In one particular use case, a project I was involved with used a message queue between databases but polling from an application server to one of the databases. Because the network from app server to DB was occasionally down, polling was additionally used to notify the app of network issues.
In the end, use what makes most sense for the use case with scale-ability in mind.
I'm using polling to check for updates on a file because I'm getting information about that file across a heterogeneous system with different OS types, one of which is very old. The notifications for Linux won't work if the file is on a remote system with a different OS, because that information is not transmitted, but polling works. It's a low bandwidth check, so it doesn't hurt anything.

Categories