See this SuperUser question. To summarize, VM software lets you save state of arbitrary applications (by saving the whole VM image).
Would it be possible to write some software for Windows that allows you to save and reload arbitrary application state? If so (and presumably so), what would it entail?
I would be looking to implement this, if possible, in a high-level language like C#. I presume if I used something else, I would need to dump memory registers (or maybe dump the entire application memory block) to a file somewhere and load it back somewhere to refresh state.
So how do I build this thing?
Well, it's unlikely that this is going to happen, especially not in C# - addressing the low level requirements in managed code would hardly be possible. Saving the state of a whole virtual machine is actually easier than just saving single processes, all you have to do is dump the whole machine memory and ensure a consistent disk image, which, given the capabilities of Virtualization software, is rather straightforward.
Restoring a single process would imply loading all shared objects that the process refers to, including whatever objects the process refers to in kernel space, i.e. file/memory/mutex/whatever handles, and without the whole machine / operating system being virtual, this would mean poking deep down in the internals of Windows...
All I'm saying is: while it is possible, the effort would be tremendous and probably not worth it.
Related
I have lots of data which I would like to save to disk in binary form and I would like to get as close to having ACID properties as possible. Since I have lots of data and cannot keep it all in memory, I understand I have two basic approaches:
Have lots of small files (e.g. write to disk every minute or so) - in case of a crash I lose only the last file. Performance will be worse, however.
Have a large file (e.g. open, modify, close) - best sequential read performance afterwards, but in case of a crash I can end up with a corrupted file.
So my question is specifically:
If I choose to go for the large file option and open it as a memory mapped file (or using Stream.Position and Stream.Write), and there is a loss of power, are there any guarantees to what could possibly happen with the file?
Is it possible to lose the entire large file, or just end up with the data corrupted in the middle?
Does NTFS ensure that a block of certain size (4k?) always gets written entirely?
Is the outcome better/worse on Unix/ext4?
I would like to avoid using NTFS TxF since Microsoft already mentioned it's planning to retire it. I am using C# but the language probably doesn't matter.
(additional clarification)
It seems that there should be a certain guarantee, because -- unless I am wrong -- if it was possible to lose the entire file (or suffer really weird corruption) while writing to it, then no existing DB would be ACID, unless they 1) use TxF or 2) make a copy of the entire file before writing? I don't think journal will help you if you lose parts of the file you didn't even plan to touch.
You can call FlushViewOfFile, which initiates dirty page writes, and then FlushFileBuffers, which according to this article, guarantees that the pages have been written.
Calling FlushFileBuffers after each write might be "safer" but it's not recommended. You have to know how much loss you can tolerate. There are patterns that limit that potential loss, and even the best databases can suffer a write failure. You just have to come back to life with the least possible loss, which typically demands some logging with a multi-phase commit.
I suppose it's possible to open the memory mapped file with FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH but that's gonna suck up your throughput. I don't do this. I open the memory mapped files for asynchronous I/O, letting the OS optimize the throughput with it's own implementation of async I/O completion ports. It's the fastest possible throughput. I can tolerate potential loss, and have mitigated appropriately. My memory mapped data is file backup data...and if I detect loss, I can can detect and re-backup the lost data once the hardware error is cleared.
Obviously, the file system has to be reliable enough to operate a database application, but I don't know of any vendors that suggest you don't still need backups. Bad things will happen. Plan for loss. One thing I do is that I never write into the middle of data. My data is immutable and versioned, and each "data" file is limited to 2gb, but each application employs different strategies.
The NTFS file system (and ext3-4) uses a transaction journal to operate the changes. Each changed is stored in the journal and the then, the journal itself is used to effectively peform the change.
Except for catastrophic disk failures, the file system is designed to be consistent in its own data structures, not yours: in case of a crash, the recovery procedure will decide what to roll back in order to preserve the consistency. In case of roll back, your "not-yet-written but to-be-written" data is lost.
The file system will be consistent, while your data not.
Additionally, there are several other factors involved: software and hardware caches introduce an additional layer, and therefore a point of failure. Usually the operations are performed in the cache, and then, the cache itself is flushed on disk. The file system driver won't see the operations performed "in" the cache, but we'll see the flush operations.
This is done for performances reasons, as the hard drive is the bottleneck. Hardware controllers do have batteries to guarantee that their own cache can be flushed even in an event of power loss.
The size of a sector is another important factor, but this detail should not be taken into account as the hard drive itself could lie about its native size for interoperability purposes.
If you have a mewmory mapped and you insert data in the middle, while the power goes down, the content of the file might partially contain the change you did if it exceeds the size of the internal buffers.
TxF is a way to mitigate the issue, but has several implications which limits the contexts where you can use it: for eaxample it does not work on different drives or shared networks.
In order to be ACID, you need to design your data structures and/or the way you use it in order not to rely about the implementation details. For example, Mercurial (versioning tool) always appends its own data to its own revision log.
There are many possible patterns, however, the more guarantees you need, the more technology specific you'll get (and by tied to).
I've designed a logging service for multiple applications in C#.
Since the thinking of performance saving, all logs should be store in buffer first, and then write to log file when buffer is full.
However, there are some of extension cards (PCI / PCI-e) causing BSoD, which are not in my control. The logs in the buffer will lose when the BSoD occurs, but I want to find a way to keep them.
I've found some articles are discussing about how to dumping data when software crashed. However, the minidump one needs to dump everything by myself, and I think it will cause some performance issues; the other articles (A)(B) are only suitable in single application crash.
Do anyone have any suggestion to save my logs even if BSoD occurs?
EDIT: if there are any suggestion to reduce the loss of data to minimize is also welcome.
Since the buffer you have in your C# application is not written to disk for performance reasons, the only other place left for it is memory (RAM). Since you don't know how Windows manages your memory at the moment of the crash, we have to consider two cases: a) the log is really in RAM and b) that RAM has been swapped to disk (page file). To get access to all RAM of a BSoD, you have to configure Windows to create a full memory dump instead of a kernel minidump.
At the time of a blue screen, the operating system stops relying on almost anything, even most kernel drivers. The only attempt it makes is to write the contents of the physical RAM onto disk. Furthermore, since it cannot even rely on valid NTFS data structures, it writes to the only place of contiguous disk space it knows: the page file. That's also the reason why your page file needs to be at least as large as physical RAM plus some metadata, otherwise it won't be able to hold the information.
At this point we can already give an answer to case b): if your log was actually swapped to the page file, it will likely be overwritten by the dump.
If the buffer was really part of the working set (RAM), that part will be included in the kernel dump. Debugging .NET applications from a kernel dump is almost impossible, because the SOS commands to analyze the .NET heaps only work for a user mode full memory dump. If you can identify your log entries by some other means (e.g. a certain substring), you may of course do a simple string search on the kernel dump.
All in all, what you're trying to achieve sounds like a XY-problem. If you want to test your service, why don't you remove or replace the unrelated problematic PCI cards or test on a different PC?
If blue screen logging is an explicit feature of your logging service, you should have considered this as a risk and evaluated before writing the service. That's a project management issue and off-topic for StackOverflow.
Unfortunately I have to confirm what #MobyDisk said: it's (almost) impossible and at least unreliable.
I'm working with an open-source .NET app that takes a long time to startup and initialize. Its creating thousands of objects and configuring them for first time use. I'm trying to improve this startup time.
Is there a way to capture application memory using the Windows API or similar, and then quickly "restore" this state later after restarting the PC? Essentially is there a way to access and save the underlying memory of a .NET app and have the CLR "absorb" this memory at a later time?
The easiest way would be using Windows Hibernate to create "hiberfile.sys", and then saving a copy of this file (if that is possible). Everytime windows starts up, you overwrite the existing hiberfile with the saved "clean" version, for the next startup. This ensures that you can save / restore application state without having to deal with memory, pointers and handles. Could this work?
One way would be creating a mem-disk, although I don't know if restoration is possible. (virtual memory that actually works off the HDD, allowing the memory to be saved/restored as a simple file)
Similar to this question, but a bit different since I don't mind re-inserting the application memory at the exact address it was saved in. The PC is entirely in my hands, and for the sake of simplicity assume there are no other apps running.
C# does not support continuation out-of-the-box, although the Workflow Foundation in .NET 3.0 and higher allows for workflows to be stopped and restarted. I wonder how an application can behave as a workflow.
Raymond Chen argues against this in a blog post, but not much technical data here either.
YAPM, an open-source process monitor is able to "display/release/change protection/decommit the memory regions in the virtual memory space of a process". Could this be something similar to what I'm after?
If you want an unchanged save/load process to avoid first-use, you may look into serialization.
Actually saving the memory could be possible, but you'll run into addressing problems when you try to restore it, and there's a chance you may not have enough memory, may not have a free block in the same size, or so on.
Serialization at an object level, or even a large group of objects, will allow you to save them and their state in an almost-identical manner to dumping memory, but greatly simplify the loading process and make it far more reliable. .Net offers pretty good serialization support, and can output to binary files (small but version-dependent) or XML (larger, human-readable, somewhat more flexible). Other libraries may offer more methods of varying use (I believe there is a JSON one, which is slightly more verbose yet, but works with web apps).
Depending on how your app works, you may want to/be able to create the first-use models on the first run, serialize them to disk, and load them from then on. With some work, it should also be possible to add all the objects (of varying types) to a single collection and serialize that, allowing all the data to be stored in one file.
So yes, this is possible and may indeed be faster, although not how you originally thought.
I am working on an application which has potential for a large memory load (>5gb) but has a requirement to run on 32bit and .NET 2 based desktops due to the customer deployment environment. My solution so far has been to use an app-wide data store for these large volume objects, when an object is assigned to the store, the store checks for the total memory usage by the app and if it is getting close to the limit it will start serialising some of the older objects in the store to the user's temp folder, retrieving them back into memory as and when they are needed. This is proving to be decidedly unreliable, as if other objects within the app start using memory, the store has no prompt to clear up and make space. I did look at using weak pointers to hold the in-memory data objects, with them being serialised to disk when they were released, however the objects seemed to be getting released almost immediately, especially in debug, causing a massive performance hit as the app was serialising everything.
Are there any useful patterns/paradigms I should be using to handle this? I have googled extensively but as yet haven't found anything useful.
I thought virtual memory was supposed to have you covered in this situation?
Anyways, it seems suspect that you really need all 5gb of data in memory at any given moment - you can't possibly be processing all that data at any given time - at least not on what sounds like a consumer PC. You didn't go into detail about your data, but something to me smells like the object itself is poorly designed in the sense that you need the entire set to be in memory to work with it. Have you thought about trying to fragment out your data into more sensible units - and then do some preemptive loading of the data from disk, just before it needs to be processed? You'd essentially be paying a more constant performance trade-off this way, but you'd reduce your current thrashing issue.
Maybe you go with Managing Memory-Mapped Files and look here. In .NET 2.0 you have to use PInvoke to that functions. Since .NET 4.0 you have efficient built-in functionality with MemoryMappedFile.
Also take a look at:
http://msdn.microsoft.com/en-us/library/dd997372.aspx
You can't store 5GB data in-memory efficiently. You have 2 GB limit per process in 32-bit OS and 4 GB limit per 32-bit process in 64-bit Windows-on-Windows
So you have choice:
Go in Google's Chrome way (and FireFox 4) and maintain potions of data between processes. It may be applicable if your application started under 64-bit OS and you have some reasons to keep your app 32-bit. But this is not so easy way. If you don't have 64-bit OS I wonder where you get >5GB RAM?
If you have 32-bit OS when any solution will be file-based. When you try to keep data in memory (thru I wonder how you address them in memory under 32-bit and 2 GB per process limit) OS just continuously swap portions of data (memory pages) to disk and restores them again and again when you access it. You incur great performance penalty and you already noticed it (I guessed from description of your problem). The main problem OS can't predict when you need one data and when you want another. So it just trying to do best by reading and writing memory pages on/from disk.
So you already use disk storage indirecltly in inefficient way, MMFs just give you same solution in efficient and controlled manner.
You can rearchitecture your application to use MMFs and OS will help you in efficient caching. Do the quick test by yourself MMF maybe good enough for your needs.
Anyway I don't see any other solution to work with dataset greater than available RAM other than file-based. And usually better to have direct control on data manipulation especially when such amount of data came and needs to be processed.
When you have to store huge loads of data and mantain accessibility, sometimes the most useful solution is to use data store and management system like database. Database (MySQL for example) can store a lots of typical data types and of course binary data too. Maybe you can store your object to database (directly or by programming business object model) and get it when you need to. This solution sometimes can solve many problems with data managing (moving, backup, searching, updating...), and storage (data layer) - and it's location independent - mayby this point of view can help you.
What is the logic behind disk defragmentation and Disk Check in Windows? Can I do it using C# coding?
For completeness sake, here's a C# API wrapper for defragmentation:
http://blogs.msdn.com/jeffrey_wall/archive/2004/09/13/229137.aspx
Defragmentation with these APIs is (supposed to be) very safe nowadays. You shouldn't be able to corrupt the file system even if you wanted to.
Commercial defragmentation programs use the same APIs.
Look at Defragmenting Files at msdn for possible API helpers.
You should carefully think about using C# for this task, as it may introduce some undesired overhead for marshaling into native Win32.
If you don't know the logic for defragmentation, and if you didn't write the file system yourself so you can't authoritatively check it for errors, why not just start new processes running 'defrag' and 'chkdsk'?
Mark Russinovich wrote an article Inside Windows NT Disk Defragmentation a while ago which gives in-depth details. If you really want to do this I would really advise you to use the built-in facilities for defragmenting. More so, on recent OSes I have never seen a need as a user to even care about defragmenting; it will be done automatically on a schedule and the NTFS folks at MS are definitely smarter at that stuff than you (sorry, but they do this for some time now, you don't).
Despite its importance, the file system is no more than a data structure that maps file names into lists of disk blocks. And keeps track of meta-information such as the actual length of the file and special files that keep lists of files (e.g., directories). A disk checker verifies that the data structure is consistent. That is, every disk block must either be free for allocation to a file or belong to a single file. It can also check for certain cases where a set of disk blocks appears to be a file that should be in a directory but is not for some reason.
Defragmentation is about looking at the lists of disk blocks assigned to each file. Files will generally load faster if they use a contiguous set of blocks rather than ones scattered all over the disk. And generally the entire file system will perform best if all the disk blocks in use confine themselves to a single congtiguous range of the disk. Thus the trick is moving disk blocks around safely to achieve this end while not destroying the file system.
The major difficulty here is running these application while a disk is in use. It is possible but one has to be very, very, very careful not to make some kind of obvious or extremely subtle error and destroy most or all of the files. It is easier to work on a file system offline.
The other difficulty is dealing with the complexities of the file system. For example, you'd be much better off building something that supports FAT32 rather than NTFS because the former is a much, much simpler file system.
As long as you have low-level block access and some sensible way for dealing with concurrency problems (best handled by working on the file system when it is not in use) you can do this in C#, perl or any language you like.
BUT BE VERY CAREFUL. Early versions of the program will destroy entire file systems. Later versions will do so but only under obscure circumstances. And users get extremely angry and litigious if you destroy their data.