Using the bittorrent protocol to distribute nightly and CI builds - c#

This questions continues from what I learnt from my question yesterday titled using git to distribute nightly builds.
In the answers to the above questions it was clear that git would not suit my needs and was encouraged to re-examine using BitTorrent.
Short Version
Need to distribute nightly builds to 70+ people each morning, would like to use git BitTorrent to load balance the transfer.
Long Version
NB. You can skip the below paragraph if you have read my previous question.
Each morning we need to distribute our nightly build to the studio of 70+ people (artists, testers, programmers, production etc). Up until now we have copied the build to a server and have written a sync program that fetches it (using Robocopy underneath); even with setting up mirrors the transfer speed is unacceptably slow with it taking up-to an hour or longer to sync at peak times (off-peak times are roughly 15 minutes) which points to being hardware I/O bottleneck and possibly network bandwidth.
What I know so far
What I have found so far:
I have found the excellent entry on Wikipedia about the BitTorrent protocol which was an interesting read (I had only previously known the basics of how torrents worked). Also found this StackOverflow answer on the BITFIELD exchange that happens after the client-server handshake.
I have also found the MonoTorrent C# Library (GitHub Source) that I can use to write our own tracker and client. We cannot use off the shelf trackers or clients (e.g. uTorrent).
Questions
In my initial design, I have our build system creating a .torrent file and adding it to the tracker. I would super-seed the torrent using our existing mirrors of the build.
Using this design, would I need to create a new .torrent file for each new build? In other words, would it be possible to create a "rolling" .torrent where if the content of the build has only change 20% that is all that needs to be downloaded to get latest?
... Actually. In writing the above question, I think that I would need to create new file however I would be able download to the same location on the users machine and the hash will automatically determine what I already have. Is this correct?
In response to comments
For completely fresh sync the entire build (including: the game, source code, localized data, and disc images for PS3 and X360) ~37,000 files and coming in just under 50GB. This is going to increase as production continues. This sync took 29 minutes to complete at time when there is was only 2 other syncs happening, which low-peak if you consider that at 9am we would have 50+ people wanting to get latest.
We have investigated the disk I/O and network bandwidth with the IT dept; the conclusion was that the network storage was being saturated. We are also recording statistics to a database of syncs, these records show even with handful of users we are getting unacceptable transfer rates.
In regard not using off-the-shelf clients, it is a legal concern with having an application like uTorrent installed on users machines given that other items can be easily downloaded using that program. We also want to have a custom workflow for determining which build you want to get (e.g. only PS3 or X360 depending on what DEVKIT you have on your desk) and have notifications of new builds available etc. Creating a client using MonoTorrent is not the part that I'm concerned about.

To the question whether or not you need to create a new .torrent, the answer is: yes.
However, depending a bit on the layout of your data, you may be able to do some simple semi-delta-updates.
If the data you distribute is a large collection of individual files, with each build some files may have changed you can simply create a new .torrent file and have all clients download it to the same location as the old one (just like you suggest). The clients would first check the files that already existed on disk, update the ones that had changed and download new files. The main drawback is that removed files would not actually be deleted at the clients.
If you're writing your own client anyway, deleting files on the filesystem that aren't in the .torrent file is a fairly simple step that can be done separately.
This does not work if you distribute an image file, since the bits that stayed the same across the versions may have moved, and thus yielding different piece hashes.
I would not necessarily recommend using super-seeding. Depending on how strict the super seeding implementation you use is, it may actually harm transfer rates. Keep in mind that the purpose of super seeding is to minimize the number of bytes sent from the seed, not to maximize the transfer rate. If all your clients are behaving properly (i.e. using rarest first), the piece distribution shouldn't be a problem anyway.
Also, to create a torrent and to hash-check a 50 GiB torrent puts a lot of load on the drive, you may want to benchmark the bittorrent implementation you use for this, to make sure it's performant enough. At 50 GiB, the difference between different implementations may be significant.

Just wanted to add a few non-BitTorrent suggestions for your perusal:
If the delta between nightly builds is not significant, you may be able to use rsync to reduce your network traffic and decrease the time it takes to copy the build. At a previous company we used rsync to submit builds to our publisher, as we found our disc images didn't change much build-to-build.
Have you considered simply staggering the copy operations so that clients aren't slowing down the transfer for each other? We've been using a simple Python script internally when we do milestone branches: the script goes to sleep until a random time in a specified range, wakes up, downloads and checks-out the required repositories and runs a build. The user runs the script when leaving work for the day, when they return they have a fresh copy of everything ready to go.

You could use BitTorrent sync Which is somehow an alternative to dropbox but without a server in the cloud. It allows you to synchronize any number of folders and files of any size. with several people and it uses the same algorithms from the bit Torrent protocol. You can create a read-only folder and share the key with others. This method removes the need to create a new torrent file for each build.

Just to throw another option into the mix, have you considered BITS? Not used it myself but from reading the documentation it supports a distributed peer caching model which sounds like it will achieve what you want.
The downside is that it is a background service so it will give up network bandwidth in favour of user initiated activity - nice for your users but possibly not what you want if you need data on a machine in a hurry.
Still, it's another option.

Related

Application architecture with data on a shared network, without a database on the server

I'm currently working on a C# project of an application we'd like to develop. We're brainstorming over the question of sharing the data between users. We'd like to be able to specify a folder where all the files of the application are going to be saved and we'd like to be able to save them on a shared folder (server, different PC or Mac, Nas, etc.).
The deployment would be like so :
Installation on the first PC, we choose a network drive, share, whatever and create all the files for the application in this location.
On the second PC we install the application and we choose the same location (on the network), the application doesn't create anything, it sees that it's already existing and it uses these files as the application's data
Same thing on the other clients
The application's files are going to be documents (most likely XML formatted documents) and when opening the application we want to show all the existing documents. The thing is, we don't only want to have the list of documents and be able to edit their content, we also would like to be able to edit the document's property, so in a way we'd like a file (Sqlite, XML, whatever) representing the list of all the documents and their attributes. Same thing for a list of addresses.
I know all that looks exactly like a client / server with database solution, but this solution is out of the question. I was first looking at SQLite for my data files, but I know concurrency can be a real problem and file lock doesn't work well. The thing is, I would have the same problem with simple XML files (refreshing the content when several users are working, accessing locked files).
So I guess my final question is : Is it feasable? Is there an alternative I didn't see which would allow us to do that more easily?
EDIT :
OK I'm not responding to every post or comment, because I'm currently testing concurrency with SQLite. What I did, and please correct me if the way I test this is wrong, is launch X BackgroundWorker which are all going to insert record in a sample database (which is recreated everytime I start the application). I tried launching 100 iterations of INSERT in the database via these backgroundWorkers.
Of course concurrency is working with one application running, it's simply waiting for the last BackgroundWorker to do it's job and then writing the next record. I also tried inserting at (almost) the same time, meaning I put a loop in every BackgroundWorker waiting for a modulo 5 timestamp (every 5 seconds, every BackgroundWorker runs). Again, it's waiting for the previous insert query to end before doing the next and everything's working fine. I even tried it with 500 BackgroundWorkers and it worked fine.
I then tried launching my app several times and running them simultaneously. When doing this I did have some issue. With two instances of my app it was still working fine, but when trying this with 4-5 instances, it got really buggy and I got two types of error : 1. database is locked 2. disk I/O failure. But mostyle locked databases.
What I did was pretty intensive, in the scenario of my application, it will never ever come to 5 processes trying to simultaneously insert 500 hunded rows at the same time (maybe I'll get a concurrency of two or three connections). But what really bugged me and what makes me think my testing method is not really a good one, is that I got these errors trying to work on a database on a shared network, on a NAS AND on my own HDD. Everytime it worked for maybe 30-40 queries then throwing me "database is locked" error.
Am I testing it wrong? Maybe I shouldn't be trying so hard to make this work, but I'm still not convinced that SQLite is not a good alternative to what I'm trying to do, since the concurrency is going to be really small.
With your optimistic/pessimistic locking, you are ultimately trying to build a database. Also, you WILL have issues with consistency while trying to keep multiple files in sync with each other. Think about if you update the "metadata" file, and the write fails half-way through because of a network blip. File corruption will ensue, and you will be left trying to reconstruct things from backups.
I would suggest a couple of likely solutions:
1) Host the content yourselves, and let them be pure clients (cloud based deployments are ideal for this). Most network/firewall issues can be circumvented by using HTTP as your transport (web services).
2) Have one of the workstations be the "server", which keeps it data files on the NFS. This will give you transactional integrity, incremental backups, etc. There are lots of good embedded database managements systems to help you manage this complexity. MS SQL Server even has some great options for this.
You right, Sqlite uses file locks on database file, so storing all data files in database would bring write-starvation problem for editing your documents.
May be it's better choice to implement simple optimistic/pessimistic locking by yourself on particular-file level? For example, in case of using pessimistic lock you just don't allow anyone to edit particular file, if somebody already in process of editing it. In this case you will hold lock just on one file, but not on the entire database. If possibility of conflict(editing particular file at the same time) is pretty low, it is better to go with optimistic locking.
Simple optimistic locking implementation:
When user get file for reading - it's OK, no problem here. If user get file for editing, you could calculate hash for this file(or get timestamp of last updated time of the file), and then, when user tries to save edited file, compare current(at the moment of saving) hash/timestamp to make sure that file has not been changed by somebody else. If file has not been changed then it's ok to save it. IF file has been changed, then current user is out of luck, you need to inform him about it. This optimistic scenario is nice when possibility of this "out of luck" is pretty low. Otherwise it's better to stick with pessimistic locking, when you do not allow user even to start file editing if somebody else is doing it.

How to run C# Task Parallel Library across multiple machines (like a render farm)?

I'm writing a calculation intensive program in C# using the TPL. Some preliminary benchmarking shows good reduction in computation time through using processors with more cores/threads.
However, there is a limit to how many threads are available on a single CPU (I think even the best Xeons money can buy is currently have about 16).
I've been reading about how render farms with a 'grid' of multiple inexpensive CPUs in their own machines is a good way to increase the overall core count, but I have no idea how I go about implementing one of these. Is it implemented at the OS level with Microsoft server technology (and if so, how?), or do I also need to modify the C# code itself?
Any help or links to existing information would be greatly appreciated.
If you want to do this at scale (100s of nodes) then developing your own system is hard. You have to handle; nodes becoming unavailable, data replication to each node, tracking job progress.. It's a long list. You also need to consider the sort of communication you're going to require between your nodes. Remember that the cost of sending a message (data) from one thread to another is tiny compared to the cost of sending it to another machine across a network (even a fast one). You may have to completely rewrite your multithreaded application to run well on a distributed system, even to the point of using a completely different algorithm.
Hadoop
Microsoft had plans to commercialize Dryad as LINQ to HPC but this project was sidelined a while back (I worked on this project before I left Microsoft). I believe you can still get the final "public preview", but it's unsupported. The SQL team opted to work with the Hadoop/Hortonworks people on getting a Windows/Azure/.NET friendly Hadoop distribution off the ground. As far as I know the only thing they have delivered is HDInsight. A Hadoop service running in Azure.
There is now a Microsoft .NET SDK For Hadoop which will allow you to manage a cluster and submit jobs etc. It does not seem to allow you to write code that executes on the Hadoop nodes. You can however use the Hadoop streaming API. This is fairly low level but is language agnostic so you can pretty much use it to integrate map reduce code written in any language with Hadoop. More details on this can be found in this blog post.
Hadoop for .NET Developers
If you want to do this as a smaller scale (10s of nodes) then I would look for something like MPI .NET. it looks like this project has been abandoned but something similar is probably what you want.
You might look into some like Dryad - http://research.microsoft.com/en-us/projects/dryadlinq/default.aspx
It might on the other hand also be a big too much for your situation, but the ideas in Dryad could be simplified to your needs.
You might also look into making your own TaskScheduler, which could handle the distribution of threads to agents running on other boxes, but you would have to implement a simple socket client/server communication to get and push the data.
Another and a bit odd suggestion, which might be okay for investigating things, is to do the following.
Let the master of the calculation cut the problem into the number of available client computers.
Write the parameters to kick of the calculation for each client to a file shared by all on the network.
Let the clients look for files dedicated to them, and kick of the calculation for their piece, when file appears. The output is written back to a result file.
The server will sit an listen for all clients completing their jobs.
The files could be replaced with a database, low-level sockets, REST services, Web Services etc. depending on your needs.

How to lock file in a multi-user file management system

I've a program (deployed a copy to each users computer) for user to store files on a centralized file server with compression (CAB file).
When adding a file, user need to extract the file onto his own disk, add the file, and compress it back onto the server. So if two users process the same compressed file at the same time, the later uploaded one will replace the one earlier and cause data loss.
My strategy to prevent this is before user extract the compressed file, the program will check if there is a specified temp file exist on the server. If not, the program will create such temp file to prevent other user's interfere, and will delete the temp file after uploading; If yes, the program will wait until the temp file is deleted.
Is there better way of doing this? And will frequently creating and deleting empty files damage the disk?
And will frequently creating and
deleting empty files damage the disk?
No. If you're using a solid-state disk, there's a theoretical limit on the number of writes that can be performed (which is an inherit limitation of FLASH). However, you're incredibly unlikely to ever reach that limit.
Is there better way of doing this
Well, I would go about this differently:
Write a Windows Service that handles all disk access, and have your client apps talk to the service. So, when a client needs to retrieve a file, it would open a socket connection to your service and request the file and either keep it in memory or save it to their local disk. Perform any modifications on the client's local copy of the file (decompress, add/remove/update files, recompress, etc), and, when the operation is complete and you're ready to save (or commit in source-control lingo) your changes, open another socket connection to your service app (running on the server), and send it the new file contents as a binary stream.
The service app would then handle loading and saving the files to disk. This gives you a lot of additional capabilities, as well - the server can keep track of past versions (perhaps even committing each version to svn or another source control system), provide metadata such as what the latest version is, etc.
Now that I'm thinking about it, you may be better off just integrating an svn interface into your app. SharpSVN is a good library for this.
Creating temporary files to flag the lock is a viable and widely used option (and no, this won't damage the disk). Another option is to open the compressed file exclusively (or let other processes only read the file but not write it) and keep the file opened while the user works with the contents of the file.
Is there better way of doing this?
Yes. From what you've written here, it sounds like you are well on your way towards re-inventing revision control.
Perhaps you could use some off-the-shelf version control system?
Or perhaps at least re-use some code from such systems?
Or perhaps you could at least learn a little about the problems those systems faced, how fixing the obvious problems led to non-obvious problems, and attempt to make a system that works at least as well?
My understanding is that version control systems went through several stages (see
"Edit Conflict Resolution" on the original wiki, the Portland Pattern Repository).
In roughly chronological order:
The master version is stored on the server. Last-to-save wins, leading to mysterious data loss with no warning.
The master version is stored on the server. When I pull a copy to my machine, the system creates a lock file on the server. When I push my changes to the server (or cancel), the system deletes that lock file. No one can change those files on the server, so we've fixed the "mysterious data loss" problem, but we have endless frustration when I need to edit some file that someone else checked out just before leaving on a long vacation.
The master version is stored on the server. First-to-save wins ("optimistic locking"). When I pull the latest version from the server, it includes some kind of version-number. When I later push my edits to the server, if the version-number I pulled doesn't match the current version on the server, someone else has cut in first and changed things ahead of me, and the system gives some sort of polite message telling me about it. Ideally I pull the latest version from the server and carefully merge it with my version, and then push the merged version to the server, and everything is wonderful. Alas, all too often, an impatient person pulls the latest version, overwrites it with "his" version, and pushes "his" version, leading to data loss.
Every version is stored on the server, in an unbroken chain. (Centralized version control like TortoiseSVN is like this).
Every version is stored in every local working directory; sometimes the chain forks into 2 chains; sometimes two chains merge back into one chain. (Distributed version control tools like TortoiseHg are like this).
So it sounds like you're doing what everyone else did when they moved from stage 1 to stage 2. I suppose you could slowly work your way through every stage.
Or maybe you could jump to stage 4 or 5 and save everyone time?
Take a look at the FileStream.Lock method. Quoting from MSDN:
Prevents other processes from reading from or writing to the FileStream.
...
Locking a range of a file stream gives the threads of the locking process exclusive access to that range of the file stream.

BizTalk server problem

we have a biztalk server (a virtual one (1!)...) at our company, and an sql server where the data is being kept.
Now we have a lot of data traffic. I'm talking about hundred of thousands. So I'm actually not even sure if one server is pretty safe, but our company is not that easy to convince.
Now recently we have a lot of problems.
Allow me to situate in detail, so I'm not missing anything:
Our server has 5 applications:
One with 3 orchestrations, 12 send ports, 16 receive locations.
One with 4 orchestrations, 32 send ports, 20 receive locations.
One with 4 orchestrations, 24 send ports, 20 receive locations.
One with 47 (yes 47) orchestrations, 37 send ports, 6 receive locations.
One with common application with a couple of resources.
Our problems have occured since we deployed the applications with the 47 orchestrations.
A lot of these orchestrations use assign shapes which use c# code to do the mapping. This is because we use HL7 extensions and this is kind of special, so by using c# code & xpath it was a lot easier to do the mapping because a lot of these schema's look alike. The c# reads in XmlNodes received through xpath, and returns XmlNode which are then assigned again to biztalk messages. I'm not sure if this could be the cause, but I thought I'd mention it.
The send and receive ports have a lot of different types: File, MQSeries, SQL, MLLP, FTP.
Each of these types have a different host instances, to balance out the load.
Our orchestrations use the BiztalkApplication host.
On this server also a couple of scripts are running, mostly ftp upload scripts & also a zipper script, which zips files every half an hour in a daily zip and deletes the zip files after a month. We use this zipscript on our backup files (we backup a lot, backups are also on our server), we did this because the server had problems with sending files to a location where there were a lot (A LOT) of files, so after the files were reduced to zips it went better.
Now the problems we are having recently are mainly two major problems:
Our most important problem is the following. We kept a receive location with a lot of messages on a queue for testing. After we start this receive location which uses the 47 orchestrations, the running service instances start to sky rock. Ok, this is pretty normal. Let's say about 10000, and then we stop the receive location to see how biztalk handles these 10000 instances. Normally they would go down pretty fast, and it does sometimes, but after a while it starts to "throttle", meaning they just stop being processed and the service instances stay at the same number, for example in 30 seconds it goes down from 10000 to 4000 and then it stays at 4000 and it lowers very very very slowly, like 30 in 5minutes or something. So this means, that all the other service instances of the other applications are also stuck in here, and they are also not processed.
We noticed that after restarting our host instances the instance number went down fast again. So we tried to selectively restart different host instances to locate the problem. We noticed that eventually restarting the file send/receive host instance would do the trick. So we thought file sends would be the problem. Concidering that we make a lot of backups. So we replaced the file type backups with mqseries backups. The same problem occured, and funny thing, restarting the file send/receive host still fixes the problem.
No errors can be found in the event viewer either.
A second problem we're having is. That sometimes at arround 6 am, all or a part of the host instances are being stopped.
In the event viewer we noticed the following errors (these are more than one):
The receive location "MdnBericht SQL" with URL "SQL://ZNACDBPEG/mdnd0001/" is shutting down. Details:"The error threshold has been exceeded. The receive location is shutting down.".
The Messaging Engine failed to add a receive location "M2m Othello Export Start Bestand" with URL "\m2mservices\Othello_import$\DataFilter Start*.xml" to the adapter "FILE". Reason: "The FILE adapter cannot access the folder \m2mservices\Othello_import$\DataFilter Start.
Verify this folder exists.
Error: Logon failure: unknown user name or bad password.
".
The FILE adapter cannot access the folder \m2mservices\Othello_import$\DataFilter Start.
Verify this folder exists.
Error: Logon failure: unknown user name or bad password.
An attempt to connect to "BizTalkMsgBoxDb" SQL Server database on server "ZNACDBBTS" failed.
Error: "Login failed for user ''. The user is not associated with a trusted SQL Server connection."
It woould seem that there's a login failure at this time and that because of it other services are also experiencing problems, and eventually they are shut down.
The thing is, our user is admin, and it's impossible that it's password is wrong "sometimes". We have concidering that the problem could be due to an infrastructure problem, but that's not really are department.
I know it's a long post, but we're not sure anymore what to do. Would adding another server and balancing the load solve our problems? Is there a way to meassure our balance and know where to start splitting? What are normal numbers of load etc?
I appreciate any answers because these issues are getting worse and we're also on a deadline.
Thanks a lot for replies!
Your immediate problem is BizTalk throttling feature. It's supposed to help BizTalk survive temporary overload conditions. One of its many problems is that you can see the throttling kick-in only in the performance monitor and not in the event log.
What you should do:
Separate the new application to a different host than the rest of the applications. Throttling is done in the host level. So the problematic application wont affect the rest of the applications.
Read about how to disable throttling in the link above.
What we have done is implementing an external throttling service. That feed the BizTalk receive location in small digestible packets. Its ugly, but the problem is ugly.
Update to comment: You have enough host instances. So Ignore that advice. You may reorder the applications between the instances. But there are no clear guidelines to do that. So its just shuffling and guessing.
About the safeness of disabling throttling. This feature doesn't make much sense in many scenarios. You have to study it. Check which of the throttling parameters you are hitting (this can be seen in the performance monitor) and decide how to change the thresholds.
How many host instances do you have?
From the line:
The send and receive ports have a lot
of different types: File, MQSeries,
SQL, MLLP, FTP. Each of these types
have a different host instances, to
balance out the load. Our
orchestrations use the
BiztalkApplication host
It sounds like you have a lot - I recently did an audit of a system where BizTalk was self throttling and the issue was in part due to too many host instances. Each host instance places its own load upon the BizTalk messagebox, as well as chewing up a minimum of 200mb memory.
Reading your comment, you have 20 - this is too many and would be a big part of your problems.
A good starting host setup would be:
A dedicated tracking host
One host that contains all receive handlers for adapters
One host that contains all orchestrations
One host that contains all send handlers for adapters
One host for adapters that need to be clustered (like FTP and MSMQ)
You can then also consider things like introducing "real time" hosts and batched hosts, so you can tune the real time hosts for low latency.
You can also have hosts for specific applications if there are known to be unstable, but in general this should not be done.
I run a BizTalk system that has similar problems and can empathize with what you are seeing. I don't know if it's the same, but I thought I'd share my experience in case.
In the same manner restarting the send/receive seems to fix the problem. In my case I found a direct correlation to memory usage by the host processes. I used performance counters to see when a given host was throttled for memory. By creating extra hosts, and moving orchestrations and ports between them I was able to narrow down which business sets were causing the problem. Basically in my case restarting the hosts was the equivalent to the ultimate "garbage collection" to free up memory. This was of course until enough instances came through to gobble it up again.
I'm afraid I have not solved the issue yet, but a few things I found to alleviate the issue:
Raise the memory to a given process so that throttling does not occur or occurs later
Each host instance, while informative, does have an overhead that is added. Try combining hosts that are not your problem children together to reduce the memory foot print.
Throw hardware at the problem, ram is cheap
I measure the following every few minutes in perfmon so I can diagnose where the problem is:
BizTalk:MessageAgent(*)\Process memory usage (MB)
BizTalk:MessageAgent(*)\Process memory usage threshold
Memory\Available MBytes
A few other things to take a look at. Make sure any custom pipelines use good BizTalk memory practices (i.e. no XML DOM manipulation hiding somewhere, etc). Also theoretically reducing the number of threads for a given host should lower the amount of memory it can seize at one time. I did not seem to have much luck with this one. Maybe the BizTalk throttling overrode it as others have mentioned, I don't know. Also, on a final note, if you dump the perfmon results to a csv, with Excel you can make some pretty memory usage graphs. These might be useful for talking to management about buying more hardware. That's assuming your issue fits this scenario as well.
We fixed the problem temporarily due to a combination of all ur answers.
We set the process memory usage throttling parameters of some hosts higher.
We divided the balance of the host instances better after I analyzed all the memory usage of all hosts, thanks to performance counters and also with the use of a tool called MsgBoxViewer.
And now we're trying to get more physical memory & hopefully also an extra server or a 64bit server.
Thanks for all replies!
We recently installed a 64-bit server in cluster with our older server. Thanks to this we can balance the memory even better which solved a lot of problems.
Although the 64-bit didn't give us much improvements (except for a bit more memory) since it can't use 64-bits on IBM MQ's, MLLP's, HL7 pipelines etc...
The other answers are helpful for run-time performance tuning, but i would recommend a design change as well.
You say that you do a lot of message manipulation in the orchestration in the message assignment shapes.
I would recommend moving that code to dedicated transforms. They are much more light weight, and can be executed faster. You can combine custom xslt and c# in these maps to do the hard work. Orchestrations cost more in development, design and testing, and a whole lot more in run-time performance.
You can then use transforms for message transformation, and leave the orchestrating (what is left of it after moving the message assignment code) to the orchestrations.
The added benefit of using transforms over orchestrations is that they are much more testable.

How to read an intermittent hard drive consistently?

I have a faulty hard drive that works intermittently. After cold booting, I can access it for about 30-60 seconds, then the hard drive fails. I'm willing to write a software to backup this drive to a new and bigger disk. I can develop it under GNU/Linux or Windows, I don't care.
The problem is: I can only access the disk for some time, and there are some files that are big and will take longer than that to be copied. For this reason, I'm thinking of backing up the entire hard disk in smaller pieces, something like bit torrenting. I'll read some megabytes and store it, before trying to read another set. My main loop would be something like this:
while(1){
if(!check_harddrive()){ sleep(100ms); continue; }
read_some_megabytes();
if(!check_harddrive()){ sleep(100ms); continue; }
save_data();
update_reading_pointer();
if(all_done){ break; }
}
The problem is the check_harddrive() function. I'm willing to write this in C/C++ for maximus API/library compatibility. I'll need some control over my file handlers to check if they are still valid, and I need something to return bad data, but return, if the drive fails during the copy process.
Maybe C# would give me best results if I abuse "hardcoded" hardware exceptions?
Another approach would be measuring how much time would I need to power cycle my harddrive and code a program to read it during this time only, and flagging me when to power cycle.
What would you do in this case? Are there any tools/utilities that already do this?
Oh, there is a GREAT app to read bad optical medias here, it's called IsoPuzzle, it's not mine, I just wanted to share something related to my problem.
!EDIT!
Some clarifications. I'm a home user, a student of computer engineering at college, I'd rather lose the data than spend thousands of dollars recovering it. The harddrive is still covered by Seagate's warranty, but since they gave me 5 years of warranty, I wanna try everything possible until the time runs out.
When I say cold booting, I mean booting after some seconds without power. Hot booting would be rebooting your computer, cold booting would be shutting it down, waiting a few seconds then bootting it up again. Since the harddisk in question is internal but SATA, I can just disconnect the power cable, wait a few seconds and connect it again.
Until now I'll go with robocopy, I'm just searching for it to see how I can use it. If I don't need to code myself, but script, it'll be even easier.
!EDIT2!
I wasn't clear, my drive is a Seagate 7200.11. It's known that it has a bad firmware and it's not always fixable with a simple firmware update (not after this bug appears). The drive physically is 100% in working condition, just the firmware is screwed, making it enter on a infinite busy state after some seconds.
I would work this from the hardware angle first. Is it an external drive - if so, can you try it in a different case?
You mention cold-booting works, then it quits. Is this heat related? Have you tried using the hard drive for an extended period in something like a freezer?
From the software side I'd have a second thread keep an eye on some progress counter updated by a repeated loop reading small amounts of data, then it would be able to signal failure via a timeout you would define.
I think the simplest way for you is to copy the entire disk image. Under Linux your disk will appear as a block device, /dev/sdb1 for example.
Start copying the disk image until the read error appear. Then wait for the user to "repair" the disk and start reading from the last position.
You can easily mount file disk image and read its content, see -o loop option for mount.
Cool down disk before use. I heard that helps.
You might be interested in robocopy("Robust File Copy"). Robocopy is a command line tool and it can tolerate network outages and resume copying where it previously left off (incomplete files are noted with a date stamp corresponding to 1980-01-01 and contain a recovery record so Robocopy knows from where to continue).
You know... I like being "lazy"... Here is what I would do:
I would write 2 simple scripts. One of them would start robocopy (with persistance feautures turned off) and start the copying while the other would periodically check (maybe by trying to list the contents of the root directory and if it takes more than a few seconds than it it is dead... again..) whether the drive is still working and if the HDD stopped working it would restart the machine. Get them start up after login and setup up auto-login so when the machines reboots it automatically continues.
From a "I need to get my data back" perspective, if your data is really valuable to you, I would recommend sending the drive to a data recovery specialist. Depending on how valuable the data is, the cost (probably several hundred dollars) is trivial. Ideally, you would find a data recovery specialist that doesn't just run some software to do the recovery - if the software approach doesn't work, they should be able to do things like replace the circiut board on the drive, and probably other things (I am not a data recover specialist).
If the value of the data on the drive doesn't quite rise to that level, you should consider purchasing one of the many pieces of software for data recovery. For example, I personally have used and would recommend GetDataBack from Runtime software http://www.runtime.org. I've used it to recover a failing drive, it worked for me.
And now on to more general information... The standard process for data recovery off of a failing drive is to do as little as possible on the drive itself. You should unplug the drive, and stop attempting to do anything. The drive is failing, and it is likely to get worse and worse. You don't want to play around with it. You need to maximize your chances of getting the data off.
The way the process works is to use software that reads the drive block-by-block (not file-by-file), and makes an image copy of the drive. The software attempts to read every block, and will retry the reads if they fail, and writes an image file which is an image of the entire hard drive.
Once the hard drive has been imaged, the software then works against the image to identify the various logical parts of the drive - the partitions, directories, and files. And then it enables you to copy the files off of the image.
The software can typically "deduce" structures from the image. For example, if the partition table is damaged or missing, the software will scan through the entire image, looking for things that might be partitions, and if they look enough like partitions, it will treat them like a partition and see if it can find directories and files. So good software is written with using a lot of knowledge about the different structures on the drive.
If you want to learn how to write such software, good for you! My recommendation is that you start with books about how various operating systems organize data on hard drives, so that you can start to get an intuitive feel for how a software might work with drive images to pull data from them.

Categories