A very similar question has also been asked here on SO in case you are interested, but as we will see the accepted answer of that question is not always the case (and it's never the case for my application use-pattern).
The performance determining code consists of FileStream constructor (to open a file) and a SHA1 hash (the .Net framework implementation). The code is pretty much C# version of what was asked in the question I've linked to above.
Case 1: The Application is started either for the first time or Nth time, but with different target file set. The application is now told to compute the hash values on the files that were never accessed before.
~50ms
80% FileStream constructor
18% hash computation
Case 2: Application is now fully terminated, and started again, asked to compute hash on the same files:
~8ms
90% hash computation
8% FileStream constructor
Problem
My application is always in use Case 1. It will never be asked to re-compute a hash on a file that was already visited once.
So my rate-determining step is FileStream Constructor! Is there anything I can do to speed up this use case?
Thank you.
P.S. Stats were gathered using JetBrains profiler.
... but with different target file set.
Key phrase, your app will not be able to take advantage of the file system cache. Like it did in the second measurement. The directory info can't come from RAM because it wasn't read yet, the OS always has to fall back to the disk drive and that is slow.
Only better hardware can speed it up. 50 msec is about the standard amount of time needed for a spindle drive, 20 msec is about as low as such drives can go. Reader head seek time is the hard mechanical limit. That's easy to beat today, SSD is widely available and reasonably affordable. The only problem with it is that when you got used to it then you never move back :)
The file system and or disk controller will cache recently accessed files / sectors.
The rate-determining step is reading the file, not constructing a FileStream object, and it's completely normal that it will be significantly faster on the second run when data is in the cache.
Off track suggestion, but this is something that I have done a lot and got our analyses 30% - 70% faster:
Caching
Write another piece of code that will:
iterate over all the files;
compute the hash; and,
store it in another index file.
Now, don't call a FileStream constructor to compute the hash when your application starts. Instead, open the (expectedly much) smaller index file and read the precomputed hash off it.
Further, if these files are log etc. files which are freshly created every time before your application starts, add code in the file creator to also update the index file with the hash of the newly created file.
This way your application can always read the hash from the index file only.
I concur with #HansPassant's suggestion of using SSDs to make your disk reads faster. This answer and his answer are complimentary. You can implement both to maximize the performance.
As stated earlier, the file system has its own caching mechanism which perturbates your measurement.
However, the FileStream constructor performs several tasks which, the first time are expensive and require accessing the file system (therefore something which might not be in the data cache). For explanatory reasons, you can take a look at the code, and see that the CompatibilitySwitches classes is used to detect sub feature usage. Together with this class, Reflection is heavily used both directly (to access the current assembly) and indirectly (for CAS protected sections, security link demands). The Reflection engine has its own cache, and requires accessing the file system when its own cache is empty.
It feels a little bit odd that the two measurements are so different. We currently have something similar on our machines equipped with an antivirus software configured with realtime protection. In this case, the antivirus software is in the middle and the cache is hit or missed the first time depending the implementation of such software.
The antivirus software might decide to aggressively check certain image files, like PNGs, due to known decode vulnerabilities. Such checks introduce additional slowdown and accounts the time in the outermost .NET class, i.e. the FileStream class.
Profiling using native symbols and/or with kernel debugging, should give you more insights.
Based on my experience, what you describe cannot be mitigated as there are multiple hidden layers out of our control. Depending on your usage, which is not perfectly clear to me right now, you might turn the application in a service, therefore you could serve all the subsequent requests faster. Alternative, you could batch multiple requests into one single call to achieve an amortized reduced cost.
You should try to use the native FILE_FLAG_SEQUENTIAL_SCAN, you will have to pinvoke CreateFile in order to get an handle and pass it to FileStream
Related
This question arise because when someone wants to use flat file as database, most people will say "is database not an option?" and things like that. This makes me think that most people believe that popular database softwares are reliable in handling data storage.
However, since database engines also write their data stores to files (or allow me to say "flat files"), then I am confused as to why most people believe that protection from data loss is almost completely guaranteed in database engines.
I suppose that database softwares use features like the Windows' CreateFile() function with the FILE_FLAG_WRITE_THROUGH option set; yet, Microsoft specifies in their Documentation that "Not all hard disk hardware supports this write-through capability."
Then why can a database engine be more reliable than my C# code that also uses unmanaged CreateFile() function to write to disk directly using some algorithms (like this SO way) to prevent damage to data? Especially when writing small bits of files and appending small bytes to it in some future time? (Note: not comparing in terms of robustness, features, etc... just reliability of data integrity)
The key to most database systems integrity is the log file.
As well as updating the various tables/data stores/documents they also write all operations and associated data to a log file.
In most cases when the program "commits()" it waits until all operations are written (really written!) to the log file. If anything happens after that database can be rebuilt using the log file data.
Note -- you could get something similar using standard disk I/O and calling flush() at the appropriate points. However you could never guarantee the status of the file (many I/Os could have taken place before you called flush() ) and you could never recover to a point in time as you have no copy of deleted records or a copy of a previous version of an updated record.
Of course you can write a very secure piece of C# code that handles all possible exceptions and faults, that calculates hash codes and check them back for anything it is going to write on the disk, that manages all quirks of every operating system it's deployed on with respect with file caching, disk write buffering and so forth and so on.
The question is: why should you?
Admittedly, a DB is not always the right choice if you just want to write data on the disk. But if you want to store data consistently, safely and most importantly, without losing too much of your time in nitty-gritty IO-operation details, then you should use some kind of well established and tested piece of code that someone else wrote and took the time to debug (hint: a database is a good choice).
See?
Also, there are databases, like sqlite, that are perfect for fast, installation-less use in a program. Use them or not, it's your choice, but I wouldn't spend my time to reinvent the wheel, if I were you.
EDIT: As result of the answers so far I like to add more focus in what I like to zero in on: A database that allows writing in-memory (could be simple C# code) with persistence to storage options in order to access the data from within R. Redis so far looks the most promising. I also consider to actually use something similar to Lockfree++ or ZeroMQ, in order to avoid writing data concurrently to the database, but rather sending all to be persisted data over a message bus/other implementation and to have one "actor" handle all write operations to an in-memory db or other solution. Any more ideas aside Redis (some mentioned SQLite and I will need to still test its performance). Any other suggestions?
I am searching for the ideal database structure/solution that meets most of my below requirements but so far I utterly failed. Can you please help?
My tasks: I run a process in .Net 4.5 (C#) and generate (generally) value types that I want to use for further analysis in other applications and therefore like to either preserve in-memory or persist on disk. More below. The data is generated within different tasks/threads and thus a row based data format does not lend itself well to match this situation (because the data generated in different threads is generated at different times and is thus not aligned). Thus I thought a columnar data structure may be suitable but please correct me if I am wrong.
Example:
Tasks/Thread #1 generates the following data at given time stamps
datetime.ticks / value of output data
1000000001 233.23
1000000002 233.34
1000000006 234.23
...
Taks/Thread #2 generates the following data at given time stamps
datetime.ticks / value of output data
1000000002 33.32
1000000005 34.34
1000000015 54.32
...
I do not need to align the time stamps at the .Net run-time, I am first and foremost after preserving the data and to process the data within R or Python at a later point.
My requirements:
Fast writes, fast writes, fast writes: It can happen that I generate 100,000- 1,000,000 data points per second and need to persist (worst case) or retain in memory the data. Its ok to run the writes on its own thread so this process can lag the data generation process but limitation is 16gb RAM (64bit code), more below.
Preference is for columnar db format as it lends itself well to how I want to query the data later but I am open to any other structure if it makes sense in regards to the examples above (document/key-value also ok if all other requirements are met, especially in terms of write speed).
API that can be referenced from within .Net. Example: HDF5 may be considered capable by some but I find their .Net port horrible.Something that supports .Net a little better would be a plus but if all other requirements are met then I can deal with something similar to the HDF5 .Net port.
Concurrent writes if possible: As described earlier I like to write data concurrently from different tasks/threads.
I am constrained by 16gb memory (run .Net process in 64bit) and thus I probably look for something that is not purely in-memory as I may sometimes generate more data than that. Something in-memory which persists at times or a pure persistence model is probably preferable.
Preference for embedded but if a server in a client/server solution can run as a windows service then no issue.
In terms of data access I have strong preference for a db solution for which interfaces from R and Python already exist because I like to use the Panda library within Python for time series alignments and other analysis and run analyses within R.
If the API/library supports in addition SQL/SQL-like/Linq/ like queries that would be terrific but generally I just need the absolute bare bones such as load columnar data in between start and end date (given the "key"/index is in such format) because I analyze and run queries within R/Python.
If it comes with a management console or data visualizer that would be a plus but not a must.
Should be open source or priced within "reach" (no, KDB does not qualify in that regards ;-)
OK, here is what I have so far, and again its all I got because most db solution simply fail already on the write performance requirement:
Infobright and Db4o. I like what I read so far but I admit I have not checked into any performance stats
Something done myself. I can easily store value types in binary format and index the data by datetime.ticks , I just would need to somehow write scripts to load/deserialize the data in Python/R. But it would be a massive tasks if I wanted to add concurrency, a query engine, and other goodies. Thus I look for something already out there.
I can't comment -- low rep (I'm new here) -- so you get a full answer instead...
First, are you sure you need a database at all? If fast write speed and portability to R is your biggest concern then have you just considered a flat file mechanism? According to your comments you're willing to batch writes out but you need persistence; if those were my requirements I'd write a straight-to-disck buffering system that was lightning fast then build a separate task that periodically took the disk files and moved them into a data store for R, and that's only if R reading the flat files wasn't sufficient in the first place.
If you can do alignment after-the-fact, then you could write the threads to separate files in your main parallel loop, cutting each file off every so often, and leave the alignment and database loading to the subprocess.
So (in crappy pseudo_code), build a thread process that you'd call with backgroundworker or some such and include a threadname string uniquely identifying each worker and thus each filestream (task/thread):
file_name = threadname + '0001.csv' // or something
open(file_name for writing)
while(generating_data) {
generate_data()
while (buffer_not_full and very_busy) {
write_data_to_buffer
generate_data()
}
flush_buffer_to_disk(file_name)
if(file is big enough or enough time has passed or we're not too busy) {
close(file_name)
move(file_name to bob's folder)
increment file_name
open(file_name for writing)
}
)
Efficient and speedy file I/O and buffering is a straightforward and common problem. Nothing is going to be faster than this. Then you can just write another process to do the database loads and not sweat the performance there:
while(file_name in list of files in bob's folder sorted by date for good measure)
{
read bob's file
load bob's file to database
align dates, make pretty
}
And I wouldn't write that part in C#, I'd batch script it and use the database's native loader which is going to be as fast as anything you can build from scratch.
You'll have to make sure the two loops don't interfere much if you're running on the same hardware. That is, run the task threads at a higher priority, or build in some mutex or performance limiters so that the database load doesn't hog resources while the threads are running. I'd definitely segregate the database server and hardware so that file I/O to the flat files isn't compromised.
FIFO queues would work if you're on Unix, but you're not. :-)
Also, hardware is going to have more of a performance impact for you than the database engine, I'd imagine. If you're on a budget I'm guessing you're on COTS hardware, so springing for a solid state drive may up performance fairly cheaply. As I said, separating the DB storage from the flat file storage would help, and the CPU/RAM for R, the Database, and your Threads should all be segregated ideally.
What I'm saying is that choice of DB vendor probably isn't your biggest issue, unless you have a lot of money to spend. You'll be hardware bound most of the time otherwise. Database tuning is an art, and while you can eek out minor performance gains at the top end, having a good database administrator will keep most databases in the same ballpark for performance. I'd look at what R and Python support well and that you're comfortable with. If you think in columnar fashion then look at R and C#'s support for Cassandra (my vote), Hana, Lucid, HBase, Infobright, Vertica and others and pick one based on price and support. For traditional databases on a single commodity machine, I haven't seen anything that MySQL can't handle.
This is not to answer my own question but to keep track of all data bases which I tested so far and why they have not met my requirements (yet): each time I attempted to write 1 million single objects (1 long, 2 floats) to the database. For ooDBs, I stuck the objects into a collection and wrote the collection itself, similar story for key/value such as Redis but also attempted to write simple ints (1mil) to columnar dbs such as InfoBright.
Db4o, awefully slow writes: 1mil objects within a collection took about 45 seconds. I later optimized the collection structure and also wrote each object individually, not much love here.
InfoBright: Same thing, very slow in terms of write speed, which surprised me quite a bit as it organizes data in columnar format but I think the "knowledge tree" only kicks in when querying data rather than when saving flat data structures/tables-like structures.
Redis (through BookSleeve): Great API for .Net: Full Redis functionality (though couple drawbacks to run the server on Windows machines vs. a Linux or Unix box). Performance was very fast...North of 1 million items per second. I serialized all objects using Protocol Buffers (protobuf-net, both written by Marc Gravell), still need to play a lot more with the library but R and Python both have full access to the Redis DB, which is a big plus. Love it so far. The Async framework that Marc wrote around the Redis base functions is awesome, really neat and it works so far. I wanna spend a little more time to experiment with the Redis Lists/Collection types as well, as I so far only serialized to byte arrays.
SqLite: I ran purely in-memory and managed to write 1 million value type elements in around 3 seconds. Not bad for a pure RDBMS, obviously the in-memory option really speeds things up. I only created one connection, one transaction, created one command, one parameter, and simply adjusted the value of the parameter within a loop and ran the ExecuteNonQuery on each iteration. The transaction commit was then run outside the loop.
HDF5: Though there is a .Net port and there also exists a library to somehow work with HDF5 files out of R, I strongly discourage anyone to do so. Its a pure nightmare. The .Net port is very badly written, heck, the whole HDF5 concept is more than questionable. Its a very old and in my opinion outgrown solution to store vectorized/columnar data. This is 2012 not 1995. If one cannot completely delete datasets and vectors out of the file in which they were stored before then I do not call that an annoyance but a major design flaw. The API in general (not just .Net) is very badly designed and written imho, there are tons of class objects that nobody, without having spent hours and hours of studying the file structure, understands how to use. I think that is somewhat evidenced by the very sparse amount of documentation and example code that is out there. Furthermore, the h5r R library is a drama, an absolute nightmare. Its badly written as well (often the file upon writing is not correctly close due to a faulty flush and it corrupts files), the library has issues to even be properly installed on 32 bit OSs...and it goes on and on. I write the most about HDF5 because I spent the most of my time on this piece of .... and ended up with the most frustration. The idea to have a fast columnar file storage system, accessible from R and .Net was enticing but it just does not deliver what it promised in terms of API integration and usability or lack thereof.
Update: I ditched testing velocityDB simply because there does not seem any adapter to access the db from within R available. I currently contemplate writing my own GUI with charting library which would access the generated data either from a written binary file or have it sent over a broker-less message bus (zeroMQ) or sent through LockFree++ to an "actor" (my gui). I could then call R from within C# and have results returned to my GUI. That would possibly allow me the most flexibility and freedom, but would obviously also be the most tedious to code. I am running into more and more limitations during my tests that with each db test I befriend this idea more and more.
RESULT: Thanks for the participation. In the end I awarded the bounty points to Chipmonkey because he suggested partly what I considered important points to the solution to my problem (though I chose my own, different solution in the end).
I ended up with a hybrid between Redis in memory storage and direct calls out of .Net to the R.dll. Redis allows access to its data stored in memory by different processes. This makes it a convenient solution to quickly store the data as key/value in Redis and to then access the same data out of R. Additionally I directly send data and invoke functions in R through its .dll and the excellent R.Net library. Passing a collection of 1 million value types to R takes about 2.3 seconds on my machine which is fast enough given that I get the convenience to just pass in the data, invoke computational functions within R out of the .Net environment and getting the results back sync or async.
Just a note: I once had a similar problem posted by a fellow in a delphi forum. I could help him with a simple ID-key-value database backend I wrote at that time (kind of a NoSQL engine). Basically, it uses a B-Tree to store triplets (32bit ObjectID, 32bit PropertyKey, 64bit Value). I could manage to save about 500k/sec Values in real time (about 5 years ago). Of course, the data was indexed on all three values (ID, property-ID and value). You could optimize this by ignoring the value index.
The source I still have is in Delphi, but I would think about implementing something like that using C#. I cannot tell you whether it will meet your needs for performance, but if all else fails, give it a try. Using a buffered write should also drastically improve performance.
I would go with way combining persistence storage (I personally prefer db4o, but you can use files as well as mentioned above) and storing objects into memory this way:
use BlockingCollection<T> to store objects in memory (I believe you will achieve better performance then 1000000/s to store objects in memory), and than have one or more processing threads which will consume the objects and store them into persistent database
// Producing thread
for (int i=0; i<1000000; i++)
blockingCollection.Add(myObject);
// Consuming threads
while (true)
{
var myObject = blockingCollection.Take();
db4oSession.Store(myObject); // or write it to the files or whathever
}
BlockingCollection pretty much solves Producer-Consumer workflow, and in case you will use multiple instance of them and use AddToAny/TakeFromAny you can reach any kind of multithreaded performance
each consuming thread could have different db4o session (file) to reach desired performance (db4o is singlethreaded).
Since you want to use ZeroMQ why not use memcache over Redis?
ZeroMQ offers no persistence as far as I know. Memcache also offers no persistence and is a bit faster than Redis.
Or perhaps the other way, if you use Redis why not use beanstalk MQ?
If you want to use Redis (for the persistence) you might want to switch from ZeroMQ to beanstalk MQ (also a fast in memory queue, but also has persistence via logging). Beanstalk also has C# libs.
We are using File.WriteAllBytes to write data to the disk. But if a reboot happens just about the time when we close the file, windows adds null to the file. This seems to be happening on Windows 7. So once we come back to the file we see nulls in the file. Is there a way to prevent this. Is windows closing it's internal handle after certain time and can this be forced to close immediately ?.
Depending on what behavior you want; you can either put it in a UPS as 0A0D suggested; but in addition you can use Windows' Vista+ Transactional NTFS functionality. This allows you to atomically write to the file system. So in your case; nothing would be written rather than improper data. It isn't directly part of the .NET Framework yet; but there are plenty of managed wrappers to be found online.
Sometimes no data is better than wrong data. When your application starts up again; it can see that the file is missing; it can "continue" from where it left off; depending on what your application does.
Based on your comments, there is no guarantees when writing a file - especially if you lose power during a file write. Your best bet is to put the PC on an Uninterruptable Power Supply. If you are able to create an auto-restore mechanism, like Microsoft Office products, then that would prevent complete loss of data but it won't fix the missing data upon power loss.
I would consider this a case of a fatal exception (sudden loss of power). There isn't anything you can do about it, and generally, trying to handle them only makes matters worse.
I have had to deal with something similar; essentially an embedded system running on Windows, where the expectation is that the power might be shut off at any time.
In practice, I work with the understanding that a file written to disk less than 10 seconds before loss-of-power means that the file will be corrupted. (I use 30 seconds in my code to play it safe).
I am not aware of any way of guaranteeing from code that a file has been fully closed, flushed to disk, and that the disk hardware has finalized its writes. Except to know that 10 (or 30) seconds has elapsed. It's not a very satisfying situation, but there it is.
Here are some pointers I have used in a real-life embedded project...
Use a system of checksums and backup files.
Checksums: at the end of any file you write, include a checksum (if it's a custom XML file then perhaps include a <checksum .../> tag of some sort). Then upon reading, if the checksum tag isn't there, or doesn't match the data, then you must reject the file as corrupt.
Backups: every time you write a file, save a copy to one of two backups; say A and B. If A exists on disk but is less than 30 seconds old, then copy to B instead. Then upon reading, read the original file first. If corrupt, then read A, if corrupt then read B.
Also
If it is an embedded system, you need to run the DOS command "chkdsk /F" on the drive you do your writes to, upon boot. Because if you are getting corrupted files, then you are also going to be getting a corrupted file system.
NTFS disk systems are meant to be more robust against errors than FAT32. But I believe that NTFS disks can also require more time to fully flush their data. I use FAT32 when I can.
Final thought: if you are really using an embedded system, under windows, you would do well to learn more about Windows Embedded, and the Enhanced Write Filter system.
What is the logic behind disk defragmentation and Disk Check in Windows? Can I do it using C# coding?
For completeness sake, here's a C# API wrapper for defragmentation:
http://blogs.msdn.com/jeffrey_wall/archive/2004/09/13/229137.aspx
Defragmentation with these APIs is (supposed to be) very safe nowadays. You shouldn't be able to corrupt the file system even if you wanted to.
Commercial defragmentation programs use the same APIs.
Look at Defragmenting Files at msdn for possible API helpers.
You should carefully think about using C# for this task, as it may introduce some undesired overhead for marshaling into native Win32.
If you don't know the logic for defragmentation, and if you didn't write the file system yourself so you can't authoritatively check it for errors, why not just start new processes running 'defrag' and 'chkdsk'?
Mark Russinovich wrote an article Inside Windows NT Disk Defragmentation a while ago which gives in-depth details. If you really want to do this I would really advise you to use the built-in facilities for defragmenting. More so, on recent OSes I have never seen a need as a user to even care about defragmenting; it will be done automatically on a schedule and the NTFS folks at MS are definitely smarter at that stuff than you (sorry, but they do this for some time now, you don't).
Despite its importance, the file system is no more than a data structure that maps file names into lists of disk blocks. And keeps track of meta-information such as the actual length of the file and special files that keep lists of files (e.g., directories). A disk checker verifies that the data structure is consistent. That is, every disk block must either be free for allocation to a file or belong to a single file. It can also check for certain cases where a set of disk blocks appears to be a file that should be in a directory but is not for some reason.
Defragmentation is about looking at the lists of disk blocks assigned to each file. Files will generally load faster if they use a contiguous set of blocks rather than ones scattered all over the disk. And generally the entire file system will perform best if all the disk blocks in use confine themselves to a single congtiguous range of the disk. Thus the trick is moving disk blocks around safely to achieve this end while not destroying the file system.
The major difficulty here is running these application while a disk is in use. It is possible but one has to be very, very, very careful not to make some kind of obvious or extremely subtle error and destroy most or all of the files. It is easier to work on a file system offline.
The other difficulty is dealing with the complexities of the file system. For example, you'd be much better off building something that supports FAT32 rather than NTFS because the former is a much, much simpler file system.
As long as you have low-level block access and some sensible way for dealing with concurrency problems (best handled by working on the file system when it is not in use) you can do this in C#, perl or any language you like.
BUT BE VERY CAREFUL. Early versions of the program will destroy entire file systems. Later versions will do so but only under obscure circumstances. And users get extremely angry and litigious if you destroy their data.
I have a problem which requires me to parse several log files from a remote machine.
There are a few complications:
1) The file may be in use
2) The files can be quite large (100mb+)
3) Each entry may be multi-line
To solve the in-use issue, I need to copy it first. I'm currently copying it directly from the remote machine to the local machine, and parsing it there. That leads to issue 2. Since the files are quite large copying it locally can take quite a while.
To enhance parsing time, I'd like to make the parser multi-threaded, but that makes dealing with multi-lined entries a bit trickier.
The two main issues are:
1) How do i speed up the file transfer (Compression?, Is transferring locally even neccessary?, Can I read an in use file some other way?)
2) How do i deal with multi-line entries when splitting up the lines among threads?
UPDATE: The reason I didnt do the obvious parse on the server reason is that I want to have as little cpu impact as possible. I don't want to affect the performance of the system im testing.
If you are reading a sequential file you want to read it in line by line over the network. You need a transfer method capable of streaming. You'll need to review your IO streaming technology to figure this out.
Large IO operations like this won't benefit much by multithreading since you can probably process the items as fast as you can read them over the network.
Your other great option is to put the log parser on the server, and download the results.
The better option, from the perspective of performance, is going to be to perform your parsing at the remote server. Apart from exceptional circumstances the speed of your network is always going to be the bottleneck, so limiting the amount of data that you send over your network is going to greatly improve performance.
This is one of the reasons that so many databases use stored procedures that are run at the server end.
Improvements in parsing speed (if any) through the use of multithreading are going to be swamped by the comparative speed of your network transfer.
If you're committed to transferring your files before parsing them, an option that you could consider is the use of on-the-fly compression while doing your file transfer.
There are, for example, sftp servers available that will perform compression on the fly.
At the local end you could use something like libcurl to do the client side of the transfer, which also supports on-the-fly decompression.
The easiest way considering you are already copying the file would be to compress it before copying, and decompress once copying is complete. You will get huge gains compressing text files because zip algorithms generally work very well on them. Also your existing parsing logic could be kept intact rather than having to hook it up to a remote network text reader.
The disadvantage of this method is that you won't be able to get line by line updates very efficiently, which are a good thing to have for a log parser.
I guess it depends on how "remote" it is. 100MB on a 100Mb LAN would be about 8 secs...up it to gigabit, and you'd have it in around 1 second. $50 * 2 for the cards, and $100 for a switch would be a very cheap upgrade you could do.
But, assuming it's further away than that, you should be able to open it with just read mode (as you're reading it when you're copying it). SMB/CIFS supports file block reading, so you should be streaming the file at that point (of course, you didn't actually say how you were accessing the file - I'm just assuming SMB).
Multithreading won't help, as you'll be disk or network bound anyway.
Use compression for transfer.
If your parsing is really slowing you down, and you have multiple processors, you can break the parsing job up, you just have to do it in a smart way -- have a deterministic algorithm for which workers are responsible for dealing with incomplete records. Assuming you can determine that a line is part of a middle of a record, for example, you could break the file into N/M segments, each responsible for M lines; when one of the jobs determines that its record is not finished, it just has to read on until it reaches the end of the record. When one of the jobs determines that it's reading a record for which it doesn't have a beginning, it should skip the record.
If you can copy the file, you can read it. So there's no need to copy it in the first place.
EDIT: use the FileStream class to have more control over the access and sharing modes.
new FileStream("logfile", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)
should do the trick.
I've used SharpZipLib to compress large files before transferring them over the Internet. So that's one option.
Another idea for 1) would be to create an assembly that runs on the remote machine and does the parsing there. You could access the assembly from the local machine using .NET remoting. The remote assembly would need to be a Windows service or be hosted in IIS. That would allow you to keep your copies of the log files on the same machine, and in theory it would take less time to process them.
i think using compression (deflate/gzip) would help
The given answer do not satisfy me and maybe my answer will help others to not think it is super complicated or multithreading wouldn't benefit in such a scenario. Maybe it will not make the transfer faster but depending on the complexity of your parsing it may make the parsing/or analysis of the parsed data faster.
It really depends upon the details of your parsing. What kind of information do you need to get from the log files? Are these information like statistics or are they dependent on multiple log message?
You have several options:
parse multiple files at the same would be the easiest I guess, you have the file as context and can create one thread per file
another option as mentioned before is use compression for the network communication
you could also use a helper that splits the log file into lines that belong together as a first step and then with multiple threads process these blocks of lines; the parsing of this depend lines should be quite easy and fast.
Very important in such a scenario is to measure were your actual bottleneck is. If your bottleneck is the network you wont benefit of optimizing the parser too much. If your parser creates a lot of objects of the same kind you could use the ObjectPool pattern and create objects with multiple threads. Try to process the input without allocating too much new strings. Often parsers are written by using a lot of string.Split and so forth, that is not really as fast as it could be. You could navigate the Stream by checking the coming values without reading the complete string and splitting it again but directly fill the objects you will need after parsing is done.
Optimization is almost always possible, the question is how much you get out for how much input and how critical your scenario is.