I have a binary data file that is written to from a live data stream, so it keeps on growing as stream comes. In the meanwhile, I need to open it at the same time in read-only mode to display data on my application (time series chart). Opening the whole file takes a few minutes as it is pretty large (a few 100' MBytes).
What I would like to do is, rather than re-opening/reading the whole file every x seconds, read only the last data that was added to the file and append it to the data that was already read.
I would suggest using FileSystemWatcher to be notified of changes to the file. From there, cache information such as the size of the file between events and add some logic to only respond to full lines, etc. You can use the Seek() method of the FileStream class to jump to a particular point in the file and read only from there. I hope it helps.
If you control the writing of this file, I would split it in several files of a predefined size.
When the writer determines that the current file is larger than, say, 50MB, close it and immediately create a new file to write data to. The process writing this data should always know the current file to write received data to.
The reader thread/process would read all these files in order, jumping to the next file when the current file was read completely.
You can probably use a FileSystemWatcher to monitor for changes in the file, like the example given here: Reading changes in a file in real-time using .NET.
But I'd suggest that you evaluate another solution, including a queue, like RabbitMQ, or Redis - any queue that has Subscriber-Publisher model. Then you'll just push the live data into the queue, and will have 2 different listeners(subscribers) - one to save in the file, and the other to process the last-appended data. This way you can achieve more flexibility with distributing load of the application.
Related
I have a logging application that works great, but I want to apply the ability to maintain the size of the log file - stop it from getting too large.
Ideally, I want to check the size of the file periodically, and if it's over the configured amount (5MB or something) delete from the beginning till it reaches some size, like 4MB.
From reading other questions I'm still unclear if I can update/delete a file without reading it's entire contents. My ideal situation would be:
if(filesize > 5MB)
{
while(filesize > 4MB)
Delete_First_X_Many_Lines(file);
}
Thankyou in advance for any pointers and direction.
I would do this:
Lock the log file (prevent writes).
Copy the end of the log file you want to keep to a new file.
Copy the new file on top of the old log file.
Unlock the log file.
I need to parse a large CSV file in real-time, while it's being modified (appended) by a different process. By large I mean ~20 GB at this point, and slowly growing. The application only needs to detect and report certain anomalies in the data stream, for which it only needs to store small state info (O(1) space).
I was thinking about polling the file's attributes (size) every couple of seconds, opening a read-only stream, seeking to the previous position, and then continuing to parse where I first stopped. But since this is a text (CSV) file, I obviously need to keep track of new-line characters when continuing somehow, to ensure I always parse an entire line.
If I am not mistaken, this shouldn't be such a problem to implement, but I wanted to know if there is a common way/library which solves some of these problems already?
Note: I don't need a CSV parser. I need info about a library which simplifies reading lines from a file which is being modified on the fly.
I did not test it, but I think you can use a FileSystemWatcher to detect when a different process modified your file. In the Changed event, you will be able to seek to a position you saved before, and read the additional content.
There is a small problem here:
Reading and parsing CSV requires a TextReader
Positioning doesn't work (well) with TextReaders.
First thought: Keep it open. If both the producer and the analyzer operate in non-exclusive mode It should be possible to ReadLine-until-null, pause, ReadLine-until-null, etc.
it should be 7-bit ASCII, just some Guids and numbers
That makes it feasible to track the file Position (pos += line.Length+2). Do make sure you open it with Encoding.ASCII. You can then re-open it as a plain binary Stream, Seek to the last position and only then attach a StreamReader to that stream.
Why don't you just spin off a separate process / thread each time you start parsing - that way, you move the concurrent (on-the-fly) part away from the data source and towards your data sink - so now you just have to figure out how to collect the results from all your threads...
This will mean doing a reread of the whole file for each thread you spin up, though...
You could run a diff program on the two versions and pick up from there, depending on how well-formed the csv data source is: Does it modify records already written? Or does it just append new records? If so, you can just split off the new stuff (last-position to current-eof) into a new file, and process those at leisure in a background thread:
polling thread remembers last file size
when file gets bigger: seek from last position to end, save to temp file
background thread processes any temp files still left, in order of creation/modification
Almost all of file transfer softwares like [NetSupport, Radmin, PcAnyWhere..] and also the different codes i used in my application, it slows down the transfer speed when you send alot of small sized files < 1kb like Folder of a game that has alot of files.
for example on a LAN (ethernet CAT5 cables) i send a single file, let say a video, the transfer rate is between 2MB and 9MB
but when i send a folder of a game that has alot of files the transfer rate is about 300kb-800kb
as i guess it's because the way of sending a file:
Send File Info [file_path,file_Size].
Send file bytes [loop till end of the file].
End Transfer [ensure it received completely].
but when you use the regular windows [copy-paste] on a shared folder on the network, the transfer rate of sending a folder is always fast like sending a single file.
so im trying to develop a file transfer application using [WCF service c# 4.0] that would use the maximum speed available on LAN, and I'm thinking about this way:
Get all files from the folder.
if(FileSize<1 MB)
{
Create additional thread to send;
SendFile(FilePath);
}
else
{
Wait for the large file to be sent. // fileSize>1MB
}
void SendFile(string path) // a regular single file send.
{
SendFileInfo;
Open Socket and wait for server application to connect;
SendFileBytes;
Dispose;
}
but im confused about using more than one Socket for a file transfer, because that will use more ports and more time (delay of listening and accepting).
so is it a good idea to do it?
need an explaination about if it's possible to do, how to do it, a better protocol than tcp that would meant for this.
thanks in advance.
It should be noted you won't ever achieve 100% LAN speed usage - I'm hoping you're not hoping for that - there are too many factors there.
In response to your comment as well, you can't reach the same level that the OS uses to transfer files, because you're a lot further away from the bare metal than windows is. I believe file copying in Windows is only a layer or two above the drivers themselves (possibly even within the filesystem driver) - in a WCF service you're a lot further away!
The simplest thing for you to do will be to package multiple files into archives and transmit them that way, then at the receiving end you unpack the complete package into the target folder. Sure, some of those files might already be compressed and so won't benefit - but in general you should see a big improvement. For rock-solid compression in which you can preserve directory structure, I'd consider using SharpZipLib
A system that uses compression intelligently (probably medium-level, low CPU usage but which will work well on 'compressible' files) might match or possibly outperform OS copying. Windows doesn't use this method because it's hopeless for fault-tolerance. In the OS, a transfer halted half way through a file will still leave any successful files in place. If the transfer itself is compressed and interrupted, everything is lost and has to be started again.
Beyond that, you can consider the following:
Get it working using compression by default first before trying any enhancements. In some cases (depending on size/no. files) it might be you can simply compress the whole folder and then transmit it in one go. Beyond a certain size, however, and this might take too long, so you'll want to create a series of smaller zips.
Write the compressed file to a temporary location on disk as it's being received, don't buffer the whole thing in memory. Delete the file once you've then unpacked it into the target folder.
Consider adding the ability to mark certain file types as being able to be sent 'naked'- i.e. uncompressed. That way you can exclude .zips, avis etc files from the compression process. That said, a folder with a million 1kb zip files will clearly benefit from being packed into one single archive - so perhaps give yourself the ability to set a min size beyond which that file will still be packed into a compressed folder (or perhaps a file count/size on disk ratio for a folder itself - including sub-folders).
Beyond this advice you will need to play around to get the best results.
perhaps, an easy solution would be gathering all files together onto a big stream (like zipping them, but just append to make this fast) and send this one stream. This would give more speed, but will use up some cpu on both devices and a good idea how to separate all files in the stream.
But using more ports would, from what i know, only be a disadvantage, since there would be more different streams colliding and so the speed would go down.
In my application, the user selects a big file (>100 mb) on their drive. I wish for the program to then take the file that was selected and chop it up into archived parts that are 100 mb or less. How can this be done? What libraries and file format should I use? Could you give me some sample code? After the first 100mb archived part is created, I am going to upload it to a server, then I will upload the next 100mb part, and so on until the upload is finished. After that, from another computer, I will download all these archived parts, and then I wish to connect them into the original file. Is this possible with the 7zip libraries, for example? Thanks!
UPDATE: From the first answer, I think I'm going to use SevenZipSharp, and I believe I understand now how to split a file into 100mb archived parts, but I still have two questions:
Is it possible to create the first 100mb archived part and upload it before creating the next 100mb part?
How do you extract a file with SevenZipSharp from multiple splitted archives?
UPDATE #2: I was just playing around with the 7-zip GUI and creating multi-volume/split archives, and I found that selecting the first one and extracting from it will extract the whole file from all of the split archives. This leads me to believe that paths to the subsequent parts are included in the first one (or is it consecutive?). However, I'm not sure if this would work directly from the console, but I will try that now, and see if it solves question #2 from the first update.
Take a look at SevenZipSharp, you can use this to create your spit 7z files, do whatever you want to upload them, then extract them on the server side.
To split the archive look at the SevenZipCompressor.CustomParameters member, passing in "v100m". (you can find more parameters in the 7-zip.chm file from 7zip)
You can split the data into 100MB "packets" first, and then pass each packet into the compressor in turn, pretending that they are just separate files.
However, this sort of compression is usually stream-based. As long as the library you are using will do its I/O via a Stream-derived class, it would be pretty simple to implement your own Stream that "packetises" the data any way you like on the fly - as data is passed into your Write() method you write it to a file. When you exceed 100MB in that file, you simply close that file and open a new one, and continue writing.
Either of these approaches would allow you to easily upload one "packet" while continuing to compress the next.
edit
Just to be clear - Decompression is just the reverse sequence of the above, so once you've got the compression code working, decompression will be easy.
I have a program that captures live data from a piece of hardware on a set interval. The data is returned as XML. There are several things I would like to do with this data (in order):
-display it to user
-save it to disk
-eventually, upload it to database
My current approach is to take the XML, parse it into a hashtable so I can display the correct values to the user.
Next, I want to save the XML to a file on disk. For each data capture session I am planning on creating a unique XML file and I will dump all the data into it.
Finally, I would like to reparse the XML and upload it to a MySQL database. The data cannot be immediately uploaded to the database.
This seems really inefficient method of solving this problem and I would love some advice.
Is it a waste of hd space to save the data as XML?
Is it THAT inefficient to have to reparse the XML in order to write it to a database?
Thank you!
To clarify: a typical XML response will be ~1kb and are captured at a rate of about 1 response every 15-60 seconds.
I think I do want to store the XML as XML on the disk because the data is very valuable and a pain to reproduce (if it is even possible). Thank you!
When you receive a new XML document from the source, directly save to disk and parse it to display to the user.
With a background process, or user initiated, read the xml files from disk and send them to the server based on created date (so you can retrieve only the latest ones) for insert into MySql.
That mostly depends on the amount and rate of data you're moving. If you have to upload to the database seldom and it only takes a few seconds, the flexibility of XML is surely good to have. If you're never using the locally stored data except to upload it to the database and parsing takes a few minutes you might want to rethink the strategy.
perhaps there should be a seperate thread for the data fetching and the processing.