I need to parse a large CSV file in real-time, while it's being modified (appended) by a different process. By large I mean ~20 GB at this point, and slowly growing. The application only needs to detect and report certain anomalies in the data stream, for which it only needs to store small state info (O(1) space).
I was thinking about polling the file's attributes (size) every couple of seconds, opening a read-only stream, seeking to the previous position, and then continuing to parse where I first stopped. But since this is a text (CSV) file, I obviously need to keep track of new-line characters when continuing somehow, to ensure I always parse an entire line.
If I am not mistaken, this shouldn't be such a problem to implement, but I wanted to know if there is a common way/library which solves some of these problems already?
Note: I don't need a CSV parser. I need info about a library which simplifies reading lines from a file which is being modified on the fly.
I did not test it, but I think you can use a FileSystemWatcher to detect when a different process modified your file. In the Changed event, you will be able to seek to a position you saved before, and read the additional content.
There is a small problem here:
Reading and parsing CSV requires a TextReader
Positioning doesn't work (well) with TextReaders.
First thought: Keep it open. If both the producer and the analyzer operate in non-exclusive mode It should be possible to ReadLine-until-null, pause, ReadLine-until-null, etc.
it should be 7-bit ASCII, just some Guids and numbers
That makes it feasible to track the file Position (pos += line.Length+2). Do make sure you open it with Encoding.ASCII. You can then re-open it as a plain binary Stream, Seek to the last position and only then attach a StreamReader to that stream.
Why don't you just spin off a separate process / thread each time you start parsing - that way, you move the concurrent (on-the-fly) part away from the data source and towards your data sink - so now you just have to figure out how to collect the results from all your threads...
This will mean doing a reread of the whole file for each thread you spin up, though...
You could run a diff program on the two versions and pick up from there, depending on how well-formed the csv data source is: Does it modify records already written? Or does it just append new records? If so, you can just split off the new stuff (last-position to current-eof) into a new file, and process those at leisure in a background thread:
polling thread remembers last file size
when file gets bigger: seek from last position to end, save to temp file
background thread processes any temp files still left, in order of creation/modification
Related
I'm writing a system for indexing the contents of files. The system monitors changes in files through the FileSystemWatcher and, by event, starts the process of updating the index.
Obviously, if file was changed, re-index whole file is bad solution, because the file can be large. It is necessary that the time spent for re-indexing depends on the number of changes and not on the file size.
The question is how quickly to determine what part of the file was changed?
Calculating the diff between file versions is too resource-intensive.
I thought to do as follows:
Split the file into parts by size of few kilobytes
Calculate the digests (hashes) from each of the parts (SHA-1 or MD5)
On the file change event quickly compare the digests of each part of
the file
Re-index only the changed parts of the file.
There are 2 problems here:
The changed part (#N) may contain shifts. These shifts affect the
parts #N+1, #N+2, etc. This means that I will get other hashes
for all these parts. If file was changed at the beginning, whole
file will be re-indexed.
Hash functions give collisions, and therefore there is no guarantee
that the coincidence of the digests means that the original data has
not been changed.
Are there any ideas how this can be done at its best?
Maybe there is a way to get the modified file system pages of the file?
I have to change the specific line of the text file in asp.net.
Can I change/Replace the text in a particular line only??
I have used the replace function in text file but it is replacing text in entire file.
I want to replace only one line specified by me.
Waiting for the reply..
Thanks in advance..
File systems don't generally allow you to edit within a file other than directly overwriting byte-by-byte. If your text file uses the same number of bytes for every line, then you can very efficiently replace a line of text - but that's a relatively rare case these days.
It's more likely that you'll need to take one of these options:
Load the whole file into memory using File.ReadAllLines, change the relevant line, and then write it out again using File.WriteAllLines. This is inefficient in terms of memory, but really simple to code. If your file is small, it's a good option.
Open the input file and a new output file. Read a line of text at a time from the input, and either copying it to the output or writing a different line instead. Then close both files, delete the input file and rename the output file. This only requires a single line of text in memory at a time, but it's considerably more fiddly.
The second option has another benefit - you can shuffle the files around (using lots of rename steps) so that at no point do you ever have the possibility of losing the input file unless the output file is known to be complete and in the right place. That's even more complicated though.
I'm trying to write skeleton data into BVH file, for that I need to get the total number of frames and write it before the joints data as the hierarchy of the bvh file is.
The function SensorSkeletonFrameReady allows me to have the frame number but I'm using this function to extract the joints data of each frame and write it directly into bvh file.
Can anyone help me, please?
BVH files have the total number of frames represented in the file. It is impossible to know this number until you are done recording.
Using the SkeletonFrameReady event you could:
save the data to a List (or some other array type structure)
stop recording and count the number of frames (i.e., List items)
write your file, with the total frames
... or ...
output the file in real-time (as you indicate in your question), keeping a running total of the frame count
stop recording and close the file as best you can
re-open the file, seek to your "frames" line and enter the appropriate value you've stored
... or ...
output the skeleton tracking data in real-time
keep seeking back to the point in your file where the frames are defined and keep updating it, then seek back to the end to write the next frame.
I'm not really taking that last one too seriously. But it all comes down to the fact that you don't know the number of frames until you are done! You have to complete your recording first, before you output that line in the file.
Unless you recording really long sessions, storing your data in a List and then writing the data file once you've stopped is the most straight forward means. In my opinion.
I have a binary data file that is written to from a live data stream, so it keeps on growing as stream comes. In the meanwhile, I need to open it at the same time in read-only mode to display data on my application (time series chart). Opening the whole file takes a few minutes as it is pretty large (a few 100' MBytes).
What I would like to do is, rather than re-opening/reading the whole file every x seconds, read only the last data that was added to the file and append it to the data that was already read.
I would suggest using FileSystemWatcher to be notified of changes to the file. From there, cache information such as the size of the file between events and add some logic to only respond to full lines, etc. You can use the Seek() method of the FileStream class to jump to a particular point in the file and read only from there. I hope it helps.
If you control the writing of this file, I would split it in several files of a predefined size.
When the writer determines that the current file is larger than, say, 50MB, close it and immediately create a new file to write data to. The process writing this data should always know the current file to write received data to.
The reader thread/process would read all these files in order, jumping to the next file when the current file was read completely.
You can probably use a FileSystemWatcher to monitor for changes in the file, like the example given here: Reading changes in a file in real-time using .NET.
But I'd suggest that you evaluate another solution, including a queue, like RabbitMQ, or Redis - any queue that has Subscriber-Publisher model. Then you'll just push the live data into the queue, and will have 2 different listeners(subscribers) - one to save in the file, and the other to process the last-appended data. This way you can achieve more flexibility with distributing load of the application.
In my application, the user selects a big file (>100 mb) on their drive. I wish for the program to then take the file that was selected and chop it up into archived parts that are 100 mb or less. How can this be done? What libraries and file format should I use? Could you give me some sample code? After the first 100mb archived part is created, I am going to upload it to a server, then I will upload the next 100mb part, and so on until the upload is finished. After that, from another computer, I will download all these archived parts, and then I wish to connect them into the original file. Is this possible with the 7zip libraries, for example? Thanks!
UPDATE: From the first answer, I think I'm going to use SevenZipSharp, and I believe I understand now how to split a file into 100mb archived parts, but I still have two questions:
Is it possible to create the first 100mb archived part and upload it before creating the next 100mb part?
How do you extract a file with SevenZipSharp from multiple splitted archives?
UPDATE #2: I was just playing around with the 7-zip GUI and creating multi-volume/split archives, and I found that selecting the first one and extracting from it will extract the whole file from all of the split archives. This leads me to believe that paths to the subsequent parts are included in the first one (or is it consecutive?). However, I'm not sure if this would work directly from the console, but I will try that now, and see if it solves question #2 from the first update.
Take a look at SevenZipSharp, you can use this to create your spit 7z files, do whatever you want to upload them, then extract them on the server side.
To split the archive look at the SevenZipCompressor.CustomParameters member, passing in "v100m". (you can find more parameters in the 7-zip.chm file from 7zip)
You can split the data into 100MB "packets" first, and then pass each packet into the compressor in turn, pretending that they are just separate files.
However, this sort of compression is usually stream-based. As long as the library you are using will do its I/O via a Stream-derived class, it would be pretty simple to implement your own Stream that "packetises" the data any way you like on the fly - as data is passed into your Write() method you write it to a file. When you exceed 100MB in that file, you simply close that file and open a new one, and continue writing.
Either of these approaches would allow you to easily upload one "packet" while continuing to compress the next.
edit
Just to be clear - Decompression is just the reverse sequence of the above, so once you've got the compression code working, decompression will be easy.