Tracking string of text in memory

Tracking string of text in memory - c#

I have a program that reads certain information from another program, from memory. The program I am monitoring, basically outputs a continuously expanding amount of text. I would like my program to replicate this text in its UI. Syntax wise, coding all of this was fine, but I am struggling with how I can logically solve it, given the text's behavior in memory. Below follows the information I have found in memory, and how the string behaves.
This is the information I have:
A memory pointer to the start of the string segment
A memory pointer to the byte that follows after the last byte that was written
A memory pointer to an int that tells us how many bytes there is, been the two above
Now what I initially did was to store how many bytes I had read, then with a timer that fired every 2 seconds, just read the bytes between what I last read and the end (as is implied by pointer 3).
The above approach crashed after a random period of time, because after a random number of bytes has been written, the string segment actually wraps and starts to write from top again (location that pointer 1 points to), overwriting what was there before. Pointer 2 and 3 gets updated when this happens, so that is fine, but I am still not able to figure out how I should solve it.
I have thought about this potential approaches, but found it to be erroneous: When polling, before making any other calculations, check if the bytes I have read so far is larger than the number of bytes shown in 3. If it is, then set number of bytes read to 0, and start from top again. Problem with this approach is that I might miss something at bottom of the string segment, if something is written at bottom, then quickly another thing is written that makes the string wrap, before I had the chance to spot it.
Any thoughts, insight tips, or solutions, would be greatly appreciated.

Easiest thing to do is to have the Reader process read the full contents of the shared memory every time. Then display the entirety of the text in the UI. There is no need to have a "cursor".
The next easiest thing to do is read the full contents of the shared memory, but have the Reader process keep a copy of the last bunch of text it read. Then after every read, it compares the text to cache text to determine if anything changes. Still no need for cursors.
I would recommend using a more robust way to share data between the processes, if possible. Eventually, you will run into problems synchronizing the two processes. In the end you will write an inter-process communication framework. I think some simple implementation of an inter-process message queue would be ideal for this kind of application.

Related

Read file faster by multiple readers

So i have a large file which has ~2 million lines. The file reading is a bottleneck in my code. Any suggested ways and expert opinion to read the file faster is welcome. Order of reading lines from that text file is unimportant. All lines are pipe '|' separated fixed length records.
What i tried? I started parallel StreamReaders and made sure that resource is locked properly but this approach failed as i now had multiple threads fighting to get hold of the single StreamReader and wasting more time in locking etc thereby making the code slow down further.
One intuitive approach is to break the file and then read it, but i wish to leave the file intact and still be somehow able to read it faster.

I would try maximizing my buffer size. The default size is 1024, increasing this should increase performance. I would suggest trying other buffer size options.
StreamReader(Stream, Encoding, Boolean, Int32) Initializes a new
instance of the StreamReader class for the specified stream, with the
specified character encoding, byte order mark detection option, and
buffer size.

I understand that my problem is not related to software. It is a 'mechanical' problem. Unless there is a possibility to perform changes in hardware, there is no way to improve the reading performance. Why is that? There is one head only to read from the disk and therefore even if i try to read file from both ends for example, it is that same reader which will now have to move even more to read file from both ends for the two threads. Hence it is wiser to let the reader read sequentially and that is the maximum performance achievable.
Thank you all for the explanations. That helped me understand this concept. It may be a very basic and straightforward point for most people here on stackoverflow, but I really learned something about file reading and hardware performance and understood the things taught to me in college, from this question.

Having multiple simultaneous writers (no reader) to a single file. Is it possible to accomplish in a performant way in .NET?

I'm developing a multiple segment file downloader. To accomplish this task I'm currently creating as many temporary files on disk as segments I have (they are fixed in number during the file downloading). In the end I just create a new file f and copy all the segments' contents onto f.
I was wondering if there's not a better way to accomplish this. My idealization is of initially creating f in its full-size and then have the different threads write directly onto their portion. There need not to be any kind of interaction between them. We can assume any of them will start at its own starting point in the file and then only fill information sequentially in the file until its task is over.
I've heard about Memory-Mapped files (http://msdn.microsoft.com/en-us/library/dd997372(v=vs.110).aspx) and I'm wondering if they are the solution to my problem or not.
Thanks

Using the memory mapped API is absolute doable and it will probably perform quite well - of cause some testing would be recommended.
If you want to look for a possible alternative implementation, I have the following suggestion.
Create a static stack data structure, where the download threads can push each file segment as soon as it's downloaded.
Have a separate thread listen for push notifications on the stack. Pop the stack file segments and save each segment into the target file in a single threaded way.
By following the above pattern, you have separated the download of file segments and the saving into a regular file, by putting a stack container in between.
Depending on the implementation of the stack handling, you will be able to implement this with very little thread locking, which will maximise performance.
The pros of this is that you have 100% control on what is going on and a solution that might be more portable (if that ever should be a concern).
The stack decoupling pattern you do, can also be implemented pretty generic and might even be reused in the future.
The implementation of this is not that complex and probably on par with the implementation needed to be done around the memory mapped api.
Have fun...
/Anders

The answers posted so far are, of course addressing your question but you should also consider the fact that multi-threaded I/O writes will most likely NOT give you gains in performance.
The reason for multi-threading downloads is obvious and has dramatic results. When you try to combine the files though, remember that you are having multiple threads manipulate a mechanical head on conventional hard drives. In case of SSD's you may gain better performance.
If you use a single thread, you are by far exceeding the HDD's write capacity in a SEQUENTIAL way. That IS by definition the fastest way to write to conventions disks.
If you believe otherwise, I would be interested to know why. I would rather concentrate on tweaking the write performance of a single thread by playing around with buffer sizes, etc.

Yes, it is possible but the only precaution you need to have is to control that no two threads are writing at the same location of file, otherwise file content will be incorrect.
FileStream writeStream = new FileStream(destinationPath, FileMode.OpenOrCreate, FileAccess.Write, FileShare.Write);
writeStream.Position = startPositionOfSegments; //REMEMBER This piece of calculation is important
// A simple function to write the bytes ... just read from your source and then write
writeStream.Write(ReadBytes, 0 , bytesReadFromInputStream);
After each Write we used writeStream.Flush(); so that buffered data gets written to file but you can change according to your requirement.
Since you have code already working which downloads the file segments in parallel. The only change you need to make is just open the file stream as posted above, and instead of creating many segments file locally just open stream for a single file.
The startPositionOfSegments is very important and calculate it perfectly so that no two segments overwrite the desired downloaded bytes to same location on file otherwise it will provide incorrect result.
The above procedure works perfectly fine at our end, but this can be problem if your segment size are too small (We too faced it but after increasing size of segments it got fixed). If you face any exception then you can also synchronize only the Write part.

How ReadLine works in .NET

Let's say I have a 1 GB text file and I want to read it. If I try to open this file, I would get an "Memory Overflow" error. I know, the usual answer is "Use StreamReader.ReadLine() method". But I am wondering how this works. If the program which uses ReadLine method wants to get a line, it will have to open the entire text file sooner or later. As far as I know, files are stored on the disk and they can be opened in memory in an "all or nothing" principle. If only one line of my 1 GB text file is stored in a memory at a time by using a ReadLine() method, this means that we have to disk I-O for every line of my 1 GB text file while reading it. Isn't this a terrible thing to do for performance?
I'm so confused and I want some details about this.

this means that we have to disk I-O for every line of my 1 GB text file
No, there are lots of layers between your ReadLine() call and the physical disk, designed to not make this a problem. The ones that matter most:
FileStream, the underlying class that does the job for StreamReader, uses a buffer to reduce the number of ReadFile() calls. Default size is 4096 bytes
ReadFile() reads file data from the file system cache, not the disk. That may result in a call to the disk driver, but that's not so common. The operating system is smart enough to guess that you are likely to read more data from the file and pre-reads it from the disk as long as that is cheap to do and RAM isn't being used for anything else. It typically slurps an entire disk cylinder worth of data.
The disk drive itself has a cache as well, usually several megabytes.
The file system cache is by far the most important one. Also a tricky one because it stops your from accurately profiling your program. When you run your test over and over again, your program in fact never reads from the disk, only the cache. Which makes it unrealistically fast. Albeit that a 1 GB file might not quite fit, depends how much RAM you have in the machine.

Usually behind the scenes a FileStream object is opened which reads a large block of your file from disk and pulls it into memory. This block acts as a cache for ReadLine() to read from, so you don't have to worry about each ReadLine() causing a disk access.

Terrible thing for the performance of what?
Obviously it should be faster, given you have the memory available to deal the whole file in memory.
Finding and allocating a contiguous block is a cost though.
A gig is a significant block of ram, if your process has it, what's hurting?
Swapping could easily hurt more than streaming.
Do you need all the file at once, Do you need it all the time?
If you went to read / write. What would that do to you?
What if the file went to 2 gig?
You can optimise for one factor. Before you do, you've got to make sure it's the right one, and above all you have to remember this is a real machine. You have a finite amount of resources, so optimisation is always robbing Peter to pay Paul. Peter might get upset...

Preventing Memory issues when handling large amounts of text

I have written a program which analyzes a project's source code and reports various issues and metrics based on the code.
To analyze the source code, I load the code files that exist in the project's directory structure and analyze the code from memory. The code goes through extensive processing before it is passed to other methods to be analyzed further.
The code is passed around to several classes when it is processed.
The other day I was running it on one of the larger project my group has, and my program crapped out on me because there was too much source code loaded into memory. This is a corner case at this point, but I want to be able to handle this issue in the future.
What would be the best way to avoid memory issues?
I'm thinking about loading the code, do the initial processing of the file, then serialize the results to disk, so that when I need to access them again, I do not have to go through the process of manipulating the raw code again. Does this make sense? Or is the serialization/deserialization more expensive then processing the code again?
I want to keep a reasonable level of performance while addressing this problem. Most of the time, the source code will fit into memory without issue, so is there a way to only "page" my information when I am low on memory? Is there a way to tell when my application is running low on memory?
Update:
The problem is not that a single file fills memory, its all of the files in memory at once fill memory. My current idea is to rotate off the disk drive when I process them

1.6GB is still manageable and by itself should not cause memory problems. Inefficient string operations might do it.
As you parse the source code your probably split it apart into certain substrings - tokens or whatver you call them. If your tokens combined account for entire source code, that doubles memory consumption right there. Depending on the complexity of the processing you do the mutiplier can be even bigger.
My first move here would be to have a closer look on how you use your strings and find a way to optimize it - i.e. discarding the origianl after the first pass, compress the whitespaces, or use indexes (pointers) to the original strings rather than actual substrings - there is a number of techniques which can be useful here.
If none of this would help than I would resort to swapping them to and fro the disk

If the problem is that a single copy of your code causing you to fill the memory available then there are atleast two options.
serialize to disk
compress files in memory. If you have a lot of CPU it can be faster to zip and unzip information in memory, instead of caching to disk.
You should also check if you are disposing of objects properly. Do you have memory problems due to old copies of objects being in memory?

Use WinDbg with SOS to see what is holding on the string references (or what ever is causing the extreme memory usage).

Serializing/deserializing sounds like a good strategy. I've done a fair amount of this and it is very fast. In fact I have an app that instantiates objects from a DB and then serializes them to the hard drives of my web nodes. It has been a while since I benchmarked it, but it was serializing several hundred a second and maybe over 1k back when I was load testing.
Of course it will depend on the size of your code files. My files were fairly small.

Larger File Streams using C#

There are some text files(Records) which i need to access using C#.Net. But the matter is those files are larger than 1GB. (minimum size is 1 GB)
what should I need to do?
What are the factors which I need to be concentrate on?
Can some one give me an idea to over come from this situation.
EDIT:
Thanks for the fast responses. yes they are fixed length records. These text files coming from a local company. (There last month transaction records)
Is it possible to access these files like normal text files (using normal file stream).
and
How about the memory management????

Expanding on CasperOne's answer
Simply put there is no way to reliably put a 100GB file into memory at one time. On a 32 bit machine there is simply not enough addressing space. In a 64 bit machine there is enough addressing space but during the time in which it would take to actually get the file in memory, your user will have killed your process out of frustration.
The trick is to process the file incrementally. The base System.IO.Stream() class is designed to process a variable (and possibly infinite) stream in distinct quantities. It has several Read methods that will only progress down a stream a specific number of bytes. You will need to use these methods in order to divide up the stream.
I can't give more information because your scenario is not specific enough. Can you give us more details or your record delimeters or some sample lines from the file?
Update
If they are fixed length records then System.IO.Stream will work just fine. You can even use File.Open() to get access to the underlying Stream object. Stream.Read has an overload that requests the number of bytes to be read from the file. Since they are fixed length records this should work well for your scenario.
As long as you don't call ReadAllText() and instead use the Stream.Read() methods which take explicit byte arrays, memory won't be an issue. The underlying Stream class will take care not to put the entire file into memory (that is of course, unless you ask it to :) ).

You aren't specifically listing the problems you need to overcome. A file can be 100GB and you can have no problems processing it.
If you have to process the file as a whole then that is going to require some creative coding, but if you can simply process sections of the file at a time, then it is relatively easy to move to the location in the file you need to start from, process the data you need to process in chunks, and then close the file.
More information here would certainly be helpful.

What are the main problems you are having at the moment? The big thing to remember is to think in terms of streams - i.e. keep the minimum amount of data in memory that you can. LINQ is excellent at working with sequences (although there are some buffering operations you need to avoid, such as OrderBy).
For example, here's a way of handling simple records from a large file efficiently (note the iterator block).
For performing multiple aggregates/analysis over large data from files, consider Push LINQ in MiscUtil.
Can you add more context to the problems you are thinking of?

Expanding on JaredPar's answer.
If the file is a binary file (i.e. ints stored as 4 bytes, fixed length strings etc) you can use the BinaryReader class. Easier than pulling out n bytes and then trying to interrogate that.
Also note, the read method on System.IO.Stream is a non blocking operation. If you ask for 100 bytes it may return less than that, but still not have reached end of file.
The BinaryReader.ReadBytes method will block until it reads the requested number of bytes, or End of file - which ever comes first.
Nice collaboration lads :)

Hey Guys, I realize that this post hasn't been touched in a while, but I just wanted to post a site that has the solution to your problem.
http://thedeveloperpage.wordpress.com/c-articles/using-file-streams-to-write-any-size-file-introduction/
Hope it helps!
-CJ

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.