Out of memory exception reading "Large" file - c#

I'm trying to serialize an object into a string
The first problem I encountered was that the XMLSerializer.Serialize method threw an Out of memory exception, I've trying all kind of solutions and none worked so I serialized it into a file.
The file is about 300mb's (32 bit process, 8gb ram) and trying to read it with StreamReader.ReadToEnd also results in Out of memory exception.
The XML format and loading it on a string are not an option but a must.
The question is:
Any reason that a 300mb file will throw that kind of exception? 300mb is not really a large file.
Serialization code that fails on .Serialize
using (MemoryStream ms = new MemoryStream())
{
var type = obj.GetType();
if (!serializers.ContainsKey(type))
serializers.Add(type,new XmlSerializer(type));
// new XmlSerializer(obj.GetType()).Serialize(ms, obj);
serializers[type].Serialize(ms, obj);
ms.Position = 0;
using (StreamReader sr = new StreamReader(ms))
{
return sr.ReadToEnd();
}
}
Serialization and read from file that fails on ReadToEnd
var type = obj.GetType();
if (!serializers.ContainsKey(type))
serializers.Add(type,new XmlSerializer(type));
FileStream fs = new FileStream(#"c:/temp.xml", FileMode.Create);
TextWriter writer = new StreamWriter(fs, new UTF8Encoding());
serializers[type].Serialize(writer, obj);
writer.Close();
fs.Close();
using (StreamReader sr = new StreamReader(#"c:/temp.xml"))
{
return sr.ReadToEnd();
}
The object is large because its an elaborate system entire configuration object...
UPDATE:
Reading the file in chucks (8*1024 chars) will load the file into a StringBuilder but the builders fails on ToString().... starting to think there is no way which is really strange.

Yeah, if you're using 32-bit, trying to load 300MB in one chunk is going to be awkward, especially when using approaches that don't know the final size (number of characters, not bytes) in advance, thus have to keep doubling an internal buffer. And that is just when processing the string! It then needs to rip that into a DOM, which can often take several times as much space as the underlying data. And finally, you need to deserialize it into the actual objects, usually taking about the same again.
So - indeed, trying to do this in 32-bit will be tough.
The first thing to try is: don't use ReadToEnd - just use XmlReader.Create with either the file path or the FileStream, and let XmlReader worry about how to load the data. Don't load the contents for it.
After that... the next thing to do is: don't limit it to 32-bit.
Well, you could try enabling the 3GB switch, but... moving to 64-bit would be preferable.
Aside: xml is not a good choice for large volumes of data.

Exploring the source code for StreamReader.ReadToEnd reveals that it internally makes use of the StringBuilder.Append method:
public override String ReadToEnd()
{
if (stream == null)
__Error.ReaderClosed();
#if FEATURE_ASYNC_IO
CheckAsyncTaskInProgress();
#endif
// Call ReadBuffer, then pull data out of charBuffer.
StringBuilder sb = new StringBuilder(charLen - charPos);
do {
sb.Append(charBuffer, charPos, charLen - charPos);
charPos = charLen; // Note we consumed these characters
ReadBuffer();
} while (charLen > 0);
return sb.ToString();
}
which most probably throws this exception that leads to the this question/answer: interesting OutOfMemoryException with StringBuilder
al

Related

C# - Read bytes from file from a specific string

I'm trying to parse a crg-file in C#. The file is mixed with plain text and binary data. The first section of the file contains plain text while the rest of the file is binary (lots of floats), here's an example:
$
$ROAD_CRG
reference_line_start_u = 100
reference_line_end_u = 120
$
$KD_DEFINITION
#:KRBI
U:reference line u,m,730.000,0.010
D:reference line phi,rad
D:long section 1,m
D:long section 2,m
D:long section 3,m
...
$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
�#z����RA����\�l
...
I know I can read bytes starting at a specific offset but how do I find out which byte to start from? The last row before the binary section will always contain at least four dollar signs "$$$$". Here's what I've got so far:
using var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
var startByte = ??; // How to find out where to start?
using (BinaryReader reader = new BinaryReader(fs))
{
reader.BaseStream.Seek(startByte, SeekOrigin.Begin);
var f = reader.ReadSingle();
Debug.WriteLine(f);
}
When you have a mixture of text data and binary data, you need to treat everything as binary. This means you should be using raw Stream access, or something similar, and using binary APIs to look through the text data (often looking for cr/lf/crlf at bytes as sentinels, although it sounds like in your case you could just look for the $$$$ using binary APIs, then decode the entire block before, and scan forwards). When you think you have an entire line, then you can use Encoding to parse each line - the most convenient API being encoding.GetString(). When you've finished looking through the text data as binary, then you can continue parsing the binary data, again using the binary API. I would usually recommend against BinaryReader here too, because frankly it doesn't gain you much over more direct API. The other problem you might want to think about is CPU endianness, but assuming that isn't a problem: BitConverter.ToSingle() may be your friend.
If the data is modest in size, you may find it easiest to use byte[] for the data; either via File.ReadAllBytes, or by renting an oversized byte[] from the array-pool, and loading it from a FileStream. The Stream API is awkward for this kind of scenario, because once you've looked at data: it has gone - so you need to maintain your own back-buffers. The pipelines API is ideal for this, when dealing with large data, but is an advanced topic.
UPDATE: This code may not work as expected. Please review the valuable information in the comments.
using (var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read))
{
using (StreamReader sr = new StreamReader(fs, Encoding.ASCII, true, 1, true))
{
var line = sr.ReadLine();
while (!string.IsNullOrWhiteSpace(line) && !line.Contains("$$$$"))
{
line = sr.ReadLine();
}
}
using (BinaryReader reader = new BinaryReader(fs))
{
// TODO: Start reading the binary data
}
}
Solution
I know this is far from the most optimized solution but in my case it did the trick and since the plain text section of the file was known to be fairly small this didn't cause any noticable performance issues. Here's the code:
using var fileStream = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
using var reader = new BinaryReader(fileStream);
var newLine = '\n';
var markerString = "$$$$";
var currentString = "";
var foundMarker = false;
var foundNewLine = false;
while (!foundNewLine)
{
var c = reader.ReadChar();
if (!foundMarker)
{
currentString += c;
if (currentString.Length > markerString.Length)
currentString = currentString.Substring(1);
if (currentString == markerString)
foundMarker = true;
}
else
{
if (c == newLine)
foundNewLine = true;
}
}
if (foundNewLine)
{
// Read binary
}
Note: If you're dealing with larger or more complex files you should probably take a look at Mark Gravell's answer and the comment sections.

File Copy Program Doesn't Properly Copy File

Hello
I've been working on terminal-like application to get better at programming in c#, just something to help me learn. I've decided to add a feature that will copy a file exactly as it is, to a new file... It seems to work almost perfect. When opened in Notepad++ the file are only a few lines apart in length, and very, very, close to the same as far as actual file size goes. However, the duplicated copy of the file never runs. It says the file is corrupt. I have a feeling it's within the methods for reading and rewriting binary to files that I created. The code is as follows, thank for the help. Sorry for the spaghetti code too, I get a bit sloppy when I'm messing around with new ideas.
Class that handles the file copying/writing
using System;
using System.IO;
//using System.Collections.Generic;
namespace ConsoleFileExplorer
{
class FileTransfer
{
private BinaryWriter writer;
private BinaryReader reader;
private FileStream fsc; // file to be duplicated
private FileStream fsn; // new location of file
int[] fileData;
private string _file;
public FileTransfer(String file)
{
_file = file;
fsc = new FileStream(file, FileMode.Open);
reader = new BinaryReader(fsc);
}
// Reads all the original files data to an array of bytes
public byte[] ReadAllDataToArray()
{
byte[] bytes = reader.ReadBytes((int)fsc.Length); // reading bytes from the original file
return bytes;
}
// writes the array of original byte data to a new file
public void WriteDataFromArray(byte[] fileData, string path) // got a feeling this is the problem :p
{
fsn = new FileStream(path, FileMode.Create);
writer = new BinaryWriter(fsn);
int i = 0;
while(i < fileData.Length)
{
writer.Write(fileData[i]);
i++;
}
}
}
}
Code that interacts with this class .
(Sleep(5000) is because I was expecting an error on first attempt...
case '3':
Console.Write("Enter source file: ");
string sourceFile = Console.ReadLine();
if (sourceFile == "")
{
Console.Clear();
Console.ForegroundColor = ConsoleColor.DarkRed;
Console.Error.WriteLine("Must input a proper file path.\n");
Console.ForegroundColor = ConsoleColor.White;
Menu();
} else {
Console.WriteLine("Copying Data"); System.Threading.Thread.Sleep(5000);
FileTransfer trans = new FileTransfer(sourceFile);
//copying the original files data
byte[] data = trans.ReadAllDataToArray();
Console.Write("Enter Location to store data: ");
string newPath = Console.ReadLine();
// Just for me to make sure it doesnt exit if i forget
if(newPath == "")
{
Console.Clear();
Console.ForegroundColor = ConsoleColor.DarkRed;
Console.Error.WriteLine("Cannot have empty path.");
Console.ForegroundColor = ConsoleColor.White;
Menu();
} else
{
Console.WriteLine("Writing data to file"); System.Threading.Thread.Sleep(5000);
trans.WriteDataFromArray(data, newPath);
Console.WriteLine("File stored.");
Console.ReadLine();
Console.Clear();
Menu();
}
}
break;
File compared to new file
right-click -> open in new tab is probably a good idea
Original File
New File
You're not properly disposing the file streams and the binary writer. Both tend to buffer data (which is a good thing, especially when you're writing one byte at a time). Use using, and your problem should disappear. Unless somebody is editing the file while you're reading it, of course.
BinaryReader and BinaryWriter do not just write "raw data". They also add metadata as needed - they're designed for serialization and deserialization, rather than reading and writing bytes. Now, in the particular case of using ReadBytes and Write(byte[]) in particular, those are really just raw bytes; but there's not much point to use these classes just for that. Reading and writing bytes is the thing every Stream gives you - and that includes FileStreams. There's no reason to use BinaryReader/BinaryWriter here whatsover - the file streams give you everything you need.
A better approach would be to simply use
using (var fsn = ...)
{
fsn.Write(fileData, 0, fileData.Length);
}
or even just
File.WriteAllBytes(fileName, fileData);
Maybe you're thinking that writing a byte at a time is closer to "the metal", but that simply isn't the case. At no point during this does the CPU pass a byte at a time to the hard drive. Instead, the hard drive copies data directly from RAM, with no intervention from the CPU. And most hard drives still can't write (or read) arbitrary amounts of data from the physical media - instead, you're reading and writing whole sectors. If the system really did write a byte at a time, you'd just keep rewriting the same sector over and over again, just to write one more byte.
An even better approach would be to use the fact that you've got file streams open, and stream the files from source to destination rather than first reading everything into memory, and then writing it back to disk.
There is an File.Copy() Method in C#, you can see it here https://msdn.microsoft.com/ru-ru/library/c6cfw35a(v=vs.110).aspx
If you want to realize it by yourself, try to place a breakpoint inside your methods and use a debug. It is like a story about fisher and god, who gived a rod to fisher - to got a fish, not the exactly fish.
Also, look at you int[] fileData and byte[] fileData inside last method, maybe this is problem.

StringBuilder.ToString() throws OutOfMemoryException

I have a created a StringBuilder of length "132370292", when I try to get the string using the ToString() method it throws OutOfMemoryException.
StringBuilder SB = new StringBuilder();
for(int i =0; i<=5000; i++)
{
SB.Append("Some Junk Data for testing. My Actual Data is created from different sources by Appending to the String Builder.");
}
try
{
string str = SB.ToString(); // Throws OOM mostly
Console.WriteLine("String Created Successfully");
}
catch(OutOfMemoryException ex)
{
StreamWriter sw = new StreamWriter(#"c:\memo.txt", true);
sw.Write(SB.ToString()); //Always writes to the file without any error
Console.WriteLine("Written to File Successfully");
}
What is the reason for the OOM while creating a new string and why it doesn't throw OOM while writing to a file?
Machine Details: 64-bit, Windows-7, 2GB RAM, .NET version 2.0
What is the reason for the OOM while creating a new string
Because you're running out of memory - or at least, the CLR can't allocate an object with the size you've requested. It's really that simple. If you want to avoid the errors, don't try to create strings that don't fit into memory. Note that even if you have a lot of memory, and even if you're running a 64-bit CLR, there are limits to the size of objects that can be created.
and why it doesn't throw OOM while writing to a file ?
Because you have more disk space than memory.
I'm pretty sure the code isn't exactly as you're describing though. This line would fail to compile:
sw.write(SB.ToString());
... because the method is Write rather than write. And if you're actually calling SB.ToString(), then that's just as likely to fail as str = SB.ToString().
It seems more likely that you're actually writing to the file in a streaming fashion, e.g.
using (var writer = File.CreateText(...))
{
for (int i = 0; i < 5000; i++)
{
writer.Write(mytext);
}
}
That way you never need to have huge amounts of text in memory - it just writes it to disk as it goes, possibly with some buffering, but not enough to cause memory issues.
Workaround: Suppose you would want to write a big string stored in StringBuilder to a StreamWriter, I would do a write this way to avoid SB.ToString's OOM exception. But if your OOM exception is due to StringBuilder's content add itself, you should work on that.
public const int CHUNK_STRING_LENGTH = 30000;
while (SB.Length > CHUNK_STRING_LENGTH )
{
sw.Write(SB.ToString(0, CHUNK_STRING_LENGTH ));
SB.Remove(0, CHUNK_STRING_LENGTH );
}
sw.Write(SB);
You have to remember that strings in .NET are stored in memory in 16-bit unicode. This means string of length 132370292 will reqire 260MB of RAM.
Furthermore, while executing
string str = SB.ToString();
you are creating a COPY of your string (another 260MB).
Keep in mind that each process have its own RAM limit so OutOfMemoryException can be thrown even if you have some free RAM left.
Might help someone , if your logic needs large objects then you can change your application to 64bit and also
change your app.config by adding this section
<runtime>
<gcAllowVeryLargeObjects enabled="true" />
</runtime>
gcAllowVeryLargeObjects On 64-bit platforms, enables arrays that are greater than 2 gigabytes (GB) in total size.
String m_filename = "c:\temp\myfile.xml"
StreamWriter sw = new StreamWriter(m_filename);
while (sb.Length > 0)
{
int writelen = Math.Min(sb.Length, 30000);
sw.Write (sb.ToString(0, writelen));
sb.Remove (0,writelen);
}
sw.Flush();
sw.Close();
sw = null;

c# binary serialization to file line by line or how to seperate

I have a collection of objects at runtime, which is already serializable, I need to persist the state of the object to a file. I did a quick coding using BinaryFormatter and saved A serialized object to a file.
I was thinking that I can save object per line. but when i open the file in a notepad, it was longer than a line. It wasnt scrolling. How can i store an binary serialized object per line?
I am aware that i can use a separator after each object so while reading them back to the application, i can know the end of the object. Well, according to information theory, this increases the size of the data(Sipser book).
What s the best algorithm to come up with a separator that woudldnt break the information?
Instead of binary serialization? Do you think JSon format is more feasible? can i store the entity in a json format, line by line?
Also, serialization/deserialization introduces overhead, hits the performance. Would Json be faster?
ideas?
Thanks.
Thanks.
Serialization functions like a FIFO queue, you dont have to read parts of the file because the formatter does it for you you just have to know the order you pushed objects inside.
public class Test
{
public void testSerialize()
{
TestObj obj = new TestObj();
obj.str = "Some String";
IFormatter formatter = new BinaryFormatter();
Stream stream = new FileStream("MyFile.bin", FileMode.Create, FileAccess.Write, FileShare.None);
formatter.Serialize(stream, obj);
formatter.Serialize(stream, 1);
formatter.Serialize(stream, DateTime.Now);
stream.Close();
}
public void TestDeserialize()
{
Stream stream = new FileStream("MyFile.bin", FileMode.Open, FileAccess.Read, FileShare.None);
IFormatter formatter = new BinaryFormatter();
TestObj obj = (TestObj)formatter.Deserialize(stream);
int obj2 = (int)formatter.Deserialize(stream);
DateTime dt = (DateTime)formatter.Deserialize(stream);
stream.Close();
}
}
[Serializable]
class TestObj
{
public string str = "1";
int i = 2;
}
Well,
Serialization/deserialization introduces overhead, would Json be faster?
JSON is still a form of serialisation, and no it probably wouldn't be faster than binary serialisation - binary serialisation is intended to be compact and quick, wheras JSON serialisation puts more emphasis on readability and so many be slower as is very likely to be less compact.
You could serialise each object individually and emit some separator between each object (e.g. a newline character), but I don't know what separator you could use that is guarenteed to not appear in the serialised data (what happens if you serialise a string containing a newline character?).
If you use a separator that the .Net serialisation framework emits then obviously you will make it difficult (if not impossible) to correctly determine where the breaks between objects are leading to deserialisation failures.
Why exactly do you want to put each object on its own line?
Binary serialization saves the data to arbitrary bytes; these bytes can include newline characters.
You're asking to use newlines as separators. Newlines are no different from other separators; they will also increase the size of the data.
You could also create a ArrayList and add the objects to it and then serialize it ;)
ArrayList list = new ArrayList();
list.Add(1);
list.Add("Hello World");
list.Add(DateTime.Now);
BinaryFormatter bf = new BinaryFormatter();
FileStream fsout = new FileStream("file.dat", FileMode.Create);
bf.Serialize(fsout, list);
fsout.Close();
FileStream fsin = new FileStream("file.dat", FileMode.Open);
ArrayList list2 = (ArrayList)bf.Deserialize(fsin);
fsin.Close();
foreach (object o in list2)
Console.WriteLine(o.GetType());

Bytes consumed by StreamReader

Is there a way to know how many bytes of a stream have been used by StreamReader?
I have a project where we need to read a file that has a text header followed by the start of the binary data. My initial attempt to read this file was something like this:
private int _dataOffset;
void ReadHeader(string path)
{
using (FileStream stream = File.OpenRead(path))
{
StreamReader textReader = new StreamReader(stream);
do
{
string line = textReader.ReadLine();
handleHeaderLine(line);
} while(line != "DATA") // Yes, they used "DATA" to mark the end of the header
_dataOffset = stream.Position;
}
}
private byte[] ReadDataFrame(string path, int frameNum)
{
using (FileStream stream = File.OpenRead(path))
{
stream.Seek(_dataOffset + frameNum * cbFrame, SeekOrigin.Begin);
byte[] data = new byte[cbFrame];
stream.Read(data, 0, cbFrame);
return data;
}
return null;
}
The problem is that when I set _dataOffset to stream.Position, I get the position that the StreamReader has read to, not the end of the header. As soon as I thought about it this made sense, but I still need to be able to know where the end of the header is and I'm not sure if there's a way to do it and still take advantage of StreamReader.
You can find out how many bytes the StreamReader has actually returned (as opposed to read from the stream) in a number of ways, none of them too straightforward I'm afraid.
Get the result of textReader.CurrentEncoding.GetByteCount(totalLengthOfAllTextRead) and then seek to this position in the stream.
Use some reflection hackery to retrieve the value of the private variable of the StreamReader object that corresponds to the current byte position within the internal buffer (different from that with the stream - usually behind, but no more than equal to of course). Judging by .NET Reflector, the this variable seems to be named bytePos.
Don't bother using a StreamReader at all but instead implement your custom ReadLine function built on top of the Stream or BinaryReader even (BinaryReader is guaranteed never to read further ahead than what you request). This custom function must read from the stream char by char, so you'd actually have to use the low-level Decoder object (unless the encoding is ASCII/ANSI, in which case things are a bit simpler due to single-byte encoding).
Option 1 is going to be the least efficient I would imagine (since you're effectively re-encoding text you just decoded), and option 3 the hardest to implement, though perhaps the most elegant. I'd probably recommend against using the ugly reflection hack (option 2), even though it's looks tempting, being the most direct solution and only taking a couple of lines. (To be quite honest, the StreamReader class really ought to expose this variable via a public property, but alas it does not.) So in the end, it's up to you, but either method 1 or 3 should do the job nicely enough...
Hope that helps.
So the data is utf8 (the default encoding for StreamReader). This is a multibyte encoding, so IndexOf would be inadvisable. You could:
Encoding.UTF8.GetByteCount(string)
on your data so far, adding 1 or 2 bytes for the missing line ending.
If you're needing to count bytes, I'd go with the BinaryReader. You can take the results and cast them about as needed, but I find its idea of its current position to be more reliable (in that since it reads in binary, its immune to character-set problems).
So your last line contains 'DATA' + an unknown amount of data bytes. You could extract the position by using IndexOf() with your last read line. Then readjust the stream.Position.
But I am not sure if you should use ReadLine() at all in this case. Maybe it would be better to read byte by byte until you reach the 'DATA' mark.
The line breaks are easily identifiable without needing to decode the stream first (except for some encodings rarely used for text files like EBCDIC, UTF-16, UTF-32), so you can just read each line as bytes and then decode the entire line:
using (FileStream stream = File.OpenRead(path)) {
List<byte> buffer = new List<byte>();
bool hasCr = false;
bool done = false;
while (!done) {
int b = stream.ReadByte();
if (b == -1) throw new IOException("End of file reached in header.");
if (b == 13) {
hasCr = true;
} else if (b == 10 && hasCr) {
string line = Encoding.UTF8.GetString(buffer.ToArray(), 0, buffer.Count);
if (line == "DATA") {
done = true;
} else {
HandleHeaderLine(line);
}
buffer.Clear();
hasCr = false;
} else {
if (hasCr) buffer.Add(13);
hasCr = false;
buffer.Add((byte)b);
}
}
_dataOffset = stream.Position;
}
Instead of closing the stream and open it again, you could of course just keep on reading the data.

Categories