I want to read a CSV file which can be at a size of hundreds of GBs and even TB. I got a limitation that I can only read the file in chunks of 32MB. My solution to the problem, not only does it work kinda slow, but it can also break a line in the middle of it.
I wanted to ask if you know of a better solution:
const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;
using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
string line;
bool stop = false;
while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0) //reading only 32mb chunks at a time
{
var stream = new StreamReader(new MemoryStream(buffer));
while ((line = stream.ReadLine()) != null)
{
//process line
}
}
}
Please do not respond with a solution which reads the file line by line (for example File.ReadLines is NOT an acceptable solution). Why? Because I'm just searching for another solution...
The problem with your solution is that you recreate the streams in each iteration. Try this version:
const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;
StringBuilder currentLine = new StringBuilder();
using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
string line;
bool stop = false;
var memoryStream = new MemoryStream(buffer);
var stream = new StreamReader(memoryStream);
while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0)
{
memoryStream.Seek(0, SeekOrigin.Begin);
while (!stream.EndOfStream)
{
line = ReadLineWithAccumulation(stream, currentLine);
if (line != null)
{
//process line
}
}
}
}
private string ReadLineWithAccumulation(StreamReader stream, StringBuilder currentLine)
{
while (stream.Read(buffer, 0, 1) > 0)
{
if (charBuffer [0].Equals('\n'))
{
string result = currentLine.ToString();
currentLine.Clear();
if (result.Last() == '\r') //remove if newlines are single character
{
result = result.Substring(0, result.Length - 1);
}
return result;
}
else
{
currentLine.Append(charBuffer [0]);
}
}
return null; //line not complete yet
}
private char[] charBuffer = new char[1];
NOTE: This needs some tweaking if newlines are two characters long and you need newline characters to be contained in the result. The worst case would be newline pair "\r\n" split across two blocks. However since you were using ReadLine I assumed that you don't need this.
Also, the problem is that in case your whole data contains only one line, this will end up in an attempt to read the whole data into memory anyway.
which can be at a size of hundreds of GBs and even TB
For a large file processing the most suitable class recommended is MemoryMappedFile Class
Some advantages:
It is ideal to access a data file on disk without performing file I/O operations and from buffering the file’s content. This works great when you deal with large data files.
You can use memory mapped files to allow multiple processes running on the same machine to share data with each other.
so try it and you will note the difference as swapping between memory and harddisk is a time consuming operation
I need to copy the content of one.xaml file into a byte clob. This is my code. It looks like I am not accessing the content of this file. Can anyone tell me why. I am new to C# APIs but I am a programmer. the choice of 4000 is because of the maximum string size restriction, just in case someone wonders. I might have bugs about zies etc.. but the main thing is that I want to se the content of thje xaml file into the clob . Thanks.
string LoadedFileName = #"C:\temp2\one.xaml";//Fd.FileName;
byte[] clobByteTotal ;
FileStream stream = new FileStream(LoadedFileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
if (stream.Length % 2 >= 1)
{
clobByteTotal = new byte[stream.Length + 1];
}
else clobByteTotal = new byte[stream.Length];
for (int i = 0; i <= stream.Length/4000; i++)
{
int x = (stream.Length / 4000 == 0) ? (int)stream.Length : 4000;
stream.Read(stringSizeClob, i*4000, x);
String tempString1 = stringSizeClob.ToString();
byte[] clobByteSection = Encoding.Unicode.GetBytes(stringSizeClob.ToString());
Buffer.BlockCopy(clobByteSection, 0, clobByteTotal, i * clobByteSection.Length, clobByteSection.Length);
}
If you just need read a content of a text file into a byte array, just can do this
string xamlText = File.ReadAlltext(LoadedFileName );
byte[] xamlBytes = Encoding.Unicode.GetBytes(xamlText); //if this is a Unicode and not UTF8
//write byte data somewhere
This much shorter option, which is suitable, naturally for not too big files.
Any reason not to use File.ReadAllBytes?
byte[] xamlBytes = File.ReadAllBytes(path);
I'm using the following code to convert a hex string written in a txt file to a
byte file. The problem is that it doesn't handles large txt files and I get the
"out of memory exception". I know that it should be done in "chunks" but I just can't
get it right.
Please help! The code:
protected void Button1_Click(object sender, EventArgs e)
{
{
string tempFileName = (Server.MapPath("~\\Tempfolder\\" + FileUpload2.FileName));
using (FileStream fs = new FileStream(tempFileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (StreamReader sr = new StreamReader(fs))
{
string s = (sr.ReadToEnd());
if (s.Length % 2 == 1) { lblispis.Text = "String must have an even length"; }
else
{
string hexString = s;
File.WriteAllBytes(tempFileName + ".bin", StringToByteArray(hexString));
lblispis.Text = "Done.";
}
}
}
}
public static byte[] StringToByteArray(String hex)
{
int NumberChars = hex.Length;
byte[] bytes = new byte[NumberChars / 2];
for (int i = 0; i < NumberChars; i += 2)
bytes[i / 2] = Convert.ToByte(hex.Substring(i, 2), 16);
return bytes;
}
You could replace the ReadToEnd call with ReadLine and wrap it in a loop, if the file format allows that.
If that's not the case, there's always the option to read an even number of characters (Read(char[], int, int)) until you hit the end of the file. Of course that way you detect an uneven number of characters very late after having done quite some work already.
To add to #Wormbo's answer, note that a hex string only contains twice as much characters as the byte array. In .NET, object size limit is 2GB (2GB is actually the process size limit on a 32-bit machine), but you can easily have problems allocating even ~800MB contiguous blocks due to heap fragmentation.
In other words, you will want to write directly to disk, immediately after converting it:
using (StreamReader reader = new StreamReader(hex))
using (BinaryWriter writer = new BinaryWriter(File.Open(bin, FileMode.Create)))
{
string line;
while ((line = reader.ReadLine()) != null)
writer.Write(StringToByteArray(line));
}
[Edit]
I've fixed it, parentheses had to be added around the assignment (check the while statement above).
Note that this is only a shorthand for something like:
string line = reader.ReadLine();
while (line != null)
{
writer.Write(...);
line = reader.ReadLine();
}
I wrote the below method to archive files into one file using binary mode:
// Compile archive
public void CompileArchive(string FilePath, ListView FilesList, Label Status, ProgressBar Progress)
{
FileTemplate TempFile = new FileTemplate();
if (FilesList.Items.Count > 0)
{
BinaryWriter Writer = new BinaryWriter(File.Open(FilePath, FileMode.Create), System.Text.Encoding.ASCII);
Progress.Maximum = FilesList.Items.Count - 1;
Writer.Write((long)FilesList.Items.Count);
for (int i = 0; i <= FilesList.Items.Count - 1; i++)
{
TempFile.Name = FilesList.Items[i].SubItems[1].Text;
TempFile.Path = "%ARCHIVE%";
TempFile.Data = this.ReadFileData(FilesList.Items[i].SubItems[2].Text + "\\" + TempFile.Name);
Writer.Write(TempFile.Name);
Writer.Write(TempFile.Path);
Writer.Write(TempFile.Data);
Status.Text = "Status: Writing '" + TempFile.Name + "'";
Progress.Value = i;
}
Writer.Close();
Status.Text = "Status: None";
Progress.Value = 0;
}
}
I read files data using ReadFileData which is in the above method method which return a string of data. (StreamReader) Next up I extract my archive. Everything is done great but the data which will being stored in the extraction method doesn't give me a right data so the extracted files have not right data to show their original functionality.
Extract method:
// Extract archive
public void ExtractArchive(string ArchivePath, string ExtractPath, ListView FilesList, Label Status, ProgressBar Progress)
{
FileTemplate TempFile = new FileTemplate();
BinaryReader Reader = new BinaryReader(File.Open(ArchivePath, FileMode.Open), System.Text.Encoding.ASCII);
long Count = Reader.ReadInt64();
if (Count > 0)
{
Progress.Maximum = (int)Count - 1;
FilesList.Items.Clear();
for (int i = 0; i <= Count - 1; i++)
{
TempFile.Name = Reader.ReadString();
TempFile.Path = Reader.ReadString();
TempFile.Data = Reader.ReadString();
Status.Text = "Status: Reading '" + TempFile.Name + "'";
Progress.Value = i;
if (!Directory.Exists(ExtractPath))
{
Directory.CreateDirectory(ExtractPath);
}
BinaryWriter Writer = new BinaryWriter(File.Open(ExtractPath + "\\" + TempFile.Name, FileMode.Create), System.Text.Encoding.ASCII);
Writer.Write(TempFile.Data);
Writer.Close();
string[] ItemArr = new string[] { i.ToString(), TempFile.Name, TempFile.Path };
ListViewItem ListItem = new ListViewItem(ItemArr);
FilesList.Items.Add(ListItem);
}
Reader.Close();
Status.Text = "Status: None";
Progress.Value = 0;
}
}
The structure:
struct FileTemplate
{
public string Name, Path, Data;
}
Thanks.
Consider using byte arrays for write and safe the data.
Byte array( write )
Byte[] bytes = File.ReadAllBytes(..);
// Write it into your stream
myStream.Write(bytes.Count);
myStream.Write(bytes, 0, bytes.Count);
Byte array ( read )
Int32 byteCount = myStream.ReadInt32();
Byte[] bytes = new Byte[byteCount];
myStream.Read(bytes, 0, byteCount);
The example of an icon makes it clear; you are using string-based APIs to handle data that isn't strings (icons are not string-based). More, you are usig ASCII, so only characters in the 0-127 range would ever be correct. Basically, you can't do that. You need to handle binary data using binary methods (perhaps using the Stream API).
Other options:
use serialization to store instances of objects with the data properties and a BLOB (byte[]) for the content
use something like zip (maybe SharpZipLib) which does somethig very similar, essentially
If your Data can be binary data, then you shouldn't have them in a string. They should be a byte[].
When you write a string using the ASCII encoding like you do, and try to write binary data, many of the bytes (treated as Unicode characters) can't be encoded and so you end up with damaged data.
Morale of the story: never treat binary data as text.
What is the best way to add text to the beginning of a file using C#?
I couldn't find a straightforward way to do this, but came up with a couple work-arounds.
Open up new file, write the text that I wanted to add, append the text from the old file to the end of the new file.
Since the text I want to add should be less than 200 characters, I was thinking that I could add white space characters to the beginning of the file, and then overwrite the white space with the text I wanted to add.
Has anyone else come across this problem, and if so, what did you do?
This works for me, but for small files. Probably it's not a very good solution otherwise.
string currentContent = String.Empty;
if (File.Exists(filePath))
{
currentContent = File.ReadAllText(filePath);
}
File.WriteAllText(filePath, newContent + currentContent );
Adding to the beginning of a file (prepending as opposed to appending) is generally not a supported operation. Your #1 options is fine. If you can't write a temp file, you can pull the entire file into memory, preprend your data to the byte array and then overwrite it back out (this is only really feasible if your files are small and you don't have to have a bunch in memory at once because prepending the array is not necessarily easy without a copy either).
Yeah, basically you can use something like this:
public static void PrependString(string value, FileStream file)
{
var buffer = new byte[file.Length];
while (file.Read(buffer, 0, buffer.Length) != 0)
{
}
if(!file.CanWrite)
throw new ArgumentException("The specified file cannot be written.", "file");
file.Position = 0;
var data = Encoding.Unicode.GetBytes(value);
file.SetLength(buffer.Length + data.Length);
file.Write(data, 0, data.Length);
file.Write(buffer, 0, buffer.Length);
}
public static void Prepend(this FileStream file, string value)
{
PrependString(value, file);
}
Then
using(var file = File.Open("yourtext.txt", FileMode.Open, FileAccess.ReadWrite))
{
file.Prepend("Text you want to write.");
}
Not really effective though in case of huge files.
using two streams, you can do it in place, but keep in mind that this will still loop over the whole file on every addition
using System;
using System.IO;
using System.Text;
namespace FilePrepender
{
public class FilePrepender
{
private string file=null;
public FilePrepender(string filePath)
{
file = filePath;
}
public void prependline(string line)
{
prepend(line + Environment.NewLine);
}
private void shiftSection(byte[] chunk,FileStream readStream,FileStream writeStream)
{
long initialOffsetRead = readStream.Position;
long initialOffsetWrite= writeStream.Position;
int offset = 0;
int remaining = chunk.Length;
do//ensure that the entire chunk length gets read and shifted
{
int read = readStream.Read(chunk, offset, remaining);
offset += read;
remaining -= read;
} while (remaining > 0);
writeStream.Write(chunk, 0, chunk.Length);
writeStream.Seek(initialOffsetWrite, SeekOrigin.Begin);
readStream.Seek(initialOffsetRead, SeekOrigin.Begin);
}
public void prepend(string text)
{
byte[] bytes = Encoding.Default.GetBytes(text);
byte[] chunk = new byte[bytes.Length];
using (FileStream readStream = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using(FileStream writeStream = File.Open(file, FileMode.OpenOrCreate, FileAccess.Write, FileShare.ReadWrite))
{
readStream.Seek(0, SeekOrigin.End);//seek chunk.Length past the end of the file
writeStream.Seek(chunk.Length, SeekOrigin.End);//which lets the loop run without special cases
long size = readStream.Position;
//while there's a whole chunks worth above the read head, shift the file contents down from the end
while(readStream.Position - chunk.Length >= 0)
{
readStream.Seek(-chunk.Length, SeekOrigin.Current);
writeStream.Seek(-chunk.Length, SeekOrigin.Current);
shiftSection(chunk, readStream, writeStream);
}
//clean up the remaining shift for the bytes that don't fit in size%chunk.Length
readStream.Seek(0, SeekOrigin.Begin);
writeStream.Seek(Math.Min(size, chunk.Length), SeekOrigin.Begin);
shiftSection(chunk, readStream, writeStream);
//finally, write the text you want to prepend
writeStream.Seek(0,SeekOrigin.Begin);
writeStream.Write(bytes, 0, bytes.Length);
}
}
}
}
}
I think the best way is to create a temp file. Add your text then read the contents of the original file adding it to the temp file. Then you can overwrite the original with the temp file.
prepend:
private const string tempDirPath = #"c:\temp\log.log", tempDirNewPath = #"c:\temp\log.new";
StringBuilder sb = new StringBuilder();
...
File.WriteAllText(tempDirNewPath, sb.ToString());
File.AppendAllText(tempDirNewPath, File.ReadAllText(tempDirPath));
File.Delete(tempDirPath);
File.Move(tempDirNewPath, tempDirPath);
using (FileStream fs = File.OpenWrite(tempDirPath))
{ //truncate to a reasonable length
if (16384 < fs.Length) fs.SetLength(16384);
fs.Close();
}
// The file we'll prepend to
string filePath = path + "\\log.log";
// A temp file we'll write to
string tempFilePath = path + "\\temp.log";
// 1) Write your prepended contents to a temp file.
using (var writer = new StreamWriter(tempFilePath, false))
{
// Write whatever you want to prepend
writer.WriteLine("Hi");
}
// 2) Use stream lib methods to append the original contents to the Temp
// file.
using (var oldFile = new FileStream(filePath, FileMode.OpenOrCreate, FileAccess.Read, FileShare.Read))
{
using (var tempFile = new FileStream(tempFilePath, FileMode.Append, FileAccess.Write, FileShare.Read))
{
oldFile.CopyTo(tempFile);
}
}
// 3) Finally, dump the Temp file back to the original, keeping all its
// original permissions etc.
File.Replace(tempFilePath, filePath, null);
Even if what you're writing is small, the Temp file gets the entire original file appended to it before the .Replace(), so it does need to be on disk.
Note that this code is not Thread-safe; if more than one thread accesses this code you can lose writes in the file swapping going on here. That said, it's also pretty expensive, so you'd want to gate access to it anyway - pass writes via multiple Providers to a buffer, which periodically empties out via this prepend method on a single Consumer thread.
You should be able to do this without opening a new file. Use the following File method:
public static FileStream Open(
string path,
FileMode mode,
FileAccess access
)
Making sure to specify FileAccess.ReadWrite.
Using the FileStream returned from File.Open, read all of the existing data into memory. Then reset the pointer to the beginning of the file, write your new data, then write the existing data.
(If the file is big and/or you're suspicious of using too much memory, you can do this without having to read the whole file into memory, but implementing that is left as an exercise to the reader.)
The following algorithm may solve the problem pretty easily, it's most efficient for any size of file, including very big text files:
string outPutFile = #"C:\Output.txt";
string result = "Some new string" + DateTime.Now.ToString() + Environment.NewLine;
StringBuilder currentContent = new StringBuilder();
List<string> rawList = File.ReadAllLines(outPutFile).ToList();
foreach (var item in rawList) {
currentContent.Append(item + Environment.NewLine);
}
File.WriteAllText(outPutFile, result + currentContent.ToString());
Use this class:
public static class File2
{
private static readonly Encoding _defaultEncoding = new UTF8Encoding(false, true); // encoding used in File.ReadAll*()
private static object _bufferSizeLock = new Object();
private static int _bufferSize = 1024 * 1024; // 1mb
public static int BufferSize
{
get
{
lock (_bufferSizeLock)
{
return _bufferSize;
}
}
set
{
lock (_bufferSizeLock)
{
_bufferSize = value;
}
}
}
public static void PrependAllLines(string path, IEnumerable<string> contents)
{
PrependAllLines(path, contents, _defaultEncoding);
}
public static void PrependAllLines(string path, IEnumerable<string> contents, Encoding encoding)
{
var temp = Path.GetTempFileName();
File.WriteAllLines(temp, contents, encoding);
AppendToTemp(path, temp, encoding);
File.Replace(temp, path, null);
}
public static void PrependAllText(string path, string contents)
{
PrependAllText(path, contents, _defaultEncoding);
}
public static void PrependAllText(string path, string contents, Encoding encoding)
{
var temp = Path.GetTempFileName();
File.WriteAllText(temp, contents, encoding);
AppendToTemp(path, temp, encoding);
File.Replace(temp, path, null);
}
private static void AppendToTemp(string path, string temp, Encoding encoding)
{
var bufferSize = BufferSize;
char[] buffer = new char[bufferSize];
using (var writer = new StreamWriter(temp, true, encoding))
{
using (var reader = new StreamReader(path, encoding))
{
int bytesRead;
while ((bytesRead = reader.ReadBlock(buffer,0,bufferSize)) != 0)
{
writer.Write(buffer,0,bytesRead);
}
}
}
}
}
Put the file's contents in a string. Append new data you want to add to the top of the file to that string -- string = newdata + string. Then move the seek position of the file to 0 and write the string into the file.