I am trying to read from a large text file with a word on each line and put all the values into an SQL database, with a small text file this works fine but when I have a larger text file, say 300,000 lines I run out of memory.
What is the best way to avoid this? Is there a way to read only a portion of the file, add this to the database then take it out of memory and move on to the next portion?
Here is my code so far:
string path = Server.MapPath("~/content/wordlist.txt");
StreamReader word_stream = new StreamReader(path);
string wordlist = word_stream.ReadToEnd();
string[] all_words = wordlist.Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
I then loop through the array adding each value to the database, but when the file is to large it simply doesnt work.
Do it like this:
// Choose the size of the buffer according
// to your requirements and/or available memory.
int bufferSize = 256 * 1024 * 1024;
string path = Server.MapPath("~/content/wordlist.txt");
using (FileStream stream = new FileStream(path, FileMode.Open, FileAccess.Read))
using (BufferedStream bufferedStream = new BufferedStream(stream, bufferSize))
using (StreamReader reader = new StreamReader(bufferedStream))
{
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
... put line into DB ...
}
}
Also, do not forget exception handling.
try it with yield return
StreamReader r = new StreamReader(path);
while( !r.EndOfStream )
{
string line = r.ReadLine();
yield return line;
}
maybe you read ten lines yield return them, write them to the database and then the next portion.
Related
I have a really big file with round about 30.000 Rows. I have to parse this file and can not delete entries on it. So my idea is to skip allready read lines. I tried something like this:
//Gets the allready readed lines
int readLines = GetCurrentCounter();
//Open File
FileStream stream = new FileStream(LogDatabasePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
using (StreamReader reader = new StreamReader(stream))
{
int counter = 0;
string line;
//If File was allready read to a specified line, skip these lines
if (readLines != 0) reader.ReadLine().Skip(readLines);
//Check if new lines are available
while ((line = reader.ReadLine()) != null)
{
if (counter >= readedLines)
{
//If there is text which contains the searched Testsystem-Name
if (line.Contains(TestSystemName.ToUpper()))
{
//Create new Database-Entry
new TestsystemError().GenerateNewDatabaseEntry(line, counter);
}
}
System.Console.WriteLine(line);
counter++;
}
}
The problem is, that the function reader.ReadLine().Skip(readLines) has no function or i use it in a wrong way.
I need a possibility to skip lines without use the function "reader.ReadLine()" because this is very slow (i get performance problems if i have to iterate through all lines ~about 30.000 lines).
Is there a way to skip lines? If so, would be great to share code. Thanks.
The method reader.ReadLine() returns a string.
The extension method Skip(readedLines) iterates that string and returns an iterator which has skipped the first readedLines characters in the string.
This has no effect on the reader.
If you want to skip the first n lines, either read the first n lines by calling reader.ReadLine() n times, or read the stream until you have read in n end-of-line character sequences before creating the reader. The latter approach avoids creating strings for the lines you want to ignore, but is more code.
If you happen to have extremely regular data so all the rows are the same length, then you can skip the stream before you create the reader
FileStream stream = new FileStream(LogDatabasePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
stream.Seek(readedRows * lengthOfRowInBytes, SeekOrigin.Begin);
using (StreamReader reader = new StreamReader(stream))
// etc
If you have the row number encoded in the row, you could also do a binary search, but that's more code.
Instead of keeping track of the number of lines, keep track of the number of characters read. Then you can use stream.Seek() to quickly skip to the last read position instead of iterating through the whole file every time.
long currentPosition = GetCurrentPosition();
//Open File
FileStream stream = new FileStream(LogDatabasePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
using (StreamReader reader = new StreamReader(stream))
{
string line;
// Seek to the previously read position
stream.Seek(currentPosition, SeekOrigin.Begin);
//Check if new lines are available
while ((line = reader.ReadLine()) != null)
{
// do stuff with the line
// ...
Console.WriteLine(line);
// keep track of the current character position
currentPosition += line.Length + 2; // Add 2 for newline
}
}
SaveCurrentPosition(currentPosition);
You should skip the lines as you read them
//If File was allready read to a specified line, skip these lines
while ((line = reader.ReadLine()) != null && readLines < readedLines){
readLines++
}
if (readedLines != 0) reader.ReadLine()
//Check if new lines are available
while ((line = reader.ReadLine()) != null)
I have a website with many large CSV files (up to 100,000 lines each). From each CSV file, I need to read the last line in the file. I know how to solve the problem when I save the file on disk before reading its content:
var url = "http://data.cocorahs.org/cocorahs/export/exportreports.aspx?ReportType=Daily&Format=csv&Date=1/1/2000&Station=UT-UT-24"
var client = new System.Net.WebClient();
var tempFile = System.IO.Path.GetTempFileName();
client.DownloadFile(url, tempFile);
var lastLine = System.IO.File.ReadLines(tempFile).Last();
Is there any way to get the last line without saving a temporary file on disk?
I tried:
using (var stream = client.OpenRead(seriesUrl))
{
using (var reader = new StreamReader(stream))
{
var lastLine = reader.ReadLines("file.txt").Last();
}
}
but the StreamReader class does not have a ReadLines method ...
StreamReader does not have a ReadLines method, but it does have a ReadLine method to read the next line from the stream. You can use it to read the last line from the remote resource like this:
using (var stream = client.OpenRead(seriesUrl))
{
using (var reader = new StreamReader(stream))
{
string lastLine;
while ((lastLine = reader.ReadLine()) != null)
{
// Do nothing...
}
// lastLine now contains the very last line from reader
}
}
Reading one line at a time with ReadLine will use less memory compared to StreamReader.ReadToEnd, which will read the entire stream into memory as a string. For CSV files with 100,000 lines this could be a significant amount of memory.
This worked for me, though the service did not return data (Headers of CSV only):
public void TestMethod1()
{
var url = "http://data.cocorahs.org/cocorahs/export/exportreports.aspx?ReportType=Daily&Format=csv&Date=1/1/2000&Station=UT-UT-24";
var client = new System.Net.WebClient();
using (var stream = client.OpenRead(url))
{
using (var reader = new StreamReader(stream))
{
var str = reader.ReadToEnd().Split('\n').Where(x => !string.IsNullOrEmpty(x)).LastOrDefault();
Debug.WriteLine(str);
Assert.IsNotEmpty(str);
}
}
}
I am reading text files into program(they are code in Unicode, the output must be in utf-8). The code below works fine for smaller ones (around 150 lines, where line is one word only), however when I am using it on bigger files(like 20.000 line, still only one word on which line) the program takes areound half a minute to complete its task. Should I write new code, or is there a way to optimize this?
int next;
string storage = "";
using (StreamReader sr = new StreamReader(path))
{
while( (next = sr.Read()) != -1 )
{
storage += Char.ConvertFromUtf32(next);
}
sr.Close();
}
Use StringBuilder instead of String:
int next;
StringBuilder storage = new StringBuilder();
using (StreamReader sr = new StreamReader(path)) {
while ((next = sr.Read()) != -1) {
storage.Append(Char.ConvertFromUtf32(next));
}
sr.Close();
}
string result = storage.ToString();
So, everything start working really smoothly when I used different StreamReader,
using (StreamReader sr = new StreamReader(path, Encoding.Unicode))
this, let me get properly formated string, not int indicating character, this has improved speed by A LOT.
I want to read a CSV file which can be at a size of hundreds of GBs and even TB. I got a limitation that I can only read the file in chunks of 32MB. My solution to the problem, not only does it work kinda slow, but it can also break a line in the middle of it.
I wanted to ask if you know of a better solution:
const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;
using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
string line;
bool stop = false;
while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0) //reading only 32mb chunks at a time
{
var stream = new StreamReader(new MemoryStream(buffer));
while ((line = stream.ReadLine()) != null)
{
//process line
}
}
}
Please do not respond with a solution which reads the file line by line (for example File.ReadLines is NOT an acceptable solution). Why? Because I'm just searching for another solution...
The problem with your solution is that you recreate the streams in each iteration. Try this version:
const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;
StringBuilder currentLine = new StringBuilder();
using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
string line;
bool stop = false;
var memoryStream = new MemoryStream(buffer);
var stream = new StreamReader(memoryStream);
while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0)
{
memoryStream.Seek(0, SeekOrigin.Begin);
while (!stream.EndOfStream)
{
line = ReadLineWithAccumulation(stream, currentLine);
if (line != null)
{
//process line
}
}
}
}
private string ReadLineWithAccumulation(StreamReader stream, StringBuilder currentLine)
{
while (stream.Read(buffer, 0, 1) > 0)
{
if (charBuffer [0].Equals('\n'))
{
string result = currentLine.ToString();
currentLine.Clear();
if (result.Last() == '\r') //remove if newlines are single character
{
result = result.Substring(0, result.Length - 1);
}
return result;
}
else
{
currentLine.Append(charBuffer [0]);
}
}
return null; //line not complete yet
}
private char[] charBuffer = new char[1];
NOTE: This needs some tweaking if newlines are two characters long and you need newline characters to be contained in the result. The worst case would be newline pair "\r\n" split across two blocks. However since you were using ReadLine I assumed that you don't need this.
Also, the problem is that in case your whole data contains only one line, this will end up in an attempt to read the whole data into memory anyway.
which can be at a size of hundreds of GBs and even TB
For a large file processing the most suitable class recommended is MemoryMappedFile Class
Some advantages:
It is ideal to access a data file on disk without performing file I/O operations and from buffering the file’s content. This works great when you deal with large data files.
You can use memory mapped files to allow multiple processes running on the same machine to share data with each other.
so try it and you will note the difference as swapping between memory and harddisk is a time consuming operation
Need to get just last line from big log file. What is the best way to do that?
You want to read the file backwards using ReverseLineReader:
How to read a text file reversely with iterator in C#
Then run .Take(1) on it.
var lines = new ReverseLineReader(filename);
var last = lines.Take(1);
You'll want to use Jon Skeet's library MiscUtil directly rather than copying/pasting the code.
String lastline="";
String filedata;
// Open file to read
var fullfiledata = new FileStream(filepath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
StreamReader sr = new StreamReader(fullfiledata);
//long offset = sr.BaseStream.Length - ((sr.BaseStream.Length * lengthWeNeed) / 100);
// Assuming a line doesnt have more than 500 characters, else use above formula
long offset = sr.BaseStream.Length - 500;
//directly move the last 500th position
sr.BaseStream.Seek(offset, SeekOrigin.Begin);
//From there read lines, not whole file
while (!sr.EndOfStream)
{
filedata = sr.ReadLine();
// Interate to see last line
if (sr.Peek() == -1)
{
lastline = filedata;
}
}
return lastline;
}
Or you can do it two line (.Net 4 only)
var lines = File.ReadLines(path);
string line = lines.Last();