Get Estimate of Line Count in a text file - c#

I would like to get an estimate of the number of lines in a csv/text file so that I can use that number for a progress bar. The file could be extremely large so getting the exact number of lines will take too long for this purpose.
What I have come up with is below (read in a portion of the file and count the number of lines and use the file size to estimate the total number of lines):
public static int GetLineCountEstimate(string file)
{
double count = 0;
using (var fs = new FileStream(file, FileMode.Open, FileAccess.Read))
{
long byteCount = fs.Length;
int maxByteCount = 524288;
if (byteCount > maxByteCount)
{
var buf = new byte[maxByteCount];
fs.Read(buf, 0, maxByteCount);
string s = System.Text.Encoding.UTF8.GetString(buf, 0, buf.Length);
count = s.Split('\n').Length * byteCount / maxByteCount;
}
else
{
var buf = new byte[byteCount];
fs.Read(buf, 0, (int)byteCount);
string s = System.Text.Encoding.UTF8.GetString(buf, 0, buf.Length);
count = s.Split('\n').Length;
}
}
return Convert.ToInt32(count);
}
This seems to work ok, but I have some concerns:
1) I would like to have my parameter simply as Stream (as opposed to a filename) since I may also be reading from the clipboard (MemoryStream). However Stream doesn't seem to be able to read n bytes at once into a buffer or get the total length of the Stream in bytes, like FileStream can. Stream is the parent class to both MemoryStream and FileStream.
2) I don't want to assume an encoding such as UTF8
3) I don't want to assume an end of line character (it should work for CR, CRLF, and LF)
I would appreciate any help to make this function more robust.

Here is what I came up with as a more robust solution for estimating line count.
public static int EstimateLineCount(string file)
{
using (var fs = new FileStream(file, FileMode.Open, FileAccess.Read))
{
return EstimateLineCount(fs);
}
}
public static int EstimateLineCount(Stream s)
{
//if file is larger than 10MB estimate the line count, otherwise get the exact line count
const int maxBytes = 10485760; //10MB = 1024*1024*10 bytes
s.Position = 0;
using (var sr = new StreamReader(s, Encoding.UTF8))
{
int lineCount = 0;
if (s.Length > maxBytes)
{
while (s.Position < maxBytes && sr.ReadLine() != null)
lineCount++;
return Convert.ToInt32((double)lineCount * s.Length / s.Position);
}
while (sr.ReadLine() != null)
lineCount++;
return lineCount;
}
}

var lineCount = File.ReadLines(#"C:\file.txt").Count();
An other way:
var lineCount = 0;
using (var reader = File.OpenText(#"C:\file.txt"))
{
while (reader.ReadLine() != null)
{
lineCount++;
}
}

You're cheating! You're asking more than one question... I'll try to help you anyway :P
No, you can't use Stream, but you can use StreamReader. This should provide the flexibility you need.
Test for encoding, since I deduce you'll be working with various. Keep in mind however that it's usually hard to cater for ALL scenarios, so pick a few important ones first, and extend your program later.
Don't - let me show you how:
First, consider your source. Whether it's a file or memory stream, you should have an idea about it's size. I've done the file bit because I'm lazy and it's easy, so you'll have to figure out the memory stream bit yourself. What I've done is much simpler but less accurate: Read the first line of the file, and use it as a percentage of the size of the file. Note I multiplied the length of the string by 2 as that is the delta, in other words number of extra bytes used per extra character in a string. Obviously this isn't very accurate, so you can extend it to x number of lines, just keep in mind that you'll have to change the formula as well.
static void Main(string[] args)
{
FileInfo fileInfo = new FileInfo((#"C:\Muckabout\StringCounter\test.txt"));
using (var stream = new StreamReader(fileInfo.FullName))
{
var firstLine = stream.ReadLine(); // Read the first line.
Console.WriteLine("First line read. This is roughly " + (firstLine.Length * 2.0) / fileInfo.Length * 100 + " per cent of the file.");
}
Console.ReadKey();
}

Related

Read and process large text file after every 2000 characters with a new line to it

I have a large text file which should be processed after every 2000 characters with a new line to it I have done so far as
string FilePath = Path.Combine(strFullProcessedPath, strFileName);
StreamReader reader = new StreamReader(FilePath);
string firstLine = reader.ReadLine();
if (firstLine.Length > 2000)
{
string text = File.ReadAllText(FilePath);
text = Regex.Replace(text, #"(.{2000})", "$1\r\n", RegexOptions.Multiline);
reader.Close();
File.WriteAllText(FilePath, text);
}
it is giving
out of memory exception
please, anyone, refer me some advice
In case of very large (multi Gigabyte) file which doesn't fit memory, you can try storing processed data into a temporary file. Avoid ReadAllText, but read and write with a help of buffer (which is convenient to be of 2000 chars in the context)
// Initial and target file
string FilePath = Path.Combine(strFullProcessedPath, strFileName);
// Temporary file
string tempFile = Path.ChangeExtension(FilePath, ".~temp");
char[] buffer = new char[2000];
using (StreamReader reader = new StreamReader(FilePath)) {
bool first = true;
using (StreamWriter writer = new StreamWriter(tempFile)) {
while (true) {
int size = reader.ReadBlock(buffer, 0, buffer.Length);
if (size > 0) { // Do we have anything to write?
if (!first) // Are we in the middle and have to add a new line?
writer.WriteLine();
for (int i = 0; i < size; ++i)
writer.Write(buffer[i]);
}
// The last (incomplete) chunk
if (size < buffer.Length)
break;
first = false;
}
}
}
File.Delete(FilePath);
// Move temporary file into target one
File.Move(tempFile, FilePath);
// And finally removing temporary file
File.Delete(tempFile);
Edit: Even if you have not that large (300MB, see comments) avoid string processing (several copies of the initial string can well lead to Out Of Memory).
Something like this
private static IEnumerable<string> ToChunks(string text, int size) {
int n = text.Length / size + (text.Length % size == 0 ? 0 : 1);
for (int i = 0; i < n; ++i)
if (i == n - 1)
yield return text.Substring(i * size); // Last chunk
else
yield return text.Substring(i * size, size); // Inner chunk
}
...
string FilePath = Path.Combine(strFullProcessedPath, strFileName);
// Read once, do not Replace ao do something with the string
string text = File.ReadAllText(FilePath);
// ... but extracting 2000 char chunks
File.WriteAllLines(FilePath, ToChunks(text, 2000));
You can't simply insert newlines into an exiting file - you need to rewrite the entire thing, basically. The easiest way to do that is to use two files - a source and destination - and then perhaps delete and rename at the end (so the temporary destination file takes the name of the original). This means you can now loop over the source file without reading it all into memory first; essentially, as pseudo-code:
using(...open source for read...)
using(...create dest for write...)
{
char[] buffer = new char[2000];
int charCount;
while(TryBuffer(source, buffer, out charCount)) {
// if true, we filled the buffer; don't need to worry
// about charCount
Write(destination, buffer, buffer.Length);
Write(destination, CRLF);
}
if(charCount != 0) // final chunk when returned false
{
// write any remaining charCount chars as a final chunk
Write(destination, buffer, charCount);
}
}
So that leaves the implementation of TryBuffer and Write. In this case, TextReader and TextWriter are probably your friends, since you are dealing in characters rather than bytes.

Read a large binary file(5GB) into a byte array in C#?

I have a recording file (Binary file) more than 5 GB, i have to read that file and filter out the data needed to be send to server.
Problem is byte[] array supports till 2GB of file data . so just need help if someone had already dealt with this type of situation.
using (FileStream str = File.OpenRead(textBox2.Text))
{
int itemSectionStart = 0x00000000;
BinaryReader breader = new BinaryReader(str);
breader.BaseStream.Position = itemSectionStart;
int length = (int)breader.BaseStream.Length;
byte[] itemSection = breader.ReadBytes(length ); //first frame data
}
issues:
1: Length is crossing the range of integer.
2: tried using long and unint but byte[] only supports integer
Edit.
Another approach i want to give try, Read data on frame buffer basis, suppose my frame buffer size is 24000 . so byte array store that many frames data and then process the frame data and then flush out the byte array and store another 24000 frame data. till keep on going till end of binary file..
See you can not read that much big file at once, so you have to either split the file in small portions and then process the file.
OR
Read file using buffer concept and once you are done with that buffer data then flush out that buffer.
I faced the same issue, so i tried the buffer based approach and it worked for me.
FileStream inputTempFile = new FileStream(Path, FileMode.OpenOrCreate, FileAccess.Read);
Buffer_value = 1024;
byte[] Array_buffer = new byte[Buffer_value];
while ((bytesRead = inputTempFile.Read(Array_buffer, 0, Buffer_value)) > 0)
{
for (int z = 0; z < Array_buffer.Length; z = z + 4)
{
string temp_id = BitConverter.ToString(Array_buffer, z, 4);
string[] temp_strArrayID = temp_id.Split(new char[] { '-' });
string temp_ArraydataID = temp_strArrayID[0] + temp_strArrayID[1] + temp_strArrayID[2] + temp_strArrayID[3];
}
}
this way you can process your data.
For my case i was trying to store buffer read data in to a List, it will work fine till 2GB data after that it will throw memory exception.
The approach i followed, read the data from buffer and apply needed filters and write filter data in to a text file and then process that file.
//text file approach
FileStream inputTempFile = new FileStream(Path, FileMode.OpenOrCreate, FileAccess.Read);
Buffer_value = 1024;
StreamWriter writer = new StreamWriter(Path, true);
byte[] Array_buffer = new byte[Buffer_value];
while ((bytesRead = inputTempFile.Read(Array_buffer, 0, Buffer_value)) > 0)
{
for (int z = 0; z < Array_buffer.Length; z = z + 4)
{
string temp_id = BitConverter.ToString(Array_buffer, z, 4);
string[] temp_strArrayID = temp_id.Split(new char[] { '-' });
string temp_ArraydataID = temp_strArrayID[0] + temp_strArrayID[1] + temp_strArrayID[2] + temp_strArrayID[3];
if(temp_ArraydataID =="XYZ Condition")
{
writer.WriteLine(temp_ArraydataID);
}
}
}
writer.Close();
As said in comments, I think you have to read your file with a stream. Here is how you can do this:
int nbRead = 0;
var step = 10000;
byte[] buffer = new byte[step];
do
{
nbRead = breader.Read(buffer, 0, step);
hugeArray.Add(buffer);
foreach(var oneByte in hugeArray.SelectMany(part => part))
{
// Here you can read byte by byte this subpart
}
}
while (nbRead > 0);
If I well understand your needs, you are looking for a specific pattern into your file?
I think you can do it by looking for the start of your pattern byte by byte. Once you find it, you can start reading the important bytes. If the whole important data is greater than 2GB, as said in the comments, you will have to send it to your server in several parts.

loading/streaming a file into a buffer/buffers

I have been trying for a couple of days now to load a file in chunks to allow the user to use very large (GB) files and still keep the speed of the program. Currently i have the following code:
using (FileStream filereader = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
using (StreamReader reader = new StreamReader(filereader))
{
while (toRead > 0 && (bytesread = reader.Read(buffer, offset, toRead)) > 0)
{
toRead -= bytesread;
offset += bytesread;
}
if (toRead > 0) throw new EndOfStreamException();
foreach (var item in buffer)
{
temporary = temporary += item.ToString();
}
temporary.Replace("\n", "\n" + System.Environment.NewLine);
Below are the declarations to avoid any confusion (hopefully):
const int Max_Buffer = 5000;
char[] buffer = new char[Max_Buffer];
int bytesread;
int toRead = 5000;
int offset = 0;
At the moment the program reads in 5000 bytes of the text file, then processes the bytes into a string which i then pass into a stringreader so i can take the information i want.
My problem at the moment is that the buffer can stop halfway through a line so when I take my data in the stringreader class it brings up index/length errors.
What i need is to know how to either seek back in the array to find a certain set of characters that signify the start of a line and then only return the data before that point for processing to a string.
Another issue after sorting the seeking back problem is how would i keep the data i didnt want to process and bring in more data to fill the buffer.
I hope this is explained well, i know i can sometimes be confusing hope someone can help.
I would suggest the use of reader.ReadLine() instead of reader.Read() in your loop
buffer=reader.ReadLine();
bytesread = buffer.Length*2;//Each charcter is unicode and equal to 2 bytes
You can then check on the whether (toRead - bytesread)<0.

Remove last x lines from a streamreader

I need to read in all but the last x lines from a file to a streamreader in C#. What is the best way to do this?
Many Thanks!
If it's a large file, is it possible to just seek to the end of the file, and examine the bytes in reverse for the '\n' character? I am aware that \n and \r\n exists. I whipped up the following code and tested on a fairly trivial file. Can you try testing this on the files that you have? I know my solution looks long, but I think you'll find that it's faster than reading from the beginning and rewriting the whole file.
public static void Truncate(string file, int lines)
{
using (FileStream fs = File.Open(file, FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.None))
{
fs.Position = fs.Length;
// \n \r\n (both uses \n for lines)
const int BUFFER_SIZE = 2048;
// Start at the end until # lines have been encountered, record the position, then truncate the file
long currentPosition = fs.Position;
int linesProcessed = 0;
byte[] buffer = new byte[BUFFER_SIZE];
while (linesProcessed < linesToTruncate && currentPosition > 0)
{
int bytesRead = FillBuffer(buffer, fs);
// We now have a buffer containing the later contents of the file
for (int i = bytesRead - 1; i >= 0; i--)
{
currentPosition--;
if (buffer[i] == '\n')
{
linesProcessed++;
if (linesProcessed == linesToTruncate)
break;
}
}
}
// Truncate the file
fs.SetLength(currentPosition);
}
}
private static int FillBuffer(byte[] buffer, FileStream fs)
{
if (fs.Position == 0)
return 0;
int bytesRead = 0;
int currentByteOffset = 0;
// Calculate how many bytes of the buffer can be filled (remember that we're going in reverse)
long expectedBytesToRead = (fs.Position < buffer.Length) ? fs.Position : buffer.Length;
fs.Position -= expectedBytesToRead;
while (bytesRead < expectedBytesToRead)
{
bytesRead += fs.Read(buffer, currentByteOffset, buffer.Length - bytesRead);
currentByteOffset += bytesRead;
}
// We have to reset the position again because we moved the reader forward;
fs.Position -= bytesRead;
return bytesRead;
}
Since you are only planning on deleting the end of the file, it seems wasteful to rewrite everything, especially if it's a large file and small N. Of course, one can make the argument that if someone wanted to eliminate all lines, then going from the beginning to the end is more efficient.
Since you are referring to lines in a file, I'm assuming it's a text file. If you just want to get the lines you can read them into an array of strings like so:
string[] lines = File.ReadAllLines(#"C:\test.txt");
Or if you really need to work with StreamReaders:
using (StreamReader reader = new StreamReader(#"C:\test.txt"))
{
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine());
}
}
You don't really read INTO a StreamReader. In fact, for the pattern you're asking for you don't need the StreamReader at all. System.IO.File has the useful static method 'ReadLines' that you can leverage instead:
IEnumerable<string> allBut = File.ReadLines(path).Reverse().Skip(5).Reverse();
The previous flawed version, back in response to the comment thread
List<string> allLines = File.ReadLines(path).ToList();
IEnumerable<string> allBut = allLines.Take(allLines.Count - 5);

GZIP file Total length in C#

I have a zipped file having size of several GBs, I want to get the size of Unzipped contents but don't want to actually unzip the file in C#, What might be the Library I can use? When I right click on the .gz file and go to Properties then under the Archive Tab there is a property name TotalLength which is showing this value. But I want to get it Programmatically using C#.. Any idea?
The last 4 bytes of the gz file contains the length.
So it should be something like:
using(var fs = File.OpenRead(path))
{
fs.Position = fs.Length - 4;
var b = new byte[4];
fs.Read(b, 0, 4);
uint length = BitConverter.ToUInt32(b, 0);
Console.WriteLine(length);
}
The last for bytes of a .gz file are the uncompressed input size modulo 2^32. If your uncompressed file isn't larger than 4GB, just read the last 4 bytes of the file. If you have a larger file, I'm not sure that it's possible to get without uncompressing the stream.
EDIT: See the answers by Leppie and Gabe; the only reason I'm keeping this (rather than deleting it) is that it may be necessary if you suspect the length is > 4GB
For gzip, that data doesn't seem to be directly available - I've looked at GZipStream and the SharpZipLib equivalent - neither works. The best I can suggest is to run it locally:
long length = 0;
using(var fs = File.OpenRead(path))
using (var gzip = new GZipStream(fs, CompressionMode.Decompress)) {
var buffer = new byte[10240];
int count;
while ((count = gzip.Read(buffer, 0, buffer.Length)) > 0) {
length += count;
}
}
If it was a zip, then SharpZipLib:
long size = 0;
using(var zip = new ZipFile(path)) {
foreach (ZipEntry entry in zip) {
size += entry.Size;
}
}
public static long mGetFileLength(string strFilePath)
{
if (!string.IsNullOrEmpty(strFilePath))
{
System.IO.FileInfo info = new System.IO.FileInfo(strFilePath);
return info.Length;
}
return 0;
}

Categories