How to read specific line number of text file without looping ReadLine()? - c#

Is there a class that lets you read lines by line number in C#?
I know about StreamReader and TextFieldParser but AFAIK those don't have this functionality. For example, if I know that line number 34572 in my text file contains certain data, it would be nice to not have to call StreamReader.ReadLine() 34572 times.

Unless the file has a precise and pre-determined format, for instance with every line having the same length, there is no way to seek within a text file.
In order to find the ith line, you must find the first i-1 line ends. And if you do not know anything about where those line ends could be, it follows that you must read the entire file up until the ith line.

This is not a problem of C# - this is a problem of line terminators. There's no way to skip to the 34572 line, because you don't know when it starts - the only thing you know is that it starts after you find 34571 \r\ns. If you need this functionality, you don't want to be using text files at all :)
A simple (but still slow) way would be to use File.ReadLines(...):
var line = File.ReadLines(fileName).Skip(34571).FirstOrDefault();
The best way, however, would be to know the actual byte offset of the line. If you remember the offset instead of the line number, you can simply seek in the stream and avoid reading the unnecessary data. Then you'd just continue reading the line as usual:
streamReader.BaseStream.Seek(offset, SeekOrigin.Begin);
var line = streamReader.ReadLine();
This is useful if the file is append-only (e.g. a log file) and you can afford to remember bookmarks. It will only work if the file isn't modified in front of the bookmark, though.
All in all, there are three options:
Add indexing - have a table that contains all (or some) of the line-start offsets
Have a fixed line length - this allows you to seek predictably without an index; however, this will not work well with unicode, so it's pretty much useless these days
Parse the file - this pretty much amounts to reading the file line by line, the only optimisation being that you don't actually need to allocate the strings - a simple reusable byte buffer would do.
There's a reason why text formats aren't preferred when performance is important - when you work with user-editable general text formats, the third option is the only option. Thus, reading from a JSON, XML, text log file etc. will always mean reading up to at least the line you want.

This is probably not your case, but if you knew the length of each line you would calculate the start index byte of the line you look for and could go like below:
FileStream fs = new FileStream("fullFileName");
fs.Seek(startByte, SeekOrigin.Current)
for (long offset = startByte; offset < fs.Length; offset++)
{
fs.Seek(-1, SeekOrigin.Current);
char value = (char)((byte)fs.ReadByte());
.
.
.
//To determine the end of each line you can use the conditions below
if (value == 0xA)// hex \n
{
if (offset == fs.Length)
continue;
}
else if (value == 0xD)// hex \r
{
if (offset == fs.Length - 1)
continue;
}
}

Related

Open a file without loading it into memory at once

Hi I have a problem to solve for college and I have a hard time understanding the sentence of the problem.
This is the problem I have :
Reverse the order of bytes from a file without loading the entire file into memory at once.You have to solve this problem in C# , Java , PHP and Python.
Now there are two things that I do not understand here.
First I am not sure if bytes refer to the actual characters of the file , or to something else.The problem does not state if it is a text file or not.
Second I am not sure how to open a file without actualy loading into memory.
This is how I would normaly approach this problem , but I think if I do it this way the file gets loaded into memory:
string fileName = 'file.txt';
reader = new StreamReader(fileName);
string line;
while ((line = reader.ReadLine()) != null)
{
Console.WriteLine(line + "\n");
}
Also I am not sure how I would actualy reverse all the characters if I am reading it one line at the time.
EDIT Sorry for posting in multiple languages I do not want the solution for the problem I only want to clarify it so I can solve it myself.I assumed that because I have to solve it in four different languages the concept would apply to all 4 and it did not matter who answer
Open a FileStream and use the Seek method to go to the end of the file. From there, go backwards, reading one byte at a time. This will read in reverse order. So, until you reach the beginning of the file, loop:
read 1 byte
// do whatever you want with that byte...write to another file?
seek back 2 bytes
As to efficiency, you can read a buffer of, say, 1024 bytes in memory. That way, you don't issue Read operations for each byte of the file. Once you have the buffer filled, reverse it and you're good to go.

C# : Fastest way for specific columns in CSV Files

I have a very large CSV file (Millions of records)
I have developed a smart search algorithm to locate specific line ranges in the file to avoid parsing the whole file.
Now I am facing a trickier issue : I am only interested in the content of a specific column.
Is there a smart way to avoid looping line by line through a 200MB Files and retrieve only the content of a specific column?
I'd use an existing library as codeulike has suggested, and for a very good reason why read this article:
Stop Rolling Your Own CSV Parser!
You mean get every value from every row for a specific column?
You're probably going to have to visit every row to do that.
This C# CSV Reading library is very quick so you might be able to use it:
LumenWorks.Framework.IO.Csv by Sebastien Lorien
Unless all CSV fields have a fixed width (and even if empty there's still n bytes of blank space between the separators surrounding it), no.
If yes
Then each row, in turn, also has a fixed length and therefore you can skip straight to the first value for that column and, once you've read it, you immediately advance to next row's value for the same field, without having to read any intermediate values.
I think this is pretty simple - but I'm on a roll at the moment (and at lunch), so I'm going to finish it anyway :)
To do this, we first want to know how long each row is in characters (adjust for bytes according to Unicode, UTF8 etc):
row_len = sum(widths[0..n-1]) + n-1 + row_sep_length
Where n is the total number of columns on each row - this is a constant for the whole file. We add an extra n-1 to it to account for the separators between column values.
And row_sep_length is the length of the separator between two rows - usually a newline, or potentially a [carriage-return & line-feed] pair.
The value for a column row[r]col[i] will be offset characters from the start of row[r]where offset is defined as:
offset = i>0 ? sum(widths[0..i-1]) + i) : 0;
//or sum of widths of all columns before col[i]
//plus one character for each separator between adjacent columns
And then, assuming you've read the whole column value, up to the next separator, the offset to the starting character for next column value row[r+1]col[i] is calculated by subtracting the width of your column from the row length. This is yet another constant for the file:
next-field-offset = row_len - widths[i];
//widths[i] is the width of the field you are actually reading.
All the while - i is zero-based in this pseudo code as is the indexing of the vectors/arrays.
To read, then, you first advance the file pointer by offset characters - taking you to the first value you want. You read the value (taking you to the next separator) and then simply advance the file pointer by next-field-offset characters. If you reach EOF at this point, you're done.
I might have missed a character either way in this - so if it's applicable - do check it!
This only works if you can guarantee that all field values - even nulls - for all rows will be the same length, and that the separators are always the same length and that alll row separators are the same length. If not - then this approach won't work.
If not
You'll have to do it the slow way - find the column in each line and do whatever it is you need to do.
If you're doing a significant amount of work on the column value each time, one optimisation will be to pull out all the column values first into a list (set with a known initial capacity too) or something (batching at 100,000 a time or something like that), then iterate through those.
If you keep each loop focused on a single task, that should be more efficient than one big loop.
Equally, once you've batched a 100,000 column values you could use Parallel Linq to distribute the second loop (not the first since there's no point parallelising reading from a file).
There are only shortcuts if you can pose specific limitations on the data.
For example, you can only read the file line by line if you know that there are no values in the file that contain line breaks. If you don't know this, you have to parse the file record by record as a stream, and each record ends where there is a line break that is not inside a value.
However, unless you know that each line takes up exactly the same amount of bytes, there is no other way to read the file than to read line by line. The line breaks in a file is just another pair of characters, there is no other way to locate a line in a text file than to read all the lines that comes before it.
You can do similar shortcuts when reading a record if you can pose limiations on the fields in the records. If you for example know that the fields to the left of the one that you are interrested in are all numerical, you can use a simpler parsing method to find the start of the field.

C# code to perform Binary search in a very big text file

Is there a library that I can use to perform binary search in a very big text file (can be 10GB).
The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.
I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.
You won't find it in a library, for two reasons:
It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).
On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)
As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.
The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.
When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.
You should find that you will reach your target pretty quickly.
You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.
Something like
byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);
This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.
So try this:
Scan the file and store the ordinal offset of each line-feed in a List
Binary search the List with a custom comparer that scans to the file offset and reads the data.
If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:
Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"
preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file
read directly from the original file starting from the given location / read the given range
You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)
Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.
Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.
The List object has a Binary Search method.
http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx

What is the BEST way to replace text in a File using C# / .NET?

I have a text file that is being written to as part of a very large data extract. The first line of the text file is the number of "accounts" extracted.
Because of the nature of this extract, that number is not known until the very end of the process, but the file can be large (a few hundred megs).
What is the BEST way in C# / .NET to open a file (in this case a simple text file), and replace the data that is in the first "line" of text?
IMPORTANT NOTE: - I do not need to replace a "fixed amount of bytes" - that would be easy. The problem here is that the data that needs to be inserted at the top of the file is variable.
IMPORTANT NOTE 2: - A few people have asked about / mentioned simply keeping the data in memory and then replacing it... however that's completely out of the question. The reason why this process is being updated is because of the fact that sometimes it crashes when loading a few gigs into memory.
If you can you should insert a placeholder which you overwrite at the end with the actual number and spaces.
If that is not an option write your data to a cache file first. When you know the actual number create the output file and append the data from the cache.
BEST is very subjective. For any smallish file, you can easily open the entire file in memory and replace what you want using a string replace and then re-write the file.
Even for largish files, it would not be that hard to load into memory. In the days of multi-gigs of memory, I would consider hundreds of megabytes to still be easily done in memory.
Have you tested this naive approach? Have you seen a real issue with it?
If this is a really large file (gigabytes in size), I would consider writing all of the data first to a temp file and then write the correct file with the header line going in first and then appending the rest of the data. Since it is only text, I would probably just shell out to DOS:
TYPE temp.txt >> outfile.txt
I do not need to replace a "fixed
amount of bytes"
Are you sure?
If you write a big number to the first line of the file (UInt32.MaxValue or UInt64.MaxValue), then when you find the correct actual number, you can replace that number of bytes with the correct number, but left padded with zeros, so it's still a valid integer.
e.g.
Replace 999999 - your "large number placeholder"
With 000100 - the actual number of accounts
Seems to me if I understand the question correctly?
What is the BEST way in C# / .NET to open a file (in this case a simple text file), and replace the data that is in the first "line" of text?
How about placing at the top of the file a token {UserCount} when it is first created.
Then use TextReader to read the file line by line. If it is the first line look for {UserCount} and replace with your value. Write out each line you read in using TextWriter
Example:
int lineNumber = 1;
int userCount = 1234;
string line = null;
using(TextReader tr = File.OpenText("OriginalFile"))
using(TextWriter tw = File.CreateText("ResultFile"))
{
while((line = tr.ReadLine()) != null)
{
if(lineNumber == 1)
{
line = line.Replace("{UserCount}", userCount.ToString());
}
tw.WriteLine(line);
lineNumber++;
}
}
If the extracted file is only a few hundred megabytes, then you can easily keep all of the text in-memory until the extraction is complete. Then, you can write your output file as the last operation, starting with the record count.
Ok, earlier I suggested an approach that would be a better if dealing with existing files.
However in your situation you want to create the file and during the create process go back to the top and write out the user count. This will do just that.
Here is one way to do it that prevents you having to write the temporary file.
private void WriteUsers()
{
string userCountString = null;
ASCIIEncoding enc = new ASCIIEncoding();
byte[] userCountBytes = null;
int userCounter = 0;
using(StreamWriter sw = File.CreateText("myfile.txt"))
{
// Write a blank line and return
// Note this line will later contain our user count.
sw.WriteLine();
// Write out the records and keep track of the count
for(int i = 1; i < 100; i++)
{
sw.WriteLine("User" + i);
userCounter++;
}
// Get the base stream and set the position to 0
sw.BaseStream.Position = 0;
userCountString = "User Count: " + userCounter;
userCountBytes = enc.GetBytes(userCountString);
sw.BaseStream.Write(userCountBytes, 0, userCountBytes.Length);
}
}

.NET C# - Random access in text files - no easy way?

I've got a text file that contains several 'records' inside of it. Each record contains a name and a collection of numbers as data.
I'm trying to build a class that will read through the file, present only the names of all the records, and then allow the user to select which record data he/she wants.
The first time I go through the file, I only read header names, but I can keep track of the 'position' in the file where the header is. I need random access to the text file to seek to the beginning of each record after a user asks for it.
I have to do it this way because the file is too large to be read in completely in memory (1GB+) with the other memory demands of the application.
I've tried using the .NET StreamReader class to accomplish this (which provides very easy to use 'ReadLine' functionality, but there is no way to capture the true position of the file (the position in the BaseStream property is skewed due to the buffer the class uses).
Is there no easy way to do this in .NET?
There are some good answers provided, but I couldn't find some source code that would work in my very simplistic case. Here it is, with the hope that it'll save someone else the hour that I spent searching around.
The "very simplistic case" that I refer to is: the text encoding is fixed-width, and the line ending characters are the same throughout the file. This code works well in my case (where I'm parsing a log file, and I sometime have to seek ahead in the file, and then come back. I implemented just enough to do what I needed to do (ex: only one constructor, and only override ReadLine()), so most likely you'll need to add code... but I think it's a reasonable starting point.
public class PositionableStreamReader : StreamReader
{
public PositionableStreamReader(string path)
:base(path)
{}
private int myLineEndingCharacterLength = Environment.NewLine.Length;
public int LineEndingCharacterLength
{
get { return myLineEndingCharacterLength; }
set { myLineEndingCharacterLength = value; }
}
public override string ReadLine()
{
string line = base.ReadLine();
if (null != line)
myStreamPosition += line.Length + myLineEndingCharacterLength;
return line;
}
private long myStreamPosition = 0;
public long Position
{
get { return myStreamPosition; }
set
{
myStreamPosition = value;
this.BaseStream.Position = value;
this.DiscardBufferedData();
}
}
}
Here's an example of how to use the PositionableStreamReader:
PositionableStreamReader sr = new PositionableStreamReader("somepath.txt");
// read some lines
while (something)
sr.ReadLine();
// bookmark the current position
long streamPosition = sr.Position;
// read some lines
while (something)
sr.ReadLine();
// go back to the bookmarked position
sr.Position = streamPosition;
// read some lines
while (something)
sr.ReadLine();
FileStream has the seek() method.
You can use a System.IO.FileStream instead of StreamReader. If you know exactly, what file contains ( the encoding for example ), you can do all operation like with StreamReader.
If you're flexible with how the data file is written and don't mind it being a little less text editor-friendly, you could write your records with a BinaryWriter:
using (BinaryWriter writer =
new BinaryWriter(File.Open("data.txt", FileMode.Create)))
{
writer.Write("one,1,1,1,1");
writer.Write("two,2,2,2,2");
writer.Write("three,3,3,3,3");
}
Then, initially reading each record is simple because you can use the BinaryReader's ReadString method:
using (BinaryReader reader = new BinaryReader(File.OpenRead("data.txt")))
{
string line = null;
long position = reader.BaseStream.Position;
while (reader.PeekChar() > -1)
{
line = reader.ReadString();
//parse the name out of the line here...
Console.WriteLine("{0},{1}", position, line);
position = reader.BaseStream.Position;
}
}
The BinaryReader isn't buffered so you get the proper position to store and use later. The only hassle is parsing the name out of the line, which you may have to do with a StreamReader anyway.
Is the encoding a fixed-size one (e.g. ASCII or UCS-2)? If so, you could keep track of the character index (based on the number of characters you've seen) and find the binary index based on that.
Otherwise, no - you'd basically need to write your own StreamReader implementation which lets you peek at the binary index. It's a shame that StreamReader doesn't implement this, I agree.
I think that the FileHelpers library runtime records feature might help u. http://filehelpers.sourceforge.net/runtime_classes.html
A couple of items that may be of interest.
1) If the lines are a fixed set of characters in length, that is not of necessity useful information if the character set has variable sizes (like UTF-8). So check your character set.
2) You can ascertain the exact position of the file cursor from StreamReader by using the BaseStream.Position value IF you Flush() the buffers first (which will force the current position to be where the next read will begin - one byte after the last byte read).
3) If you know in advance that the exact length of each record will be the same number of characters, and the character set uses fixed-width characters (so each line is the same number of bytes long) the you can use FileStream with a fixed buffer size to match the size of a line and the position of the cursor at the end of each read will be, perforce, the beginning of the next line.
4) Is there any particular reason why, if the lines are the same length (assuming in bytes here) that you don't simply use line numbers and calculate the byte-offset in the file based on line size x line number?
Are you sure that the file is "too large"? Have you tried it that way and has it caused a problem?
If you allocate a large amount of memory, and you aren't using it right now, Windows will just swap it out to disk. Hence, by accessing it from "memory", you will have accomplished what you want -- random access to the file on disk.
This exact question was asked in 2006 here: http://www.devnewsgroups.net/group/microsoft.public.dotnet.framework/topic40275.aspx
Summary:
"The problem is that the StreamReader buffers data, so the value returned in
BaseStream.Position property is always ahead of the actual processed line."
However, "if the file is encoded in a text encoding which is fixed-width, you could keep track of how much text has been read and multiply that by the width"
and if not, you can just use the FileStream and read a char at a time and then the BaseStream.Position property should be correct
Starting with .NET 6, the methods in the System.IO.RandomAccess class is the official and supported way to randomly read and write to a file. These APIs work with Microsoft.Win32.SafeHandles.SafeFileHandles which can be obtained with the new System.IO.File.OpenHandle function, also introduced in .NET 6.

Categories