.NET C# - Random access in text files - no easy way? - c#

I've got a text file that contains several 'records' inside of it. Each record contains a name and a collection of numbers as data.
I'm trying to build a class that will read through the file, present only the names of all the records, and then allow the user to select which record data he/she wants.
The first time I go through the file, I only read header names, but I can keep track of the 'position' in the file where the header is. I need random access to the text file to seek to the beginning of each record after a user asks for it.
I have to do it this way because the file is too large to be read in completely in memory (1GB+) with the other memory demands of the application.
I've tried using the .NET StreamReader class to accomplish this (which provides very easy to use 'ReadLine' functionality, but there is no way to capture the true position of the file (the position in the BaseStream property is skewed due to the buffer the class uses).
Is there no easy way to do this in .NET?

There are some good answers provided, but I couldn't find some source code that would work in my very simplistic case. Here it is, with the hope that it'll save someone else the hour that I spent searching around.
The "very simplistic case" that I refer to is: the text encoding is fixed-width, and the line ending characters are the same throughout the file. This code works well in my case (where I'm parsing a log file, and I sometime have to seek ahead in the file, and then come back. I implemented just enough to do what I needed to do (ex: only one constructor, and only override ReadLine()), so most likely you'll need to add code... but I think it's a reasonable starting point.
public class PositionableStreamReader : StreamReader
{
public PositionableStreamReader(string path)
:base(path)
{}
private int myLineEndingCharacterLength = Environment.NewLine.Length;
public int LineEndingCharacterLength
{
get { return myLineEndingCharacterLength; }
set { myLineEndingCharacterLength = value; }
}
public override string ReadLine()
{
string line = base.ReadLine();
if (null != line)
myStreamPosition += line.Length + myLineEndingCharacterLength;
return line;
}
private long myStreamPosition = 0;
public long Position
{
get { return myStreamPosition; }
set
{
myStreamPosition = value;
this.BaseStream.Position = value;
this.DiscardBufferedData();
}
}
}
Here's an example of how to use the PositionableStreamReader:
PositionableStreamReader sr = new PositionableStreamReader("somepath.txt");
// read some lines
while (something)
sr.ReadLine();
// bookmark the current position
long streamPosition = sr.Position;
// read some lines
while (something)
sr.ReadLine();
// go back to the bookmarked position
sr.Position = streamPosition;
// read some lines
while (something)
sr.ReadLine();

FileStream has the seek() method.

You can use a System.IO.FileStream instead of StreamReader. If you know exactly, what file contains ( the encoding for example ), you can do all operation like with StreamReader.

If you're flexible with how the data file is written and don't mind it being a little less text editor-friendly, you could write your records with a BinaryWriter:
using (BinaryWriter writer =
new BinaryWriter(File.Open("data.txt", FileMode.Create)))
{
writer.Write("one,1,1,1,1");
writer.Write("two,2,2,2,2");
writer.Write("three,3,3,3,3");
}
Then, initially reading each record is simple because you can use the BinaryReader's ReadString method:
using (BinaryReader reader = new BinaryReader(File.OpenRead("data.txt")))
{
string line = null;
long position = reader.BaseStream.Position;
while (reader.PeekChar() > -1)
{
line = reader.ReadString();
//parse the name out of the line here...
Console.WriteLine("{0},{1}", position, line);
position = reader.BaseStream.Position;
}
}
The BinaryReader isn't buffered so you get the proper position to store and use later. The only hassle is parsing the name out of the line, which you may have to do with a StreamReader anyway.

Is the encoding a fixed-size one (e.g. ASCII or UCS-2)? If so, you could keep track of the character index (based on the number of characters you've seen) and find the binary index based on that.
Otherwise, no - you'd basically need to write your own StreamReader implementation which lets you peek at the binary index. It's a shame that StreamReader doesn't implement this, I agree.

I think that the FileHelpers library runtime records feature might help u. http://filehelpers.sourceforge.net/runtime_classes.html

A couple of items that may be of interest.
1) If the lines are a fixed set of characters in length, that is not of necessity useful information if the character set has variable sizes (like UTF-8). So check your character set.
2) You can ascertain the exact position of the file cursor from StreamReader by using the BaseStream.Position value IF you Flush() the buffers first (which will force the current position to be where the next read will begin - one byte after the last byte read).
3) If you know in advance that the exact length of each record will be the same number of characters, and the character set uses fixed-width characters (so each line is the same number of bytes long) the you can use FileStream with a fixed buffer size to match the size of a line and the position of the cursor at the end of each read will be, perforce, the beginning of the next line.
4) Is there any particular reason why, if the lines are the same length (assuming in bytes here) that you don't simply use line numbers and calculate the byte-offset in the file based on line size x line number?

Are you sure that the file is "too large"? Have you tried it that way and has it caused a problem?
If you allocate a large amount of memory, and you aren't using it right now, Windows will just swap it out to disk. Hence, by accessing it from "memory", you will have accomplished what you want -- random access to the file on disk.

This exact question was asked in 2006 here: http://www.devnewsgroups.net/group/microsoft.public.dotnet.framework/topic40275.aspx
Summary:
"The problem is that the StreamReader buffers data, so the value returned in
BaseStream.Position property is always ahead of the actual processed line."
However, "if the file is encoded in a text encoding which is fixed-width, you could keep track of how much text has been read and multiply that by the width"
and if not, you can just use the FileStream and read a char at a time and then the BaseStream.Position property should be correct

Starting with .NET 6, the methods in the System.IO.RandomAccess class is the official and supported way to randomly read and write to a file. These APIs work with Microsoft.Win32.SafeHandles.SafeFileHandles which can be obtained with the new System.IO.File.OpenHandle function, also introduced in .NET 6.

Related

How can I write to the column I want with StreamWriter? [duplicate]

I am trying to use StreamReader and StreamWriter to Open a text file (fixed width) and to modify a few specific columns of data. I have dates with the following format that are going to be converted to packed COMP-3 fields.
020100718F
020100716F
020100717F
020100718F
020100719F
I want to be able to read in the dates form a file using StreamReader, then convert them to packed fields (5 characters), and then output them using StreamWriter. However, I haven't found a way to use StreamWriter to right to a specific position, and beginning to wonder if is possible.
I have the following code snip-it.
System.IO.StreamWriter writer;
this.fileName = #"C:\Test9.txt";
reader = new System.IO.StreamReader(System.IO.File.OpenRead(this.fileName));
currentLine = reader.ReadLine();
currentLine = currentLine.Substring(30, 10); //Substring Containing the Date
reader.Close();
...
// Convert currentLine to Packed Field
...
writer = new System.IO.StreamWriter(System.IO.File.Open(this.fileName, System.IO.FileMode.Open));
writer.Write(currentLine);
Currently what I have does the following:
After:
!##$%0718F
020100716F
020100717F
020100718F
020100719F
!##$% = Ascii Characters SO can't display
Any ideas? Thanks!
UPDATE
Information on Packed Fields COMP-3
Packed Fields are used by COBOL systems to reduce the number of bytes a field requires in files. Please see the following SO post for more information: Here
Here is Picture of the following date "20120123" packed in COMP-3. This is my end result and I have included because I wasn't sure if it would effect possible answers.
My question is how do you get StreamWriter to dynamically replace data inside a file and change the lengths of rows?
I have always found it better to to read the input file, filter/process the data and write the output to a temporary file. After finished, delete the original file (or make a backup) and copy the temporary file over. This way you haven't lost half your input file in case something goes wrong in the middle of processing.
You should probably be using a Stream directly (probably a FileStream). This would allow you to change position.
However, you're not going to be able to change record sizes this way, at least, not in-line. You can have one Stream reading from the original file, and another writing to a new, converted copy of the file.
However, I haven't found a way to use StreamWriter to right to a specific position, and
beginning to wonder if is possible.
You can use StreamWriter.BaseStream.Seek method
using (StreamWriter wr = new StreamWriter(File.Create(#"c:\Temp\aaa.txt")))
{
wr.Write("ABC");
wr.Flush();
wr.BaseStream.Seek(0, SeekOrigin.Begin);
wr.Write("Z");
}

How to read specific line number of text file without looping ReadLine()?

Is there a class that lets you read lines by line number in C#?
I know about StreamReader and TextFieldParser but AFAIK those don't have this functionality. For example, if I know that line number 34572 in my text file contains certain data, it would be nice to not have to call StreamReader.ReadLine() 34572 times.
Unless the file has a precise and pre-determined format, for instance with every line having the same length, there is no way to seek within a text file.
In order to find the ith line, you must find the first i-1 line ends. And if you do not know anything about where those line ends could be, it follows that you must read the entire file up until the ith line.
This is not a problem of C# - this is a problem of line terminators. There's no way to skip to the 34572 line, because you don't know when it starts - the only thing you know is that it starts after you find 34571 \r\ns. If you need this functionality, you don't want to be using text files at all :)
A simple (but still slow) way would be to use File.ReadLines(...):
var line = File.ReadLines(fileName).Skip(34571).FirstOrDefault();
The best way, however, would be to know the actual byte offset of the line. If you remember the offset instead of the line number, you can simply seek in the stream and avoid reading the unnecessary data. Then you'd just continue reading the line as usual:
streamReader.BaseStream.Seek(offset, SeekOrigin.Begin);
var line = streamReader.ReadLine();
This is useful if the file is append-only (e.g. a log file) and you can afford to remember bookmarks. It will only work if the file isn't modified in front of the bookmark, though.
All in all, there are three options:
Add indexing - have a table that contains all (or some) of the line-start offsets
Have a fixed line length - this allows you to seek predictably without an index; however, this will not work well with unicode, so it's pretty much useless these days
Parse the file - this pretty much amounts to reading the file line by line, the only optimisation being that you don't actually need to allocate the strings - a simple reusable byte buffer would do.
There's a reason why text formats aren't preferred when performance is important - when you work with user-editable general text formats, the third option is the only option. Thus, reading from a JSON, XML, text log file etc. will always mean reading up to at least the line you want.
This is probably not your case, but if you knew the length of each line you would calculate the start index byte of the line you look for and could go like below:
FileStream fs = new FileStream("fullFileName");
fs.Seek(startByte, SeekOrigin.Current)
for (long offset = startByte; offset < fs.Length; offset++)
{
fs.Seek(-1, SeekOrigin.Current);
char value = (char)((byte)fs.ReadByte());
.
.
.
//To determine the end of each line you can use the conditions below
if (value == 0xA)// hex \n
{
if (offset == fs.Length)
continue;
}
else if (value == 0xD)// hex \r
{
if (offset == fs.Length - 1)
continue;
}
}

Open a file without loading it into memory at once

Hi I have a problem to solve for college and I have a hard time understanding the sentence of the problem.
This is the problem I have :
Reverse the order of bytes from a file without loading the entire file into memory at once.You have to solve this problem in C# , Java , PHP and Python.
Now there are two things that I do not understand here.
First I am not sure if bytes refer to the actual characters of the file , or to something else.The problem does not state if it is a text file or not.
Second I am not sure how to open a file without actualy loading into memory.
This is how I would normaly approach this problem , but I think if I do it this way the file gets loaded into memory:
string fileName = 'file.txt';
reader = new StreamReader(fileName);
string line;
while ((line = reader.ReadLine()) != null)
{
Console.WriteLine(line + "\n");
}
Also I am not sure how I would actualy reverse all the characters if I am reading it one line at the time.
EDIT Sorry for posting in multiple languages I do not want the solution for the problem I only want to clarify it so I can solve it myself.I assumed that because I have to solve it in four different languages the concept would apply to all 4 and it did not matter who answer
Open a FileStream and use the Seek method to go to the end of the file. From there, go backwards, reading one byte at a time. This will read in reverse order. So, until you reach the beginning of the file, loop:
read 1 byte
// do whatever you want with that byte...write to another file?
seek back 2 bytes
As to efficiency, you can read a buffer of, say, 1024 bytes in memory. That way, you don't issue Read operations for each byte of the file. Once you have the buffer filled, reverse it and you're good to go.

How to optimize "real-time" C# write-to-file & MATLAB read-from-file operation

I am trying to find a good method to write data from a NetworkStream (via C#) to a text file while "quasi-simultaneously" reading the newly written data from the text file into Matlab.
Basically, is there a good method or technique for coordinating write/read operations (from separate programs) such that a read operation does not block a write operation (and vice-versa) and the lag between successive write/reads is minimized?
Currently I am just writing (appending) data from the network stream to a text file via a WriteLine loop, and reading the data by looping Matlab's fscanf function which also marks the last element read and repositions the file-pointer to that spot.
Relevant portions of C# code:
(Note: The loop conditions I'm using are arbitrary, I'm just trying to see what works right now.)
NetworkStream network_stream = tcp_client.GetStream();
string path = #"C:\Matlab\serial_data.txt";
FileInfo file_info = new FileInfo(path);
using (StreamWriter writer = file_info.CreateText())
{
string foo = "";
writer.WriteLine(foo);
}
using (StreamWriter writer = File.AppendText(path))
{
byte[] buffer = new byte[1];
int maxlines = 100000;
int lines = 0;
while (lines <= maxlines)
{
network_stream.Read(buffer, 0, buffer.Length);
byte byte2string = buffer[0];
writer.WriteLine(byte2string);
lines++;
}
}
Relevant Matlab Code:
i=0;
while i<100;
a = fopen('serial_data.txt');
b = fscanf(a, '%g', [1000 1]);
fclose(a);
i=i+1;
end
When I look at the data read into Matlab there are large stretches of zeros in between the actual data, and the most disconcerting part is that number of consecutive data-points read between these "false zero" stretches varies drastically.
I was thinking about trying to insert some delays (Thread.sleep and wait(timerObject)) into C# and Matlab, respectively, but even then, I don't feel confident that will guarantee I always obtain the data received over the network stream, which is imperative.
Any advice/suggestions would be greatly appreciated.
Looks like there's an issue with how fscanf is being used in the reader on the Matlab side.
The reader code looks like it's going to reread the entire file each time through the loop, because it's re-opening it on each pass through the loop. Is this intentional? If you want to track the end of a file, you probably want to keep the file handle open, and just keep checking to see if you can read further data from it with repeated fscanf calls on the same open filehandle.
Also, that fscanf call looks like it might always return a zero-padded 1000-element array, regardless of how large the file it read was. Maybe that's where your "false zeros" are coming from. How many there are would vary with how much data is actually in the file and how often the Matlab code read it between writes. Grab the second argout of fscanf to see how many elements it actually read.
[b,nRead] = fscanf(a, '%g', [1000 1]);
fprintf('Read %d numbers\n', nRead);
b = b(1:nRead);
Check the doc page for fscanf. In the "Output Arguments" section: "If the input contains fewer than sizeA elements, MATLABĀ® pads A with zeros."
And then you may want to look at this question: How can I do an atomic write/append in C#, or how do I get files opened with the FILE_APPEND_DATA flag?. Keeping the writes shorter than the output stream's buffer (like they are now) will make them atomic, and flushing after each write will make them visible to the reader in a timely manner.

What is the BEST way to replace text in a File using C# / .NET?

I have a text file that is being written to as part of a very large data extract. The first line of the text file is the number of "accounts" extracted.
Because of the nature of this extract, that number is not known until the very end of the process, but the file can be large (a few hundred megs).
What is the BEST way in C# / .NET to open a file (in this case a simple text file), and replace the data that is in the first "line" of text?
IMPORTANT NOTE: - I do not need to replace a "fixed amount of bytes" - that would be easy. The problem here is that the data that needs to be inserted at the top of the file is variable.
IMPORTANT NOTE 2: - A few people have asked about / mentioned simply keeping the data in memory and then replacing it... however that's completely out of the question. The reason why this process is being updated is because of the fact that sometimes it crashes when loading a few gigs into memory.
If you can you should insert a placeholder which you overwrite at the end with the actual number and spaces.
If that is not an option write your data to a cache file first. When you know the actual number create the output file and append the data from the cache.
BEST is very subjective. For any smallish file, you can easily open the entire file in memory and replace what you want using a string replace and then re-write the file.
Even for largish files, it would not be that hard to load into memory. In the days of multi-gigs of memory, I would consider hundreds of megabytes to still be easily done in memory.
Have you tested this naive approach? Have you seen a real issue with it?
If this is a really large file (gigabytes in size), I would consider writing all of the data first to a temp file and then write the correct file with the header line going in first and then appending the rest of the data. Since it is only text, I would probably just shell out to DOS:
TYPE temp.txt >> outfile.txt
I do not need to replace a "fixed
amount of bytes"
Are you sure?
If you write a big number to the first line of the file (UInt32.MaxValue or UInt64.MaxValue), then when you find the correct actual number, you can replace that number of bytes with the correct number, but left padded with zeros, so it's still a valid integer.
e.g.
Replace 999999 - your "large number placeholder"
With 000100 - the actual number of accounts
Seems to me if I understand the question correctly?
What is the BEST way in C# / .NET to open a file (in this case a simple text file), and replace the data that is in the first "line" of text?
How about placing at the top of the file a token {UserCount} when it is first created.
Then use TextReader to read the file line by line. If it is the first line look for {UserCount} and replace with your value. Write out each line you read in using TextWriter
Example:
int lineNumber = 1;
int userCount = 1234;
string line = null;
using(TextReader tr = File.OpenText("OriginalFile"))
using(TextWriter tw = File.CreateText("ResultFile"))
{
while((line = tr.ReadLine()) != null)
{
if(lineNumber == 1)
{
line = line.Replace("{UserCount}", userCount.ToString());
}
tw.WriteLine(line);
lineNumber++;
}
}
If the extracted file is only a few hundred megabytes, then you can easily keep all of the text in-memory until the extraction is complete. Then, you can write your output file as the last operation, starting with the record count.
Ok, earlier I suggested an approach that would be a better if dealing with existing files.
However in your situation you want to create the file and during the create process go back to the top and write out the user count. This will do just that.
Here is one way to do it that prevents you having to write the temporary file.
private void WriteUsers()
{
string userCountString = null;
ASCIIEncoding enc = new ASCIIEncoding();
byte[] userCountBytes = null;
int userCounter = 0;
using(StreamWriter sw = File.CreateText("myfile.txt"))
{
// Write a blank line and return
// Note this line will later contain our user count.
sw.WriteLine();
// Write out the records and keep track of the count
for(int i = 1; i < 100; i++)
{
sw.WriteLine("User" + i);
userCounter++;
}
// Get the base stream and set the position to 0
sw.BaseStream.Position = 0;
userCountString = "User Count: " + userCounter;
userCountBytes = enc.GetBytes(userCountString);
sw.BaseStream.Write(userCountBytes, 0, userCountBytes.Length);
}
}

Categories