StreamReader returning another char - c#

I'm trying to read the content of a file with a StreamReader, that receives a FileStream. The file has some spaces inside (char 32) and the StreamReader is reading them as 0 (char 48). The screenshot shows the FileStream buffer and the StreamReader buffer. Both have the value 32, but when I call Read(), it returns 48. Am I missing something here? By the way, the code is running under .NET Compact Framework.
alt text http://www.freeimagehosting.net/uploads/9f72b61bbe.png
The code that reads the data:
public void Read() {
using (StreamReader reader = new StreamReader(InputStream, Encoding.UTF8)) {
foreach (var property in DataObject.EnumerateProperties()) {
OffsetInfo offset = property.GetTextOffset();
reader.BaseStream.Position = offset.Start - 1;
StringBuilder builder = new StringBuilder(offset.Size);
int count = 0;
while (reader.Peek() >= 0 && count < offset.Size) {
char c = (char)reader.Read();
if ((int)c != 32 && c != '\r' && c != '\n') {
builder.Append(c);
count++;
} else {
reader.BaseStream.Position++;
}
}
property.SetValue(DataObject,
Convert.ChangeType(builder.ToString(), property.PropertyType, CultureInfo.CurrentCulture),
null
);
}
}
}
EDIT: Changing the encoding didn't worked (neither Unicode, nor Default)
EDIT: The input looks like this:
000636920000000532000404100100000001041000000001041000000001031000000000000000000000000000000000000000001730173017301730203020302030203021302130213021300027900267841515150000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000280010000000280010000000280010000020
260007464616011007464816011009005321011009005621011010041621011010041821011013574026011013574226011014564729011014564929011018343318021018343618021020035418021020035618021022583818021022584018021005474302031005474502031010311305031010311505031011265308031011265508031011265508031011274108031021524009
0310215242090310060151130310063110130310160022210310160024210310022837280310022839280310
00206377740002484841000029844400181529330003034081000000000000000000
The problem happens with the spaces that start in the third line and goes to the fourth.

I suspect your problem is the Encoding.ASCII. Are you positive your file is encoded this way? I'd wager your file is actually encoded with Encoding.Unicode, which is why you're encountering zeroes.
In this case you say your encoding is UTF-8, so set your encoding to Encoding.UTF8 and see what happens.

OK, I just ran a little test. Repositioning the BaseStream doesn't work for a TextReader, so you are simply reading from another position than you think you are (and are checking in the Watch window).
To solve it, you will have to create a new StreamReader for each property, and be careful not to close it (don't use a using block).
But I would go for reading it all at once (it is all text, right?) and operate on the string(s).

Related

Seek and ReadLine in c# [duplicate]

Can you use StreamReader to read a normal textfile and then in the middle of reading close the StreamReader after saving the current position and then open StreamReader again and start reading from that poistion ?
If not what else can I use to accomplish the same case without locking the file ?
I tried this but it doesn't work:
var fs = File.Open(# "C:\testfile.txt", FileMode.Open, FileAccess.Read);
var sr = new StreamReader(fs);
Debug.WriteLine(sr.ReadLine()); //Prints:firstline
var pos = fs.Position;
while (!sr.EndOfStream)
{
Debug.WriteLine(sr.ReadLine());
}
fs.Seek(pos, SeekOrigin.Begin);
Debug.WriteLine(sr.ReadLine());
//Prints Nothing, i expect it to print SecondLine.
Here is the other code I also tried :
var position = -1;
StreamReaderSE sr = new StreamReaderSE(# "c:\testfile.txt");
Debug.WriteLine(sr.ReadLine());
position = sr.BytesRead;
Debug.WriteLine(sr.ReadLine());
Debug.WriteLine(sr.ReadLine());
Debug.WriteLine(sr.ReadLine());
Debug.WriteLine(sr.ReadLine());
Debug.WriteLine("Wait");
sr.BaseStream.Seek(position, SeekOrigin.Begin);
Debug.WriteLine(sr.ReadLine());
I realize this is really belated, but I just stumbled onto this incredible flaw in StreamReader myself; the fact that you can't reliably seek when using StreamReader. Personally, my specific need is to have the ability to read characters, but then "back up" if a certain condition is met; it's a side effect of one of the file formats I'm parsing.
Using ReadLine() isn't an option because it's only useful in really trivial parsing jobs. I have to support configurable record/line delimiter sequences and support escape delimiter sequences. Also, I don't want to implement my own buffer so I can support "backing up" and escape sequences; that should be the StreamReader's job.
This method calculates the actual position in the underlying stream of bytes on-demand. It works for UTF8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, and any single-byte encoding (e.g. code pages 1252, 437, 28591, etc.), regardless the presence of a preamble/BOM. This version will not work for UTF-7, Shift-JIS, or other variable-byte encodings.
When I need to seek to an arbitrary position in the underlying stream, I directly set BaseStream.Position and then call DiscardBufferedData() to get StreamReader back in sync for the next Read()/Peek() call.
And a friendly reminder: don't arbitrarily set BaseStream.Position. If you bisect a character, you'll invalidate the next Read() and, for UTF-16/-32, you'll also invalidate the result of this method.
public static long GetActualPosition(StreamReader reader)
{
System.Reflection.BindingFlags flags = System.Reflection.BindingFlags.DeclaredOnly | System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.GetField;
// The current buffer of decoded characters
char[] charBuffer = (char[])reader.GetType().InvokeMember("charBuffer", flags, null, reader, null);
// The index of the next char to be read from charBuffer
int charPos = (int)reader.GetType().InvokeMember("charPos", flags, null, reader, null);
// The number of decoded chars presently used in charBuffer
int charLen = (int)reader.GetType().InvokeMember("charLen", flags, null, reader, null);
// The current buffer of read bytes (byteBuffer.Length = 1024; this is critical).
byte[] byteBuffer = (byte[])reader.GetType().InvokeMember("byteBuffer", flags, null, reader, null);
// The number of bytes read while advancing reader.BaseStream.Position to (re)fill charBuffer
int byteLen = (int)reader.GetType().InvokeMember("byteLen", flags, null, reader, null);
// The number of bytes the remaining chars use in the original encoding.
int numBytesLeft = reader.CurrentEncoding.GetByteCount(charBuffer, charPos, charLen - charPos);
// For variable-byte encodings, deal with partial chars at the end of the buffer
int numFragments = 0;
if (byteLen > 0 && !reader.CurrentEncoding.IsSingleByte)
{
if (reader.CurrentEncoding.CodePage == 65001) // UTF-8
{
byte byteCountMask = 0;
while ((byteBuffer[byteLen - numFragments - 1] >> 6) == 2) // if the byte is "10xx xxxx", it's a continuation-byte
byteCountMask |= (byte)(1 << ++numFragments); // count bytes & build the "complete char" mask
if ((byteBuffer[byteLen - numFragments - 1] >> 6) == 3) // if the byte is "11xx xxxx", it starts a multi-byte char.
byteCountMask |= (byte)(1 << ++numFragments); // count bytes & build the "complete char" mask
// see if we found as many bytes as the leading-byte says to expect
if (numFragments > 1 && ((byteBuffer[byteLen - numFragments] >> 7 - numFragments) == byteCountMask))
numFragments = 0; // no partial-char in the byte-buffer to account for
}
else if (reader.CurrentEncoding.CodePage == 1200) // UTF-16LE
{
if (byteBuffer[byteLen - 1] >= 0xd8) // high-surrogate
numFragments = 2; // account for the partial character
}
else if (reader.CurrentEncoding.CodePage == 1201) // UTF-16BE
{
if (byteBuffer[byteLen - 2] >= 0xd8) // high-surrogate
numFragments = 2; // account for the partial character
}
}
return reader.BaseStream.Position - numBytesLeft - numFragments;
}
Of course, this uses Reflection to get at private variables, so there is risk involved. However, this method works with .Net 2.0, 3.0, 3.5, 4.0, 4.0.3, 4.5, 4.5.1, 4.5.2, 4.6, and 4.6.1. Beyond that risk, the only other critical assumption is that the underlying byte-buffer is a byte[1024]; if Microsoft changes it the wrong way, the method breaks for UTF-16/-32.
This has been tested against a UTF-8 file filled with Ažテ𣘺 (10 bytes: 0x41 C5 BE E3 83 86 F0 A3 98 BA) and a UTF-16 file filled with A𐐷 (6 bytes: 0x41 00 01 D8 37 DC). The point being to force-fragment characters along the byte[1024] boundaries, all the different ways they could be.
UPDATE (2013-07-03): I fixed the method, which originally used the broken code from that other answer. This version has been tested against data containing a characters requiring use of surrogate pairs. The data was put into 3 files, each with a different encoding; one UTF-8, one UTF-16LE, and one UTF-16BE.
UPDATE (2016-02): The only correct way to handle bisected characters is to directly interpret the underlying bytes. UTF-8 is properly handled, and UTF-16/-32 work (given the length of byteBuffer).
Yes you can, see this:
var sr = new StreamReader("test.txt");
sr.BaseStream.Seek(2, SeekOrigin.Begin); // Check sr.BaseStream.CanSeek first
Update:
Be aware that you can't necessarily use sr.BaseStream.Position to anything useful because StreamReader uses buffers so it will not reflect what you actually have read. I guess you gonna have problems finding the true position. Because you can't just count characters (different encodings and therefore character lengths). I think the best way is to work with FileStream´s themselves.
Update:
Use the TGREER.myStreamReader from here:
http://www.daniweb.com/software-development/csharp/threads/35078
this class adds BytesRead etc. (works with ReadLine() but apparently not with other reads methods)
and then you can do like this:
File.WriteAllText("test.txt", "1234\n56789");
long position = -1;
using (var sr = new myStreamReader("test.txt"))
{
Console.WriteLine(sr.ReadLine());
position = sr.BytesRead;
}
Console.WriteLine("Wait");
using (var sr = new myStreamReader("test.txt"))
{
sr.BaseStream.Seek(position, SeekOrigin.Begin);
Console.WriteLine(sr.ReadToEnd());
}
If you want to just search for a start position within a text stream, I added this extension to StreamReader so that I could determine where the edit of the stream should occur. Granted, this is based upon characters as the incrementing aspect of the logic, but for my purposes, it works great, for getting the position within a text/ASCII based file based upon a string pattern. Then, you can use that location as a start point for reading, to write a new file that discludes the data prior to the start point.
The returned position within the stream can be provided to Seek to start from that position within text-based stream reads. It works. I've tested it. However, there may be issues when matching to non-ASCII Unicode chars during the matching algorithm. This was based upon American English and the associated character page.
Basics: it scans through a text stream, character-by-character, looking for the sequential string pattern (that matches the string parameter) forward only through the stream. Once the pattern doesn't match the string parameter (i.e. going forward, char by char), then it will start over (from the current position) trying to get a match, char-by-char. It will eventually quit if the match can't be found in the stream. If the match is found, then it returns the current "character" position within the stream, not the StreamReader.BaseStream.Position, as that position is ahead, based on the buffering that the StreamReader does.
As indicated in the comments, this method WILL affect the position of the StreamReader, and it will be set back to the beginning (0) at the end of the method. StreamReader.BaseStream.Seek should be used to run to the position returned by this extension.
Note: the position returned by this extension will also work with BinaryReader.Seek as a start position when working with text files. I actually used this logic for that purpose to rewrite a PostScript file back to disk, after discarding the PJL header information to make the file a "proper" PostScript readable file that could be consumed by GhostScript. :)
The string to search for within the PostScript (after the PJL header) is: "%!PS-", which is followed by "Adobe" and the version.
public static class StreamReaderExtension
{
/// <summary>
/// Searches from the beginning of the stream for the indicated
/// <paramref name="pattern"/>. Once found, returns the position within the stream
/// that the pattern begins at.
/// </summary>
/// <param name="pattern">The <c>string</c> pattern to search for in the stream.</param>
/// <returns>If <paramref name="pattern"/> is found in the stream, then the start position
/// within the stream of the pattern; otherwise, -1.</returns>
/// <remarks>Please note: this method will change the current stream position of this instance of
/// <see cref="System.IO.StreamReader"/>. When it completes, the position of the reader will
/// be set to 0.</remarks>
public static long FindSeekPosition(this StreamReader reader, string pattern)
{
if (!string.IsNullOrEmpty(pattern) && reader.BaseStream.CanSeek)
{
try
{
reader.BaseStream.Position = 0;
reader.DiscardBufferedData();
StringBuilder buff = new StringBuilder();
long start = 0;
long charCount = 0;
List<char> matches = new List<char>(pattern.ToCharArray());
bool startFound = false;
while (!reader.EndOfStream)
{
char chr = (char)reader.Read();
if (chr == matches[0] && !startFound)
{
startFound = true;
start = charCount;
}
if (startFound && matches.Contains(chr))
{
buff.Append(chr);
if (buff.Length == pattern.Length
&& buff.ToString() == pattern)
{
return start;
}
bool reset = false;
if (buff.Length > pattern.Length)
{
reset = true;
}
else
{
string subStr = pattern.Substring(0, buff.Length);
if (buff.ToString() != subStr)
{
reset = true;
}
}
if (reset)
{
buff.Length = 0;
startFound = false;
start = 0;
}
}
charCount++;
}
}
finally
{
reader.BaseStream.Position = 0;
reader.DiscardBufferedData();
}
}
return -1;
}
}
FileStream.Position (or equivalently, StreamReader.BaseStream.Position) will usually be ahead -- possibly way ahead -- of the TextReader position because of the underlying buffering taking place.
If you can determine how newlines are handled in your text files, you can add up the number of bytes read based on line lengths and end-of-line characters.
File.WriteAllText("test.txt", "1234" + System.Environment.NewLine + "56789");
long position = -1;
long bytesRead = 0;
int newLineBytes = System.Environment.NewLine.Length;
using (var sr = new StreamReader("test.txt"))
{
string line = sr.ReadLine();
bytesRead += line.Length + newLineBytes;
Console.WriteLine(line);
position = bytesRead;
}
Console.WriteLine("Wait");
using (var sr = new StreamReader("test.txt"))
{
sr.BaseStream.Seek(position, SeekOrigin.Begin);
Console.WriteLine(sr.ReadToEnd());
}
For more complex text file encodings you might need to get fancier than this, but it worked for me.
I found the suggestions above to not work for me -- my use case was to simply need to back up one read position (I'm reading one char at a time with a default encoding). My simple solution was inspired by above commentary ... your mileage may vary...
I saved the BaseStream.Position before reading, then determined if I needed to back up... if yes, then set position and invoke DiscardBufferedData().
From MSDN:
StreamReader is designed for character
input in a particular encoding,
whereas the Stream class is designed
for byte input and output. Use
StreamReader for reading lines of
information from a standard text file.
In most of the examples involving StreamReader, you will see reading line by line using the ReadLine(). The Seek method comes from Stream class which is basically used to read or handle data in bytes.

Why can I use FileSystemObjects for reading and writing client-side binary files, but not for reading and sending them to the server?

I created a binary file in the following manner (to ensure that all the possible byte values are in the binary file):
using (var fs = File.Create(fileName))
{
for (byte b = 0; b < Byte.MaxValue; b++)
{
fs.WriteByte(b);
}
}
and I read it in this way (for testing that it works):
using (var fs = File.Open(fileName, FileMode.Open))
{
long oldPos = -1;
long pos = 0;
while (oldPos != pos)
{
oldPos = pos;
Console.WriteLine(Convert.ToString(fs.ReadByte(), 2).PadLeft(8, '0'));
pos = fs.Position;
}
}
In javascript in IE, copying the file (reading it, then writing it back out) works just fine when using the FileSystemObject:
var fso = new ActiveXObject("Scripting.FileSystemObject");
var from = fso.OpenTextFile(fileToRead, 1, 0); // read, ASCII (-1 for unicode)
var to = fso.CreateTextFile(fileToWriteTo, true, false);
while (!from.AtEndOfStream) {
to.Write(from.Read(1));
}
from.Close();
to.Close();
When I read the outputted binary file, I get 00000000,00000001,00000010... etc.
But attempting to read it into javascript appears to cause the failure to read:
var fso = new ActiveXObject("Scripting.FileSystemObject");
var from = fso.OpenTextFile(fileToRead, 1, 0);
var test = [];
while (!from.AtEndOfStream) {
test.push(0xff & from.Read(1)); // make it a byte.
}
from.Close();
which results in test having a bunch of 0's in it's array, and a few other non-zero items, but mostly just 0s.
Can somebody please explain why it works for one and not the other? What do I need to do to get the values into javascript?
By the way, here is a related read on reading files off the client machine:
First, do you know if the length of the final array is the same length as the file ?
Try the read and "Push" in seperate oprations, like:
...
Test2 = from.Read(1));
// Possibly display value of Test2 as string
test.push(Test2);
...
Also, you could try this with Text data to see if it is the binary nature of the file/data causing the issue.

Stream.Seek behaviour

I came across this earlier today that was not sure why it happens.
I have the following code that sets the internal position of the file stream to a location so I can read the number of lines from that position. It is similar to this other post but when I used stream.Seek I see strange results
StringBuilder b = new StringBuilder();
using(var stream = _streamFactory.CreateStream())
using (var streamReader = new System.IO.StreamReader(stream, _streamFactory.Encoding))
{
stream.Seek(startPosition, System.IO.SeekOrigin.Begin);
string value;
for (int i = 0; i < lines; i++)
{
if ((value = streamReader.ReadLine()) != null)
{
b.AppendLine(value);
}
}
}
Now what I am doing is reading a file using the UTF-8 encoding so I know there are extra bits at the start of the file that denote this but are not part of the text I want to extract.
Say for eample I have the following text in the file
Hello my name is bob
So if I set startPosition to 0 my results will be Hello my name is bob however when I set startPosition to 1 I dont get ello my name is bob but rather ##Hello my name is bob where ## are 2 bytes from the encoding bits.
So my question is why when I set .Seek(0) and then do a ReadLine I get the correct line but Seek(1) will return the 2nd and 3rd bytes of the encoding?
Seek(3) will also yield the same results as Seek(0). If this was consistent I would have thought Seek(0) would return ###Hello my name is bob
Also how do I know how many extra bytes are at the start of the file without reading it (but knowing the encoding)?
I tried looking at the disassembled code and had to stop before my brain went on strike.
Note:
The Streambuilder in this case is just creating a FileStream. I do this so I can Unit test this code using a MemoryStream
First two bytes represent the encoding of file. Take a look at this article.

Unable to read beyond the end of the stream

I did some quick method to write a file from a stream but it's not done yet. I receive this exception and I can't find why:
Unable to read beyond the end of the stream
Is there anyone who could help me debug it?
public static bool WriteFileFromStream(Stream stream, string toFile)
{
FileStream fileToSave = new FileStream(toFile, FileMode.Create);
BinaryWriter binaryWriter = new BinaryWriter(fileToSave);
using (BinaryReader binaryReader = new BinaryReader(stream))
{
int pos = 0;
int length = (int)stream.Length;
while (pos < length)
{
int readInteger = binaryReader.ReadInt32();
binaryWriter.Write(readInteger);
pos += sizeof(int);
}
}
return true;
}
Thanks a lot!
Not really an answer to your question but this method could be so much simpler like this:
public static void WriteFileFromStream(Stream stream, string toFile)
{
// dont forget the using for releasing the file handle after the copy
using (FileStream fileToSave = new FileStream(toFile, FileMode.Create))
{
stream.CopyTo(fileToSave);
}
}
Note that i also removed the return value since its pretty much useless since in your code, there is only 1 return statement
Apart from that, you perform a Length check on the stream but many streams dont support checking Length.
As for your problem, you first check if the stream is at its end. If not, you read 4 bytes. Here is the problem. Lets say you have a input stream of 6 bytes. First you check if the stream is at its end. The answer is no since there are 6 bytes left. You read 4 bytes and check again. Ofcourse the answer is still no since there are 2 bytes left. Now you read another 4 bytes but that ofcourse fails since there are only 2 bytes. (readInt32 reads the next 4 bytes).
I presume that the input stream have ints only (Int32). You need to test the PeekChar() method,
while (binaryReader.PeekChar() != -1)
{
int readInteger = binaryReader.ReadInt32();
binaryWriter.Write(readInteger);
}
You are doing while (pos < length) and length is the actual length of the stream in bytes. So you are effectively counting the bytes in the stream and then trying to read that many number of ints (which is incorrect). You could take length to be stream.Length / 4 since an Int32 is 4 bytes.
try
int length = (int)binaryReader.BaseStream.Length;
After reading the stream by the binary reader the position of the stream is at the end, you have to set the position to zero "stream.position=0;"

Extracting Byte Arrays from a File

I'm trying to read a file and extract 2 blocks of data, let's call them block1 and block2, from the file where the file would contain many blocks of data. Both blocks need to be
returned in a byte array. Block1 would begin at place in the file where the line begins
"block1:" followed by the number of bytes to read. Block2, not necessarily appearing after
block1, would begin at place in the file where the line begins "block2:" followed by the
number of bytes to read. I am limited to .Net 3.5 at the highest.
You can use File.ReadAllBytes and extract your blocks from the returned byte[] using one of the Array.Copy overloads if you know the indexes they are in.
As others have mentioned, without header information you'll need to, at the very least, stream the contents of the file through a filter of some kind looking for your "block" markers.
If you do have header information (or at least some information somewhere as to the offset of your block markers), you could use a memory mapped file:
http://www.developer.com/net/article.php/3828586/Using-Memory-Mapped-Files-in-NET-40.htm
This requires .NET 4.0, although you could also use the Win32 API if you're not using .NET 4.
Without any sort of header information in your file, you'll have to scan the entire file, searching for your block1: or block2: markers.
Update:
Here's a sample of how you'd do this (not necessarily the best implementation):
byte[] GetBlockOfData(string fileName, string blockName)
{
var allBytes = File.ReadAllBytes(fileName);
// Assuming block names are ASCII-encoded
var blockMarker = Encoding.ASCII.GetBytes(blockName + ":");
// Scan for the first byte of the marker
for (var i = 0; i < allBytes.Length; i++)
{
if (allBytes[i] == blockMarker[i])
{
// See if this is the entire marker
var isMatch == true;
for (var j = 0; j < blockMarker.Length; j++)
{
if (allBytes[i + j] != blockMarker[j])
{
isMatch = false;
break;
}
}
if (isMatch)
{
// Assuming it's a byte...
var blockLength = allBytes[i + blockMarker.Length];
var result = new byte[blockLength];
Array.Copy(
allBytes, i + blockMarker.Length + 1, result, 0,
blockLength);
return result;
}
}
}
return null;
}

Categories