I have a text file called message.txt which has "abcdef' as text in it.
Now, code below outputs:
a if I seek with offset 0
? if I seek with offset 1 or 2
a (again) if I seek with offset 3
b if I seek with offset 4
c if I seek with offset 5
and so on.
static void Main(string[] args)
{
StreamReader sr = new StreamReader("Message.txt");
sr.BaseStream.Seek(2, SeekOrigin.Begin);
Console.WriteLine((char)sr.Read());
}
QUESTION
From offset 3 it behaves as expected. But ideally the same should have been the output starting with offset 1. Hence,
Q1. Why same output a happens with offset 0 and 3?
Q2. Why I get a ? for offset 1 and 2
thanks
You have BOM at the start of your file. Byte order mark, the unicode header.
Watch your file in some hex editor. (Rename to .bin and open in Visual Studio.) This particular BOM tells the computer that this is a UTF-8 file.
There are three likely factors here:
encoding: in most encodings, bytes != characters
buffers: if you Seek a base stream, you must tell the reader to drop any buffers it may have, or it will get badly confused; to do this call sr.DiscardBufferedData()
byte order marks at the start of the file
Related
I'm working on a file reader for a custom file format. Part of the format is like the following:
[HEADER]
...
[EMBEDDED_RESOURCE_1]
[EMBEDDED_RESOURCE_2]
[EMBEDDED_RESOURCE_3]
...
Now what I'm trying to do is to open a new stream that its boundaries are only one resource, for instance EMBEDDED_RESOURCE_1's first byte is at the 100th byte and its length is 200 bytes so its boundaries are 100 - 300. Is there any way to do so without using any buffers?
Thanks!
Alternatively - MemoryStream.
Before reading the necessary number of bytes set the initial position of the position by the property - Position.
But it is necessary to read the entire file into MemoryStream.
ive been reading about this topic and didnt get the specific info for my question :
(maybe the following is incorrect - but please do correct me)
Every file( text/binary) is saving BYTES.
byte is 8 bits hence max value is 2^8-1 = 255 codes.
those 255 codes divides to 2 groups:
0..127 : textual chars
128:..255 : special chars.
so binary file contains char codes from the whole range : 0..255 ( ascii chars+special chars).
1 ) correct ?
2) NOw , lets say im saving one INT in binary file. ( 4 byte in 32 bit system)
how does the file tells the progem reads it : its not 4 single unrelated bytes but an int which is 4 bytes ?
Underlying all files are being stored as bytes, so in a sense what you're saying is correct. However, if you open a file that's intended to be read as binary and try to read it in a text editor, it will look like gibberish.
How does a program know whether to read a file as text or as binary? (ie as special sets of ASCII or other encoded bytes, or just as the underlying bytes with a different representation)?
Well, it doesn't know - it just does what it's told.
In Windows, you open .txt files in notepad - notepad expects to be reading text. Try opening a binary file in notepad. It will open, you will see stuff, but it will be rubbish.
If you're writing your own program you can write using BinaryWriter and read using BinaryReader if you want to store everything as binary. What would happen if you wrote using BinaryWriter and read using StringReader?
To answer your specific example:
using (var test = new BinaryWriter(new FileStream(#"c:\test.bin", FileMode.Create)))
{
test.Write(10);
test.Write("hello world");
}
using (var test = new BinaryReader(new FileStream(#"c:\test.bin", FileMode.Open)))
{
var out1 = test.ReadInt32();
var out2 = test.ReadString();
Console.WriteLine("{0} {1}", out1, out2);
}
See how you have to read in the same order that's written? The file doesn't tell you anything.
Now switch the second part around:
using (var test = new BinaryReader(new FileStream(#"c:\test.bin", FileMode.Open)))
{
var out1 = test.ReadString();
var out2 = test.ReadInt32();
Console.WriteLine("{0} {1}", out1, out2);
}
You'll get gibberish out (if it works at all). Yet there is nothing you can read in the file that will tell you that beforehand. There is no special information there. The program must know what to do based on some out of band information (a specification of some sort).
so binary file contains char codes from the whole range : 0..255 ( ascii chars+special chars).
No, a binary file just contains bytes. Values between 0 and 255. They should only be considered as character at all if you decide to ascribe that meaning to them. If it's a binary file (e.g. a JPEG) then you shouldn't do that - a byte 65 in image data isn't logically an 'A' - it's whatever byte 65 means at that point in the file.
(Note that even text files aren't divided into "ASCII characters" and "special characters" - it depends on the encoding. In UTF-16, each code unit takes two bytes regardless of its value. In UTF-8 the number of bytes depends on the character you're trying to represent.)
how does the file tells the progem reads it : its not 4 single unrelated bytes but an int which is 4 bytes ?
The file doesn't tell the program. The program has to know how to read the file. If you ask Notepad to open a JPEG file, it won't show you an image - it will show you gibberish. Likewise if you try to force an image viewer to open a text file as if it were a JPEG, it will complain that it's broken.
Programs reading data need to understand the structure of the data they're going to read - they have to know what to expect. In some cases the format is quite flexible, like XML: there are well-specified layers, but then the program reads the values with higher-level meaning - elements, attributes etc. In other cases, the format is absolutely precise: first you'll start with a 4 byte integer, then two 2-byte integers or whatever. It depends on the format.
EDIT: To answer your specific (repeated) comment:
Im Cmd shell....youve written your binary file. I have no clue what did you do there. how am i suppose to know whether to read 4 single bytes or 4 bytes as once ?
Either the program reading the data needs to know the meaning of the data or it doesn't. If it's just copying the file from one place to another, it doesn't need to know the meaning of the data. It doesn't matter whether it copies it one byte at a time or all four bytes at once.
If it does need to know the meaning of the data, then just knowing that it's a four byte integer doesn't really help much - it would need to know what that integer meant to do anything useful with it. So your file written from the command shell... what does it mean? If I don't know what it means, what does it matter whether I know to read one byte at a time or four bytes as an integer?
(As I mentioned above, there's an intermediate option where code can understand structure without meaning, and expose that structure to other code which then imposes meaning - XML is a classic example of that.)
It's all a matter of interpretation. Neither the file nor the system know what's going on in your file, they just see your storage as a sequence of bytes that has absolutely no meaning in itself. The same thing happens in your brain when you read a word (you attempt to choose a language to interpret it in, to give the sequence of characters a meaning).
It is the responsibility of your program to interpret the data the way you want it, as there is no single valid interpretation. For example, the sequence of bytes 48 65 6C 6C 6F 20 53 6F 6F 68 6A 75 6E can be interpreted as:
A string (Hello Soohjun)
A sequence of 12 one-byte characters (H, e, l, l, o, , S, o, o, h, j, u, n)
A sequence of 3 unsigned ints followed by a character (1214606444, 1864389487, 1869113973, 110)
A character followed by a float followed by an unsigned int followed by a float (72, 6.977992E22, 542338927, 4.4287998E24), and so on...
You are the one choosing the meaning of those bytes, another program would make a different interpretation of the very same data, much the same a combination of letters has a different interpretation in say, English and French.
PS: By the way, that's the goal of reverse engineering file formats: find the meaning of each byte.
i want to read\write a binary file which has the following structure:
The file is composed by "RECORDS". Each "RECORD" has the following structure:
I will use the first record as example
(red)START byte: 0x5A (always 1 byte, fixed value 0x5A)
(green) LENGTH bytes: 0x00 0x16 (always 2 bytes, value can change from
"0x00 0x02" to "0xFF 0xFF")
(blue) CONTENT: Number of Bytes indicated by the decimal value of LENGTH Field minus 2. In this case LENGHT field value is 22 (0x00 0x16 converted to decimal), therefore the CONTENT will contain 20 (22 - 2) bytes.
My goal is to read each record one by one, and write it to an output file.
Actually i have a read function and write function (some pseudocode):
private void Read(BinaryReader binaryReader, BinaryWriter binaryWriter)
{
byte START = 0x5A;
int decimalLenght = 0;
byte[] content = null;
byte[] length = new byte[2];
while (binaryReader.PeekChar() != -1)
{
//Check the first byte which should be equals to 0x5A
if (binaryReader.ReadByte() != START)
{
throw new Exception("0x5A Expected");
}
//Extract the length field value
length = binaryReader.ReadBytes(2);
//Convert the length field to decimal
int decimalLenght = GetLength(length);
//Extract the content field value
content = binaryReader.ReadBytes(decimalLenght - 2);
//DO WORK
//modifying the content
//Writing the record
Write(binaryWriter, content, length, START);
}
}
private void Write(BinaryWriter binaryWriter, byte[] content, byte[] length, byte START)
{
binaryWriter.Write(START);
binaryWriter.Write(length);
binaryWriter.Write(content);
}
This way is actually working.
However since I am dealing with very large files i find it to be not performing at all, cause I Read and write 3 times foreach Record. Actually I would like to read bug chunks of data instead small amount of byte and maybe work in memory, but my experience in using Stream stops with BinaryReader and BinaryWriter. Thanks in advance.
FileStream is already buffered, so I'd expect it to work pretty well. You could always create a BufferedStream around the original stream to add extra more buffering if you really need to, but I doubt it would make a significant difference.
You say it's "not performing at all" - how fast is it working? How sure are you that the IO is where your time is going? Have you performed any profiling of the code?
I might also suggest that you read 3 (or 6?) bytes initially, instead of 2 separate reads. Put the initial bytes in a small array, check the 5a ck-byte, then the 2 byte length indicator, then the 3 byte AFP op-code, THEN, read the remainder of the AFP record.
It's a small difference, but it gets rid of one of your read calls.
I'm no Jon Skeet, but I did work at one of the biggest print & mail shops in the country for quite a while, and we did mostly AFP output :-)
(usually in C, though)
System.IO.BinaryWriter outfile;
System.IO.FileStream fs = new System.IO.FileStream(some_object.text, System.IO.FileMode.Create);
outfile = new System.IO.BinaryWriter(fs);
outfile.Write('A'); // Line 1
outfile.Write('B'); // Line 2
outfile.Write('C'); // Line 3
outfile.Write( Convert.ToUInt16(some_object.text, 16) ); // Line 4
outfile.Write((ushort)0); // Line 5
Here i declare a BinaryWriter for creating my output file.
What i need to know clearly is how the file is exactly being written?
Meaning, that Line 1, 2, 3 write the file Byte by Byte meaning 1 byte at a time if i am correct??
This some_object.text holds a value 2000.
How many bytes does Line 4 exactly write?? (2 Bytes/16 Bits since UInt16 of is 16 bits) ?
Take a look at the chart from MSDN to see how many bytes are written:
BinaryWriter.Write Method
The BinaryWriter uses the BitConverter class to create sequences of bytes that are written to the underlying stream. A great way to understand what is going on, at the lowest level, is to use .NET Reflector. It can decompile assemblies and easily be used to figure out framework implementation details.
Most of the binary write methods use the native representation in little endian (though the endian is architecture specific and varies between platforms such as XBOX and Windows). The only exception to this are strings. Strings are by default encoded using UTF-8 encoding.
I have a binary file. It consists of 4 messages, each is inthe size of 100 bytes.
I want to read that last 2 messages again. I am using BinaryReader object.
I seek to psosition 200 and then I read: BinaryReaderObject.read(charBuffer, 0, 10000),
where charBuffer is big enougth.
I get all the time the a mount of read is always missing 1. Instead of getting 200 I get 199. Instead of getting 400 I get 399.
I checked and saw the size of the file is correct and the data that I get starts at the right place.
Thnaks,
Try this code and see what happens with your file.
String message = #"Read {0} bytes into the buffer.";
String fileName = #"TEST.DAT";
Int32 recordSize = 100;
Byte[] buffer = new Byte[recordSize];
using (BinaryReader br = new BinaryReader(File.OpenRead(fileName)))
{
br.BaseStream.Seek(2 * recordSize, SeekOrigin.Begin);
Console.WriteLine(message, br.Read(buffer, 0, recordSize));
Console.WriteLine(message, br.Read(buffer, 0, recordSize));
}
Console.ReadLine();
I get the following output with a 400 byte test file.
Read 100 bytes into the buffer.
Read 100 bytes into the buffer.
If I seek to 2 * recordSize + 1 or use a 399 byte file, I get the following output.
Read 100 bytes into the buffer.
Read 99 bytes into the buffer.
So it works as expected.
Hint: zero-based array indexes, and zero-based positions ...
First byte will start at position zero.
Seek to the end and print position. Is it as expected?
Print the position after reading the 199 -- is it as expected?
Try to read 1 more byte from the position after you get 199 -- do you get EOF?
How are you checking the size of the file?
Diff the 199 bytes with the expected ones -- what is different?
Two things I would check
CR/LF transformations
That the size is what you think it is.
The problem was that I used a wrapper to BinaryReader object.
When calling the Read method there are some function overloding. Instead os using the signeture of char[], I used byte[]. Till now it worked fine because there was only use of utf-8, but now when I entered real binary data in the beginning of each message it caused the problem.