c# Why is last byte of copied file different? - c#

I am writing a program to read and write a specific binary file format.
I believe I have it 95% working. I am running into a a strange problem.
In the screenshot I am showing a program I wrote that compares two files byte by byte. The very last byte should be 0 but is FFFFFFF.
Using a binary viewer I can see no difference in the files. They appear to be identical.
Also, windows tells me the size of the files is different but the size on disk is the same.
Can someone help me understand what is going on?
The original is on the left and my copy is on the right.

Possible answers:
You forgot to call Stream.close() or Stream.Dispose().
Your code is messing up text and and other kinds of data (e.g. casting a -1 from a Read() method into a char, then writing it.
We need to see your code though...

Size on disk vs Size
First of all you should note that the Size on disk is almost always different from the Size value because the Size on disk value reflects the allocated drive storage but the Size reflects the actual length of the file.
A disk drive splits its space into blocks of the same size. For example, if your drive works with 4KB blocks then even the smallest file containing a single byte will still take up 4KB on the disk, as that is the minimum space it can allocate. Once you write out the 4KB + 1 byte it will then allocate another 4KB block of storage, thus making it 8KB on disk. Hence the Size on disk is always a multiple of 4KB. So the fact the source and destination files have the same Size on disk does not mean the files are the same length. (Different drives have different blocks sizes, it is not always 4KB).
The Size value is the actual defined length of the file data within the disk blocks.
Your Size Issue
As your Size values are different it means that the operating system has saved different lengths of data. Hence you have a fundamental problem with your copying routine and not just an issue with the last byte as you think at the moment. One of your files is 3,434 bytes and the other 2,008 which is a big difference. Your first step must be to work out why you have such a big difference.
If your hex comparing routine is simply looking at the block data then it will think they are the same length as it is comparing disk blocks rather than actual file length.

Related

Optimal size of files for parallel reading

I have a big file (around 8GB) with thousands of entries (an entry can be up to 1100 bytes).
I need to read each entry and thought about parallelizing this process by splitting the big file into multiple smaller files.
For reading the files I am using C# (FileStream object).
As far as I understood it makes sense that every smaller file has the same size.
The question is now what the optimal size for those smaller files is or if it even make a difference if they all are 20MB or 50MB large.
On my research I came up with different sizes 20MB, 8KB, 10MB or 8MB but always without further explanation.
Update
The entries have a fixed size of bytes depending on the kind of entry which is described in the first 23 bytes of each.
So I'm reading the first 23 bytes, then i get the information of how many bytes i have to read until the entry ends. By this I managed to split the file by the kind of entry (34 files).
Since not every file has the same size the different tasks for the files terminate at a different times.

Extract Data of 2GB XML File in c#

I have a 2 GB XML file containing around 2.5 million records. I am not being able to load it in c#. It is throwing out of memory exception. Please help me to resolve it with easy method.
Simple and general methodology when you have these problems:
As written by mjwills and TheGeneral, compile at 64 bits
As written by Prateek use XmlReader. Don't load completely the file in memory. Don't use XDocument/XmlDocument/XmlSerializer.
If the size of the output is proportional to the size of the input (you are making a conversion of formats for example), write the result of your reading one piece at a time. If possible you shouldn't have the whole output in memory at the same time. You read an object (a node) from the source file, you make your elaborations, you write the result in a new file/on a db, you discard the result of the elaboration
If the output instead is a summary of the input (for example you are calculating some statistics on the input), and so the size of the output is sub-proportional to the size of the input then normally it is ok to keep it in memory

How FileStream works in c#?

I have this following piece of code.I an not fully understanding its implementation.
img stores the path of image as c:\\desktop\my.jpg
FileStream fs = new FileStream(img, FileMode.Open, FileAccess.Read);
byte[] bimage = new byte[fs.Length];
fs.Read(bimage, 0, Convert.ToInt32(fs.Length));
In the first line the filestream is opening image located at path img to read.
The second line is (i guess) converting the file opened to byte.
What does fs.length represent?
Does image have length or is it the length of name of file(i guess not)?
What is the third line doing?
Please help me clearify!!
fs is one of many C# I/O objects that presents file descriptor and some methods like Read in your example. Because Read method returns byte array, you should declare it first and set its length to file length (the second string, so fs.Length is file length in bytes). Then all you need is just read file content and store it in this array (third line). This could be done by one iteration (like in the example) or by reading blocks in a loop. When you done with reading, it is good approach to destroy fs object to prevent memory leakage.
A "stream", in computing, is commonly a control buffer you open (or connect), read a chunk and close (or disconnect).
In case of files, when opening OS finds file, handle pointers and locks on the resource.
You do reads. When you read, you are picking a range of bytes ("chunk") and putting it in memory. In this case, that second line byte array.
You could, in thesis, pick any number. But life is hard: you have physical memory limitation in any computer.
If your file fits in your RAM + virtual memory... you may use a large byte array (FSB and motherboard throughput applies).
So, in a low memory system, like a Raspberry Pi B (512MB), this can cause errors or failures.
There is where goes fs.Length. Microsoft implemented it to count all the bytes in the file. It iterates, counting every byte till the end of file (EOF).
Knowing this, a fs.Length call will be faster in smaller files, slower on bigger ones and allows you to do some math for your byte array maximum size versus optimum size (hardware power versus file chunk size).
Your buffers shall consider max computer memory and processes running (and using memory) in parallel.
You SHOULD NEVER, in any plataform, rely only on file size to define your memory buffer size.
And remember to always close/disconnect/release/dispose locked I/O resources... Like TCP connections, files, consoles, database connections and thread-safety locks.
Imagine you read a file with 10 GB, as a payment transaction log file in a 512 MB + 2GB SD Raspberry Pi.

Compare files byte by byte or read all bytes?

I came across this code http://support.microsoft.com/kb/320348 which made me wonder what would be the best way to compare 2 files in order to figure out if they differ.
The main idea is to optimize my program which needs to verify if any file is equal or not to create a list of changed files and/or files to delete / create.
Currently I am comparing the size of the files if they match i will go into a md5 checksum of the 2 files, but after looking at that code linked at the begin of this question it made me wonder if it is really worth to use it over creating a checksum of the 2 files (which is basically after you get all the bytes) ?
Also what other verifications should I make to reduce the work in check each file ?
Read both files into a small buffer (4K or 8K) which is optimised for reading and then compare buffers in memory (byte by byte) which is optimised for comparing.
This will give you optimum performance for all cases (where difference is at the start, middle or the end).
Of course first step is to check if file length differs and if that's the case, files are indeed different..
If you haven't already computed hashes of the files, then you might as well do a proper comparison (instead of looking at hashes), because if the files are the same it's the same amount of work, but if they're different you can stop much earlier.
Of course, comparing a byte at a time is probably a bit wasteful - probably a good idea to read whole blocks at a time and compare them.

What's the best way to read/write array contents from/to binary files in C#?

I would like to read and write the contents of large, raw volume files (e.g. MRI scans). These files are just a sequence of e.g. 32 x 32 x 32 floats so they map well to 1D arrays. I would like to be able to read the contents of the binary volume files into 1D arrays of e.g. float or ushort (depending on the data type of the binary files) and similarly export the arrays back out to the raw volume files.
What's the best way to do this with C#? Read/Write them 1 element at a time with BinaryReader/BinaryWriter? Read them piece-wise into byte arrays with FileStream.Read and then do a System.Buffer.BlockCopy between arrays? Write my own Reader/Writer?
EDIT: It seems it's not possible to work with > 2GB arrays, but the question still stands for smaller arrays (around 256 MB or so)
You're not going to get arrays with more than 2GB data. From what I remember, there's a CLR limit of 1GB per object. It's possible that's been lifted for .NET 4 on 64-bit, but I haven't heard about it.
EDIT: According to this article the limit is 2GB, not 1GB - but you still won't manage more than 2GB.
Do you really have to have all the data in memory at one time? Can you work on chunks of it at a time?
EDIT: Okay, so it's now just about reading from a file into a float array? It's probably simplest to read chunks (either using BinaryReader.Read(byte[], int, int) or BinaryReader.ReadBytes(int)) and then use Buffer.BlockCopy to efficiently convert from bytes to floats etc. Note that this will be endian-sensitive, however. If you want to convert more robustly (so that you can change endianness later, or run on a big-endian platform) you'd probably want to call ReadFloat() repeatedly.
How convinced are you that you actually have a performance issue in this area of the code? It's worth doing the simplest thing that will work and then profiling it, to start with...

Categories