Change encoding of text file using Stream class object

Change encoding of text file using Stream class object - c#

I am using UploadCsv method of BigQuery Client class. This class accepts Stream class object. Can I change encoding of my file and pass this file to stream object without converting my text file to ByteArray? I'm using a large file and in case of ByteArray it gives out of memory exception.
UploadCsvAsync(string datasetId, string tableId, TableSchema schema, Stream input, UploadCsvOptions options = null, CancellationToken cancellationToken = null);

It seems that it is not possible to change the encoding of your input. This is because the UploadCSVAsync requires the input to be a System.IO.Stream class.
This class is the abstract base class of all streams in C#. A stream, on the other hand is a sequence of bytes, such as a file, input/output devices, an inter-process communication pipe or a TCP/IP socket. If you changed the type or encoding of the input, it might not match with the System.IO.Stream remarks.
This covers the main question "If it possible to change the encoding?", the answer is No.
Now, the next you may ask should be: "If it is not possible, then what do we need to do?", and that's a great question!
For that particular matter, the Out of memory Exception Error message can be solved by changing the way we are reading the data. Having a single loop statement to read a single file might not be the best option due to the allocated memory.
Instead, use a buffer to read data as they did on this other question or use smaller arrays or jagged arrays as done on this question.
Hope this is helpful! :)

Related

Guidelines for designing a robust file format writer?

Suppose you want to write a .WAV file format writer like so:
using var stream = File.OpenRead("test.wav");
using var writer = new WavWriter(stream, ... /* .WAV format parameters */);
// write the file
// writer.Dispose() does a few things:
// - writes user-added chunks
// - updates the file header (chunk size) so the file is valid
There is a concpetual problem in doing so:
the user can change the stream position and therefore screw the writing process
You may suggest the following:
the writer should own the stream, this would work if writing to a file, but not to a stream
own its own memory stream so it can write to streams too, okay but memory concerns
I guess you get the point...
To me, the only viable thing would be to document that aspect but I may have missed something, hence the question.
Question:
How to make a file format writer be able to write to a stream yet defend yourself about possible changes to its position?

My suggestion would be to keep an internal position field in the WavWriter. Each time you do some operation you can check that this matches the position in the backing stream and throw an exception if it does not. Update this value at the end of each write operation.
Ideally you should also handle streams that does not support seeking, but it does not sound like your design would permit that anyway. It might be a good idea to check CanSeek in the constructor and throw if seek is not supported. It is in general a good idea to validate any arguments before usage.

Difference between loading a file from a path and from a stream C#

This is a follow up question to this question:
Difference between file path and file stream?
I didn't fully understand everything answered in the linked question.
I am using the Microsoft.SqlServer.Dac.BacPackage which contains a Load method with 2 overloads - one that receives a string path and one that receives a Stream.
This is the documentation of the Load method:
https://learn.microsoft.com/en-us/dotnet/api/microsoft.sqlserver.dac.bacpackage.load?view=sql-dacfx-150
What exactly is the difference between the two? Am I correct in assuming that the overloading of the string path saves all the file in the memory first, while the stream isn't? Are there other differences?

No, the file will not usually be fully loaded all at once.
A string path parameter normally means it will just open the file as a FileStream and pass it to the other version of the function. There is no reason why the stream should fully load the file into memory unless requested.
A Stream parameter means you open the file and pass the resulting Stream. You could also pass any other type of Stream, such as a network stream, a zip or decryption stream, a memory-backed stream, anything really.

Short answer:
The fact that you have two methods, one that accepts a filename and one that accepts a stream is just for convenience. Internally, the one with the filename will open the file as a stream and call the other method.
Longer answer
You can consider a stream as a sequence of bytes. The reason to use a stream instead of a byte[] or List<byte>, is, that if the sequence is really, really large, and you don't need to have access to all bytes at once, it would be a waste to put all bytes in memory before processing them.
For instance, if you want to calculate the checksum for all bytes in a file: you don't need to put all data in memory before you can start calculating the sum. In fact, anything that efficiently can deliver you the bytes one by one would suffice.
That is the reason why people would want to read a file as a stream.
The reason why people want a stream as input for their data, is that they want to give the caller the opportunity to specify the source of their data: callers can provide a stream that reads from a file, but also a stream with data from the internet, or from a database, or from a textBox, the procedure does not care, as long as it can read the bytes one by one or sometimes per chunk of bytes:
using (Stream fileStream = File.Open(fileName)
{
ProcessInputData(fileStream);
}
Or:
byte[] bytesToProcess = ...
using (Stream memoryStream = new MemoryStream(bytesToProcess))
{
ProcessInputData(memoryStream);
}
Or:
string operatorInput = this.textBox1.Text;
using (Stream memoryStream = new MemoryStream(operatorInput))
{
ProcessInputData(memoryStream);
}
Conclusioin
Methods use streams in their interface to indicate that they don't need all data in memory at once. One-by-one, or per chunk is enough. The caller is free to decide where the data comes from.

FileStream.Read() - bytes read

FileStream.Read() returns the amount of bytes read, but... is there any situation other than having reached the end of file, that it will read less bytes than the number of bytes requested and not throw an exception?
the documentation says:
The Read method returns zero only after reaching the end of the stream. Otherwise, Read always reads at least one byte from the stream before returning. If no data is available from the stream upon a call to Read, the method will block until at least one byte of data can be returned. An implementation is free to return fewer bytes than requested even if the end of the stream has not been reached.
But this doesn't quite explain in what situations data would be unavailable and cause the method to block until it can read again. I mean, shouldn't most situations where data is unavailable force an exception?
What are real situations where comparing the number of bytes read against the number of expected bytes could differ (assuming that we're already checking for end of file when we mention number of bytes expected)?
EDIT: A bit more information, reason why I'm asking this is because I've come across a bit of code where the developer pretty much did something like this:
bytesExpected = (remainingBytesInFile > 94208 ? 94208 : remainingBytesInFile
while (bytesRead < bytesExpected)
{
bytesRead += fileStream.Read(buffer, bytesRead, bytesExpected - bytesRead)
}
Now, I can't see any advantage to having this while at all, I'd expect it to throw an exception if it can't read the number of bytes expected (bearing in mind it's already taking into account that there are those many bytes left to read)
What would the reason one could possibly have for something like this? I'm sure I'm missing something

The documentation is for Stream.Read, from which FileStream is derived. Since FileStream is a stream, it should obey the stream contract. Not all streams do, but unless you have a very good reason, you should stick to that.
In a typical file stream, you'll only get a return value smaller than count when you reach the end of file (and it's a pretty simple way of checking for the end of file).
However, in a NetworkStream, for example, you keep reading in a loop until the method returns zero - signalling the end of stream. The same works for file streams - you know you're at the end of the file when Read returns zero.
Most importantly, FileStream isn't just for what you'd consider files - it's also for pseudo-files like standard input/output pipes and COM ports, for example (try opening a file stream on PRN, for example). In that case, you're not reading a file with a fixed length, and the behaviour is the same as with NetworkStream.
Finally, don't forget that FileStream isn't sealed. It's perfectly fine for you to implement a virtualized file system, for example - and it's perfectly fine if your virtualized file system doesn't support seeking, or checking the length of file.
EDIT:
To address your edit, this is exactly how you're supposed to read any stream. Nothing wrong with it. If there's nothing else to read in a stream, the Read method will simply return 0, and you know the stream is over. The only thing is, it seems that he tries to fill his buffer to full, one buffer at a time - this only makes sense if you explicitly need to partition the file by 94208 bytes, and pass that byte[] for further processing somewhere.
If that's not the case, you don't really need to fill the full buffer - you just keep reading (and probably writing on some other side) until Read returns 0. And indeed, by default, FileStream will always fill the whole buffer unless it's built around a pipe handle - but since that's a possibility, you shouldn't rely on the "real file" behaviour, so as long as you need those byte[] for something non-stream (e.g. parsing messages), this is entirely fine. If you're only using the stream as an actual stream, and you're streaming the data somewhere else, it doesn't have a point, really - you only need one while to read the file.

Your expectations would only apply to the case when the stream is reading data off of a no-latency source. Other I/O sources can be slow, which is why the Read method might will not always be able to return immediately. That doesn't mean that there is an error (so no exception), just that it has to wait for data to arrive.
Examples: network stream, file stream on slow disk, etc.
(UPDATE, HDD example) To give an example specific to files (since your case is FileStream, although Read is defined on Stream and so all implementations should fulfill the requirements): mechanical hard-drives go to "sleep" when not active (specially on battery-powered devices, read laptops). Spinning up can take a second or so. That is not an IOException, but your read would have to wait for a second before any data is read.

Simple answer is that on a FileStream it probably never happens.
However keep in mind that the Read method is inherited from Stream which serves as base for many other streams like NetworkStream and in this case you may not be able to read has many bytes as you requested simple because they havent been received from the network yet.
So like the documentation says it all depends on the implementation of the specific type of stream - FileStream, NetworkStream, etc.

MemoryStream.WriteTo(Stream destinationStream) versus Stream.CopyTo(Stream destinationStream)

Which one is better : MemoryStream.WriteTo(Stream destinationStream) or Stream.CopyTo(Stream destinationStream)??
I am talking about the comparison of these two methods without Buffer as I am doing like this :
Stream str = File.Open("SomeFile.file");
MemoryStream mstr = new MemoryStream(File.ReadAllBytes("SomeFile.file"));
using(var Ms = File.Create("NewFile.file", 8 * 1024))
{
str.CopyTo(Ms) or mstr.WriteTo(Ms);// Which one will be better??
}
Update
Here is what I want to Do :
Open File [ Say "X" Type File]
Parse the Contents
From here I get a Bunch of new Streams [ 3 ~ 4 Files ]
Parse One Stream
Extract Thousands of files [ The Stream is an Image File ]
Save the Other Streams To Files
Editing all the Files
Generate a New "X" Type File.
I have written every bit of code which is actually working correctly..
But Now I am optimizing the code to make the most efficient.

It is an historical accident that there are two ways to do the same thing. MemoryStream always had the WriteTo() method, Stream didn't acquire the CopyTo() method until .NET 4.
The MemoryStream.WriteTo() version looks like this:
public virtual void WriteTo(Stream stream)
{
// Exception throwing code elided...
stream.Write(this._buffer, this._origin, this._length - this._origin);
}
The Stream.CopyTo() implementation like this:
private void InternalCopyTo(Stream destination, int bufferSize)
{
int num;
byte[] buffer = new byte[bufferSize];
while ((num = this.Read(buffer, 0, buffer.Length)) != 0)
{
destination.Write(buffer, 0, num);
}
}
Stream.CopyTo() is more universal, it works for any stream. And helps programmers that fumble copying data from, say, a NetworkStream. Forgetting to pay attention to the return value from Read() was a very common bug. But it of course copies the bytes twice and allocates that temporary buffer, MemoryStream doesn't need it since it can write directly from its own buffer. So you'd still prefer WriteTo(). Noticing the difference isn't very likely.

MemoryStream.WriteTo: Writes the entire contents of this memory stream to another stream.
Stream.CopyTo: Reads the bytes from the current stream and writes them to the destination stream. Copying begins at the current position in the current stream.
You'll need to seek back to 0, to get the whole source stream copied.
So I think MemoryStream.WriteTo better option for this situation

If you use Stream.CopyTo, you don't need to read all the bytes into memory to start with. However:
This code would be simpler if you just used File.Copy
If you are going to load all the data into memory, you can just use:
byte[] data = File.ReadAllBytes("input");
File.WriteAllBytes("output", data);
You should have a using statement for the input as well as the output stream
If you really need processing so can't use File.Copy, using Stream.CopyTo will cope with larger files than loading everything into memory. You may not need that, of course, or you may need to load the whole file into memory for other reasons.
If you have got a MemoryStream, I'd probably use MemoryStream.WriteTo rather than Stream.CopyTo, but it probably won't make much difference which you use, except that you need to make sure you're at the start of the stream when using CopyTo.

I think Hans Passant's claim of a bug in MemoryStream.WriteTo() is wrong; it does not "ignore the return value of Write()". Stream.Write() returns void, which implies to me that the entire count bytes are written, which implies that Stream.Write() will block as necessary to complete the operation to, e.g., a NetworkStream, or throw if it ultimately fails.
That is indeed different from the write() system call in ?nix, and its many emulations in libc and so forth, which can return a "short write". I suspect Hans leaped to the conclusion that Stream.Write() followed that, which I would have expected, too, but apparently it does not.
It is conceivable that Stream.Write() could perform a "short write", without returning any indication of that, requiring the caller to check that the Position property of the Stream has actually been advanced by count. That would be a very error-prone API, and I doubt that it does that, but I have not thoroughly tested it. (Testing it would be a bit tricky: I think you would need to hook up a TCP NetworkStream with a reader on the other end that blocked forever, and write enough to fill up the wire buffers. Or something like that...)
The comments for Stream.Write() are not quite unambiguous:
Summary:
When overridden in a derived class, writes a sequence of bytes to the current
stream and advances the current position within this stream by the number
of bytes written.
Parameters: buffer:
An array of bytes. This method copies count bytes from buffer to the current stream.
Compare that to the Linux man page for write(2):
write() writes up to count bytes from the buffer pointed buf to the file referred to by the file descriptor fd.
Note the crucial "up to". That sentence is followed by explanation of some of the conditions under which a "short write" might occur, making it very explicit that it can occur.
This is really a critical issue: we need to know how Stream.Write() behaves, beyond all doubt.

The CopyTo method creates a buffer, populates its with data from the original stream and then calls the Write method passing the created buffer as a parameter. The WriteTo uses the memoryStream's internal buffer to write. That is the difference. What is better - it is up to you to decide which method you prefer.

Creating a MemoryStream from a HttpInputStream in Vb.Net:
Dim filename As String = MyFile.PostedFile.FileName
Dim fileData As Byte() = Nothing
Using binaryReader = New BinaryReader(MyFile.PostedFile.InputStream)
binaryReader.BaseStream.Position = 0
fileData = binaryReader.ReadBytes(MyFile.PostedFile.ContentLength)
End Using
Dim memoryStream As MemoryStream = New MemoryStream(fileData)

C# - Creating byte array of unknown size?

I'm trying to create a class to manage the opening of a certain file. I would one of the properties to be a byte array of the file, but I don't know how big the file is going to be. I tried declaring the byte array as :
public byte[] file;
...but it won't allow me to set it the ways I've tried. br is my BinaryReader:
file = br.ReadBytes(br.BaseStream.Length);
br.Read(file,0,br.BaseStream.Length);
Neither way works. I assume it's because I have not initialized my byte array, but I don't want to give it a size if I don't know the size. Any ideas?
edit: Alright, I think it's because the Binary Reader's BaseStream length is a long, but its readers take int32 counts. If I cast the 64s into 32s, is it possible I will lose bytes in larger files?

I had no problems reading a file stream:
byte[] file;
var br = new BinaryReader(new FileStream("c:\\Intel\\index.html", FileMode.Open));
file = br.ReadBytes((int)br.BaseStream.Length);
Your code doesn't compile because the Length property of BaseStream is of type long but you are trying to use it as an int. Implicit casting which might lead to data loss is not allowed so you have to cast it to int explicitly.
Update
Just bear in mind that the code above aims to highlight your original problem and should not be used as it is. Ideally, you would use a buffer to read the stream in chunks. Have a look at this question and the solution suggested by Jon Skeet

You can't create unknown sized array.
byte []file=new byte[br.BaseStream.Length];
PS: You should have to repeatedly read chunks of bytes for larger files.

BinaryReader.ReadBytes returns a byte[]. There is no need to initialize a byte array because that method already does so internally and returns the complete array to you.
If you're looking to read all the bytes from a file, there's a handy method in the File class:
http://msdn.microsoft.com/en-us/library/system.io.file.readallbytes.aspx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Change encoding of text file using Stream class object - c#

Related

Guidelines for designing a robust file format writer?

Difference between loading a file from a path and from a stream C#

FileStream.Read() - bytes read

MemoryStream.WriteTo(Stream destinationStream) versus Stream.CopyTo(Stream destinationStream)

C# - Creating byte array of unknown size?

Categories

Resources