There are many questions that ask for detecting file encoding, which is tricky. I only need to know if the file contains only valid UTF8 sequences, thus being safe to treat as UTF8 (plain ASCII can be safely treated as UTF8)
The File comes in a form of a Stream from within AspNetCore.
I assume I will have to read the stream twice, first to check it doesn't contain any invalid UTF8 sequences, and second to actually process it further.
Based on comment by #madreflection:
public static async Task<bool> IsValidUtf8(Stream stream)
{
var reader = new StreamReader(stream, new UTF8Encoding(true, true));
try
{
await reader.ReadToEndAsync();
return true;
}
catch (DecoderFallbackException)
{
return false;
}
}
I was hoping I wouldn't have to read the stream twice, but that's not possible, but also undesired, as I need to make a decision before processing.
One thing that bothers me about this is that the whole stream is read into RAM, but CodeReview might be a better place to discuss that.
Related
I'm making a chat system thing with tcp which requires to send things in byte arrays, but when I convert an image into a byte array, send it and then convert back it gives this error: 'End of Stream encountered before parsing was completed.'. With strings it works just fine.
public byte[] ObjectToByteArray(object obj)
{
BinaryFormatter formatter = new BinaryFormatter();
using (var stream = new MemoryStream())
{
formatter.Serialize(stream, obj);
return stream.ToArray();
}
}
public object ByteArrayToObject(byte[] bytes)
{
using (var stream = new MemoryStream())
{
var binForm = new BinaryFormatter();
stream.Write(bytes, 0, bytes.Length);
stream.Position = 0;
var obj = binForm.Deserialize(stream);
return obj;
}
}
There's two separate things here; firstly, and I cannot emphasize this enough; do not use BinaryFormatter. Ever. It will hurt you. Lots of serializers exist, and BinaryFormatter (and the cousin NetDataContractSerializer) is literally the absolute last you should use. I can expand on that if you like, or I can suggest alternatives if you like.
Now; as for the actual problem: I strongly suspect that it isn't what you think it is. I have a hunch, based on decades of working on network code, that the real problem here is "framing". By which I mean: TCP is a stream protocol, not a message/packet protocol. I strongly suspect that you have not correctly deframed the exact bytes that were sent. I can't say this for sure without seeing your socket code, but... as I say: it is an hunch based on lots of experience. To investigate this: note the length of the bytes you send, and note the length of the bytes you've received. I'm pretty sure you'll find they are different. If there's still doubt: get the base-64 or hex string of the sent payload and the received payload (Convert.ToBase64String, for example), and compare that string. I'm pretty sure they'll turn out to be different.
Ultimately, network code is hard; I could try and explain individual points, but "how to correctly send messages over a network" could fill a book. IMO, if you're not interested in specializing in writing network code for the next 5 years: use an existing tool that will do the job for you, for example gRPC. Lots and lots of other messaging RPC tools exist.
This is a follow up question to this question:
Difference between file path and file stream?
I didn't fully understand everything answered in the linked question.
I am using the Microsoft.SqlServer.Dac.BacPackage which contains a Load method with 2 overloads - one that receives a string path and one that receives a Stream.
This is the documentation of the Load method:
https://learn.microsoft.com/en-us/dotnet/api/microsoft.sqlserver.dac.bacpackage.load?view=sql-dacfx-150
What exactly is the difference between the two? Am I correct in assuming that the overloading of the string path saves all the file in the memory first, while the stream isn't? Are there other differences?
No, the file will not usually be fully loaded all at once.
A string path parameter normally means it will just open the file as a FileStream and pass it to the other version of the function. There is no reason why the stream should fully load the file into memory unless requested.
A Stream parameter means you open the file and pass the resulting Stream. You could also pass any other type of Stream, such as a network stream, a zip or decryption stream, a memory-backed stream, anything really.
Short answer:
The fact that you have two methods, one that accepts a filename and one that accepts a stream is just for convenience. Internally, the one with the filename will open the file as a stream and call the other method.
Longer answer
You can consider a stream as a sequence of bytes. The reason to use a stream instead of a byte[] or List<byte>, is, that if the sequence is really, really large, and you don't need to have access to all bytes at once, it would be a waste to put all bytes in memory before processing them.
For instance, if you want to calculate the checksum for all bytes in a file: you don't need to put all data in memory before you can start calculating the sum. In fact, anything that efficiently can deliver you the bytes one by one would suffice.
That is the reason why people would want to read a file as a stream.
The reason why people want a stream as input for their data, is that they want to give the caller the opportunity to specify the source of their data: callers can provide a stream that reads from a file, but also a stream with data from the internet, or from a database, or from a textBox, the procedure does not care, as long as it can read the bytes one by one or sometimes per chunk of bytes:
using (Stream fileStream = File.Open(fileName)
{
ProcessInputData(fileStream);
}
Or:
byte[] bytesToProcess = ...
using (Stream memoryStream = new MemoryStream(bytesToProcess))
{
ProcessInputData(memoryStream);
}
Or:
string operatorInput = this.textBox1.Text;
using (Stream memoryStream = new MemoryStream(operatorInput))
{
ProcessInputData(memoryStream);
}
Conclusioin
Methods use streams in their interface to indicate that they don't need all data in memory at once. One-by-one, or per chunk is enough. The caller is free to decide where the data comes from.
I am using the code below to read binary data from text file and divide it into small chunks. I want to do the same with a text file with alphanumeric data which is obviously not working with the binary reader. Which reader would be best to achieve that stream,string or text and how to implement that in the following code?
public static IEnumerable<IEnumerable<byte>> ReadByChunk(int chunkSize)
{
IEnumerable<byte> result;
int startingByte = 0;
do
{
result = ReadBytes(startingByte, chunkSize);
startingByte += chunkSize;
yield return result;
} while (result.Any());
}
public static IEnumerable<byte> ReadBytes(int startingByte, int byteToRead)
{
byte[] result;
using (FileStream stream = File.Open(#"C:\Users\file.txt", FileMode.Open, FileAccess.Read, FileShare.Read))
using (BinaryReader reader = new BinaryReader(stream))
{
int bytesToRead = Math.Max(Math.Min(byteToRead, (int)reader.BaseStream.Length - startingByte), 0);
reader.BaseStream.Seek(startingByte, SeekOrigin.Begin);
result = reader.ReadBytes(bytesToRead);
}
return result;
}
I can only help you get the general process figured out:
String/Text is the 2nd worst data format to read, write or process. It should be reserved for output towards and input from the user exclusively. It has some serious issues as a storage and retreival format.
If you have to transmit, store or retreive something as text, make sure you use a fixed Encoding and Culture Format (usually invariant) at all endpoints. You do not want to run into issues with those two.
The worst data fromat is raw binary. But there is a special 0th place for raw binary that you have to interpret into text, to then further process. To quote the most importnt parts of what I linked on encodings:
It does not make sense to have a string without knowing what encoding it uses. [...]
If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.
I want to detect the encoding of a XML document before parsing it. So I found on stackoverflow this script.
public static XElement GetXMLFromStream(Stream uploadStream)
{
/** Remember position */
var position = uploadStream.Position;
/** Get encoding */
var xmlReader = new XmlTextReader(uploadStream);
xmlReader.MoveToContent();
/** Move to remembered position */
uploadStream.Seek(position, SeekOrigin.Begin); // with "pos" = 0 it not works, too
uploadStream.Seek(position, SeekOrigin.Current); // if I remove this I have the same issue!
/** Read content with detected encoding */
var streamReader = new StreamReader(uploadStream, xmlReader.Encoding);
var streamReaderString = streamReader.ReadToEnd();
return XElement.Parse(streamReaderString);
}
But it doesn't work.
Always I get EndOfStream true. But it isn't!!!! -.-
For example I have the string <test></test>.
Begin: 0, End: 13
If I ReadToEnd or MoveToContent then the end is reached successfully. The EndOfStream is true then.
If I reset the position via Seek or Position to 0 (for example) then a new StreamReader shows always EndOfStream is true.
The thing is that the uploadStream is a stream which I can not close.
It's a SharpZipLib stream of a http upload stream. So I can't close this stream. I can only working with it.
And the bad thing is only because Position and Seek not work... Only because ReadToEnd relays on this Position. - Else it would work. I think!
Maybe you can help my with this situation :-)
Thank you very much in Advance!
Example:
This approach is fundamentally incompatible with some types of input streams. Streams are not required to support Seek at all. In fact, Stream has a property specifically to detect whether Seek is usable, called CanSeek. Code needs to take into account that Seek can fail.
The simple but not very memory-efficient way is to copy your stream's content into a MemoryStream. That one does support Seek, and you can then do whatever you want with it. The fact that you're using ReadToEnd() suggests that the data is not so large that the memory use is going to cause a problem, so you can probably just go with this.
Note: as documented, if Seek is not supported, it's supposed to throw a NotSupportedException. It looks like with the stream implementation you're dealing with, it's not supported, but not properly implemented. I hope at least that CanSeek returns false for you, so you can still reliably detect this.
Option 1:
XElement has a Load() method that will read directly from an xml stream. It will manange the encoding for you internally. And it'll be more efficient by avoid a needless string. So why not use this.
XElement.Load(uploadStream);
Option 2:
If you really want to work with a string, dont use new XmlTextReader(). The XmlTextReader.Create() has more features so do this instead:
var xmlReader = XmlTextReader.Create(uploadStream);
var streamReaderString = xmlReader.ReadOuterXml();
return XElement.Parse(streamReaderString);
I found some questions on encoding issues before asking, however they are not what I want. Currently I have two methods, I'd better not modify them.
//FileManager.cs
public byte[] LoadFile(string id);
public FileStream LoadFileStream(string id);
They are working correctly for all kind of files. Now I have an ID of a text file(it's guaranteed to be a .txt file) and I want to get its content. I tried the following:
byte[] data = manager.LoadFile(id);
string content = Encoding.UTF8.GetString(data);
But obviously it's not working for other non-UTF8 encodings. To resolve the encoding issue I tried to get its FileStream first and then use a StreamReader.
public StreamReader(Stream stream, bool detectEncodingFromByteOrderMarks);
I hope this overlord can resolve the encoding but I still get strange contents.
using(var stream = manager.LoadFileStream(id))
using(var reader = new StreamReader(stream, true))
{
content = reader.ReadToEnd(); //still incorrect
}
Maybe I misunderstood the usage of detectEncodingFromByteOrderMarks? And how to resolve the encoding issue?
ByteOrderMarks are sometimes added to files encoded in one of the unicode formats, to indicate whether characters made up from multiple bytes are stored in big or little endian format (is byte 1 stored first, and then byte 0? Or byte 0 first, and then byte 1?). This is particularly relevant when files are read both by for instance windows and unix machines, because they write these multibyte characters in opposite directions.
If you read a file and the first few bytes equal that of a ByteOrderMark, chances are quite high the file is encoded in the unicode format that matches that ByteOrderMark. You never know for sure, though, as Shadow Wizard mentioned. Since it's always a guess, the option is provided as a parameter.
If there is no ByteOrderMark in the first bytes of the file, it'll be hard to guess the file's encoding.
More info: http://en.wikipedia.org/wiki/Byte_order_mark