Text file encoding issue - c#

I found some questions on encoding issues before asking, however they are not what I want. Currently I have two methods, I'd better not modify them.
//FileManager.cs
public byte[] LoadFile(string id);
public FileStream LoadFileStream(string id);
They are working correctly for all kind of files. Now I have an ID of a text file(it's guaranteed to be a .txt file) and I want to get its content. I tried the following:
byte[] data = manager.LoadFile(id);
string content = Encoding.UTF8.GetString(data);
But obviously it's not working for other non-UTF8 encodings. To resolve the encoding issue I tried to get its FileStream first and then use a StreamReader.
public StreamReader(Stream stream, bool detectEncodingFromByteOrderMarks);
I hope this overlord can resolve the encoding but I still get strange contents.
using(var stream = manager.LoadFileStream(id))
using(var reader = new StreamReader(stream, true))
{
content = reader.ReadToEnd(); //still incorrect
}
Maybe I misunderstood the usage of detectEncodingFromByteOrderMarks? And how to resolve the encoding issue?

ByteOrderMarks are sometimes added to files encoded in one of the unicode formats, to indicate whether characters made up from multiple bytes are stored in big or little endian format (is byte 1 stored first, and then byte 0? Or byte 0 first, and then byte 1?). This is particularly relevant when files are read both by for instance windows and unix machines, because they write these multibyte characters in opposite directions.
If you read a file and the first few bytes equal that of a ByteOrderMark, chances are quite high the file is encoded in the unicode format that matches that ByteOrderMark. You never know for sure, though, as Shadow Wizard mentioned. Since it's always a guess, the option is provided as a parameter.
If there is no ByteOrderMark in the first bytes of the file, it'll be hard to guess the file's encoding.
More info: http://en.wikipedia.org/wiki/Byte_order_mark

Related

How to validate a Stream is valid UTF8 in c#

There are many questions that ask for detecting file encoding, which is tricky. I only need to know if the file contains only valid UTF8 sequences, thus being safe to treat as UTF8 (plain ASCII can be safely treated as UTF8)
The File comes in a form of a Stream from within AspNetCore.
I assume I will have to read the stream twice, first to check it doesn't contain any invalid UTF8 sequences, and second to actually process it further.
Based on comment by #madreflection:
public static async Task<bool> IsValidUtf8(Stream stream)
{
var reader = new StreamReader(stream, new UTF8Encoding(true, true));
try
{
await reader.ReadToEndAsync();
return true;
}
catch (DecoderFallbackException)
{
return false;
}
}
I was hoping I wouldn't have to read the stream twice, but that's not possible, but also undesired, as I need to make a decision before processing.
One thing that bothers me about this is that the whole stream is read into RAM, but CodeReview might be a better place to discuss that.

How do I access the data in a Avro.snz file with C#

I have an Avro.snz file whose
avro.codecs is snappy
This can be opened with com.databricks.avro in Spark but it seems snappy is unsupported by Apache.Avro and Confluent.Avro, they only have deflate and null. Although they can get me the Schema, I cannot get at the data.
The next method gets and error. Ironsnappy is unable to decompress the file too, it says the input is
using (Avro.File.IFileReader<generic> reader = Avro.File.DataFileReader<generic>.OpenReader(avro_path))
{
schema = reader.GetSchema();
Console.WriteLine(reader.HasNext()); //true
var hi = reader.Next(); // error
Console.WriteLine(hi.ElementAt(0).ToString()); // error
}
I'm starting to wonder if there is anything in the Azure HDInsight library, but I cant seem to find the nuget package that gives me a way to read Avro with support for Snappy compression.
I'm open to any solution, even if that means downloading the source for Apache.Avro and adding in Snappy support manually, but to be honest, I'm sort of a newbie and have no idea how compression even works let alone add support to a library.
Can anyone help?
Update:
Just adding the snappy codec to Apache.Avro and changing the DeflateStream to Ironsnappy stream failed. It gave Corrupt input again. Is there anything anywhere that can open Snappy compressed Avro files with C#?
Or how do I determine what part of the Avro is snappy compressed and pass that to Ironsnappy.
Ok, so not even any comments on this. But I eventually solved my problem. Here is how I solved it.
I tried Apache.Avro and Confluent version as well, but their .net version has no snappy support darn. But I can get the schema as that is uncompressed apparently.
Since Parquet.Net uses IronSnappy, I built/added out the snappy codec in Apache.Avro by basically cloning its deflate code and changing a few names. Failed. Corrupt input Ironsnappy says.
I research Avro and see that it is seperated by an uncompressed Schema, followed by the name of the compression codec of the data, then the data itself, which are divided into blocks. Well, I have no idea where a block starts and ends. Somehow the binary in the file gives that info somehow, but I still have no idea, I couldn't get it with a hex editor even. I think Apache.Avro takes a long or a varint somehow, and the hex editor I used doesn't give me that info.
I found the avro-tools.jar tool inside Apache.Avro. To make it easier to use, I made it an executable with launch4j totally superfluous move but whatever. Then I used that cat my avro into 1 row, uncompressed and snappy. I used that as my base and followed the flow of Apache.Avro in the debugger. While also tracking the index of bytes and such with the hex editor and the debugger in C#.
With 1 row, it is guaranteed 1 block. So I ran a loop on the byte start index and end index. I found my Snappy block and was able to decompress it with IronSnappy. I modified the codec portion of my Apache.Avro snappy codec code to make it work with 1 block. (which was basically whatever block Apache.Avro took minus 4 bytes which I assume is the Snappy CRC check which I ignored.
It fails with multi blocks. I found its because Apache.Avro always throws the deflate codec a 4096 byte array after the first block. I reduced it to read size and did the minus 4 size thing again. It worked.
Success! So basically it was copy over deflate as a template for snappy, reduce block byte by 4, then make sure to resize the byte array to block byte size before getting Ironsnappy to decompress.
public override byte[] Decompress(byte[] compressedData)
{
int snappySize = compressedData.Length - 4;
byte[] compressedSnappy_Data = new byte[snappySize];
System.Array.Copy(compressedData, compressedSnappy_Data, snappySize);
byte[] result = IronSnappy.Snappy.Decode(compressedSnappy_Data);
return result;
}
if (_codec.GetHashCode() == DataFileConstants.SnappyCodecHash)
{
byte[] snappyBlock = new byte[(int)_currentBlock.BlockSize];
System.Array.Copy(_currentBlock.Data, snappyBlock, (int)_currentBlock.BlockSize);
_currentBlock.Data = snappyBlock;
}
I didn't bother with actually using the checksum as I don't know how or need to? At least not right now. And I totally ignored the compress function.
but if you really want my compress function here it is
public override byte[] Compress(byte[] uncompressedData)
{
return new byte[0];
}
The simplest solution would be to use:
ResultModel resultObject = AvroConvert.Deserialize<ResultModel>(byte[] avroObject);
From https://github.com/AdrianStrugala/AvroConvert
null
deflate
snappy
gzip
codes are supported

Reading alphanumeric data from text file

I am using the code below to read binary data from text file and divide it into small chunks. I want to do the same with a text file with alphanumeric data which is obviously not working with the binary reader. Which reader would be best to achieve that stream,string or text and how to implement that in the following code?
public static IEnumerable<IEnumerable<byte>> ReadByChunk(int chunkSize)
{
IEnumerable<byte> result;
int startingByte = 0;
do
{
result = ReadBytes(startingByte, chunkSize);
startingByte += chunkSize;
yield return result;
} while (result.Any());
}
public static IEnumerable<byte> ReadBytes(int startingByte, int byteToRead)
{
byte[] result;
using (FileStream stream = File.Open(#"C:\Users\file.txt", FileMode.Open, FileAccess.Read, FileShare.Read))
using (BinaryReader reader = new BinaryReader(stream))
{
int bytesToRead = Math.Max(Math.Min(byteToRead, (int)reader.BaseStream.Length - startingByte), 0);
reader.BaseStream.Seek(startingByte, SeekOrigin.Begin);
result = reader.ReadBytes(bytesToRead);
}
return result;
}
I can only help you get the general process figured out:
String/Text is the 2nd worst data format to read, write or process. It should be reserved for output towards and input from the user exclusively. It has some serious issues as a storage and retreival format.
If you have to transmit, store or retreive something as text, make sure you use a fixed Encoding and Culture Format (usually invariant) at all endpoints. You do not want to run into issues with those two.
The worst data fromat is raw binary. But there is a special 0th place for raw binary that you have to interpret into text, to then further process. To quote the most importnt parts of what I linked on encodings:
It does not make sense to have a string without knowing what encoding it uses. [...]
If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.

Interpreting base 64 in C# from an image based via JSON/PHP (base64_encode)

So I'm able to read an image file successfully, and pass it back to my C# application but I'm unable to decode it properly.
I'm returning the JSON data as such (the json_encode function isn't shown) via PHP:
$imgbinary = fread(fopen($filename, "r"), filesize($filename));
if ( strlen($imgbinary) > 0 ){
return array("success"=>true, "map"=>base64_encode($imgbinary));
}
Then in C# I use Newtonsoft.Json to decode the string (I can read success and the map properties successfully), but I'm unable to then use base64 decode to properly write the image to a file (or to display).
I'm doing it as such:
File.WriteAllText(System.Windows.Forms.Application.StartupPath + "\\MyDir\\" + FileName, Base64Decode(FileData));
public string Base64Decode(string data)
{
byte[] binary = Convert.FromBase64String(data);
return Encoding.Default.GetString(binary);
}
Am I missing something crazy simple here? What is really strange is after I decode the data, the file size is LARGER than the original file. (I realize once you encode, data increases by about 33%, just strange that after I then decode, it is still larger).
Any help/pointers would be greatly appreciated!
Am I missing something crazy simple here?
Yes. An image isn't a text file, so you shouldn't be using File.WriteAllText. What characters do you believe are present in an image file? It's really, really important to distinguish between when your data is fundamentally text, and when it's fundamentally binary. If you try to treat either as if it were the other, you're asking for trouble.
Don't convert back from the byte array to text (your Encoding.Default.GetString call will be losing data) - just use:
File.WriteAllBytes(path, Convert.FromBase64String(data));

Determining size of a future file while data is still in memory

This is C#/.NET 2.0.
So I have string that contains the future contents of an XML file. It contains metadata and binary data from image files. I would like to somehow determine how big the XML file will be once I write the data in the string to the file system.
I've tried the following and neither works:
Console.Out.WriteLine("Size: " + data.Length/1024 + "KB");
and
Console.Out.WriteLine("Size: " + (data.Length * sizeof(char))/1024 + "KB");
Neither works (the actual size of the resulting file deviates from what is returned from either of these methods). I'm obviously missing something here. Any help would be appreciated.
XML Serialization:
// doc is an XMLDocument that I've built previously
StringWriter sw = new StringWriter();
doc.Save(sw);
string XMLAsString = sw.ToString();
Writing to file system (XMLAsString passed to this function as variable named data):
Random rnd = new Random(DateTime.Now.Millisecond);
FileStream fs = File.Open(#"C:\testout" + rnd.Next(1000).ToString() + ".txt", FileMode.OpenOrCreate);
StreamWriter sw = new StreamWriter(fs);
app.Diagnostics.Write("Size of XML: " + (data.Length * sizeof(char))/1024 + "KB");
sw.Write(data);
sw.Close();
fs.Close();
Thanks
You're missing how the encoding process works. Try this:
string data = "this is what I'm writing";
byte[] mybytes = System.Text.Encoding.UTF8.GetBytes(data);
The size of the array is exactly the number of bytes that it should take up on disk if it's being written in a somewhat "normal" way, as UTF8 is the default encoding for text output (I think). There may be an extra EOF (End Of File) character written, or not, but you should be really close with that.
Edit: I think it's worth it for everybody to remember that characters in C#/.NET are NOT one byte long, but two, and are unicode characters, that are then encoded to whatever the output format needs. That's why any approach with data.Length*sizeof(char) would not work.
In NTFS, if your file system is set to compress, the final file might be smaller than what your actual file might be. Is that your problem?
If you want to determine if your file will fit on the media, you have to take into account what the allocation size of the file system is. A file that is 10 bytes long does not occupy 10 bytes on the disk. The space requirement increases in discrete steps, determined by the allocation size (also called cluster size).
See this Microsoft support article for more info about NTFS and FAT cluster sizes.
What is data in your example above? How is the binary data represented in the xml file?
It's quite likely that you'll want to do a full serialization into a byte array to get an accurate guess of the size. The serializer may do arbitrary things like add CDATA tags and if you for some reason need to save the file in UTF-16 instead of UTF-8, well that'll double your size right there probably.
You can save (or write) it to a memory stream then determine how big that memory stream has become, thats the only way to determine the actual size without writing it to disk.
Can't see there being any point to that you may as well just save it a local file, take a look at the final file size then make a choice as to what to do with it.
If all you want to do is make a reasonable estimate of how big a XML file will become once you've added a bunch of encoded binary elements and if we can assume that the rest of the XML will be negligable in comparison to the encoded binary content, then its a matter of determining the bloat introduced due to the encoding.
Typicaly we would encode binary content with base64 encoding which results in 4 bytes of ASCII for every 3 bytes of binary, that is a 33% bloat. So an estimate would be data.Length * 1.33333

Categories