Upload file with BOM encoding - c#

I need to upload/create a file on an FTP server with UCS-2 LE BOM encoding. I'm using C#.
Based on Microsoft's documentation, I have to use UnicodeEncoding with bigEndian:false and byteOrderMark:true. Here is the code:
using (WebClient client = new WebClient())
{
client.Encoding = new UnicodeEncoding(false, true);
client.Credentials = myCredentials;
client.UploadString(path, WebRequestMethods.Ftp.UploadFile, myCsvInString);
}
The created file on the FTP server actually has UCS-2 Little Endian. For test purposes, I tried to switch the byteOrderMark to false and I got the same result.
Why? What am I missing?
I know that I could add '\uFEFF' but why isn't it done automatically?

The interface and description of UnicodeEncoding could be improved regarding the handling of the byte order mark. UnicodeEncoding has a byte order mark attribute but the only method (aside from Equals and GetHashCode) using it is GetPreamble. All other methods and in particular the core method GetBytes don't.
The idea is to ensure the byte order mark is only ever written at the beginning of a file. UnicodeEncoding has no knowledge of the context. So it's up to the caller to add the preamble if needed (ie. byte order mark).
Based on this concept, WebClient.UploadString cannot assume it's uploading a file. It might be some other Unicode content. So it doesn't add the preamble.
You will have to add the preamble yourself. UnicodeEncoding.GetPreamble will return it.

Related

System.Text.Encoding.Default.GetBytes fails

Here is my sample code:
CodeSnippet 1: This code executes in my file repository server and returns the file as encoded string using the WCF Service:
byte[] fileBytes = new byte[0];
using (FileStream stream = System.IO.File.OpenRead(#"D:\PDFFiles\Sample1.pdf"))
{
fileBytes = new byte[stream.Length];
stream.Read(fileBytes, 0, fileBytes.Length);
stream.Close();
}
string retVal = System.Text.Encoding.Default.GetString(fileBytes); // fileBytes size is 209050
Code Snippet 2:
Client box, which demanded the PDF file, receives the encoded string and converts to PDF and save to local.
byte[] encodedBytes = System.Text.Encoding.Default.GetBytes(retVal); /// GETTING corrupted here
string pdfPath = #"C:\DemoPDF\Sample2.pdf";
using (FileStream fileStream = new FileStream(pdfPath, FileMode.Create)) //encodedBytes is 327279
{
fileStream.Write(encodedBytes, 0, encodedBytes.Length);
fileStream.Close();
}
Above code working absolutely fine Framework 4.5 , 4.6.1
When I use the same code in Asp.Net Core 2.0, it fails to convert to Byte Array properly. I am not getting any runtime error but, the final PDF is not able to open after it is created. Throws error as pdf file is corrupted.
I tried with Encoding.Unicode and Encoding.UTF-8 also. But getting same error for final PDF.
Also, I have noticed that when I use Encoding.Unicode, atleast the Original Byte Array and Result byte array size are same. But other encoding types are mismatching with bytes size also.
So, the question is, System.Text.Encoding.Default.GetBytes broken in .NET Core 2.0 ?
I have edited my question for better understanding.
Sample1.pdf exists on a different server and communicate using WCF to transmit the data to Client which stores the file encoded stream and converts as Sample2.pdf
Hopefully my question makes some sense now.
1: the number of times you should ever use Encoding.Default is essentially zero; there may be a hypothetical case, but if there is one: it is elusive
2: PDF files are not text, so trying to use an Encoding on them is just... wrong; you aren't "GETTING corrupted here" - it just isn't text.
You may wish to see Extracting text from PDFs in C# or Reading text from PDF in .NET
If you simply wish to copy the content without parsing it: File.Copy or Stream.CopyTo are good options.

Unzip Byte Array in C#

I'm working on EBICS protocol and i want to read a data in an XML File to compare with another file.
I have successfull decode data from base64 using
Convert.FromBase64String(OrderData); but now i have a byte array.
To read the content i have to unzip it. I tried to unzip it using Gzip like this example :
static byte[] Decompress(byte[] data)
{
using (var compressedStream = new MemoryStream(data))
using (var zipStream = new GZipStream(compressedStream, CompressionMode.Decompress))
using (var resultStream = new MemoryStream())
{
zipStream.CopyTo(resultStream);
return resultStream.ToArray();
}
}
But it does not work i have an error message :
the magic number in gzip header is not correct. make sure you are passing in a gzip stream
Now i have no idea how i can unzip it, please help me !
Thanks !
The first four bytes provided by the OP in a comment to another answer: 0x78 0xda 0xe5 0x98 is the start of a zlib stream. It is neither gzip, nor zip, but zlib. You need a ZlibStream, which for some reason Microsoft does not provide. That's fine though, since what Microsoft does provide is buggy.
You should use DotNetZip, which provides ZlibStream, and it works.
Try using SharpZipLib. It copes with various compression formats and is free under the GPL license.
As others have pointed out, I suspect you have a zip stream and not gzip. If you check the first 4 bytes in a hex view, ZIP files always start with 0x04034b50 ZIP File Format Wiki whereas GZIP files start with 0x8b1f GZIP File Format Wiki
I think I finally got it - as usual the problem is not what is in the title. Luckily I've noticed the word EBICS in your post. So, according to EBICS spec the data is first compressed, then encrypted and finally base64 encoded. As you see, after decoding base64 you need first to decrypt the data and then try to unzip it.
UPDATE: If that's not the case, it turns out from the EBICS spec Chapter 16 Appendix: Standards and references that ZIP refers to zlib/deflate format, so all you need to do is to replace GZipStream with the DeflateStream

Detect Stream or Byte Array Encoding on Windows Phone

I'm trying to read xmls that I downloaded with the WebClient.OpenReadAsync() in a Windows Phone application. The problem is that sometimes, the xml won't come with UTF8 Encoding, it might come with other encodings such as "ISO-8859-1", so the accents come messed up.
I was able to load one of the ISO-8858-1 xmls perfectly using the code:
var buff = e.Result.ReadFully(); //Gets byte array from the stream
var resultBuff = Encoding.Convert(Encoding.GetEncoding("ISO-8859-1"), Encoding.UTF8, buff);
var result = Encoding.UTF8.GetString(resultBuff, 0, resultBuff.Length);
It works beautifully with ISO-8859-1, the text came perfect after, but It messed up the UTF8 xmls.
So, the idea here is to detect the encoding of the byte array or the stream before doing this, then if it's not UTF8, it will convert the data using the method above with the detected encoding.
I am searching for some method that can detect the encoding on the internet but I cannot find any!
Does anybody know how I could do this kind of thing on Windows Phone?
Thanks!
You can look for the "Content-Type" value in the WebClient.ResponseHeaders property; If you are lucky the server is setting it to indicate the type of media plus its encoding (e.g. "text/html; charset=ISO-8859-4").

How do I get this encoding right with ANTLR?

I'm working on a project for school. We are making a static code analyzer.
A requirement for this is to analyse C# code in Java, which is going so far so good with ANTLR.
I have made some example C# code to scan with ANTLR in Visual Studio. I analyse every C# file in the solution. But it does not work. I am getting a memory leak and the error message :
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.antlr.runtime.Lexer.emit(Lexer.java:151)
at org.antlr.runtime.Lexer.nextToken(Lexer.java:86)
at org.antlr.runtime.CommonTokenStream.fillBuffer(CommonTokenStream.java:119)
at org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:238)
After a while I thought it was an issue with encoding, because all the files are in UTF-8. I think it can't read the encoded Stream. So i opened Notepad++ and i changed the encoding of every file to ANSI, and then it worked. I don't really understand what ANSI means, is this one character set or some kind of organisation?
I want to change the encoding from any encoding (probably UTF-8) to this ANSI encoding so i won't get memory leaks anymore.
This is the code that makes the Lexer and Parser:
InputStream inputStream = new FileInputStream(new File(filePath));
CharStream charStream = new ANTLRInputStream(inputStream);
CSharpLexer cSharpLexer = new CSharpLexer(charStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(cSharpLexer);
CSharpParser cSharpParser = new CSharpParser(commonTokenStream);
Does anyone know how to change the encoding of the InputStream to the right encoding?
And what does Notepad++ do when I change the encoding to ANSI?
When reading text files you should set the encoding explicitly. Try you examples with the following change
CharStream charStream = new ANTLRInputStream(inputStream, "UTF-8");
I solved this issue by putting the ImputStream into a BufferedStream and then removed the Byte Order Mark.
I guess my parser didn't like that encoding, because I also tried set the encoding explicitly.

C# - Detecting encoding in a file, write change to file using the found encoding

I wrote a small program for iterating through a lot of files and applying some changes where a certain string match is found, the problem I have is that different files have different encodings. So what I would like to do is check the encoding, then overwrite the file in its original encoding.
What would be the prettiest way of doing that in C# .net 2.0?
My code looks very simple as of now;
String f1 = File.ReadAllText(fileList[i]).ToLower();
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, Encoding.Unicode);
}
I took a look at Auto encoding detect in C# which made me realize how I could detect encoding, but I am not sure how I could use that information to write in the same encoding.
Would greatly appreciate any help here.
Unfortunately encoding is one of those subjects where there is not always a definitive answer. In many cases it's much closer to guessing the encoding as opposed to detecting it. Raymond Chen did an excellent blog post on this subject that is worth the read
http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx
The gist of the article is
If the BOM (byte order marker) exists then you're golden
Else it's guess work and heuristics
However I still think the best approach is to Darin mentioned in the question you linked. Let StreamReader guess for you vs. re-inventing the wheel. It only requires a very slight modification to your sample.
String f1;
Encoding encoding;
using (var reader = new StreamReader(fileList[i])) {
f1 = reader.ReadToEnd().ToLower();
encoding = reader.CurrentEncoding;
}
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, encoding);
}
By default, .Net use UTF8. It is hard to detect character encoding becus most of the time .Net will read as UTF8. i alway have problem with ANSI.
my trick is i will read the file as Stream as force it to read as UTF8 and detect usual character that should be in text. If found, then UTF8 else ANSI ... and tell user u can use just 2 encoding either ANSI or UTF8. auto dectect not quite work in my language :p
I am afraid, you will have to know the encoding. For UTF based encodings though you can use StreamReader built in functionality though.
Taken form here.
With regard to encodings - you will
need to have identified the encoding
in order to use the StreamReader.
However, the StreamReader itself can
help if you create it with one of the
constructor overloads that allows you
to supply the flag
detectEncodingFromByteOrderMarks as
true (or you can use
Encoding.GetPreamble and look at the
byte preamble yourself).
Both these methods will only help
auto-detect UTF based encodings though
- so any ANSI encodings with a specified codepage will probably not
be parsed correctly.
Prob a bit late but I encountered the same problem myself, using the previous answers I found a solution that works for me, It reads in the text using StreamReaders default encoding, extracts the encoding used on that file and uses StreamWriter to write it back with the changes using the found Encoding. Also removes\reAdds the ReadOnly flag
string file = "File to open";
string text;
Encoding encoding;
string oldValue = "string to be replaced";
string replacementValue = "New string";
var attributes = File.GetAttributes(file);
File.SetAttributes(file, attributes & ~FileAttributes.ReadOnly);
using (StreamReader reader = new StreamReader(file, Encoding.Default))
{
text = reader.ReadToEnd();
encoding = reader.CurrentEncoding;
reader.Close();
}
bool changedValue = false;
if (text.Contains(oldValue))
{
text = text.Replace(oldValue, replacementValue);
changedValue = true;
}
if (changedValue)
{
using (StreamWriter write = new StreamWriter(file, false, encoding))
{
write.Write(text.ToString());
write.Close();
}
File.SetAttributes(file, attributes | FileAttributes.ReadOnly);
}
The solution for all Germans => ÄÖÜäöüß
This function opens the file an determines the Encoding by the BOM.
If the BOM is missing the file will be interpreted as ANSI, but if there are UTF8 encoded German Umlaute in it, it will be detected as UTF8.
https://stackoverflow.com/a/69312696/9134997

Categories