How do I get this encoding right with ANTLR? - c#

I'm working on a project for school. We are making a static code analyzer.
A requirement for this is to analyse C# code in Java, which is going so far so good with ANTLR.
I have made some example C# code to scan with ANTLR in Visual Studio. I analyse every C# file in the solution. But it does not work. I am getting a memory leak and the error message :
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.antlr.runtime.Lexer.emit(Lexer.java:151)
at org.antlr.runtime.Lexer.nextToken(Lexer.java:86)
at org.antlr.runtime.CommonTokenStream.fillBuffer(CommonTokenStream.java:119)
at org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:238)
After a while I thought it was an issue with encoding, because all the files are in UTF-8. I think it can't read the encoded Stream. So i opened Notepad++ and i changed the encoding of every file to ANSI, and then it worked. I don't really understand what ANSI means, is this one character set or some kind of organisation?
I want to change the encoding from any encoding (probably UTF-8) to this ANSI encoding so i won't get memory leaks anymore.
This is the code that makes the Lexer and Parser:
InputStream inputStream = new FileInputStream(new File(filePath));
CharStream charStream = new ANTLRInputStream(inputStream);
CSharpLexer cSharpLexer = new CSharpLexer(charStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(cSharpLexer);
CSharpParser cSharpParser = new CSharpParser(commonTokenStream);
Does anyone know how to change the encoding of the InputStream to the right encoding?
And what does Notepad++ do when I change the encoding to ANSI?

When reading text files you should set the encoding explicitly. Try you examples with the following change
CharStream charStream = new ANTLRInputStream(inputStream, "UTF-8");

I solved this issue by putting the ImputStream into a BufferedStream and then removed the Byte Order Mark.
I guess my parser didn't like that encoding, because I also tried set the encoding explicitly.

Related

C# .csv-file in WinForm with Ä, Ö, Ü [duplicate]

I'm using the code below to read a text file that contains foreign characters, the file is encoded ANSI and looks fine in notepad. The code below doesn't work, when the file values are read and shown in the datagrid the characters appear as squares, could there be another problem elsewhere?
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.ANSI);
using (reader = File.OpenText(inputFilePath))
Thanks
Update 1: I have tried all encodings found under System.Text.Encoding. and all fail to show the file correctly.
Update 2: I've changed the file encoding (resaved the file) to unicode and used System.Text.Encoding.Unicode and it worked just fine. So why did notepad read it correctly? And why didn't System.Text.Encoding.Unicode read the ANSI file?
You may also try the Default encoding, which uses the current system's ANSI codepage.
StreamReader reader = new StreamReader(inputFilePath, Encoding.Default, true)
When you try using the Notepad "Save As" menu with the original file, look at the encoding combo box. It will tell you which encoding notepad guessed is used by the file.
Also, if it is an ANSI file, the detectEncodingFromByteOrderMarks parameter will probably not help much.
I had the same problem and my solution was simple: instead of
Encoding.ASCII
use
Encoding.GetEncoding("iso-8859-1")
The answer was found here.
Edit: more solutions. This maybe more accurate one:
Encoding.GetEncoding(1252);
Also, in some cases this will work for you too if your OS default encoding matches file encoding:
Encoding.Default;
Yes, it could be with the actual encoding of the file, probably unicode. Try UTF-8 as that is the most common form of unicode encoding. Otherwise if the file ASCII then standard ASCII encoding should work.
Using Encoding.Unicode won't accurately decode an ANSI file in the same way that a JPEG decoder won't understand a GIF file.
I'm surprised that Encoding.Default didn't work for the ANSI file if it really was ANSI - if you ever find out exactly which code page Notepad was using, you could use Encoding.GetEncoding(int).
In general, where possible I'd recommend using UTF-8.
Try a different encoding such as Encoding.UTF8. You can also try letting StreamReader find the encoding itself:
StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.UTF8, true)
Edit: Just saw your update. Try letting StreamReader do the guessing.
For swedish Å Ä Ö the only solution form the ones above working was:
Encoding.GetEncoding("iso-8859-1")
Hopefully this will save someone time.
File.OpenText() always uses an UTF-8 StreamReader implicitly. Create your own StreamReader
instance instead and specify the desired encoding.
like
using (StreamReader reader = new StreamReader(#"C:\test.txt", Encoding.Default)
{
// ...
}
I solved my problem of reading portuguese characters, changing the source file on notepad++.
C#
var url = System.Web.HttpContext.Current.Server.MapPath(#"~/Content/data.json");
string s = string.Empty;
using (System.IO.StreamReader sr = new System.IO.StreamReader(url, System.Text.Encoding.UTF8,true))
{
s = sr.ReadToEnd();
}
I'm also reading an exported file which contains french and German languages. I used Encoding.GetEncoding("iso-8859-1"), true which worked out without any challenges.
for Arabic, I used Encoding.GetEncoding(1256). it is working good.
I had a similar problem with ProcessStartInfo and the property StandardOutputEncoding. I set it for German language console output to code page 850. This way I could read the output like ausführen instead of ausf�hren.

C# : Change String Encoding?

I'm struggling with the encoding of one of my string.
On a Mail Sending WS, I'm receiving a bad string containing "�" instead of "é" (that's what I'm seeing in the Debug Mode of Visual Studio at least).
The character comes from some JSON that is deserialized when entering the WS into my DTO.
Changing the Content-Type of the JSON is not solving the thing.
So I thought I'll change the encoding of my string by myself, because the JSON encoding thing seems like a VS deserialization issue (I started a thread here if one of you guys want to take a look at it).
I tried :
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding defaultEncoding = Encoding.Default;
byte[] bytes = defaultEncoding.GetBytes(messedUpString);
byte[] isoBytes = Encoding.Convert(defaultEncoding, iso, bytes);
cleanString = iso.GetString(isoBytes);
Or :
byte[] bytes = Encoding.Default.GetBytes(messedUpString);
cleanString = Encoding.UTF8.GetString(bytes);
And it's not really effective... I get rid of the "�" char, which is the nice part, but I'm receiving in the cleanString "?" instead of the expected "é", and this in not really nice, or at least, the expected behavior.
In fact, every thing was fine in my application.
I used SOAPUI to test, and this was my error.
I downloaded some rest plugin for my browser, try from there, and everything worked.
Thanks for the help though #MattiVirkkunen

Detect Stream or Byte Array Encoding on Windows Phone

I'm trying to read xmls that I downloaded with the WebClient.OpenReadAsync() in a Windows Phone application. The problem is that sometimes, the xml won't come with UTF8 Encoding, it might come with other encodings such as "ISO-8859-1", so the accents come messed up.
I was able to load one of the ISO-8858-1 xmls perfectly using the code:
var buff = e.Result.ReadFully(); //Gets byte array from the stream
var resultBuff = Encoding.Convert(Encoding.GetEncoding("ISO-8859-1"), Encoding.UTF8, buff);
var result = Encoding.UTF8.GetString(resultBuff, 0, resultBuff.Length);
It works beautifully with ISO-8859-1, the text came perfect after, but It messed up the UTF8 xmls.
So, the idea here is to detect the encoding of the byte array or the stream before doing this, then if it's not UTF8, it will convert the data using the method above with the detected encoding.
I am searching for some method that can detect the encoding on the internet but I cannot find any!
Does anybody know how I could do this kind of thing on Windows Phone?
Thanks!
You can look for the "Content-Type" value in the WebClient.ResponseHeaders property; If you are lucky the server is setting it to indicate the type of media plus its encoding (e.g. "text/html; charset=ISO-8859-4").

CSV encoding issues (Microsoft Excel)

I am dynamically creating CSV files using C#, and I am encountering some strange encoding issues. I currently use the ASCII encoding, which works fine in Excel 2010, which I use at home and on my work machine. However, the customer uses Excel 2007, and for them there are some strange formatting issues, namely that the '£' sign (UK pound sign) is preceded with an accented 'A' character.
What encoding should I use? The annoying thing is that I can hardly test these fixes as I don't have access to Excel 2007!
I'm using Windows ANSI codepage 1252 without any problems on Excel 2003. I explicitly changed to this because of the same issue you are seeing.
private const int WIN_1252_CP = 1252; // Windows ANSI codepage 1252
this._writer = new StreamWriter(fileName, false, Encoding.GetEncoding(WIN_1252_CP));
I've successfully used UTF8 encoding when writing CSV files intended to work with Excel.
The only problem I had was making sure to use the overload of the StreamWriter constructor that takes an encoding as a parameter. The default encoding of StreamWriter says it is UTF8 but it's really UTF8-Without-A-Byte-Order-Mark and without a BOM Excel will mess up characters using multiple bytes.
You need to add Preamble to file:
var data = Encoding.UTF8.GetBytes(csv);
var result = Encoding.UTF8.GetPreamble().Concat(data).ToArray();
return File(new MemoryStream(result), "application/octet-stream", "file.csv");

C# - Detecting encoding in a file, write change to file using the found encoding

I wrote a small program for iterating through a lot of files and applying some changes where a certain string match is found, the problem I have is that different files have different encodings. So what I would like to do is check the encoding, then overwrite the file in its original encoding.
What would be the prettiest way of doing that in C# .net 2.0?
My code looks very simple as of now;
String f1 = File.ReadAllText(fileList[i]).ToLower();
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, Encoding.Unicode);
}
I took a look at Auto encoding detect in C# which made me realize how I could detect encoding, but I am not sure how I could use that information to write in the same encoding.
Would greatly appreciate any help here.
Unfortunately encoding is one of those subjects where there is not always a definitive answer. In many cases it's much closer to guessing the encoding as opposed to detecting it. Raymond Chen did an excellent blog post on this subject that is worth the read
http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx
The gist of the article is
If the BOM (byte order marker) exists then you're golden
Else it's guess work and heuristics
However I still think the best approach is to Darin mentioned in the question you linked. Let StreamReader guess for you vs. re-inventing the wheel. It only requires a very slight modification to your sample.
String f1;
Encoding encoding;
using (var reader = new StreamReader(fileList[i])) {
f1 = reader.ReadToEnd().ToLower();
encoding = reader.CurrentEncoding;
}
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, encoding);
}
By default, .Net use UTF8. It is hard to detect character encoding becus most of the time .Net will read as UTF8. i alway have problem with ANSI.
my trick is i will read the file as Stream as force it to read as UTF8 and detect usual character that should be in text. If found, then UTF8 else ANSI ... and tell user u can use just 2 encoding either ANSI or UTF8. auto dectect not quite work in my language :p
I am afraid, you will have to know the encoding. For UTF based encodings though you can use StreamReader built in functionality though.
Taken form here.
With regard to encodings - you will
need to have identified the encoding
in order to use the StreamReader.
However, the StreamReader itself can
help if you create it with one of the
constructor overloads that allows you
to supply the flag
detectEncodingFromByteOrderMarks as
true (or you can use
Encoding.GetPreamble and look at the
byte preamble yourself).
Both these methods will only help
auto-detect UTF based encodings though
- so any ANSI encodings with a specified codepage will probably not
be parsed correctly.
Prob a bit late but I encountered the same problem myself, using the previous answers I found a solution that works for me, It reads in the text using StreamReaders default encoding, extracts the encoding used on that file and uses StreamWriter to write it back with the changes using the found Encoding. Also removes\reAdds the ReadOnly flag
string file = "File to open";
string text;
Encoding encoding;
string oldValue = "string to be replaced";
string replacementValue = "New string";
var attributes = File.GetAttributes(file);
File.SetAttributes(file, attributes & ~FileAttributes.ReadOnly);
using (StreamReader reader = new StreamReader(file, Encoding.Default))
{
text = reader.ReadToEnd();
encoding = reader.CurrentEncoding;
reader.Close();
}
bool changedValue = false;
if (text.Contains(oldValue))
{
text = text.Replace(oldValue, replacementValue);
changedValue = true;
}
if (changedValue)
{
using (StreamWriter write = new StreamWriter(file, false, encoding))
{
write.Write(text.ToString());
write.Close();
}
File.SetAttributes(file, attributes | FileAttributes.ReadOnly);
}
The solution for all Germans => ÄÖÜäöüß
This function opens the file an determines the Encoding by the BOM.
If the BOM is missing the file will be interpreted as ANSI, but if there are UTF8 encoded German Umlaute in it, it will be detected as UTF8.
https://stackoverflow.com/a/69312696/9134997

Categories