Reading a CSV file containing greek characters - c#

I am trying to read the data from a CSV file using the following:
var lines = File.ReadAllLines(#"c:\test.csv").Select(a => a.Split(';'));
It works but the fields that contain words are written with Greek charactes and they are presented as symbols.
How can I set the Encoding correctly in order to read those greek characters?

ReadAllLines has overload, which takes Encoding along file path
var lines = File.ReadAllLines(#"c:\test.csv", Encoding.Unicode)
.Select(line => line.Split(';'));
Testing:
File.WriteAllText(#"c:\test.csv", "ϗϡϢϣϤ", Encoding.Unicode);
Console.WriteLine(File.ReadAllLines(#"c:\test.csv", Encoding.Unicode));
will print:
ϗϡϢϣϤ
To find out in which encoding the file was actually written, use next snippet:
using (var r = new StreamReader(#"c:\test.csv", detectEncodingFromByteOrderMarks: true))
{
Console.WriteLine (r.CurrentEncoding.BodyName);
}
for my scenario it will print
utf-8

Related

Replace a string in a text read from a csv and save it

I managed to load the csv and now want to change a few strings inside and then save it again.
First problem: He doesnt want to change the text to '0 . Replacing only "4" with "0" works, but never when my string has more than 1 character.
Second problem: The last replace where I delete all ' to "". When opening the csv in an editor it shows some weird asian characters instead of nothing.
(䈀攀稀甀最猀瀀爀)
There are no spaces in my csv. The csv looks like
.....";"++49 then more random numbers and so on.
This is just the part where ++49 is to be found.
Relevant code:
Encoding ansi = Encoding.GetEncoding(1252);
foreach (string file in Directory.EnumerateFiles(#"path comes here, "*.csv"))
{
string text = File.ReadAllText(file, ansi);
text = text.Replace(#"++49", "'0");
text = text.Replace("+49", "'0");
text = text.Replace(#"""", "");
File.WriteAllText(file, text, ansi);
}
Am i doing something fundamentally wrong?
edit: What it looks like: ";"++49<morenumbers>";; What it should look like: ;0<morenumbers>;;
As people mentioned in comments, problem is with your file encoding decoding. So in this case you can try this:
foreach(string file in Directory.EnumerateFiles(#"path comes here","*.csv"))
{
Encoding ansi;
using (var reader = new System.IO.StreamReader(file, true))
{
ansi = reader.CurrentEncoding; // please tell what you have here ! :)
}
string text = File.ReadAllText(file, ansi);
text = text.Replace(#"++49", "'0");
text = text.Replace(#"+49", "'0");
text = text.Replace(#"""", "");
File.WriteAllText(file, text, ansi);
}
For me it works fine with all formats I was able to set. Then you do not have to set your encoding as hardcoded value

Read a CSV file and writer into a file without " " using C#

I am trying to read a CSV file and stored all the values in the single list.CSV file contains credentials as uid(userid) and pass(password) and separated by','I have successfully read all the lines and write it in the file.but when it writes in the file, it write the value in between " "(double quotes) like as("abcdefgh3 12345678")what i want actually to remove this "" double quotes sign when i write it in to the files.i am pasting my code here:
static void Main(string[] args)
{
var reader = new StreamReader(File.OpenRead(#"C:\Desktop\userid1.csv"));
List<string> listA = new List<string>();
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
var values = line.Split(',');
listA.Add(values[0]);
listA.Add(values[1]);
}
foreach (string a in listA)
{
TextWriter tr = new StreamWriter(#"E:\newfiless",true);
tr.Write(a);
tr.Write(tr.NewLine);
tr.Close();
}
}
and the resulted output is like this:
"uid
pass"
"Martin123
123456789"
"Damian
91644"
but i want in this form:
uid
pass
Martin123
123456789
Damian
91644
Thanking you all in advance.
The original file clearly has quotes, which makes it a CSV file with only one colum and in that column there are two values. Not usual, but it happens.
To actually remove quotes you can use Trim, TrimEnd or TrimStart.
You can remove the quotes while reading, or while writing, in this case it doesn't really matter.
var line = reader.ReadLine().Trim('"');
This will remove the quotes while reading. Note that this assumes the CSV is of this "broken" variant.
tr.WriteLine(a.Trim('"'));
This will handle it on write. This will work even if the file is "correct" CSV having two columns and values in quotes.
Note that you can use WriteLine to add the newline, no need for two Write calls.
Also as others have commented, don't create a TextWriter in the loop for every value, create it once.
using (TextWriter tr = new StreamWriter(#"E:\newfiless"))
{
foreach (string a in listA)
{
tr.WriteLine(a.Trim('"'));
}
}
The using will take care of closing the file and other possible resources even if there is an exception.
I assume that all you need to read the input file, strip out all starting/ending quotation marks, then split by comma and write it all to another file. You can actually accomplish it in a one-liner using SelectMany, which will produce a "flat" collection:
File.WriteAllLines(
#"c:\temp\output.txt",
File
.ReadAllLines(#"c:\temp\input.csv")
.SelectMany(line => line.Trim('"').Split(','))
);
It's not quite clear from your example where quotation marks are located in the file. For a typical .CSV file some comma-separated field might be wrapped in quotation marks to allow commas to be a part of the content. If it's the case, then parsing will be more complex.
You can use
tr.Write(a.Substring(1, line.Length - 2));
Edited
Please use Trim
tr.Write(a.TrimEnd('"').TrimStart('"'));

How do I read chars from other countries such as ß ä?

How do I read chars from other countries such as ß ä?
The following code reads all chars, including chars such as 0x0D.
StreamReader srFile = new StreamReader(gstPathFileName);
char[] acBuf = null;
int iReadLength = 100;
while (srFile.Peek() >= 0) {
acBuf = new char[iReadLength];
srFile.Read(acBuf, 0, iReadLength);
string s = new string(acBuf);
}
But it does not interpret correctly chars such as ß ä.
I don't know what coding the file uses. It is exported from code (into a .txt file) that was written 20 plus years ago from a C-Tree database.
The ß ä display fine with Notepad.
By default, the StreamReader constructor assumes the UTF-8 encoding (which is the de facto universal standard today). Since that's not decoding your file correctly, your characters (ß, ä) suggest that it's probably encoded using Windows-1252 (Western European):
var encoding = Encoding.GetEncoding("Windows-1252");
using (StreamReader srFile = new StreamReader(gstPathFileName, encoding))
{
// ...
}
A closely-related encoding is ISO/IEC 8859-1. If the above gives some unexpected results, use Encoding.GetEncoding("ISO-8859-1") instead.

UTF-8 File data to ANSII

I have UTF-8 files (with Swedish äåö characters). I read those as:
List<MyData> myDataList = new List<MyData>();
string[] allLines = File.ReadAllLines(csvFile[0], Encoding.Default);
foreach (string line in allLines)
{
MyData myData = new MyData();
string[] words = line.Split(";");
myData.ID = words[0];
myData.Name = word[1];
myData.Age = words[2];
myData.Date = words[3];
myData.Score = words[4];
//Do something...
myDataList.Add(myData);
}
StringBuilder sb = new StringBuilder();
foreach (string data in myDataList)
{
sb.AppendLine(string.Format("{0},{1},{2},{3},{4}",
data.ID,
data.Name,
data.Age,
data.Date,
data.Score));
}
File.WriteAllText("output.txt", sb.ToString(), Encoding.ASCII);
I get output.txt file in ansii but not with Swedish characters. Can someone help me to know how can I save file data from UTF-8 to Ansii? Thanks.
What you probably mean by "ANSII"¹ is the codepage Windows-1252, used by most Western European countries.
At the moment, you are reading the file in your system default encoding, which is probably Windows-1252, and writing it as ASCII, which defines only the first 128 characters and does not include any non-English characters (such as äåö):
string[] allLines = File.ReadAllLines(csvFile[0], Encoding.Default);
...
File.WriteAllText("output.txt", sb.ToString(), Encoding.ASCII);
This is both wrong. If you want to convert your file from UTF-8 to Windows-1252, you need to read as UTF-8 and write as Windows 1252, i.e.
string[] allLines = File.ReadAllLines(csvFile[0], Encoding.UTF8);
...
File.WriteAllText("output.txt", sb.ToString(), new Encoding(1252));
¹ It is spelled ANSI; but even that is not entirely correct (quote from Wikipedia):
Historically, the phrase “ANSI Code Page” (ACP) is used in Windows to refer to various code pages considered as native. The intention was that most of these would be ANSI standards such as ISO-8859-1. Even though Windows-1252 was the first and by far most popular code page named so in Microsoft Windows parlance, the code page has never been an ANSI standard. Microsoft-affiliated bloggers now state that “The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community.”
Currently you are writing the file in ASCII, which is very limited and not capable of showing those "swedish" characters. I would recommend to try this :
System.IO.File.WriteAllText(path, text, Encoding.GetEncoding(28603));
This writes the file in ANSI encoding with codepage Latin-4. I would recommend you the wikipedia article: ISO 8859

How to read a file into a string with CR/LF preserved?

If I asked the question "how to read a file into a string" the answer would be obvious. However -- here is the catch with CR/LF preserved.
The problem is, File.ReadAllText strips those characters. StreamReader.ReadToEnd just converted LF into CR for me which led to long investigation where I have bug in pretty obvious code ;-)
So, in short, if I have file containing foo\n\r\nbar I would like to get foo\n\r\nbar (i.e. exactly the same content), not foo bar, foobar, or foo\n\n\nbar. Is there some ready to use way in .Net space?
The outcome should be always single string, containing entire file.
Are you sure that those methods are the culprits that are stripping out your characters?
I tried to write up a quick test; StreamReader.ReadToEnd preserves all newline characters.
string str = "foo\n\r\nbar";
using (Stream ms = new MemoryStream(Encoding.ASCII.GetBytes(str)))
using (StreamReader sr = new StreamReader(ms, Encoding.UTF8))
{
string str2 = sr.ReadToEnd();
Console.WriteLine(string.Join(",", str2.Select(c => ((int)c))));
}
// Output: 102,111,111,10,13,10,98,97,114
// f o o \n \r \n b a r
An identical result is achieved when writing to and reading from a temporary file:
string str = "foo\n\r\nbar";
string temp = Path.GetTempFileName();
File.WriteAllText(temp, str);
string str2 = File.ReadAllText(temp);
Console.WriteLine(string.Join(",", str2.Select(c => ((int)c))));
It appears that your newlines are getting lost elsewhere.
This piece of code will preserve LR and CR
string r = File.ReadAllText(#".\TestData\TR120119.TRX", Encoding.ASCII);
The outcome should be always single string, containing entire file.
It takes two hops. First one is File.ReadAllBytes() to get all the bytes in the file. Which doesn't try to translate anything, you get the raw data in the file so the weirdo line-endings are preserved as-is.
But that's bytes, you asked for a string. So second hop is to apply Encoding.GetString() to convert the bytes to a string. The one thing you have to do is pick the right Encoding class, the one that matches the encoding used by the program that wrote the file. Given that the file is pretty messed up if it contains \n\r\n sequences, and you didn't document anything else about the file, your best bet is to use Encoding.Default. Tweak as necessary.
You can read the contents of a file using File.ReadAllLines, which will return an array of the lines. Then use String.Join to merge the lines together using a separator.
string[] lines = File.ReadAllLines(#"C:\Users\User\file.txt");
string allLines = String.Join("\r\n", lines);
Note that this will lose the precision of the actual line terminator characters. For example, if the lines end in only \n or \r, the resulting string allLines will have replaced them with \r\n line terminators.
There are of course other ways of acheiving this without losing the true EOL terminator, however ReadAllLines is handy in that it can detect many types of text encoding by itself, and it also takes up very few lines of code.
ReadAllText doesn't return carriage returns.
This method opens a file, reads each line of the file, and then adds each line as an element of a string. It then closes the file. A line is defined as a sequence of characters followed by a carriage return ('\r'), a line feed ('\n'), or a carriage return immediately followed by a line feed. The resulting string does not contain the terminating carriage return and/or line feed.
From MSDN - https://msdn.microsoft.com/en-us/library/ms143368(v=vs.110).aspx
This is similar to the accepted answer, but wanted to be more to the point. sr.ReadToEnd() will read the bytes like is desired:
string myFilePath = #"C:\temp\somefile.txt";
string myEvents = String.Empty;
FileStream fs = new FileStream(myFilePath, FileMode.Open);
StreamReader sr = new StreamReader(fs);
myEvents = sr.ReadToEnd();
sr.Close();
fs.Close();
You could even also do those in cascaded using statements. But I wanted to describe how the way you write to that file in the first place will determine how to read the content from the myEvents string, and might really be where the problem lies. I wrote to my file like this:
using System.Reflection;
using System.IO;
private static void RecordEvents(string someEvent)
{
string folderLoc = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
if (!folderLoc.EndsWith(#"\")) folderLoc += #"\";
folderLoc = folderLoc.Replace(#"\\", #"\"); // replace double-slashes with single slashes
string myFilePath = folderLoc + "myEventFile.txt";
if (!File.Exists(myFilePath))
File.Create(myFilePath).Close(); // must .Close() since will conflict with opening FileStream, below
FileStream fs = new FileStream(myFilePath, FileMode.Append);
StreamWriter sr = new StreamWriter(fs);
sr.Write(someEvent + Environment.NewLine);
sr.Close();
fs.Close();
}
Then I could use the code farther above to get the string of the contents. Because I was going further and looking for the individual strings, I put this code after THAT code, up there:
if (myEvents != String.Empty) // we have something
{
// (char)2660 is ♠ -- I could have chosen any delimiter I did not
// expect to find in my text
myEvents = myEvents.Replace(Environment.NewLine, ((char)2660).ToString());
string[] eventArray = myEvents.Split((char)2660);
foreach (string s in eventArray)
{
if (!String.IsNullOrEmpty(s))
// do whatever with the individual strings from your file
}
}
And this worked fine. So I know that myEvents had to have the Environment.NewLine characters preserved because I was able to replace it with (char)2660 and do a .Split() on that string using that character to divide it into the individual segments.

Categories