I have UTF-8 files (with Swedish äåö characters). I read those as:
List<MyData> myDataList = new List<MyData>();
string[] allLines = File.ReadAllLines(csvFile[0], Encoding.Default);
foreach (string line in allLines)
{
MyData myData = new MyData();
string[] words = line.Split(";");
myData.ID = words[0];
myData.Name = word[1];
myData.Age = words[2];
myData.Date = words[3];
myData.Score = words[4];
//Do something...
myDataList.Add(myData);
}
StringBuilder sb = new StringBuilder();
foreach (string data in myDataList)
{
sb.AppendLine(string.Format("{0},{1},{2},{3},{4}",
data.ID,
data.Name,
data.Age,
data.Date,
data.Score));
}
File.WriteAllText("output.txt", sb.ToString(), Encoding.ASCII);
I get output.txt file in ansii but not with Swedish characters. Can someone help me to know how can I save file data from UTF-8 to Ansii? Thanks.
What you probably mean by "ANSII"¹ is the codepage Windows-1252, used by most Western European countries.
At the moment, you are reading the file in your system default encoding, which is probably Windows-1252, and writing it as ASCII, which defines only the first 128 characters and does not include any non-English characters (such as äåö):
string[] allLines = File.ReadAllLines(csvFile[0], Encoding.Default);
...
File.WriteAllText("output.txt", sb.ToString(), Encoding.ASCII);
This is both wrong. If you want to convert your file from UTF-8 to Windows-1252, you need to read as UTF-8 and write as Windows 1252, i.e.
string[] allLines = File.ReadAllLines(csvFile[0], Encoding.UTF8);
...
File.WriteAllText("output.txt", sb.ToString(), new Encoding(1252));
¹ It is spelled ANSI; but even that is not entirely correct (quote from Wikipedia):
Historically, the phrase “ANSI Code Page” (ACP) is used in Windows to refer to various code pages considered as native. The intention was that most of these would be ANSI standards such as ISO-8859-1. Even though Windows-1252 was the first and by far most popular code page named so in Microsoft Windows parlance, the code page has never been an ANSI standard. Microsoft-affiliated bloggers now state that “The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community.”
Currently you are writing the file in ASCII, which is very limited and not capable of showing those "swedish" characters. I would recommend to try this :
System.IO.File.WriteAllText(path, text, Encoding.GetEncoding(28603));
This writes the file in ANSI encoding with codepage Latin-4. I would recommend you the wikipedia article: ISO 8859
Related
I managed to load the csv and now want to change a few strings inside and then save it again.
First problem: He doesnt want to change the text to '0 . Replacing only "4" with "0" works, but never when my string has more than 1 character.
Second problem: The last replace where I delete all ' to "". When opening the csv in an editor it shows some weird asian characters instead of nothing.
(䈀攀稀甀最猀瀀爀)
There are no spaces in my csv. The csv looks like
.....";"++49 then more random numbers and so on.
This is just the part where ++49 is to be found.
Relevant code:
Encoding ansi = Encoding.GetEncoding(1252);
foreach (string file in Directory.EnumerateFiles(#"path comes here, "*.csv"))
{
string text = File.ReadAllText(file, ansi);
text = text.Replace(#"++49", "'0");
text = text.Replace("+49", "'0");
text = text.Replace(#"""", "");
File.WriteAllText(file, text, ansi);
}
Am i doing something fundamentally wrong?
edit: What it looks like: ";"++49<morenumbers>";; What it should look like: ;0<morenumbers>;;
As people mentioned in comments, problem is with your file encoding decoding. So in this case you can try this:
foreach(string file in Directory.EnumerateFiles(#"path comes here","*.csv"))
{
Encoding ansi;
using (var reader = new System.IO.StreamReader(file, true))
{
ansi = reader.CurrentEncoding; // please tell what you have here ! :)
}
string text = File.ReadAllText(file, ansi);
text = text.Replace(#"++49", "'0");
text = text.Replace(#"+49", "'0");
text = text.Replace(#"""", "");
File.WriteAllText(file, text, ansi);
}
For me it works fine with all formats I was able to set. Then you do not have to set your encoding as hardcoded value
How do I read chars from other countries such as ß ä?
The following code reads all chars, including chars such as 0x0D.
StreamReader srFile = new StreamReader(gstPathFileName);
char[] acBuf = null;
int iReadLength = 100;
while (srFile.Peek() >= 0) {
acBuf = new char[iReadLength];
srFile.Read(acBuf, 0, iReadLength);
string s = new string(acBuf);
}
But it does not interpret correctly chars such as ß ä.
I don't know what coding the file uses. It is exported from code (into a .txt file) that was written 20 plus years ago from a C-Tree database.
The ß ä display fine with Notepad.
By default, the StreamReader constructor assumes the UTF-8 encoding (which is the de facto universal standard today). Since that's not decoding your file correctly, your characters (ß, ä) suggest that it's probably encoded using Windows-1252 (Western European):
var encoding = Encoding.GetEncoding("Windows-1252");
using (StreamReader srFile = new StreamReader(gstPathFileName, encoding))
{
// ...
}
A closely-related encoding is ISO/IEC 8859-1. If the above gives some unexpected results, use Encoding.GetEncoding("ISO-8859-1") instead.
I am trying to read the data from a CSV file using the following:
var lines = File.ReadAllLines(#"c:\test.csv").Select(a => a.Split(';'));
It works but the fields that contain words are written with Greek charactes and they are presented as symbols.
How can I set the Encoding correctly in order to read those greek characters?
ReadAllLines has overload, which takes Encoding along file path
var lines = File.ReadAllLines(#"c:\test.csv", Encoding.Unicode)
.Select(line => line.Split(';'));
Testing:
File.WriteAllText(#"c:\test.csv", "ϗϡϢϣϤ", Encoding.Unicode);
Console.WriteLine(File.ReadAllLines(#"c:\test.csv", Encoding.Unicode));
will print:
ϗϡϢϣϤ
To find out in which encoding the file was actually written, use next snippet:
using (var r = new StreamReader(#"c:\test.csv", detectEncodingFromByteOrderMarks: true))
{
Console.WriteLine (r.CurrentEncoding.BodyName);
}
for my scenario it will print
utf-8
If I asked the question "how to read a file into a string" the answer would be obvious. However -- here is the catch with CR/LF preserved.
The problem is, File.ReadAllText strips those characters. StreamReader.ReadToEnd just converted LF into CR for me which led to long investigation where I have bug in pretty obvious code ;-)
So, in short, if I have file containing foo\n\r\nbar I would like to get foo\n\r\nbar (i.e. exactly the same content), not foo bar, foobar, or foo\n\n\nbar. Is there some ready to use way in .Net space?
The outcome should be always single string, containing entire file.
Are you sure that those methods are the culprits that are stripping out your characters?
I tried to write up a quick test; StreamReader.ReadToEnd preserves all newline characters.
string str = "foo\n\r\nbar";
using (Stream ms = new MemoryStream(Encoding.ASCII.GetBytes(str)))
using (StreamReader sr = new StreamReader(ms, Encoding.UTF8))
{
string str2 = sr.ReadToEnd();
Console.WriteLine(string.Join(",", str2.Select(c => ((int)c))));
}
// Output: 102,111,111,10,13,10,98,97,114
// f o o \n \r \n b a r
An identical result is achieved when writing to and reading from a temporary file:
string str = "foo\n\r\nbar";
string temp = Path.GetTempFileName();
File.WriteAllText(temp, str);
string str2 = File.ReadAllText(temp);
Console.WriteLine(string.Join(",", str2.Select(c => ((int)c))));
It appears that your newlines are getting lost elsewhere.
This piece of code will preserve LR and CR
string r = File.ReadAllText(#".\TestData\TR120119.TRX", Encoding.ASCII);
The outcome should be always single string, containing entire file.
It takes two hops. First one is File.ReadAllBytes() to get all the bytes in the file. Which doesn't try to translate anything, you get the raw data in the file so the weirdo line-endings are preserved as-is.
But that's bytes, you asked for a string. So second hop is to apply Encoding.GetString() to convert the bytes to a string. The one thing you have to do is pick the right Encoding class, the one that matches the encoding used by the program that wrote the file. Given that the file is pretty messed up if it contains \n\r\n sequences, and you didn't document anything else about the file, your best bet is to use Encoding.Default. Tweak as necessary.
You can read the contents of a file using File.ReadAllLines, which will return an array of the lines. Then use String.Join to merge the lines together using a separator.
string[] lines = File.ReadAllLines(#"C:\Users\User\file.txt");
string allLines = String.Join("\r\n", lines);
Note that this will lose the precision of the actual line terminator characters. For example, if the lines end in only \n or \r, the resulting string allLines will have replaced them with \r\n line terminators.
There are of course other ways of acheiving this without losing the true EOL terminator, however ReadAllLines is handy in that it can detect many types of text encoding by itself, and it also takes up very few lines of code.
ReadAllText doesn't return carriage returns.
This method opens a file, reads each line of the file, and then adds each line as an element of a string. It then closes the file. A line is defined as a sequence of characters followed by a carriage return ('\r'), a line feed ('\n'), or a carriage return immediately followed by a line feed. The resulting string does not contain the terminating carriage return and/or line feed.
From MSDN - https://msdn.microsoft.com/en-us/library/ms143368(v=vs.110).aspx
This is similar to the accepted answer, but wanted to be more to the point. sr.ReadToEnd() will read the bytes like is desired:
string myFilePath = #"C:\temp\somefile.txt";
string myEvents = String.Empty;
FileStream fs = new FileStream(myFilePath, FileMode.Open);
StreamReader sr = new StreamReader(fs);
myEvents = sr.ReadToEnd();
sr.Close();
fs.Close();
You could even also do those in cascaded using statements. But I wanted to describe how the way you write to that file in the first place will determine how to read the content from the myEvents string, and might really be where the problem lies. I wrote to my file like this:
using System.Reflection;
using System.IO;
private static void RecordEvents(string someEvent)
{
string folderLoc = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
if (!folderLoc.EndsWith(#"\")) folderLoc += #"\";
folderLoc = folderLoc.Replace(#"\\", #"\"); // replace double-slashes with single slashes
string myFilePath = folderLoc + "myEventFile.txt";
if (!File.Exists(myFilePath))
File.Create(myFilePath).Close(); // must .Close() since will conflict with opening FileStream, below
FileStream fs = new FileStream(myFilePath, FileMode.Append);
StreamWriter sr = new StreamWriter(fs);
sr.Write(someEvent + Environment.NewLine);
sr.Close();
fs.Close();
}
Then I could use the code farther above to get the string of the contents. Because I was going further and looking for the individual strings, I put this code after THAT code, up there:
if (myEvents != String.Empty) // we have something
{
// (char)2660 is ♠ -- I could have chosen any delimiter I did not
// expect to find in my text
myEvents = myEvents.Replace(Environment.NewLine, ((char)2660).ToString());
string[] eventArray = myEvents.Split((char)2660);
foreach (string s in eventArray)
{
if (!String.IsNullOrEmpty(s))
// do whatever with the individual strings from your file
}
}
And this worked fine. So I know that myEvents had to have the Environment.NewLine characters preserved because I was able to replace it with (char)2660 and do a .Split() on that string using that character to divide it into the individual segments.
As part of a recent project I had to read and write from a CSV file and put in a grid view in c#. In the end decided to use a ready built parser to do the work for me.
Because I like to do that kind of stuff, I wondered how to go about writing my own.
So far all I've managed to do is this:
//Read the header
StreamReader reader = new StreamReader(dialog.FileName);
string row = reader.ReadLine();
string[] cells = row.Split(',');
//Create the columns of the dataGridView
for (int i = 0; i < cells.Count() - 1; i++)
{
DataGridViewTextBoxColumn column = new DataGridViewTextBoxColumn();
column.Name = cells[i];
column.HeaderText = cells[i];
dataGridView1.Columns.Add(column);
}
//Display the contents of the file
while (reader.Peek() != -1)
{
row = reader.ReadLine();
cells = row.Split(',');
dataGridView1.Rows.Add(cells);
}
My question: is carrying on like this a wise idea, and if it is (or isn't) how would I test it properly?
As a programming exercise (for learning and gaining experience) it is probably a very reasonable thing to do. For production code, it may be better to use an existing library mainly because the work is already done. There are quite a few things to address with a CSV parser. For example (randomly off the top of my head):
Quoted values (strings)
Embedded quotes in quoted strings
Empty values (NULL ... or maybe even NULL vs. empty).
Lines without the correct number of entries
Headers vs. no headers.
Recognizing different data types (e.g., different date formats).
If you have a very specific input format in a very controlled environment, though, you may not need to deal with all of those.
... is carrying on like this a wise idea ...?
Since you're doing this as a learning exercise, you may want to dig deeper into lexing and parsing theory. Your current approach will show its shortcomings fairly quickly as described in Stop Rolling Your Own CSV Parser!. It's not that parsing CSV data is difficult. (It's not.) It's just that most CSV parser projects treat the problem as a text splitting problem versus a parsing problem. If you take the time to define the CSV "language", the parser almost writes itself.
RFC 4180 defines a grammar for CSV data in ABNF form:
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D ;as per section 6.1 of RFC 2234
DQUOTE = %x22 ;as per section 6.1 of RFC 2234
LF = %x0A ;as per section 6.1 of RFC 2234
CRLF = CR LF ;as per section 6.1 of RFC 2234
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
This grammar shows how single characters are built up to create more and more complex language elements. (As written, definitions go the opposite direction from complex to simple.)
If you start with a grammar, you can write parsing functions that mirror non-terminal grammar elements (the lowercase items). Julian M Bucknall describes the process in Writing a parser for CSV data. Take a look at Test-Driven Development with ANTLR for an example of the same process using a parser generator.
Keep in mind, there is no one accepted CSV definition. CSV data in the wild is not guaranteed to implement all of the RFC 4180 suggestions.
Get (or make) some CSV data and write Unit Tests using NUnit or Visual Studio Testing Tools.
Be sure to test edge cases like
"csv","Data","with","a","trailing","comma",
and
"csv","Data","with,","commas","and","""quotes""","in","it"
This come from
http://www.gigawebsolution.com/Posts/Details/61/Building-a-Simple-CSV-Parser-in-C#
public interface ICsvReaderWriter
{
List<string[]> Read(string filePath, char delimiter);
void Write(string filePath, List<string[]> lines, char delimiter);
}
public class CsvReaderWriter : ICsvReaderWriter
{
public List<string[]> Read(string filePath, char delimiter)
{
var fileContent = new List<string[]>();
using (var reader = new StreamReader(filePath, Encoding.Unicode))
{
string line;
while ((line = reader.ReadLine()) != null)
{
if (!string.IsNullOrEmpty(line))
{
fileContent.Add(line.Split(delimiter));
}
}
}
return fileContent;
}
public void Write(string filePath, List<string[]> lines, char delimiter)
{
using (var writer = new StreamWriter(filePath, true, Encoding.Unicode))
{
foreach (var line in lines)
{
var data = line.Aggregate(string.Empty,
(current, column) => current +
string.Format("{0}{1}", column,delimiter))
.TrimEnd(delimiter);
writer.WriteLine(data);
}
}
}
}
Parsing a CSV file isn't difficult, but it involves more than simply calling String.Split().
You are breaking the lines at each comma. But it's possible for fields to contain embedded commas. In these cases, CSV wraps the field in double quotes. So you must also look for double quotes and ignore commas within those quotes. In addition, it's even possible for fields to contain embedded double quotes. Double quotes must appear within double quotes and be "doubled up" to indicate the quote is a literal character.
If you'd like to see how I did it, you can check out this article.