Strange line break character appearing in C# generated CSV - c#

I use a OleDb data reader to read a number of records, and then write them to a CSV. I then read from this CSV using File.ReadAllLines, then split on commas to get my data. The problem is some parts of the CSV include a character I can't display (shows up as a square), which appears to act as a line break - this line break corrupts the CSV, so I need to get rid of it.
I've tried replacing Environment.NewLine with something else (a blank space) when writing the CSV, and ditto with /r and /n but to no avail - the character isn't replaced. What other ways are there to remove these?

I then read from this CSV using File.ReadAllLines, then split on commas to get my data.
Stop rolling your own CSV parser.

Dont write CSV file's this way... as it won't work in every scenario.
Use OLD DB to do it for you.
http://devlicio.us/blogs/sergio_pereira/archive/2008/09/17/tip-export-to-csv-using-ado-net.aspx
Hope that helps.

Related

Escape semicolon in interpolated string C#

I am writing data to .csv file and I need the below expression to be read correctly:
csvWriter.WriteLine($"Security: {sec.ID} ; Serial number: {sec.SecuritySerialNo}");
the semicolon in between is used to put the serial number in a separate cell.
The problem is that ID can also contain semicolons and mess up the data, therefore I need to escape it.
I have tried to use replace:
csvWriter.WriteLine($"Security: {sec.ID.Replace(";", "")} ; Serial number: {sec.SecuritySerialNo}");
though deleting semicolons is not what I want to achieve, I just want to escape them.
Let's emphasize again that the best way to create a CSV file is through a specialized CSV Parser library.
However, just to resolve the simple case presented by your question you could add double quotes around each field. This should be enough to explain to the Excel parser how to handle your fields.
So, export the fields in this way:
csvWriter.WriteLine($"\"Security: {sec.ID}\";\"Serial number: {sec.SecuritySerialNo}\"");
Notice that I have removed the blank space around the semicolon. It is important otherwise Excel will not parse the line correctly

Field and text delimiters within cells in csv files

This is likely a very basic question that I could not, despite trying, find a satsifying answer to. Feel free to skip to the question at the end if you aren't interested in the background.
The task:
I wish to create an easy localisation solution for my unity projects. After some initial research I concluded it would be best to use a .csv file read by a streamreader, so that translators would only ever have to interact with the csv table, where information is neatly organized.
The main problem:
Due to the nature of the text, I need to account for linebreaks and special characters in the actual fields. As such I could not use the normal readLine() method.
This I worked with by using Read() and checking if a linebreak is within a text delimiter bracket. But as I check for the text delimiter, I am afraid it might run into an un-escaped delimiter part of the normal in-cell text (since the normal text delimiter is quotation marks).
So I switched the delimiter to §. But now every time I open the file I have to re-enter § as a text delimiter in OpenOfficeCalc, probably due to encoding differences. Which is annoying but not the end of the world.
My question:
How does OpenOffice (or similar software) usually tell in-cell commas/quotation marks apart from the ones used as delimiters? If I knew that, I could probably incorporate a similar approach in my reading of the file.
I've tried to look at the files with NotePad++, revealing a difference in linebreaks (/r instead of /r/n) and obviously it's within a text delimiter bracket, but when it comes to how it seperates its delimiters from ones just entered in the text/field, I am drawing a blank.
Translation file in OpenOffice Calc:
Translation file in NotePad++, showing all characters:
I'd appreciate any insight or links on the topic.
From https://en.wikipedia.org/wiki/Comma-separated_values:
The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line breaks.
LibreOffice Calc has a reasonable way to handle these things.
Use LF for line breaks and CR at the end of each record. It seems your code already handles this.
Use quotes to delimit strings when needed. If the string contains one or more quotes, then duplicate the quote to make it literal.
From the example in your question, it looks like you told Calc not to use any quotes as string delimiters. Why did you do this? When I tried it, LibreOffice (or Apache OpenOffice) showed the fields in different columns after opening the file saved that way.
The following example CSV file has fields that contain commas, quotes and line breaks.
When viewed in Calc:
A B
--------- --
1 | 1,",2", 3
--------- --
2 | a c
| b
Calc correctly reads and saves the file as shown below. Settings when saving are Field delimiter , and String delimiter " which are the defaults.
"1,"",2"",",3[CR]
"a
b",c[CR]

CSV file double-spacing lines

I am having trouble with OpenOffice Calc opening a CSV file that I create using StreamWriter C#. When it opens it has empty lines between every line that should be there(double-spaced). There seems to be some kind of doubling of the carriage returns. When I open it in Notepad it reads correctly. When I changed the program to write integers instead of strings the problem went away. It seems to be adding a return on the end of each string and then the formating adds another return that I'm not seeing.
Output looks like this...
1...
2...
3...
Output should look like this...
1...
2...
3...
Here is the ForEach loop I use to write the List to file...
using (StreamWriter sw = new StreamWriter(#"c:\andy\Arduino StreamWriter.csv", false, Encoding.UTF8))
{
foreach (string element in SerialPortString)
{
sw.WriteLine(element);
}
}
There is only one field of data per line, so there are no delimiters, just new lines. I tried formatting so that it would write with quotes around each field hoping that would eliminate confusion for the CSV format, but I wasn't able to figure that out either.
Any help would be appreciated.
Thanks.
Change
sw.WriteLine(element);
to
sw.WriteLine(element.Trim());
or maybe
sw.WriteLine(element.TrimEnd());
Trim the element first. That will remove any LineFeeds or other whitespace characters around the 'edges' of the characters. Then the StreamWriter's CRLFs will be the only newlines present.

What is the best way to check a .TXT extension file for CSV format data?

I need to Export & Import TXT file fill-up with CSV format data. I need want to do it in MVC4. What is the best approach to do this ?
Txt file can contain a large number of CSV format data,
Just run it through a CSV parser (I've used this one in the past - worked fine) and check that it makes semantic sense, and has the same number of columns on each row. That would be very unlikely if it wasn't CSV data. Note: columns != commas - you need to watch out for quoted data "like, this", and line-breaks - both of which a parser will help you with. You cannot just Split by ',' or use line-endings to detect rows - CSV is more complex than that.
If all you want is to check the file extension, then using lastIndexOf or Split will pretty much do an excellent trick for you.
Using endsWith
String myFile = "some.file.txt";
System.out.println(myFile.endsWith(".txt"));
Using split
String myFile = "some.file.txt";
String[] myFileArray = myFile.split("\\.(?=[^\\.]+$)");
if (myFileArray[myFileArray.length - 1].equalsIgnoreCase("txt")) {
System.out.println("Ends with .txt");
}

.net program to parse .doc file

I want to create an application which will be able to parse doc/docx files structure of this file is shown bellow:
par-000.01 - some content
par-000.21 - some content
par-000.31 - some content
par-001.32 - some content
content could be multi line and not regular. What I want to do is to put these content into database I mean for first record - par-000.01 into code column and some content into text column. The reason why I cannot do this manually is that I have about 15 docs where each of them contains about 10 pages of paragraphs I want to put into my database. I cannot find any article how can i parse whole doc file so I believe it could be possible if i write proper regular expression. Can anyone redirect me to the article how I can do what I want- I can't find anything that suits me probably I am using wrong key words..
Since you say you have reasonable amount of data, 15 docs * 10 pages/doc * ~100 lines/page = 15000 lines this is manageable in a word document, and you did not say that this is a repeating data feed, i.e. this is a one-time conversion, I would do it using an editor that supported global find and replace and convert to a Comma Separated Variable format. Most DB I know can load a CSV file.
I know you asked for C# app, but that is overkill for time and effort based on your problem
So
Convert '<start of line>' to '<start of line>"'
for MS Word with Find and replace
find: ^p
replace: ^&"
Convert ' - ' to '","'
for MS Word with Find and replace
find: ' - ' Note: don't add tick marks.
replace: ","
Convert '<end of line>' to '"<end of line>'
for MS Word with Find and replace
find: ^p
replace: "^&
Manually fix up start of first line and end of last line.
you should get
"par-000.01","some content"
"par-000.21","some content"
Now just load that into a DB using its CSV load.
Also if you insist on doing this with C#, then realize that you can probably save the text as a *.txt file without all of the Word tags and it will be much easier to take apart with a C# app. Don't get fixated on the Word tags, just side step the problem with creative thinking.
You can automate parsing of Word documents (.doc or .docx) in C# using GroupDocs.Parser for .NET API. The text can be extracted from the documents either line by line or as a whole. This is how you can do it.
// extracting all the text
WordsTextExtractor extractor = new WordsTextExtractor("sample.docx");
Console.Write(extractor.ExtractAll());
// OR
// Extract text line by line
string line = extractor.ExtractLine();
// If the line is null, then the end of the file is reached
while (line != null)
{
// Print a line to the console
Console.Write(line);
// Extract another line
line = extractor.ExtractLine();
}
Disclosure: I work as Developer Evangelist at GroupDocs.

Categories