Reading from CSV with value numbers formatted

Reading from CSV with value numbers formatted - c#

The process that I wish to achieve it that I read from a CSV file and automatically the system creates a new CSV file in different format.
I am able to read and format the CSV file however I have issues when dealing with number formatting as the values are formatted in thousands(1,000). For example when I read from the CSV and split each line with ',' my values change.
Ex Line 1: Test Name, Test Desc, Test Currency, 12,500
var line1 = line.split(',');
This splits the value 12 & 500 because of the , delimiter. How can I get the number as a whole amount please?
using (var reader = new StreamReader(openFileDialog1.FileName))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
var values = line.Split(',');
}
}

You cant. When a CSV file contains numbers (or any text with a , in it) it needs to quote the fields. It is impossible for simple code (i.e. not AI) to differentiate in the way your human eye can.
Ex Line 1: Test Name, Test Desc, Test Currency, 12,500
Should be:
Ex Line 1: "Test Name", "Test Desc", "Test Currency", "12,500"
Common CSV parsers/libraries will know how to handle this (e.g. CsvHelper)
If you have control over the CSV file generation, then you should make this change. If it's from a 3rd party then see if you can get them to make a change.
There may be an edge case in your example if there is always a space after fields and not in the number fields. Your delimiter then becomes ", " instead of just ','

Side note:
You should consider not to use culture-specific separators in a .csv file because it always leads to headaches when the data is exported/imported with different regional settings.
Possible solutions:
I suggest to dump and parse numbers (dates, etc.) with invariant culture:
myNumber.ToString(CultureInfo.InvariantCulture)
If you really need to dump numbers with comma decimal sign enclose the field into quotes. This does not turn the numbers strings as .csv has no type information.
Excel vs. the .csv format
Another side note for Excel: Microsoft's .csv handling is somewhat confusing and contradicts the RFC Standard. When you export a .csv in Excel the numbers are always dumped using the regional settings. To avoid the confusion with delimiters, Excel uses a different character (usually semicolon) as delimiter if the decimal separator is comma.
The used delimiter is the one is set as the list separator in the operating system's regional settings and in .NET can be retrieved via the CultureInfo.TextInfo.ListSeparator property.
I find this solution from Microsoft quite unfortunate as .csv files dumped by different regional settings cannot be always read on another computer and this only causes troubles since decades.

Related

Verify file format is csv on import by determining delimiter and checking all columns are included with .NET

I am busy doing a project where I have to add import and export functionality. I was able to quite easily make the export functionality work. Here is my question regarding Import:
I was able to make a template with 3 columns : stock_code;item_name;price
All future imports will only have these 3 columns. Now here is my question:
How can I determine the delimiter on import?
I have done the following on the file input
<input type="file" class="custom-file-input" id="File" accept=".csv/text/plain"
This sets the browse window to - custom files and not All files. The problem here is if they do set it to all files, I need to determine the delimiter so there is not an error.
2.In addition to determining delimiter I need to make sure that the file is only 3 columns
Useful Info:
I do have CSVHelper nuget package
and im working on asp.net mvc 4.6
Please help

I have written code that implements a "delimiter detector for csv data", so I can describe the strategy that I used. First, I defined a set of "acceptable delimiters" ordered by priority: ',', '\t', ';', '|'. Those are the most common delimiters that I've seen in use.
Then, I read the first line of data out of the csv, this could be done with StreamReader.ReadLine(). I then iterate over every character in the line, and keep track of how many times I see each delimiter. The delimiter that was seen the most is the winner.
This works quite well, especially when the first line in the file is a header row that contains mostly alpha character. However, for some cultures it is common to find ',' used as a decimal point in numbers, and they tend to use ';' as the CSV delimiter. If the first row is all numeric values with decimal points (no headers) then this algorithm can mis-detect ',' as being the delimiter. I doubt this would likely ever happen in practice.
My implementation can be seen here. There is a bit extra complexity because I'm processing data out of an intermediate buffer; that could be simplified if you adapted it to your own needs.
Depending on what you intend to do with the CSV data you could use the library that this code lives in: Sylvan.Data.Csv. It is available as a nuget package.
using Sylvan.Data.Csv;
...
// by default, detects delimiter and assumes a header row is present
var csv = CsvDataReader.Create("MyData.csv");
if(csv.FieldCount != 3) {
throw new Exception("Invalid file");
}
while(csv.Read())
{
if(csv.RowFieldCount != 3) {
// the row contains a different number of columns than expected
throw new Exception("Invalid row at " + csv.RowNumber);
}
var stockCode = csv.GetString(0);
var name = csv.GetString(1);
// this next line would throw a FormatException if the the price column doesn't contain
// a numeric value
var price = csv.GetDecimal(2);
}

Delete character out of string

I am having some problems with a quite easy task - i feel like im missing something very obvious here.
I have a .csv file which is semicolon seperated. In this file are several numbers that contain dots like "1.300" but there are also dates included like "2015.12.01". The task is to find and delete all dots but only those that are in numbers and not in dates. The dates and numbers are completely variable and never at the same position in the file.
My question now: What is the 'best' way to handle this problem?
From a programmers point of view: Is it a good solution to just split at every semilicon, count the dots and if there is only one dot, delete it? This is the only way to solve the problem i could think of by now.
Example source file:
2015.12.01;
13.100;
500;
1.200;
100;
Example result:
2015.12.01;
13100;
500;
1200;
100;

If you can rely on the fact that dates have two dots and numbers just one, you can use that as a filter:
string s = "123.45";
if (s.Count(x => x == '.') == 1)
{
s = s.Replace(".", null);
}

The source file looks like a valid file generated by a program running on a machine whose locale uses . as the thousand separator (most of Europe does) and date separator (German locales only I think). Such locales also use ; as the list separator.
If the question was only how to parse such dates, numbers, the answer would be to pass the proper culture to the parse function, eg: decimal.Parse("13.500",new CultureInfo("de-at")) would return 13500. The actual issue though is that the data must be fed to another program that uses . as the decimal separator.
The safest option would be to change the locale used by the exporting program, eg change the thread CultureInfo if the exporter is a .NET program, the locale in an SSIS package etc, to a locale like en-gb to export with . and avoid the weird date format. This assumes that the next program in the pipeline doesn't use German for the date, English for numbers
Another option would be to load the text, parse the fields using the proper locale then export them in the format required by the next program.
Finally, a regular expression could be used to match only the numeric fields and remove the dot. This can be a bit tricky and depends on the actual contents.
For example (\d+)\.(\d{3}) can be used to match numbers if there is only one thousand separator. This can fail if some text field contains similar values. Or ;(\d+)\.(\d{3}); could match only a full field, except the first and last fields, eg:
Regex.Replace("1.457;2016.12.30;13.000;1,50;2015.12.04;13.456",#";(\d+)\.(\d{3});",#"$1$2;")
produces :
1.457;2016.12.3013000;1,50;2015.12.04;13.456
A regular expression that would match either numbers between ; or the first/last field could be
(^|;)(\d+)\.(\d{3})(;|$)
This would produce 1457;2016.12.30;13000;1,50;2015.12.04;13456, eg:
var data="1.457;2016.12.30;13.000;1,50;2015.12.04;13.456";
var pattern=#"(^|;)(\d+)\.(\d{3})(;|$)";
var replacement=#"$1$2$3$4";
var result= Regex.Replace(data,pattern,replacement);
The advantage of a regex over splitting and replacing strings is that it's a lot faster and more memory efficient. Instead of generating temporary strings for each split, manipulation, a Regex only calculates indexes in the source. A string object is generated only when you request the final text result. This results in far fewer allocations and garbage collections.
Even in medium-sized files this can result in 10x better performance

I wouldn't rely on the number of dots as mistakes can be made.
You can use the double.TryParse to safely test if the string is a number
var data = "2015.12.01;13.100;500;1.200;100;";
var dataArray = data.Split(';');
foreach (var s in dataArray)
{
double result;
if(double.TryParse(s,out result))
// implement your logic here
Console.WriteLine(s.Replace(".",string.Empty));
}

Format number as text in CSV when open in both Excel and Notepad

I received a requirement to save data in CSV file and send it to customers.
Customers use both Excel and Notepad to view this file.
Data look like:
975567EB, 973456CE, 971343C8
And my data have some number end by "E3" like:
98765E3
so when open in Excel, it will change to:
9.8765E+7
I write a program to change this format to text by adding ="98765E3" to this in C#
while(!sr.EndOfStream) {
var line = sr.ReadLine();
var values = line.Split(',');
values[0] = "=" + "\"" + values[0] + "\""; //Change number format to string
listA.Add(new string[] {values[0], values[1], values[2], values[3]});
}
But with customer, who use Notepad to open CSV file, it will show like:
="98765E3"
How could I save number as text in CSV to open in both Excel and Notepad with the same result? Greatly appreciate any suggestion!

Don't Shoot the messenger.
Your problem is not the way you are exporting (creating...?) data in C#. It is with the way that you are opening the CSV files in Excel.
Excel has numerous options for importing text files that allow for the use of a FieldInfo parameter that specifies the TextFileColumnDataTypes property for each field (aka column) of data being brought in.
If you chose to double-click a CSV file from an Explorer folder window then you will have to put up with what Excel 'best-guesses' are your intended field types for each column. It's not going to stop halfway through an import process to ask your opinion. Some common errors include:
An alphanumeric value with an E will often be interpreted as scientific notation.
Half of the DMY dates will be misinterpreted as the wrong MDY dates (or vise-versa). The other half will become text since Excel cannot process something like 14/08/2015 as MDY.
Any value that starts with a + will produce a #NAME! error because Excel thinks you are attempting to bring in a formula with a named quality.
That's a short list of common errors. There are others. Here are some common solutions.
Use Data ► Get External Data ► From Text. Explicitly specify any ambiguous column data type; e.g. 98765E3 as Text, dates as either DMY, MDY, YMD, etc as the case may be. There is even the option to discard a column of useless data.
Use File ► Open ► Text Files which brings you through the same import wizard as the option above. These actions can be recorded for repeated use using either command.
Use VBA's Workbooks.OpenText method and specify each column's FieldInfo position and data type (the latter with a XlColumnDataType constant).
Read the import file into memory and process it in a memory array before dumping it into the target worksheet.
There are less precise solutions that are still subject to some interpretation from Excel.
Use a Range.PrefixCharacter to force numbers with leading zeroes or alphnumeric values that could conceivably be misinterpreted as scientific notation into the worksheet as text.
Use a text qualifier character; typically ASCII character 034 (e.g. ") to wrap values you want to be interpreted as text.
Copy and paste the entire text file into the target worksheet's column A then use the Range.TextToColumns method (again with FieldInfo options available for each column).
These latter two methods are going to cause some odd values in Notepad but Notepad isn't Excel and cannot process a half-million calculations and other operations in several seconds. If you must mash-up the two programs there will be some compromises.
My suggestion is to leave the values as best as they can be in Notepad and use the facilities and processes readily available in Excel to import the data properly.

Importing csv values having multiple comma in one column in Asp.net

I am having an issue with importing a CSV file. The problem arises when an address field has multiple comma seperated values e.g. home no, street no, town etc.
I tried to use http://forums.asp.net/t/1705264.aspx/1 this article but, the problem did not solved because of a single field containing multiple comma separated values.
Any idea or solution? because I didnt found any help
Thanks

Don't split the string yourself. Parsing CSV files is not trivial, and using str.Split(',') will give you a lot of headaches. Try using a more robust library like CsvHelper
- https://github.com/JoshClose/CsvHelper
If that doesn't work then the format of the file is probably incorrect. Check to make sure the text fields are properly quoted.

Do you control the format of the CSV file? You could see about changing it to qualify the values by surrounding them with double quotes (""). Another option is to switch to a different delimiter like tabs.
Barring that, if the address is always in the same format, you could read in the file, and then inspect each record and manually concatenate columns 2, 3, and 4.

Are the fields surrounded by quotation marks? If so, split on "," rather than just ,.
Is the address field at the beginning or end of a record? If so, you can ignore the first x commas (if at the beginning) or split only the correct number of fields (if at the end).
Ideally, if you have control of the source file's creation, you would change the delimiter of either the address sub-fields, or the record fields.

How to show long numbers in Excel?

I have to build a C# program that makes CSV files and puts long numbers (as string in my program). The problem is, when I open this CSV file in Excel the numbers appear like this:
1234E+ or 1234560000000 (the end of the number is 0)
How I retain the formatting of the numbers? If I open the file as a text file, the numbers are formatted correctly.
Thanks in advance.

As others have mentioned, you can force the data to be a string. The best way for that was ="1234567890123". The = makes the cell a formula, and the quotation marks make the enclosed value an Excel string literal. This will display all the digits, even beyond Excel's numeric precision limit, but the cell (generally) won't be able to be used directly in numeric calculations.
If you need the data to remain numeric, the best way is probably to create a native Excel file (.xls or .xlsx). Various approaches for that can be found in the solutions to this related Stack Overflow question.
If you don't mind having thousands separators, there is one other trick you can use, which is to make your C# program insert the thousands separators and surround the value in quotes: "1,234,567,890,123". Do not include a leading = (as that will force it to be a string). Note that in this case, the quotation marks are for protecting the commas in the CSV, not for specifying an Excel string literal.

Format those long numbers as strings by putting a ' (apostrophe) in front or making a formula out of it: ="1234567890123"

You can't. Excel stores numbers with fifteen digits of precision. If you don't mind not having the ability to perform calculations on the numbers from within Excel, you can store them as Text, and all of the digits will display.

When I generate data to imported into Excel, I do not generate a CSV file if I want control over how the data are displayed. Instead, I write out an Excel file where the properties of the cells are set appropriately. I do not know if there is a library out there that would do that for you in C# without requiring Excel to be installed on the machine generating the files, but it is something to look into.

My two cents:
I think it's important to realize there is a difference between "Data" and "Formatting". In this example you are kind of trying to store both in a data-only file. This will, as you can tell from other answers, change the nature of the data. (In other words cause it to be converted to a string. A CSV file is a data only file. You can do some tricks here and there to merge formatting in with data, but to my way of thinking this essentially corrupts the data by merging it with non-data values: ie: "Formatting".
If you really need to be able to store formatting information I suggest that, if you have time to develop it out, you switch to a file type capable of storing formatting info separately from the data. It sounds like this problem would be a good candidate for a XML Spreadsheet solution. In this way you can not only specify your data, but also it's type and any formatting you choose to use.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading from CSV with value numbers formatted - c#

Related

Verify file format is csv on import by determining delimiter and checking all columns are included with .NET

Delete character out of string

Format number as text in CSV when open in both Excel and Notepad

Importing csv values having multiple comma in one column in Asp.net

How to show long numbers in Excel?

Categories

Resources