Split does not work as expected with commas

Split does not work as expected with commas - c#

I need to write a CSV Parser I am now trying to separat the fields to manipulate them.
Sample CSV:
mitarbeiter^tagesdatum^lohnart^kostenstelle^kostentraeger^menge^betrag^belegnummer
11005^23.01.2018^1^^31810020^5,00^^
11081^23.01.2018^1^^31810020^5,00^^
As you can see, there a several empty cells.
I am doing the following:
using (CsvFileReader reader = new CsvFileReader(path))
{
CsvRow row = new CsvRow();
while (reader.ReadRow(row))
{
foreach (string s in row)
{
csvROW.Add(new aCSVROW());
string[] items = s.Split(new char[] { '^' }, StringSplitOptions.None);
csvROW[0].mitarbeiter = items[0];
csvROW[0].tagesdatum = items[1];
csvROW[0].lohnart = items[2];
csvROW[0].kostenstelle = items[3];
csvROW[0].kostentraeger = items[4];
csvROW[0].menge = items[5];
csvROW[0].betrag = items[6];
csvROW[0].belegnummer = items[7];
}
}
}
Problem:
It seems that Split stops after the comma (5,00). The separator is ^ ... is there a reason why?
I tried several things without success...
Thank you so much!

CsvFileReader reads rows from a CSV file and then strings within that row. What else do you expect the CsvFileReader to do than separating the row?
After reading the second line, row will have the contents
11005^23.01.2018^1^^31810020^5
and
00^^
When you split the first row by ^, the last entry of the resulting array will be "5". Anyway, your code will throw, because you are trying to access items exceeding the bounds of the array.
I don't know CsvFileReader. Maybe you can pass ^ as a separator and spare the splitting of the string. Anyway, you could use a StreamReader, too. This will work much more like you expected.
using (StreamReader reader = new StreamReader(path))
{
while (!reader.EndOfStream)
{
var csvLine = reader.ReadLine();
csvROW.Add(new aCSVROW());
string[] items = csvLine.Split(new char[] { '^' }, StringSplitOptions.None);
csvROW[0].mitarbeiter = items[0];
csvROW[0].tagesdatum = items[1];
csvROW[0].lohnart = items[2];
csvROW[0].kostenstelle = items[3];
csvROW[0].kostentraeger = items[4];
csvROW[0].menge = items[5];
csvROW[0].betrag = items[6];
csvROW[0].belegnummer = items[7];
}
}

Is CsvRow meant to be the data of all rows, or of one row? Because as it is, you keep adding a new aCSVROW object into csvROW for each read line, but you keep replacing the data on just csvROW[0], the first inserted aCSVROW. This means that in the end, you will have a lot of rows that all have no data in them, except for the one on index 0, that had its properties overwritten on each iteration, and ends up containing the data of the last read row.
Also, despite using a CsvReader class, you are using plain normal String.Split to actually separate the fields. Surely that's what the CsvReader class is for?
Personally, I always use the TextFieldParser, from the Microsoft.VisualBasic.FileIO namespace. It has the advantage it's completely native in the .Net framework, and you can simply tell it which separator to use.
This function can get the data out of it as simple List<String[]>:
A:
Using C# to search a CSV file and pull the value in the column next to it
Once you have your data, you can paste it into objects however you want.
List<String[]> lines = SplitFile(path, textEncoding, "^");
// I assume "CsvRow" is some kind of container for multiple rows?
// Looks like pretty bad naming to me...
CsvRow allRows = new CsvRow();
foreach (String items in lines)
{
// Create new object, and add it to list.
aCSVROW row = new aCSVROW();
csvROW.Add(row);
// Fill the actual newly created object, not the first object in allRows.
// conside adding index checks here though to avoid index out of range exceptions.
row.mitarbeiter = items[0];
row.tagesdatum = items[1];
row.lohnart = items[2];
row.kostenstelle = items[3];
row.kostentraeger = items[4];
row.menge = items[5];
row.betrag = items[6];
row.belegnummer = items[7];
}
// Done. All rows added to allRows.

CsvRow row = new CsvRow();
while (reader.ReadRow(row))
{
foreach (string s in row)
{
csvROW.Add(new aCSVROW());
s.Split("^","");
csvROW[0].mitarbeiter = items[0];
csvROW[0].tagesdatum = items[1];
csvROW[0].lohnart = items[2];
csvROW[0].kostenstelle = items[3];
csvROW[0].kostentraeger = items[4];
csvROW[0].menge = items[5];
csvROW[0].betrag = items[6];
csvROW[0].belegnummer = items[7];
}
}
}

Related

C# - check which element in a csv is not in an other csv and then write the elements to another csv

My task is to check which of the elements of a column in one csv are not included in the elements of a column in the other csv. There is a country column in both csv and the task is to check which countries are not in the secong csv but are in the first csv.
I guess I have to solve it with Lists after I read the strings from the two csv. But I dont know how to check which items in the first list are not in the other list and then put it to a third list.

There are many way to achieve this, for many real world CSV applications it is helpful to read the CSV input into a typed in-memory store there are standard libraries that can assist with this like CsvHelper as explained in this canonical post: Parsing CSV files in C#, with header
However for this simple requirement we only need to parse the values for Country form the master list, in this case the second csv. We don't need to manage, validate or parse any of the other fields in the CSVs
Build a list of unique Country values from the second csv
Iterate the first csv
Get the Country value
Check against the list of countries from the second csv
Write to the third csv if the country was not found
You can test the following code on .NET Fiddle
NOTE: this code uses StringWriter and StringReader as their interfaces are the same as the file reader and writers in the System.IO namespace. but we can remove the complexity associated with file access for this simple requirement
string inputcsv = #"Id,Field1,Field2,Country,Field3
1,one,two,Australia,three
2,one,two,New Zealand,three
3,one,two,Indonesia,three
4,one,two,China,three
5,one,two,Japan,three";
string masterCsv = #"Field1,Country,Field2
one,Indonesia,...
one,China,...
one,Japan,...";
string errorCsv = "";
// For all in inputCsv where the country value is not listed in the masterCsv
// Write to errorCsv
// Step 1: Build a list of unique Country values
bool csvHasHeader = true;
int countryIndexInMaster = 1;
char delimiter = ',';
List<string> countries = new List<string>();
using (var masterReader = new System.IO.StringReader(masterCsv))
{
string line = null;
if (csvHasHeader)
{
line = masterReader.ReadLine();
// an example of how to find the column index from first principals
if(line != null)
countryIndexInMaster = line.Split(delimiter).ToList().FindIndex(x => x.Trim('"').Equals("Country", StringComparison.OrdinalIgnoreCase));
}
while ((line = masterReader.ReadLine()) != null)
{
string country = line.Split(delimiter)[countryIndexInMaster].Trim('"');
if (!countries.Contains(country))
countries.Add(country);
}
}
// Read the input CSV, if the country is not in the master list "countries", write it to the errorCsv
int countryIndexInInput = 3;
csvHasHeader = true;
var outputStringBuilder = new System.Text.StringBuilder();
using (var outputWriter = new System.IO.StringWriter(outputStringBuilder))
using (var inputReader = new System.IO.StringReader(inputcsv))
{
string line = null;
if (csvHasHeader)
{
line = inputReader.ReadLine();
if (line != null)
{
countryIndexInInput = line.Split(delimiter).ToList().FindIndex(x => x.Trim('"').Equals("Country", StringComparison.OrdinalIgnoreCase));
outputWriter.WriteLine(line);
}
}
while ((line = inputReader.ReadLine()) != null)
{
string country = line.Split(delimiter)[countryIndexInInput].Trim('"');
if(!countries.Contains(country))
{
outputWriter.WriteLine(line);
}
}
outputWriter.Flush();
errorCsv = outputWriter.ToString();
}
// dump output to the console
Console.WriteLine(errorCsv);

Since you write about solving it with lists, I assume you can load those values from the CSV to the lists, so let's start with:
List<string> countriesIn1st = LoadDataFrom1stCsv();
List<string> countriesIn2nd = LoadDataFrom2ndCsv();
Then you can easily solve it with linq:
List<string> countriesNotIn2nd = countriesIn1st.Where(country => !countriesIn2nd.Contains(country)).ToList();
Now you have your third list with countries that are in first, but not in the second list. You can save it.

Remove csv row where text does not contain value

I am processing data in a datagrid and want to filter out any rows which do not contain specific text. Is it possible to do this?
Code below is where I am reading the data. I don't want to read/process lines which do not contain the word "INTEREST"
while (fileReader.Peek() != -1)
{
fileRow = fileReader.ReadLine();
fileRow = fileRow.Replace("\"", "");
// fileRow = fileRow.Replace("-", "");
fileDataField = fileRow.Split(delimiter);
fileDataField = fileRow.Split(',');
gridLGTCash.Rows.Add(fileDataField);
}
fileReader.Close();

If you want to skip all lines that contain "INTEREST" check for that string using Contains:
while (fileReader.Peek() != -1)
{
fileRow = fileReader.ReadLine();
if (!fileRow.Contains("INTEREST")) //<--- Add a test here
{
fileRow = fileRow.Replace("\"", "");
// fileRow = fileRow.Replace("-", "");
fileDataField = fileRow.Split(delimiter);
fileDataField = fileRow.Split(',');
gridLGTCash.Rows.Add(fileDataField);
}
}
fileReader.Close();

Here is my fix for no record or blank record on a row. Blank row in cvs looks like this [,,,,,,,..].
using System.Text.RegularExpressions;
:
:
do {
fileRow = fileReader.ReadLine();
if (!Regex.IsMatch(fileRow, #"^,*$"))
{
fileRow = fileRow.Replace("\"", "");
// fileRow = fileRow.Replace("-", "");
fileDataField = fileRow.Split(delimiter);
fileDataField = fileRow.Split(',');
gridLGTCash.Rows.Add(fileDataField);
}
}while (fileReader.Peek() != -1)
fileReader.Close();

First of all, you'll be much better off with a dedicated CSV reader. string.Split() performs poorly and fails for all kinds of edge cases. There are three (at least) built into the framework, but you can easily get other (imo better) options via NuGet. That will likely side-step this whole issue for you.
Assuming for a moment you can't do that, I wouldn't use Peek() for this, either. Use the File.ReadLines() method (note: this is not the same as File.ReadAllLines(), which can be a memory hog):
var lines = File.ReadLines("filename here"))
.Select(line => line.Replace("\"", "").Split(','))
.Where(line => line.Any(field => field.Trim().Length > 0));
gridLGTCash.Rows.AddRange(lines.ToArray());

CsvHelper Parser.Read() not splitting columns

For some to me unknown reason the csvHelper.parser.read() method returns a string array with only one entry containing the entire row.
The csv-file looks like this:
Name;Vorname;Alter
Petersen;Peter;18
Heinzen;Heinz;19
The code is like this:
using (CsvReader reader = new CsvReader(new StreamReader(path, Encoding.Default)))
{
String[] cells = reader.Parser.Read();
// cells = {"Name;Vorname;Alter"} (length = 1)
}
What am I doing wrong, or how do I get it to output an array of strings with three entries?
Edit:
CsvHelper: https://joshclose.github.io/CsvHelper/
expected result:
cells = {"Name", "Vorname", "Alter"} (length = 3)

Well, i feel stupid now...
Change the reader.Configuration.Delimiter = ";";
Thanks Benjamin Podszun for getting me on the right track

Using LINQ for CSV data

My code works great, as long as there aren't any commas in the data.
IEnumerable<Account> AccountItems = from line in File.ReadAllLines(filePath).Skip(1)
let columns = line.Split(',')
select new Account
{
AccountName = columns[0],
BKAccountID = columns[1],
Brand = columns[2],
FirstOE = columns[3],
LastOE = columns[4]
};
But the output includes data with commas, and wraps the data in double quotes when there is a comma in the data. I'm not sure if I can still use LINQ to do this.
Acme Health Care,{C2F9A7DD-0000-0000-0000-8B06859016AD},"Data With, LLC",2/4/2013,2/18/2013

Take a look at this question:
Reading CSV files using C#
TextFieldParser parser = new TextFieldParser(#"c:\temp\test.csv");
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
//Processing row
string[] fields = parser.ReadFields();
foreach (string field in fields)
{
//TODO: Process field
}
}
parser.Close();
No need to reinvent the wheel when .NET can hold your hand.

Speedily Read and Parse Data

As of now, I am using this code to open a file and read it into a list and parse that list into a string[]:
string CP4DataBase =
"C:\\Program\\Line Balancer\\FUJI DB\\KTS\\KTS - CP4 - Part Data Base.txt";
CP4DataBaseRTB.LoadFile(CP4DataBase, RichTextBoxStreamType.PlainText);
string[] splitCP4DataBaseLines = CP4DataBaseRTB.Text.Split('\n');
List<string> tempCP4List = new List<string>();
string[] line1CP4Components;
foreach (var line in splitCP4DataBaseLines)
tempCP4List.Add(line + Environment.NewLine);
string concattedUnitPart = "";
foreach (var line in tempCP4List)
{
concattedUnitPart = concattedUnitPart + line;
line1CP4PartLines++;
}
line1CP4Components = new Regex("\"UNIT\",\"PARTS\"", RegexOptions.Multiline)
.Split(concattedUnitPart)
.Where(c => !string.IsNullOrEmpty(c)).ToArray();
I am wondering if there is a quicker way to do this. This is just one of the files I am opening, so this is repeated a minimum of 5 times to open and properly load the lists.
The minimum file size being imported right now is 257 KB. The largest file is 1,803 KB. These files will only get larger as time goes on as they are being used to simulate a database and the user will continually add to them.
So my question is, is there a quicker way to do all of the above code?
EDIT:
***CP4***
"UNIT","PARTS"
"BLOCK","HEADER-"
"NAME","106536"
"REVISION","0000"
"DATE","11/09/03"
"TIME","11:10:11"
"PMABAR",""
"COMMENT",""
"PTPNAME","R160805"
"CMPNAME","R160805"
"BLOCK","PRTIDDT-"
"PMAPP",1
"PMADC",0
"ComponentQty",180
"BLOCK","PRTFORM-"
"PTPSZBX",1.60
"PTPSZBY",0.80
"PTPMNH",0.25
"NeedGlue",0
"BLOCK","TOLEINF-"
"PTPTLBX",0.50
"PTPTLBY",0.40
"PTPTLCL",10
"PTPTLPX",0.30
"PTPTLPY",0.30
"PTPTLPQ",30
"BLOCK","ELDT+" "PGDELSN","PGDELX","PGDELY","PGDELPP","PGDELQ","PGDELP","PGDELW","PGDELL","PGDELWT","PGDELLT","PGDELCT","PGDELR"
0,0.000,0.000,0,0,0.000,0.000,0.000,0.000,0.000,0.000,0
"BLOCK","VISION-"
"PTPVIPL",0
"PTPVILCA",0
"PTPVILB",0
"PTPVICVT",10
"PENVILIT",0
"BLOCK","ENVDT"
"ELEMENT","CP43ENVDT-"
"PENNMI",1.0
"PENNMA",1.0
"PENNZN",""
"PENNZT",1.0
"PENBLM",12
"PENCRTS",0
"PENSPD1",100
"PTPCRDCT",0
"PENVICT",1
"PCCCRFT",1
"BLOCK","CARRING-"
"PTPCRAPO",0
"PTPCRPCK",0
"PTPCRPUX",0.00
"PTPCRPUY",0.00
"PTPCRRCV",0
"BLOCK","PACKCLS-"
"FDRTYPE","Emboss"
"TAPEWIDTH","8mm"
"FEEDPITCH",4
"REELDIAMETER",0
"TAPEDEPTH",0.0
"DOADVVACUUM",0
"CHKBEFOREFEED",0
"TAPEARMLENGTH",0
"PPCFDPP",0
"PPCFDEC",4
"PPCMNPT",30
"UNIT","PARTS"
"BLOCK","HEADER-"
"NAME","106653"
"REVISION","0000"
"DATE","11/09/03"
"TIME","11:10:42"
"PMABAR",""
"COMMENT",""
"PTPNAME","0603R"
"CMPNAME","0603R"
"BLOCK","PRTIDDT-"
"PMAPP",1
"PMADC",0
"ComponentQty",18
"BLOCK","PRTFORM-"
"PTPSZBX",1.60
"PTPSZBY",0.80
"PTPMNH",0.23
"NeedGlue",0
"BLOCK","TOLEINF-"
"PTPTLBX",0.50
"PTPTLBY",0.34
"PTPTLCL",0
"PTPTLPX",0.60
"PTPTLPY",0.40
"PTPTLPQ",30
"BLOCK","ELDT+" "PGDELSN","PGDELX","PGDELY","PGDELPP","PGDELQ","PGDELP","PGDELW","PGDELL","PGDELWT","PGDELLT","PGDELCT","PGDELR"
0,0.000,0.000,0,0,0.000,0.000,0.000,0.000,0.000,0.000,0
"BLOCK","VISION-"
"PTPVIPL",0
"PTPVILCA",0
"PTPVILB",0
"PTPVICVT",10
"PENVILIT",0
"BLOCK","ENVDT"
"ELEMENT","CP43ENVDT-"
"PENNMI",1.0
"PENNMA",1.0
"PENNZN",""
"PENNZT",1.0
"PENBLM",12
"PENCRTS",0
"PENSPD1",80
"PTPCRDCT",0
"PENVICT",1
"PCCCRFT",1
"BLOCK","CARRING-"
"PTPCRAPO",0
"PTPCRPCK",0
"PTPCRPUX",0.00
"PTPCRPUY",0.00
"PTPCRRCV",0
"BLOCK","PACKCLS-"
"FDRTYPE","Emboss"
"TAPEWIDTH","8mm"
"FEEDPITCH",4
"REELDIAMETER",0
"TAPEDEPTH",0.0
"DOADVVACUUM",0
"CHKBEFOREFEED",0
"TAPEARMLENGTH",0
"PPCFDPP",0
"PPCFDEC",4
"PPCMNPT",30
... the file goes on and on and on.. and will only get larger.
The REGEX is placing each "UNIT PARTS" and the following code until the NEXT "UNIT PARTS" into a string[].
After this, I am checking each string[] to see if the "NAME" section exists in a different list. If it does exist, I am outputting that "UNIT PARTS" at the end of a textfile.

This bit is a potential performance killer:
string concattedUnitPart = "";
foreach (var line in tempCP4List)
{
concattedUnitPart = concattedUnitPart + line;
line1CP4PartLines++;
}
(See this article for why.) Use a StringBuilder for repeated concatenation:
// No need to use tempCP4List at all
StringBuilder builder = new StringBuilder();
foreach (var line in splitCP4DataBaseLines)
{
concattedUnitPart.AppendLine(line);
line1CP4PartLines++;
}
Or even just:
string concattedUnitPart = string.Join(Environment.NewLine,
splitCP4DataBaseLines);
Now the regex part may well also be slow - I'm not sure. It's not obvious what you're trying to achieve, whether you need regular expressions at all, or whether you really need to do the whole thing in one go. Can you definitely not just process it line by line?

You could achieve the same output list 'line1CP4Components' using the following:
Regex StripEmptyLines = new Regex(#"^\s*$", RegexOptions.Multiline);
Regex UnitPartsMatch = new Regex(#"(?<=\n)""UNIT"",""PARTS"".*?(?=(?:\n""UNIT"",""PARTS"")|$)", RegexOptions.Singleline);
string CP4DataBase =
"C:\\Program\\Line Balancer\\FUJI DB\\KTS\\KTS - CP4 - Part Data Base.txt";
CP4DataBaseRTB.LoadFile(CP4DataBase, RichTextBoxStreamType.PlainText);
List<string> line1CP4Components = new List<string>(
UnitPartsMatch.Matches(StripEmptyLines.Replace(CP4DataBaseRTB.Text, ""))
.OfType<Match>()
.Select(m => m.Value)
);
return line1CP4Components.ToArray();
You may be able to ignore the use of StripEmptyLines, but your original code is doing this via the Where(c => !string.IsNullOrEmpty(c)). Also your original code is causing the '\r' part of the "\r\n" newline/linefeed pair to be duplicated. I assumed this was an accident and not intentional?
Also you don't seem to be using the value in 'line1CP4PartLines' so I omitted the creation of the value. It was seemingly inconsistent with the omission of empty lines later so I guess you're not depending on it. If you need this value a simple regex can tell you how many new lines are in the string:
int linecount = new Regex("^", RegexOptions.Multiline).Matches(CP4DataBaseRTB.Text).Count;

// example of what your code will look like
string CP4DataBase = "C:\\Program\\Line Balancer\\FUJI DB\\KTS\\KTS - CP4 - Part Data Base.txt";
List<string> Cp4DataList = new List<string>(File.ReadAllLines(CP4DataBase);
//or create a Dictionary<int,string[]> object
string strData = string.Empty;//hold the line item data which is read in line by line
string[] strStockListRecord = null;//string array that holds information from the TFE_Stock.txt file
Dictionary<int, string[]> dctStockListRecords = null; //dictionary object that will hold the KeyValuePair of text file contents in a DictList
List<string> lstStockListRecord = null;//Generic list that will store all the lines from the .prnfile being processed
if (File.Exists(strExtraLoadFileLoc + strFileName))
{
try
{
lstStockListRecord = new List<string>();
List<string> lstStrLinesStockRecord = new List<string>(File.ReadAllLines(strExtraLoadFileLoc + strFileName));
dctStockListRecords = new Dictionary<int, string[]>(lstStrLinesStockRecord.Count());
int intLineCount = 0;
foreach (string strLineSplit in lstStrLinesStockRecord)
{
lstStockListRecord.Add(strLineSplit);
dctStockListRecords.Add(intLineCount, lstStockListRecord.ToArray());
lstStockListRecord.Clear();
intLineCount++;
}//foreach (string strlineSplit in lstStrLinesStockRecord)
lstStrLinesStockRecord.Clear();
lstStrLinesStockRecord = null;
lstStockListRecord.Clear();
lstStockListRecord = null;
//Alter the code to fit what you are doing..

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Split does not work as expected with commas - c#

Related

C# - check which element in a csv is not in an other csv and then write the elements to another csv

Remove csv row where text does not contain value

CsvHelper Parser.Read() not splitting columns

Using LINQ for CSV data

Speedily Read and Parse Data

Categories

Resources