How to validate a .csv file before storage in C#? - c#

I have some .csv files which I am parsing before storing in database.
I would like to make application more robust, and perform validation upon the .csv files before save in the database.
So I am asking you guys if you have some good links, or code examples, patterns, or advice on how to do this?
I will paste an example of my .csv file below. The different data fields in the .csv file are separated by tabs. Each new row of data is on a new line.
I have been thinking a little about the things I should validate against and came up with the list below (I am very open for other suggestions, in case you have anything which you think should be added to the list?)
Correct file encoding.
That file is not empty.
Correct number of lines/columns.
correct number/text/date formats.
correct number ranges.
This is how my .csv file looks like (file with two lines, data on one line is separated by tabs).
4523424 A123456 GT-P1000 mobile phone Samsung XSD1234 135354191325234
345353 A134211 A8181 mobile phome HTC S4112-ad3 111911911932343
The string representation of above looks like:
"4523424\tA123456\tGT-P1000\tmobile phone\tSamsung\tXSD1234\t135354191325234\r
\n345353\tA134211\tA8181\tmobile phome\tHTC\tS4112-ad3\t111911911932343\r\n"
So do you have any good design, links, patterns, code examples, etc. on how to do this in C#?

I do like this:
Create a class to hold each parsed line with expected type
internal sealed class Record {
public int Field1 { get; set; }
public DateTime Field2 { get; set; }
public decimal? PossibleEmptyField3 { get; set; }
...
}
Create a method that parses a line into the record
public Record ParseRecord(string[] fields) {
if (fields.Length < SomeLineLength)
throw new MalformadLineException(...)
var record = new Record();
record.Field1 = int.Parse(fields[0], NumberFormat.None, CultureInvoice.InvariantCulture);
record.Field2 = DateTime.ParseExact(fields[1], "yyyyMMdd", CultureInvoice.InvariantCulture);
if (fields[2] != "")
record.PossibleEmptyField3 = decimal.Parse(fields[2]...)
return record;
}
Create a method parsing the entire file
public List<Record> ParseStream(Stream stream) {
var tfp = new TextFileParser(stream);
...
try {
while (!tfp.EndOfData) {
records.Add(ParseRecord(tfp.ReadFields());
}
}
catch (FormatException ex) {
... // show error
}
catch (MalformadLineException ex) {
... // show error
}
return records;
}
And then I create a number of methods validating the fields
public void ValidateField2(IEnumerable<Record> records) {
foreach (var invalidRecord in records.Where(x => x.Field2 < DateTime.Today))
... // show error
}
I have tried various tools but since the pattern is straight forward they don't help much.
(You should use a tool to split the line into fields)

You can use FileHelpers a free/open source .Net library to deal with CSV and many other file formats.

adrianm and Nipun Ambastha
Thank you for your response to my question.
I solved my problem by writing a solution to validate my .csv file myself.
It's quite possible a more elegant solution could be made by making use of adrianm's code, but I didn't do that, but I am encouraging to give adrianm's code a look.
I am validating the list below.
Empty file
new FileInfo(dto.AbsoluteFileName).Length == 0
Wrong formatting of file lines.
string[] items = line.Split('\t');
if (items.Count() == 20)
Wrong datatype in line fields.
int number;
bool isNumber = int.TryParse(dataRow.ItemArray[0].ToString(), out number);
Missing required line fields.
if (dataRow.ItemArray[4].ToString().Length < 1)
To work through the contents of the .csv file I based my code on this code example:
http://bytes.com/topic/c-sharp/answers/256797-reading-tab-delimited-file

Probably you should take a look to
http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
We have been using this in our projects, its quite robust and does what it says.

Related

Custom Class to CSV

I have a requirement to output some of our ERP data to a very specific csv format with an exact number of fields per record. Most of which we won't be providing at this time (Or have default values). To support future changes, I decided to write out the CSV format into a custom class of strings (All are strings) and readonly each of the strings we are not currently utilizing and default in the values that should go into those, most are String.Empty. So the Class looks something like this:
private class CustomClass
{
public string field1 = String.Empty;
public readonly string field2 = String.Empty; //Not going to be used
public string field3 = String.Empty;
public readonly string field4 = "N/A"; //Not going to be used
...
}
Now, after I populate the used fields, I need to take this data and export a specifically formatted comma delimited string. So using other posts on StackOverflow I came up with the following function to add to the class:
public string ToCsvFields()
{
StringBuilder sb = new StringBuilder();
foreach (var f in typeof(CustomClass).GetFields())
{
if (sb.Length > 0)
sb.Append(",");
var x = f.GetValue(this);
if (x != null)
sb.Append("\"" + x.ToString() + "\"");
}
return sb.ToString();
}
This works and gives me the exact CSV output I need for each Line when I call CustomClass.ToCsvFields(), and makes it pretty easy to maintain if the consumer of the CSV changes their column definition. But this line in-particular makes me feel like something could go wrong with Production code: var x = f.GetValue(this);
I understand what it is doing, but I generally shy away from "this" in my code; am I just being paranoid and this is totally acceptable code for this purpose?

Read Specific Strings from Text File

I'm trying to get certain strings out of a text file and put it in a variable.
This is what the structure of the text file looks like keep in mind this is just one line and each line looks like this and is separated by a blank line:
Date: 8/12/2013 12:00:00 AM Source Path: \\build\PM\11.0.64.1\build.11.0.64.1.FileServerOutput.zip Destination Path: C:\Users\Documents\.NET Development\testing\11.0.64.1\build.11.0.55.5.FileServerOutput.zip Folder Updated: 11.0.64.1 File Copied: build.11.0.55.5.FileServerOutput.zip
I wasn't entirely too sure of what to use for a delimiter for this text file or even if I should be using a delimiter so it could be subjected to change.
So just a quick example of what I want to happen with this, is I want to go through and grab the Destination Path and store it in a variable such as strDestPath.
Overall the code I came up with so far is this:
//find the variables from the text file
string[] lines = File.ReadAllLines(GlobalVars.strLogPath);
Yeah not much, but I thought perhaps if I just read one line at at a time and tried to search for what I was looking for through that line but honestly I'm not 100% sure if I should stick with that way or not...
If you are skeptical about how large your file is, you should come up using ReadLines which is deferred execution instead of ReadAllLines:
var lines = File.ReadLines(GlobalVars.strLogPath);
The ReadLines and ReadAllLines methods differ as follows:
When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.
As weird as it might sound, you should take a look to log parser. If you are free to set the file format you could use one that fits with log parser and, believe me, it will make your life a lot more easy.
Once you load the file with log parse you can user queries to get the information you want. If you don't care about using interop in your project you can even add a com reference and use it from any .net project.
This sample reads a HUGE csv file a makes a bulkcopy to the DB to perform there the final steps. This is not really your case, but shows you how easy is to do this with logparser
COMTSVInputContextClass logParserTsv = new COMTSVInputContextClass();
COMSQLOutputContextClass logParserSql = new COMSQLOutputContextClass();
logParserTsv.separator = ";";
logParserTsv.fixedSep = true;
logParserSql.database = _sqlDatabaseName;
logParserSql.server = _sqlServerName;
logParserSql.username = _sqlUser;
logParserSql.password = _sqlPass;
logParserSql.createTable = false;
logParserSql.ignoreIdCols = true;
// query shortened for clarity purposes
string SelectPattern = #"Select TO_STRING(UserName),TO_STRING(UserID) INTO {0} From {1}";
string query = string.Format(SelectPattern, _sqlTable, _csvPath);
logParser.ExecuteBatch(query, logParserTsv, logParserSql);
LogParser in one of those hidden gems Microsoft has and most people don't know about. I have use to read iis logs, CSV files, txt files, etc. You can even generate graphics!!!
Just check it here http://support.microsoft.com/kb/910447/en
Looks like you need to create a Tokenizer. Try something like this:
Define a list of token values:
List<string> gTkList = new List<string>() {"Date:","Source Path:" }; //...etc.
Create a Token class:
public class Token
{
private readonly string _tokenText;
private string _val;
private int _begin, _end;
public Token(string tk, int beg, int end)
{
this._tokenText = tk;
this._begin = beg;
this._end = end;
this._val = String.Empty;
}
public string TokenText
{
get{ return _tokenText; }
}
public string Value
{
get { return _val; }
set { _val = value; }
}
public int IdxBegin
{
get { return _begin; }
}
public int IdxEnd
{
get { return _end; }
}
}
Create a method to Find your Tokens:
List<Token> FindTokens(string str)
{
List<Token> retVal = new List<Token>();
if (!String.IsNullOrWhitespace(str))
{
foreach(string cd in gTkList)
{
int fIdx = str.IndexOf(cd);
if(fIdx > -1)
retVal.Add(cd,fIdx,fIdx + cd.Length);
}
}
return retVal;
}
Then just do something like this:
foreach(string ln in lines)
{
//returns ordered list of tokens
var tkns = FindTokens(ln);
for(int i=0; i < tkns.Length; i++)
{
int len = (i == tkns.Length - 1) ? ln.Length - tkns[i].IdxEnd : tkns[i+1].IdxBegin - tkns[i].IdxEnd;
tkns[i].value = ln.Substring(tkns[i].IdxEnd+1,len).Trim();
}
//Do something with the gathered values
foreach(Token tk in tkns)
{
//stuff
}
}

Search a string from 500k entries in txt

I have a .txt file which has about 500k entries, each separated by new line. The file size is about 13MB and the format of each line is the following:
SomeText<tab>Value<tab>AnotherValue<tab>
My problem is to find a certain "string" with the input from the program, from the first column in the file, and get the corresponding Value and AnotherValue from the two columns.
The first column is not sorted, but the second and third column values in the file are actually sorted. But, this sorting is of no good use to me.
The file is static and does not change. I was thinking to use the Regex.IsMatch() here but I am not sure if that's the best approach here to go line by line.
If the lookup time would increase drastically, I could probably go for rearranging the first column (and hence un-sorting the second & third column). Any suggestions on how to implement this approach or the above approach if required?
After locating the string, how should I fetch those two column values?
EDIT
I realized that there will be quite a bit of searches in the file for atleast oe request by the user. If I have an array of values to be found, how can I return some kind of dictionary having a corresponding values of found matches?
Maybe with this code:
var myLine = File.ReadAllLines()
.Select(line => line.Split(new [] {' ', '\t'}, SplitStringOptions.RemoveEmptyEntries)
.Single(s => s[0] == "string to find");
myLine is an array of strings that represents a row. You may also use .AsParallel() extension method for better performance.
How many times do you need to do this search?
Is the cost of some pre-processing on startup worth it if you save time on each search?
Is loading all the data into memory at startup feasible?
Parse the file into objects and stick the results into a hashtable?
I don't think Regex will help you more than any of the standard string options. You are looking for a fixed string value, not a pattern, but I stand to be corrected on that.
Update
Presuming that the "SomeText" is unique, you can use a dictionary like this
Data represents the values coming in from the file.
MyData is a class to hold them in memory.
public IEnumerable<string> Data = new List<string>() {
"Text1\tValue1\tAnotherValue1\t",
"Text2\tValue2\tAnotherValue2\t",
"Text3\tValue3\tAnotherValue3\t",
"Text4\tValue4\tAnotherValue4\t",
"Text5\tValue5\tAnotherValue5\t",
"Text6\tValue6\tAnotherValue6\t",
"Text7\tValue7\tAnotherValue7\t",
"Text8\tValue8\tAnotherValue8\t"
};
public class MyData {
public String SomeText { get; set; }
public String Value { get; set; }
public String AnotherValue { get; set; }
}
[TestMethod]
public void ParseAndFind() {
var dictionary = Data.Select(line =>
{
var pieces = line.Split(new char[] { '\t' }, StringSplitOptions.RemoveEmptyEntries);
return new MyData {
SomeText = pieces[0],
Value = pieces[1],
AnotherValue = pieces[2],
};
}).ToDictionary<MyData, string>(dat =>dat.SomeText);
Assert.AreEqual("AnotherValue3", dictionary["Text3"].AnotherValue);
Assert.AreEqual("Value7", dictionary["Text7"].Value);
}
hth,
Alan
var firstFoundLine = File.ReadLines("filename").FirstOrDefault(s => s.StartsWith("string"));
if (firstFoundLine != "")
{
char yourColumnDelimiter = '\t';
var columnValues = firstFoundLine.Split(new []{yourColumnDelimiter});
var secondColumn = columnValues[1];
var thirdColumns = columnValues[2];
}
File.ReadLines is better than File.RealAllLines because you won't need to read the whole file -- only until matching string is found http://msdn.microsoft.com/en-us/library/dd383503.aspx
Parse this monstrosity into some sort of database.
SQL Server/MySQL would be preferable, but if you can't use them for various reasons, SQLite or even Access or Excel could work.
Doing that a single time is not hard.
After you are done with that, searching will become easy and fast.
GetLines(inputPath).FirstOrDefault(p=>p.Split(",")[0]=="SearchText")
private static IEnumerable<string> GetLines(string inputFile)
{
string filePath = Path.Combine(Directory.GetCurrentDirectory(),inputFile);
return File.ReadLines(filePath);
}

Use and parse a text file in C# to initialize a component based game model

I have a text file that should initialize my objects, which are built around a component based model, it is my first time trying to use a data driven approach and i'm not sure if i am heading in the right direction here.
The file i have currently in mind looks like this
EliteGoblin.txt
#Goblin.txt
[general]
hp += 20
strength = 12
description = "A big menacing goblin"
tacticModifier += 1.3
[skills]
fireball
Where the # symbol says which other files to parse at at that point
The names in [] correspond with component classes in the code
And below them is how to configure them
For example the hp += 20 would increase the value taken from goblin.txt and increase it by 20 etc.
My question is how i should go about parsing this file, is there some sort of parser built in C#?
Could i change the format of my document to match already defined format that already has support in .net?
How do i go about understanding what type is each value? int/float/string
Does this seem a viable solution at all?
Thanks in advance, Xtapodi.
Drop the flat file and pick up XML. Definately look into XML Serialization. You can simply create all of your objects in C# as classes, serialize them into XML, and reload them into your application without having to worry about parsing a flat file out. Because your objects will act as the schema for your XML, you won't have to worry about casting objects and writing a huge parsing routine, .NET will handle it for you. You will save many moons of headache.
For instance, you could rewrite your class to look like this:
public class Monster
{
public GeneralInfo General {get; set;}
public SkillsInfo Skills {get; set;}
}
public class GeneralInfo
{
public int Hp {get; set;}
public string Description {get; set;}
public double TacticModifier {get; set;}
}
public class SkillsInfo
{
public string[] SkillTypes {get; set;}
}
...and your XML would get deserialized to something like...
<Monster>
<General>
<Hp>20</Hp>
<Description>A big menacing goblin</Description>
<TacticModifier>1.3</TacticModifier>
</General>
<SkillTypes>
<SkillType>Fireball</SkillType>
<SkillType>Water</SkillType>
</SkillTypes>
</Monster>
..Some of my class names, hierarchy, etc. may be wrong, as I punched this in real quick, but you get the general gist of how serialization will work.
You might want to check out Sprache, a .net library that can create DSL' s by Autofac creator Nicholas Blumhardt. From the google site:
Sprache is a small library for
constructing parsers directly in C#
code.
It isn't an "industrial strength"
framework - it fits somewhere in
between regular expressions and a
full-blown toolset like ANTLR.
Usage Unlike most parser-building
frameworks, you use Sprache directly
from your program code, and don't need
to set up any build-time code
generation tasks. Sprache itself is a
single tiny assembly.
A simple parser might parse a sequence
of characters:
// Parse any number of capital 'A's in
a row var parseA =
Parse.Char('A').AtLeastOnce(); Sprache
provides a number of built-in
functions that can make bigger parsers
from smaller ones, often callable via
Linq query comprehensions:
Parser identifier =
from leading in Parse.Whitespace.Many()
from first in Parse.Letter.Once()
from rest in Parse.LetterOrDigit.Many()
from trailing in Parse.Whitespace.Many()
select new string(first.Concat(rest).ToArray());
var id = identifier.Parse(" abc123
");
Assert.AreEqual("abc123", id);
The link to the article builds a questionaire that is driven by a simple text file with the following format:
identification "Personal Details"
[
name "Full Name"
department "Department"
]
employment "Current Employer"
[
name "Your Employer"
contact "Contact Number"
#months "Total Months Employed"
]
There is no builtin function to do exactly what you want to do, but you can easily write it.
void ParseLine(string charClass, string line) {
// your code to parse line here...
Console.WriteLine("{0} : {1}", charClass, line);
}
void ParseFile(string fileName) {
string currentClass = "";
using (StringReader sr = new StringReader(fileName)) {
string line = sr.ReadLine();
if (line[0] == '#') {
string embeddedFile = line.Substring(1);
ParseFile(embeddedFile);
}
else if (line[0] == '[') {
currentClass = line.Substring(2, line.Length - 2);
}
else ParseLine(currentClass, line);
}
}
What you want to do isn't going to be easy. The statistic inheritance in particular.
So, unless you can find some existing code to leverage, I suggest you start with simpler requirements with a view to adding the more involved functionality later and build up the functionality incrementally.

C# file input from text file

I have a function like this:
List<float> myList = new List(float);
public void numbers(string filename)
{
string input;
float number;
if (System.IO.File.Exists(filename) == true)
{
System.IO.StreamReader objectReader;
objectReader = new System.IO.StreamReader(filename);
while ((input = objectReader.ReadLine()) != null)
{
number = Convert.ToSingle(input);
myList.Add(number);
}
objectReader.Close();
}
else
{
MessageBox.Show("No Such File" + filename);
}
}
Where Im trying to add numbers (floats) from a text file into a List. But I keep getting errors saying wrong format. The numbers in the text file are one number per line...any help?
I would suggest you do a Trim call like this
number = Convert.ToSingle(input.Trim());
However, a better code would be using a TryParse call
float tmp;
if(float.TryParse(input.Trim(), out tmp)
{
mylist.Add(tmp);
}
Your code worked fine for me except for the case of a newline (and of course for entries that were not numbers at all)
Here is a version that should work for you, using a tryParse to check if each line can convert to a single):
public void Numbers(string filename)
{
List<float> myList = new List<float>();
string input;
if (System.IO.File.Exists(filename) == true)
{
System.IO.StreamReader objectReader;
objectReader = new System.IO.StreamReader(filename);
while ((input = objectReader.ReadLine()) != null)
{
Single output;
if (Single.TryParse(input, out output ))
{
myList.Add(output);
}
else
{
// Huh? Should this happen, maybe some logging can go here to track down why you couldn't just use the .Convert()
}
}
objectReader.Close();
}
else
{
MessageBox.Show("No Such File" + filename);
}
}
As Mike C rightly points out, this could be potentially risky - swallowing good data that has been corrupted by the output process. The tryParse method returns false when it fails so you could add in an else branch and some logging to check just what is causing the failures and see if there is another bug floating around that can be corrected.
Do you have any blank lines in the file, or failures to convert the number? My guess is that you have a line which is not castable to float from its current format. You should make sure you sanitize the lines before reading them in (strip off everything that is not a number using a regex) and throw the line out if it fails the check.
One thing you might do is use double instead and do a Convert.ToDouble().
Are there spaces or commas or anything? The best thing to do would be to set a breakpoint on
number = Convert.ToSingle(input);
to see what input is actually before you try to convert it.
There's a wonderful free package called FileHelpers which helps with importing data from all sorts of text files. The advantage with this is that a lot of the deeper error handling is already in place.
By the way,
if (System.IO.File.Exists(filename) == true)
can be shortened to
if (System.IO.File.Exists(filename))

Categories