Related
I know that the data should be correct. I have no control over the data and my boss is just going to tell me that I need to figure out a way to deal with someone else's mistake. So please don't tell me it's not my problem that the data is bad, because it is.
Anywho, this is what I'm looking at:
"Words","email#email.com","","4253","57574","FirstName","","LastName, MD","","","576JFJD","","1971","","Words","Address","SUITE "A"","City","State","Zip","Phone","",""
Data has been scrubbed for confidentiality reasons.
So as you see, the data contains quotation marks and there are commas inside some of these quoted fields. So I cannot remove them. But the "Suite A""" is throwing off the parser. There are too many quotation marks. >.<
I'm using the TextFieldParser in the Microsoft.VisualBasic.FileIO namespace with these settings:
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
parser.TextFieldType = FieldType.Delimited;
The error is
MalformedLineException: Line 9871 cannot be parsed using the current
delimiters.
I would like to scrub the data somehow to account for this but I'm not sure how to do it. Or maybe there's a way to just skip this line? Although I suspect my higher ups will not approve of me just skipping data that we might need.
If you are only trying to get rid of the stray " marks in your csv, you can use the following regex to find them and replace them with '
String sourcestring = "source string to match with pattern";
String matchpattern = #"(?<!^|,)""(?!(,|$))";
String replacementpattern = #"$1'";
Console.WriteLine(Regex.Replace(sourcestring,matchpattern,replacementpattern,RegexOptions.Multiline));
Explanation:
#"(?<!^|,)""(?!(,|$))"; will find will find any " that is not preceded by the beginning of the string, or a , and that is not followed by the end of the string or a ,
I am not familiar with TextFieldParser. However with CsvHelper, you can add a custom handler for invalid data:
var config = new CsvConfiguration();
config.IgnoreReadingExceptions = true;
config.ReadingExceptionCallback += (e, row) =>
{
// you can add some custom patching here if possible
// or, save the line numbers and add/edit them manually later.
};
using(var file = File.OpenRead(".csv"))
using(var reader = new CsvReader(reader, config))
{
reader.GetRecords<YourDtoClass>();
}
My only addition to what everyone is saying (because we've all been there) is to try to attempt to rectify each new issue you encounter with code. There are some decent REGEX strings out there https://www.google.com/?ion=1&espv=2#q=c-sharp+regex+csv+clean or you could manually fix things using String.Replace (String.Replace("\"\"\"","").Replace("\"\","").Replace("\",,","\",") or such). Eventually, as you detect and find ways of correcting more and more mistakes, your manual recovery rate will be minimized substantially (most of your bad data will likely come from similar mistakes). Cheers!
PS - Idea-ish (it's been a while - the logic may neeed some tweaking as I'm writing from memory), but you'll get the gist:
public string[] parseCSVWithQuotes(string csvLine,int expectedNumberOfDataPoints)
{
string ret = "";
string thisChar = "";
string lastChar = "";
bool needleDown = true;
for(int i = 0; i < csvLine.Length; i++)
{
thisChar = csvLine.Substring(i, 1);
if (thisChar == "'"&&lastChar!="'")
needleDown = needleDown == true ? false : true;//when needleDown = true, characters are treated literally
if (thisChar == ","&&lastChar!=",") {
if (needleDown)
{
ret += "|";//convert literal comma to pipe so it doesn't cause another break on split
}else
{
ret += ",";//break on split is intended because the comma is outside the single quote
}
}
if (!needleDown && (thisChar == "\"" || thisChar == "*")) {//repeat for any undesired character or use RegEx
//do not add -- this eliminates any undesired characters outside single quotes
}
else
{
if ((lastChar == "'" || lastChar == "\"" || lastChar == ",") && thisChar == lastChar)
{
//do not add - this eliminates double characters
}else
{
ret += thisChar;
lastChar = thisChar;
//this character is not an undesired character, is no a double, is valid.
}
}
}
//we've cleaned as best we can
string[] parts = ret.Split(',');
if(parts.Length==expectedNumberOfDataPoints){
for(int i = 0; i < parts.Length; i++)
{
//go back and replace the temporary pipe with the literal comma AFTER split
parts[i] = parts[i].Replace("|", ",");
}
return parts;
}else{
//save ret to bad CSV log
return null;
}
}
I've had to do this before,
The first step is to parse the data using string.split(',')
The next step is to combine the segments that belong together.
What I essentially did was
make a new list representing the combined strings
if a string begins with a quote, push it onto your new list
if it does not begin with a quote, append it to the last string in your list
Bonus: throw exceptions when a string ends with a quote but the next one does not begin with a quote
Depending on what the rules are regarding what can actually appear in your data, you might have to change your code to account for that.
At the core of CSV's file format, each line is a row, each cell in that row is separated by a comma. In your case, your format also contains the (very unfortunate) stipulation that commas inside a pair of quotation marks do not count as separators and are instead part of the data. I say very unfortunate because a misplaced quotation mark affects the entire rest of the line, and since quotation marks in standard ASCII do not distinguish between open and closed, there really is nothing you can do to recover from this without knowing the original intent.
That is when you log a message in a way that the person who does know the original intent (the person that provided the data) can look at the file and correct the error:
if (parse_line(line, &data)) {
// save the data
} else {
// log the error
fprintf(&stderr, "Bad line: %s", line);
}
And since your quotation marks aren't escaping newlines, you can keep on going with the next line after running into this error.
ADDENDUM: And if your company has a choice (i.e. your data is being serialized by a company tool) don't use CSV. Use something like XML or JSON with a much more clearly defined parsing mechanism.
I had to do this once aswell. My approach was to go through a line and keep track on what I was reading.
Basicly, I coded my own scanner chopping off tokens from the input line which gave me full control over my faulty .csv data.
This is what I did:
For each character on a line of input.
1. when outside of a string meeting a comma => all of the previous string (which can be empty) is a valid token.
2. when outside of a sting meeting anything but a comma or a quote => now you have a real problem, unquoted tekst => handle as you see fit.
3. when outside of a string meeing a quote => found a start of string.
4. when inside of a string meeting a comma => accept the comma as part of the string.
5. when inside of the string meeting a qoute => trouble starts here, mark this point.
6. continue and when meeting a comma (skipping white space if desired) close the string, 'unread' the comma and continue. (than will bring you to point 1.)
7. or continue and when meeting a quote -> obviously, what was read must be part of the string, add it to the string, 'unread' the quote and continue. (that will you bring to point 5)
8. or continue and find an whitespace, then End Of Line ('\n') -> the last qoute must be the closing quote. accept the string as a value.
9. or continue and fine non-whitespace, then End Of Line. -> now you have a real problem, you have the start of a string but it is not closed -> handle the error as you see fit.
If the number of fields in your .csv file is fixed you can count the comma's you recognise as field seperators and when you see a End Of Line you know you have another problem or not.
With the stream of strings received from the input line you can build a 'clean' .csv line and this way build a buffer of accepted and cleaned input that you can use in your already existing code.
What I want to do is rename all file in a particular folder, such that if a filename contains any digit in it, it is removed.
say, if a filename is
someFileName.someExtension it remains the same, but if a file is like this,
03 - Rocketman Elton John it should be renamed to Rocketman Elton John (I did the part to remove the -), another example, if the filename is 15-Trey Songz - Unfortunate (Prod. by Noah 40 Shebib) it should be renamed to Trey Songz Unfortunate (Prod. by Noah Shebib) (again I can remove -). The user is asked to select the folder like this
private void txtFolder_MouseDown(object sender, MouseEventArgs e)
{
FolderBrowserDialog fd = new FolderBrowserDialog();
fd.RootFolder = Environment.SpecialFolder.Desktop;
fd.ShowNewFolderButton = true;
if (fd.ShowDialog() == DialogResult.OK)
{
txtFolder.Text = fd.SelectedPath;
}
}
Also, it renaming starts like this
private void btnGo_Click(object sender, EventArgs e)
{
StartRenaming(txtFolder.Text);
}
and
private void StartRenaming(string FolderName)
{
string[] files = Directory.GetFiles(FolderName);
foreach (string file in files)
RenameFile(file);
}
Now in rename file, I need the function, the regular expression that will remove any number(s) in file.
Its is implemented as
private void RenameFile(string FileName)
{
string fileName = Path.GetFileNameWithoutExtension(FileName);
/* here the function goes that will find numbers in filename using regular experssion and replace them */
}
so what I can do is, I can use something like
1 var matches = Regex.Matches(fileName, #"\d+");
2
3 if (matches.Count == 0)
4 return null;
5
6 // start the loop
7 foreach(var match in matches)
8 {
9 fileName = fileName.Replace(match, ""); /* or fileName.Replace(match.ToString(), ""), whatever be the case */
10 }
11 File.Move(FileName, Path.Combine(Path.GetDirectoryName(FileName), fileName));
12 return;
But I don't think that's the right way to do it? Is there any better option to do this? or is this the best (and only option) to do this? Also, is there anything like IN in String.Replace? Say in sql I can use IN in a select command and specify a bunch of where conditions, but is there something like this with String.Replace so that I don't have to run the loop I ran from line 7 to 10? Are there any other better options?
ps: about that regex, I posted a question Regular Expression for numbers? (apparently I wasn't clear enough) and from that I got my regex, if you think someother regex would do better please tell me, also if you need any other information please let me know...
You can try Regex.Replace to remove digits, ie:
Regex.Replace(fileName, #"\d", "");
In the off chance that you are merely looking to simply rename the files and you thought that creating your own program would be the best way - I would recommend PFrank as a standalone tool (especially if you understand regex already)
If you do desire this and if you do take my suggestion (and since it's not the simplest and clearest interface), you would use \d+(\s?-)? for the match expression (in the first column in PFrank), which should match any number of digits, optionally followed by a hyphen and an additional optional whitespace character between the two. You would then have no replacement expression (zero-length string or an empty second column in PFrank). Finally, select the folder containing the files you want renamed and click the scan button; in the dialog that pops up, confirm your results and click the rename button. Sorry if I wasted anyone's time!
For replacing you should look into Regex.Replace which can replace all occurences at once.
Otherwise code look ok (with exception of strange fileName.Replace("match", "") which uses constant string...)
How about this ?
private void StartRenaming(string FolderName)
{
string[] files = Directory.GetFiles(FolderName);
string[] applicableFiles = (from string s in files
where Regex.IsMatch(s, #"(\d+)|(-+)", RegexOptions.None)
select s).ToArray<string>();
foreach (string file in applicableFiles)
RenameFile(file);
}
private void RenameFile(string file)
{
string newFileName = Regex.Replace(file, #"(\d+)|(-+)", "");
File.Move(file, Path.Combine(Path.GetDirectoryName(file), newFileName));
}
StartRenaming method will now limit the number of files to be processed based on Regex match. If the file contains a digit or - then it will be processed, thus optimizing the complete process.
RenameFile replaces digits and - in a string and gives you a newFileName
I am not quite sure about the correctness of File.Move(file, Path.Combine(Path.GetDirectoryName(file), newFileName)); though, but I guess your problem was to avoid the foreach loop, and I think I have provided an appropriate solution.
Please note that I was not able to completely test this, so let me know whether it works for you and if it doesn't I will be happy to help you further.
EDIT : Forgot to mention that file.Replace(#"(\d+)|(-+)", "") will remove digits as well as - from the file string.
EDIT : Corrected file.Replace to Regex.Replace
I prefer to use brackets to select the before and after and then use the $n method to rebuild the string how you want it to be.
"03 - Rocketman Elton John" -Replace '^([^-]*) - ([^-]*)', '$1 $2'
I'm stuck with regular expressions. The program is a console application written in C#. There are a few commands. I want to check the arguments are right first. I thought it'll be easy with Regex but couldn't do that:
var strArgs = "";
foreach (var x in args)
{
strArgs += x + " ";
}
if (!Regex.IsMatch(strArgs, #"(-\?|-help|-c|-continuous|-l|-log|-ip|)* .{1,}"))
{
Console.WriteLine("Command arrangement is wrong. Use \"-?\" or \"-help\" to see help.");
return;
}
Usage is:
program.exe [-options] [domains]
The problem is, program accepts all commands. Also I need to check "-" prefixed commands are before the domains. I think the problem is not difficult to solve.
Thanks...
Since you will end up writing a switch statement to process the options anyway, you would be better off doing the checking there:
switch(args[i])
{
case "-?": ...
case "-help": ...
...
default:
if (args[i][0] == '-')
throw new Exception("Unrecognised option: " + args[i]);
}
First, to parse command line arguments don't use regular expressions. Here is a related question that I think you should look at:
Best way to parse command line arguments in C#?
But for your specific problem with your regular expression - the options are optional and then you match against a space followed by anything at all, where anything can include for example invalid domains and/or invalid options. So far example this is valid according to your regular expression:
program.exe -c -invalid
One way to improve this by being more precise about the allowed characters in a domain rather than just matching anything.
Another problem with your regular expressions is that you don't allow spaces between the switches. To handle that you probably want something like this:
(?:(?:-\?|-help|-c|-continuous|-l|-log|-ip) +)*
I'd also like to point out that you should use string.Join instead of the loop you are currently using.
string strArgs = string.Join(" ", args);
Don't reinvent the wheel, handling command line arguments is a solved problem.
I've gotten good use out of the Command Line Parser Library for .Net.
Actually the easiest way to achieve command line argument parsing is to create a powershell commandlet. That gives you a really nice way to work with arguments.
I have been using this function with success... perhaps it will be useful for someone else...
First, define your variables:
private string myVariable1;
private string myVariable2;
private Boolean debugEnabled = false;
Then, execute the function:
loadArgs();
and add the function to your code:
private void loadArgs()
{
const string namedArgsPattern = "^(/|-)(?<name>\\w+)(?:\\:(?<value>.+)$|\\:$|$)";
System.Text.RegularExpressions.Regex argRegEx = new System.Text.RegularExpressions.Regex(namedArgsPattern, System.Text.RegularExpressions.RegexOptions.Compiled);
foreach (string arg in Environment.GetCommandLineArgs())
{
System.Text.RegularExpressions.Match namedArg = argRegEx.Match(arg);
if (namedArg.Success)
{
switch (namedArg.Groups["name"].ToString().ToLower())
{
case "myArg1":
myVariable1 = namedArg.Groups["value"].ToString();
break;
case "myArg2":
myVariable2 = namedArg.Groups["value"].ToString();
break;
case "debug":
debugEnabled = true;
break;
default:
break;
}
}
}
}
and to use it you can use the command syntax with either a forward slash "/" or a dash "-":
myappname.exe /myArg1:Hello /myArg2:Chris -debug
This regex parses the command line arguments into matches and groups so that you can build a parser based on this regex.
((?:|^\b|\s+)--(?<option_1>.+?)(?:\s|=|$)(?!-)(?<value_1>[\"\'].+?[\"\']|.+?(?:\s|$))?|(?:|^\b)-(?<option_2>.)(?:\s|=|$)(?!-)(?<value_2>[\"\'].+?[\"\']|.+?(?:\s|$))?|(?<arg>[\"\'].+?[\"\']|.+?(?:\s|$)))
This Regex will parse the Following and works in almost all the languages
--in-argument hello --out-stdout false positional -x
--in-argument 'hello world"
"filename"
--in-argument="hello world'
--in-argument='hello'
--in-argument hello
"hello"
helloworld
--flag-off
-v
-x="hello"
-u positive
C:\serverfile
--in-arg1='abc' --in-arg2=hello world c:\\test
Try on Regex101
I have a Stringbuilder object that has been populated from a text file.
How can I check the StringBuilder object for and remove consecutive "blank" lines.
i.e
Line 1: This is my text
Line 2:
Line 3: Another line after the 1st blank one
Line 4:
Line 5:
Line 6: Next line after 2 blank lines
(Line numbers given as reference only)
The blank line on Line 2 is fine, but I would like to remove the duplicate blank line, on Line 5, and so on.
If for argument sake Line 6 would have also been a blank line, and a Line 7 had a value, I would like Blank Line 5 and Blank Line 6 removed, so that there would only be 1 blank line between the Line 3 and Line 7.
Thanks in advance.
Do you have to already have the file contents in a StringBuilder?
It would be nicer to be able to read line-by-line. Something like:
private IEnumerable<string> GetLinesFromFile(string fileName)
{
using (var streamReader = new StreamReader(fileName))
{
string line = null;
bool previousLineWasBlank = false;
while ((line = streamReader.ReadLine()) != null)
{
if (!previousLineWasBlank && string.IsNullOrEmpty(line))
{
yield return line;
}
previousLineWasBlank = string.IsNullOrEmpty(line);
}
}
}
Now you can read in your text (which has had dupe blank lines removed) like this:
foreach (var line in GetLinesFromFile("myFile.txt"))
{
Console.WriteLine(line);
}
Note: I'm only illustrating a technique here. There are other considerations: e.g. my iterator method holds the file open while the consumers are processing the foreach. This is nice and memory efficient (more so than reading into a string for example) as you are only dealing with one line at a time, but not ideal for files that take a long time to process.
Probably not very efficient, but it's easy.
while(sb.ToString().Contains(Environment.NewLine + Environment.NewLine))
{
sb = sb.Replace(Environment.NewLine + Environment.NewLine, Environment.NewLine);
}
StringBuilder is a lot less flexible when it comes to searching & removing from. It's used as a helper to speed up concatenation as "string" + "another string" is a very costly operation.
I would suggest using .ToString() then Regex.Replace with a compiled regular expression with flags set to allow multiline.
You'll probably want a search pattern of:
(\n[\w-\n]*\n)
And you replace it with the empty string.
Check out Expresso for a great .NET Regular expression tool.
I have an Excel spreadsheet being converted into a CSV file in C#, but am having a problem dealing with line breaks. For instance:
"John","23","555-5555"
"Peter","24","555-5
555"
"Mary,"21","555-5555"
When I read the CSV file, if the record does not starts with a double quote (") then a line break is there by mistake and I have to remove it. I have some CSV reader classes from the internet but I am concerned that they will fail on the line breaks.
How should I handle these line breaks?
Thanks everybody very much for your help.
Here's is what I've done so far. My records have fixed format and all start with
JTW;...;....;...;
JTW;...;...;....
JTW;....;...;..
..;...;... (wrong record, line break inserted)
JTW;...;...
So I checked for the ; in the [3] position of each line. If true, I write; if false, I'll append on the last (removing the line-break)
I'm having problems now because I'm saving the file as a txt.
By the way, I am converting the Excel spreadsheet to csv by saving as csv in Excel. But I'm not sure if the client is doing that.
So the file as a TXT is perfect. I've checked the records and totals. But now I have to convert it back to csv, and I would really like to do it in the program. Does anybody know how?
Here is my code:
namespace EditorCSV
{
class Program
{
static void Main(string[] args)
{
ReadFromFile("c:\\source.csv");
}
static void ReadFromFile(string filename)
{
StreamReader SR;
StreamWriter SW;
SW = File.CreateText("c:\\target.csv");
string S;
char C='a';
int i=0;
SR=File.OpenText(filename);
S=SR.ReadLine();
SW.Write(S);
S = SR.ReadLine();
while(S!=null)
{
try { C = S[3]; }
catch (IndexOutOfRangeException exception){
bool t = false;
while (t == false)
{
t = true;
S = SR.ReadLine();
try { C = S[3]; }
catch (IndexOutOfRangeException ex) { S = SR.ReadLine(); t = false; }
}
}
if( C.Equals(';'))
{
SW.Write("\r\n" + S);
i = i + 1;
}
else
{
SW.Write(S);
}
S=SR.ReadLine();
}
SR.Close();
SW.Close();
Console.WriteLine("Records Processed: " + i.ToString() + " .");
Console.WriteLine("File Created SucacessFully");
Console.ReadKey();
}
}
}
CSV has predefined ways of handling that. This site provides an easy to read explanation of the standard way to handle all the caveats of CSV.
Nevertheless, there is really no reason to not use a solid, open source library for reading and writing CSV files to avoid making non-standard mistakes. LINQtoCSV is my favorite library for this. It supports reading and writing in a clean and simple way.
Alternatively, this SO question on CSV libraries will give you the list of the most popular choices.
Rather than check if the current line is missing the (") as the first character, check instead to see if the last character is a ("). If it is not, you know you have a line break, and you can read the next line and merge it together.
I am assuming your example data was accurate - fields were wrapped in quotes. If quotes might not delimit a text field (or new-lines are somehow found in non-text data), then all bets are off!
There is a built-in method for reading CSV files in .NET (requires Microsoft.VisualBasic assembly reference added):
public static IEnumerable<string[]> ReadSV(TextReader reader, params string[] separators)
{
var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader);
parser.SetDelimiters(separators);
while (!parser.EndOfData)
yield return parser.ReadFields();
}
If you're dealing with really large files this CSV reader claims to be the fastest one you'll find: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
I've used this piece of code recently to parse rows from a CSV file (this is a simplified version):
private void Parse(TextReader reader)
{
var row = new List<string>();
var isStringBlock = false;
var sb = new StringBuilder();
long charIndex = 0;
int currentLineCount = 0;
while (reader.Peek() != -1)
{
charIndex++;
char c = (char)reader.Read();
if (c == '"')
isStringBlock = !isStringBlock;
if (c == separator && !isStringBlock) //end of word
{
row.Add(sb.ToString().Trim()); //add word
sb.Length = 0;
}
else if (c == '\n' && !isStringBlock) //end of line
{
row.Add(sb.ToString().Trim()); //add last word in line
sb.Length = 0;
//DO SOMETHING WITH row HERE!
currentLineCount++;
row = new List<string>();
}
else
{
if (c != '"' && c != '\r') sb.Append(c == '\n' ? ' ' : c);
}
}
row.Add(sb.ToString().Trim()); //add last word
//DO SOMETHING WITH LAST row HERE!
}
Try CsvHelper (a library I maintain). It ignores empty rows. I believe there is a flag you can set in FastCsvReader to have it handle empty rows also.
Heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "How do I handle new line breaks?"
Your next thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free CsvHelper library.
Maybe you could count for (") during the ReadLine(). If they are odd, that will raise the flag. You could either ignore those lines, or get the next two and eliminate the first "\n" occurrence of the merge lines.
What I usually do is read the text in character by character opposed to line by line, due to this very problem.
As you're reading each character, you should be able to figure out where each cell starts and stops, but also the difference between a linebreak in a row and in a cell: If I remember correctly, for Excel generated files anyway, rows start with \r\n, and newlines in cells are only \r.
There is an example parser is c# that seems to handle your case correctly. Then you can read your data in and purge the line breaks out of it post-read.
Part 2 is the parser, and there is a Part 1 that covers the writer portion.
Read the line.
Split into columns(fields).
If you have enough columns expected for each line, then process.
If not, read the next line, and capture the remaining columns until you get what you need.
Repeat.
A somewhat simple regular expression could be used on each line. When it matches, you process each field from the match. When it doesn't find a match, you skip that line.
The regular expression could look something like this.
Match match = Regex.Match(line, #"^(?:,?(?<q>['"](?<field>.*?\k'q')|(?<field>[^,]*))+$");
if (match.Success)
{
foreach (var capture in match.Groups["field"].Captures)
{
string fieldValue = capture.Value;
// Use the value.
}
}
Have a look at FileHelpers Library
It supports reading\writing CSV with line breaks as well as reading\writing to excel
The LINQy solution:
string csvText = File.ReadAllText("C:\\Test.txt");
var query = csvText
.Replace(Environment.NewLine, string.Empty)
.Replace("\"\"", "\",\"").Split(',')
.Select((i, n) => new { i, n }).GroupBy(a => a.n / 3);
You might also check out my CSV parser SoftCircuits.CsvParser on NuGet. It will not only parse a CSV file but--if wanted--can also automatically map column values to your class properties. And it runs nearly four times faster than CsvHelper.
For a line break to exist in a CSV, there must be an open double quote that's not closed.
Assuming that all CSVs cells must open and close a double quote, just check if there's an odd number of quotation marks
my_string.Count(c => c == '"') % 2 == 1
and if that's the case, continue reading until you have the even number.