Using RegEx to read through a CSV file

Using RegEx to read through a CSV file - c#

I have a CSV file, with the following type of data:
0,'VT,C',0,
0,'C,VT',0,
0,'VT,H',0,
and I desire the following output
0
VT,C
0
0
C,VT
0
0
VT,H
0
Therefore splitting the string on the comma however ignoring the comma within quote marks. At the moment I'm using the following RegEx:
("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)"
however this gives me the result of:
0
VT
C
0
0
C
VT
0
0
VT
H
0
This show the RegEx is not reading the quote mark properly. Can anyone suggest some alterations that might help?

Usually when it comes to CSV parsing, people use specific libraries well suited for the programming language they are using to code their application.
Anyway if you are going to use a regular expression to make a really loose(!) parsing you may try using something like this:
'(?<value>[^']*?)'
It will match anything in between single quotes, and assuming the csv file is well formed, it will not miss a field. Of course it doesn't accept embedded quotes but it easily gets the job done. That's what I use when I need to get the job done really quickly. Please don't consider it a complete solution to your problem...it just works in special conditions when the requirements are what you described and the input is well formed.
[EDIT]
I was checking again your question and noticed you want to include also non quoted fields...well ok in that case my expression will not work at all. Anyway listen...if you think hard about your problem, you'll find that's something quite difficult to solve without ambiguity. Because you need fixed rules and if you allow quoted and not quoted fields, the parser will have hard time figuring out legit commas as separator/quoted.
Another expression to model such a solution may be:
('[^']+'|[^,]+),?
It will match both quoted/notquoted fields...anyway I'm not sure if it needs to assume the csv HAS to adhere to strict conditions. That will work much safer then a split strategy as far as I can tell ... you just need to collect all matches and print the matched_value + \r\n on your target string.

This regex is based of the fact you have 1 digit before and after your 'value'
Regex.Replace(input, #"(?:(?<=\d),|,(?=\d))", "\n");
You can test it out on RegexStorm

foreach(var m in Regex.Matches(s,"(('.*?')|[0-9])"))

I have manages to get the following method to read the file as required:
public List<string> SplitCSV(string input, List<string> line)
{
Regex csvSplit = new Regex("(([^,^\'])*(\'.*\')*([^,^\'])*)(,|$)", RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
line.Add(match.Value.TrimStart(','));
}
return line;
}
Thanks for everyone help though.

Related

C# Regex filter problems

At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.
So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:
.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*
Also filtering some CSS files:
.css
.css*
And some SQL Files:
.sql, .ddl, .dml
.sql*, .ddl*, .dml*
My previous question got an answer to filtering Python files:
.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$
But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.
For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).
Thank you.

For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:
In C# this would basically, with IgnoreCase, be
Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);
Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*
With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*
Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:
\.(html|htm|shtml|shtm|xhtml).*
Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):
\.(s|x)?(htm)l?.*
And, I always check if it's still working in regexstorm for .Net
That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.

Regex hangs trying to find match

I am trying to match an assignment string in VB code (as in I'm passing in text that is VB code into my program that's written in C#). The assignment string that I'm trying to match is something for example like
CustomClassInitializer(someParameter, anotherParameter, someOtherClassAsParameterWithInitialization()).SomeProperty = 7
and I realize that's rather complex, but it actually isn't far off from some of the real text I'm trying to match.
In order to do so I wrote a Regex. This Regex:
#"[\w,.]+\(([\w,.]*\(*,* *\)*)+ = "
which correctly matches. The problem is it becomes VERY slow (with timeouts), which I've researched and found is probably because of "backtracking". One of the suggested solutions to help with backtracking in general was to add "?>" to the regex, which I think would go in this position:
[\w,.]+\(?>([\w,.]*\(*,* *\)*)+ =
but this no longer matches properly.
I'm fairly new to Regex, so I imagine that there is a much better pattern. What is it please? Or how can I improve my times in general?
Helpful notes:
I'm only interested in position 0 of the string I'm searching for a
match in. My code is "if (isMatch && match.index == 0) { ... }. Can
I tell it to only check position 0 and if it's not a match move on?
The reason I use all the 0 or more things is the match could be as simple as CustomClass() = new CustomClass(), and as complicated as the above or perhaps a bit worse. I'm trying to get as many cases as possible.
This Regex is interested in "[\w,.]+(" and then "whatever may be inside the parentheses" (I tried to think of what all could be inside them based on the fact that it's valid VB code) until you get to the close parenthesis and then " = ". Perhaps I can use a wildcard for literally anything until it get's to ") = " in the string? - Like I said, fairly new to Regex.
Thanks in advance!

This seems to do what you want. Normally, I like to be more specific than .*, but it is working correctly. Note that I am using the Multi-line option.
^.*=\s*.+$
Here is a working example in RegExStorm.net example

Returning the regular expression match as part of a split (or equivalent functionality)

I am trying to parse through some log files and put them into a database for analysis. A single line looks something like this:
2012-09-30 17:16:27,213 [39] (boxes) ERROR Assembly.Places [(null)] - Error while displaying a thing
I have made a regular expression that works well for pulling out the date in front and breaking up the lines that way, but I lose the date itself. This is a pretty important bit of data, and I don't want to lose it!
I cannot just do this by \r\n, because some logs are fatal errors that include stack traces for the developers. Those, obviously, use \r\n to make them readable.
My current code looks like this for reference:
var logpath = Directory.GetFiles(#"C:\a\directory", "*.log");
foreach (var log in logpath)
{
var fileStream = new StreamReader(log);
var fileString = fileStream.ReadToEnd();
var records = Regex.Split(fileString, "[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}");
...
}

Split() will always remove the matched delimiter. The trick is not to match any actual text, but rather a position in the string.
This is done through zero-width look-ahead:
var datePattern = "^(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})";
var datePositions = new Regex(datePattern, RegexOptions.Multiline);
// ...
Regex.Split(fileString, datePositions);

You should match instead of splitting
This is the regex.Use singleLine Mode
([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})(.*?)((?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}|$))
Group 1 contains date
Group 2 contains the required date
NOTE
The regex is conceptually like this.
(yourDate)(.*?yourdata)(?=till the other date|$)
Dont forget to use singlelineMode

Well, I'm not an expert on the subject but I did found this: Regex.Match.
From what I see you can receive the first match of the date format with a Match object
which has all kind of nice properties that put together you can probably cut the parts you want.
p.s. also exists a Regex.Matches which will return all matches in the file, might be easier for use.
Sorry I don't have time for to find a complete code example.
good day

Regex MatchCollection obj hangs\"Function evuluation timed out" after Regex.Matches

I'm kind of new too C#, and regular expression for that matter, but I've searched a couple of hours to find a solution too this problem so, hopefully this is easy for you guys:)
My application uses a regex to match email addresses in a given string,
then loops throu the matches.:
String EmailPattern = "\\w+([-+.]\\w+)*#\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*";
MatchCollection mcemail = Regex.Matches(rawHTML, EmailPattern);
foreach (Match memail in mcemail)
Works fine, but, when I downloaded the string from a certain page, http://www.sp.se/sv/index/services/quality/sidor/default.aspx, the MatchCollection(mcemail) object "hangs" the loop. When using a break point and accessing the object, I get "Function evuluation timed out" on everything(.Count etc).
Update
I've tried my pattern and other email patterns on the same string, everyone(regex desingers, python based web pages etc.) fails/timesout when trying too match this particular string.
How can I detect that the matchcollection obj is not "ready" to use?

If you can post the email that's causing the problem (perhaps anonymized in some way), that will give us more information, but I'm thinking the problem is this little guy right here:
([-.]\\w+)*\\.\\w+([-.]\\w+)*
To understand the problem, let's break that into groups:
([-.]\\w+)*
\\.\\w+
([-.]\\w+)*
The strings that will match \\.\\w+ are a subset of those that will match [-.]\\w+. So if part of your input looks like foo.bar.baz.blah.yadda.com, your regex engine has no way of knowing which group is supposed to match it. Does that make sense? So the first ([-.]\\w+)* could match .bar.baz.blah, then the \\.\\w+ could match .yadda, then the last ([-.]\\w+)* could match .com...
...OR the first clause could match .bar.baz, the second could match .blah, and the last could match .yadda.com. Since it doesn't know which one is right, it will keep trying different combinations. It should stop eventually, but that could still take a long time. This is called "catastrophic backtracking".
This issue is compounded by the fact that you're using capturing groups rather than non-capturing groups; i.e. ([-+.]\\w+) instead of (?:[-+.]\\w+). That causes the engine to try and separate and save whatever matches inside the parentheses for later reference. But as I explained above, it's ambiguous which group each substring belongs in.
You might consider replacing everything after the # with something like this:
\\w[-\\w]*\\.[-.\\w]+
That could use some refinement to make it more specific, but you get the general idea. Hope I explained all this well enough; grouping and backreferences are kind of tough to describe.
EDIT:
Looking back at your pattern, there's a deeper issue here, still related to the backtracking/ambiguity problem I mentioned. The clause \\w+([-.]\\w+)* is ambiguous all by itself. Splitting it into parts, we have:
\\w+
([-.]\\w+)*
Suppose you have a string like foobar. Where does the \\w+ end and the ([-.]\\w+)* begin? How many repetitions of ([-.]\\w+) are there? Any of the following could work as matches:
f(oobar)
foo(bar)
f(o)(oba)(r)
f(o)(o)(b)(a)(r)
foobar
etc...
The regex engine doesn't know which is important, so it will try them all. This is the same problem I pointed out above, but it means you have it in multiple places in your pattern.
Even worse, ([-.]\\w+)* is also ambiguous, because of the + after the \\w. How many groups are there in blah? I count 16 possible combinations: (blah), (b)(lah), (bl)(ah)...
The amount of different possible combinations is going to be huge, even for a relatively small input, so your engine is going to be in overdrive. I would definitely simplify it if I were you.

I just did a local test and it appears either the sheer document size or something in the ViewState causes the Regex match evaluation to time out. (Edit: I'm pretty sure it's the size, actually. Removing the ViewState just reduces the size significantly.)
An admittedly crude way to solve this would be something like this:
string[] rawHtmlLines = File.ReadAllLines(#"C:\default.aspx");
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => !line.Contains("_VIEWSTATE")).ToArray());
string emailPattern = #"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
var emailMatches = Regex.Matches(filteredHtml, emailPattern);
foreach (Match match in emailMatches)
{
//...
}
Overall I suspect the email pattern is just not well optimised (or intended) to filter out emails in a large string but just used as validation for user input. Generally it might be a good idea to limit the string you search in to just the parts you are actually interested in and keep it as small as possible - for example by leaving out the ViewState which is guaranteed to not contain any readable email addresses.
If performance is important, it's probably also a better idea to create the filtered HTML using a StringBuilder and IndexOf (etc.) instead of splitting lines and LINQing up the result :)
Edit:
To further minimize the length of the string the Regex needs to check you could only include lines that contain the # character to begin with, like so:
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => line.IndexOf('#') >= 0 && !line.Contains("_VIEWSTATE")).ToArray());

From "Function evaluation timed out", I'm assuming you're doing this in the debugger. The debugger has some fairly quick timeouts with regard to how long a method takes. Not eveything happens quickly. I would suggest going the operation in code, storing the result, then viewing that result in the debugger (i.e. let the call to Matches run and put a breakpoint after it).
Now, with regard to detecting whether the string will make Matches take a long time; that's a bit of a black art. You basically have to perform some sort of input validation. Just because you got some value from the internet, doesn't mean that value will work well with Matches. The ultimate validation logic is up to you; but, starting with the length of rawHtmlLines might be useful. (i.e. if the lenght is 1000000 bytes, Matches might take a while) But, you have to decide what to do if the length is too long; e.g give an error to the user.

Splitting string on commas when data can contain commas

I have a CSV file (which I didn't design and I can't change now nor will I ever be able to change it) that contains lines like the following:
"Surname, Firstname", yes, no, somestring, whatever, etc
As you can see here, the first , is not a comma on which I'd want to split the string. Notice that this particular comma is enclosed within the quotation marks.
Because of this, a simple string.split(',') obviously won't work, as it would give me an array of length 7 for the above string instead of 6.
Is there a way to get around this? I was thinking of using regex to split the string instead but I'm not competent enough in regex to think of a pattern that would only split on commas that are not enclosed inside quotation marks.
I can think of ugly, hacky ways to do it by reading each string char by char but this would have to be a last resort as I'm sure there's a better way to do it!

You can handle this easily by using the TextFieldParser class. Just set HasFieldsEnclosedInQuotes to true.

I would suggest using a CSV parser library - there are other cases that you wouldn't have thought of (new line as part of a quoted field).
The VisualBasic namespace has a nice library that can help - the TextFieldParser.

I know there's a lot of people here who think character-by-character comparisons should never be used and will strongly disagree with me but I'm not convinced companies like Microsoft aren't the only ones who should be doing that sort of programming.
Afterall, Split does character-by-character comparisons so why is it any less ugly when you call existing code that doesn't quite do exactly what you want?
At any rate, my approach was to write my own code. And I've posted the code online at http://www.blackbeltcoder.com/Articles/files/reading-and-writing-csv-files-in-c.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.