Single RegEx expressiong to decode CSV with embedded dobule quotes and Commas

Single RegEx expressiong to decode CSV with embedded dobule quotes and Commas - c#

I have lots of CSV data that I am trying to decode using regex. I am actually tried to build on an existing code base that other people/projects hit and dont want to risk breaking their data flows by refactoring the class too much. So, I was wondering if it is possible to decode this text with a single regex (which is how the class works currently):
f1,f2,f3,f4,f5,f6,f7
,"clean text","with,embedded,commas.","with""embedded""double""quotes",,"6.1",
First row is the header. If I save this as xxx.csv and open in Excel, it properly decompiles it to read (note the space between the fields are the cell breaks):
f1 f2 f3 f4 f5 f6 f7
clean text with,embedded,commas. with"embedded"double"quotes 6.1
But when I try this in .net, I get stuck on the regex. I have this:
string regExp = "(((?<x>(?=[,\\r\\n]+))|\"(?<x>([^\"]|\"\")+)\"|(?<x>[^,\\r\\n]+)),?)";
You can see it in action here:
http://ideone.com/hRq8xe
Which results in this:
<start>
clean text
with,embedded,commas.
with""embedded""double""quotes
6.1
<end>
This is very close but it does not replace the escaped double-double quotes with a single-double quote like Excel does. I could not come up with a regex that worked better. Can it be done?

Maybe you can somehow manage to match your string using regular-expression-conditionals with the following constructors:
if-then sentence(?(?=regex)then|else)
multiple if-then sentences(?(?=condition)(then1|then2|then3)|(else1|else2|else3))
I came up with the following pattern in order to match the body of your text: ([^\,]+(?(?=[^\,])([^\"]+")|([^\,]+,))), however, you will need to put an extra effort in order to create a completly matching expression for your text or end up using a file parser. If so, You can take a look at FileHelpers, a pretty neat library for parsing text files.
Sources:
Regular Expression Conditionals
Alternation Constructs in Regular Expressions

Related

Escape semicolon in interpolated string C#

I am writing data to .csv file and I need the below expression to be read correctly:
csvWriter.WriteLine($"Security: {sec.ID} ; Serial number: {sec.SecuritySerialNo}");
the semicolon in between is used to put the serial number in a separate cell.
The problem is that ID can also contain semicolons and mess up the data, therefore I need to escape it.
I have tried to use replace:
csvWriter.WriteLine($"Security: {sec.ID.Replace(";", "")} ; Serial number: {sec.SecuritySerialNo}");
though deleting semicolons is not what I want to achieve, I just want to escape them.

Let's emphasize again that the best way to create a CSV file is through a specialized CSV Parser library.
However, just to resolve the simple case presented by your question you could add double quotes around each field. This should be enough to explain to the Excel parser how to handle your fields.
So, export the fields in this way:
csvWriter.WriteLine($"\"Security: {sec.ID}\";\"Serial number: {sec.SecuritySerialNo}\"");
Notice that I have removed the blank space around the semicolon. It is important otherwise Excel will not parse the line correctly

C# Regex filter problems

At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.
So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:
.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*
Also filtering some CSS files:
.css
.css*
And some SQL Files:
.sql, .ddl, .dml
.sql*, .ddl*, .dml*
My previous question got an answer to filtering Python files:
.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$
But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.
For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).
Thank you.

For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:
In C# this would basically, with IgnoreCase, be
Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);
Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*
With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*
Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:
\.(html|htm|shtml|shtm|xhtml).*
Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):
\.(s|x)?(htm)l?.*
And, I always check if it's still working in regexstorm for .Net
That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.

How to ignore \t character inside double quotes using regex?

I am trying to parse a file using regex split, it works well with the '\t' character but some lines have the '\t' inside a field instead of acting as the delimiter.
Like :
G2226 TEST 1 C 29 Internal Head Office D Head Office ZZZ Unassigned 10910 10/10/2011 11/10/2011 10/10/2011 11/10/2011 "Test call Sort the customer out some data. See the customer again tomorrow to talk about Prod " Mr ABC Mr ABC Mr ABC Mr ABC Credit Requested BDM Call Internal Note 10
This part has 2 tabs I wish were ignored :
"Test call Sort the customer out some data. See the customer again tomorrow to talk about Prod\t\t"
The good thing is, they are included in double quotes, but I cannot work out how to ignore them, any ideas?
Edit:
My goal is to get 36 columns, some columns may come out more after a Regex.Split(lineString,'\t') using '\t' because they include '\t' characters inside some of the fields. I would like to ignore those ones. The one above comes out to 38 cols, which is rejected by my datatable as the header is only 36 cols, I would like to solve this problem.

If you have a simple CSV file, then regex split is usually the easiest way to process it.
However, if your CSV file contains more complex elements, such as quoted fields that contain separator characters or newlines, then this approach will no longer work. It is not a trivial matter to correctly parse these types of files, so you should use a library when possible.
The answers to this question give several options for C# libraries that can read a CSV file.

Regex is not the right tool for this.
You have basically a CSV format, it is "tab separated", not "comma separated", but it works exactly the same. So, find a CSV parser and use that - the separation character is usually configurable.

If you really need a regular expression, you can try something like this:
(?!\t")\t(?!\t")

Is it possible to use Regex to extract text from attributes repeated in a text file - c# .NET

I am working something at the moment and need to extract an attribute from a big list tags, they are formatted like this:
<appid="928" appname="extractapp" supportemail="me#mydomain.com" /><appid="928" appname="extractapp" supportemail="me#mydomain.com" />
The tags are repeated one after another and all have different appid, appname, supportemail.
I need to just extract all of the support emails, just the email address, without the supportemail=
Will I need to use two regex statements, one to seperate each individual tag, then loop through the result and pull out the emails?
I would then go through and Add the emails to a list, then loop through the list and write each one to a txt file, with a comma after it.
I've never really used Regex too much, so don't know if it's suitable for the above?
I would spend more time trying it myself but it's quite urgent. So hopefully somebody can help.

Have you considered Linq to XML?
http://www.hookedonlinq.com/LINQtoXML5MinuteOverview.ashx

Using XML is better, perhaps, but here's the regular expression you'd use (in case there's a particular reason you need/want to use regular expressions to read XML):
(appid="(?<AppID>[^"]+)" appname="(?<AppName>[^"]+)" supportemail="(?<SupportEmail>[^"]+)")
You can just take the last bit there for the support email but this will extract all of the attributes you mentioned and they will be "grouped" within each tag.

What about modify the string to have proper xml format and load xml to extract all the values of supportemail attribute?

Use
string pattern = "supportemail=\"([^\"]+)";
MatchCollection matches = Regex.Matches(inputString, pattern);
foreach(Match m in matches)
Console.WriteLine(m.Groups[1].Value);
See it here.

Problems you'll encounter by using regular expressions instead of an XML DOM:
All of the example regexes posted thus far will fail in the extremely common case that the attribute values are delimited by single quotes.
Any regex that depends on the attributes appearing in a specific order (e.g. appId before appName) will fail in the event that attributes - whose ordering is insignificant to XML - appear in an order different from what the regex expects.
A DOM will resolve entity references for you and a regex will not; if you use regex, you must check the returned values for (at least) the XML character entitites &, &apos;, >, <, and ".
There's a well-known edge case where using regular expressions to parse XML and XHTML unleashes the Great Old Ones. This will complicate your task considerably, as you will be reduced to gibbering madness and then the Earth will be eaten.

Regular expression to define format of backup filenames

In the application I am currently working on, I have an option to create automatic backups of a certain file on the hard disk. What I would like to do is offer the user the possibility to configure the name of the file and its extension.
For example, the backup filename could be something like : "backup_month_year_username.bak". I had the idea to save the format in the form of a regular expression. For the example above, the regexp would look like :
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w).(?<extension>bak)$"
I thought about using regex because I will also have to browse through the directory of backuped files to delete those older than a certain date. The main trouble I have now is how to create a filename using the regex. In a way I should replace the tags with the information. I could do that using regex.replace and another regex, but I feel it's a big weird doing that and it might be a better way.
Thanks
[Edit] Maybe I wasn't really clear in the first go, but the idea is of course that the user (in this case an admin that will know regex syntax) will have the possibility to modify the form of the filename, that's all the idea behind it[/Edit]

... and if the regex changes, it is next to impossible to reconstruct a string from a given regex.
Edit:
Create some predefined "place-holders": %u could be the user's name, %y could be the year, etc.:
backup_%m_%y_%u.bak
and then simple replace the %? with their actual values.

It sounds like you're trying to use the regular expression to create the file name from a pattern which the user should be able to specify.
Regular expressions can - AFAIK - not be used to create output, but only to validate input, so you'd have the user specify two things:
a file name production pattern like Bart suggested
a validation pattern in form of a regular expression that helps you split the file names into their parts
EDIT
By the way, your sample regex contains an error: The "." is use for "any character", also \w only matches one word character, so I guess you meant to write
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"

If the filename is always in this form, there is no reason for a regex, as it's easier to process with string.Split ...

With Bart's solution it is easy enough to split (using string.Split) the generated file name using underscore as the delimiter, to get back the information.

Ok, I think I have found a way to use only the regex. As I am using groups to get the information, I will use another regular expression to match the regular expression and replace the groups with the value:
Regex rgx = new Regex("\(\?\<Month\>.+?\)");
rgx.Replace("^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"
, DateTime.Now.Month.ToString());
Ok, it's really a hack, but at least it works and I have only one pattern defined by the user. It might not work if the regex is too complex, but I think I can deal with that problem.
What do you think?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Single RegEx expressiong to decode CSV with embedded dobule quotes and Commas - c#

Related

Escape semicolon in interpolated string C#

C# Regex filter problems

How to ignore \t character inside double quotes using regex?

Is it possible to use Regex to extract text from attributes repeated in a text file - c# .NET

Regular expression to define format of backup filenames

Categories

Resources