Extract data from a text file using a RegEx

Extract data from a text file using a RegEx - c#

I am trying to write a RegEx to extract data from a file.
The file looks like the following:
"a123 100 Start"
"a123 101 Today"
"a123 101 Tomorrow"
"a123 102 End"
The file contains multiple lines of records just like the one above. In each line of the file there is a code on a fixed position (100 - start of record, 101 - record detail, 102 - end of record). I would like to extract from that file a structure like a List<List<string>> where the outer list will store all the groups of records that are in the file.
My first approach was to parse this file with a foreach but I think there should be a way to achieve this with a RegEx. And being that I would like to expand my RegEx knowledge, I think that is a great example for me.
Is it possible for such data to be parsed with a RegEx? If so, can someone help out with the RegEx itself?
Thanks!

If your file has this specific structure, you do not need to use Regex. Just use Split(" ") and the result array for each line.
Regex has performance penalties.
But if you like to use Regex anyway, you can use Regex.Match(line, "[\S]+ [\S]+ [\S]+") for this file structure.

Related

Single RegEx expressiong to decode CSV with embedded dobule quotes and Commas

I have lots of CSV data that I am trying to decode using regex. I am actually tried to build on an existing code base that other people/projects hit and dont want to risk breaking their data flows by refactoring the class too much. So, I was wondering if it is possible to decode this text with a single regex (which is how the class works currently):
f1,f2,f3,f4,f5,f6,f7
,"clean text","with,embedded,commas.","with""embedded""double""quotes",,"6.1",
First row is the header. If I save this as xxx.csv and open in Excel, it properly decompiles it to read (note the space between the fields are the cell breaks):
f1 f2 f3 f4 f5 f6 f7
clean text with,embedded,commas. with"embedded"double"quotes 6.1
But when I try this in .net, I get stuck on the regex. I have this:
string regExp = "(((?<x>(?=[,\\r\\n]+))|\"(?<x>([^\"]|\"\")+)\"|(?<x>[^,\\r\\n]+)),?)";
You can see it in action here:
http://ideone.com/hRq8xe
Which results in this:
<start>
clean text
with,embedded,commas.
with""embedded""double""quotes
6.1
<end>
This is very close but it does not replace the escaped double-double quotes with a single-double quote like Excel does. I could not come up with a regex that worked better. Can it be done?

Maybe you can somehow manage to match your string using regular-expression-conditionals with the following constructors:
if-then sentence(?(?=regex)then|else)
multiple if-then sentences(?(?=condition)(then1|then2|then3)|(else1|else2|else3))
I came up with the following pattern in order to match the body of your text: ([^\,]+(?(?=[^\,])([^\"]+")|([^\,]+,))), however, you will need to put an extra effort in order to create a completly matching expression for your text or end up using a file parser. If so, You can take a look at FileHelpers, a pretty neat library for parsing text files.
Sources:
Regular Expression Conditionals
Alternation Constructs in Regular Expressions

Encoding Special Characters Before Processing Invalid XML

I have some invalid XML from a vendor that I need to process. Here is an example:
<a>foo</a>
<b>bar</b>
<c>foobar is < $15</c>
So, we have a few problems. First, there is no root document. I overcome that by adding a root document. No problem. The second, and more difficult problem, is the less than symbol. I can just encode the whole thing but it will encode the XML tags. Is there a library or simple method out there somewhere for handling this? I really don't want to reinvent the wheel as I'm sure hundreds of people have dealt with "quasi-XML" like this. Appreciate any help.

I would read the file line by line and use a regex to get the values between the nodes. Your example doesn't have nested elements so this is pretty easy. While reading line by line you can replace encode the inner values. The named capture group (?.*?) will get everything between the nodes into the group named xml.
var regex = "<.*?>(?<xml>.*?)</.*?>"
var badXML = Regex.Match(line, regex , RegexOptions.IgnoreCase).Groups["xml"].Value;

How to ignore \t character inside double quotes using regex?

I am trying to parse a file using regex split, it works well with the '\t' character but some lines have the '\t' inside a field instead of acting as the delimiter.
Like :
G2226 TEST 1 C 29 Internal Head Office D Head Office ZZZ Unassigned 10910 10/10/2011 11/10/2011 10/10/2011 11/10/2011 "Test call Sort the customer out some data. See the customer again tomorrow to talk about Prod " Mr ABC Mr ABC Mr ABC Mr ABC Credit Requested BDM Call Internal Note 10
This part has 2 tabs I wish were ignored :
"Test call Sort the customer out some data. See the customer again tomorrow to talk about Prod\t\t"
The good thing is, they are included in double quotes, but I cannot work out how to ignore them, any ideas?
Edit:
My goal is to get 36 columns, some columns may come out more after a Regex.Split(lineString,'\t') using '\t' because they include '\t' characters inside some of the fields. I would like to ignore those ones. The one above comes out to 38 cols, which is rejected by my datatable as the header is only 36 cols, I would like to solve this problem.

If you have a simple CSV file, then regex split is usually the easiest way to process it.
However, if your CSV file contains more complex elements, such as quoted fields that contain separator characters or newlines, then this approach will no longer work. It is not a trivial matter to correctly parse these types of files, so you should use a library when possible.
The answers to this question give several options for C# libraries that can read a CSV file.

Regex is not the right tool for this.
You have basically a CSV format, it is "tab separated", not "comma separated", but it works exactly the same. So, find a CSV parser and use that - the separation character is usually configurable.

If you really need a regular expression, you can try something like this:
(?!\t")\t(?!\t")

Using RegEx to read through a CSV file

I have a CSV file, with the following type of data:
0,'VT,C',0,
0,'C,VT',0,
0,'VT,H',0,
and I desire the following output
0
VT,C
0
0
C,VT
0
0
VT,H
0
Therefore splitting the string on the comma however ignoring the comma within quote marks. At the moment I'm using the following RegEx:
("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)"
however this gives me the result of:
0
VT
C
0
0
C
VT
0
0
VT
H
0
This show the RegEx is not reading the quote mark properly. Can anyone suggest some alterations that might help?

Usually when it comes to CSV parsing, people use specific libraries well suited for the programming language they are using to code their application.
Anyway if you are going to use a regular expression to make a really loose(!) parsing you may try using something like this:
'(?<value>[^']*?)'
It will match anything in between single quotes, and assuming the csv file is well formed, it will not miss a field. Of course it doesn't accept embedded quotes but it easily gets the job done. That's what I use when I need to get the job done really quickly. Please don't consider it a complete solution to your problem...it just works in special conditions when the requirements are what you described and the input is well formed.
[EDIT]
I was checking again your question and noticed you want to include also non quoted fields...well ok in that case my expression will not work at all. Anyway listen...if you think hard about your problem, you'll find that's something quite difficult to solve without ambiguity. Because you need fixed rules and if you allow quoted and not quoted fields, the parser will have hard time figuring out legit commas as separator/quoted.
Another expression to model such a solution may be:
('[^']+'|[^,]+),?
It will match both quoted/notquoted fields...anyway I'm not sure if it needs to assume the csv HAS to adhere to strict conditions. That will work much safer then a split strategy as far as I can tell ... you just need to collect all matches and print the matched_value + \r\n on your target string.

This regex is based of the fact you have 1 digit before and after your 'value'
Regex.Replace(input, #"(?:(?<=\d),|,(?=\d))", "\n");
You can test it out on RegexStorm

foreach(var m in Regex.Matches(s,"(('.*?')|[0-9])"))

I have manages to get the following method to read the file as required:
public List<string> SplitCSV(string input, List<string> line)
{
Regex csvSplit = new Regex("(([^,^\'])*(\'.*\')*([^,^\'])*)(,|$)", RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
line.Add(match.Value.TrimStart(','));
}
return line;
}
Thanks for everyone help though.

Deleting line breaks in a text file with a condition

I have a text file. Some of the lines in it end with lf and some end with crlf. I only need to delete lfs and leave all crlfs.
Basically, my file looks like this
Mary had a lf
dog.crlf
She liked her lf
dog very much. crlf
I need it to be
Mary had a dog.crlf
She liked her dog very much.crlf
Now, I tried just deleting all lfs unconditionally, but then I couldn't figure out how to write it back into the text file. If I use File.WriteAllLines and put a string array into it, it automatically creates line breaks all over again. If I use File.WriteAllText, it just forms one single line.
So the question is - how do I take a text file like the first and make it look like the second? Thank you very much for your time.
BTW, I checked similar questions, but still have trouble figuring it out.

Use regex with a negative look-behind and only replace the \n not preceded by a \r:
DEMO
var result = Regex.Replace(sampleFileContent, #"(?<!\r)\n", String.Empty);
The (?<! ... ) is a negative look-behind. It means that we only want to replace instances of \n when there isn't a \r directly behind it.
Disclaimer: This may or may not be as viable an option depending on the size of your file(s). This is a good solution if you're not concerned with overhead or you're doing some quick fixes, but I'd look in to a more robust parser if the files are going to be huge.

This is an alternative to Brad Christie's answer, which doesn't use Regex.
String result = sampleFileContent.Replace("\r\n", "**newline**")
.Replace("\n","")
.Replace("**newline**","\r\n");
Here's a demo. Seems faster than the regex solution according to this site, but uses a bit more memory.

Just tested it:
string file = File.ReadAllText("test.txt");
file = file.Replace("\r", "");
File.WriteAllText("test_replaced.txt", file);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.