Regex for Information Extraction from CS File

Regex for Information Extraction from CS File - c#

Below is a snapshot of lines from my CS file from C# code and I'm trying to extract fields mandatory or supported fields from my class file.
1) Is there a way for me to dynamically load the cs file into the .NET application and extract the information out, starting just by loading cs file from file path?
2) Following to the question above, I'm currently resorting to extract information out thru Regex.
First Regex - (m_oSupportedFields.).+?(?=EnumSupported.Mandatory;|EnumSupported.Supported)
and result as below :-
Second Regex - (..+)\=
and result as below :-
What I'm trying to achieve is to extract Persona.Forename, Personal.Surname and other fields by a Regex (one Regex for EnumSupported.Mandatory, and one for EnumSupported.Supported).
Also, I'm trying to cater for malformed line such as
m_oSupportedFields.Personal.DOB.Day.Supported=EnumSupported.Supported;
(Note the space between the equal sign)
or
m_oSupportedFields.Personal.DOB.Day.Supported = EnumSupported.Supported;
(Note the double space between)
or even
m_oSupportedFields.Personal.Surname.Supported =
EnumSupported.Mandatory;
(Note the Enum is on second line)
Please advice on how should I compile the Regex for such situation.
Thanks.
UPDATED in TEXTUAL VERSION
m_oSupportedFields.Personal.Surname.Supported = EnumSupported.Mandatory;
m_oSupportedFields.Personal.Forename.Supported = EnumSupported.Mandatory;
m_oSupportedFields.Personal.MiddleName.Supported = EnumSupported.Supported;
m_oSupportedFields.Personal.DOB.Day.Supported = EnumSupported.Supported;
m_oSupportedFields.Personal.DOB.Month.Supported = EnumSupported.Supported;
m_oSupportedFields.Personal.DOB.Year.Supported = EnumSupported.Supported;

So from each line, you want to extract the part after m_oSupportedFields. and before .Supported =, as well as the part after the =. And you want to ignore only blank spaces before the =, but any whitespace after the =.
Your regular expression will be: ^m_oSupportedFields\.([\w\.]+)\.Supported *=\s*(EnumSupported\.\w+);
Omit the ^ if you don't want to require that the string start at the beginning of a line.
Using C#, you can access the match groups like this:
using System.Text.RegularExpressions;
string regex = #"^m_oSupportedFields\.([\w\.]+)\.Supported *=\s*(EnumSupported\.\w+);";
string input = #"m_oSupportedFields.Personal.DOB.Day.Supported=EnumSupported.Supported";
foreach (Match m in Regex.Matches(input, regex))
{
Console.WriteLine(m.Captures[0].ToString());
Console.WriteLine(m.Captures[1].ToString());
}
// Console:
// Personal.DOB.Day
// EnumSupported.Supported

1) Is there a way for me to dynamically load the cs file into the .NET application and extract the information out, starting just by loading cs file from file path?
Possibly, there is the .Net Compiler as a Service which is now used by VS2015 (Overview). Look into creating a Stand-Alone Code Analysis Tool.
extract Persona.Forename, Personal.Surname and other fields by a Regex (one Regex for EnumSupported.Mandatory, and one for EnumSupported.Supported).
To create a pattern, one can be very general or one can be very specific on what needs to be captured. As one makes the pattern to be more general, the pattern complexity increases along with the supporting code to extract the data.
Capture into Enumerable Dynamic Entities
This is a specific pattern that takes the results and places them into Linq set of dynamic entities. ** Note that it handles the possible line split**
string data = #"
m_oSupportedFields.Personal.Surname.Supported =
EnumSupported.Mandatory;
m_oSupportedFields.Personal.Forename.Supported=EnumSupported.Mandatory;
m_oSupportedFields.Personal.MiddleName.Supported = EnumSupported.Supported;
m_oSupportedFields.Personal.DOB.Day.Supported = EnumSupported.Supported;
m_oSupportedFields.Personal.DOB.Month.Supported = EnumSupported.Supported;
m_oSupportedFields.Personal.DOB.Year.Supported = EnumSupported.Supported;
";
string pattern = #"
Personal\. # Anchor for match
(?<Full> # Grouping for Or condition
(?<Name>[^.]+) # Just the name
| # Or
(?<Combined>[^.]+\.[^.]+) # Name/subname
) # End Or Grouping
(?=\.Supported) # Look ahead to anchor to Supported (does not capture)
\.Supported
\s*= # Possible whitespace and =
[\s\r\n]*EnumSupported\.
(?<SupportType>Mandatory|Supported) # Capture support type";
// Ignore Pattern whitespace allows us to comment the pattern instead of having
// it on oneline. It does not affect regex pattern processing in anyway.
Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace)
.OfType<Match>()
.Select (mt => new
{
FullName = mt.Groups["Full"].Value,
IsName = mt.Groups["Name"].Success,
IsCombined = mt.Groups["Combined"].Success,
Type = mt.Groups["SupportType"].Value
})
The results look like this:
Note that it can determine if the name extracted is from a single like (ForeName) or double from (DOB.Day) and captures either into the named capture "FullName" with the "Name" and "Combined" capturings used as "Is-As" booleans.

Related

How to structure REGEX in C#

I currently have a regex that checks if a US State is spelled correctly
var r = new Regex(string.Format(#"\b(?:{0})\b", pattern), RegexOptions.IgnoreCase)
pattern is a pipe delimited string containing all US states.
It was working as intended today until one of the states was spelled like "Florida.." I would have liked it picked up the fact there was a fullstop character.
I found this regex that will only match letters.
^[a-zA-Z]+
How do I combine this with my current Regex or is it not possible?
I tried some variations of this but it didn't work
var r = new Regex(string.Format(#"\b^[a-zA-Z]+(?:{0})\b", pattern), RegexOptions.IgnoreCase);
EDIT: Florida.. was in my input string. My pattern string hasn't changed at all. Apologies for not being clearer.

It seems you need start of string (^) and end of string ($) anchors:
var r = new Regex(string.Format(#"^(?:{0})$", pattern), RegexOptions.IgnoreCase);
The regex above would match any string comprising a name of a state only.

You should make a replacement of the pattern variable to escape the regex special characters. One of them is the . character. Something similar to pattern.Replace(".", #"\.") but doing all the especial characters.

I believe you can't merge both patterns into one, so you would have to perform two diferent regex operations, one to split the states into a list, and a subsequent one for the validation of each item within it.
I'd rather go for something "simpler" such as
var states = input.Split('|').Select(s => new string(s.Where(char.IsLetter).ToArray()))
.Where(s => !string.IsNullOrWhiteSpace(s));

Basically don't use a regex here.
List<string> values = new List<string>() {"florida", etc.};
string input;
//is input in values, ignore case and look for any value that includes the input value
bool correct = values.Any(a =>
input.IndexOf(a, StringComparison.CurrentCultureIgnoreCase) >= 0);
This will be considerably more efficient than a regex based option. This should match florida, Florida and Florida..., etc.

Don't search for characters directly, tell regex to consume all which are not targeted specific characters such as [^\|.]+. It uses the set [ ] with the not ^ indicator says consume anything which is not a literal | or .. Hence it consumes just the text needed. Such as on
Colorado|Florida..|New Mexico
returns 3 matches of Colorado Florida and New Mexico

c# Regex: find placeholders as substring

i have following string.
"hello [#NAME#]. nice to meet you. I heard about you via [#SOURCE#]."
in above text i have two place holders. NAME and SOURCE
i want to extract these sub string using Reg Ex.
what would be the reg ex pattern to find list of these place holders.
i tried
string pattern = #"\[#(\w+)#\]";
result
hello
NAME
. nice to meet you. I heard about you via
SOURCE
.
what i want is only
NAME
SOURCE
Sample code
string tex = "hello [#NAME#]. nice to meet you. I heard about you via [#SOURCE#].";
string pattern = #"\[#(\w+)#\]";
var sp = Regex.Split(tex, pattern);
sp.Dump();

Your regex is working correctly. That's, how Regex.Split() should behave (see the doc). If what you said is really what you want, you can use something like:
var matches = from Match match in Regex.Matches(text, pattern)
select match.Groups[1].Value;
If, on the other hand, you wanted to replace the placeholders using some rules (e.g. using a Dictionary<string, string>), then you could do:
Regex.Replace(text, pattern, m => substitutions[m.Groups[1].Value]);

Try this regex:
\[#([A-Z]+)#\]

^hello (.*?). nice to meet you. I heard about you via (.*?).$
Very simply, the () means you want to capture what's inside, the .*? is (what's known as) an "ungreedy" capture (capture as few characters as possible). and . means any character.
demo of above
Unless you're placeholds are always going to use [# prefix, and #] postfix, then see the other users' posts.

How to get numbers from http://www.example.com/images/business/113.jpg

Using regex I need to get the numbers between the last "/" and ".jpg" (this actually might be .png, .gif, etc) in this:
http://www.example.com/images/business/113.jpg
Any ideas?
Thank you

Easy enough using split:
var fileName = myUrl.Split('/')[myUrl.Split('/').Length - 1];
var justTheFileName = fileName.Split('.')[0];

Regular expression are absolute unnecessary here.
Just do:
using System.IO;
var fileName = Path.GetFileNameWithoutExtension("http://www.example.com/images/business/113.jpg");
Take a look at the documentation of the method GetFileNameWithoutExtension:
Returns the file name of the specified path string without the extension.
Edit:
If you still want to use regex for this purpose, the following one will work:
//Both regexes will work here
var pattern = #"/([^/]*)\.jpg"
var pattern2 = #".*/(.*)\.jpg"
var matches = Regex.Matches(pattern, "http://www.example.com/images/business/113.jpg");
if (matches.Count > 0)
Console.WriteLine(matches[0].Groups[1].Count);
Note:
I didn't compile the regex. This was a small & fast example.

I see that you found a solution matches a single digit in your URL 3 times, but not the entire number. You may want to go with something more "readable" (heh) like this:
(?<=\/)\d+(?=\.\w+$)
If you're trying to capture the number and use it, throw it into a group:
(?<=\/)(\d+)(?=\.\w+$)

Got it!! (?=[\s\S]*?\\.)(?![\s\S]+?/)[0-9]
PS: The regular expression workbench by microsoft KICKS ASS

You could use the following regular expression:
/(?<number>\d+)\.jpg$
It will capture the number into the named group 'number'. The regular expression works as follows:
Search for /
Capture 1 or more times a digit (0-9) to the named group 'number'
Check for .jpg
$ matches the end of the string.
Matching the end makes stuff a lot easier. I don't believe look-ahead or look-behind is necessary.

Convert C# regex Code to Java

I have found this Regex extractor code in C#.
Can someone tell me how this works, and how do I write the equivalent in Java?
// extract songtitle from metadata header.
// Trim was needed, because some stations don't trim the songtitle
fileName =
Regex.Match(metadataHeader,
"(StreamTitle=')(.*)(';StreamUrl)").Groups[2].Value.Trim();

This should be what you want.
// Create the Regex pattern
Pattern p = Pattern.compile("(StreamTitle=')(.*)(';StreamUrl)");
// Create a matcher that matches the pattern against your input
Matcher m = p.matcher(metadataHeader);
// if we found a match
if (m.find()) {
// the filename is the second group. (The `(.*)` part)
filename = m.group(2);
}

It pulls "MyTitle" from a string such as "StreamTitle='MyTitle';StreamUrl".
The () operators define match groups, there are 3 in your regex. The second one contains the string of interest, and is gotten in the Groups[2].Value.
There's a few very good regex designers out there. The one I use is Rad Software's Regular Expression Designer (www.radsoftware.com.au). It is very useful for figuring out stuff like this (and it uses C# RegEx's).

C# reliable way to pattern match?

At the moment I am trying to match patterns such as
text text date1 date2
So I have regular expressions that do just that. However, the issue is for example if users input data with say more than 1 whitespace or if they put some of the text in a new line etc the pattern does not get picked up because it doesn't exactly match the pattern set.
Is there a more reliable way for pattern matching? The goal is to make it very simple for the user to write but make it easily matchable on my end. I was considering stripping out all the whitespace/newlines etc and then trying to match the pattern with no spaces i.e. texttextdate1date2.
Anyone got any better solutions?
Update
Here is a small example of the pattern I would need to match:
FIND me#test.com 01/01/2010 to 10/01/2010
Here is my current regex:
FIND [A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4} [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4} to [0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}
This works fine 90% of the time, however, if users submit this information via email it can have all different kinds of formatting and HTML I am not interested in. I am using a combination of the HtmlAgilityPack and a HTML tag removing regex to strip all the HTML from the email, but even at that I can't seem to get a match on some occassions.
I believe this could be a more parsing related question than pattern matching, but I think maybe there is a better way of doing this...

To match at least one or more whitespace characters (space, tab, newline), use:
\s+
Substitute the above wherever you have the physical space in your pattern and you should be fine.

Example of matching multiple groups in a text with multiple whitespaces and/or newlines.
var txt = "text text date1\ndate2";
var matches = Regex.Match(txt, #"([a-z]+)\s+([a-z]+)\s+([a-z0-9]+)\s+([a-z0-9]+)", RegexOptions.Singleline);
matches.Groups[n].Value with n from 1 to 4 will contain your matches.

I would split the string into a string array and match each resulting string to the necessary Regular Expression.

\b(text)[\s]+(text)[\s]+(date1)[\s]+(date2)\b

Its a nasty expression but here is something that will work for the input you provided:
^(\w+)\s+([\w#.]+)\s+(\d{2}\/\d{2}\/\d{4})[^\d]+(\d{2}\/\d{2}\/\d{4})$
This will work with variable amounts of whitespace between the capture groups as well.

Through ORegex you can tokenize your string and just pattern match on token sequences:
var tokens = input.Split(new[]{' ','\t','\n','\r'}, StringSplitOptions.RemoveEmptyEntries);
var oregex = new ORegex<string>("{0}{0}{1}{1}", IsText, IsDate);
var matches = oregex.Matches(tokens); //here is your subsequence tokens.
...
public bool IsText(string str)
{
...
}
public bool IsDate(string str)
{
...
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex for Information Extraction from CS File - c#

Related

How to structure REGEX in C#

c# Regex: find placeholders as substring

How to get numbers from http://www.example.com/images/business/113.jpg

Convert C# regex Code to Java

C# reliable way to pattern match?

Categories

Resources