Regex - split by "_" and exclude file extension - c#

I need to split the following string AAA_BBB_CCC.extension by "_" and exclude from the results any file extension.
Where A, B and C can be any character or space. I wish to get AAA, BBB and CCC.
I know that \.(?:.(?!\.))+$ will match .extension but I could not combine it with matching "_" for splitting.

Use the Path.GetFileNameWithoutExtension function to strip the extension from the file name.
Then use String.Split to get an array with three items:
var fileName = Path.GetFileNameWithoutExtension(fullName);
var parts = fileName.Split('_');
var partAAA = parts[0];
var partBBB = parts[1];
var partCCC = parts[2];
If the parts are always the same fixed number of characters long, you can as well extract them using the Substring function. No need to resort to regex here.

Another option is to make use of the .NET Group.Captures property and capture any char except an _ in a named capture group, which you can extract from the match using a named group.
^(?'val'[^_]+)(?:_(?'val'[^_]+))+\.\w+$
Explanation
^ Start of string
(?'val'[^_]+) Named group val, match 1+ chars other than _ using a negated character class
(?: Non caputure group
_(?'val'[^_]+) Match an _ and capture again 1+ chars other than _ in same named group val
)+ Close the non capture group and repeat 1+ times for at least 1 occurrence with _
\.\w+ Match a . and 1+ word chars
$ End of string
Regex demo
string pattern = #"^(?'val'[^_]+)(?:_(?'val'[^_]+))+\.\w+$";
string input = #"AAA_BBB_CCC.extension";
Match m = Regex.Match(input, pattern);
foreach (Capture capture in m.Groups["val"].Captures) {
Console.WriteLine(capture.Value);
}
Output
AAA
BBB
CCC

If you wanted to use a regex based approach here, you could try doing a find all on the following regex pattern:
[^_]+(?=.*\.\w+$)
This pattern will match every term in between underscore, except for the portion after the extension, which will be excluded by the lookahead.
Regex rx = new Regex(#"[^_]+(?=.*\.\w+$)");
string text = "AAA_BBB_CCC.extension";
MatchCollection matches = rx.Matches(text);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[0].Value);
}
This prints:
AAA
BBB
CCC

Related

Regex to match words between underscores after second occurence of underscore

so i would like to get words between underscores after second occurence of underscore
this is my string
ABC_BC_BE08_C1000004_0124
I've assembled this expresion
(?<=_)[^_]+
well it matches what i need but only skips the first word since there is no underscore before it. I would like it to skip ABC and BC and just get the last three strings, i've tried messing around but i am stuck and cant make it work. Thanks!
You can use a non-regex approach here with Split and Skip:
var text = "ABC_BC_BE08_C1000004_0124";
var result = text.Split('_').Skip(2);
foreach (var s in result)
Console.WriteLine(s);
Output:
BE08
C1000004
0124
See the C# demo.
With regex, you can use
var result = Regex.Matches(text, #"(?<=^(?:[^_]*_){2,})[^_]+").Cast<Match>().Select(x => x.Value);
See the regex demo and the C# demo. The regex matches
(?<=^(?:[^_]*_){2,}) - a positive lookbehind that matches a location that matches the following patterns immediately to the left of the current location:
^ - start of string
(?:[^_]*_){2,} - two or more ({2,}) sequences of any zero or more chars other than _ ([^_]*) and then a _ char
[^_]+ - one or more chars other than _
Usign .NET there is also a captures collection that you might use with a regex and a repeated catpure group.
^[^_]*_[^_]*(?:_([^_]+))+
The pattern matches:
^ Start of string
[^_]*_[^_]* Match any char except an _, match _ and again any char except _
(?: Non capture group
_([^_]+) Match _ and capture 1 or more times any char except _ in group 1
)+ Close the non capture group and repeat 1 or more times
.NET regex demo | C# demo
For example:
var pattern = #"^[^_]*_[^_]*(?:_([^_]+))+";
var str = "ABC_BC_BE08_C1000004_0124";
var strings = Regex.Match(str, pattern).Groups[1].Captures.Select(c => c.Value);
foreach (String s in strings)
{
Console.WriteLine(s);
}
Output
BE08
C1000004
0124
If you want to match only word characters in between the underscores, another option for a pattern could be using a negated character class [^\W_] excluding the underscore from the word characters in between:
^[^\W_]*_[^\W_]*(?:_([^\W_]+))+

Get the middle part of a filename using regex

I need a regex that can return up to 10 characters in the middle of a file name.
filename: returns:
msl_0123456789_otherstuff.csv -> 0123456789
msl_test.xml -> test
anythingShort.w1 -> anythingSh
I can capture the beginning and end for removal with the following regex:
Regex.Replace(filename, "(^msl_)|([.][[:alnum:]]{1,3}$)", string.Empty); *
but I also need to have only 10 characters when I am done.
Explanation of the regex above:
(^msl_) - match lines that start with "msl_"
| - or
([.] - match a period
[[:alnum]]{1,3} - followed by 1-3 alphanumeric characters
$) - at the end of the line
Note [[:alnum:]] can't work in a .NET regex, because it does not support POSIX character classes. You may use \w (to match letters, digits, underscores) or [^\W_] (to match letters or digits).
You can use your regex and just keep the first 10 chars in the string:
new string(Regex.Replace(s, #"^msl_|\.\w{1,3}$","").Take(10).ToArray())
See the C# demo online:
var strings = new List<string> { "msl_0123456789_otherstuff.csv", "msl_test.xml", "anythingShort.w1" };
foreach (var s in strings)
{
Console.WriteLine("{0} => {1}", s, new string(Regex.Replace(s, #"^msl_|\.\w{1,3}$","").Take(10).ToArray()));
}
Output:
msl_0123456789_otherstuff.csv => 0123456789
msl_test.xml => test
anythingShort.w1 => anythingSh
Using replace with the alternation, removes either of the alternatives from the start and the end of the string, but it will also work when the extension is not present and does not take the number of chars into account in the middle.
If the file extension should be present you might use a capturing group and make msl_ optional at the beginning.
Then match 1-10 times a word character except the _ followed by matching optional word characters until the .
^(?:msl_)?([^\W_]{1,10})\w*\.[^\W_]{2,}$
.NET regex demo (Click on the table tab)
A bit broader match could be using \S instead of \w and match until the last dot:
^(?:msl_)?(\S{1,10})\S*\.[^\W_]{2,}$
See another regex demo | C# demo
string[] strings = {"msl_0123456789_otherstuff.csv", "msl_test.xml","anythingShort.w1", "123456testxxxxxxxx"};
string pattern = #"^(?:msl_)?(\S{1,10})\S*\.[^\W_]{2,}$";
foreach (String s in strings) {
Match match = Regex.Match(s, pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1]);
}
}
Output
0123456789
test
anythingSh

Regular expression matching a given structure

I need to generate a regex to match any string with this structure:
{"anyWord"}{"aSpace"}{"-"}{"anyLetter"}
How can I do it?
Thanks
EDIT
I have tried:
string txt="print -c";
string re1="((?:[a-z][a-z]+))"; // Word 1
Regex r = new Regex(re1,RegexOptions.IgnoreCase|RegexOptions.Singleline);
Match m = r.Match(txt);
if (m.Success)
{
String word1=m.Groups[1].ToString();
Console.Write("("+word1.ToString()+")"+"\n");
}
Console.ReadLine();
but this only matches the word "print"
This would be pretty straight-forward :
[a-zA-Z]+\s\-[a-zA-Z]
explained as follows :
[a-zA-Z]+ # Matches 1 or more letters
\s # Matches a single space
\- # Matches a single hyphen / dash
[a-zA-Z] # Matches a single letter
If you needed to implement this in C#, you could just use the Regex class and specifically the Regex.Matches() method:
var matches = Regex.Matches(yourString,#"[a-zA-Z]+\s\-[a-zA-Z]");
Some example matching might look like this :

Regex to match and return group names

I need to match the following strings and returns the values as groups:
abctic
abctac
xyztic
xyztac
ghhtic
ghhtac
Pattern is wrote with grouping is as follows:
(?<arch>[abc,xyz,ghh])(?<flavor>[tic,tac]$)
The above returns only parts of group names. (meaning match is not correct).
If I use * in each sub pattern instead of $ at the end, groups are correct, but that would mean that abcticff will also match.
Please let me know what my correct regex should be.
Your pattern is incorrect because a pipe symbol | is used to specify alternate matches, not a comma in brackets as you were using, i.e., [x,y].
Your pattern should be: ^(?<arch>abc|xyz|ghh)(?<flavor>tic|tac)$
The ^ and $ metacharacters ensures the string matches from start to end. If you need to match text in a larger string you could replace them with \b to match on a word boundary.
Try this approach:
string[] inputs = { "abctic", "abctac", "xyztic", "xyztac", "ghhtic", "ghhtac" };
string pattern = #"^(?<arch>abc|xyz|ghh)(?<flavor>tic|tac)$";
foreach (var input in inputs)
{
var match = Regex.Match(input, pattern);
if (match.Success)
{
Console.WriteLine("Arch: {0} - Flavor: {1}",
match.Groups["arch"].Value,
match.Groups["flavor"].Value);
}
else
Console.WriteLine("No match for: " + input);
}

RegEx replace query to pick out wiki syntax

I've got a string of HTML that I need to grab the "[Title|http://www.test.com]" pattern out of e.g.
"dafasdfasdf, adfasd. [Test|http://www.test.com/] adf ddasfasdf [SDAF|http://www.madee.com/] assg ad"
I need to replace "[Title|http://www.test.com]" this with "http://www.test.com/'>Title".
What is the best away to approach this?
I was getting close with:
string test = "dafasdfasdf adfasd [Test|http://www.test.com/] adf ddasfasdf [SDAF|http://www.madee.com/] assg ad ";
string p18 = #"(\[.*?|.*?\])";
MatchCollection mc18 = Regex.Matches(test, p18, RegexOptions.Singleline | RegexOptions.IgnoreCase);
foreach (Match m in mc18)
{
string value = m.Groups[1].Value;
string fulltag = value.Substring(value.IndexOf("["), value.Length - value.IndexOf("["));
Console.WriteLine("text=" + fulltag);
}
There must be a cleaner way of getting the two values out e.g. the "Title" bit and the url itself.
Any suggestions?
Replace the pattern:
\[([^|]+)\|[^]]*]
with:
$1
A short explanation:
\[ # match the character '['
( # start capture group 1
[^|]+ # match any character except '|' and repeat it one or more times
) # end capture group 1
\| # match the character '|'
[^]]* # match any character except ']' and repeat it zero or more times
] # match the character ']'
A C# demo would look like:
string test = "dafasdfasdf adfasd [Test|http://www.test.com/] adf ddasfasdf [SDAF|http://www.madee.com/] assg ad ";
string adjusted = Regex.Replace(test, #"\[([^|]+)\|[^]]*]", "$1");

Categories