Problem creating regex to match filename - c#

I am trying to create a regex in C# to extract the artist, track number and song title from a filename named like: 01.artist - title.mp3
Right now I can't get the thing to work, and am having problems finding much relevant help online.
Here is what I have so far:
string fileRegex = "(?<trackNo>\\d{1,3})\\.(<artist>[a-z])\\s-\\s(<title>[a-z])\\.mp3";
Regex r = new Regex(fileRegex);
Match m = r.Match(song.Name); // song.Name is the filname
if (m.Success)
{
Console.WriteLine("Artist is {0}", m.Groups["artist"]);
}
else
{
Console.WriteLine("no match");
}
I'm not getting any matches at all, and all help is appreciated!

You might want to put ?'s before the <> tags in all your groupings, and put a + sign after your [a-z]'s, like so:
string fileRegex = "(?<trackNo>\\d{1,3})\\.(?<artist>[a-z]+)\\s-\\s(?<title>[a-z]+)\\.mp3";
Then it should work. The ?'s are required so that the contents of the angled brackets <> are interpreted as a grouping name, and the +'s are required to match 1 or more repetitions of the last element, which is any character between (and including) a-z here.

Your artist and title groups are matching exactly one character. Try:
"(?<trackNo>\\d{1,3})\\.(?<artist>[a-z]+\\s-\\s(?<title>[a-z]+)\\.mp3"
I really recommend http://www.ultrapico.com/Expresso.htm for building regular expressions. It's brilliant and free.
P.S. i like to type my regex string literals like so:
#"(?<trackNo>\d{1,3})\.(?<artist>[a-z]+\s-\s(?<title>[a-z]+)\.mp3"

Maybe try:
"(?<trackNo>\\d{1,3})\\.(<artist>[a-z]*)\\s-\\s(<title>[a-z]*)\\.mp3";

CODE
String fileName = #"01. Pink Floyd - Another Brick in the Wall.mp3";
String regex = #"^(?<TrackNumber>[0-9]{1,3})\. ?(?<Artist>(.(?!= - ))+) - (?<Title>.+)\.mp3$";
Match match = Regex.Match(fileName, regex);
if (match.Success)
{
Console.WriteLine(match.Groups["TrackNumber"]);
Console.WriteLine(match.Groups["Artist"]);
Console.WriteLine(match.Groups["Title"]);
}
OUTPUT
01
Pink Floyd
Another Brick in the Wall

Related

How to find the third element value using Regex

All, i am currently trying to parse each element that has the format below using regex and c# to find any value in () below.. Example i would like to extract 2002_max_allow_date .. note not all the names in here will be alpha numeric etc...
I initially have the pattern: Regex regex = new Regex(#"(\w\d\d\d.[A-Z])\w+");
However this only returns the name with the numeric etc
From reply i tried the following and trying to format this so that i do not get the syntax error as well as i don't want to change the regex query...
Can someone please assist me in finding the name located in the third position.. example this,'46032','46032','2002_MAX_ALLOW_DATE'
<button class="longlist-cb longlist-cb-yes" id="cb46032"
onclick="$ll.CATG.toggleCb(this,'46032','46032','2002_MAX_ALLOW_DATE')"
</button>
Please try this
Regex rex = new Regex("'[^']+','[^']+','(?<ThirdElement>[^']+)'");
String data = "'46032','46032','2002_MAX_ALLOW_DATE'";
Match match = rex.Match(data);
Console.WriteLine(match.Groups["ThirdElement"]); // Output: 2002_MAX_ALLOW_DATE
SECOND EDIT:
I've written some code that provides all the elements inside the onclick as capture groups:
Regex regex = new Regex("onclick=\"\\$ll.CATG.toggleCb\\((.*),\\s?(.*),\\s?(.*),\\s?(.*)\\)");
string x = "<button class=\"longlist - cb longlist - cb - yes\" id=\"cb46032\" onclick=\"$ll.CATG.toggleCb(this, '46032', '46032', '2002_MAX_ALLOW_DATE')\"></button>";
Match match = regex.Match(x);
if (match.Success)
{
Console.WriteLine("match.Value returns: " + match.Value);
foreach (Group y in match.Groups)
{
Console.WriteLine("the current capture group: " + y.Value);
}
}
else
{
Console.Write("No match");
}
Console.ReadKey();
will print:
EDIT: After trying with VS, this worked for me: Regex regex = new Regex("onclick=\"\\$ll.CATG.toggleCb\\((.*),.*,.*,.*\\)");
ORIGINAL ANSWER:
If you were to use Regex regex = new Regex(#"onclick="\$ll.CATG.toggleCb\(.*,.*,(.*),.*\)"); on your provided text, that should return '46032'.
You could alter this regex by moving the capturing ( and ) to a different .* to capture, say, the fourth element, like this: onclick="\$ll.CATG.toggleCb\((.*),.*,.*,.*\) would capture this.
Why not get the attribute value of onclick, but to get the all HTML of the button which make question become complex.
And use String.Split can resolve your problem simply, but you choose to use RegExp.
the_button_element.GetAttribute('onclick').Split(',')[3]
Or use RegExp:
new Regex(#".*?,'(\w+)'\)$")

Finding Definition of an Acronym in a sentence

I am working on a sample program in C# with Visual Studios 2013. I have some logic that will find a all uppercase acronym as shown below :
string docStr = "Made at Training And Doctrine (TAD)";
string allUpperRegStr = "\\([A-Z]{2,}\\)";
Match mUpper = Regex.Match(docStr, allUpperRegStr);
If (mUpper.Success)
{
string remWS = mUpper.Value.Trim();
}
So the above logic finds the (TAD) acronym, what I need is some way to parse the sentence and find a match for the definition of the acronym which is Training And Doctrine. Any help would be appreciated.
You should construct a new regex that will look something like (T[a-z]+\sA[a-z]+\sD[a-z]), and that should be able to capture "Training And Doctrine". You might have to consider scenarios where the definition contains punctuation characters or other variations (multiple spaces for example), and maybe adjust the regex string accordingly.
EDIT: Full Solution - EDIT2: Ignore case (this was not validated to work yet)
string docStr = "Made at Training And Doctrine (TAD)";
string allUpperRegStr = "\\([A-Z]{2,}\\)";
Match mUpper = Regex.Match(docStr, allUpperRegStr);
if (mUpper.Success)
{
string remWS = mUpper.Value.Trim();
char [] chars = remWS.toCharArray();
IEnumerable<string> lowerUpper = from l in chars
where l !='(' && l != ')'
select string.Format("[{0}{1}][a-z]+", Char.ToLower(l), Char.ToUpper(l));
string regex2 = string.Format("({0})", string.Join("\\s", lowerUpper));
Match mDefinition = Regex.Match(docStr, regex2);
if (mDefinition.Success)
{
string definition = mDefinition.Value.Trim();
}
}
Latest fix included:
Fixing target regex string (the last character has to be followed by [a-z] as well)
Strip '(' & ')' from lowerUpper
Working regex sample
Just use the words directly as your regex, like \Training And Doctrine\
A regex parser will recognize the pattern based on a string.

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong
You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5
The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.
Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

Regex - Find from both sides only if it has spaces

I need some help on Regex. I need to find a word that is surrounded by whatever element, for example - *. But I need to match it only if it has spaces or nothing on the ether sides. For example if it is at start of the text I can't really have space there, same for end.
Here is what I came up to
string myString = "You will find *me*, and *me* also!";
string findString = #"(\*(.*?)\*)";
string foundText;
MatchCollection matchCollection = Regex.Matches(myString, findString);
foreach (Match match in matchCollection)
{
foundText = match.Value.Replace("*", "");
myString = myString.Replace(match.Value, "->" + foundText + "<-");
match.NextMatch();
}
Console.WriteLine(myString);
You will find ->me<-, and ->me<- also!
Works correct, the problem is when I add * in the middle of text, I don't want it to match then.
Example: You will find *m*e*, and *me* also!
Output: You will find ->m<-e->, and <-me* also!
How can I fix that?
Try the following pattern:
string findString = #"(?<=\s|^)\*(.*?)\*(?=\s|$)";
(?<=\s|^)X will match any X only if preceded by a space-char (\s), or the start-of-input, and
X(?=\s|$) matches any X if followed by a space-char (\s), or the end-of-input.
Note that it will not match *me* in foo *me*, bar since the second * has a , after it! If you want to match that too, you need to include the comma like this:
string findString = #"(?<=[\s,]|^)\*(.*?)\*(?=[\s,]|$)";
You'll need to expand the set [\s,] as you see fit, of course. You might want to add !, ? and . at the very least: [\s,!?.] (and no, . and ? do not need to be escaped inside a character-set!).
EDIT
A small demo:
string Txt = "foo *m*e*, bar";
string Pattern = #"(?<=[\s,]|^)\*(.*?)\*(?=[\s,]|$)";
Console.WriteLine(Regex.Replace(Txt, Pattern, ">$1<"));
which would print:
>m*e<
You can add "beginning of line or space" and "space or end of line" around your match:
(^|\s)\*(.*?)\*(\s|$)
You'll now need to refer to the middle capture group for the match string.

Google-like search query tokenization & string splitting

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query:
the quick "brown fox" jumps over the "lazy dog"
I would like to have a string array with the following tokens:
the
quick
brown fox
jumps
over
the
lazy dog
As you can see, the tokens preserve the spaces with in double quotes.
I'm looking for some examples of how I could do this in C#, preferably not using regular expressions, however if that makes the most sense and would be the most performant, then so be it.
Also I would like to know how I could extend this to handle other special characters, for example, putting a - in front of a term to force exclusion from a search query and so on.
So far, this looks like a good candidate for RegEx's. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).
This regex should solve your problem:
("[^"]+"|\w+)\s*
Here is a C# example of its usage:
string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = #"(""[^""]+""|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
The real benefit of this method is it can be easily extened to include your "-" requirement like so:
string data = "the quick \"brown fox\" jumps over " +
"the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = #"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
Now I hate reading Regex's as much as the next guy, but if you split it up, this one is quite easy to read:
(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*
Explanation
If possible match a minus sign, followed by a " followed by everything until the next "
Otherwise match a " followed by everything until the next "
Otherwise match a - followed by any word characters
Otherwise match as many word characters as you can
Put the result in a group
Swallow up any following space characters
I was just trying to figure out how to do this a few days ago. I ended up using Microsoft.VisualBasic.FileIO.TextFieldParser which did exactly what I wanted (just set HasFieldsEnclosedInQuotes to true). Sure it looks somewhat odd to have "Microsoft.VisualBasic" in a C# program, but it works, and as far as I can tell it is part of the .NET framework.
To get my string into a stream for the TextFieldParser, I used "new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))". Not sure if this is the best way to do it.
Edit: I don't think this would handle your "-" requirement, so maybe the RegEx solution is better
Go char by char to the string like this: (sort of pseudo code)
array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
if in_quotes:
if c is '"':
append word to words
word = "" // empty word
in_quotes = false
else:
append c to word
else if c is '"':
in_quotes = true
else if c is ' ': // space
if not empty word:
append word to words
word = "" // empty word
else:
append c to word
// Rest
if not empty word:
append word to words
I was looking for a Java solution to this problem and came up with a solution using #Michael La Voie's. Thought I would share it here despite the question being asked for in C#. Hope that's okay.
public static final List<String> convertQueryToWords(String q) {
List<String> words = new ArrayList<>();
Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
Matcher matcher = pattern.matcher(q);
while (matcher.find()) {
MatchResult result = matcher.toMatchResult();
if (result != null && result.group() != null) {
if (result.group().contains("\"")) {
words.add(result.group().trim().replaceAll("\"", "").trim());
} else {
words.add(result.group().trim());
}
}
}
return words;
}

Categories