Extract Menu from String - c#

I want to extract a menu from a string whenever there is one.
recipe ABC: Quelle bonne idC)e!
L: 33348, C: 2130
1 Like
2 Comment
3 Next
4 See Comments
# Home
Since I am new to regex, I tried this for a start:
If Regex.IsMatch(text, "(\d\w*\n)*") Then
End If
And it returned true.
Am I doing this right?
I want to be able to extract the menu whenever there is one. Menus don't have a pre-defined format. So I used whatever starts with number \d followed by alphanumeric character \w and new line \n.
After regex returning true, how can I extract the text that did match the regex?
Any help would be appreciated.

You can use a regex (?sm).*(?=^\d+\s+\p{L}+[\r\n]) that is taking everything from the beginning and up to a line (due to ^) that starts with a number (\d+), then some spaces (\s+), then some letters (\p{L}), then a newline ([\r\n]):
var txt ="Lorem ipsum:amet, consectetur adipiscing elit!!\r\nL: 33348, C: 2130\r\n\r\n1 Next\r\n\r\n2 Forward\r\n\r\n3 Last\r\n\r\n4 See more";
var rx = new Regex(#"(?sm).*?(?=^\d+\s+\p{L}+[\r\n])");
var res = rx.Match(txt).Value;
However, I believe your menu always starts with 1 at the line start, and all menu items are generally capitalized. That is why I suggest using another regex to reflect the following conditions: take all until a line that starts with 1 followed by some space(s), and then by an uppercase letter:
var rx = new Regex(#"(?sm).*(?=^1\s+\p{Lu})");
Or, you can try to split the string into lines, and check if a line starts with 1.
var out2 = string.Join("\r\n",txt.Split(new string[] { "\r\n" }, StringSplitOptions.None).TakeWhile(p => !p.StartsWith("1 ")).ToList());
Results:

You are using isMatch, which will return only the information "Did the pattern match anything"?
You should use something like this :
Regex regex = new Regex(#"(\d\w*\n)*");
Match match = regex.Match(yourText);
if (match.Success)
{
Console.WriteLine(match.Value);
}
Disclaimer : As your question was not about your expression itself, I haven't checked what it does. You haven't asked help on that part so I didn't give any.

Related

REGEX help needed in c#

I am very new to reg-ex and i am not sure whats going on with this one.... however my friend gave me this to solve my issue BUT somehow it is not working....
string: department_name:womens AND item_type_keyword:base-layer-underwear
reg-ex: (department_name:([\\w-]+))?(item_type_keyword:([\\w-]+))?
desired output: array OR group
1st element should be: department_name:womens
2nd should be: womens
3rd: item_type_keyword:base-layer-underwear
4th: base-layer-underwear
strings can contain department_name OR item_type_keyword, BUT not mendatory, in any order
C# Code
Regex regex = new Regex(#"(department_name:([\w-]+))?(item_type_keyword:([\w-]+))?");
Match match = regex.Match(query);
if (match.Success)
if (!String.IsNullOrEmpty(match.Groups[4].ToString()))
d1.ItemType = match.Groups[4].ToString();
this C# code only returns string array with 3 element
1: department_name:womens
2: department_name:womens
3: womens
somehow it is duplicating 1st and 2nd element, i dont know why. BUT its not return the other elements that i expect..
can someone help me please...
when i am testing the regex online, it looks fine to me...
http://fiddle.re/crvw1
Thanks
You can use something like this to get the output you have in your question:
string txt = "department_name:womens AND item_type_keyword:base-layer-underwear";
var reg = new Regex(#"(?:department_name|item_type_keyword):([\w-]+)", RegexOptions.IgnoreCase);
var ms = reg.Matches(txt);
ArrayList results = new ArrayList();
foreach (Match match in ms)
{
results.Add(match.Groups[0].Value);
results.Add(match.Groups[1].Value);
}
// results is your final array containing all results
foreach (string elem in results)
{
Console.WriteLine(elem);
}
Prints:
department_name:womens
womens
item_type_keyword:base-layer-underwear
base-layer-underwear
match.Groups[0].Value gives the part that matched the pattern, while match.Groups[1].Value will give the part captured in the pattern.
In your first expression, you have 2 capture groups; hence why you have twice department_name:womens appearing.
Once you get the different elements, you should be able to put them in an array/list for further processing. (Added this part in edit)
The loop then allows you to iterate over each of the matches, which you cannot exactly do with if and .Match() (which is better suited for a single match, while here I'm enabling multiple matches so the order they are matched doesn't matter, or the number of matches).
ideone demo
(?:
department_name # Match department_name
| # Or
item_type_keyword # Match item_type_keyword
)
:
([\w-]+) # Capture \w and - characters
It's better to use the alternation (or logical OR) operator | because we don't know the order of the input string.
(department_name:([\w-]+))|(item_type_keyword:([\w-]+))
DEMO
String input = #"department_name:womens AND item_type_keyword:base-layer-underwear";
Regex rgx = new Regex(#"(?:(department_name:([\w-]+))|(item_type_keyword:([\w-]+)))");
foreach (Match m in rgx.Matches(input))
{
Console.WriteLine(m.Groups[1].Value);
Console.WriteLine(m.Groups[2].Value);
Console.WriteLine(m.Groups[3].Value);
Console.WriteLine(m.Groups[4].Value);
}
IDEONE
Another idea using a lookahead for capturing and getting all groups in one match:
^(?!$)(?=.*(department_name:([\w-]+))|)(?=.*(item_type_keyword:([\w-]+))|)
as a .NET String
"^(?!$)(?=.*(department_name:([\\w-]+))|)(?=.*(item_type_keyword:([\\w-]+))|)"
test at regexplanet (click on .NET); test at regex101.com
(add m multiline modifier if multiline input: "^(?m)...)
If you use any spliting with And Or , etc that you can use
(department_name:(.*?)) AND (item_type_keyword:(.*?)$)
•1: department_name:womens
•2: womens
•3: item_type_keyword:base-layer-underwear
•4: base-layer-underwear
(?=(department_name:\w+)).*?:([\w-]+)|(?=(item_type_keyword:.*)$).*?:([\w-]+)
Try this.This uses a lookahead to capture then backtrack and again capture.See demo.
http://regex101.com/r/lS5tT3/52

C# creating a string which will be parsed, based on user input fails when they enter a tokenizing character

I know what is going on, but i was trying to make it so that my .Split() ignores certain characters.
sample:
1|2|3|This is a string|type:1
the parts "This is a string" is user input The user could enter in a splitting character, | in this case, so i wanted to escape it with \|. It still seems to split based on that. This is being done on the web, so i was thinking that a smart move might actually be just JSON.encode(user_in) to get around it?
1|2|3| This is \|a string|type:1
Still splits on the escaped character because i didnt define it as a special case. How would i get around this issue?
you could use Regex.Split instead and then split on | not preceded by a .
// -- regex for | not preceded by a \
string input = #"1|2|3|This is a string\|type:1";
string pattern = #"(?<!\\)[|]";
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
You can replace your delimiter with something special first, next split it and finally replace it back.
var initial = #"1|2|3|This is \| a string|type:1";
var modified = initial.Replace(#"\|", "###");
IEnumerable<string> result = modified.Split('|');
result = result.Select(i => i.Replace("###", #"\|"));

Removing words with special characters in them

I have a long string composed of a number of different words.
I want to go through all of them, and if the word contains a special character or number (except '-'), or starts with a Capital letter, I want to delete it (the whole word not just that character). For all intents and purposes 'foreign' letters can count as special characters.
The obvious solution is to run a loop through each word (after splitting it) and then a loop through each character - but I'm hoping there's a faster way of doing it? Perhaps using Regex but I've almost no experience with it.
Thanks
ADDED:
(What I want for example:)
Input: "this Is an Example of 5 words in an input like-so from example.com"
Output: {this,an,of,words,in,an,input,like-so,from}
(What I've tried so far)
List<string> response = new List<string>();
string[] splitString = text.Split(' ');
foreach (string s in splitString)
{
bool add = true;
foreach (char c in s.ToCharArray())
{
if (!(c.Equals('-') || (Char.IsLetter(c) && Char.IsLower(c))))
{
add = false;
break;
}
if (add)
{
response.Add(s);
}
}
}
Edit 2:
For me a word should be a number of characters (a..z) seperated by a space. ,/./!/... at the end shouldn't count for the 'special character' condition (which is really mostly just to remove urls or the like)
So:
"I saw a dog. It was black!"
should result in
{saw,a,dog,was,black}
So you want to find all "words" that only contain characters a-z or -, for words that are separated by spaces?
A regex like this will find such words:
(?<!\S)[a-z-]+(?!\S)
To also allow for words that end with single punctuation, you could use:
(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))
Example (ideone):
var re = #"(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))";
var str = "this, Is an! Example of 5 words in an input like-so from example.com foo: bar?";
var m = Regex.Matches(str, re);
Console.WriteLine("Matched: ");
foreach (Match i in m)
Console.Write(i + " ");
Notice the punctuation in the string.
Output:
Matched:
this an of words in an input like-so from foo bar
How about this?
(?<=^|\s+)(?[a-z-]+)(?=$|\s+)
Edit: Meant (?<=^|\s+)(?<word>[a-z\-]+)(?=(?:\.|,|!|\.\.\.)?(?:$|\s+))
Rules:
Word can only be preceded by start of line or some number of whitespace characters
Word can only be followed by end of line or some number of whitespace characters (Edit supports words ending with periods, commas, exclamation points, and ellipses)
Word can only contain lower case (latin) letters and dashes
The named group containing each word is "word"
Have a look at Microsoft's How to: Search Strings Using Regular Expressions (C# Programming Guide) - it's about regexes in C#.
List<string> strings = new List<string>() {"asdf", "sdf-sd", "sdfsdf"};
for (int i = strings.Count-1; i > 0; i--)
{
if (strings[i].Contains("-"))
{
strings.Remove(strings[i]);
}
}
This could be a starting point. right now it just checks only for "." as a special char. This outputs : "this an of words in an like-so from"
string pattern = #"[A-Z]\w+|\w*[0-9]+\w*|\w*[\.]+\w*";
string line = "this Is an Example of 5 words in an in3put like-so from example.com";
System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(pattern);
line = r.Replace(line,"");
You can do this in two ways, the white-list way and the black-list way. With a white-list you define the set of characters that you consider to be acceptable and with the black-list its the opposite.
Lets assume the white-list way and that you accept only characters a-z, A-Z and the - character. Additionally you have the rule that the first character of a word cannot be an upper case character.
With this you can do something like this:
string target = "This is a white-list example: (Foo, bar1)";
var matches = Regex.Matches(target, #"(?:\b)(?<Word>[a-z]{1}[a-zA-Z\-]*)(?:\b)");
string[] words = matches.Cast<Match>().Select(m => m.Value).ToArray();
Console.WriteLine(string.Join(", ", words));
Outputs:
// is, a, white-list, example
You can use look-aheads and look-behinds to do this. Here's a regex that matches your example:
(?<=\s|^)[a-z-]+(?=\s|$)
The explanation is: match one or more alphabetic characters (lowercase only, plus hyphen), as long as what comes before the characters is whitespace (or the start of the string), and as long as what comes after is whitespace or the end of the string.
All you need to do now is plug that into System.Text.RegularExpressions.Regex.Matches(input, regexString) to get your list of words.
Reference: http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong
You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5
The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.
Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

Regex - Find from both sides only if it has spaces

I need some help on Regex. I need to find a word that is surrounded by whatever element, for example - *. But I need to match it only if it has spaces or nothing on the ether sides. For example if it is at start of the text I can't really have space there, same for end.
Here is what I came up to
string myString = "You will find *me*, and *me* also!";
string findString = #"(\*(.*?)\*)";
string foundText;
MatchCollection matchCollection = Regex.Matches(myString, findString);
foreach (Match match in matchCollection)
{
foundText = match.Value.Replace("*", "");
myString = myString.Replace(match.Value, "->" + foundText + "<-");
match.NextMatch();
}
Console.WriteLine(myString);
You will find ->me<-, and ->me<- also!
Works correct, the problem is when I add * in the middle of text, I don't want it to match then.
Example: You will find *m*e*, and *me* also!
Output: You will find ->m<-e->, and <-me* also!
How can I fix that?
Try the following pattern:
string findString = #"(?<=\s|^)\*(.*?)\*(?=\s|$)";
(?<=\s|^)X will match any X only if preceded by a space-char (\s), or the start-of-input, and
X(?=\s|$) matches any X if followed by a space-char (\s), or the end-of-input.
Note that it will not match *me* in foo *me*, bar since the second * has a , after it! If you want to match that too, you need to include the comma like this:
string findString = #"(?<=[\s,]|^)\*(.*?)\*(?=[\s,]|$)";
You'll need to expand the set [\s,] as you see fit, of course. You might want to add !, ? and . at the very least: [\s,!?.] (and no, . and ? do not need to be escaped inside a character-set!).
EDIT
A small demo:
string Txt = "foo *m*e*, bar";
string Pattern = #"(?<=[\s,]|^)\*(.*?)\*(?=[\s,]|$)";
Console.WriteLine(Regex.Replace(Txt, Pattern, ">$1<"));
which would print:
>m*e<
You can add "beginning of line or space" and "space or end of line" around your match:
(^|\s)\*(.*?)\*(\s|$)
You'll now need to refer to the middle capture group for the match string.

Categories