How to find a string with missing fragments? - c#

I'm building a chatbot in C# using AIML files, at the moment I've this code to process:
<aiml>
<category>
<pattern>a * is a *</pattern>
<template>when a <star index="1"/> is not a <star index="2"/>?</template>
</category>
</aiml>
I would like to do something like:
if (user_string == pattern_string) return template_string;
but I don't know how to tell the computer that the star character can be anything, and expecially that can be more than one word!
I was thinking to do it with regular expressions, but I don't have enough experience with it. Can somebody help me? :)

Using Regex
static bool TryParse(string pattern, string text, out string[] wildcardValues)
{
// ^ and $ means that whole string must be matched
// Regex.Escape (http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.escape(v=vs.110).aspx)
// (.+) means capture at least one character and place it in match.Groups
var regexPattern = string.Format("^{0}$", Regex.Escape(pattern).Replace(#"\*", "(.+)"));
var match = Regex.Match(text, regexPattern, RegexOptions.Singleline);
if (!match.Success)
{
wildcardValues = null;
return false;
}
//skip the first one since it is the whole text
wildcardValues = match.Groups.Cast<Group>().Skip(1).Select(i => i.Value).ToArray();
return true;
}
Sample usage
string[] wildcardValues;
if(TryParse("Hello *. * * to *", "Hello World. Happy holidays to all", out wildcardValues))
{
//it's a match
//wildcardValues contains the values of the wildcard which is
//['World','Happy','holidays','all'] in this sample
}
By the way, you don't really need Regex for this, it's overkill. Just implement your own algorithm by splitting the pattern into tokens using string.Split then finding each token using string.IndexOf. Although using Regex does result in shorter code

Do you think this should work for you?
Match match = Regex.Match(pattern_string, #"<pattern>a [^<]+ is a [^<]+</pattern>");
if (match.Success)
{
// do something...
}
Here [^<]+ represents for one or more characters which is/are not <
If you think you may have < character in your *, then you can simply use .+ instead of [^<]+
But this will be risky as .+ means any characters having one or multiple times.

Related

How to remove substring after occurence of certain characters in a string?

I have the requirement as follows:
input => "Employee.Addresses[].Address.City"
output => "Empolyee.Addresses[].City"
(Address is removed which is present after [].)
input => "Employee.Addresses[].Address.Lanes[].Lane.Name"
output => "Employee.Addresses[].Lanes[].Name"
(Address is removed which is present after []. and Lane is removed which is present after [].)
How to do this in C#?
private static IEnumerable<string> Filter(string input)
{
var subWords = input.Split('.');
bool skip = false;
foreach (var word in subWords)
{
if (skip)
{
skip = false;
}
else
{
yield return word;
}
if (word.EndsWith("[]"))
{
skip = true;
}
}
}
And now you use it like this:
var filtered = string.Join(".", Filter(input));
How about a regular expression?
Regex rgx = new Regex(#"(?<=\[\])\..+?(?=\.)");
string output = rgx.Replace(input, String.Empty);
Explanation:
(?<=\[\]) //positive lookbehind for the brackets
\. //match literal period
.+? //match any character at least once, but as few times as possible
(?=\.) //positive lookahead for a literal period
Your description of what you need is lacking. Please correct me if I have understood it incorrectly.
You need to find the pattern "[]." and then remove everything after this pattern until the next dot .
If this is the case, I believe using a Regular Expression could solve the problem easily.
So, the pattern "[]." can be written in a regular expression as
"\[\]\."
Then you need to find everything after this pattern until the next dot: ".*?\." (The .*? means every character as many times as possible but in a non-greedy way, i.e. stopping at the first dot it finds).
So, the whole pattern would be:
var regexPattern = #"\[\]\..*?\.";
And you want to replace all matches of this pattern with "[]." (i.e. removing what was match after the brackets until the dot).
So you call the Replace method in the Regex class:
var result = Regex.Replace(input, regexPattern, "[].");

Match the last bracket

I have a string which contains some text followed by some brackets with different content (possibly empty). I need to extract the last bracket with its content:
atext[d][][ef] // should return "[ef]"
other[aa][][a] // should return "[a]"
xxxxx[][xx][x][][xx] // should return "[xx]"
yyyyy[] // should return "[]"
I have looked into RegexOptions.RightToLeft and read up on lazy vs greedy matching, but I can't for the life of me get this one right.
This regex will work
.*(\[.*\])
Regex Demo
More efficient and non-greedy version
.*(\[[^\]]*\])
C# Code
string input = "atext[d][][ef]\nother[aa][][a]\nxxxxx[][xx][x][][xx]\nyyyyy[]";
string pattern = "(?m).*(\\[.*\\])";
Regex rgx = new Regex(pattern);
Match match = rgx.Match(input);
while (match.Success)
{
Console.WriteLine(match.Groups[1].Value);
match = match.NextMatch();
}
Ideone Demo
It may give unexpected results for nested [] or unbalanced []
Alternatively, you could reverse the string using a function similar to this:
public static string Reverse( string s )
{
char[] charArray = s.ToCharArray();
Array.Reverse( charArray );
return new string( charArray );
}
And then you could perform a simple Regex search to just look for the first [someText] group or just use a for loop to iterate through and then stop when the first ] is reached.
With negative lookahead:
\[[^\]]*\](?!\[)
This is relatively efficient and flexible, without the evil .*. This will be also work with longer text which contains multiple instances.
Regex101 demo here
The correct way for .net is indeed to use the regex option RightToLeft with the appropriate method Regex.Match(String, String, RegexOptions).
In this way you keep the pattern very simple and efficient since it doesn't produce the less backtracking step and, since the pattern ends with a literal character (the closing bracket), allows a quick search for possible positions in the string where the pattern may succeeds before the "normal" walk of the regex engine.
public static void Main()
{
string input = #"other[aa][][a]";
string pattern = #"\[[^][]*]";
Match m = Regex.Match(input, pattern, RegexOptions.RightToLeft);
if (m.Success)
Console.WriteLine("Found '{0}' at position {1}.", m.Value, m.Index);
}

.NET Regex - "Not" Match

I have a regular expression:
12345678|[0]{8}|[1]{8}|[2]{8}|[3]{8}|[4]{8}|[5]{8}|[6]{8}|[7]{8}|[8]{8}|[9]{8}
which matches if the string contains 12345679 or 11111111 or 22222222 ... or ... 999999999.
How can I changed this to only match if NOT the above? (I am not able to just !IsMatch in the C# unfortunately)...EDIT because that is black box code to me and I am trying to set the regex in an existing config file
This will match everything...
foundMatch = Regex.IsMatch(SubjectString, #"^(?:(?!123456789|(\d)\1{7}).)*$");
unless one of the "forbidden" sequences is found in the string.
Not using !isMatch as you can see.
Edit:
Adding your second constraint can be done with a lookahead assertion:
foundMatch = Regex.IsMatch(SubjectString, #"^(?=\d{9,12})(?:(?!123456789|(\d)\1{7}).)*$");
Works perfectly
string s = "55555555";
Regex regx = new Regex(#"^(?:12345678|(\d)\1{7})$");
if (!regx.IsMatch(s)) {
Console.WriteLine("It does not match!!!");
}
else {
Console.WriteLine("it matched");
}
Console.ReadLine();
Btw. I simplified your expression a bit and added anchors
^(?:12345678|(\d)\1{7})$
The (\d)\1{7} part takes a digit \d and the \1 checks if this digit is repeated 7 more times.
Update
This regex is doing what you want
Regex regx = new Regex(#"^(?!(?:12345678|(\d)\1{7})$).*$");
First of all, you don't need any of those [] brackets; you can just do 0{8}|1{8}| etc.
Now for your problem. Try using a negative lookahead:
#"^(?:(?!123456789|(\d)\1{7}).)*$"
That should take care of your issue without using !IsMatch.
I am not able to just !IsMatch in the C# unfortunately.
Why not? What's wrong with the following solution?
bool notMatch = !Regex.Match(yourString, "^(12345678|[0]{8}|[1]{8}|[2]{8}|[3]{8}|[4]{8}|[5]{8}|[6]{8}|[7]{8}|[8]{8}|[9]{8})$");
That will match any string that contains more than just 12345678, 11111111, ..., 99999999

Simple regex question C#

I need to match the string that is shown in the window displayed below :
8% of setup_av_free.exe from software-files-l.cnet.com Completed
98% of test.zip from 65.55.72.119 Completed
[numeric]%of[filename]from[hostname | IP address]Completed
I have written the regex pattern halfway
if (Regex.IsMatch(text, #"[\d]+%[\s]of[\s](.+?)(\.[^.]*)[\s]from[\s]"))
MessageBox.Show(text);
and I now need to integrate the following regex into my code above
ValidIpAddressRegex = "^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$";
ValidHostnameRegex = "^(([a-zA-Z]|[a-zA-Z][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z]|[A-Za-z][A-Za-z0-9\-]*[A-Za-z0-9])$";
The 2 regex were taken from this link. These 2 regex works well when i use the Regex.ismatch to match "123.123.123.123" and "software-files-l.cnet.com" . However i cannot get it to work when i intergrate both of them to my existin regex code. I tried several variant but not able to get it to work. Can someone guide me to integrate the 2 regex to my existing code. Thanks in advance.
You can certainly combine all these regular expressions into one, but I'd recommend against it. Consider this method, first it checks wether your input text has the correct form overall, then it checks if the "from" part is an IP address or a hostname.
bool CheckString(string text) {
const string ValidIpAddressRegex = #"^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$";
const string ValidHostnameRegex = #"^(([a-zA-Z]|[a-zA-Z][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z]|[A-Za-z][A-Za-z0-9\-]*[A-Za-z0-9])$";
var match = Regex.Match(text, #"[\d]+%[\s]of[\s](.+?)(\.[^.]*)[\s]from[\s](\S+)");
if(!match.Success)
return false;
string address = match.Groups[3].Value;
return Regex.IsMatch(address, ValidIpAddressRegex) ||
Regex.IsMatch(address, ValidHostnameRegex);
}
It does what you want and is much more readable and than single monster-sized regular expression. If you aren't going to call this method millions of time in a loop there is no reason to be concerned about it being less performant that single regex.
Also, in case you aren't aware of that the brackets around \d or \s aren't necessary.
The "Problem" that those two regexes do not match your string is that they start with ^ and end with $
^ means match the start of the string (or row if the m modifier is activated)
$ means match the end of the string (or row if the m modifier is activated)
When you try it this is true but in your real text they are in the middle of the string, so it is not matched.
Try just remove the ^ at the very beginning and the $ at the very end.
Here you go.
^[\d]+%[\s+]of[\s+](.+?)(\.[^.]*)[\s+]from[\s+]((([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])|((([a-zA-Z]|[a-zA-Z][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z]|[A-Za-z][A-Za-z0-9\-]*[A-Za-z0-9])))[\s+]Completed
Remove the ^ and $ characters from the ValidIpAddressRegex and ValidHostnameRegex samples above, and add them separated by the or character (|) enclosed by parentheses.
You could use this, its should work for all cases. I mightve accidentally deleted a character while formatting so let me know if it doesnt work.
string captureString = "8% of setup_av_free.exe from software-files-l.cnet.com Completed";
Regex reg = new Regex(#"(?<perc>\d+)% of (?<file>\w+\.\w+) from (?<host>" +
#"(\d+\.\d+.\d+.\d+)|(((https?|ftp|gopher|telnet|file|notes|ms-help):" +
#"((//)|(\\\\))+)?[\w\d:##%/;$()~_?\+-=\\\.&]*)) Completed");
Match m = reg.Match(captureString);
string perc = m.Groups["perc"].Value;
string file = m.Groups["file"].Value;
string host = m.Groups["host"].Value;

Google-like search query tokenization & string splitting

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query:
the quick "brown fox" jumps over the "lazy dog"
I would like to have a string array with the following tokens:
the
quick
brown fox
jumps
over
the
lazy dog
As you can see, the tokens preserve the spaces with in double quotes.
I'm looking for some examples of how I could do this in C#, preferably not using regular expressions, however if that makes the most sense and would be the most performant, then so be it.
Also I would like to know how I could extend this to handle other special characters, for example, putting a - in front of a term to force exclusion from a search query and so on.
So far, this looks like a good candidate for RegEx's. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).
This regex should solve your problem:
("[^"]+"|\w+)\s*
Here is a C# example of its usage:
string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = #"(""[^""]+""|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
The real benefit of this method is it can be easily extened to include your "-" requirement like so:
string data = "the quick \"brown fox\" jumps over " +
"the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = #"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
Now I hate reading Regex's as much as the next guy, but if you split it up, this one is quite easy to read:
(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*
Explanation
If possible match a minus sign, followed by a " followed by everything until the next "
Otherwise match a " followed by everything until the next "
Otherwise match a - followed by any word characters
Otherwise match as many word characters as you can
Put the result in a group
Swallow up any following space characters
I was just trying to figure out how to do this a few days ago. I ended up using Microsoft.VisualBasic.FileIO.TextFieldParser which did exactly what I wanted (just set HasFieldsEnclosedInQuotes to true). Sure it looks somewhat odd to have "Microsoft.VisualBasic" in a C# program, but it works, and as far as I can tell it is part of the .NET framework.
To get my string into a stream for the TextFieldParser, I used "new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))". Not sure if this is the best way to do it.
Edit: I don't think this would handle your "-" requirement, so maybe the RegEx solution is better
Go char by char to the string like this: (sort of pseudo code)
array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
if in_quotes:
if c is '"':
append word to words
word = "" // empty word
in_quotes = false
else:
append c to word
else if c is '"':
in_quotes = true
else if c is ' ': // space
if not empty word:
append word to words
word = "" // empty word
else:
append c to word
// Rest
if not empty word:
append word to words
I was looking for a Java solution to this problem and came up with a solution using #Michael La Voie's. Thought I would share it here despite the question being asked for in C#. Hope that's okay.
public static final List<String> convertQueryToWords(String q) {
List<String> words = new ArrayList<>();
Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
Matcher matcher = pattern.matcher(q);
while (matcher.find()) {
MatchResult result = matcher.toMatchResult();
if (result != null && result.group() != null) {
if (result.group().contains("\"")) {
words.add(result.group().trim().replaceAll("\"", "").trim());
} else {
words.add(result.group().trim());
}
}
}
return words;
}

Categories