C# Regex partial string match

C# Regex partial string match - c#

everyone, i've got below function to return true if input is badword
public bool isAdultKeyword(string input)
{
if (input == null || input.Length == 0)
{
return false;
}
else
{
Regex regex = new Regex(#"\b(badword1|badword2|anotherbadword)\b");
return regex.IsMatch(input);
}
}
above function only matched to whole string i.e if input badword it wont match but it will when input is bawrod1.
what im trying to do it is get match when part of input contains one of the badwords

So under your logic, would you match as to ass?
Also, remember the classic place Scunthorpe - your adult filter needs to be able to allow this word through.

You probably don't have to do it in such a complex way but you can try to implement Knuth-Morris-Pratt. I had tried using it in one of my failed(totally my fault) OCR enhancer modules.

Try:
Regex regex = new Regex(#"(\bbadword1\b|\bbadword2\b|\banotherbadword\b)");
return regex.IsMatch(input);

Your method seems to be working fine. Can you clarify what wrong with it? My tester program below shows it passing a number of tests with no failures.
using System;
using System.Text.RegularExpressions;
namespace CSharpConsoleSandbox {
class Program {
public static bool isAdultKeyword(string input) {
if (input == null || input.Length == 0) {
return false;
} else {
Regex regex = new Regex(#"\b(badword1|badword2|anotherbadword)\b");
return regex.IsMatch(input);
}
}
private static void test(string input) {
string matchMsg = "NO : ";
if (isAdultKeyword(input)) {
matchMsg = "YES: ";
}
Console.WriteLine(matchMsg + input);
}
static void Main(string[] args) {
// These cases should match
test("YES badword1");
test("YES this input should match badword2 ok");
test("YES this input should match anotherbadword. ok");
// These cases should not match
test("NO badword5");
test("NO this input will not matchbadword1 ok");
}
}
}
Output:
YES: YES badword1
YES: YES this input should match badword2 ok
YES: YES this input should match anotherbadword. ok
NO : NO badword5
NO : NO this input will not matchbadword1 ok

Is \b the word boundary in a regular expression?
In that case your regular expression is only looking for entire words.
Removing these will match any occurances of the badwords including where it has been included as part of a larger word.
Regex regex = new Regex(#"(bad|awful|worse)", RegexOptions.IgnoreCase);

Related

Replacing anchor/link in text

I'm having issues doing a find / replace type of action in my function, i'm extracting the < a href="link">anchor from an article and replacing it with this format: [link anchor] the link and anchor will be dynamic so i can't hard code the values, what i have so far is:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
string theString = string.Empty;
switch (articleWikiCheck) {
case "id|wpTextbox1":
StringBuilder newHtml = new StringBuilder(articleBody);
Regex r = new Regex(#"\<a href=\""([^\""]+)\"">([^<]+)");
string final = string.Empty;
foreach (var match in r.Matches(theString).Cast<Match>().OrderByDescending(m => m.Index))
{
string text = match.Groups[2].Value;
string newHref = "[" + match.Groups[1].Index + " " + match.Groups[1].Index + "]";
newHtml.Remove(match.Groups[1].Index, match.Groups[1].Length);
newHtml.Insert(match.Groups[1].Index, newHref);
}
theString = newHtml.ToString();
break;
default:
theString = articleBody;
break;
}
Helpers.ReturnMessage(theString);
return theString;
}
Currently, it just returns the article as it originally is, with the traditional anchor text format: < a href="link">anchor
Can anyone see what i have done wrong?
regards

If your input is HTML, you should consider using a corresponding parser, HtmlAgilityPack being really helpful.
As for the current code, it looks too verbose. You may use a single Regex.Replace to perform the search and replace in one pass:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody, #"<a\s+href=""([^""]+)"">([^<]+)", "[$1 $2]");
}
else
{
// Helpers.ReturnMessage(articleBody); // Uncomment if it is necessary
return articleBody;
}
}
See the regex demo.
The <a\s+href="([^"]+)">([^<]+) regex matches <a, 1 or more whitespaces, href=", then captures into Group 1 any one or more chars other than ", then matches "> and then captures into Group 2 any one or more chars other than <.
The [$1 $2] replacement replaces the matched text with [, Group 1 contents, space, Group 2 contents and a ].

Updated (Corrected regex to support whitespaces and new lines)
You can try this expression
Regex r = new Regex(#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>");
It will match your anchors, even if they are splitted into multiple lines. The reason why it is so long is because it supports empty whitespaces between the tags and their values, and C# does not supports subroutines, so this part [\s\n]* has to be repeated multiple times.
You can see a working sample at dotnetfiddle
You can use it in your example like this.
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody,
#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>",
"[${link} ${anchor}]");
}
else
{
return articleBody;
}
}

C# Method to Check if a String Contains Certain Letters

I'm trying to create a method which takes two parameters, "word" and "input". The aim of the method is to print any word where all of its characters can be found in "input" no more than once (this is why the character is removed if a letter is found).
Not all the letters from "input" must be in "word" - eg, for input = "cacten" and word = "ace", word would be printed, but if word = "aced" then it would not.
However, when I run the program it produces unexpected results (words being longer than "input", containing letters not found in "input"), and have coded the solution several ways all with the same outcome. This has stumped me for hours and I cannot work out what's going wrong. Any and all help will be greatly appreciated, thanks. My full code for the method is written below.
static void Program(string input, string word)
{
int letters = 0;
List<string> remaining = new List<string>();
foreach (char item in input)
{
remaining.Add(item.ToString());
}
input = remaining.ToString();
foreach (char letter in word)
{
string c = letter.ToString();
if (input.Contains(c))
{
letters++;
remaining.Remove(c);
input = remaining.ToString();
}
}
if (letters == word.Length)
{
Console.WriteLine(word);
}
}

Ok so just to go through where you are going wrong.
Firstly when you assign remaining.ToString() to your input variable. What you actually assign is this System.Collections.Generic.List1[System.String]. Doing to ToString on a List just gives you the the type of list it is. It doesnt join all your characters back up. Thats probably the main thing that is casuing you issues.
Also you are forcing everything into string types and really you don't need to a lot of the time, because string already implements IEnumerable you can get your string as a list of chars by just doing myString.ToList()
So there is no need for this:
foreach (char item in input)
{
remaining.Add(item.ToString());
}
things like string.Contains have overloads that take chars so again no need for making things string here:
foreach (char letter in word)
{
string c = letter.ToString();
if (input.Contains(c))
{
letters++;
remaining.Remove(c);
input = remaining.ToString();
}
}
you can just user the letter variable of type char and pass that into contains and beacuse remaining is now a List<char> you can remove a char from it.
again Don't reassign remaining.ToString() back into input. use string.Join like this
string.Join(string.empty,remaining);
As someone else has posted there is a probably better ways of doing this, but I hope that what I've put here helps you understand what was going wrong and will help you learn

You can also use Regular Expression which was created for such scenarios.
bool IsMatch(string input, string word)
{
var pattern = string.Format("\\b[{0}]+\\b", input);
var r = new Regex(pattern);
return r.IsMatch(word);
}
I created a sample code for you on DotNetFiddle.
You can check what the pattern does at Regex101. It has a pretty "Explanation" and "Quick Reference" panel.

There are a lot of ways to achieve that, here is a suggestion:
static void Main(string[] args)
{
Func("cacten","ace");
Func("cacten", "aced");
Console.ReadLine();
}
static void Func(string input, string word)
{
bool isMatch = true;
foreach (Char s in word)
{
if (!input.Contains(s.ToString()))
{
isMatch = false;
break;
}
}
// success
if (isMatch)
{
Console.WriteLine(word);
}
// no match
else
{
Console.WriteLine("No Match");
}
}

Not really an answer to your question but its always fun to do this sort of thing with Linq:
static void Print(string input, string word)
{
if (word.All(ch => input.Contains(ch) &&
word.GroupBy(c => c)
.All(g => g.Count() <= input.Count(c => c == g.Key))))
Console.WriteLine(word);
}
Functional programming is all about what you want without all the pesky loops, ifs and what nots... Notice that this code does what you'd do in your head without needing to painfully specify step by step how you'd actually do it:
Make sure all characters in word are in input.
Make sure all characters in word are used at most as many times as they are present in input.
Still, getting the basics right is a must, posted this answer as additional info.

Regex C# is it possible to use a variable in substitution?

I got bunch of strings in text, which looks like something like this:
h1. this is the Header
h3. this one the header too
h111. and this
And I got function, which suppose to process this text depends on what lets say iteration it been called
public void ProcessHeadersInText(string inputText, int atLevel = 1)
so the output should look like one below in case of been called
ProcessHeadersInText(inputText, 2)
Output should be:
<h3>this is the Header<h3>
<h5>this one the header too<h5>
<h9 and this <h9>
(last one looks like this because of if value after h letter is more than 9 it suppose to be 9 in the output)
So, I started to think about using regex.
Here's the example https://regex101.com/r/spb3Af/1/
(As you can see I came up with regex like this (^(h([\d]+)\.+?)(.+?)$) and tried to use substitution on it <h$3>$4</h$3>)
Its almost what I'm looking for but I need to add some logic into work with heading level.
Is it possible to add any work with variables in substitution?
Or I need to find other way? (extract all heading first, replace em considering function variables and value of the header, and only after use regex I wrote?)

The regex you may use is
^h(\d+)\.+\s*(.+)
If you need to make sure the match does not span across line, you may replace \s with [^\S\r\n]. See the regex demo.
When replacing inside C#, parse Group 1 value to int and increment the value inside a match evaluator inside Regex.Replace method.
Here is the example code that will help you:
using System;
using System.Linq;
using System.Text.RegularExpressions;
using System.IO;
public class Test
{
// Demo: https://regex101.com/r/M9iGUO/2
public static readonly Regex reg = new Regex(#"^h(\d+)\.+\s*(.+)", RegexOptions.Compiled | RegexOptions.Multiline);
public static void Main()
{
var inputText = "h1. Topic 1\r\nblah blah blah, because of bla bla bla\r\nh2. PartA\r\nblah blah blah\r\nh3. Part a\r\nblah blah blah\r\nh2. Part B\r\nblah blah blah\r\nh1. Topic 2\r\nand its cuz blah blah\r\nFIN";
var res = ProcessHeadersInText(inputText, 2);
Console.WriteLine(res);
}
public static string ProcessHeadersInText(string inputText, int atLevel = 1)
{
return reg.Replace(inputText, m =>
string.Format("<h{0}>{1}</h{0}>", (int.Parse(m.Groups[1].Value) > 9 ?
9 : int.Parse(m.Groups[1].Value) + atLevel), m.Groups[2].Value.Trim()));
}
}
See the C# online demo
Note I am using .Trim() on m.Groups[2].Value as . matches \r. You may use TrimEnd('\r') to get rid of this char.

You can use a Regex like the one used below to fix your issues.
Regex.Replace(s, #"^(h\d+)\.(.*)$", #"<$1>$2<$1>", RegexOptions.Multiline)
Let me explain you what I am doing
// This will capture the header number which is followed
// by a '.' but ignore the . in the capture
(h\d+)\.
// This will capture the remaining of the string till the end
// of the line (see the multi-line regex option being used)
(.*)$
The parenthesis will capture it into variables that can be used as "$1" for the first capture and "$2" for the second capture

Try this:
private static string ProcessHeadersInText(string inputText, int atLevel = 1)
{
// Group 1 = value after 'h'
// Group 2 = Content of header without leading whitespace
string pattern = #"^h(\d+)\.\s*(.*?)\r?$";
return Regex.Replace(inputText, pattern, match => EvaluateHeaderMatch(match, atLevel), RegexOptions.Multiline);
}
private static string EvaluateHeaderMatch(Match m, int atLevel)
{
int hVal = int.Parse(m.Groups[1].Value) + atLevel;
if (hVal > 9) { hVal = 9; }
return $"<h{hVal}>{m.Groups[2].Value}</h{hVal}>";
}
Then just call
ProcessHeadersInText(input, 2);
This uses the Regex.Replace(string, string, MatchEvaluator, RegexOptions) overload with a custom evaluator function.
You could of course streamline this solution into a single function with an inline lambda expression:
public static string ProcessHeadersInText(string inputText, int atLevel = 1)
{
string pattern = #"^h(\d+)\.\s*(.*?)\r?$";
return Regex.Replace(inputText, pattern,
match =>
{
int hVal = int.Parse(match.Groups[1].Value) + atLevel;
if (hVal > 9) { hVal = 9; }
return $"<h{hVal}>{match.Groups[2].Value}</h{hVal}>";
},
RegexOptions.Multiline);
}

A lot of good solution in this thread, but I don't think you really need a Regex solution for your problem. For fun and challenge, here a non regex solution:
Try it online!
using System;
using System.Linq;
public class Program
{
public static void Main()
{
string extractTitle(string x) => x.Substring(x.IndexOf(". ") + 2);
string extractNumber(string x) => x.Remove(x.IndexOf(". ")).Substring(1);
string build(string n, string t) => $"<h{n}>{t}</h{n}>";
var inputs = new [] {
"h1. this is the Header",
"h3. this one the header too",
"h111. and this" };
foreach (var line in inputs.Select(x => build(extractNumber(x), extractTitle(x))))
{
Console.WriteLine(line);
}
}
}
I use C#7 nested function and C#6 interpolated string. If you want, I can use more legacy C#. The code should be easy to read, I can add comments if needed.
C#5 version
using System;
using System.Linq;
public class Program
{
static string extractTitle(string x)
{
return x.Substring(x.IndexOf(". ") + 2);
}
static string extractNumber(string x)
{
return x.Remove(x.IndexOf(". ")).Substring(1);
}
static string build(string n, string t)
{
return string.Format("<h{0}>{1}</h{0}>", n, t);
}
public static void Main()
{
var inputs = new []{
"h1. this is the Header",
"h3. this one the header too",
"h111. and this"
};
foreach (var line in inputs.Select(x => build(extractNumber(x), extractTitle(x))))
{
Console.WriteLine(line);
}
}
}

remove text in between delimiters in a string - regex

I have been trying real hard understanding regular expression, Is there any way I can replace character(s) that is between two regex/ For example I have
string datax = "a4726e1e-babb-4898-a5d5-e29d2bc40028;POPULATE DATA AØ99c1d133-15f5-4ef5-bc59- d9ed673b70c6;POPULATE DATA BØ";
how to remove string between regex ";" and "Ø" ???
i try to use code like this :
string xresult = Regex.Replace(datax, #"(?<=;)(\w+?)(?=Ø)", "");
But not working.
please corrected and give me solutions...
thanks...
i want the result like this sir :
string datax = "a4726e1e-babb-4898-a5d5-e29d2bc40028;Ø99c1d133-15f5-4ef5-bc59-d9ed673b70c6;Ø";

I think you need to understand regex a little better and how the replace function works. with regex you're defining capture groups, and with the replace function you want to replace those groups.
how to remove string between regex ";" and "Ø" ???
Step 1: First find ";",then capture all characters up to and including "Ø".
That's (;.*?Ø)
( New Capture Group
; Match ";"
. Match Anything
* Zero or more times
? Be Lazy
Ø Match "Ø"
) End Capture
Step 2: Replace each group with ";Ø"
public static string Replace(string input, string pattern, string
replacement)
So you need to put back the ";Ø" you removed from the original capture.
static void Test2()
{
foreach (string item in SO2588078())
{
Console.WriteLine(item);
}
string input = "a4726e1e-babb-4898-a5d5-e29d2bc40028;POPULATE DATA AØ99c1d133-15f5-4ef5-bc59- d9ed673b70c6;POPULATE DATA BØ";
string regex = "(;.*?Ø)";
string output = Regex.Replace(input, regex, ";Ø");
if (output == string.Join(";Ø", SO2588078()) + ";Ø")
{
Console.WriteLine("TRUE");
}
}
An alternative would be to parse the string without regex. It's a simple format and this gives you more control over the process so you can see what's happening, why it's gone wrong and why it gives the results it does. Since you can step through it.
private static IEnumerable<string> SO2588078()
{
string datax = "a4726e1e-babb-4898-a5d5-e29d2bc40028;POPULATE DATA AØ99c1d133-15f5-4ef5-bc59- d9ed673b70c6;POPULATE DATA BØ";
string temp = datax;
while (!string.IsNullOrEmpty(temp))
{
int index1 = temp.IndexOf(';');
if (index1 > -1)
{
string guid = temp.Remove(index1);
yield return guid;
int index2 = temp.IndexOf('Ø');
if (index2 > -1)
{
temp = temp.Substring(index2 + 1);
}
else
{
temp = null;
}
}
else
{
temp = null;
}
}
}

Match a string against an easy pattern

I am trying to future proof a program I am creating so that the pattern I need to have users put in is not hard coded. There is always a chance that the letter or number patter can change, but when it does I need everyone to remain consistent. Plus I want the managers to be to control what goes in without relying on me. Is it possible to use regex or another string tool to compare input against a list stored in a database. I want it to be easy so the patterns stored in the database would look like X###### or X######-X####### and so on.

Sure, just store the regular expression rules in a string column in a table and then load them into an IEnumerable<Regex> in your app. Then, a match is simply if ANY of those rules match. Beware that conflicting rules could be prone to greedy race (first one to be checked wins) so you'd have to be careful there. Also be aware that there are many optimizations that you could perform beyond my example, which is designed to be simple.
List<string> regexStrings = db.GetRegexStrings();
var result = new List<Regex>(regexStrings.Count);
foreach (var regexString in regexStrings)
{
result.Add(new Regex(regexString);
}
...
// The check
bool matched = result.Any(i => i.IsMatch(testInput));

You could store your patterns as-is in your database, and then translate them to regexes.
I don't know specifically what characters you'd need in your format, but let's suppose you just want to substitute a number to # and leave the rest as-is, here's some code for that:
public static Regex ConvertToRegex(string pattern)
{
var sb = new StringBuilder();
sb.Append("^");
foreach (var c in pattern)
{
switch (c)
{
case '#':
sb.Append(#"\d");
break;
default:
sb.Append(Regex.Escape(c.ToString()));
break;
}
}
sb.Append("$");
return new Regex(sb.ToString());
}
You can also use options like RegexOptions.IgnoreCase if that's what you need.
NB: For some reason, Regex.Escape escapes the # character, even though it's not special... So I just went for the character-by-character approach.

private bool TestMethod()
{
const string textPattern = "X###";
string text = textBox1.Text;
bool match = true;
if (text.Length == textPattern.Length)
{
char[] chrStr = text.ToCharArray();
char[] chrPattern = textPattern.ToCharArray();
int length = text.Length;
for (int i = 0; i < length; i++)
{
if (chrPattern[i] != '#')
{
if (chrPattern[i] != chrStr[i])
{
return false;
}
}
}
}
else
{
return false;
}
return match;
}
This is doing everything I need it to do now. Thanks for all the tips though. I will have to look into the regex more in the future.

Using MaskedTextProvider, you could do do something like this:
using System.Globalization;
using System.ComponentModel;
string pattern = "X&&&&&&-X&&&&&&&";
string text = "Xabcdef-Xasdfghi";
var culture = CultureInfo.GetCultureInfo("sv-SE");
var matcher = new MaskedTextProvider(pattern, culture);
int position;
MaskedTextResultHint hint;
if (!matcher.Set(text, out position, out hint))
{
Console.WriteLine("Error at {0}: {1}", position, hint);
}
else if (!matcher.MaskCompleted)
{
Console.WriteLine("Not enough characters");
}
else if (matcher.ToString() != text)
{
Console.WriteLine("Missing literals");
}
else
{
Console.WriteLine("OK");
}
For a description of the format, see: http://msdn.microsoft.com/en-us/library/system.windows.forms.maskedtextbox.mask

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Regex partial string match - c#

So under your logic, would you match as to ass? Also, remember the classic place Scunthorpe - your adult filter needs to be able to allow this word through.

You probably don't have to do it in such a complex way but you can try to implement Knuth-Morris-Pratt. I had tried using it in one of my failed(totally my fault) OCR enhancer modules.

Try: Regex regex = new Regex(#"(\bbadword1\b|\bbadword2\b|\banotherbadword\b)"); return regex.IsMatch(input);

Related

Replacing anchor/link in text

C# Method to Check if a String Contains Certain Letters

Regex C# is it possible to use a variable in substitution?

remove text in between delimiters in a string - regex

Match a string against an easy pattern

Categories

Resources