Regex C# is it possible to use a variable in substitution? - c#

I got bunch of strings in text, which looks like something like this:
h1. this is the Header
h3. this one the header too
h111. and this
And I got function, which suppose to process this text depends on what lets say iteration it been called
public void ProcessHeadersInText(string inputText, int atLevel = 1)
so the output should look like one below in case of been called
ProcessHeadersInText(inputText, 2)
Output should be:
<h3>this is the Header<h3>
<h5>this one the header too<h5>
<h9 and this <h9>
(last one looks like this because of if value after h letter is more than 9 it suppose to be 9 in the output)
So, I started to think about using regex.
Here's the example https://regex101.com/r/spb3Af/1/
(As you can see I came up with regex like this (^(h([\d]+)\.+?)(.+?)$) and tried to use substitution on it <h$3>$4</h$3>)
Its almost what I'm looking for but I need to add some logic into work with heading level.
Is it possible to add any work with variables in substitution?
Or I need to find other way? (extract all heading first, replace em considering function variables and value of the header, and only after use regex I wrote?)

The regex you may use is
^h(\d+)\.+\s*(.+)
If you need to make sure the match does not span across line, you may replace \s with [^\S\r\n]. See the regex demo.
When replacing inside C#, parse Group 1 value to int and increment the value inside a match evaluator inside Regex.Replace method.
Here is the example code that will help you:
using System;
using System.Linq;
using System.Text.RegularExpressions;
using System.IO;
public class Test
{
// Demo: https://regex101.com/r/M9iGUO/2
public static readonly Regex reg = new Regex(#"^h(\d+)\.+\s*(.+)", RegexOptions.Compiled | RegexOptions.Multiline);
public static void Main()
{
var inputText = "h1. Topic 1\r\nblah blah blah, because of bla bla bla\r\nh2. PartA\r\nblah blah blah\r\nh3. Part a\r\nblah blah blah\r\nh2. Part B\r\nblah blah blah\r\nh1. Topic 2\r\nand its cuz blah blah\r\nFIN";
var res = ProcessHeadersInText(inputText, 2);
Console.WriteLine(res);
}
public static string ProcessHeadersInText(string inputText, int atLevel = 1)
{
return reg.Replace(inputText, m =>
string.Format("<h{0}>{1}</h{0}>", (int.Parse(m.Groups[1].Value) > 9 ?
9 : int.Parse(m.Groups[1].Value) + atLevel), m.Groups[2].Value.Trim()));
}
}
See the C# online demo
Note I am using .Trim() on m.Groups[2].Value as . matches \r. You may use TrimEnd('\r') to get rid of this char.

You can use a Regex like the one used below to fix your issues.
Regex.Replace(s, #"^(h\d+)\.(.*)$", #"<$1>$2<$1>", RegexOptions.Multiline)
Let me explain you what I am doing
// This will capture the header number which is followed
// by a '.' but ignore the . in the capture
(h\d+)\.
// This will capture the remaining of the string till the end
// of the line (see the multi-line regex option being used)
(.*)$
The parenthesis will capture it into variables that can be used as "$1" for the first capture and "$2" for the second capture

Try this:
private static string ProcessHeadersInText(string inputText, int atLevel = 1)
{
// Group 1 = value after 'h'
// Group 2 = Content of header without leading whitespace
string pattern = #"^h(\d+)\.\s*(.*?)\r?$";
return Regex.Replace(inputText, pattern, match => EvaluateHeaderMatch(match, atLevel), RegexOptions.Multiline);
}
private static string EvaluateHeaderMatch(Match m, int atLevel)
{
int hVal = int.Parse(m.Groups[1].Value) + atLevel;
if (hVal > 9) { hVal = 9; }
return $"<h{hVal}>{m.Groups[2].Value}</h{hVal}>";
}
Then just call
ProcessHeadersInText(input, 2);
This uses the Regex.Replace(string, string, MatchEvaluator, RegexOptions) overload with a custom evaluator function.
You could of course streamline this solution into a single function with an inline lambda expression:
public static string ProcessHeadersInText(string inputText, int atLevel = 1)
{
string pattern = #"^h(\d+)\.\s*(.*?)\r?$";
return Regex.Replace(inputText, pattern,
match =>
{
int hVal = int.Parse(match.Groups[1].Value) + atLevel;
if (hVal > 9) { hVal = 9; }
return $"<h{hVal}>{match.Groups[2].Value}</h{hVal}>";
},
RegexOptions.Multiline);
}

A lot of good solution in this thread, but I don't think you really need a Regex solution for your problem. For fun and challenge, here a non regex solution:
Try it online!
using System;
using System.Linq;
public class Program
{
public static void Main()
{
string extractTitle(string x) => x.Substring(x.IndexOf(". ") + 2);
string extractNumber(string x) => x.Remove(x.IndexOf(". ")).Substring(1);
string build(string n, string t) => $"<h{n}>{t}</h{n}>";
var inputs = new [] {
"h1. this is the Header",
"h3. this one the header too",
"h111. and this" };
foreach (var line in inputs.Select(x => build(extractNumber(x), extractTitle(x))))
{
Console.WriteLine(line);
}
}
}
I use C#7 nested function and C#6 interpolated string. If you want, I can use more legacy C#. The code should be easy to read, I can add comments if needed.
C#5 version
using System;
using System.Linq;
public class Program
{
static string extractTitle(string x)
{
return x.Substring(x.IndexOf(". ") + 2);
}
static string extractNumber(string x)
{
return x.Remove(x.IndexOf(". ")).Substring(1);
}
static string build(string n, string t)
{
return string.Format("<h{0}>{1}</h{0}>", n, t);
}
public static void Main()
{
var inputs = new []{
"h1. this is the Header",
"h3. this one the header too",
"h111. and this"
};
foreach (var line in inputs.Select(x => build(extractNumber(x), extractTitle(x))))
{
Console.WriteLine(line);
}
}
}

Related

How to check how many times a string exist in another string

I Want to see how many time's a string occurrs in a string. For example I want to see how many times 2018 occurs in this paragraph:
zaeazeaze2018
azeazeazeazeaze2018azezaaze
azeaze4azeaze2018
In this case it is occuring 3 times.
I tried the following code
But the problem is that it always returns 0
And I can't find the mistake here:
public static string count(string k)
{
int i = 0;
foreach(var line in k)
{
if (line.ToString().Contains("Bestellung sehen"))
{
i++;
i = +i;
}
}
return i.ToString();
}
use this :
string text = "Hello2018,world2018\r\nWe have five 2018 here\r\n2018is coming2018"
int Counter = Regex.Matches(text, "2018").Count;
Console.WriteLine(Counter.ToString()); //write : 5
You can use Regular Expressions to handle such cases. Regular expressions give you good flexibility over your pattern matching in a string. In your case, I have prepared a sample code for you using Regular Expressions:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string str="zaeazeaze2018azeazeazeazeaze2018azezaazeazeaze4azeaze2018";
string regexPattern = #"2018";
int numberOfOccurence = Regex.Matches(str, regexPattern).Count;
Console.WriteLine(numberOfOccurence);
}
}
Working example: https://dotnetfiddle.net/PGgbm8
If you will notice the line string regexPattern = #"2018";, this sets the pattern to find all occurences of 2018 from your string. You can change this pattern according to what you require. A simple example would be that if I changed the pattern to string regexPattern = #"\d+";, it would give me 4 as output. This is because my pattern will match all occurences of numbers in the string.
This can be accomplished using Regular Expressions with the following:
using System.Text.RegularExpressions;
public static int count(string fullString, string searchPattern)
{
int i = Regex.Matches(fullString, searchPattern).Count;
return i;
}
For example, the following returns 2 as an int, not string:
count("asdfasdfasfdfindmeasdfadfasdasdfasdffindmesadf","findme")
I find this is quick enough for most of my use cases.
str is your String from your count method
str2 is your substring which is Bestellung sehen
int n = str2.length
int k = 0;
for(int i=0;i < str.length; i++){
if(str.substring(i,i+n-1)){
k++;
if(i+n-1 >= str.length){
break;
}
}
}
return k.toString()

Replacing anchor/link in text

I'm having issues doing a find / replace type of action in my function, i'm extracting the < a href="link">anchor from an article and replacing it with this format: [link anchor] the link and anchor will be dynamic so i can't hard code the values, what i have so far is:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
string theString = string.Empty;
switch (articleWikiCheck) {
case "id|wpTextbox1":
StringBuilder newHtml = new StringBuilder(articleBody);
Regex r = new Regex(#"\<a href=\""([^\""]+)\"">([^<]+)");
string final = string.Empty;
foreach (var match in r.Matches(theString).Cast<Match>().OrderByDescending(m => m.Index))
{
string text = match.Groups[2].Value;
string newHref = "[" + match.Groups[1].Index + " " + match.Groups[1].Index + "]";
newHtml.Remove(match.Groups[1].Index, match.Groups[1].Length);
newHtml.Insert(match.Groups[1].Index, newHref);
}
theString = newHtml.ToString();
break;
default:
theString = articleBody;
break;
}
Helpers.ReturnMessage(theString);
return theString;
}
Currently, it just returns the article as it originally is, with the traditional anchor text format: < a href="link">anchor
Can anyone see what i have done wrong?
regards
If your input is HTML, you should consider using a corresponding parser, HtmlAgilityPack being really helpful.
As for the current code, it looks too verbose. You may use a single Regex.Replace to perform the search and replace in one pass:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody, #"<a\s+href=""([^""]+)"">([^<]+)", "[$1 $2]");
}
else
{
// Helpers.ReturnMessage(articleBody); // Uncomment if it is necessary
return articleBody;
}
}
See the regex demo.
The <a\s+href="([^"]+)">([^<]+) regex matches <a, 1 or more whitespaces, href=", then captures into Group 1 any one or more chars other than ", then matches "> and then captures into Group 2 any one or more chars other than <.
The [$1 $2] replacement replaces the matched text with [, Group 1 contents, space, Group 2 contents and a ].
Updated (Corrected regex to support whitespaces and new lines)
You can try this expression
Regex r = new Regex(#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>");
It will match your anchors, even if they are splitted into multiple lines. The reason why it is so long is because it supports empty whitespaces between the tags and their values, and C# does not supports subroutines, so this part [\s\n]* has to be repeated multiple times.
You can see a working sample at dotnetfiddle
You can use it in your example like this.
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody,
#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>",
"[${link} ${anchor}]");
}
else
{
return articleBody;
}
}

C# - How can stop recursive function?

I want to replace all the word that start via # with another word, here is my code:
public string SemiFinalText { get; set; }
public string FinalText { get; set; }
//sample text : "aaaa bbbb #cccc dddd #eee fff g"
public string GetProperText(string text)
{
if (text.Contains('#'))
{
int index = text.IndexOf('#');
string restText = text.Substring(index);
var indexLast = restText.IndexOf(' ');
var oldName = text.Substring(index, indexLast);
string restText2 = text.Substring( index + indexLast);
SemiFinalText += text.Substring(0, index + indexLast).Replace(oldName, "#New");
if (restText2.Contains('#'))
{
GetProperText(restText2);
}
FinalText = SemiFinalText + restText2;
return FinalText;
}
else
{
return text;
}
}
When return FinalText; is executed I want to stop recursive function. How can fix it?
Maybe another approach is better than recursive function. If you know another way please give an answer to me.
You don't need a recursive solution for this problem. You have a string containing a number of words (separated by spaces) and you want to replace the ones starting with an '#' with another string. Modifying your solution to have a simple method that splits based on spaces, replaces all words starting with # and then combines them once again.
Using Linq:
string text = "aaaa bbbb #cccc dddd #eee fff g";
FinalText = GetProperText(text, "New");
public string GetProperText(string text, string replacewith)
{
text = string.Join(" ", text.Split(' ').Select(x => x.StartsWith("#") ? replacewith: x));
return text;
}
Output: aaaa bbbb New dddd New fff g
Using Regex:
Regex rgx = new Regex("#([^ #])*");
string result = rgx.Replace(text, replaceword);
Solution with Regular Expressions:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string pattern = #"#\w+";
var r = new Regex(pattern);
Console.WriteLine(r.Replace("ABC #ABC ABC #DEF klm.#bhsh", "BOOM!"));
}
}
This does not rely on space character being the delimiter, any non-word (letters and numbers) can be used to separate the 'words'. This example outputs:
ABC BOOM! ABC BOOM! klm.BOOM!
You can test it out here: https://dotnetfiddle.net/rZyjjg
If you're new to Regex: .NET Introduction to Regular Expressions
Here also the proper way to do it recursively for anyone interested. I think your stopping condition was actually oke, but you should concatenate the outcome of the recursive function call to the already processed text. Also I think that using global variables in a recursive function defeats its purpose a little bit.
That being said I think that using RegEx from one of the supplied answer is better and faster.
The recursive code:
//sample text : "aaaa bbbb #cccc dddd #eee fff g"
public string GetProperText(string text)
{
if (text.Contains('#'))
{
int index = text.IndexOf('#'); //Index of first occuring '#'
var indexLast = text.IndexOf(' ',index); //Index of first ' ' after '#'
var oldName = text.Substring(index, indexLast); //Old Name
string processedText = text.Substring(0, index + indexLast).Replace(oldName, "New"); //String with new name
string restText = text.Substring(indexLast); //Rest Text
if (text.Contains('#'))
{
//Here the outcome of the function is pasted on the allready processed text part.
text = processedText + GetProperText(restText);
}
return text;
}
else
{
return text;
}
}

Remove BR tag from the beginning and end of a string

How can I use something like
return Regex.Replace("/(^)?(<br\s*\/?>\s*)+$/", "", source);
to replace this cases:
<br>thestringIwant => thestringIwant
<br><br>thestringIwant => thestringIwant
<br>thestringIwant<br> => thestringIwant
<br><br>thestringIwant<br><br> => thestringIwant
thestringIwant<br><br> => thestringIwant
It can have multiple br tags at begining or end, but i dont want to remove any br tag in the middle.
A couple of loops would solve the issue and be easier to read and understand (use a regex = tomorrow you look at your own code wondering what the heck is going on)
while(source.StartsWith("<br>"))
source = source.SubString(4);
while(source.EndsWith("<br>"))
source = source.SubString(0,source.Length - 4);
return source;
When I see your regular expression, it sounds like there could be spaces allowed with in br tag.
So you can try something like:
string s = Regex.Replace(input,#"\<\s*br\s*\/?\s*\>","");
There is no need to use regular expression for it
you can simply use
yourString.Replace("<br>", "");
This will remove all occurances of <br> from your string.
EDIT:
To keep the tag present in between the string, just use as follows-
var regex = new Regex(Regex.Escape("<br>"));
var newText = regex.Replace("<br>thestring<br>Iwant<br>", "<br>", 1);
newText = newText.Substring(0, newText.LastIndexOf("<br>"));
Response.Write(newText);
This will remove only 1st and last occurance of <br> from your string.
How about doing it in two goes so ...
result1 = Regex.Replace("/^(<br\s*\/?>\s*)+/", "", source);
then feed the result of that into
result2 = Regex.Replace("/(<br\s*\/?>\s*)+$/", "", result1);
It's a bit of added overhead I know but simplifies things enormously, and saves trying to counter match everything in the middle that isn't a BR.
Note the subtle difference between those two .. one matching them at start and one matching them at end. Doing it this way keeps the flexibility of keeping a regular expression that allows for the general formatting of BR tags rather than it being too strict.
if you also want it to work with
<br />
then you could use
return Regex.Replace("((:?<br\s*/?>)*<br\s*/?>$|^<br\s*/?>(:?<br\s*/?>)*)", "", source);
EDIT:
Now it should also take care of multiple
<br\s*/?>
in the start and end of the lines
You can write an extension method to this stuff
public static string TrimStart(this string value, string stringToTrim)
{
if (value.StartsWith(stringToTrim, StringComparison.CurrentCultureIgnoreCase))
{
return value.Substring(stringToTrim.Length);
}
return value;
}
public static string TrimEnd(this string value, string stringToTrim)
{
if (value.EndsWith(stringToTrim, StringComparison.CurrentCultureIgnoreCase))
{
return value.Substring(0, value.Length - stringToTrim.Length);
}
return value;
}
you can call it like
string example = "<br> some <br> test <br>";
example = example.TrimStart("<br>").TrimEnd("<br>"); //output some <br> test
I believe that one should not ignore the power of Regex. If you name the regular expression appropriately then it would not be difficult to maintain it in future.
I have written a sample program which does your task using Regex. It also ignores the character cases and white space at beginning and end. You can try other source string samples you have.
Most important, It would be faster.
using System;
using System.Text.RegularExpressions;
namespace ConsoleDemo
{
class Program
{
static void Main(string[] args)
{
string result;
var source = #"<br><br>thestringIwant<br><br> => thestringIwant<br/> same <br/> <br/> ";
result = RemoveStartEndBrTag(source);
Console.WriteLine(result);
Console.ReadKey();
}
private static string RemoveStartEndBrTag(string source)
{
const string replaceStartEndBrTag = #"(^(<br>[\s]*)+|([\s]*<br[\s]*/>)+[\s]*$)";
return Regex.Replace(source, replaceStartEndBrTag, "", RegexOptions.IgnoreCase);
}
}
}

Count regex replaces (C#)

Is there a way to count the number of replacements a Regex.Replace call makes?
E.g. for Regex.Replace("aaa", "a", "b"); I want to get the number 3 out (result is "bbb"); for Regex.Replace("aaa", "(?<test>aa?)", "${test}b"); I want to get the number 2 out (result is "aabab").
Ways I can think to do this:
Use a MatchEvaluator that increments a captured variable, doing the replacement manually
Get a MatchCollection and iterate it, doing the replacement manually and keeping a count
Search first and get a MatchCollection, get the count from that, then do a separate replace
Methods 1 and 2 require manual parsing of $ replacements, method 3 requires regex matching the string twice. Is there a better way.
Thanks to both Chevex and Guffa. I started looking for a better way to get the results and found that there is a Result method on the Match class that does the substitution. That's the missing piece of the jigsaw. Example code below:
using System.Text.RegularExpressions;
namespace regexrep
{
class Program
{
static int Main(string[] args)
{
string fileText = System.IO.File.ReadAllText(args[0]);
int matchCount = 0;
string newText = Regex.Replace(fileText, args[1],
(match) =>
{
matchCount++;
return match.Result(args[2]);
});
System.IO.File.WriteAllText(args[0], newText);
return matchCount;
}
}
}
With a file test.txt containing aaa, the command line regexrep test.txt "(?<test>aa?)" ${test}b will set %errorlevel% to 2 and change the text to aabab.
You can use a MatchEvaluator that runs for each replacement, that way you can count how many times it occurs:
int cnt = 0;
string result = Regex.Replace("aaa", "a", m => {
cnt++;
return "b";
});
The second case is trickier as you have to produce the same result as the replacement pattern would:
int cnt = 0;
string result = Regex.Replace("aaa", "(?<test>aa?)", m => {
cnt++;
return m.Groups["test"] + "b";
});
This should do it.
int count = 0;
string text = Regex.Replace(text,
#"(((http|ftp|https):\/\/|www\.)[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)", //Example expression. This one captures URLs.
match =>
{
string replacementValue = String.Format("<a href='{0}'>{0}</a>", match.Value);
count++;
return replacementValue;
});
I am not on my dev computer so I can't do it right now, but I am going to experiment later and see if there is a way to do this with lambda expressions instead of declaring the method IncrementCount() just to increment an int.
EDIT modified to use a lambda expression instead of declaring another method.
EDIT2 If you don't know the pattern in advance, you can still get all the groupings (The $ groups you refer to) within the match object as they are included as a GroupCollection. Like so:
int count = 0;
string text = Regex.Replace(text,
#"(((http|ftp|https):\/\/|www\.)[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)", //Example expression. This one captures URLs.
match =>
{
string replacementValue = String.Format("<a href='{0}'>{0}</a>", match.Value);
count++;
foreach (Group g in match.Groups)
{
g.Value; //Do stuff with g.Value
}
return replacementValue;
});

Categories