Replacing anchor/link in text

Replacing anchor/link in text - c#

I'm having issues doing a find / replace type of action in my function, i'm extracting the < a href="link">anchor from an article and replacing it with this format: [link anchor] the link and anchor will be dynamic so i can't hard code the values, what i have so far is:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
string theString = string.Empty;
switch (articleWikiCheck) {
case "id|wpTextbox1":
StringBuilder newHtml = new StringBuilder(articleBody);
Regex r = new Regex(#"\<a href=\""([^\""]+)\"">([^<]+)");
string final = string.Empty;
foreach (var match in r.Matches(theString).Cast<Match>().OrderByDescending(m => m.Index))
{
string text = match.Groups[2].Value;
string newHref = "[" + match.Groups[1].Index + " " + match.Groups[1].Index + "]";
newHtml.Remove(match.Groups[1].Index, match.Groups[1].Length);
newHtml.Insert(match.Groups[1].Index, newHref);
}
theString = newHtml.ToString();
break;
default:
theString = articleBody;
break;
}
Helpers.ReturnMessage(theString);
return theString;
}
Currently, it just returns the article as it originally is, with the traditional anchor text format: < a href="link">anchor
Can anyone see what i have done wrong?
regards

If your input is HTML, you should consider using a corresponding parser, HtmlAgilityPack being really helpful.
As for the current code, it looks too verbose. You may use a single Regex.Replace to perform the search and replace in one pass:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody, #"<a\s+href=""([^""]+)"">([^<]+)", "[$1 $2]");
}
else
{
// Helpers.ReturnMessage(articleBody); // Uncomment if it is necessary
return articleBody;
}
}
See the regex demo.
The <a\s+href="([^"]+)">([^<]+) regex matches <a, 1 or more whitespaces, href=", then captures into Group 1 any one or more chars other than ", then matches "> and then captures into Group 2 any one or more chars other than <.
The [$1 $2] replacement replaces the matched text with [, Group 1 contents, space, Group 2 contents and a ].

Updated (Corrected regex to support whitespaces and new lines)
You can try this expression
Regex r = new Regex(#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>");
It will match your anchors, even if they are splitted into multiple lines. The reason why it is so long is because it supports empty whitespaces between the tags and their values, and C# does not supports subroutines, so this part [\s\n]* has to be repeated multiple times.
You can see a working sample at dotnetfiddle
You can use it in your example like this.
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody,
#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>",
"[${link} ${anchor}]");
}
else
{
return articleBody;
}
}

Related

How to do I cut off a certain part a String?

I have a big String in my program.
For Example:
String Newspaper = "...Blablabla... What do you like?...Blablabla... ";
Now I want to cut out the "What do you like?" an write it to a new String. But the problem is that the "Blablabla" is everytime something diffrent. Whit "cut out" I mean that you submit a start and a end word and all the things wrote between these lines should be in the new string. Because the sentence "What do you like?" changes sometimes except the start word "What" and the end word "like?"
Thanks for every responds

You can write the following method:
public static string CutOut(string s, string start, string end)
{
int startIndex = s.IndexOf(start);
if (startIndex == -1) {
return null;
}
int endIndex = s.IndexOf(end, startIndex);
if (endIndex == -1) {
return null;
}
return s.Substring(startIndex, endIndex - startIndex + end.Length);
}
It returns null if either the start or end pattern is not found. Only end patterns that follow the start pattern are searched for.
If you are working with C# 8+ and .NET Core 3.0+, you can also replace the last line with
return s[startIndex..(endIndex + end.Length)];
Test:
string input = "...Blablabla... What do you like?...Blablabla... ";
Console.WriteLine(CutOut(input, "What ", " like?"));
prints:
What do you like?
If you are happy with Regex, you can also write:
public static string CutOutRegex(string s, string start, string end)
{
Match match = Regex.Match(s, $#"\b{Regex.Escape(start)}.*{Regex.Escape(end)}");
if (match.Success) {
return match.Value;
}
return null;
}
The \b ensures that the start pattern is only found at the beginning of a word. You can drop it if you want. Also, if the end pattern occurs more than once, the result will include all of them unlike the first example with IndexOf which will only include the first one.

You have to do a substring, like the example below. See source for more information on substrings.
// A long string
string bio = "Mahesh Chand is a founder of C# Corner. Mahesh is also an
author, speaker, and software architect. Mahesh founded C# Corner in
2000.";
// Get first 12 characters substring from a string
string authorName = bio.Substring(0, 12);
Console.WriteLine(authorName);

In this case I would do it like this, cut the first part and then the second and concatenate with the fixed words using them as a parameter for cutting.
public string CutPhrase(string phrase)
{
var fst = "What";
var snd = "like?";
string[] cut1 = phrase.Split(new[] { fst }, StringSplitOptions.None);
string[] cut2 = cut1[1].Split(new[] { snd }, StringSplitOptions.None);
var rst = $"{fst} {cut2[0]} {snd}";
return rst;
}

Extract ID and replace everything in `Example HTML`

New to Regular Expressions, I want to have the following text in my HTML and would like to replace with something else
Example HTML:
{{Object id='foo'}}
Extract the id into a variable like this:
string strId = "foo";
So far I have the following Regular Expression code that will capture the Example HTML:
string strStart = "Object";
string strFind = "{{(" + strStart + ".*?)}}";
Regex regExp = new Regex(strFind, RegexOptions.IgnoreCase);
Match matchRegExp = regExp.Match(html);
while (matchRegExp.Success)
{
//At this point, I have this variable:
//{{Object id='foo'}}
//I can find the id='foo' (see below)
//but not sure how to extract 'foo' and use it
string strFindInner = "id='(.*?)'"; //"{{Slider";
Regex regExpInner = new Regex(strFindInner, RegexOptions.IgnoreCase);
Match matchRegExpInner = regExpInner.Match(matchRegExp.Value.ToString());
//Do something with 'foo'
matchRegExp = matchRegExp.NextMatch();
}
I understand this might be a simple solution, I am hoping to gain more knowledge about Regular Expressions but more importantly, I am hoping to receive a suggestion on how to approach this cleaner and more efficiently.
Thank you
Edit:
Is this an example that I could potentially use: c# regex replace

While I am not solving my initial question with Regular Expressions, I did move into a simpler solution using SubString, IndexOf and string.Split for the time being, I understand that my code needs to be cleaned up but thought I would post the answer that I have thus far.
string html = "<p>Start of Example</p>{{Object id='foo'}}<p>End of example</p>"
string strObject = "Slider"; //Example
//When found, this will contain "{{Object id='foo'}}"
string strCode = "";
//ie: "id='foo'"
string strCodeInner = "";
//Tags will be a list, but in this example, only "id='foo'"
string[] tags = { };
//Looking for the following "{{Object "
string strFindStart = "{{" + strObject + " ";
int intFindStart = html.IndexOf(strFindStart);
//Then ending in the following
string strFindEnd = "}}";
int intFindEnd = html.IndexOf(strFindEnd) + strFindEnd.Length;
//Must find both Start and End conditions
if (intFindStart != -1 && intFindEnd != -1)
{
strCode = html.Substring(intFindStart, intFindEnd - intFindStart);
//Remove Start and End
strCodeInner = strCode.Replace(strFindStart, "").Replace(strFindEnd, "");
//Split by spaces, this needs to be improved if more than IDs are to be used
//but for proof of concept this is perfect
tags = strCodeInner.Split(new char[] { ' ' });
}
Dictionary<string, string> dictTags = new Dictionary<string, string>();
foreach (string tag in tags)
{
string[] tagSplit = tag.Split(new char[] { '=' });
dictTags.Add(tagSplit[0], tagSplit[1].Replace("'", "").Replace("\"", ""));
}
//At this point, I can replace "{{Object id='foo'}}" with anything I'd like
//What I don't show is that I go into the website's database,
//get the object (ie: Slider) and return the html for slider with the ID of foo
html = html.Replace(strCode, strView);
/*
"html" variable may contain:
<p>Start of Example</p>
<p id="foo">This is the replacement text</p>
<p>End of example</p>
*/

C# - How can stop recursive function?

I want to replace all the word that start via # with another word, here is my code:
public string SemiFinalText { get; set; }
public string FinalText { get; set; }
//sample text : "aaaa bbbb #cccc dddd #eee fff g"
public string GetProperText(string text)
{
if (text.Contains('#'))
{
int index = text.IndexOf('#');
string restText = text.Substring(index);
var indexLast = restText.IndexOf(' ');
var oldName = text.Substring(index, indexLast);
string restText2 = text.Substring( index + indexLast);
SemiFinalText += text.Substring(0, index + indexLast).Replace(oldName, "#New");
if (restText2.Contains('#'))
{
GetProperText(restText2);
}
FinalText = SemiFinalText + restText2;
return FinalText;
}
else
{
return text;
}
}
When return FinalText; is executed I want to stop recursive function. How can fix it?
Maybe another approach is better than recursive function. If you know another way please give an answer to me.

You don't need a recursive solution for this problem. You have a string containing a number of words (separated by spaces) and you want to replace the ones starting with an '#' with another string. Modifying your solution to have a simple method that splits based on spaces, replaces all words starting with # and then combines them once again.
Using Linq:
string text = "aaaa bbbb #cccc dddd #eee fff g";
FinalText = GetProperText(text, "New");
public string GetProperText(string text, string replacewith)
{
text = string.Join(" ", text.Split(' ').Select(x => x.StartsWith("#") ? replacewith: x));
return text;
}
Output: aaaa bbbb New dddd New fff g
Using Regex:
Regex rgx = new Regex("#([^ #])*");
string result = rgx.Replace(text, replaceword);

Solution with Regular Expressions:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string pattern = #"#\w+";
var r = new Regex(pattern);
Console.WriteLine(r.Replace("ABC #ABC ABC #DEF klm.#bhsh", "BOOM!"));
}
}
This does not rely on space character being the delimiter, any non-word (letters and numbers) can be used to separate the 'words'. This example outputs:
ABC BOOM! ABC BOOM! klm.BOOM!
You can test it out here: https://dotnetfiddle.net/rZyjjg
If you're new to Regex: .NET Introduction to Regular Expressions

Here also the proper way to do it recursively for anyone interested. I think your stopping condition was actually oke, but you should concatenate the outcome of the recursive function call to the already processed text. Also I think that using global variables in a recursive function defeats its purpose a little bit.
That being said I think that using RegEx from one of the supplied answer is better and faster.
The recursive code:
//sample text : "aaaa bbbb #cccc dddd #eee fff g"
public string GetProperText(string text)
{
if (text.Contains('#'))
{
int index = text.IndexOf('#'); //Index of first occuring '#'
var indexLast = text.IndexOf(' ',index); //Index of first ' ' after '#'
var oldName = text.Substring(index, indexLast); //Old Name
string processedText = text.Substring(0, index + indexLast).Replace(oldName, "New"); //String with new name
string restText = text.Substring(indexLast); //Rest Text
if (text.Contains('#'))
{
//Here the outcome of the function is pasted on the allready processed text part.
text = processedText + GetProperText(restText);
}
return text;
}
else
{
return text;
}
}

remove text in between delimiters in a string - regex

I have been trying real hard understanding regular expression, Is there any way I can replace character(s) that is between two regex/ For example I have
string datax = "a4726e1e-babb-4898-a5d5-e29d2bc40028;POPULATE DATA AØ99c1d133-15f5-4ef5-bc59- d9ed673b70c6;POPULATE DATA BØ";
how to remove string between regex ";" and "Ø" ???
i try to use code like this :
string xresult = Regex.Replace(datax, #"(?<=;)(\w+?)(?=Ø)", "");
But not working.
please corrected and give me solutions...
thanks...
i want the result like this sir :
string datax = "a4726e1e-babb-4898-a5d5-e29d2bc40028;Ø99c1d133-15f5-4ef5-bc59-d9ed673b70c6;Ø";

I think you need to understand regex a little better and how the replace function works. with regex you're defining capture groups, and with the replace function you want to replace those groups.
how to remove string between regex ";" and "Ø" ???
Step 1: First find ";",then capture all characters up to and including "Ø".
That's (;.*?Ø)
( New Capture Group
; Match ";"
. Match Anything
* Zero or more times
? Be Lazy
Ø Match "Ø"
) End Capture
Step 2: Replace each group with ";Ø"
public static string Replace(string input, string pattern, string
replacement)
So you need to put back the ";Ø" you removed from the original capture.
static void Test2()
{
foreach (string item in SO2588078())
{
Console.WriteLine(item);
}
string input = "a4726e1e-babb-4898-a5d5-e29d2bc40028;POPULATE DATA AØ99c1d133-15f5-4ef5-bc59- d9ed673b70c6;POPULATE DATA BØ";
string regex = "(;.*?Ø)";
string output = Regex.Replace(input, regex, ";Ø");
if (output == string.Join(";Ø", SO2588078()) + ";Ø")
{
Console.WriteLine("TRUE");
}
}
An alternative would be to parse the string without regex. It's a simple format and this gives you more control over the process so you can see what's happening, why it's gone wrong and why it gives the results it does. Since you can step through it.
private static IEnumerable<string> SO2588078()
{
string datax = "a4726e1e-babb-4898-a5d5-e29d2bc40028;POPULATE DATA AØ99c1d133-15f5-4ef5-bc59- d9ed673b70c6;POPULATE DATA BØ";
string temp = datax;
while (!string.IsNullOrEmpty(temp))
{
int index1 = temp.IndexOf(';');
if (index1 > -1)
{
string guid = temp.Remove(index1);
yield return guid;
int index2 = temp.IndexOf('Ø');
if (index2 > -1)
{
temp = temp.Substring(index2 + 1);
}
else
{
temp = null;
}
}
else
{
temp = null;
}
}
}

c# Remove string from XML

I have an xml that has several attributes and values such as follows:
<z:row ID="1"
Author="2;#Bruce, Banner"
Editor="1;#Bruce, Banner"
FileRef="1;#Reports/Pipeline Tracker Report.xltm"
FileDirRef="1;#Reports"
Last_x0020_Modified="1;#2014-04-04 12:05:56"
Created_x0020_Date="1;#2014-04-04 11:36:21"
File_x0020_Size="1;#311815"
/>
How can I remove the string from after the " up to the #?
Original
'Author="2;#Bruce, Banner"'
Converted
'Author="Bruce, Banner"'

See if this helps.
private string FilterValue(string input)
{
// If the string does not contain #, return value
if (!input.Contains("#"))
return input;
// # does exist in the string so
// 1) find its location
// 2) Read everything from that point to the end of the string
// 3) Return the SubString value
var index = input.IndexOf("#", StringComparison.Ordinal) + 1;
return input.Substring(index, input.Length - index);
}

Something like this ?
// same logic then M Patel.
// This one will fit only if you have three items to remove (one digit, one semi-colon and one sharp).
// use M Patel solution
string CleanElement(string elem)
{
return elem.Substring(3, elem.Length - 3);
}
or like this :
// slower I guess but still a solution
string CleanElement(string elem)
{
string[] strs = elem.Split('#');
strs[0] = "";
return string.Join("", strs);
}

You can use string.Substring and string.IndexOf methods
string value= node.Attributes["Author"].Value;
value=value.Substring(0, value.IndexOf('#'));
I hope this is what you are looking for assuming that you are already reading your node from xml document
If you are new to reading XML in c#, I would recommend you to take a look at following msdn link https://msdn.microsoft.com/en-us/library/cc189056(v=vs.95).aspx

You can use regex for for seraching you pattern and use regEx.Replace() method.
Regex might goes like this "\d;#".

It should work if entry is 2;#Bruce, Banner!
string value= node.Attributes["Author"].Value;
var op = value.Split('#');
string name = op[1];
If other # is expected then,
string value1 = value.Substring(3, value.Length - 3);

You can use a simple regex:
string s = #"<z:row ID=""1""
Author=""2;#Bruce, Banner""
Editor=""1;#Bruce, Banner""
FileRef=""1;#Reports/Pipeline Tracker Report.xltm""
FileDirRef=""1;#Reports""
Last_x0020_Modified=""1;#2014-04-04 12:05:56""
Created_x0020_Date=""1;#2014-04-04 11:36:21""
File_x0020_Size=""1;#311815""
/>";
string result = Regex.Replace(s,"\"([0-9];#)","");

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Replacing anchor/link in text - c#

Related

How to do I cut off a certain part a String?

Extract ID and replace everything in `Example HTML`

C# - How can stop recursive function?

remove text in between delimiters in a string - regex

c# Remove string from XML

Categories

Resources