Troubles with finding/replacing in string - c#

I am having some troubles finding/replacing a value in a string. DonĀ“t know if i should do it in RegEx or C# has some nifty feature to make it work. Regex gives me headace.
The problem:
<doc name="tester" value="p1,p2,p3" />
So i want the "value" (p1,p2,p3) and replace it with the current value + ",p4".
Any help appriciated.

Although you get Regex headache, this is actually very simple to do with the following regex:
#"(?<=value=\"")[^""]+"
It starts by looking back for 'value="', then it matches all character up to the ending double quote.
string test = #"<doc name=""tester"" value=""p1,p2,p3"" />";
Regex regex = new Regex(#"(?<=value=\"")[^""]+");
string result = regex.Replace(test, "p1,p2,p3,p4");
// result will be: #"<doc name=""tester"" value=""p1,p2,p3,p4"" />";
Edit:
You can of course capture the original content, simply by calling:
string match = regex.Match(test).Value;

Related

.Net regex for string between \" "\

I have been trying to get the id from the following text
<body id=\"body\" runat=\"server\">
In C# using substring or even Regex, but nothing seems to be working. No matter what regex i use, i always get the whole line back. I have been trying to use ^id, ^id.*, ^id=\\\\\\\\.* and id=.* but they don't either work or give me the desired output. Is there any way i can get the id portion from this text which is enclosed between the characters \" "\?
Try this:
string htmlString = "<body id=\"body\" runat=\"server\">";
Regex regex = new Regex("id=\"(.*?)\"");
Match m = regex.Match(htmlString);
Group g = m.Groups[1];
string id = g.ToString();
Console.WriteLine(id); //body
Test here:
http://rextester.com/BQSF93427

C# Extract part of the string that starts with specific letters

I have a string which I extract from an HTML document like this:
var elas = htmlDoc.DocumentNode.SelectSingleNode("//a[#class='a-size-small a-link-normal a-text-normal']");
if (elas != null)
{
//
_extractedString = elas.Attributes["href"].Value;
}
The HREF attribute contains this part of the string:
gp/offer-listing/B002755TC0/
And I'm trying to extract the B002755TC0 value, but the problem here is that the string will vary by its length and I cannot simply use Substring method that C# offers to extract that value...
Instead I was thinking if there's a clever way to do this, to perhaps a match beginning of the string with what I search?
For example I know for a fact that each href has this structure like I've shown, So I would simply match these keywords:
offer-listing/
So I would find this keyword and start extracting the part of the string B002755TC0 until the next " / " sign ?
Can someone help me out with this ?
This is a perfect job for a regular expression :
string text = "gp/offer-listing/B002755TC0/";
Regex pattern = new Regex(#"offer-listing/(\w+)/");
Match match = pattern.Match(text);
string whatYouAreLookingFor = match.Groups[1].Value;
Explanation : we just match the exact pattern you need.
'offer-listing/'
followed by any combination of (at least one) 'word characters' (letters, digits, hyphen, etc...),
followed by a slash.
The parenthesis () mean 'capture this group' (so we can extract it later with match.Groups[1]).
EDIT: if you want to extract also from this : /dp/B01KRHBT9Q/
Then you could use this pattern :
Regex pattern = new Regex(#"/(\w+)/$");
which will match both this string and the previous. The $ stands for the end of the string, so this literally means :
capture the characters in between the last two slashes of the string
Though there is already an accepted answer, I thought of sharing another solution, without using Regex. Just find the position of your pattern in the input + it's lenght, so the wanted text will be the next character. to find the end, search for the first "/" after the begining of the wanted text:
string input = "gp/offer-listing/B002755TC0/";
string pat = "offer-listing/";
int begining = input.IndexOf(pat)+pat.Length;
int end = input.IndexOf("/",begining);
string result = input.Substring(begining,end-begining);
If your desired output is always the last piece, you could also use split and get the last non-empty piece:
string result2 = input.Split(new string[]{"/"},StringSplitOptions.RemoveEmptyEntries)
.ToList().Last();

Regular expressions multiple matches

I have this text and I want to get the 2 matches from it but the problem is I am always getting only 1 match. This is the sample code in c#
string formattedTag = "{Tag 1}::[FORMAT] asdfa {Tag 2}::[FORMAT]";
var tagMatches = Regex.Matches(formattedTag, #"(\{.+\}\:\:\[.+\])");
i am expecting to get two matches here "{Tag 1}::[FORMAT]" and "{Tag 2}::[FORMAT]"
but the result of this code is the actual value of the variable formattedTag.
It must be something from regexp pattern so can somebody help me to figure it out?
I will appreciate every help. Thanks in advance!
You need to use the following regular expression:
(\{[^}]+\}\:\:\[[^]]+\])
You want to match any character except the closing bracket within your bracketed portions of the string, otherwise the whole string is matched because regular expressions are greedy and attempt to retrieve the longest match.
string formattedTag = "{tag 1}::[admin] adfaf{tag 2}::[test.user]";
var tagMatches = Regex.Matches(formattedTag, #"\{(\w+\s*\d{1,2})\}::\[(.*?)\]");
foreach(Match item in tagMatches)[enter image description here][1]{
Console.WriteLine(item.Groups[0]);
Console.WriteLine(item.Groups[1] + "=" + item.Groups[2]);
}

Regex string issue in making plain text urls clickable

I need a working Regex code in C# that detects plain text urls (http/https/ftp/ftps) in a string and make them clickable by putting an anchor tag around it with same url. I have already made a Regex pattern and the code is attached below.
However, if there is already any clickable url is present in the input string then the above code puts another anchor tag over it. For example the existing substring in the below code: string sContent: "ftp://www.abc.com'>ftp://www.abc.com" has another anchor tag over it when the code below is run. Is there any way to fix it?
string sContent = "ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc ftp://www.abc.com abbbbb http://www.abc2.com";
Regex regx = new Regex("(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
MatchCollection mactches = regx.Matches(sContent);
foreach (Match match in mactches)
{
sContent = sContent.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>");
}
Also, I want a Regex code to make emails as clickable with "mailto" tag. I can do it myself but the above mentioned issue of double anchor tag will also appear in it.
I noticed in your example test string that if a duplicate link e.g. ftp://www.abc.com is in the string and is already linked then the result will be to double anchor that link. The Regular Expression that you already have and that #stema has supplied will work, but you need to approach how you replace the matches in the sContent variable differently.
The following code example should give you what you want:
string sContent = "ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc ftp://www.abc.com abbbbb http://www.abc2.com";
Regex regx = new Regex("(?<!(?:href='|<a[^>]*>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
MatchCollection matches = regx.Matches(sContent);
for (int i = matches.Count - 1; i >= 0 ; i--)
{
string newURL = "<a href='" + matches[i].Value + "'>" + matches[i].Value + "</a>";
sContent = sContent.Remove(matches[i].Index, matches[i].Length).Insert(matches[i].Index, newURL);
}
Try this
Regex regx = new Regex("(?<!(?:href='|>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
It should work for your example.
(?<!(?:href='|>)) is a negative lookbehind, that means the pattern matches only if it is not preceeded by "href='" or ">".
See lookarounds on regular-expressions.info
and the especially the zero-width negative lookbehind assertion on msdn
See something similar on Regexr. I had to remove the alternation from the look behind, but .net should be able to handle it.
Update
To ensure that there are also (maybe possible) cases like "<p>ftp://www.def.com</p>" correctly handled, I improved the regex
Regex regx = new Regex("(?<!(?:href='|<a[^>]*>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
The lookbehind (?<!(?:href='|<a[^>]*>)) is now checking that there is not a "href='" nor a tag starting with "
The output of the teststring
ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc <p>ftp://www.def.com</p> abbbbb http://www.ghi.com
is with this expression
ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc <p><a href='ftp://www.def.com'>ftp://www.def.com</a></p> abbbbb <a href='http://www.ghi.com'>http://www.ghi.com</a>
I know I arrived late to this party, but there are several problems with the regex that the existing answers don't address. First and most annoying, there's that forest of backslashes. If you use C#'s verbatim strings, you don't have to do all that double escaping. And anyway, most of the backslashes weren't needed in the first place.
Second, there's this bit: ([\\w+?\\.\\w+])+. The square brackets form a character class, and everything inside them is treated either as a literal character or a class shorthand like \w. But getting rid of the square brackets isn't enough to make it work. I suspect this is what you were trying for: \w+(?:\.\w+)+.
Third, the quantifiers at the end of the regex - ]*)? - are mismatched. * can match zero or more characters, so there's no point making the enclosing group optional. Also, that kind of arrangement can result in severe performance degradation. See this page for details.
There are other, minor problems, but I won't go into them right now. Here's the new and improved regex:
#"(?n)(https?|ftps?)://\w+(\.\w+)+([-a-zA-Z0-9~!##$%^&*()_=+/?.:;',\\]*)(?![^<>]*+(>|</a>))"
The negative lookahead - (?![^<>]*+(>|</a>)) is what prevents matches inside tags or in the content of an anchor element. It's still very crude, though. There are several areas, like inside <script> elements, where you don't want it to match but it does. But trying to cover all the possibilities would result in a mile-long regex.
Check out: Detect email in text using regex and Regex URL Replace, ignore Images and existing Links, just replace the regex for links, it will never replace a link inside a tag, only in contents.
http://html-agility-pack.net/?z=codeplex
Something like:
string textToBeLinkified = "... your text here ...";
const string regex = #"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";
Regex urlExpression = new Regex(regex, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(textToBeLinkified);
var nodes = doc.DocumentNode.SelectNodes("//text()[not(ancestor::a)]") ?? new HtmlNodeCollection();
foreach (var node in nodes)
{
node.InnerHtml = urlExpression.Replace(node.InnerHtml, #"$0");
}
string linkifiedText = doc.DocumentNode.OuterHtml;

problem in regular expression

I am having a regular expression
Regex r = new Regex(#"(\s*)([A|B|C|E|G|H|J|K|L|M|N|P|R|S|T|V|Y|X]\d(?!.*[DFIOQU])(?:[A-Z](\s?)\d[A-Z]\d))(\s*)",RegexOptions.IgnoreCase);
and having a string
string test="LJHLJHL HJGJKDGKJ JGJK C1C 1C1 LKJLKJ";
I have to fetch C1C 1C1.This running fine.
But if a modify test string as
string test="LJHLJHL HJGJKDGKJ JGJK C1C 1C1 ON";
then it is unable to find the pattern i.e C1C 1C1.
any idea why this expression is failing?
You have a negative look ahead:
(?!.*[DFIOQU])
That matches the "O" in "ON" and since it is a negative look ahead, the whole pattern fails. And, as an aside, I think you want to replace this:
[A|B|C|E|G|H|J|K|L|M|N|P|R|S|T|V|Y|X]
With this:
[A-CEGHJ-NPR-TVYX]
A pipe (|) is a literal character inside a character class, not an alternation, and you can use ranges to help hilight the characters that you're leaving out.
A single regex might not be the best way to parse that string. Or perhaps you just need a looser regex.
You are searching for a not a following DFIOQU with your negative look ahead (?!.*[DFIOQU])
In your second string there is a O at the end in ON, so it must be failing to match.
If you remove the .* in your negative look ahead it will only check the directly following character and not the complete string to the end (Is it this what you want?).
\s*([ABCEGHJKLMNPRSTVYX]\d(?![DFIOQU])(?:[A-Z]\s?\d[A-Z]\d))\s*
then it works, see it here on Regexr. It is now checking if there is not one of the characters in the class directly after the digit, I don't know if this is intended.
Btw. I removed the | from your first character class, its not needed and also some brackets around your whitespaces, also not needed.
As I understood you need to find the C1C 1C1 text in your string
I've used this regex for do this
string strRegex = #"^.*(?<c1c>C1C)\s*(?<c1c2>1C1).*$";
after that you can extract text from named groups
string strRegex = #"^.*(?<c1c>C1C)\s*(?<c1c2>1C1).*$";
RegexOptions myRegexOptions = RegexOptions.Multiline;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = #"LJHLJHL HJGJKDGKJ JGJK C1C 1C1 LKJLKJ";
string secondStr = "LJHLJHL HJGJKDGKJ JGJK C1C 1C1 ON";
Match match = myRegex.Match(strTargetString);
string c1c = match.Groups["c1c"].Value;
string c1c2 = match.Groups["c1c2"].Value;
Console.WriteLine(c1c + " " +c1c2);

Categories