I have a block of html that looks something like this;
<p>33</p>
There are basically hundreds of anchor links which I need to replace the href based on the anchor text. For example, I need to replace the link above with something like;
33.
I will need to take the value 33 and do a lookup on my database to find the new link to replace the href with.
I need to keep it all in the original html as above!
How can I do this? Help!
Although this doesn't answer your question, the HTML Agility Pack is a great tool for manipulating and working with HTML: http://html-agility-pack.net
It could at least make grabbing the values you need and doing the replaces a little easier.
Contains links to using the HTML Agility Pack: How to use HTML Agility pack
Slurp your HTML into an XmlDocument (your markup is valid, isn't it?) Then use XPath to find all the <a> tags with an href attribute. Apply the transform and assign the new value to the href attribute. Then write the XmlDocument out.
Easy!
Use a regexp to find the values and replace
A regexp like "/<p><a herf=\"[^\"]+\">([^<]+)<\\/a><\\/p> to match and capture the ancor text
Consider using the the following rough algorithm.
using System;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
static class Program
{
static void Main ()
{
string html = "<p>33</p>"; // read the whole html file into this string.
StringBuilder newHtml = new StringBuilder (html);
Regex r = new Regex (#"\<a href=\""([^\""]+)\"">([^<]+)"); // 1st capture for the replacement and 2nd for the find
foreach (var match in r.Matches(html).Cast<Match>().OrderByDescending(m => m.Index))
{
string text = match.Groups[2].Value;
string newHref = DBTranslate (text);
newHtml.Remove (match.Groups[1].Index, match.Groups[1].Length);
newHtml.Insert (match.Groups[1].Index, newHref);
}
Console.WriteLine (newHtml);
}
static string DBTranslate(string s)
{
return "junk_" + s;
}
}
(The OrderByDescending makes sure the indexes don't change as you modify the StringBuilder.)
So, what you want to do is generate the replacement string based on the contents of the match. Consider using one of the Regex.Replace overloads that take a MatchEvaluator. Example:
static void Main()
{
Regex r = new Regex(#"<a href=""[^""]+"">([^<]+)");
string s0 = #"<p>33</p>";
string s1 = r.Replace(s0, m => GetNewLink(m));
Console.WriteLine(s1);
}
static string GetNewLink(Match m)
{
return string.Format(#"(<a href=""{0}.html"">{0}", m.Groups[1]);
}
I've actually taken it a step further and used a lambda expression instead of explicitly creating a delegate method.
Related
I have this string variable which composes a text and html tags. how do i perform regex only within the html table tag? is this possible?
string input = "Hello,\nTRAVEL DETAILS\n<table border=\"1\">\n<tr>\n<th align=\"center\">Initial Travel Date</th>\n<th align=\"center\">Reference Number</th>\n<th align=\"center\">First Name</th>\n<th align=\"center\">Surname</th>\n<th align=\"center\">Main Reason</th>\n<th align=\"center\">Client ID</th>\n</tr>\n<tr>\n<td align=\"center\">{TRV TRL INIT.trn}</td>\n<td align=\"center\">{TRV REF NO.trn}</td>\n<td align=\"center\">{TRV FIRST NM.trn}</td>\n<td align=\"center\">{TRV SURNAME.trn}</td>\n<td align=\"center\">Internal Meeting</td>\n<td align=\"center\">{TRV CLIEN ID.trn}</td>\n</tr>\n</table>"
string output = Regex.Replace(input, #"\t|\n|\r", "");
return output;
i only need to remove the "\n" inside the table element
You can use the WebBrowser control to parse the HTML string, get the table chunk and remove the new lines from there.
Or you can utilise IHTMLDocument, IHTMLDocument2, IHtmlDocument3 ... up to 8 to parse the HTML. You need to include Mshtml.dll in your project references though.
Or use a 3rd party HTML parser.
Do not try to manipulate the raw string unless you wanna write your own HTML parser.
i have found a way to eliminate the "\n" inside the table. but then it resulted for not using the regex. here's the updated codes
string input = emailMessage.Message.Replace("\n<tr>\n", "<tr>").Replace("</th>\n", "</th>").Replace("\n</tr>", "</tr>")
.Replace("</td>\n", "</td>").Replace("\n</table>", "</table>");
string output = input;
return output;
thank you to all of the comments and suggestions
I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...
1) Special html characters like ” “ are treated as words.
2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")
3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"
I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions.
What I have so far is:
string pattern = #"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
wordCnt++;
return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
});
How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?
A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.
What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.
Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like <p>Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.
You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load(args[0]);
var wordCount = 0;
var nodes = doc.DocumentNode
.SelectNodes("/html/body//*[not(self::script)]/text()");
foreach (var node in nodes)
{
var words = node.InnerHtml.Split(' ');
var surroundedWords = words.Select(word =>
{
if (String.IsNullOrWhiteSpace(word))
{
return word;
}
else
{
return $"<span data-wordno={wordCount++}>{word}</span>";
}
});
var newInnerHtml = String.Join("", surroundedWords);
node.InnerHtml = newInnerHtml;
}
WriteLine(doc.DocumentNode.InnerHtml);
}
}
Fix 1) by adding "negative look-behind assertions" (?<!\&). I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above.
Fix 2) by adding a new alternative |(\$?(\d+[,.])+\d+)' at the end of pattern. This also handles non-dollar and decimal-pointed numbers at the same time.
Fix 3) by enhancing the (\w+['-]\w+) alternative to read instead ((\w+['-])+\w+).
I found some examples that use regular expressions to detect patterns for URLs inside text paragraphs and add the HTML code to make them links. The problem I have with this approach is that, sometimes, the input paragraph contains both URLs written in plain text (which I want to convert to clickable) but also some URLs that already have markup for links. For example, consider this paragraph:
My favourite search engine is http://www.google.com but
sometimes I also use http://www.yahoo.com
I only want to convert the Google link but leave the two Yahoo links as they are.
What I am after is a C# function that uses regex to detect URLs and convert them but which ignores URLs that either have "A" markup tags surrounding them or inside an "A" tag already.
Edit
Here is what I have so far:
PostBody = "My favourite search engine is http://www.google.com but sometimes I also use http://www.yahoo.com";
String pattern = #"http(s)?://([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?";
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(pattern);
System.Text.RegularExpressions.MatchCollection matches = regex.Matches(PostBody);
for (int i = 0; i < matches.Count; i++)
{
PostBody = PostBody.Replace(matches[i].Value, String.Format("{1}", matches[i].Value, matches[i].Value));
}
ltrlPostBody.Text = PostBody;
And here is what I am getting (I split it in multiple lines for clarity):
My favourite search engine is
http://www.google.com
but sometimes I also use
http://www.yahoo.com">
http://www.yahoo.com</a>">
I only want to convert the first link (in this case) because it does not already make part of a link markup.
You can also use HTML Agility Pack, which gives you more power (for example you don't want to escape
<script></script>
elements and style elements:
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
namespace ConsoleApplication3 {
class Program {
static void Main(string[] args) {
var text = #"My favourite search engine is http://www.google.com but
sometimes I also use http://www.yahoo.com
<div>http://catchme.com</div>
<script>
var thisCanHurt = 'http://noescape.com';
</script>";
var doc = new HtmlDocument();
doc.LoadHtml(text);
var regex = new Regex(#"http(s)?://([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?", RegexOptions.IgnoreCase);
var nodes = doc.DocumentNode.SelectNodes("//text()");
foreach (var node in nodes) {
if (node.ParentNode != null && (node.ParentNode.Name == "a" || node.ParentNode.Name == "script" || node.ParentNode.Name == "style")) {
continue;
}
node.InnerHtml = regex.Replace(node.InnerText, (match) => {
return string.Format(#"{0}", match.Value);
});
}
var builder = new StringBuilder(100);
using (var writer = new StringWriter(builder)) {
doc.Save(writer);
}
var compose = builder.ToString();
}
}
}
If you have the Regex already written to determine when text is wrapped with an anchor tag, you can use RegularExpressions to determine if your input is a match, via http://msdn.microsoft.com/en-us/library/sdx2bds0.aspx
Can do something as simple as
private string Pattern = "whateverregexpatternyouhavewritten";
private bool MatchesPattern(string input)
{
return Regex.IsMatch(Pattern, input);
}
I guess I need some regex help. I want to find all tags like <?abc?> so that I can replace it with whatever the results are for the code ran inside. I just need help regexing the tag/code string, not parsing the code inside :p.
<b><?abc print 'test' ?></b> would result in <b>test</b>
Edit: Not specifically but in general, matching (<?[chars] (code group) ?>)
This will build up a new copy of the string source, replacing <?abc code?> with the result of process(code)
Regex abcTagRegex = new Regex(#"\<\?abc(?<code>.*?)\?>");
StringBuilder newSource = new StringBuilder();
int curPos = 0;
foreach (Match abcTagMatch in abcTagRegex.Matches(source)) {
string code = abcTagMatch.Groups["code"].Value;
string result = process(code);
newSource.Append(source.Substring(curPos, abcTagMatch.Index));
newSource.Append(result);
curPos = abcTagMatch.Index + abcTagMatch.Length;
}
newSource.Append(source.Substring(curPos));
source = newSource.ToString();
N.B. I've not been able to test this code, so some of the functions may be slightly the wrong name, or there may be some off-by-one errors.
var new Regex(#"<\?(\w+) (\w+) (.+?)\?>")
This will take this source
<b><?abc print 'test' ?></b>
and break it up like this:
Value: <?abc print 'test' ?>
SubMatch: abc
SubMatch: print
SubMatch: 'test'
These can then be sent to a method that can handle it differently depending on what the parts are.
If you need more advanced syntax handling you need to go beyond regex I believe.
I designed a template engine using Antlr but thats way more complex ;)
exp = new Regex(#"<\?abc print'(.+)' \?>");
str = exp.Replace(str, "$1")
Something like this should do the trick. Change the regexes how you see fit
I have some site content that contains abbreviations. I have a list of recognised abbreviations for the site, along with their explanations. I want to create a regular expression which will allow me to replace all of the recognised abbreviations found in the content with some markup.
For example:
content: This is just a little test of the memb to see if it gets picked up.
Deb of course should also be caught here.
abbreviations: memb = Member; deb = Debut;
result: This is just a little test of the [a title="Member"]memb[/a] to see if it gets picked up.
[a title="Debut"]Deb[/a] of course should also be caught here.
(This is just example markup for simplicity).
Thanks.
EDIT:
CraigD's answer is nearly there, but there are issues. I only want to match whole words. I also want to keep the correct capitalisation of each word replaced, so that deb is still deb, and Deb is still Deb as per the original text. For example, this input:
This is just a little test of the memb.
And another memb, but not amemba.
Deb of course should also be caught here.deb!
First you would need to Regex.Escape() all the input strings.
Then you can look for them in the string, and iteratively replace them by the markup you have in mind:
string abbr = "memb";
string word = "Member";
string pattern = String.Format("\b{0}\b", Regex.Escape(abbr));
string substitue = String.Format("[a title=\"{0}\"]{1}[/a]", word, abbr);
string output = Regex.Replace(input, pattern, substitue);
EDIT: I asked if a simple String.Replace() wouldn't be enough - but I can see why regex is desirable: you can use it to enforce "whole word" replacements only by making a pattern that uses word boundary anchors.
You can go as far as building a single pattern from all your escaped input strings, like this:
\b(?:{abbr_1}|{abbr_2}|{abbr_3}|{abbr_n})\b
and then using a match evaluator to find the right replacement. This way you can avoid iterating the input string more than once.
Not sure how well this will scale to a big word list, but I think it should give the output you want (although in your question the 'result' seems identical to 'content')?
Anyway, let me know if this is what you're after
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var input = #"This is just a little test of the memb to see if it gets picked up.
Deb of course should also be caught here.";
var dictionary = new Dictionary<string,string>
{
{"memb", "Member"}
,{"deb","Debut"}
};
var regex = "(" + String.Join(")|(", dictionary.Keys.ToArray()) + ")";
foreach (Match metamatch in Regex.Matches(input
, regex /*#"(memb)|(deb)"*/
, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
{
input = input.Replace(metamatch.Value, dictionary[metamatch.Value.ToLower()]);
}
Console.Write (input);
Console.ReadLine();
}
}
}
For anyone interested, here is my final solution. It is for a .NET user control. It uses a single pattern with a match evaluator, as suggested by Tomalak, so there is no foreach loop. It's an elegant solution, and it gives me the correct output for the sample input while preserving correct casing for matched strings.
public partial class Abbreviations : System.Web.UI.UserControl
{
private Dictionary<String, String> dictionary = DataHelper.GetAbbreviations();
protected void Page_Load(object sender, EventArgs e)
{
string input = "This is just a little test of the memb. And another memb, but not amemba to see if it gets picked up. Deb of course should also be caught here.deb!";
var regex = "\\b(?:" + String.Join("|", dictionary.Keys.ToArray()) + ")\\b";
MatchEvaluator myEvaluator = new MatchEvaluator(GetExplanationMarkup);
input = Regex.Replace(input, regex, myEvaluator, RegexOptions.IgnoreCase);
litContent.Text = input;
}
private string GetExplanationMarkup(Match m)
{
return string.Format("<b title='{0}'>{1}</b>", dictionary[m.Value.ToLower()], m.Value);
}
}
The output looks like this (below). Note that it only matches full words, and that the casing is preserved from the original string:
This is just a little test of the <b title='Member'>memb</b>. And another <b title='Member'>memb</b>, but not amemba to see if it gets picked up. <b title='Debut'>Deb</b> of course should also be caught here.<b title='Debut'>deb</b>!
I doubt it will perform better than just doing normal string.replace, so if performance is critical measure (refactoring a bit to use a compiled regex). You can do the regex version as:
var abbrsWithPipes = "(abbr1|abbr2)";
var regex = new Regex(abbrsWithPipes);
return regex.Replace(html, m => GetReplaceForAbbr(m.Value));
You need to implement GetReplaceForAbbr, which receives the specific abbr being matched.
I'm doing pretty exactly what you're looking for in my application and this works for me:
the parameter str is your content:
public static string GetGlossaryString(string str)
{
List<string> glossaryWords = GetGlossaryItems();//this collection would contain your abbreviations; you could just make it a Dictionary so you can have the abbreviation-full term pairs and use them in the loop below
str = string.Format(" {0} ", str);//quick and dirty way to also search the first and last word in the content.
foreach (string word in glossaryWords)
str = Regex.Replace(str, "([\\W])(" + word + ")([\\W])", "$1<span class='glossaryItem'>$2</span>$3", RegexOptions.IgnoreCase);
return str.Trim();
}