Replacing html content in a string - c#

I have a string that has html contents such as:
string myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
What I need in the end is:
string myMessage = "Please the website for more information http://www.africau.edu/images/default/sample.pdf easy details given";
I can do this replacing each string as myMessage = myMessage.Replace("string to replace", ""); but then I have to take in each string and replace it will empty. Could there be a better solution?

If I understand you correctly you have a larger text with multiple occurrences of "<a ....>" and actually you want to replace that entire thing by simply only the URL given in the href.
Not sure if this makes it so much easier for you but you could use Regex.Matches something like e.g.
var myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
var matches = Regex.Matches(myMessage, "(.+?)<a.+?href=\"(.+?)\".+?<\\/a>(.+?)");
var strBuilder = new StringBuilder();
foreach (Match match in matches)
{
var groups = match.Groups;
strBuilder.Append(groups[1]) // Please the website for more information (
.Append(groups[2]) // http://www.africau.edu/images/default/sample.pdf
.Append(groups[3]); // )
}
Debug.Log(strBuilder.ToString());
So what does this do?
(.+?) will create a group for everything before the first encounter of the following <a => groups[1]
<a.+?href=" matches everything starting with <a and ending with href=" => ignored
(.+?) will create a group for everything between href=" and the next " (so the URL) => groups[2]
".+?<\/a> matches everything from the " until the next </a> => ignored
(.+?) will create a group for everything after the </a> => groups[3]
and groups[0] is the entire match.
so finally we just want to combine
groups[1] + groups[2] + groups[3]
but in a loop so we find possibly multiple matches within the same string and it is simply more efficient to use a StringBuilder for that.
Result
Please the website for more information (http://www.africau.edu/images/default/sample.pdf)
you can simply adjust this to e.g. also remove the ( ) or include the text between the tags but I figured actually this makes the most sense for now.

I personally don't like to rely on the string format always being what I expect as this can lead to errors down the road.
Instead, I offer two ways I can think of doing this:
Use regular expressions:
string myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
var capturePattern = #"(.+)\(<a .*href.*?=""(.*?)"".*>(.*)</a>\)";
var regex = new Regex(capturePattern);
var captures = regex.Match(myMessage);
var newString = $"{captures.Groups[1]}{captures.Groups[2]}{captures.Groups[3]}";
Console.WriteLine(myMessage);
Console.WriteLine(newString);
Output:
Please the website for more information (<a class="link" href="http://www.africau.edu/images/default/sample.pdf" target=_blank" id="urlLink"> easy details given)
Please the website for more information http://www.africau.edu/images/default/sample.pdf easy details given
Of course, regular expressions are only as good as the cases you can think of/test. I wrote this up quickly just to illustrate so make sure to verify for other string variations.
The other way is using HTMLAgilityPack:
string myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
var doc = new HtmlDocument();
doc.LoadHtml(myMessage);
var prefix = doc.DocumentNode.ChildNodes[0].InnerText;
var url = doc.DocumentNode.SelectNodes("//a[#href]").First().GetAttributeValue("href", string.Empty);
var suffix= doc.DocumentNode.ChildNodes[1].InnerText + doc.DocumentNode.ChildNodes[2].InnerText;
var newString = $"{prefix}{url}{suffix}";
Console.WriteLine(myMessage);
Console.WriteLine(newString);
Output:
Please the website for more information (<a class="link" href="http://www.africau.edu/images/default/sample.pdf" target=_blank" id="urlLink"> easy details given)
Please the website for more information (http://www.africau.edu/images/default/sample.pdf easy details given)
Notice this method preserves the parenthesis around the link. This is because from the agility pack's perspective, the first parenthesis is part of the text of the node. You can always remove them with a quick replace.
This method adds a dependency but this library is very mature and has been around for a long time.
it goes without saying that for both methods, you should make sure to add [error handling] checks for unexpected conditions.

Related

Using C# and Regex to find and surround all words and numbers within some html text with a span

I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...
1) Special html characters like ” “ are treated as words.
2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")
3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"
I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions.
What I have so far is:
string pattern = #"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
wordCnt++;
return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
});
How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?
A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.
What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.
Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like <p>Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.
You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load(args[0]);
var wordCount = 0;
var nodes = doc.DocumentNode
.SelectNodes("/html/body//*[not(self::script)]/text()");
foreach (var node in nodes)
{
var words = node.InnerHtml.Split(' ');
var surroundedWords = words.Select(word =>
{
if (String.IsNullOrWhiteSpace(word))
{
return word;
}
else
{
return $"<span data-wordno={wordCount++}>{word}</span>";
}
});
var newInnerHtml = String.Join("", surroundedWords);
node.InnerHtml = newInnerHtml;
}
WriteLine(doc.DocumentNode.InnerHtml);
}
}
Fix 1) by adding "negative look-behind assertions" (?<!\&). I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above.
Fix 2) by adding a new alternative |(\$?(\d+[,.])+\d+)' at the end of pattern. This also handles non-dollar and decimal-pointed numbers at the same time.
Fix 3) by enhancing the (\w+['-]\w+) alternative to read instead ((\w+['-])+\w+).

Cleaning/formatting URLs

If I have the following URL:
/sites/testsite/subsite/shared%20documents1/projects/project%20-%20csf%20healthcare%20patient%20dining%20development
/sites/testsite/subsite2/healthcare/sd/Documents/Cleaning%20Services
I need to be able to clean the URLs so I do this with the following:
string webUrl = sd.Key.Substring(0, sd.Key.ToLower().IndexOf("documents") - 1);
This works great for the 2 second link and it gives me the following cleaned up URL:
/sites/testsite/subsite2/healthcare/sd
This however is not universal and it does not work for the first Url, and what I get is the following:
/sites/testsite/subsite/Shared%2
Ideally what I would want to get here is
/sites/testsite/subsite
Is there a better way (universal) to ensure that this works for both URLs?
These are escaped strings, use javascript function unescape() to unescape them.
e.g.
unescape('/sites/testsite/subsite/shared%20documents1/projects/project%20-%20csf%20healthcare%20patient%20dining%20development')
//sites/testsite/subsite/shared documents1/projects/project - csf healthcare patient dining development
And use HttpUtility.HtmlDecode in C#
var result = HttpUtility.HtmlDecode("/sites/testsite/subsite/shared%20documents1/projects/project%20-%20csf%20healthcare%20patient%20dining%20development");
The best way to do this is by using Uri.UnescapeDataString
Uri.UnescapeDataString(#"/sites/testsite/subsite/shared%20documents1/projects/project%20-%20csf%20healthcare%20patient%20dining%20development")
this will give you
/sites/testsite/subsite/shared documents1/projects/project - csf healthcare patient dining development
then you can remove spaces if you want to. for more information use this link
If you are trying to retrieve just "/sites/sitename/subsite" you can do this
var match = Regex.Match("/sites/compass/community/Shared%20documents1/projects/project%20-%20csf%20healthcare%20patient%20dining%20development", "^/sites/.*?/.*?/", RegexOptions.IgnoreCase);
if (match.Success)
Console.WriteLine(match.Value) // "/sites/compass/community"
Or in order to retrieve everything left from /*documents
var source = new [] {"/sites/testsite/subsite/shared%20documents1/projects/project%20-%20csf%20healthcare%20patient%20dining%20development",
"/sites/testsite/subsite2/healthcare/sd/Documents/Cleaning%20Services"};
foreach (var item in source)
{
var match = Regex.Match(item, "^(.*)(?:/.*?Documents)", RegexOptions.IgnoreCase);
if (match.Success)
Console.WriteLine(match.Groups[1].Value);
}
Output:
/sites/testsite/subsite
/sites/testsite/subsite2/healthcare/sd

extract text from <p>...</p> tag or directly from an HTML file

I have an HTML page that contains some filenames that i want to download from a webserver.
I need to read these filenames in order to create a list that will be passed to my web application that downloads the file from the server. These filenames have some extention.
I have digged about this topic but havn't fount anything except -
Regex cannt be used to parse HTML.
Use HTML Agility Pack
Is there no other way so that i can search for text that have pattern like filename.ext from an HTML file?
Sample HTML that contains filename -
<p class=3DMsoNormal style=3D'margin-top:0in;margin-right:0in;margin-bottom=:0in; margin-left:1.5in;margin-bottom:.0001pt;text-indent:-.25in;line-height:normal;mso-list:l1 level3 lfo8;tab-stops:list 1.5in'><![if !supportLists]> <span style=3D'font-family:"Times New Roman","serif";mso-fareast-font-family:"Times New Roman"'><span style=3D'mso-list:Ignore'>1.<span style=3D'font:7.0pt "Times New Roman"'>
</span></span></span><![endif]><span style=3D'font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman"'>**13572_PostAccountingReport_2009-06-03.acc**<o:p></o:p></span></p>
I cant use HTML Agility Pack because I m not allowed to download and make use of any application or tool.
Cant this be achieved by anyother logic?
This is what i have done so far
string pageSource = "";
string geturl = #"C:\Documents and Settings\NASD_Download.mht";
WebRequest getRequest = WebRequest.Create(geturl);
WebResponse getResponse = getRequest.GetResponse();
using (StreamReader sr = new StreamReader(getResponse.GetResponseStream()))
{
pageSource = sr.ReadToEnd();
pageSource.Replace("=", "");
}
var fileNames = from Match m in Regex.Matches(pageSource, #"[0-9]+_+[A-Za-z]+_+[0-9]+-+[0-9]+-+[0-9]+.+[a-z]")
select m.Value;
foreach (var s in fileNames)
Response.Write(s);
Bcause of some "=" occuring in every file name i m not able to get the filename. how can I remove the occurrence of "=" in pageSource string
Thanks in advance
Akhil
Well, knowing that regex aren't ideal to find values in HTML:
var files = [];
var p = document.getElementsByTagName('p');
for (var i = 0; i < p.length; i++){
var match = p[i].innerHTML.match(/\s(\S+\.ext)\s/)
if (match)
files.push(match[1]);
}
Live DEMO
Note:
Read the comments to the question.
If the extension can be anything, you can use this:
var files = [];
var p = document.getElementsByTagName('p');
for (var i = 0; i < p.length; i++){
var match = p[i].innerHTML.match(/\b(\S+\.\S+)\b/)
console.log(match)
if (match)
files.push(match[1]);
}
document.getElementById('result').innerHTML = files + "";
​
But this really really not reliable.
Live DEMO
Well, you can use regular expressions to extract stuff that looks like file names. Since, as you correctly point out, regular expressions do not parse HTML, you might get false positives, i.e., you might get results that look like file names but are not.
Let's take an example:
string html = #"<p class=3DMsoNormal ...etc...";
var fileNames = from Match m in Regex.Matches(html, #"\b[A-Za-z0-9_-]+\.[A-Za-z0-9_-]{3}\b")
select m.Value;
foreach (var s in fileNames)
Console.WriteLine(s);
Console.ReadLine();
This will return
1.5in
1.5in
7.0pt
13572_PostAccountingReport_2009-06-03.acc
You see, HTML stuff that looks like a file name will be returned. Of course, you could refine the regular expression (for example, replace + with {3,}, so that at least three characters are required for the part before the dot) so that the false positives in this example are filtered out. Still, it's always going to be an approximate result, not an exact one.
It may be impossible to get file names using common pattern because of 1.5in -.25in 7.0pt and the likes, try to be more specific (if possible), like
/[a-z0-9_-]+\.[a-z]+/gi or
/>[a-z0-9_-]+\.[a-z]+</gi (markup included) or even
/>\d+_PostAccountingReport_\d+-\d+-\d+\.[a-z]+</gi

Find/parse server-side <?abc?>-like tags in html document

I guess I need some regex help. I want to find all tags like <?abc?> so that I can replace it with whatever the results are for the code ran inside. I just need help regexing the tag/code string, not parsing the code inside :p.
<b><?abc print 'test' ?></b> would result in <b>test</b>
Edit: Not specifically but in general, matching (<?[chars] (code group) ?>)
This will build up a new copy of the string source, replacing <?abc code?> with the result of process(code)
Regex abcTagRegex = new Regex(#"\<\?abc(?<code>.*?)\?>");
StringBuilder newSource = new StringBuilder();
int curPos = 0;
foreach (Match abcTagMatch in abcTagRegex.Matches(source)) {
string code = abcTagMatch.Groups["code"].Value;
string result = process(code);
newSource.Append(source.Substring(curPos, abcTagMatch.Index));
newSource.Append(result);
curPos = abcTagMatch.Index + abcTagMatch.Length;
}
newSource.Append(source.Substring(curPos));
source = newSource.ToString();
N.B. I've not been able to test this code, so some of the functions may be slightly the wrong name, or there may be some off-by-one errors.
var new Regex(#"<\?(\w+) (\w+) (.+?)\?>")
This will take this source
<b><?abc print 'test' ?></b>
and break it up like this:
Value: <?abc print 'test' ?>
SubMatch: abc
SubMatch: print
SubMatch: 'test'
These can then be sent to a method that can handle it differently depending on what the parts are.
If you need more advanced syntax handling you need to go beyond regex I believe.
I designed a template engine using Antlr but thats way more complex ;)
exp = new Regex(#"<\?abc print'(.+)' \?>");
str = exp.Replace(str, "$1")
Something like this should do the trick. Change the regexes how you see fit

Highlight a list of words using a regular expression in c#

I have some site content that contains abbreviations. I have a list of recognised abbreviations for the site, along with their explanations. I want to create a regular expression which will allow me to replace all of the recognised abbreviations found in the content with some markup.
For example:
content: This is just a little test of the memb to see if it gets picked up.
Deb of course should also be caught here.
abbreviations: memb = Member; deb = Debut;
result: This is just a little test of the [a title="Member"]memb[/a] to see if it gets picked up.
[a title="Debut"]Deb[/a] of course should also be caught here.
(This is just example markup for simplicity).
Thanks.
EDIT:
CraigD's answer is nearly there, but there are issues. I only want to match whole words. I also want to keep the correct capitalisation of each word replaced, so that deb is still deb, and Deb is still Deb as per the original text. For example, this input:
This is just a little test of the memb.
And another memb, but not amemba.
Deb of course should also be caught here.deb!
First you would need to Regex.Escape() all the input strings.
Then you can look for them in the string, and iteratively replace them by the markup you have in mind:
string abbr = "memb";
string word = "Member";
string pattern = String.Format("\b{0}\b", Regex.Escape(abbr));
string substitue = String.Format("[a title=\"{0}\"]{1}[/a]", word, abbr);
string output = Regex.Replace(input, pattern, substitue);
EDIT: I asked if a simple String.Replace() wouldn't be enough - but I can see why regex is desirable: you can use it to enforce "whole word" replacements only by making a pattern that uses word boundary anchors.
You can go as far as building a single pattern from all your escaped input strings, like this:
\b(?:{abbr_1}|{abbr_2}|{abbr_3}|{abbr_n})\b
and then using a match evaluator to find the right replacement. This way you can avoid iterating the input string more than once.
Not sure how well this will scale to a big word list, but I think it should give the output you want (although in your question the 'result' seems identical to 'content')?
Anyway, let me know if this is what you're after
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var input = #"This is just a little test of the memb to see if it gets picked up.
Deb of course should also be caught here.";
var dictionary = new Dictionary<string,string>
{
{"memb", "Member"}
,{"deb","Debut"}
};
var regex = "(" + String.Join(")|(", dictionary.Keys.ToArray()) + ")";
foreach (Match metamatch in Regex.Matches(input
, regex /*#"(memb)|(deb)"*/
, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
{
input = input.Replace(metamatch.Value, dictionary[metamatch.Value.ToLower()]);
}
Console.Write (input);
Console.ReadLine();
}
}
}
For anyone interested, here is my final solution. It is for a .NET user control. It uses a single pattern with a match evaluator, as suggested by Tomalak, so there is no foreach loop. It's an elegant solution, and it gives me the correct output for the sample input while preserving correct casing for matched strings.
public partial class Abbreviations : System.Web.UI.UserControl
{
private Dictionary<String, String> dictionary = DataHelper.GetAbbreviations();
protected void Page_Load(object sender, EventArgs e)
{
string input = "This is just a little test of the memb. And another memb, but not amemba to see if it gets picked up. Deb of course should also be caught here.deb!";
var regex = "\\b(?:" + String.Join("|", dictionary.Keys.ToArray()) + ")\\b";
MatchEvaluator myEvaluator = new MatchEvaluator(GetExplanationMarkup);
input = Regex.Replace(input, regex, myEvaluator, RegexOptions.IgnoreCase);
litContent.Text = input;
}
private string GetExplanationMarkup(Match m)
{
return string.Format("<b title='{0}'>{1}</b>", dictionary[m.Value.ToLower()], m.Value);
}
}
The output looks like this (below). Note that it only matches full words, and that the casing is preserved from the original string:
This is just a little test of the <b title='Member'>memb</b>. And another <b title='Member'>memb</b>, but not amemba to see if it gets picked up. <b title='Debut'>Deb</b> of course should also be caught here.<b title='Debut'>deb</b>!
I doubt it will perform better than just doing normal string.replace, so if performance is critical measure (refactoring a bit to use a compiled regex). You can do the regex version as:
var abbrsWithPipes = "(abbr1|abbr2)";
var regex = new Regex(abbrsWithPipes);
return regex.Replace(html, m => GetReplaceForAbbr(m.Value));
You need to implement GetReplaceForAbbr, which receives the specific abbr being matched.
I'm doing pretty exactly what you're looking for in my application and this works for me:
the parameter str is your content:
public static string GetGlossaryString(string str)
{
List<string> glossaryWords = GetGlossaryItems();//this collection would contain your abbreviations; you could just make it a Dictionary so you can have the abbreviation-full term pairs and use them in the loop below
str = string.Format(" {0} ", str);//quick and dirty way to also search the first and last word in the content.
foreach (string word in glossaryWords)
str = Regex.Replace(str, "([\\W])(" + word + ")([\\W])", "$1<span class='glossaryItem'>$2</span>$3", RegexOptions.IgnoreCase);
return str.Trim();
}

Categories