I did some searching and didn't quite figure out why my solution is not working. Basically I need to take a string (which is HTML code) parse it and look for mailto links (which I then want to replace as part of an obfuscation). Here is what I have thus far:
string text = "<p>Some Person<br /> Person's Position<br />p. 123-456-7890<br /> e. <a title=\"Email Some Person\" target=\"_blank\" href=\"mailto:someperson%40domain.com\">someperson#domain.com</a></p>";
text = Server.UrlDecode(text);
string safeEmails = Regex.Replace(text, "()(.*?)()", "<a class=\"mailme\" href=\"$2*$4\">$6</a>");
Response.Write( Server.HtmlDecode(safeEmails));
The text is coming out of a WYSIWYG text editor (Telrik RadEditor for those familiar) and for all intents and purposes I don't have access to be able to control what is coming out of it.
Basically I need to find and replace any:
someone#domain.com
With:
<a class="mailme" href="someone#domain.com">someone#domain.com</a>
Some background: I am attempting to create a mailto link that will avoid detection by harvesters. The problem is that I receive a string with the e-mail as a standard mailto link. I cannot control the incoming string, so the mailto will always be an unprotected mailto. My object is to find all of them, obfuscate them, then use JavaScript to "fix" the link so that human vistors can easily use the mailto links. I am open to new approaches as well as modifications to the above code.
You could use a regex or the HTML agility pack to find and obfuscate all your mailto. If you want a good obfuscation try reading ten methods to obfuscate e-mail addresses compared
EDIT:
sorry, from the first version of your question I didn't get you had a problem in making your regex work. Since you're usign a WYSIWYG text editor, I think the HTML that comes out of it should be pretty "regular", so you may be fine using a regex.
You can try changing your Replace line like this:
string safeEmails = Regex.Replace(text, "href=\"mailto:.*\">(.*)</a>", "class=\"mailme\" href=\"$1\">$1</a>");
Related
What I want to do is to replace a part of the text inside the clipboard, but the problem is it is html formatted text and I am unable to modify its content using the below given code in C#. Any solutions?
Steps to replicate my doing:
1- copy an entry from cambridge advanced learner dictionary 4 to clipboard OR any other html formatted text to clipboard
2- Use these C# codes in a windows forms application to modify and replace text while keeping its html formatting:
private void button1_Click(object sender, EventArgs e)
{
string myStr = Clipboard.GetText(TextDataFormat.Html);
myStr.Replace("Cambridge Advanced Learner's Dictionary - 4th Edition", "******************************");
Clipboard.SetText(myStr,TextDataFormat.Html);
}
But it seems that it does not work at all!
NOTE: I want to keep the html formatting, I don't want to strip string from its html formatting.
I used Regex and it seems to work when I use:
myStr = Regex.Replace(myStr, "Cambridge Advanced Learner's Dictionary - 4th Edition", "");
but when I want to use:
myStr = Regex.Replace(myStr, "Cambridge Advanced Learner's Dictionary - 4th Edition<br /><br />", "");
it does not work! any solutions to remove those html tags: <br /><br />
?
Using Regex solved the problem to some extent like this:
private void button1_Click(object sender, EventArgs e)
{
string myStr = Clipboard.GetText(TextDataFormat.Html);
myStr = Regex.Replace(myStr, "Cambridge Advanced Learner's Dictionary - 4th Edition", "");
Clipboard.SetText(myStr,TextDataFormat.Html);
}
but still unable to remove HTML tags like <br /><br /> from clipboard.
Since HTML input can be arbitrary, here are the steps I suggest:
Assuming you have a way to detect that the clipboard content is indeed in HTML, tidy it using a C# library of your choice (for example, this). This will allow the app to work with content which is "sanitized", i.e., HTML breaks such as <br> and <br /> below will be tidied to standard <br/> which you can then omit or replace.
Instead of using "one-off" RegEx replacement like the one you have for handing HTML breaks, try to make your code a bit more flexible by anticipating future additions to the list of offending HTML elements you need to replace, i.e., use groups (for example, this). You will then be able to provide the user of your forms app a way to configure which elements to omit.
You must format the text in a special HTML Clipboard Format (link to description).
It looks like this (working example unlike the exmaple given in the link, which has wrong Start- and End- numbers):
Version:1.0
StartHTML:00085
EndHTML:00287
StartFragment:00105
EndFragment:00269
<!--StartFragment--><HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html;charset=UTF-8" /><TITLE></TITLE></HEAD><BODY>YOUR <B>HTML FORMATTED</B> TEXT GOES HERE!</BODY></HTML><!--EndFragment-->
Also make sure to fill in the right Start- and End- numbers in the top section. More specifically, you must adapt EndHTML, EndFragment and EndSelection to reflect the change in the length of your text. Replacing alone won't work.
I have the following mailto link on an ASP.NET MVC 5 application:
<a rel="nofollow" href="#(
String.Format("mailto:?subject={0}&body={1}",
"The title", "The description" + "%0D%0A" + "http://theurl.xyz")">
share by email
</a>
This is not validating on HTML Validator. I get the error:
Bad value mailto:?subject=The subject&body=This is the url:%0D%0Ahttp://localhost:8580/home for attribute href on element a: Whitespace in query component. Use %20 in place of spaces.
I tried encoding using HttpUtility.UrlEncode but when I open the email I get "+" signs and others in the subject and body and I am not able to solve that.
I know this is a little old, but I came across this when I was trying to figure out the best way to encode mailto links. I've found the best way is use Uri.EscapeDataString for each parameter and then encode the entire attribute thing using HttpUtility.HtmlAttributeEncode:
HttpUtility.HtmlAttributeEncode(
String.Format("mailto:?subject={0}&body={1}",
Uri.EscapeDataString(subject),
Uri.EscapeDataString(body)))
HttpUtility.UrlEncode and HttpUtility.UrlEncodeUnicode do not correctly encode spaces -- they become plus signs ("+") which then show up as plus signs in the subject line/body/etc. HttpUtility.UrlPathEncode seems to fix that problem, but doesn't properly encode other characters like ?, #, and /. Uri.EscapedDataString seems to be the only method that properly encodes all of these characters. I imagine Uri.HexEscape would work equally as well, but it seems like that might be overkill.
Caveat: I haven't tested this with even a remotely wide variety of browsers and email clients
You need to use the HttpUtility.UrlPathEncode instead of the HttpUtility.UrlEncode:
<a rel="nofollow" href="#(
(String.Format("mailto:?subject={0}&body={1}",
HttpUtility.UrlPathEncode("The subject line"),
HttpUtility.UrlPathEncode("The body") + "%0D%0A" + "http://theurl.xyz"))))">
share by email
</a>
Note: you need to HttpUtility.UrlPathEncode the parts separately, and you cannot put the HttpUtility.UrlPathEncode around the whole String.Format because the HttpUtility.UrlPathEncode handles the ? specially and only encodes the text before the ?.
From MSDN:
You can encode a URL using with the UrlEncode method or the
UrlPathEncode method. However, the methods return different results.
The UrlEncode method converts each space character to a plus character
(+). The UrlPathEncode method converts each space character into the
string "%20", which represents a space in hexadecimal notation.
That's because razor doesn't know when to start processing c# code. You have to explicitly tell razor when to interpret c#...
<a rel="nofollow" href=" #(string.Format("mailto:?subject={0}&body={1}", ViewBag.Title, ViewBag.Description + "%0D%0A" + ViewBag.Url))>Share by Email </a>
Edit
Answer
I got carried away by sytanx errors you had in your razor. However, even after you edited your question I can still see some issues there. The first issue is that you open two parenthesis but close only one. The second issue is that you specify an empty email address, well, you should at least add a space (not html encoded). And the last issue is that you are not actually separating the subject and body because you are using ? instead of &. If you correct these issues you should be good to go. Here's an example based on your question...
<a rel="nofollow" href="#(String.Format("mailto: ?subject={0}&body={1}"
, "The title"
, "The description" + "%0D%0A" + "http://theurl.xyz"))">
share by email
</a>
That should work as it is. But if you want to do more funky stuff, please read this RFC
I am trying to get a page title from page source of different pages. But lets say some pages have title like this:
"This is an example," ABC.
It has some html in it like """. If i use string in c# to get this title i get the whole thing and while displaying it displays it like above which is wrong. Is there any way to ignore or to take into account html values in c#?
I am also using htmlagilitypack so anything in that will do too.
You can use WebUtility.HtmlDecode to decode html, link on MSDN:
WebUtility.HtmlDecode(""This is an example," ABC.");
just use:
using System.Net;
The result will be: "\"This is an example,\" ABC."
You also can use HtmlEntity.DeEntitize in HTML Agility Pack:
HtmlEntity.DeEntitize(string text)
You don't know what you can find in the page title. Sometimes is a whole mess there. My suggestion is to get the string as it is and process it before to show/save it.
In this case, the solution is simple: replace the
"
with corresponding char.
Each time you read a HTML document to extract some tags, take care to tags never closed. If the user forget to close the title tag... you'll get in that line the whole page!
I have a RegEx which nicely finds the href's in a URL:
<[aA][^>]*? href=[\"'](?<url>[^\"]+?)[\"'][^>]*?>
However, I want it to NOT find any href that contains the text, 'javascript:' in it.
The reason is that I sometimes need to mod the href and sometimes don't. When there is a 'javascript:' text in the href I want it not to be found by the regex.
(ASP.NET, C#)
I really wouldn't recommend using a regexp for this, since HTML isn't regular and there are no end of edge cases to cater for. If at all possible, please use an HTML parser. I think you'll find it a lot less grief.
A word javascript can be written in other ways. Look at ha.ckers.org article.
Simple excluding javascript word dot't provide you safety at all.
C# Need to locate web addresses using REGEX is that possible?
Basically I need to parse a string prior to loading it into a WebBrowser
myString = "this is an example string http://www.google.com , and I need to make the link clickable";
webBrow.DocumentText = myString;
Basically what I want to happen is a replace of the web address so that it looks like a hyperlink, and do this with any address pulled in to the string. I would need to replace the web address so that web address would read like
<a href='web address'>web address</a>
This would allow me to have the links clickable..
Any Ideas?
new Regex(#"https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?").Match(myString)
It's possible depending on how strict or permissive you want your parsing to be.
As a first cut, you can try #"\bhttp://\S+" which will match any string starting with "http://" at a word boundary (non-word character, such as whitespace or punctuation).
To search using a regex and replace all occurrences with your custom text, you could use the Regex.Replace method.
You may want to read up on Regular Expression Language Elements to learn more.