I found some examples that use regular expressions to detect patterns for URLs inside text paragraphs and add the HTML code to make them links. The problem I have with this approach is that, sometimes, the input paragraph contains both URLs written in plain text (which I want to convert to clickable) but also some URLs that already have markup for links. For example, consider this paragraph:
My favourite search engine is http://www.google.com but
sometimes I also use http://www.yahoo.com
I only want to convert the Google link but leave the two Yahoo links as they are.
What I am after is a C# function that uses regex to detect URLs and convert them but which ignores URLs that either have "A" markup tags surrounding them or inside an "A" tag already.
Edit
Here is what I have so far:
PostBody = "My favourite search engine is http://www.google.com but sometimes I also use http://www.yahoo.com";
String pattern = #"http(s)?://([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?";
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(pattern);
System.Text.RegularExpressions.MatchCollection matches = regex.Matches(PostBody);
for (int i = 0; i < matches.Count; i++)
{
PostBody = PostBody.Replace(matches[i].Value, String.Format("{1}", matches[i].Value, matches[i].Value));
}
ltrlPostBody.Text = PostBody;
And here is what I am getting (I split it in multiple lines for clarity):
My favourite search engine is
http://www.google.com
but sometimes I also use
http://www.yahoo.com">
http://www.yahoo.com</a>">
I only want to convert the first link (in this case) because it does not already make part of a link markup.
You can also use HTML Agility Pack, which gives you more power (for example you don't want to escape
<script></script>
elements and style elements:
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
namespace ConsoleApplication3 {
class Program {
static void Main(string[] args) {
var text = #"My favourite search engine is http://www.google.com but
sometimes I also use http://www.yahoo.com
<div>http://catchme.com</div>
<script>
var thisCanHurt = 'http://noescape.com';
</script>";
var doc = new HtmlDocument();
doc.LoadHtml(text);
var regex = new Regex(#"http(s)?://([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?", RegexOptions.IgnoreCase);
var nodes = doc.DocumentNode.SelectNodes("//text()");
foreach (var node in nodes) {
if (node.ParentNode != null && (node.ParentNode.Name == "a" || node.ParentNode.Name == "script" || node.ParentNode.Name == "style")) {
continue;
}
node.InnerHtml = regex.Replace(node.InnerText, (match) => {
return string.Format(#"{0}", match.Value);
});
}
var builder = new StringBuilder(100);
using (var writer = new StringWriter(builder)) {
doc.Save(writer);
}
var compose = builder.ToString();
}
}
}
If you have the Regex already written to determine when text is wrapped with an anchor tag, you can use RegularExpressions to determine if your input is a match, via http://msdn.microsoft.com/en-us/library/sdx2bds0.aspx
Can do something as simple as
private string Pattern = "whateverregexpatternyouhavewritten";
private bool MatchesPattern(string input)
{
return Regex.IsMatch(Pattern, input);
}
Related
In Umbraco I have made a field (supportListItemWordMarking) to input a csv of words that the user want to have highlighted in a header, f.x.
This is a very fine header
And the user want to have very and header highlighted with red
The field in umbraco will then look like this:
very,header
I am trying to make a razor script so that it will look in the header for words inputted in the supportListItemWordMarking field and then output something like this:
<h1>This is a <span class="red">very</span> fine <span class="red">header</span></h1>
I have come up with this:
#{
if(subItem.HasValue("supportListItemWordMarking")) {
string[] wordMarking = subItem.GetValue("supportListItemWordMarking").ToString().Split(',');
}
}
But I am not sure if this is the right approach, so I am stuck.
This is probably how I would do it:
var txt = "This is a very fine header";
var wordMarking = new string[] { "very", "header" };
// search for all words using regex
var rx = new Regex(#"(\w+)", RegexOptions.Compiled);
// the text to replace all regex matches with
// any words found will be inserted into {0} using string.Format
var replacementText = "<span class=\"red\">{0}</span>";
var newTxt = rx.Replace(txt, (match) =>
{
var wordFound = match.Groups[1].Value;
// check if word should be marked
if (wordMarking.Contains(wordFound))
{
// return the new word with the replacement
return string.Format(replacementText, wordFound);
}
return wordFound;
});
If the list wordMarking has many items, then you should use a dictionary to improve performance.
I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...
1) Special html characters like ” “ are treated as words.
2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")
3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"
I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions.
What I have so far is:
string pattern = #"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
wordCnt++;
return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
});
How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?
A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.
What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.
Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like <p>Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.
You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load(args[0]);
var wordCount = 0;
var nodes = doc.DocumentNode
.SelectNodes("/html/body//*[not(self::script)]/text()");
foreach (var node in nodes)
{
var words = node.InnerHtml.Split(' ');
var surroundedWords = words.Select(word =>
{
if (String.IsNullOrWhiteSpace(word))
{
return word;
}
else
{
return $"<span data-wordno={wordCount++}>{word}</span>";
}
});
var newInnerHtml = String.Join("", surroundedWords);
node.InnerHtml = newInnerHtml;
}
WriteLine(doc.DocumentNode.InnerHtml);
}
}
Fix 1) by adding "negative look-behind assertions" (?<!\&). I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above.
Fix 2) by adding a new alternative |(\$?(\d+[,.])+\d+)' at the end of pattern. This also handles non-dollar and decimal-pointed numbers at the same time.
Fix 3) by enhancing the (\w+['-]\w+) alternative to read instead ((\w+['-])+\w+).
I am looking to write a utility to batch rename a bunch of files at once using a regular expression. The files that I will be renaming all at once follow a certain naming convention, and I want to alter them to a new naming convention using data that's already in the filenames; but not all my files follow the same convention currently.
So I want to be able to write a general use program that lets me input into a textbox during runtime the pattern of the filename, and what tokens I want to extract from the filename to use for renaming.
For example - Assume I have one file named [Coalgirls]_Suite_Precure_02_(1280x720_Blu-Ray_FLAC)_[33D74D55].mkv. I want to be able to rename this file to Suite Precure - Ep 02 [Coalgirls][33D74D55].mkv
This means I would preferably be able to enter into my program before renaming something akin to [%group%]_Suite_Precure_%ep%_(...)_[%crc%].mkv and it would populate the local variables group, ep, and crc to use in the batch rename.
One particular program I'm thinking of that does this is mp3tag, used for converting file names to id3 tags. It lets you put something like %artist% - %album% - %tracknumber% - %title%, and it takes those 4 tokens and puts them into the respective id3 tags.
How can I make a system similar to this without having to make the user know regex syntax?
As mentioned by usr, you can extract all the named placeholders in the search string using %(?<name>[^%]+)%. This will get you "group", "ep", and "crc".
Now you need to scan all the fragments between the placeholders and put a capture at each placeholder in the regex. I'd iterate through the matches from above (you can get start offset and length of each match to navigate through the non-placeholder fragments).
(There are mistakes in your example, I'll assume the last part is correct and I'm dropping the mysterious (...))
It would build a regex that looks like this:
^%(?<group>.*?)_Suite_Precure_(?<ep>.*?)_(?<crc>.*?).mkv$
Pass the literal fragments to Regex.Escape before using it in the regex to handle troublesome characters properly.
Now, for each filename, you try to match the regex to it. If it matches, you get the values of the placeholders for this file. Then you take those placeholder values and merge them into the output pattern, replacing the placeholders appropriately. This gives you the new name, you can do the rename.
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
namespace renamer
{
class RenameImpl
{
public static IEnumerable<Tuple<string,string>> RenameWithPatterns(
string path, string curpattern, string newpattern,
bool caseSensitive)
{
var placeholderNames = new List<string>();
// Extract all the cur_placeholders from the user's input pattern
var input_regex = new Regex(#"(\%[^%]+\%)");
var cur_matches = input_regex.Matches(curpattern);
var new_matches = input_regex.Matches(newpattern);
var regex_pattern = new StringBuilder();
if (!caseSensitive)
regex_pattern.Append("(?i)");
regex_pattern.Append('^');
// Do a pass over the matches and grab info about each capture
var cur_placeholders = new List<Tuple<string, int, int>>();
var new_placeholders = new List<Tuple<string, int, int>>();
for (var i = 0; i < cur_matches.Count; ++i)
{
var m = cur_matches[i];
cur_placeholders.Add(new Tuple<string, int, int>(
m.Value, m.Index, m.Length));
}
for (var i = 0; i < new_matches.Count; ++i)
{
var m = new_matches[i];
new_placeholders.Add(new Tuple<string, int, int>(
m.Value, m.Index, m.Length));
}
// Build the regular expression
for (var i = 0; i < cur_placeholders.Count; ++i)
{
var ph = cur_placeholders[i];
// Get the literal before the first capture if it is the first
if (i == 0 && ph.Item2 > 0)
regex_pattern.Append(Regex.Escape(
curpattern.Substring(0, ph.Item2)));
// Generate the capture for the placeholder
regex_pattern.AppendFormat("(?<{0}>.*?)",
ph.Item1.Replace("%", ""));
// The literal after the placeholder
if (i + 1 == cur_placeholders.Count)
regex_pattern.Append(Regex.Escape(
curpattern.Substring(ph.Item2 + ph.Item3)));
else
regex_pattern.Append(Regex.Escape(
curpattern.Substring(ph.Item2 + ph.Item3,
cur_placeholders[i + 1].Item2 - (ph.Item2 + ph.Item3))));
}
regex_pattern.Append('$');
var re = new Regex(regex_pattern.ToString());
foreach (var pathname in Directory.EnumerateFileSystemEntries(path))
{
var file = Path.GetFileName(pathname);
var m = re.Match(file);
if (!m.Success)
continue;
// New name is initially same as target pattern
var newname = newpattern;
// Iterate through the placeholder names
for (var i = new_placeholders.Count; i > 0; --i)
{
// Target placeholder name
var tn = new_placeholders[i-1].Item1.Replace("%", "");
// Get captured value for this capture
var ct = m.Groups[tn].Value;
// Perform the replacement
newname = newname.Remove(new_placeholders[i - 1].Item2,
new_placeholders[i - 1].Item3);
newname = newname.Insert(new_placeholders[i - 1].Item2, ct);
}
newname = Path.Combine(path, newname);
yield return new Tuple<string, string>(pathname, newname);
}
}
}
}
Make the regex pattern %(?<name>[^%]+)%. This will capture you all tokens in the string that are surrounded by percent signs.
Then, use Regex.Replace to replace them:
var replaced = Regex.Replace(input, pattern, (Match m) => EvaluateToken(m.Groups["name"].Value));
Regex.Replace can take a callback that allows you to provide a dynamic value.
I have an HTML page that contains some filenames that i want to download from a webserver.
I need to read these filenames in order to create a list that will be passed to my web application that downloads the file from the server. These filenames have some extention.
I have digged about this topic but havn't fount anything except -
Regex cannt be used to parse HTML.
Use HTML Agility Pack
Is there no other way so that i can search for text that have pattern like filename.ext from an HTML file?
Sample HTML that contains filename -
<p class=3DMsoNormal style=3D'margin-top:0in;margin-right:0in;margin-bottom=:0in; margin-left:1.5in;margin-bottom:.0001pt;text-indent:-.25in;line-height:normal;mso-list:l1 level3 lfo8;tab-stops:list 1.5in'><![if !supportLists]> <span style=3D'font-family:"Times New Roman","serif";mso-fareast-font-family:"Times New Roman"'><span style=3D'mso-list:Ignore'>1.<span style=3D'font:7.0pt "Times New Roman"'>
</span></span></span><![endif]><span style=3D'font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman"'>**13572_PostAccountingReport_2009-06-03.acc**<o:p></o:p></span></p>
I cant use HTML Agility Pack because I m not allowed to download and make use of any application or tool.
Cant this be achieved by anyother logic?
This is what i have done so far
string pageSource = "";
string geturl = #"C:\Documents and Settings\NASD_Download.mht";
WebRequest getRequest = WebRequest.Create(geturl);
WebResponse getResponse = getRequest.GetResponse();
using (StreamReader sr = new StreamReader(getResponse.GetResponseStream()))
{
pageSource = sr.ReadToEnd();
pageSource.Replace("=", "");
}
var fileNames = from Match m in Regex.Matches(pageSource, #"[0-9]+_+[A-Za-z]+_+[0-9]+-+[0-9]+-+[0-9]+.+[a-z]")
select m.Value;
foreach (var s in fileNames)
Response.Write(s);
Bcause of some "=" occuring in every file name i m not able to get the filename. how can I remove the occurrence of "=" in pageSource string
Thanks in advance
Akhil
Well, knowing that regex aren't ideal to find values in HTML:
var files = [];
var p = document.getElementsByTagName('p');
for (var i = 0; i < p.length; i++){
var match = p[i].innerHTML.match(/\s(\S+\.ext)\s/)
if (match)
files.push(match[1]);
}
Live DEMO
Note:
Read the comments to the question.
If the extension can be anything, you can use this:
var files = [];
var p = document.getElementsByTagName('p');
for (var i = 0; i < p.length; i++){
var match = p[i].innerHTML.match(/\b(\S+\.\S+)\b/)
console.log(match)
if (match)
files.push(match[1]);
}
document.getElementById('result').innerHTML = files + "";
But this really really not reliable.
Live DEMO
Well, you can use regular expressions to extract stuff that looks like file names. Since, as you correctly point out, regular expressions do not parse HTML, you might get false positives, i.e., you might get results that look like file names but are not.
Let's take an example:
string html = #"<p class=3DMsoNormal ...etc...";
var fileNames = from Match m in Regex.Matches(html, #"\b[A-Za-z0-9_-]+\.[A-Za-z0-9_-]{3}\b")
select m.Value;
foreach (var s in fileNames)
Console.WriteLine(s);
Console.ReadLine();
This will return
1.5in
1.5in
7.0pt
13572_PostAccountingReport_2009-06-03.acc
You see, HTML stuff that looks like a file name will be returned. Of course, you could refine the regular expression (for example, replace + with {3,}, so that at least three characters are required for the part before the dot) so that the false positives in this example are filtered out. Still, it's always going to be an approximate result, not an exact one.
It may be impossible to get file names using common pattern because of 1.5in -.25in 7.0pt and the likes, try to be more specific (if possible), like
/[a-z0-9_-]+\.[a-z]+/gi or
/>[a-z0-9_-]+\.[a-z]+</gi (markup included) or even
/>\d+_PostAccountingReport_\d+-\d+-\d+\.[a-z]+</gi
I have a block of html that looks something like this;
<p>33</p>
There are basically hundreds of anchor links which I need to replace the href based on the anchor text. For example, I need to replace the link above with something like;
33.
I will need to take the value 33 and do a lookup on my database to find the new link to replace the href with.
I need to keep it all in the original html as above!
How can I do this? Help!
Although this doesn't answer your question, the HTML Agility Pack is a great tool for manipulating and working with HTML: http://html-agility-pack.net
It could at least make grabbing the values you need and doing the replaces a little easier.
Contains links to using the HTML Agility Pack: How to use HTML Agility pack
Slurp your HTML into an XmlDocument (your markup is valid, isn't it?) Then use XPath to find all the <a> tags with an href attribute. Apply the transform and assign the new value to the href attribute. Then write the XmlDocument out.
Easy!
Use a regexp to find the values and replace
A regexp like "/<p><a herf=\"[^\"]+\">([^<]+)<\\/a><\\/p> to match and capture the ancor text
Consider using the the following rough algorithm.
using System;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
static class Program
{
static void Main ()
{
string html = "<p>33</p>"; // read the whole html file into this string.
StringBuilder newHtml = new StringBuilder (html);
Regex r = new Regex (#"\<a href=\""([^\""]+)\"">([^<]+)"); // 1st capture for the replacement and 2nd for the find
foreach (var match in r.Matches(html).Cast<Match>().OrderByDescending(m => m.Index))
{
string text = match.Groups[2].Value;
string newHref = DBTranslate (text);
newHtml.Remove (match.Groups[1].Index, match.Groups[1].Length);
newHtml.Insert (match.Groups[1].Index, newHref);
}
Console.WriteLine (newHtml);
}
static string DBTranslate(string s)
{
return "junk_" + s;
}
}
(The OrderByDescending makes sure the indexes don't change as you modify the StringBuilder.)
So, what you want to do is generate the replacement string based on the contents of the match. Consider using one of the Regex.Replace overloads that take a MatchEvaluator. Example:
static void Main()
{
Regex r = new Regex(#"<a href=""[^""]+"">([^<]+)");
string s0 = #"<p>33</p>";
string s1 = r.Replace(s0, m => GetNewLink(m));
Console.WriteLine(s1);
}
static string GetNewLink(Match m)
{
return string.Format(#"(<a href=""{0}.html"">{0}", m.Groups[1]);
}
I've actually taken it a step further and used a lambda expression instead of explicitly creating a delegate method.