I want to split the body of article by html div tag so I have a pattern to search div.
the problem is that the pattern also split \r\n
[enter image description here][1]
string pattern = #"<div[^<>]*>(.*?)</div>";
string[] bodyParagraphsnew = Regex.Split(body, pattern,RegexOptions.None);
Response.Write("num of paragraph =" + bodyParagraphsnew.Length);
for (int i = 0; i < bodyParagraphsnew.Length; i++)
{
Response.Write("bodyParagraphs" + i + "= " + bodyParagraphsnew[i]+ Environment.NewLine);
}
When I debug this code I see a lot of "\r\n" in the array bodyParagraphsnew.
Its seen that the pattern include split by the string "\r\n"
I try to replace \r\n to string empty and i hoped that bodyParagraphsnew length will change.but not.I got instead of item(in array) that contain \r\n it contain ""
WHY?
here is link to image http://i.stack.imgur.com/Hxqki.gif that explain the problem
What you are seeing is the text that is between the end of the first </div> tag and the start of the next <div> tag. This is what Split does, it finds the text between the Regular Expression matches.
What is curious here though is that you are also going to get the text between the open and close tags because you put brackets in your string forming a capturing group. Consider the following program:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string body = "<div>some text</div>\r\n<div>some more text</div>";
string pattern = #"<div[^>]*?>(.*?)</div>";
string[] bodyParagraphsnew = Regex.Split(body, pattern, RegexOptions.None);
Console.WriteLine("num of paragraph =" + bodyParagraphsnew.Length);
for (int i = 0; i < bodyParagraphsnew.Length; i++)
{
Console.WriteLine("bodyParagraphs {0}: '{1}'", i, bodyParagraphsnew[i]);
}
}
}
What you will get from this is:
"" - An empty string taken from before the first <div>.
"some text" - The contents of the first <div>, because of the capturing group.
"\r\n" - The text between the end of the first </div> and the start of the last <div>.
"some more text" - The contents of the second div, again because of the capturing group.
"" - An empty string taken from after the last </div>.
What you are probably after is the contents of the div tags. This can kind of be achieved using this code:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string body = "<div>some text</div>\r\n<div>some more text</div>";
string pattern = #"<div[^>]*?>(.*?)</div>";
MatchCollection bodyParagraphsnew = Regex.Matches(body, pattern, RegexOptions.None);
Console.WriteLine("num of paragraph =" + bodyParagraphsnew.Count);
for (int i = 0; i < bodyParagraphsnew.Count; i++)
{
Console.WriteLine("bodyParagraphs {0}: '{1}'", i, bodyParagraphsnew[i].Groups[1].Value);
}
}
}
Note however that in HTML, div tags can be nested within each other. For example, the following is a valid HTML string:
string test = "<div>Outer div<div>inner div</div>outer div again</div>";
With this kind of situation Regular expressions are not going to work! This is largely due to HTML not being a Regular Language. To deal with this situation you are going to need to write a Parser (of which regular expressions are only a small part). However personally I wouldn't bother as there are plenty of open source HTML parsers already available HTML Agility Pack for example.
Two possibilies
you use llist instead of array and list.remove
you go through your array search for \r\n and remove it by index
if(bodyParagraphsnew[i] == "\r\n")
{
bodyParagraphsnew = bodyParagraphsnew.Where(w => w != bodyParagraphsnew[i]).ToArray();
}
Not very nice but maybe it is what you were looking for
Related
I'm having issues doing a find / replace type of action in my function, i'm extracting the < a href="link">anchor from an article and replacing it with this format: [link anchor] the link and anchor will be dynamic so i can't hard code the values, what i have so far is:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
string theString = string.Empty;
switch (articleWikiCheck) {
case "id|wpTextbox1":
StringBuilder newHtml = new StringBuilder(articleBody);
Regex r = new Regex(#"\<a href=\""([^\""]+)\"">([^<]+)");
string final = string.Empty;
foreach (var match in r.Matches(theString).Cast<Match>().OrderByDescending(m => m.Index))
{
string text = match.Groups[2].Value;
string newHref = "[" + match.Groups[1].Index + " " + match.Groups[1].Index + "]";
newHtml.Remove(match.Groups[1].Index, match.Groups[1].Length);
newHtml.Insert(match.Groups[1].Index, newHref);
}
theString = newHtml.ToString();
break;
default:
theString = articleBody;
break;
}
Helpers.ReturnMessage(theString);
return theString;
}
Currently, it just returns the article as it originally is, with the traditional anchor text format: < a href="link">anchor
Can anyone see what i have done wrong?
regards
If your input is HTML, you should consider using a corresponding parser, HtmlAgilityPack being really helpful.
As for the current code, it looks too verbose. You may use a single Regex.Replace to perform the search and replace in one pass:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody, #"<a\s+href=""([^""]+)"">([^<]+)", "[$1 $2]");
}
else
{
// Helpers.ReturnMessage(articleBody); // Uncomment if it is necessary
return articleBody;
}
}
See the regex demo.
The <a\s+href="([^"]+)">([^<]+) regex matches <a, 1 or more whitespaces, href=", then captures into Group 1 any one or more chars other than ", then matches "> and then captures into Group 2 any one or more chars other than <.
The [$1 $2] replacement replaces the matched text with [, Group 1 contents, space, Group 2 contents and a ].
Updated (Corrected regex to support whitespaces and new lines)
You can try this expression
Regex r = new Regex(#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>");
It will match your anchors, even if they are splitted into multiple lines. The reason why it is so long is because it supports empty whitespaces between the tags and their values, and C# does not supports subroutines, so this part [\s\n]* has to be repeated multiple times.
You can see a working sample at dotnetfiddle
You can use it in your example like this.
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody,
#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>",
"[${link} ${anchor}]");
}
else
{
return articleBody;
}
}
I have a snippet of html held as a string "s", it's user generated and may come from multiple sources, so I can't control the encoding of characters etc.
I have a simple string "comparison", and I need to check if comparison exists as a substring of "s". "comparison" does not have any html tags or encoding.
I am decoding, normalizing, and using a regex to strip out html tags, but am still unable to find the substring even when I know it is there...
string s = "<p>this is my string.</p><p>my string is html with tags and <a href="someurl">links</a> and encoding.</p><p>i want to find a substring but my comparison might not have tags & encoding.";
string comparison = "i want to find a substring";
string decode = HttpUtility.HtmlDecode(s);
string tagsreplaced = Regex.Replace(decode, "<.*?>", " ");
string normalized = tagsreplaced.Normalize();
Literal1.Text = normalized;
if (normalized.IndexOf(comparison) != -1)
{
Label1.Text = "substring found";
}
else
{
Label1.Text = "substring not found";
}
This is returning "substring not found". I can see by clicking view source that the string sent to the Literal absolutely includes the comparison string exactly as provided, so why isn't in being found?
Is there another way to achieve this?
The answer is that the HTML entity decoding still decodes your to the character 0xc2 0xa0 which is not a normal space character ' ' (which is 0x20). Verfy this with the following program:
using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Web;
namespace TestStuff
{
class Program
{
static void Main(string[] args)
{
string s = "<p>this is my string.</p><p>my string is html with tags and <a href="someurl">links</a> and encoding.</p><p>i want to find a substring but my comparison might not have tags & encoding.";
s = "i want to find a substring";
string comparison = "i want to find a substring";
string decode = HttpUtility.HtmlDecode(s);
string tagsreplaced = Regex.Replace(decode, "<.*?>", " ");
string normalized = tagsreplaced.Normalize();
Console.WriteLine("Dumping first string");
Console.WriteLine(normalized);
Console.WriteLine(BitConverter.ToString(Encoding.UTF8.GetBytes(normalized)));
Console.WriteLine("Dumping second string");
Console.WriteLine(comparison);
Console.WriteLine(BitConverter.ToString(Encoding.UTF8.GetBytes(comparison)));
if (normalized.IndexOf(comparison) != -1)
Console.WriteLine("substring found");
else
Console.WriteLine("substring not found");
Console.ReadLine();
return;
}
}
}
It dumps the UTF8 encodings of the two strings for you. You'll see as output:
Dumping first string
i want to find a substring
69-20-77-61-6E-74-20-74-6F-C2-A0-66-69-6E-64-C2-A0-61-C2-A0-73-75-62-73-74-72-69-6E-67
Dumping second string
i want to find a substring
69-20-77-61-6E-74-20-74-6F-20-66-69-6E-64-20-61-20-73-75-62-73-74-72-69-6E-67
substring not found
You see that the bytearrays do not match, therefore they aren't equal, therefore .IndexOf() is right to tell you that nothing was found.
So, the problem lies within the HTML itself since there is a non-breaking space character which you don't decode to a normal space. You can hack around it by substituting a " " for a " " in the string using String.Replace().
I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...
1) Special html characters like ” “ are treated as words.
2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")
3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"
I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions.
What I have so far is:
string pattern = #"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
wordCnt++;
return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
});
How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?
A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.
What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.
Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like <p>Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.
You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load(args[0]);
var wordCount = 0;
var nodes = doc.DocumentNode
.SelectNodes("/html/body//*[not(self::script)]/text()");
foreach (var node in nodes)
{
var words = node.InnerHtml.Split(' ');
var surroundedWords = words.Select(word =>
{
if (String.IsNullOrWhiteSpace(word))
{
return word;
}
else
{
return $"<span data-wordno={wordCount++}>{word}</span>";
}
});
var newInnerHtml = String.Join("", surroundedWords);
node.InnerHtml = newInnerHtml;
}
WriteLine(doc.DocumentNode.InnerHtml);
}
}
Fix 1) by adding "negative look-behind assertions" (?<!\&). I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above.
Fix 2) by adding a new alternative |(\$?(\d+[,.])+\d+)' at the end of pattern. This also handles non-dollar and decimal-pointed numbers at the same time.
Fix 3) by enhancing the (\w+['-]\w+) alternative to read instead ((\w+['-])+\w+).
I am attempting to split strings using '?' as the delimiter. My code reads data from a CSV file, and certain symbols (like fractions) are not recognized by C#, so I am trying to replace them with a relevant piece of data (bond coupon in this case). I have print statements in the following code (which is embedded in a loop with index variable i) to test the output:
string[] l = lines[i][1].Split('?');
//string[] l = Regex.Split(lines[i][1], #"\?");
System.Console.WriteLine("L IS " + l.Length.ToString() + " LONG");
for (int j = 0; j < l.Length; j++)
System.Console.WriteLine("L["+ j.ToString() + "] IS " + l[j]);
if (l.Length > 1)
{
double cpn = Convert.ToDouble(lines[i][12]);
string couponFrac = (cpn - Math.Floor(cpn)).ToString().Remove(0,1);
lines[i][1] = l[0].Remove(l[0].Length-1) + couponFrac + l[1]; // Recombine, replacing '?' with CPN
}
The issue is that both split methods (string.Split() and Regex.Split() ) produce inconsistent results with some of the string elements in lines splitting correctly and the others not splitting at all (and thus the question mark is still in the string).
Any thoughts? I've looked at similar posts on split methods and they haven't been too helpful.
I had no problem using String.Split. Could you post your input and output?
If at all you could probably use String.Replace to replace your desired '?' with a character that does not occur in the string and then use String.Split on that character to split the resultant string for the same effect. (just a try)
I didn't have any trouble parsing the following.
var qsv = "now?is?the?time";
var keywords = qsv.Split('?');
keywords.Dump();
screenshot of code and output...
UPDATE:
There doesn't appear to be any problem with Split. There is a problem somewhere else because in this small scale test it works just fine. I would suggest you use LinqPad to test out these kinds of scenarios small scale.
var qsv = "TII 0 ? 04/15/15";
var keywords = qsv.Split('?');
keywords.Dump();
qsv = "TII 0 ? 01/15/22";
keywords = qsv.Split('?');
keywords.Dump();
New updated output:
In the program I'm working on, I need to strip the tags around certain parts of a string, and then insert a comma after each character WITHIN the tag (not not after any other characters in the string). In case this doesn't make sense, here's an example of what needs to happen -
This is a string with a < a > tag < /a > (please ignore the spaces within the tag)
(needs to become)
This is a string with a t,a,g,.
Can anyone help me with this? I've managed to strip the tags using RegEx, but I can't figure out how to insert the commas only after the characters contained within the tag. If someone could help that would be great.
#Dour High Arch I'll elaborate a little bit. The code is for a text-to-speech app that won't recognize SSML tags. When the user enters a message for the text to speech app, they have the option of enclosing a word in a < a > tag to make the speaker say the world as an acronym. Because the acronym SSML tag won't work, I want to remove the < a > tag whenever present, and place commas after each character contained in the tag to fake it out (ex: < a > test< /a > becomes t,e,s,t,). All the non-tagged words in the string do not need commas after them, just those enclosed in tags (see my first example if need be).
If you have figured out the regex, I would imagine it would be simple to capture the inner text of the tag. Then it's a really simple operation to insert the commas:
var commaString = string.Join(",", capturedString.ToList());
Assuming you have your target string already parsed via your RegEx i.e. no tags around it...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication32
{
class Program
{
static void Main(string[] args)
{
// setup a test string
string stringToProcess = "Test";
// actual solution here
string result = String.Concat(stringToProcess.Select(c => c + ","));
// results: T,e,s,t,
Console.WriteLine(result);
}
}
}
Parsing XML is very problematic because you may have to deal with things like CDATA sections, nested elements, entities, surrogate characters, and on and on. I would use a state-based parser like ANTLR.
However, if you are just starting out with C# it is instructive to solve this using the built-in .Net string and array classes. No ANTLR, LINQ, or regular expressions needed:
using System;
class ReplaceAContentsWithCommaSeparatedChars
{
static readonly string acroStartTag = "<a>";
static readonly string acroEndTag = "</a>";
static void Main(string[] args)
{
string s = "Alpha <a>Beta</a> Gamma <a>Delta</a>";
while (true)
{
int start = s.IndexOf(acroStartTag);
if (start < 0)
break;
int end = s.IndexOf(acroEndTag, start + acroStartTag.Length);
if (end < 0)
end = s.Length;
string contents = s.Substring(start + acroStartTag.Length, end - start - acroStartTag.Length);
string[] chars = Array.ConvertAll<char, string>(contents.ToCharArray(), c => c.ToString());
s = s.Substring(0, start)
+ string.Join(",", chars)
+ s.Substring(end + acroEndTag.Length);
}
Console.WriteLine(s);
}
}
Please be aware this does not deal with any of the issues I mentioned. But then, none of the other suggestions do either.