I'm working on a WYSIWYG editor with builtin spell checker Hunspell and online highlighting of misspelled words. I'm using Webbrowser control as a html handler. It's a way easy to spell check text than html in webbrowser control, but following this way I'm losing all html formatting.
So the question is: is there any way to spell check body innertext and then convert it to body innerhtml with previous formatting? (with no use of HtmlAgilityPack or Majestic12 or SgmlReader or ZetaHtmlTidy).
Thanks in advance.
As opposed to checking the spelling of the innterText property of a given element, a better approach might be to loop through the child elements, and check the spelling of each child's innerText instead.
This approach, while possibly limiting context-based spell-checking, should keep the markup intact.
Note: You might want to take into consideration that each child node may also contain further children.
I chose to check the spelling of the innerText property, but when replacing any changed words, I replaced them within the innerHTML. This was rather easy when changing all instances of a misspelled word. Simply use a Regular Expression to gather the indices of all matching words in the innerHTML and replace each one.
Regex wordEx = new Regex(#"[A-Za-z]", RegexOptions.Compiled);
MatchCollection mcol = wordEx.Matches(webEditor.Document.Body.InnerHtml);
foreach (Match m in mcol)
{
//Basic checking for whether this word is an HTML tag. This is not perfect.
if (m.Value == e.Word && webEditor.Document.Body.InnerHtml.Substring(m.Index -1, 1) != "<")
{
wordIndeces.Add(m.Index);
}
}
foreach (int curWordTextIndex in wordIndeces)
{
Word word = Word.GetWordFromPosition(webEditor.Document.Body.InnerHtml, curWordTextIndex);
string tmpText = webEditor.Document.Body.InnerHtml.Remove(word.Start, word.Length);
webEditor.Document.Body.InnerHtml = tmpText.Insert(word.Start, e.NewWord);
}
UpdateSpellingForm(e.TextIndex);
When replacing a single instance, I just looped through the InnerText to find which instance needs to be replaced. Then I looped through the InnerHTML until I found the correct instance and replaced it.
Related
I am going to leave this here in case anyone can still answer this, but I am going to go a different route for my search
I know there are several questions on here that are similar but none get me where I am going.
I have the search part basically finished. It works beautifully. Gets all occurrences of the searched word or phrase ignoring case. But the problem is, if you were to search for "div" or "table" or some other word that is an html element name or attribute value, the search tries to highlight that too and totally screws up the page.
So I really just need a simple way to make sure the search ignores those occurrences. Here is what I have. I assume I probably need a really good regex but I can't write a regex to save my life, so help would be appreciated.
private void PerformSearch()
{
string searchString = SearchTextBox.Text;
HtmlDocument doc = ManualViewBrowser.Document;
StringBuilder html = new StringBuilder(doc.Body.InnerHtml);
doc.Body.InnerHtml = Regex.Replace(html.ToString(), searchString, new MatchEvaluator(Highlight), RegexOptions.IgnoreCase);
}
private string Highlight(Match m)
{
return "<em class=\"highlight\">" + m.Value + "</em>";
}
Just remove all html tags from that html string with this method:
private string RemoveHtmlTags(string html) {
return Regex.Replace(html, "<.*?>", String.Empty);
}
edit:
you are right, so instead of search inside the html just loop trough all the nodes of the page and search for the word inside them.
There is a tough nut to crack.
I have a HTML which needs to be stripped of some tags, attributes AND properties.
Basically there are three different approaches which are to be considered:
String Operations: Iterate through the HTML string and strip it via string operations 'manually'
Regex: Parsing HTML with RegEx is evil. Is stripping HTML evil too?
Using a library to strip it (e.g. HTML Agility Pack)
My wish is that I have lists for:
acceptedTags (e.g. SPAN, DIV, OL, LI)
acceptedAttributes (e.g. STYLE, SRC)
acceptedProperties (e.g. TEXT-ALIGN, FONT-WEIGHT, COLOR, BACKGROUND-COLOR)
Which I can pass to this function which strips the HTML.
Example Input:
<BODY STYLE="font-family:Tahoma;font-size:11;"> <DIV STYLE="margin:0 0 0 0;text-align:Left;font-family:Tahoma;font-size:16;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;font-family:tahoma;font-size:11;">Hello</SPAN></BODY>
Example Output (with parameter lists from above):
<DIV STYLE="text-align:Left;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;">Hello</SPAN>
the entire tag Body is stripped (not accepted tag)
properties margin, font-family and font-size are stripped from DIV-Tag
properties font-family and font-size are stripped from SPAN-Tag.
What have I tried?
Regex seemed to be the best approach at the first glance. But I couldn't get it working properly.
Articles on Stackoverflow I had a look at:
Regular expression to remove HTML tags
How to clean HTML tags using C#
...and many more.
I tried the following regex:
Dim AcceptableTags As String = "font|span|html|i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
Dim Html as String = Regex.Replace(b.HTML, WhiteListPattern, "", RegexOptions.Compiled)
However, this is only removing tags and no attributes or properties!
I'm definitely not looking for someone who's doing the whole job. Rather for someone, who points me to the right direction.
I'm happy with either C# or VB.NET as answers.
Definitely use a library! (See this)
With HTMLAgilityPack you can do pretty much everything you want:
Remove tags you don't want:
string[] allowedTags = {"SPAN", "DIV", "OL", "LI"};
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
{
if (!allowedTags.Contains(node.Name.ToUpper()))
{
HtmlNode parent = node.ParentNode;
parent.RemoveChild(node,true);
}
}
Remove attributes you don't want & remove properties
string[] allowedAttributes = { "STYLE", "SRC" };
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
{
List<HtmlAttribute> attributesToRemove = new List<HtmlAttribute>();
foreach (HtmlAttribute att in node.Attributes)
{
if (!allowedAttributes.Contains(att.Name.ToUpper())) attributesToRemove.Add(att);
else
{
string newAttrib = string.Empty;
//do string manipulation based on your checking accepted properties
//one way would be to split the attribute.Value by a semicolon and do a
//String.Contains() on each one, not appending those that don't match. Maybe
//use a StringBuilder instead too
att.Value = newAttrib;
}
}
foreach (HtmlAttribute attribute in attributesToRemove)
{
node.Attributes.Remove(attribute);
}
}
I would probably actually just write this myself as a multi-step process:
1) Exclude all rules for removing properties from tags that are listed as tags to be removed (the tags won't be there anyway!)
2) Walk the document, taking a copy of the document without excluded tags (i.e. in your example, copy everything up until "< div" then wait until I see ">" before continuing to copy. If I'm in copy mode, and I see "ExcludedTag=" then stop copying until I see quotation mark.
You'll probably want to do some pre-work validation on the html and getting the formatting the same, etc. before running this process to avoid broken output.
Oh, and copy in chunks, i.e. just keep the index of copy start until you reach copy end, then copy the whole chunk, not individual characters!
Hopefully that helps as a starting point.
Ok so i am really new to XPath queries used in HTMLAgilityPack.
So lets consider this page http://health.yahoo.net/articles/healthcare/what-your-favorite-flavor-says-about-you. What i want is to extract just the page content and nothing else.
So for that i first remove script and style tags.
Document = new HtmlDocument();
Document.LoadHtml(page);
TempString = new StringBuilder();
foreach (HtmlNode style in Document.DocumentNode.Descendants("style").ToArray())
{
style.Remove();
}
foreach (HtmlNode script in Document.DocumentNode.Descendants("script").ToArray())
{
script.Remove();
}
After that i am trying to use //text() to get all the text nodes.
foreach (HtmlTextNode node in Document.DocumentNode.SelectNodes("//text()"))
{
TempString.AppendLine(node.InnerText);
}
However not only i am not getting just text i am also getting numerous /r /n characters.
Please i require a little guidance in this regard.
If you consider that script and style nodes only have text nodes for children, you can use this XPath expression to get text nodes that are not in script or style tags, so that you don't need to remove the nodes beforehand:
//*[not(self::script or self::style)]/text()
You can further exclude text nodes that are only whitespace using XPath's normalize-space():
//*[not(self::script or self::style)]/text()[not(normalize-space(.)="")]
or the shorter
//*[not(self::script or self::style)]/text()[normalize-space()]
But you will still get text nodes that may have leading or trailing whitespace. This can be handled in your application as #aL3891 suggests.
If \r \n characters in the final string is the problem, you could just remove them after the fact:
TempString.ToString().Replace("\r", "").Replace("\n", "");
I need to write some code that will search and replace whole words in a string that are outside HTML tags. So if I have this string:
string content = "the brown fox jumped over <b>the</b> lazy dog over there";
string keyword = "the";
I need to something like:
if (content.ToLower().Contains(keyword.ToLower()))
content = content.Replace(keyword, String.Format("<span style=\"background-color:yellow;\">{0}</span>", keyword));
but I don't want to replace the "the" in the bold tags or the "the" in "there", just the first "the".
you can use this library to parse you html and to replace only the words that are not in any html, to replace only the word "the" and not "three" use RegEx.Replace("the\s+"...) instead of string replace
Try this:
content = RegEx.Replace(content, "(?<!>)"
+ keyword
+ "(?!(<|\w))", "<span blah...>" + keyword + '</span>';
Edit: I fixed the "these" case, but not the case where more than the keyword is wrapped in HTML, e.g., "fox jumped over the lazy dog."
What you're asking for is going to be nearly impossible with RegEx and normal, everyday HTML, because to know if you're "inside" a tag, you would have to "pair" each start and end tag, and ignore tags that are intended to be self-closing (BR and IMG, for instance).
If this is merely eye candy for a web site, I suggest going the other route: fix your CSS so the SPAN you are adding only impacts the HTML outside of a tag.
For example:
content = content.Replace("the", "<span class=\"highlight\">the</span>");
Then, in your CSS:
span.highlight { background-color: yellow; }
b span.highlight,
i span.highlight,
em span.highlight,
strong span.highlight,
p span.highlight,
blockquote span.highlight { background: none; }
Just add an exclusion for each HTML tag whose contents should not be highlighted.
I like the suggestion to use an HTML parser, but let me propose a way to enumerate the top-level text (no enclosing tags) regions, which you can transform and recombine at your leisure.
Essentially, you can treat each top-level open tag as a {, and track the nesting of only that tag. This might be simple enough compared to regular parsing that you want to do it yourself.
Here are some potential gotchas:
If it's not XHTML, you need a list of tags which are always empty:
<hr> , <br> and <img> (are there more?).
For all opening tags, if it ends in />, it's immediately closed - {} rather than {.
Case insensitivity - I believe you'll want to match tag names insensitively (just lc them all).
Super-permissive generous browser interpretations like
"<p> <p>" = "<p> </p><p>" = {}{
Quoted entities are NOT allowed to contain <> (they need to use <), but maybe browsers are super permissive there as well.
Essentially, if you want to parse correct HTML markup, there's no problem.
So, the algorithm:
"end of previous tag" = start of string
repeatedly search for the next open-tag (case insensitive), or end of string:
< *([^ >/]+)[^/>]*(/?) *>|$
handle (end of previous tag, start of match) as a region outside all tags.
set tagname=lc($1). if there was a / ($2 isn't empty), then update end and continue at start. else, with depth=1,
while depth > 0, scan for next (also case insensitive):
< *(/?) *$tagname *(/?) *>
If $1, then it's a close tag (depth-=1). Else if not $2, it's another open tag; depth+=1. In any case, keep looping (back to 1.)
Back to start (you're at top level again). Note that I said at the top "scan for next start of top-level open tag, or end of string", i.e. make sure you process the toplevel text hanging off the last closing tag.
That's it. Essentially, you get to ignore all other tags than the current topmost one you're monitoring, on the assumption that the input markup is properly nested (it will still work properly against some types of mis-nesting).
Also, wherever I wrote a space above, should probably be any whitespace (between < > / and tag name you're allowed any whitespace you like).
As you can see, just because the problem is slightly easier than full HTML parsing, doesn't necessarily mean you shouldn't use a real HTML parser :) There's a lot you could screw up.
You'll need to give more details.
For example:
<p>the brown fox</p>
is technically inside HTML tags.
What would be the best way to search through HTML inside a C# string variable to find a specific word/phrase and mark (or wrap) that word/phrase with a highlight?
Thanks,
Jeff
I like using Html Agility Pack very easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links
HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var link in Nodes)
{
Console.WriteLine(link.Attributes["href"].Value);
}
Regular Expression would be my way. ;)
If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?
An approach I used in the past is to use HTMLTidy to convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.
Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.
In simple cases, regular expressions will do.
string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");
will yield: "tttttt<strong>go</strong>ttttttt"
But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:
<span class="firstLetter">B</span>ook
To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.
You could look at using Html DOM, an open source project on SourceForge.net.
This way you could programmatically manipulate your text instead of relying regular expressions.
Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.