Detecting hyperlinks in WPF RichTextBox

Detecting hyperlinks in WPF RichTextBox - c#

Hey folks, I'm wanting to write some rudimentary support for detecting hyperlinks in a WPF RichTextBox control. My plan is to use a regex to identify any links and then manually replace them with real hyperlink objects.
However the part I am having trouble with is getting the correct textpointers, etc. once I find a link. For example, I can flatten the entire document to a text string and find links, but once I do that how can I get the proper pointer to the block that needs url-ifying?
Perhaps a better approach would be to iterate over blocks in the document, assuming a url would not span multiple blocks, however even then I have very little experience working with the RichTextBox/FlowDocument object model so any pointers (pun intended) would be helpful. Thanks!

I think you might find this useful:
http://blogs.msdn.com/b/prajakta/archive/2006/10/17/autp-detecting-hyperlinks-in-richtextbox-part-i.aspx

Related

Simple solution to create tree-like structure in html

I am trying to create a tree-like structure for a webpage. In my case a user has a set of projects, each project has a set of tasks, tasks can have sub-tasks, and sub-tasks can have their own sub-tasks. So, basically I am looking for an elegant solution to display the tree-like structure using html elements in an aesthetically pleasing manor. Any tips or ideas would be appreciated.

a good place to start would be to make some psudo-code. most people on this website wont give you an entire solution. i have almost no knowledge of html but in all other programming languages the first place to start would be psudo code. also this sounds like you need to make some objects and structure your code around them.

I would leave this as a comment but I'm lacking a couple of rep points. You may want to focus more on the javascript part of building the tree. The HTML is the easy part. Do some searching on jQuery treeview control and see what you come back with. Here's an example:
http://www.jstree.com/

Follow the link -
here
Very good explanation. You can customize according to your need.

How to draw a base glyph with one color and its attached diacritic with another one?

MS Word has this capability in its Hebrew and Arabic versions. I would like to achieve this in a windows desktop application, using .Net (may be with win-api calls).

As explained in the link provided by Otaku here, current rich text edit controls can not handle this (unless you go for the hack OP in that Q did, which did not seem like a very good solution).
You could write code to do this manually yourself, ditching the text edit control completely, but that would probably mean a lot of work. It took Microsoft years to get support for combining diacritics working properly in MSWord. I would search for open source software that has this capability, and look at how other developers have done it. It might be hard to find, though, and you would likely have to step outside .NET-land. Maybe OpenOffice can do this?
This discussion might also be of help.
I am afraid that you will find, though, that you'll have to manually parse the Unicode and assign colors to the correct glyphs. If you want to be complete, that is one heck of a job.

Searching the name of web pages according to the word entered in a textbox

I have a textbox and a button in one page.I want to enter a word in the textbox and click the button. After clicking the button I want to display the name of the web pages containing the word entered in the textbox. So please tell me how to do it? I am using C#.

So you want to create a search engine internal to your website. There are a couple of different options
You can use something like google custom search which requires no coding and uses the google technology which I think we all agree does a pretty good job compared to other search engines. More information at http://www.google.com/cse/
Or you can implement it in .net which I will try to give some pointers about below.
A search engine in general exists out of (some of) the following parts:
a index which is searched against
a query system which allows searches to be specified and results shown
a way to get documents into the index like a crawler or some event thats handled when the documents are created/published/updated.
These are non trivial things to create especially if you want a rich feature set like stemming (returning documents containing plural forms of search terms), highlighting results, indexing different document formats like pdf, rtf, html etc... so you want to use something already made for this purpose. This would only leave the task of connecting and orchestrating the different parts, writing the flow control logic.
You could use Lucene.net a opensource project with a lot of features. http://usoniandream.blogspot.com/2007/10/tutorial-implementing-lucenenet-search.html explains how to get started with it.
The other option is Microsoft indexing service which comes with windows but I would advice against it since it's difficult to tweak to work like you want and the results are sub-optimal in my opinion.

You are going to need some sort of backing store, and full text indexing. To the best of my knowledge, C# alone is not enough.

Algorithm for reading the actual content of news articles and ignoring "noise" on the page?

I'm looking for an algorithm (or some other technique) to read the actual content of news articles on websites and ignore anything else on the page. In a nutshell, I'm reading an RSS feed programatically from Google News. I'm interested in scraping the actual content of the underlying articles. On my first attempt I have the URLs from the RSS feed and I simply follow them and scrape the HTML from that page. This very clearly resulted in a lot of "noise", whether it be HTML tags, headers, navigation, etc. Basically all the information that is unrelated to the actual content of the article.
Now, I understand this is an extremely difficult problem to solve, it would theoretically involve writing a parser for every website out there. What I'm interested in is an algorithm (I'd even settle for an idea) on how to maximize the actual content that I see when I download the article and minimize the amount of noise.
A couple of additional notes:
Scraping the HTML is simply the first attempt I tried. I'm not sold that this is the best way to do things.
I don't want to write a parser for every website I come across, I need the unpredictability of accepting whatever Google provides through the RSS feed.
I know whatever algorithm I end up with is not going to be perfect, but I'm interested in a best possible solution.
Any ideas?

As long as you've accepted that fact that whatever you try is going to be very sketchy based on your requirements, I'd recommend you look into Bayesian filtering. This technique has proven to be very effective in filtering spam out of email.

When reading news outside of my RSS reader, I often use Readability to filter out everything but the meat of the article. It is Javascript-based so the technique would not directly apply to your problem, but the algorithm has a high success rate in my experience and is worth a look. Hope this helps.

Take a look at templatemaker (Google code homepage). The basic idea is that you request a few different pages from the same site, then mark down what elements are common across the set of pages. From there you can figure out where the dynamic content is.
Try running diff on two pages from the same site to get an idea of how it works. The parts of the page that are different are the places where there is dynamic (interesting) content.

Here's what I would do after I checked the robots.txt file to make sure it's fine to scrap the article and parsed the document as an XML tree:
Make sure the article is not broken into many pages. If it is, 'print view', 'single page' or 'mobile view' links may help to bring it to single page. Of course, don't bother if you only want the beginning of the article.
Find the main content frame. To do that, I would count the amount of information in every tag. Now, what we're looking is a node that is big but consists of many small subnodes.
Now I would try to filter out any noise inside the content frame. Well, the websites I read don't put any crap there, only useful images, but you do need to kill anything that has inline javascript and any external links.
Optionally, flatten that into plain text (that is, go into the tree and open all elements; block elements create a new paragraph).
Guess the header. It's usually something with h1, h2 or at least big font size, but you can simplify life by assuming that it somehow resembles the page title.
Finally, find the authors (something with names and email), the copyright notice (try metadata or the word copyright) and the site name. Assemble these somewhere together with the the link to original and state clearly it's probably fair use (or whatever legal doctrine you feel like applies to you.)

There is an almost perfect tool for this job, Boilerpipe.
In fact it has its own tag here though it's little used, boilerpipe. Here's the description right from the tag wiki:
The boilerpipe library for Java provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The source is all there in the project if you just want to learn the algorithms and techniques, but in fact somebody has already ported it to C# which is quite possibly perfect for your needs: NBoilerpipe.

BTE (Body Text Extraction) is a Python module that finds the portion of a document with the highest ratio of text to tags on a page.
http://www.aidanf.net/archive/software/bte-body-text-extraction
It's a nice, simple way of getting real text out of a website.

Here's my a (probably naive) plan of how to approach this:
Assuming the RSS feed contains the opening words of the article, you could use these to locate the start of the article in the DOM. Walk back up the DOM a little (first parent DIV? first non-inline container element?) and snip. That should be the article.
Assuming you can get the document as a XML (HtmlAgilityPack can help here), you could (for instance) grab all descendant text from <p> elements with the following Linq2Xml:
document
.Descendants(XName.Get("p", "http://www.w3.org/1999/xhtml"))
.Select(
p=>p
.DescendantNodes()
.Where(n => n.NodeType == XmlNodeType.Text)
.Select(t=>t.ToString())
)
.Where(c=>c.Any())
.Select(c=>c.Aggregate((a,b)=>a+b))
.Aggregate((a,b)=>a+"\r\n\r\n"+b);
We successfully used this formula for scraping, but it seems like the terrain you have to cross is considerably more inhospitable.

Obviously not a whole solution, but instead of trying to find the relevant content, it might be easier to disqualify non-relevant content. You could classify certain types of noises and work on coming up with smaller solutions that eliminate them. You could have advertisement filters, navigation filters, etc.
I think that the larger question is do you need to have one solution work on a wide range of content, or are you willing to create a framework that you can extend and implement on a site by site basis? On top of that, how often are you expecting change to the underlying data sources (i.e. volatility)?

You might want to look at Latent Dirichlet Allocation which is an IR technique to generate topics from text data that you have. This should help you reduce noise and get some precise information on what the page is about.

Multiline ddl Custom Control

One of the guys I work with needs a custom control that would work like a multiline ddl since such a thing does not exist as far as we have been able to discover
does anyone have any ideas or have created such a thing before
we have a couple ideas but they involve to much database usage
We prefer that it be FREE!!!

We use a custom modified version of suckerfish at work. DB performance isn't an issue for us because we cache the control.
The control renders out nested UL/LIs either for all nodes in the web.sitemap or for a certain set of pages pulled from the DB. We then use jQuery to do all the cool javascript stuff. Because it uses such basic HTML, it's pretty easy to have multi-line or wrapped long items once you style it with CSS.

Have a look at EasyListBox. I used on a project and while a bit quirky at first, got the job done.

I'm not sure exactly what you mean by multi-line, but if it is selecting multiple elements in a drop down list, see this demo.
If its showing elements that wrap mulitple lines in a drop down, see this demo. You can put a break in the HTML to achieve what you might be looking for. I've used this control in this manner before, so I can confirm it works.
Good luck.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.