Find href attribute values that do not contain “javascript:”

Find href attribute values that do not contain “javascript:” - c#

I have a RegEx which nicely finds the href's in a URL:
<[aA][^>]*? href=[\"'](?<url>[^\"]+?)[\"'][^>]*?>
However, I want it to NOT find any href that contains the text, 'javascript:' in it.
The reason is that I sometimes need to mod the href and sometimes don't. When there is a 'javascript:' text in the href I want it not to be found by the regex.
(ASP.NET, C#)

I really wouldn't recommend using a regexp for this, since HTML isn't regular and there are no end of edge cases to cater for. If at all possible, please use an HTML parser. I think you'll find it a lot less grief.

A word javascript can be written in other ways. Look at ha.ckers.org article.
Simple excluding javascript word dot't provide you safety at all.

Related

Remove everything expect src in Image Tag using Regex

I want to remove everything expect src in Image tag using regex.I am using C# but I don't want to use HTMLAgilityPack I want it using regex only.
How to get this ?
If String is <img id="image" class="header" src="test.png"> then it returns as <img src="test.png">
Image tag may contain many other extra properties.

To clarify my comments: Normally I wouldn't recommend parsing HTML Using Regex. however, this is one of the few times when it's possible without ending up with a disastrously complicated regex string, because here you have a single node, with 1 pair of matching angle brackets. In addition, the OP only needs a single tag from this string. If he needed to do anything more complicated, I'd agree that he should use HTMLAgilityPack, but this is perfectly doable.
What you do is you extract the tag from the string using this regex: (src=['\"].+?['\"]). Then you take what you extracted from the string and paste it into a new string:
String newImgTag = String.Format("<img {0}>", srcMatch);
Again, if this were any more complicated (or if I had to do other HTML manipulation), I would just skip the regex and go for the established solutions like the aforementioned HTMLAgilityPack, because it offers far more support for HTML manipulation.
However, I don't view this as HTML manipulation, because you got a single tag without even a matching closing tag. This is more like basic string manipulation. It's similar to calculating a number to the second power: I doubt anyone would import the entire math library just for that, they'd just do N * N.
I fully expect and accept that people will downvote me for even considering to use Regex for this. Before you do so, however, read the post and think about it. This is one of those borderline cases where HTMLAgilityPack would make the project far more complicated without actually adding anything except that you're not using Regex. Regex has its uses, it's only when you abuse it that it becomes a monster to work with.

Matching tags in C#

I'm trying to match tags with C# and I'm having some trouble getting it to work. I have these tags:
<categories=1></categories=1>
The =1 could be really any number. It could be 1, 2, 3 or any other given number. Is there a way to match this tag in C# using IndexOf or RegEx or a better method.
So to give an example of how I want to use it. I would have something like:
if (PUT WORKING CODE HERE ONCE FIGURED OUT)
{
Do Something
}
Is there an easy way to do this?
Thanks!

I would suggest to first make the document valid XML by replacing those equation signs, then use any XML parser.

there is only one valid answer to this need, unless you are doing homeworks and need to learn how to code this yourself...
avoid reinventing things from scratch and use Html Agility Pack
it is called Html but also handles XML files, in case you have to do more complex things, like parsing, and don't want or cannot use pure XPath and XML related .NET Framework classes.
see here for some examples: How to use HTML Agility pack

C# regex html table inside a table

I am using the follow regex:
(<(table|h[1-6])[^>]*>(?<op>.+?)<\/(table|h[1-6])>)
to extract tables (and headings) from a html document.
I've found it to work quite well in the documents we are using (documents converted with word save as filtered html), however I have a problem that if the table contains a table inside it the regex will match the initial table start tag and the second table end tag rather than the initial table end tag.
Is there a way in regex to specify that if it finds another table tag within the match to keep to ignore the next match of and go for the next one and so on?

Don't do this.
HTML is not a regular grammar and so a regular expression is not a good tool with which to parse it. What you are asking in your last sentence is for a contextual parser, not a regular expression. Bare regular expression parsing it is too likely fail to parse HTML correctly to be responsible coding.
HtmlAgilityPack is a MsPL-licensed solution I've used in the past that has widely acceptable license terms and provides a well-formed DOM which can be probed with XPath or manipulated in other useful ways ("Extract all text, dropping out tags" being a popular one for importing HTML mail for search, for example, that is nigh trivial after letting a DOM parser rip through the HTML and only coding the part that adds value for your specific business case).

Is there a way in regex to specify
that if it finds another table tag
within the match to keep to ignore the
next match of and go for the next one
and so on?
Since nobody's actually answered this part, I will—No.
This is part of what makes regular languages "regular". A regular language is one that can be recognized by a certain regular grammar, often described in syntax that looks very much like basic regular expressions (10* to match 1 followed by any number of 0s), or a DFA. "Regular Expressions" are based strongly off of these regular languages, as their name implies, but add some functions such as lookaheads and lookbehinds. As a general rule, a regular language knows nothing about what's around it or what it's seen, only what it's looking at currently, and which of its finite states it's in.
TLDNR: Why does this matter to you? Since a regular language cannot "count" elements in that way, it is impossible to keep a tally of the number of <table> and </table> elements you have seen. An HTML Parser does just that - since it is not trying to emulate a regular language, it can count the number of opening and closing tags it sees.
This is the prime example of why it's best not to use regular expressions to parse HTML; even though you know how it may be formed, you cannot parse it since there may be nested elements. If you could guarantee there would be no nested tables, it may be feasible to do this, but even then, using a parser would be much simpler.
Plea to the theoretical computer scientists: I did my best to explain what I know from the CS Theory classes I've taken in a way that most people here should be able to understand. I know that regular languages can "count" finite numbers of things. Feel free to correct me, but please be kind!

Regular expressions are not really suited for this as what you're trying to do contains knowledge about the fact that this is a nested language. Without this knowledge it will be really hard (and also hard to read and maintain) to extract this information.
Maybe do something with an XPath navigator?

How to strip all tags from wikipedia pages or make page more readable

I want to strip all tags, remove the [show][Hide] stuffs from wikipedia, or is there some website that makes pages in more readable format.
Please I am aware of the Wikipedia printable version, but I don't need any tags in that, as I have some other use. So please answer the original question only, about any website or webservice or code snippets in php/C# to remove the tags from a webpages.
Also like when I copy some list from firefox it replaces <li> with the *, is it possible to set something in firefox to return some other non readable character like some kind of dot

You can start by taking a look at the strip_tags function.

You could use an HTML parser, BeautifulSoup (Python) or Simple HTML DOM for example. Or you could try using an XML parser.

I want to strip all tags, remove the
[show][Hide] stuffs from wikipedia, or
is there some website that makes pages
in more readable format.
You should take a look at DBpedia, Wikipedia, but just the data.
http://dbpedia.org/About

What about htmlagilitypack
htmlagilitypackt
Similar thread available in stackoverflow
Is there a Wikipedia API?
Try this function.
Dim pattern As String = "<(.|\n)*?>"
Return System.Text.RegularExpressions.Regex.Replace(strHtmlString, pattern, String.Empty).Trim()

What is the best way to search through HTML in a C# string for specific text and mark the text?

What would be the best way to search through HTML inside a C# string variable to find a specific word/phrase and mark (or wrap) that word/phrase with a highlight?
Thanks,
Jeff

I like using Html Agility Pack very easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links
HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var link in Nodes)
{
Console.WriteLine(link.Attributes["href"].Value);
}

Regular Expression would be my way. ;)

If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?
An approach I used in the past is to use HTMLTidy to convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.
Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.

In simple cases, regular expressions will do.
string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");
will yield: "tttttt<strong>go</strong>ttttttt"
But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:
<span class="firstLetter">B</span>ook
To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.

You could look at using Html DOM, an open source project on SourceForge.net.
This way you could programmatically manipulate your text instead of relying regular expressions.

Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.