I want to download a html source, then search for the username and other information, and then display this in my program.
I'm pretty new to programming, but a straight noob when it comes to things like this (Regex) so I hope you can explain it to me.
I used Regex before extracting a K/D ratio from a html source, for that I used this code:
string pattern = #"<span class=""kdratio"">\d+\.\d+";
But I have no idea how to start on this one...
This is the line of the source that contains the information:
<section class="profile-header" profile="true" motto="user's motto" user="User" figure="hr-3322-45.hd-190-1.ch-3342-64-66.lg-285-64.sh-3068-82-66.ea-1404-64">
I only need the parts user="User" and figure="x", I couldn't try anything because I really wouldn't know how to start, because the html line looks so different from what I have experience with.
Regular expressions are not a good idea for matching HTML unless it's very simple, single, tag matching. See here: RegEx match open tags except XHTML self-contained tags
I recommend using an HTML DOM-parsing library and use XPath or CSS selectors to get the information you want. For .NET, HtmlAgilityPack is recommended. For CSS Selectors you'll want Fizzler (an add-on for HtmlAgilityPack).
In JavaScript (easily rewritten to C# and HtmlAgilityPack) it would be this:
document.querySelector(
"section[class=profile-header][profile=true][user=User]"
).textContent
HtmlAgilityPack: http://html-agility-pack.net
Fizzler: https://www.nuget.org/packages/Fizzler.Systems.HtmlAgilityPack/
Generally for parsing HTML, Regex is not a good choice! HTML tends to be so complicated and it is so hard to write a single Regex to be able to match everything! Instead use a parser like Html Agility Pack.
Related
There is a sample html code like below:
<div><span>span1</span></div>
<b>for test</b>
<span>span2</span>
Is there any way to get all span tags that are not in div tags (In this sample: span2)
According to this post C# Regular Expression excluding a string this is my pattern but it does not work.
pattern: ((?:(?!\b<div>\b))*)((.|\n)*?)<span>((.|\n)*?)</span>((.|\n)*?)((?:(?!\b</div>\b))*)
You really don't want to be using regular expressions to try to parse HTML. You can read more about the many reasons on this Stack Overflow question:
RegEx match open tags except XHTML self-contained tags
You should use an HTML parser like Html Agility Pack, or even a simple XML parser like XMLReader
This question already has answers here:
Using C# regular expressions to remove HTML tags
(10 answers)
Closed 4 years ago.
Need regular expression to remove the a tag from the following url Name to output only the string "Name". I am using C#.net.
Any help is appreciated
This will do a pretty good job:
str = Regex.Replace(str, #"<a\b[^>]+>([^<]*(?:(?!</a)<[^<]*)*)</a>", "$1");
You should be looking at Html Agility Pack. RegEx works on almost all cases but it fails for some basics or broken Html. Since, the grammar of HTML is not regular, Html Agility pack still works perfectly fine in all cases.
If you are looking for just one time this particular case of anchor tag, any above RegEx would work for you, but Html Agility Pack is your long run, solid solution to strip off any Html tags.
Ref: Using C# regular expressions to remove HTML tags
You can try to use this one. It has not been tested under all conditions, but it will return the correct value from your example.
\<[^\>]+\>(.[^\<]+)</[^\>]+\>
Here's a version that will work for only tags.
\<a\s[^\>]+\>(.[^\<]+)</a\>
I tested it on the following HTML and it returned Name and Value only.
Name<label>This is a label</label> Value
Agree with Priyank that using a parser is a safer bet. If you do go the route of using a regex, consider how you want to handle edge cases. It's easy to transform the simple case you mentioned in your question. And if that is indeed the only form the markup will take, a simple regex can handle it. But if the markup is, for example, user generated or from 3rd party source, consider cases such as these:
<a>foo</a> --> foo # a bare anchor tag, with no attributes
# the regexes listed above wouldn't handle this
<b>boldness</b> --> <b>boldness</b>
# stripping out only the anchor tag
<A onClick="javascript:alert('foo')">Upper\ncase</A> --> Upper\ncase
# and obviously the regex should be case insensitive and
# apply to the entire string, not just one line at a time.
<b>bold</b>bar --> <b>bold</b>bar
# cases such as this tend to break a lot of regexes,
# if the markup in question is user generated, you're leaving
# yourself open to the risk of XSS
Following is working for me.
Regex.Replace(inputvalue, "\<[\/]*a[^\>]*\>", "")
I want to strip all tags, remove the [show][Hide] stuffs from wikipedia, or is there some website that makes pages in more readable format.
Please I am aware of the Wikipedia printable version, but I don't need any tags in that, as I have some other use. So please answer the original question only, about any website or webservice or code snippets in php/C# to remove the tags from a webpages.
Also like when I copy some list from firefox it replaces <li> with the *, is it possible to set something in firefox to return some other non readable character like some kind of dot
You can start by taking a look at the strip_tags function.
You could use an HTML parser, BeautifulSoup (Python) or Simple HTML DOM for example. Or you could try using an XML parser.
I want to strip all tags, remove the
[show][Hide] stuffs from wikipedia, or
is there some website that makes pages
in more readable format.
You should take a look at DBpedia, Wikipedia, but just the data.
http://dbpedia.org/About
What about htmlagilitypack
htmlagilitypackt
Similar thread available in stackoverflow
Is there a Wikipedia API?
Try this function.
Dim pattern As String = "<(.|\n)*?>"
Return System.Text.RegularExpressions.Regex.Replace(strHtmlString, pattern, String.Empty).Trim()
I am using the following regex to get the src value of the first img tag in an HTML document.
string match = "src=(?:\"|\')?(?<imgSrc>[^>]*[^/].(?:jpg|png))(?:\"|\')?"
Now it captures total src attribute that I dont need. I just need the url inside the src attribute. How to do it?
Parse your HTML with something else. HTML is not regular and thus regular expressions aren't at all suited to parsing it.
Use an HTML parser, or an XML parser if the HTML is strict. It's a lot easier to get the src attribute's value using XPath:
//img/#src
XML parsing is built into the System.Xml namespace. It's incredibly powerful. HTML parsing is a bit more difficult if the HTML isn't strict, but there are lots of libraries around that will do it for you.
see When not to use Regex in C# (or Java, C++ etc) and Looking for C# HTML parser
PS, how can I put a link to a StackOverflow question in a comment?
Your regex should (in english) match on any character after a quote, that is not a quote inside an tag on the src attribute.
In perl regex, it would be like this:
/src=[\"\']([^\"\']+)/
The URL will be in $1 after running this.
Of course, this assumes that the urls in your src attributes are quoted. You can modify the values in the [] brackets accordingly if they are not.
What would be the best way to search through HTML inside a C# string variable to find a specific word/phrase and mark (or wrap) that word/phrase with a highlight?
Thanks,
Jeff
I like using Html Agility Pack very easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links
HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var link in Nodes)
{
Console.WriteLine(link.Attributes["href"].Value);
}
Regular Expression would be my way. ;)
If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?
An approach I used in the past is to use HTMLTidy to convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.
Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.
In simple cases, regular expressions will do.
string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");
will yield: "tttttt<strong>go</strong>ttttttt"
But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:
<span class="firstLetter">B</span>ook
To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.
You could look at using Html DOM, an open source project on SourceForge.net.
This way you could programmatically manipulate your text instead of relying regular expressions.
Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.