Extract all images from html string using Regex

Extract all images from html string using Regex - c#

I'm trying to use Regex to extract all image sources from html string. For couple reasons I cannot use HTML Agitility Pack.
I need to extract 'gfx/image.png' from strings which looks like
<table cellpadding="0" cellspacing="0" border="0" style="height:350px; margin:0; background: url('gfx/image.jpg') no-repeat;">
<table cellpadding="0" cellspacing="0" border="0" background="gfx/image.jpg" style=" width:700px; height:250px; "><tr><td valign="middle">

you can use this regex: (['"])([^'"]+\.jpg)\1
then get Groups[2], this code is worked fine:
var str = #"<table cellpadding=""0"" cellspacing=""0"" border=""0"" style=""height:350px; margin:0; background: url('gfx/image.jpg') no-repeat;"">
<table cellpadding=""0"" cellspacing=""0"" border=""0"" background=""gfx/image.jpg"" style="" width:700px; height:250px; ""><tr><td valign=""middle"">";
var regex = new Regex(#"(['""])([^'""]+\.jpg)\1");
var match = regex.Match(str);
while (match.Success)
{
Console.WriteLine(match.Groups[2].Value);
match = match.NextMatch();
}

Related

how to place C# code/syntax within concatenated string

I have a string that I am concatenating so as to then generate a pdf using C# and ITextSharp. I have some values such as paymentId from my model that I would like to also display on the pdf.
The pdf is generated successfully until I try to add the values from my model e.g "onlineTransactionViewModel.OnlineTransaction.PaymentId"
var example_html = #"<html>
<body style = 'font-family: Helvetica Neue, Helvetica, Helvetica, Arial, sans-serif; text-align: center; color: #777'>;
<div class='invoice- box'>
<table>
<tr class='top'>
<td colspan=""5"">
<table>
<tr>
<td colspan= ""3"">'onlineTransactionViewModel.OnlineTransaction.PaymentId' ""</br>"" ""onlineTransactionViewModel.OnlineTransaction.PayFastReference"" ""</br>"" ""onlineTransactionViewModel.OnlineTransaction.PayFastReference""
<td class= ""title"" style = 'text - align:right'></ td >
</ tr >
</ table >
</ table >
</ div >
</ body >
</ html>";
using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc))
{
//HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
using (var sr = new StringReader(example_html))
{
//Parse the HTML
htmlWorker.Parse(sr);
}
}
doc.Close();

You need to use string interpolation.
From C# 6, string interpolation could be combined with string literals by appending "$#" to a string.
When using string interpolation, you need to contain the content that you want to reference code with "{" and "}".
If you update your string to the following, you should get the intended result:
var example_html = $#"<html>
<body style = 'font-family: Helvetica Neue, Helvetica, Helvetica, Arial, sans-serif; text-align: center; color: #777'>;
<div class='invoice- box'>
<table>
<tr class='top'>
<td colspan=""5"">
<table>
<tr>
<td colspan= ""3"">'{onlineTransactionViewModel.OnlineTransaction.PaymentId}' ""</br>"" ""{onlineTransactionViewModel.OnlineTransaction.PayFastReference}"" ""</br>"" ""{onlineTransactionViewModel.OnlineTransaction.PayFastReference}""
<td class= ""title"" style = 'text - align:right'></ td >
</ tr >
</ table >
</ table >
</ div >
</ body >
</ html>";

Regex Capturing Group within a Capture Group

I am searching through a database to find span tags with video information for the purpose of migration.
My regex works well and I can extract all of the information I need for the most part. The trouble I run into is when the style tag is in a different position than expected. This throws off the expression and results in about 2/3rds of the captures I would expect.
If I try and nest the style capture group inside the main capture group, it fails to capture anything. I also tried using negative/positive lookaheads as well, but it only ever works if I make it an optional capture group. I think the problem is im not nesting it correctly. Most of the related questions give the answer of a negative lookbehind, but my understanding is that's more of a assertion/quantifier.
So how can I always capture the style tag regardless of its position in the span tag?
Regex flavor is .NET (server side)
I have a Regexr setup
/(?<tag><span class='vidly-vid' data-thumb='(?<thumb>http.+\.jpg)'.+aspect-ratio='(?<aspect>\d{1,3}:\d{1,3})'.+sources='\[{"file":.+"(?<src>(?<uri>https:\/\/cf1234.cloudfront\.net\/Vids\/)(?<key>(?<ident>[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}|[a-z0-9]{6})\/(?<mp4>mp4_1080.mp4|mp4_720.mp4|mp4_480.mp4|mp4_360.mp4|mp4.mp4))).+style='(?<style>.+width: (?<width>.+)px.+height: (?<height>.+)px.+)'.+<\/span>)/gmi
Sample Data
All of these should match. The first one does NOT, the other three do.
<span class='vidly-vid' data-thumb='https://cf1234.cloudfront.net/Vids/Thumbnails/691DBB43-5EC8-4D57-AF7B-99896D9BD5D1_19127.jpg' data-aspect-ratio='4:3' style='border-width: 0px; width: 352px; height: 240px;' data-sources='[{"file":"https://cf1234.cloudfront.net/Vids/6v1j0a/hls.m3u8","label":"HD"},{"file":"https://cf1234.cloudfront.net/Vids/6v1j0a/mp4_360.mp4","label":"360p SD"}]'> </span>
<span class='vidly-vid' data-thumb='https://cf1234.cloudfront.net/Vids/Thumbnails/b181cfa5-565d-470a-b93a-2610987bb4da_28142.jpg' data-aspect-ratio='160:117' data-sources='[{"file":"https://cf1234.cloudfront.net/Vids/b181cfa5-565d-470a-b93a-2610987bb4da/hls.m3u8","label":"HD"},{"file":"https://cf1234.cloudfront.net/Vids/b181cfa5-565d-470a-b93a-2610987bb4da/mp4_480.mp4","label":"480p SD"},{"file":"https://cf1234.cloudfront.net/Vids/b181cfa5-565d-470a-b93a-2610987bb4da/mp4_360.mp4","label":"360p SD"},{"file":"https://cf1234.cloudfront.net/Vids/b181cfa5-565d-470a-b93a-2610987bb4da/mp4_720.mp4","label":"720p HD"},{"file":"https://cf1234.cloudfront.net/Vids/b181cfa5-565d-470a-b93a-2610987bb4da/mp4_1080.mp4","label":"1080p HD"}]' style='border-width: 0px; width: 600px; height: 480px;'> </span>
<table align="left" border="0" cellpadding="5" cellspacing="5" style="width:600px"> <tbody> <tr> <td><img alt="" src="/content/generator/Course_90016206/Case-10-LMLO_MG_FLAVOR1label.jpg" style="height:497px; width:324px" /></td> <td><span class='vidly-vid' data-thumb='https://cf1234.cloudfront.net/Vids/Thumbnails/b2a7cbd3-5d31-49a5-bf89-aef0cf9f7414_28142.jpg' data-aspect-ratio='146:225' data-sources='[{"file":"https://cf1234.cloudfront.net/Vids/b2a7cbd3-5d31-49a5-bf89-aef0cf9f7414/hls.m3u8","label":"HD"},{"file":"https://cf1234.cloudfront.net/Vids/b2a7cbd3-5d31-49a5-bf89-aef0cf9f7414/mp4_480.mp4","label":"480p SD"},{"file":"https://cf1234.cloudfront.net/Vids/b2a7cbd3-5d31-49a5-bf89-aef0cf9f7414/mp4_360.mp4","label":"360p SD"},{"file":"https://cf1234.cloudfront.net/Vids/b2a7cbd3-5d31-49a5-bf89-aef0cf9f7414/mp4_720.mp4","label":"720p HD"},{"file":"https://cf1234.cloudfront.net/Vids/b2a7cbd3-5d31-49a5-bf89-aef0cf9f7414/mp4_1080.mp4","label":"1080p HD"}]' style='border-width: 0px; width: 324px; height: 500px;'> </span></td> </tr> </tbody> </table>
<span class='vidly-vid' data-thumb='https://cf1234.cloudfront.net/Vids/Thumbnails/231913a7-b608-4d8b-9332-64b6840c22f0_28142.jpg' data-aspect-ratio='16:9' data-sources='[{"file":"https://cf1234.cloudfront.net/Vids/231913a7-b608-4d8b-9332-64b6840c22f0/hls.m3u8","label":"HD"},{"file":"https://cf1234.cloudfront.net/Vids/231913a7-b608-4d8b-9332-64b6840c22f0/mp4_480.mp4","label":"480p SD"},{"file":"https://cf1234.cloudfront.net/Vids/231913a7-b608-4d8b-9332-64b6840c22f0/mp4_360.mp4","label":"360p SD"},{"file":"https://cf1234.cloudfront.net/Vids/231913a7-b608-4d8b-9332-64b6840c22f0/mp4_720.mp4","label":"720p HD"},{"file":"https://cf1234.cloudfront.net/Vids/231913a7-b608-4d8b-9332-64b6840c22f0/mp4_1080.mp4","label":"1080p HD"}]' style='border-width: 0px; width: 920px; height: 520px;'> </span>

I'd personally just split up the regex into more manageable chunks, like so:
var spanRegex = new Regex(#"<span class='vidly-vid'.+<\/span>");
var attrRegexes = new[]{
#"data-thumb='(?<thumb>http.+\.jpg)'",
#"aspect-ratio='(?<aspect>\d{1,3}:\d{1,3})'",
#"sources='\[{""file"":.+""(?<src>(?<uri>https:\/\/cf1234.cloudfront\.net\/Vids\/)(?<key>(?<ident>[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}|[a-z0-9]{6})\/(?<mp4>mp4_1080.mp4|mp4_720.mp4|mp4_480.mp4|mp4_360.mp4|mp4.mp4)))",
#"style='(?<style>.+width: (?<width>.+)px.+height: (?<height>.+)px.+)'",
}
.Select(r => new Regex(r))
.ToList();
var results = inputs.Select(i => spanRegex.Match(i).Value)
.Select(i => new
{
i,
attributes =
from r in attrRegexes
let match = r.Match(i)
from g in match.Groups.Cast<Group>().Skip(1)
select new {g.Name, capture = g.Value}
});
Linqpad example

Formate Exception thrown “input string was not in a correct format”

I am getting an error message when running a small bit of code saying that the input string was not in a correct formate. The issue is coming from when parsing these html
Mark up One
<td class="tdRow1Color" width="100%">
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<tr><td class="plaintextbold">Item Number: 1258</td></tr>
<tr><td><img alt="" src="images/clear.gif" width="1" height="10" border="0"></td></tr>
<tr>
<td class="plaintext" valign="middle"> <img src="../images/0note.gif" border="0" align="absmiddle"> <a class="prodlink" href="writeReview.asp?number=1258"><i><u>Be the first to review this item</u></i></a></td>
</tr>
<tr><td><img alt="" src="images/clear.gif" width="1" height="10" border="0"></td></tr>
<tr><td class="plaintext"><b>RRP £50.00 - Now £39.99</b> </td>
Mark up Two
<tr><td class="tdRow1Color" width="100%">
<table width="100%" cellpadding="0" cellspacing="0" border="0">
<tr><td class="plaintextbold">Item Number: 2525</td></tr>
<tr><td><img alt="" src="images/clear.gif" width="1" height="10" border="0"></td></tr>
<tr>
<td class="plaintext" valign="middle"> <img src="../images/0note.gif" border="0" align="absmiddle"> <a class="prodlink" href="writeReview.asp?number=2525"><i><u>Be the first to review this item</u></i></a></td>
</tr>
<tr><td><img alt="" src="images/clear.gif" width="1" height="10" border="0"></td></tr>
<tr><td class="plaintext">RRP £45 - Now £38
I am coverting the RRP price through this regex.
private Regex _originalPriceRegex = new Regex(#"RRP \s(\d+\.?\d+?)");
And picking up the RRP prices through the xpath
ProductProperties.priceOriginal, new HtmlElementLocator("//td[#class='tdRow1Color']//td[#class='plaintext']//text()[starts-with(., 'RRP')]",
The issue seems to be arrive when the xpath value is passed into the functio below. The exception is being thrown when it returns priceMatch.Groups[1].Value
private string LookForOrignalPrice(HtmlNode node)
{
string text = node.InnerText;
Match priceMatch = _originalPriceRegex.Match(text);
if (priceMatch.Success)
Console.WriteLine("++++++price is " + priceMatch);
return priceMatch.Groups[1].Value;
return null;
}
Thanks for any advice which you can give.

When using if its good practice to put braces, or else you might fall to errors like this.
Here priceMatch.Groups[1].Value is executed when priceMatch.Success is false.
So you should change your code like this:
private string LookForOrignalPrice(HtmlNode node)
{
string text = node.InnerText;
Match priceMatch = _originalPriceRegex.Match(text);
if (priceMatch.Success) --> put braces to group the statements
{
Console.WriteLine("++++++price is " + priceMatch);
return priceMatch.Groups[1].Value;
}
return null;
}

C# REGEX need to be corrected

I'm making a C# regex to find and replace patterns related to html content.
i need to get all the stuff like that:
<table border=0 align=center id=mytable5>
corrected like that:
<table border="0" align="center" id="mytable5">
i tried out this:
String pattern = #"\s(?<element>[a-z])=(?<valeur>\d+?[a-z])\s?[\>]";
String replacePattern = "${element}=[\"]${valeur}[\"]";
html = Regex.Replace(html, pattern, replacePattern, RegexOptions.IgnoreCase);
but there is absolutly no effect.
Any help would be greatly appreciated.
thank you all
Actually King King, there is a problem with your regex
<table border=0 align="center" id="mytable5">
will give
<table border="0" align=""center"" id=""mytable5"">
thats why the regex must check this
[a space][a-z]=[a-z0-9][a space or '>']

var html = "<table border=0 align=center id=mytable5>";
html = Regex.Replace(html, #"=\s*(\S+?)([ >])", "=\"${1}\"${2}", RegexOptions.IgnoreCase);

I got it
String pattern = #"([a-z]+)=([a-z0-9_-]+)([ >])";
String replacePattern = "${1}=\"${2}\"${3}";
html = Regex.Replace(html, pattern, replacePattern, RegexOptions.IgnoreCase);
will get
<table border=0 align="center" id="mytable5">
corrected to this:
<table border="0" align="center" id="mytable5">
thanks for King King that showed me the path

Getting text inside table

I have a table like that. And I wanna get the just text FOO COMPANY from between td tags. How can I get it?
<table class="left_company">
<tr>
<td style="BORDER-RIGHT: medium none; bordercolor="#FF0000" align="left" width="291" bgcolor="#FF0000">
<table cellspacing="0" cellpadding="0" width="103%" border="0">
<tr style="CURSOR: hand" onclick="window.open('http://www.foo.com')">
<td class="title_post" title="FOO" valign="center" align="left" colspan="2">
<font style="font-weight: 700" face="Tahoma" color="#FFFFFF" size="2">***FOO COMPANY***</font>
</td>
</tr>
</table>
</td>
</tr>
<table>
I'm using following code but nS is null.
doc = hw.Load("http://www.foo.aspx?page=" + j);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//table[#class='left_company']"))
{
nS = doc.DocumentNode.SelectNodes("//td[#class='title_post']");
}

var text = doc.DocumentNode.Descendants()
.FirstOrDefault(n => n.Attributes["class"] != null &&
n.Attributes["class"].Value == "title_post")
.Element("font").InnerText;
or
var text2 = doc.DocumentNode.SelectNodes("//td[#class='title_post']/font")
.First().InnerText;

Likely the page you are calling generate the content of interest using JavaScript. HtmlAgilityPack does not execute JavaScript, so the content cannot be extracted. One way to confirm this is to try to visit the page with scripting turned off, and try to see if the element you are interested in still exists.

insert some attribute to font element like company="FOO"
then use jquery to get that element like
alert($('font[company="FOO"]').html())
like this
cheers

Close: nS = doc.DocumentNode.SelectNodes("//td[#class='title_post']//text()");
You can then open the nS node to retrieve the text. If there's more than one text node, you'll need to iterate over them.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract all images from html string using Regex - c#

Related

how to place C# code/syntax within concatenated string

Regex Capturing Group within a Capture Group

Formate Exception thrown “input string was not in a correct format”

C# REGEX need to be corrected

Getting text inside table

Categories

Resources