C# REGEX need to be corrected - c#

I'm making a C# regex to find and replace patterns related to html content.
i need to get all the stuff like that:
<table border=0 align=center id=mytable5>
corrected like that:
<table border="0" align="center" id="mytable5">
i tried out this:
String pattern = #"\s(?<element>[a-z])=(?<valeur>\d+?[a-z])\s?[\>]";
String replacePattern = "${element}=[\"]${valeur}[\"]";
html = Regex.Replace(html, pattern, replacePattern, RegexOptions.IgnoreCase);
but there is absolutly no effect.
Any help would be greatly appreciated.
thank you all
Actually King King, there is a problem with your regex
<table border=0 align="center" id="mytable5">
will give
<table border="0" align=""center"" id=""mytable5"">
thats why the regex must check this
[a space][a-z]=[a-z0-9][a space or '>']

var html = "<table border=0 align=center id=mytable5>";
html = Regex.Replace(html, #"=\s*(\S+?)([ >])", "=\"${1}\"${2}", RegexOptions.IgnoreCase);

I got it
String pattern = #"([a-z]+)=([a-z0-9_-]+)([ >])";
String replacePattern = "${1}=\"${2}\"${3}";
html = Regex.Replace(html, pattern, replacePattern, RegexOptions.IgnoreCase);
will get
<table border=0 align="center" id="mytable5">
corrected to this:
<table border="0" align="center" id="mytable5">
thanks for King King that showed me the path

Related

Extract all images from html string using Regex

I'm trying to use Regex to extract all image sources from html string. For couple reasons I cannot use HTML Agitility Pack.
I need to extract 'gfx/image.png' from strings which looks like
<table cellpadding="0" cellspacing="0" border="0" style="height:350px; margin:0; background: url('gfx/image.jpg') no-repeat;">
<table cellpadding="0" cellspacing="0" border="0" background="gfx/image.jpg" style=" width:700px; height:250px; "><tr><td valign="middle">
you can use this regex: (['"])([^'"]+\.jpg)\1
then get Groups[2], this code is worked fine:
var str = #"<table cellpadding=""0"" cellspacing=""0"" border=""0"" style=""height:350px; margin:0; background: url('gfx/image.jpg') no-repeat;"">
<table cellpadding=""0"" cellspacing=""0"" border=""0"" background=""gfx/image.jpg"" style="" width:700px; height:250px; ""><tr><td valign=""middle"">";
var regex = new Regex(#"(['""])([^'""]+\.jpg)\1");
var match = regex.Match(str);
while (match.Success)
{
Console.WriteLine(match.Groups[2].Value);
match = match.NextMatch();
}

Regex Capturing Group within a Capture Group

I am searching through a database to find span tags with video information for the purpose of migration.
My regex works well and I can extract all of the information I need for the most part. The trouble I run into is when the style tag is in a different position than expected. This throws off the expression and results in about 2/3rds of the captures I would expect.
If I try and nest the style capture group inside the main capture group, it fails to capture anything. I also tried using negative/positive lookaheads as well, but it only ever works if I make it an optional capture group. I think the problem is im not nesting it correctly. Most of the related questions give the answer of a negative lookbehind, but my understanding is that's more of a assertion/quantifier.
So how can I always capture the style tag regardless of its position in the span tag?
Regex flavor is .NET (server side)
I have a Regexr setup
/(?<tag><span class='vidly-vid' data-thumb='(?<thumb>http.+\.jpg)'.+aspect-ratio='(?<aspect>\d{1,3}:\d{1,3})'.+sources='\[{"file":.+"(?<src>(?<uri>https:\/\/cf1234.cloudfront\.net\/Vids\/)(?<key>(?<ident>[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}|[a-z0-9]{6})\/(?<mp4>mp4_1080.mp4|mp4_720.mp4|mp4_480.mp4|mp4_360.mp4|mp4.mp4))).+style='(?<style>.+width: (?<width>.+)px.+height: (?<height>.+)px.+)'.+<\/span>)/gmi
Sample Data
All of these should match. The first one does NOT, the other three do.
<span class='vidly-vid' data-thumb='https://cf1234.cloudfront.net/Vids/Thumbnails/691DBB43-5EC8-4D57-AF7B-99896D9BD5D1_19127.jpg' data-aspect-ratio='4:3' style='border-width: 0px; width: 352px; height: 240px;' data-sources='[{"file":"https://cf1234.cloudfront.net/Vids/6v1j0a/hls.m3u8","label":"HD"},{"file":"https://cf1234.cloudfront.net/Vids/6v1j0a/mp4_360.mp4","label":"360p SD"}]'> </span>
<span class='vidly-vid' data-thumb='https://cf1234.cloudfront.net/Vids/Thumbnails/b181cfa5-565d-470a-b93a-2610987bb4da_28142.jpg' data-aspect-ratio='160:117' data-sources='[{"file":"https://cf1234.cloudfront.net/Vids/b181cfa5-565d-470a-b93a-2610987bb4da/hls.m3u8","label":"HD"},{"file":"https://cf1234.cloudfront.net/Vids/b181cfa5-565d-470a-b93a-2610987bb4da/mp4_480.mp4","label":"480p SD"},{"file":"https://cf1234.cloudfront.net/Vids/b181cfa5-565d-470a-b93a-2610987bb4da/mp4_360.mp4","label":"360p SD"},{"file":"https://cf1234.cloudfront.net/Vids/b181cfa5-565d-470a-b93a-2610987bb4da/mp4_720.mp4","label":"720p HD"},{"file":"https://cf1234.cloudfront.net/Vids/b181cfa5-565d-470a-b93a-2610987bb4da/mp4_1080.mp4","label":"1080p HD"}]' style='border-width: 0px; width: 600px; height: 480px;'> </span>
<table align="left" border="0" cellpadding="5" cellspacing="5" style="width:600px"> <tbody> <tr> <td><img alt="" src="/content/generator/Course_90016206/Case-10-LMLO_MG_FLAVOR1label.jpg" style="height:497px; width:324px" /></td> <td><span class='vidly-vid' data-thumb='https://cf1234.cloudfront.net/Vids/Thumbnails/b2a7cbd3-5d31-49a5-bf89-aef0cf9f7414_28142.jpg' data-aspect-ratio='146:225' data-sources='[{"file":"https://cf1234.cloudfront.net/Vids/b2a7cbd3-5d31-49a5-bf89-aef0cf9f7414/hls.m3u8","label":"HD"},{"file":"https://cf1234.cloudfront.net/Vids/b2a7cbd3-5d31-49a5-bf89-aef0cf9f7414/mp4_480.mp4","label":"480p SD"},{"file":"https://cf1234.cloudfront.net/Vids/b2a7cbd3-5d31-49a5-bf89-aef0cf9f7414/mp4_360.mp4","label":"360p SD"},{"file":"https://cf1234.cloudfront.net/Vids/b2a7cbd3-5d31-49a5-bf89-aef0cf9f7414/mp4_720.mp4","label":"720p HD"},{"file":"https://cf1234.cloudfront.net/Vids/b2a7cbd3-5d31-49a5-bf89-aef0cf9f7414/mp4_1080.mp4","label":"1080p HD"}]' style='border-width: 0px; width: 324px; height: 500px;'> </span></td> </tr> </tbody> </table>
<span class='vidly-vid' data-thumb='https://cf1234.cloudfront.net/Vids/Thumbnails/231913a7-b608-4d8b-9332-64b6840c22f0_28142.jpg' data-aspect-ratio='16:9' data-sources='[{"file":"https://cf1234.cloudfront.net/Vids/231913a7-b608-4d8b-9332-64b6840c22f0/hls.m3u8","label":"HD"},{"file":"https://cf1234.cloudfront.net/Vids/231913a7-b608-4d8b-9332-64b6840c22f0/mp4_480.mp4","label":"480p SD"},{"file":"https://cf1234.cloudfront.net/Vids/231913a7-b608-4d8b-9332-64b6840c22f0/mp4_360.mp4","label":"360p SD"},{"file":"https://cf1234.cloudfront.net/Vids/231913a7-b608-4d8b-9332-64b6840c22f0/mp4_720.mp4","label":"720p HD"},{"file":"https://cf1234.cloudfront.net/Vids/231913a7-b608-4d8b-9332-64b6840c22f0/mp4_1080.mp4","label":"1080p HD"}]' style='border-width: 0px; width: 920px; height: 520px;'> </span>
I'd personally just split up the regex into more manageable chunks, like so:
var spanRegex = new Regex(#"<span class='vidly-vid'.+<\/span>");
var attrRegexes = new[]{
#"data-thumb='(?<thumb>http.+\.jpg)'",
#"aspect-ratio='(?<aspect>\d{1,3}:\d{1,3})'",
#"sources='\[{""file"":.+""(?<src>(?<uri>https:\/\/cf1234.cloudfront\.net\/Vids\/)(?<key>(?<ident>[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}|[a-z0-9]{6})\/(?<mp4>mp4_1080.mp4|mp4_720.mp4|mp4_480.mp4|mp4_360.mp4|mp4.mp4)))",
#"style='(?<style>.+width: (?<width>.+)px.+height: (?<height>.+)px.+)'",
}
.Select(r => new Regex(r))
.ToList();
var results = inputs.Select(i => spanRegex.Match(i).Value)
.Select(i => new
{
i,
attributes =
from r in attrRegexes
let match = r.Match(i)
from g in match.Groups.Cast<Group>().Skip(1)
select new {g.Name, capture = g.Value}
});
Linqpad example

Best way to filter HTML-file for certain Words

I want to read and filter a HTML-file. My plan is to read the file line by line and only save lines that are "encapsulated" by the td tags (I cant copy them here because they wont show up for some reason).
For example:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>
Tabelle Artikel</title>
</head>
<body>
<table width="900" cellspacing="0" cellpadding="0">
<tr>
<td style="vertical-align:top;font-family:Arial;font-size:14pt;">
Tabelle <strong>Artikel</strong><span style="font-size:10pt;padding-left:10">Tabellen-Übersicht</span></td>
</tr>
<tr>
<td style="vertical-align:top;font-family:Arial;font-size:.8em;padding-bottom:5;padding-left:30;"> </td>
</tr>
<tr>
<td>
<table border="0" cellpadding="1" cellspacing="0" style="border:1px solid darkgray;width:100%">
<thead>
<tr>
<td style="font-family:Arial; font-size:.9em;font-weight:bold;">Name</td>
<td style="font-family:Arial; font-size:.9em;font-weight:bold;">Label</td>
<td style="font-family:Arial; font-size:.9em;font-weight:bold;">Typ</td>
<td style="font-family:Arial; font-size:.9em;font-weight:bold;">MDT</td>
<td style="font-family:Arial; font-size:.9em;font-weight:bold;">Format</td>
</tr>
</thead>
<tbody>
<tr style="background-color:black;color:white;">
In this example only these lines that were encapsulated by the TD tags would be saved to the string
Its pretty straight forwards and easy if it werent for the fact that the encapsulated text might be large enough to be multiple lines long. This make it hard for me to filter it (Im a beginner).
static void Main(string[] args)
{
string line = "";
string FilteredHTML = "";
bool SaveLine = false;
using (StreamReader Reader = new StreamReader(#"FilePath"))
{
while ((line = Reader.ReadLine()) != null)
{
if (line.Contains("<td") || SaveLine == true)
{
{
FilteredHTML += line;
SaveLine = true;
}
}
if (line.Contains("</td>") && (!line.Contains("<td")))
{
{
FilteredHTML += line;
SaveLine = false;
}
}
}
}
while (FilteredHTML.Contains(" ")) FilteredHTML = FilteredHTML.Replace(" ", " ");
string Output = "";
string Delimiter = "</td>";
var tokens = Regex.Split(FilteredHTML, Delimiter);
foreach (var item in tokens)
{
Output = item;
Output += Delimiter;
Console.WriteLine(Output);
}
Console.ReadLine();
}
This works to a certain extend but I still have Text before the tags. And the fact that im Adding the Delimiter manually after using it also causes multiple in a row to show up.
As you can see this is a mess and I tried my best to make it work but I cant help but think that there has to be a easier way to Filter HTML file for certain lines.
I also tried to use the HTML-Agility pack but I didnt find anything that would help me with filtering out certain lines.

Getting text inside table

I have a table like that. And I wanna get the just text FOO COMPANY from between td tags. How can I get it?
<table class="left_company">
<tr>
<td style="BORDER-RIGHT: medium none; bordercolor="#FF0000" align="left" width="291" bgcolor="#FF0000">
<table cellspacing="0" cellpadding="0" width="103%" border="0">
<tr style="CURSOR: hand" onclick="window.open('http://www.foo.com')">
<td class="title_post" title="FOO" valign="center" align="left" colspan="2">
<font style="font-weight: 700" face="Tahoma" color="#FFFFFF" size="2">***FOO COMPANY***</font>
</td>
</tr>
</table>
</td>
</tr>
<table>
I'm using following code but nS is null.
doc = hw.Load("http://www.foo.aspx?page=" + j);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//table[#class='left_company']"))
{
nS = doc.DocumentNode.SelectNodes("//td[#class='title_post']");
}
var text = doc.DocumentNode.Descendants()
.FirstOrDefault(n => n.Attributes["class"] != null &&
n.Attributes["class"].Value == "title_post")
.Element("font").InnerText;
or
var text2 = doc.DocumentNode.SelectNodes("//td[#class='title_post']/font")
.First().InnerText;
Likely the page you are calling generate the content of interest using JavaScript. HtmlAgilityPack does not execute JavaScript, so the content cannot be extracted. One way to confirm this is to try to visit the page with scripting turned off, and try to see if the element you are interested in still exists.
insert some attribute to font element like company="FOO"
then use jquery to get that element like
alert($('font[company="FOO"]').html())
like this
cheers
Close: nS = doc.DocumentNode.SelectNodes("//td[#class='title_post']//text()");
You can then open the nS node to retrieve the text. If there's more than one text node, you'll need to iterate over them.

How to get a link's title and href value separately with html agility pack?

Im trying to download a page contain a table like this
<table id="content-table">
<tbody>
<tr>
<th id="name">Name</th>
<th id="link">link</th>
</tr>
<tr class="tt_row">
<td class="ttr_name">
<a title="name_of_the_movie" href="#"><b>name_of_the_movie</b></a>
<br>
<span class="pre">message</span>
</td>
<td class="td_dl">
<img alt="Download" src="#">
</td>
</tr>
<tr class="tt_row"> .... </tr>
<tr class="tt_row"> .... </tr>
</tbody>
</table>
i want to extract the name_of_the_movie from td class="ttr_name" and download link from td class="td_dl"
this is the code i used to loop through table rows
HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
hDocument.LoadHtml(htmlSource);
HtmlNode table = hDocument.DocumentNode.SelectSingleNode("//table");
foreach (var row in table.SelectNodes("//tr"))
{
HtmlNode nameNode = row.SelectSingleNode("td[0]");
HtmlNode linkNode = row.SelectSingleNode("td[1]");
}
currently i have no idea how to check the nameNode and linkNode and extract data inside it
any help would be appreciated
Regards
I can't test it right now, but it should be something among the lines of :
string name= namenode.Element("a").Element("b").InnerText;
string url= linknode.Element("a").GetAttributeValue("href","unknown");
nameNode.Attributes["title"]
linkNode.Attributes["href"]
presuming you are getting the correct Nodes.
public const string UrlExtractor = #"(?: href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript|.*css|.*this\.)(?<url>.*?)(?:[\s>""'])";
public static Match GetMatchRegEx(string text)
{
return new Regex(UrlExtractor, RegexOptions.IgnoreCase).Match(text);
}
Here is how you can extract all Href Url. I'm using that regex in one of my projects, you can modify it to match your needs and rewrite it to match title as well. I guess it is more convenient to match them in bulk

Categories