C# find string and extract the href

C# find string and extract the href - c#

I have a file which contain text in that. I need to search for a string and extract the href on that line.
file.txt is the file which contain basic wordpress homepage
finally I want the http://example.com like link.
I tried several ways like
DateTime dateTime = DateTime.UtcNow.Date;
string stringpart = dateTime.ToString("-dd-M-yyyy");
string finalword = "candy" + stringpart;
List<List<string>> groups = new List<List<string>>();
List<string> current = null;
foreach (var line in File.ReadAllLines(#"E:/file.txt"))
{
if (line.Contains("-22-8-2014") && current == null)
current = new List<string>();
else if (line.Contains("candy") && current != null)
{
groups.Add(current);
current = null;
}
if (current != null)
current.Add(line);
}
foreach (object o in groups)
{
Console.WriteLine(o);
}
Console.ReadLine();
}

To do this correctly, you must parse this html-file. Use something like CSquery, HTML Agility Pack, or SgmlReader.
Solution of your problem with CSQuery:
public IEnumerable<string> ExtractLinks(string htmlFile)
{
var page = CQ.CreateFromFile(htmlFile);
return page.Select("a[href]").Select(tag => tag.GetAttribute("href"));
}

In case you decided to use HtmlAgilityPack, this should be easy :
var doc = new HtmlDocument();
//load your HTML file to HtmlDocument
doc.Load("path_to_your_html.html");
//select all <a> tags containing href attribute
var links = doc.DocumentNode.SelectNodes("//a[#href]");
foreach(HtmlNode link in links)
{
//print value of href attribute
Console.WriteLine(link.GetAttributeValue("href", "");
}

Related

How to create a CSV file from a XML file

I am very new at C#. In my project I need to create a csv file which will get data from a xml data. Now, I can get data from XML, and print in looger for some particulaer attributes from xml. But I am not sure how can I store my Data into CSV file for that particular attribues.
Here is my XML file that I need to create a CSV file.
<?xml version="1.0" encoding="utf-8"?>
<tlp:WorkUnits xmlns:tlp="http://www.timelog.com/XML/Schema/tlp/v4_4"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.timelog.com/XML/Schema/tlp/v4_4 http://www.timelog.com/api/xsd/WorkUnitsRaw.xsd">
<tlp:WorkUnit ID="130">
<tlp:EmployeeID>3</tlp:EmployeeID>
<tlp:AllocationID>114</tlp:AllocationID>
<tlp:TaskID>239</tlp:TaskID>
<tlp:ProjectID>26</tlp:ProjectID>
<tlp:ProjectName>LIK Template</tlp:ProjectName>
<tlp:CustomerId>343</tlp:CustomerId>
<tlp:CustomerName>Lekt Corp Inc.</tlp:CustomerName>
<tlp:IsBillable>1</tlp:IsBillable>
<tlp:ApprovedStatus>0</tlp:ApprovedStatus>
<tlp:LastModifiedBy>AL</tlp:LastModifiedBy>
</tlp:WorkUnit>
And my Code where I am getting this value in logger.But I am not sure how can I create a csv file that stores that value in order.
Edited
namespace TimeLog.ApiConsoleApp
{
/// <summary>
/// Template class for consuming the reporting API
/// </summary>
public class ConsumeReportingApi
{
private static readonly ILog Logger = LogManager.GetLogger(typeof(ConsumeReportingApi));
public static void Consume()
{
if (ServiceHandler.Instance.TryAuthenticate())
{
if (Logger.IsInfoEnabled)
{
Logger.Info("Successfully authenticated on reporting API");
}
var customersRaw = ServiceHandler.Instance.Client.GetWorkUnitsRaw(ServiceHandler.Instance.SiteCode,
ServiceHandler.Instance.ApiId,
ServiceHandler.Instance.ApiPassword,
WorkUnit.All,
Employee.All,
Allocation.All,
Task.All,
Project.All,
Department.All,
DateTime.Now.AddDays(-5).ToString(),
DateTime.Now.ToString()
);
if (customersRaw.OwnerDocument != null)
{
var namespaceManager = new XmlNamespaceManager(customersRaw.OwnerDocument.NameTable);
namespaceManager.AddNamespace("tlp", "http://www.timelog.com/XML/Schema/tlp/v4_4");
var workUnit = customersRaw.SelectNodes("tlp:WorkUnit", namespaceManager);
var output = new StringBuilder();
output.AppendLine("AllocationID,ApprovedStatus,CustomerId,CustomerName,EmployeeID");
if (workUnit != null)
{
foreach (XmlNode customer in workUnit)
{
var unit = new WorkUnit();
var childNodes = customer.SelectNodes("./*");
if (childNodes != null)
{
foreach (XmlNode childNode in childNodes)
{
if (childNode.Name == "tlp:EmployeeID")
{
unit.EmployeeID = Int32.Parse(childNode.InnerText);
}
if (childNode.Name == "tlp:EmployeeFirstName")
{
unit.EmployeeFirstName = childNode.InnerText;
}
if (childNode.Name == "tlp:EmployeeLastName")
{
unit.EmployeeLastName = childNode.InnerText;
}
if (childNode.Name == "tlp:AllocationID")
{
unit.AllocationID = Int32.Parse(childNode.InnerText);
}
if (childNode.Name == "tlp:TaskName")
{
unit.TaskName = childNode.InnerText;
}
}
}
output.AppendLine($"{unit.EmployeeID},{unit.EmployeeFirstName},{unit.EmployeeLastName},{unit.AllocationID},{unit.TaskName}");
//Console.WriteLine("---");
}
Console.WriteLine(output.ToString());
File.WriteAllText("c:\\...\\WorkUnits.csv", output.ToString());
}
}
else
{
if (Logger.IsWarnEnabled)
{
Logger.Warn("Failed to authenticate to reporting API");
}
}
}
}
}
}

You want to write the columns in the correct order to the CSV (of course), so you need to process them in the correct order. Two options:
intermediate class
Create a new class (let's call it WorkUnit) with properties for each of the columns that you want to write to the CSV. Create a new instance for every <tlp:WorkUnit> node in your XML and fill the properties when you encounter the correct subnodes. When you have processed the entire WorkUnit node, write out the properties in the correct order.
var output = new StringBuilder();
foreach (XmlNode customer in workUnit)
{
// fresh instance of the class that holds all columns (so all properties are cleared)
var unit = new WorkUnit();
var childNodes = customer.SelectNodes("./*");
if (childNodes != null)
{
foreach (XmlNode childNode in childNodes)
{
if(childNode.Name== "tlp:EmployeeID")
{
// employeeID node found, now write to the corresponding property:
unit.EmployeeId = childNode.InnerText;
}
// etc for the other XML nodes you are interested in
}
// all nodes have been processed for this one WorkUnit node
// so write a line to the CSV
output.AppendLine($"{unit.EmployeeId},{unit.AllocationId}, etc");
}
read in correct order
Instead of using foreach to loop through all subnodes in whatever order they appear, search for specific subnodes in the order you want. Then you can write out the CSV in the same order. Note that even when you don't find some subnode, you still need to write out the separator.
var output = new StringBuilder();
foreach (XmlNode customer in workUnit)
{
// search for value for first column (EmployeeID)
var node = workUnit.SelectSingleNode("tlp:EmployeeID");
if (node != null)
{
output.Append(node.InnerText).Append(',');
}
else
{
output.Append(','); // no content, but we still need a separator
}
// etc for the other columns
And of course watch out for string values that contain the separator.

Assuming that you put your XML data into List
StringBuilder str = new StringBuilder();
foreach (var fin list.ToList())
{
str.Append(fin.listfield.ToString() + ",");
}
to create a new line:
str.Replace(",", Environment.NewLine, str.Length - 1, 1);
to save:
string filename=(DirectoryPat/filename.csv");
File.WriteAllText(Filename, str.ToString());

Try this:
var output = new StringBuilder();
output.AppendLine("AllocationID,ApprovedStatus,CustomerId,CustomerName,EmployeeID");
if (workUnit != null)
{
foreach (XmlNode customer in workUnit)
{
var unit = new WorkUnit();
var childNodes = customer.SelectNodes("./*");
if (childNodes != null)
{
for (int i = 0; i<childNodes.Count; ++i)
{
XmlNode childNode = childNodes[i];
if (childNode.Name == "tlp:EmployeeID")
{
unit.EmployeeID = Int32.Parse(childNode.InnerText);
}
if (childNode.Name == "tlp:EmployeeFirstName")
{
unit.EmployeeFirstName = childNode.InnerText;
}
if (childNode.Name == "tlp:EmployeeLastName")
{
unit.EmployeeLastName = childNode.InnerText;
}
if (childNode.Name == "tlp:AllocationID")
{
unit.AllocationID = Int32.Parse(childNode.InnerText);
}
if (childNode.Name == "tlp:TaskName")
{
unit.TaskName = childNode.InnerText;
}
output.Append(childNode.InnerText);
if (i<childNodes.Count - 1)
output.Append(",");
}
output.Append(Environment.NewLine);
}
}
Console.WriteLine(output.ToString());
File.WriteAllText("c:\\Users\\mnowshin\\projects\\WorkUnits.csv", output.ToString());
}

You can use this sequence:
a. Deserialize (i.e. convert from XML to C# objects) your XML.
b. Write a simple loop to write the data to a file.
The advantages of this sequence:
You can use a list of your data/objects "readable" that you can add any other access code to it.
If you XML schema changed at any time, you can maintain the code very easily.
The solution
a. Desrialize:
Copy you XML file contents. Note You should modify your XML input before coping it.. You should double the WorkUnit node, in order to tell Visual Studio that you would have a list of this node nested inside WorkUnits node.
From Visual Studio Menus select Edit -> Paste Special -> Paste XML as Classes.
Use the deserialize code.
var workUnitsNode = customersRaw.SelectSingleNode("tlp:WorkUnits", namespaceManager);
XmlSerializer ser = new XmlSerializer(typeof(WorkUnits));
WorkUnits workUnits = (WorkUnits)ser.Deserialize(workUnitsNode);
b. Write the csv file
StringBuilder csvContent = new StringBuilder();
// add the header line
csvContent.AppendLine("AllocationID,ApprovedStatus,CustomerId,CustomerName,EmployeeID");
foreach (var unit in workUnits.WorkUnit)
{
csvContent.AppendFormat(
"{0}, {1}, {2}, {3}, {4}",
new object[]
{
unit.AllocationID,
unit.ApprovedStatus,
unit.CustomerId,
unit.CustomerName,
unit.EmployeeID
// you get the idea
});
csvContent.AppendLine();
}
File.WriteAllText(#"G:\Projects\StackOverFlow\WpfApp1\WorkUnits.csv", csvContent.ToString());

You can use Cinchoo ETL - if you have room to use open source library
using (var csvWriter = new ChoCSVWriter("sample1.csv").WithFirstLineHeader())
{
using (var xmlReader = new ChoXmlReader("sample1.xml"))
csvWriter.Write(xmlReader);
}
Output:
ID,tlp_EmployeeID,tlp_AllocationID,tlp_TaskID,tlp_ProjectID,tlp_ProjectName,tlp_CustomerId,tlp_CustomerName,tlp_IsBillable,tlp_ApprovedStatus,tlp_LastModifiedBy
130,3,114,239,26,LIK Template,343,Lekt Corp Inc.,1,0,AL
Disclaimer: I'm the author of this library.

HtmlAgilityPack multiple element

I have a html document that contains multiple divs
Example:
<div class="element">
<div class="title">
<a href="127.0.0.1" title="Test>Test</a>
</div>
</div>
Now I'm using this code to extract the title element.
List<string> items = new List<string>();
var nodes = Web.DocumentNode.SelectNodes("//*[#title]");
if (nodes != null)
{
foreach (var node in nodes)
{
foreach (var attribute in node.Attributes)
if (attribute.Name == "title")
items.Add(attribute.Value);
}
}
I don't know how to adapt my code to extract the href and the title element
at the same time.
Each div should be an object with the included a tags as properties.
public class CheckBoxListItem
{
public string Text { get; set; }
public string Href { get; set; }
}

You can use the following xpath query to retrieve only a tags with a title and href :
//a[#title and #href]
The you can use your code like this:
List<CheckBoxListItem> items = new List<CheckBoxListItem>();
var nodes = Web.DocumentNode.SelectNodes("//a[#title and #href]");
if (nodes != null)
{
foreach (var node in nodes)
{
items.Add(new CheckBoxListItem()
{
Text = node.Attributes["title"].Value,
Href = node.Attributes["href"].Value
});
}
}

I very often use ScrapySharp's package together with HtmlAgilityPack for css selection.
(add a using statement for ScrapySharp.Extensions so you can use the CssSelect method).
using HtmlAgilityPack;
using ScrapySharp.Extensions;
In your case, I would do:
HtmlWeb w = new HtmlWeb();
var htmlDoc = w.Load("myUrl");
var titles = htmlDoc.DocumentNode.CssSelect(".title");
foreach (var title in titles)
{
string href = string.Empty;
var anchor = title.CssSelect("a").FirstOrDefault();
if (anchor != null)
{
href = anchor.GetAttributeValue("href");
}
}

C# HtmlDecode Specific tags only

I have a large htmlencoded string and i want decode only specific whitelisted html tags.
Is there a way to do this in c#, WebUtility.HtmlDecode() decodes everything.
`I am looking for an implementaiton of DecodeSpecificTags() that will pass below test.
[Test]
public void DecodeSpecificTags_SimpleInput_True()
{
string input = "<span>i am <strong color=blue>very</strong> big <br>man.</span>";
string output = "<span>i am <strong color=blue>very</strong> big <br>man.</span>";
List<string> whiteList = new List<string>(){ "strong","br" } ;
Assert.IsTrue(DecodeSpecificTags(whiteList,input) == output);
}`

You could do something like this
public string DecodeSpecificTags(List<string> whiteListedTagNames,string encodedInput)
{
String regex="";
foreach(string s in whiteListedTagNames)
{
regex="<"+#"\s*/?\s*"+s+".*?"+">";
encodedInput=Regex.Replace(encodedInput,regex);
}
return encodedInput;
}

A better approach could be to use some html parser like Agilitypack or csquery or Nsoup to find specific elements and decode it in a loop.
check this for links and examples of parsers
Check It, i did it using csquery :
string input = "<span>i am <strong color=blue>very</strong> big <br>man.</span>";
string output = "<span>i am <strong color=blue>very</strong> big <br>man.</span>";
var decoded = HttpUtility.HtmlDecode(output);
var encoded =input ; // HttpUtility.HtmlEncode(decoded);
Console.WriteLine(encoded);
Console.WriteLine(decoded);
var doc=CsQuery.CQ.CreateDocument(decoded);
var paras=doc.Select("strong").Union(doc.Select ("br")) ;
var tags=new List<KeyValuePair<string, string>>();
var counter=0;
foreach (var element in paras)
{
HttpUtility.HtmlEncode(element.OuterHTML).Dump();
var key ="---" + counter + "---";
var value= HttpUtility.HtmlDecode(element.OuterHTML);
var pair= new KeyValuePair<String,String>(key,value);
element.OuterHTML = key ;
tags.Add(pair);
counter++;
}
var finalstring= HttpUtility.HtmlEncode(doc.Document.Body.InnerHTML);
finalstring.Dump();
foreach (var element in tags)
{
finalstring=finalstring.Replace(element.Key,element.Value);
}
Console.WriteLine(finalstring);

Or you could use HtmlAgility with a black list or white list based on your requirement. I'm using black listed approach.
My black listed tag is store in a text file, for example "script|img"
public static string DecodeSpecificTags(this string content, List<string> blackListedTags)
{
if (string.IsNullOrEmpty(content))
{
return content;
}
blackListedTags = blackListedTags.Select(t => t.ToLowerInvariant()).ToList();
var decodedContent = HttpUtility.HtmlDecode(content);
var document = new HtmlDocument();
document.LoadHtml(decodedContent);
decodedContent = blackListedTags.Select(blackListedTag => document.DocumentNode.Descendants(blackListedTag))
.Aggregate(decodedContent,
(current1, nodes) =>
nodes.Select(htmlNode => htmlNode.WriteTo())
.Aggregate(current1,
(current, nodeContent) =>
current.Replace(nodeContent, HttpUtility.HtmlEncode(nodeContent))));
return decodedContent;
}

HTML agility pack - removing unwanted tags without removing content?

I've seen a few related questions out here, but they don’t exactly talk about the same problem I am facing.
I want to use the HTML Agility Pack to remove unwanted tags from my HTML without losing the content within the tags.
So for instance, in my scenario, I would like to preserve the tags "b", "i" and "u".
And for an input like:
<p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p>
The resulting HTML should be:
my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b>
I tried using HtmlNode's Remove method, but it removes my content too. Any suggestions?

I wrote an algorithm based on Oded's suggestions. Here it is. Works like a charm.
It removes all tags except strong, em, u and raw text nodes.
internal static string RemoveUnwantedTags(string data)
{
if(string.IsNullOrEmpty(data)) return string.Empty;
var document = new HtmlDocument();
document.LoadHtml(data);
var acceptableTags = new String[] { "strong", "em", "u"};
var nodes = new Queue<HtmlNode>(document.DocumentNode.SelectNodes("./*|./text()"));
while(nodes.Count > 0)
{
var node = nodes.Dequeue();
var parentNode = node.ParentNode;
if(!acceptableTags.Contains(node.Name) && node.Name != "#text")
{
var childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (var child in childNodes)
{
nodes.Enqueue(child);
parentNode.InsertBefore(child, node);
}
}
parentNode.RemoveChild(node);
}
}
return document.DocumentNode.InnerHtml;
}

How to recursively remove a given list of unwanted html tags from an html string
I took #mathias answer and improved his extension method so that you can supply a list of tags to exclude as a List<string> (e.g. {"a","p","hr"}). I also fixed the logic so that it works recursively properly:
public static string RemoveUnwantedHtmlTags(this string html, List<string> unwantedTags)
{
if (String.IsNullOrEmpty(html))
{
return html;
}
var document = new HtmlDocument();
document.LoadHtml(html);
HtmlNodeCollection tryGetNodes = document.DocumentNode.SelectNodes("./*|./text()");
if (tryGetNodes == null || !tryGetNodes.Any())
{
return html;
}
var nodes = new Queue<HtmlNode>(tryGetNodes);
while (nodes.Count > 0)
{
var node = nodes.Dequeue();
var parentNode = node.ParentNode;
var childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (var child in childNodes)
{
nodes.Enqueue(child);
}
}
if (unwantedTags.Any(tag => tag == node.Name))
{
if (childNodes != null)
{
foreach (var child in childNodes)
{
parentNode.InsertBefore(child, node);
}
}
parentNode.RemoveChild(node);
}
}
return document.DocumentNode.InnerHtml;
}

Try the following, you might find it a bit neater than the other proposed solutions:
public static int RemoveNodesButKeepChildren(this HtmlNode rootNode, string xPath)
{
HtmlNodeCollection nodes = rootNode.SelectNodes(xPath);
if (nodes == null)
return 0;
foreach (HtmlNode node in nodes)
node.RemoveButKeepChildren();
return nodes.Count;
}
public static void RemoveButKeepChildren(this HtmlNode node)
{
foreach (HtmlNode child in node.ChildNodes)
node.ParentNode.InsertBefore(child, node);
node.Remove();
}
public static bool TestYourSpecificExample()
{
string html = "<p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
document.DocumentNode.RemoveNodesButKeepChildren("//div");
document.DocumentNode.RemoveNodesButKeepChildren("//p");
return document.DocumentNode.InnerHtml == "my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b>";
}

Before removing a node, get its parent and its InnerText, then remove the node and re-assign the InnerText to the parent.
var parent = node.ParentNode;
var innerText = parent.InnerText;
node.Remove();
parent.AppendChild(doc.CreateTextNode(innerText));

If you do not want to use Html agility pack and still want to remove Unwanted Html Tag than you can do as given below.
public static string RemoveHtmlTags(string strHtml)
{
string strText = Regex.Replace(strHtml, "<(.|\n)*?>", String.Empty);
strText = HttpUtility.HtmlDecode(strText);
strText = Regex.Replace(strText, #"\s+", " ");
return strText;
}

How to get the contents of a HTML element using HtmlAgilityPack in C#?

I want to get the contents of an ordered list from a HTML page using HTMLAgilityPack in C#, i have tried the following code but, this is not working can anyone help, i want to pass html text and get the contents of the first ordered list found in the html
private bool isOrderedList(HtmlNode node)
{
if (node.NodeType == HtmlNodeType.Element)
{
if (node.Name.ToLower() == "ol")
return true;
else
return false;
}
else
return false;
}
public string GetOlList(string htmlText)
{
string s="";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlText);
HtmlNode nd = doc.DocumentNode;
foreach (HtmlNode node in nd.ChildNodes)
{
if (isOrderedList(node))
{
s = node.WriteContentTo();
break;
}
else if (node.HasChildNodes)
{
string sx= GetOlList(node.WriteTo());
if (sx != "")
{
s = sx;
break;
}
}
}
return s;
}

The following code worked for me
public static string GetComments(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string s = "";
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//ol"))
{
s += node.OuterHtml;
}
return s;
}

How about:
var el = (HtmlElement)doc.DocumentNode
.SelectSingleNode("//ol");
if(el!=null)
{
string s = el.OuterHtml;
}
(untested, from memory)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# find string and extract the href - c#

Related

How to create a CSV file from a XML file

HtmlAgilityPack multiple element

C# HtmlDecode Specific tags only

HTML agility pack - removing unwanted tags without removing content?

How to get the contents of a HTML element using HtmlAgilityPack in C#?

Categories

Resources