How to extract the text values of a given attribute using Xpath?

How to extract the text values of a given attribute using Xpath? - c#

I want to extract the text within the content attribute using X path.
<meta name="keywords" content="football,cricket,Rugby,Volleyball">
I want to select only "football,cricket,Rugby,Volleyball"
I'm using C#, htmlagilitypack.
this is how I supposed to do this.but it did not work.
private void scrapBtn_Click(object sender, EventArgs e)
{
string url = urlTextBox.Text;
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
try
{
var node = doc.DocumentNode.SelectSingleNode("//head/title/text()");
var node1 = doc.DocumentNode.SelectSingleNode("//head/meta[#name='DESCRIPTION']/#content");
try
{
label4.Text = "Title:";
label4.Text += "\t"+node.Name.ToUpper() + ": " + node.OuterHtml;
}
catch (NullReferenceException)
{
MessageBox.Show(url + "does not contain <Title>", "Oppz, Sorry");
}
try
{
label4.Text += "\nMeta Keywords:";
label4.Text += "\n\t" + node1.Name.ToUpper() + ": " + node1.OuterHtml;
}
catch (NullReferenceException)
{
MessageBox.Show(url + "does not contain <meta='Keywords'>", "Oppz, Sorry");
}
}
catch(Exception ex){
MessageBox.Show(ex.StackTrace, "Oppz, Sorry");
}
}

With HTML Agility Pack you can use doc.SelectSingleNode("/html/head/meta[#name = 'keywords']").Attributes["content"].Value. I think their XPath support for attribute nodes is a bit odd so it is better to select the element and then use the Attributes property to select the attribute and the Value property to extract the value. If you want to use pure XPath to get the attribute value as a string then use doc.CreateNavigator().Evaluate("string(/html/head/meta[#name = 'keywords']/#content)").

You can use string() to get just the value:
string(//head/meta[#name]/#content/text())

Related

Get only child nodes of a parent node

I try to work with html agility pack. The basic works fine, only when I try to get the childnodes of a part, then i dont get all nodes with this the class 'dealer-offer' equal in which parentnode it will be.
Here is the code, that i use for it:
private void getListOfDiv(string html, string classname)
{
if (html != null)
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var divProduktkategorie = doc.DocumentNode.SelectSingleNode("//div[#class='" + classname + "']");
//this.txtHtmlCode.Text = divProduktkategorie.InnerHtml;
//return;
int i = 1;
foreach( var divAngebote in divProduktkategorie.SelectNodes("//div[#class='dealer-offer']"))
{
this.listBox1.Items.Add(i + ": " + classname);
this.txtHtmlCode.AppendText(divAngebote.OuterHtml);
i++;
}
}
}
Wenn I return the divProduktkategorie to the outputfild, then I get only the 3 positiones, which be under this singlenode, but wenn I start the loop, then I get every node with the class 'dealer-offer' and not only the 3 positions.
Where is my fault? I didn't find it by myself.
Thanks for helping

Try to get the 3 nodes with correct relative path and then just foreach them. Dont search them in divProduktkategorie references.
private void getListOfDiv(string html, string classname)
{
if (html != null)
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var divProduktkategorie = doc.DocumentNode.SelectSingleNode("//div[#class='" + classname + "']//div[#class='dealer-offer']");
//this.txtHtmlCode.Text = divProduktkategorie.InnerHtml;
//return;
int i = 1;
foreach( var divAngebote in divProduktkategorie)
{
this.listBox1.Items.Add(i + ": " + classname);
this.txtHtmlCode.AppendText(divAngebote.OuterHtml);
i++;
}
}
}

How to access and replace text in certain paragraphs using OPENXML powertools case by case

I am trying to redact some word files using c# and openxml. I need to do controlled replace of the numbers with certain phrase. Each word file contains different amount of info. I want to use OPENXML powertools for this purspose.
I used normal openxml method to replace but it very unreliable and gets random errors such as zero length error.I used regex replace and that seems to work but it replaces it through out the document which is highly undesirable.
Here is some snippet of the code :
private void redact_Replaceall(string wfile)
{
try
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(wfile, true))
{
var ydoc = doc.MainDocumentPart.GetXDocument();
IEnumerable<XElement> content = ydoc.Descendants(W.body);
Regex regex = new Regex(#"\d+\.\d{2,3}");
int count1 = OpenXmlPowerTools.OpenXmlRegex.Match(content, regex);
int count2 = OpenXmlPowerTools.OpenXmlRegex.Replace(content, regex, replace_text, null);
statusBar1.Text = "Try 1: Found: " + count1 + ", Replaced: " + count2;
doc.MainDocumentPart.PutXDocument();
}
}
catch(Exception e)
{
MessageBox.Show("Replace all exprienced error: " + e.Message);
}
}
Basically, I want to do this redaction based on content of paragraph. I am able to get the paragraphs using but not the id's
IEnumerable<XElement> content = ydoc.Descendants(W.p);
Here is my approach using the normal openxml method but I get alot of errors depending on the file.
foreach (DocumentFormat.OpenXml.Wordprocessing.Paragraph para in bod.Descendants<DocumentFormat.OpenXml.Wordprocessing.Paragraph>())
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
string temp = text.Text;
int firstlength = first.Length + 1;
int secondlength = second.Length + 1;
if (text.Text.Contains(first) && !(temp.Length > firstlength))
{
text.Text = text.Text.Replace(first, "DELETED");
}
if (text.Text.Contains(second) && !(temp.Length > secondlength))
{
text.Text = text.Text.Replace(second, "DELETED");
}
}
}
}
Here is the last new approach but I am stuck on it
private void redact_Replacebadones(string wfile)
{
try
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(wfile, true))
{
var ydoc = doc.MainDocumentPart.GetXDocument();
/* from XElement xele in ydoc.Root.Elements();
List<string> lhsElements = xele.Elements("lhs")
.Select(el => el.Attribute("id").Value)
.ToList();
*/
/// XElement
IEnumerable<XElement> content = ydoc.Descendants(W.p);
foreach (var p in content )
{
if (p.Value.Contains("each") && !p.Value.Contains("DELETED"))
{
string to_overwrite = p.Value;
Regex regexop = new Regex(#"\d+\.\d{2,3}");
regexop.Replace(to_overwrite, "Deleted");
p.SetValue(to_overwrite);
MessageBox.Show("NAME :" + p.GetParagraphInfo() +" VValue:"+to_overwrite);
}
}
doc.MainDocumentPart.PutXDocument();
}
}
catch (Exception e)
{
MessageBox.Show("Replace each exprienced error: " + e.Message);
}
}

May be a bit late. OpenXML Power tools by Eric white has a Function SearchAndReplace where you can replace Text content, so you don't have to handle it with RegEx.
This function handles also text which is splitted into runs. (If you edit a word, a word can be splittet in runs, so you dint find the search phrase directly.)
May be this helps somebody.

XML Parsing for Mediawiki link

I have this link http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=panadol&prop=revisions&rvprop=content
I need to get the content inside tag. so I used this code
private void HttpsCompleted(object sender, DownloadStringCompletedEventArgs e)
{
WebClient wwc = new WebClient();
String xmlStr = "http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=" + medName + "&prop=revisions&rvprop=content";
wwc.DownloadStringCompleted += wwc_DownloadStringCompleted;
wwc.DownloadStringAsync(new Uri(xmlStr));
}
else
{
MessageBox.Show("Couldn't search for medicine!\nCheck the internet connection.");
}
}
catch (Exception)
{
// do nothing
}
}
also calling this method.
XNamespace ns = "http://www.w3.org/2005/Atom";
var entry = XDocument.Parse(e.Result);
var xmlData = new xmlWiki();
var g = entry.Element(ns + "rev").Value.ToString();
}
}
catch (Exception f)
{
MessageBox.Show(f.ToString());
}
}
But I am getting Null reference exception when the code executes "var g = entry.Element(ns + "rev").Value.ToString(); "
Please any help. Thank you in advance

rev is not the child of root of tree. This is the path to it:
api
query
pages
page
revisions
rev
You can use .Descendants() to reach it.
var entry = XDocument.Parse(html);
var g = entry.Descendants("rev").First().Value;

Reading specific text from XML files

I have created a small XML tool which gives me count of specific XML tags from multiple XML files.
The code for this is as follow:
public void SearchMultipleTags()
{
if (txtSearchTag.Text != "")
{
try
{
//string str = null;
//XmlNodeList nodelist;
string folderPath = textBox2.Text;
DirectoryInfo di = new DirectoryInfo(folderPath);
FileInfo[] rgFiles = di.GetFiles("*.xml");
foreach (FileInfo fi in rgFiles)
{
int i = 0;
XmlDocument xmldoc = new XmlDocument();
xmldoc.Load(fi.FullName);
//rtbox2.Text = fi.FullName.ToString();
foreach (XmlNode node in xmldoc.GetElementsByTagName(txtSearchTag.Text))
{
i = i + 1;
//
}
if (i > 0)
{
rtbox2.Text += DateTime.Now + "\n" + fi.FullName + " \nInstance: " + i.ToString() + "\n\n";
}
else
{
//MessageBox.Show("No Markup Found.");
}
//rtbox2.Text += fi.FullName + "instances: " + str.ToString();
}
}
catch (Exception)
{
MessageBox.Show("Invalid Path or Empty File name field.");
}
}
else
{
MessageBox.Show("Dont leave field blanks.");
}
}
This code returns me the tag counts in Multiple XML files which user wants.
Now the same I want to Search for particular text and its count present in XML files.
Can you suggest the code using XML classes.
Thanks and Regards,
Mayur Alaspure

Use LINQ2XML instead..It's simple and a complete replacement to othe XML API's
XElement doc = XElement.Load(fi.FullName);
//count of specific XML tags
int XmlTagCount=doc.Descendants().Elements(txtSearchTag.Text).Count();
//count particular text
int particularTextCount=doc.Descendants().Elements().Where(x=>x.Value=="text2search").Count();

System.Xml.XPath.
Xpath supports counting: count(//nodeName)
If you want to count nodes with specific text, try count(//*[text()='Hello'])
See How to get count number of SelectedNode with XPath in C#?
By the way, your function should probably look something more like this:
private int SearchMultipleTags(string searchTerm, string folderPath) { ...
//...
return i;
}

Try using XPath:
//var document = new XmlDocument();
int count = 0;
var nodes = document.SelectNodes(String.Format(#"//*[text()='{0}']", searchTxt));
if (nodes != null)
count = nodes.Count;

get the ancestor nodes of a selected xml node using c# into a multiline text box

Im using c#.net. I have an xml file that contains many nodes. I have got the xml file into a tree view. Now when I select a particular node in the treeview, I should be able to display all its ancestors in a multiline text box. Please suggest me to do this job.

I´m not really sure what you want but this might be some thing to start with.
The extension method will get the xpath to an XElement node with attributes to specify the exact element more precisely.
public static string ToXPath(this XElement element)
{
var current = element.Parent;
string result = "";
while (current != null)
{
string currentDef = current.Name.ToString();
string attribsDef = "";
foreach (var attrib in current.Attributes())
{
attribsDef += " and #" + attrib.Name + "='" + attrib.Value + "'";
}
if (attribsDef.Length > 0)
{
currentDef += "[" + attribsDef.Substring(5) + "]";
}
result = "/" + currentDef + result;
current = current.Parent;
}
return result.Substring(1);
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to extract the text values of a given attribute using Xpath? - c#

You can use string() to get just the value: string(//head/meta[#name]/#content/text())

Related

Get only child nodes of a parent node

How to access and replace text in certain paragraphs using OPENXML powertools case by case

XML Parsing for Mediawiki link

Reading specific text from XML files

get the ancestor nodes of a selected xml node using c# into a multiline text box

Categories

Resources