XML Parsing for Mediawiki link

XML Parsing for Mediawiki link - c#

I have this link http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=panadol&prop=revisions&rvprop=content
I need to get the content inside tag. so I used this code
private void HttpsCompleted(object sender, DownloadStringCompletedEventArgs e)
{
WebClient wwc = new WebClient();
String xmlStr = "http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=" + medName + "&prop=revisions&rvprop=content";
wwc.DownloadStringCompleted += wwc_DownloadStringCompleted;
wwc.DownloadStringAsync(new Uri(xmlStr));
}
else
{
MessageBox.Show("Couldn't search for medicine!\nCheck the internet connection.");
}
}
catch (Exception)
{
// do nothing
}
}
also calling this method.
XNamespace ns = "http://www.w3.org/2005/Atom";
var entry = XDocument.Parse(e.Result);
var xmlData = new xmlWiki();
var g = entry.Element(ns + "rev").Value.ToString();
}
}
catch (Exception f)
{
MessageBox.Show(f.ToString());
}
}
But I am getting Null reference exception when the code executes "var g = entry.Element(ns + "rev").Value.ToString(); "
Please any help. Thank you in advance

rev is not the child of root of tree. This is the path to it:
api
query
pages
page
revisions
rev
You can use .Descendants() to reach it.
var entry = XDocument.Parse(html);
var g = entry.Descendants("rev").First().Value;

Related

PDDocument addPage /importPage not working correct

I am using PDFBox 1.8.3 and my need is to take a set of ZPLs and convert them all into one PDF. For the conversion part(ZPL to PDF), we are using labelary and this works correctly. I am basically using PDFBox to "stitch" all these individual PDFs. My source code is very simple as below :
var outputByteStream = new ByteArrayOutputStream();
var destinationDoc = null;
var doc = null;
for each(var img in images.toArray()) {
log.debug("Working with image : " + img);
if("ZPL".equalsIgnoreCase(img.getFormat()) || "ZPL203".equalsIgnoreCase(img.getFormat())) {
try {
var convertorServiceUrl = "http://labelary ...." + img.getId()+"?labelSize=4x6&density=8dpmm";
var urlObject = new URL(convertorServiceUrl);
var conn = urlObject.openConnection();
conn.connect();
if(destinationDoc == null ) {
var tmpfile = java.io.File.createTempFile(pw +"-"+uniqKey, ".pdf");
var raf = new org.apache.pdfbox.io.RandomAccessFile(tmpfile, "rw");
destinationDoc = PDDocument.load(conn.getInputStream(), raf);
}
else {
doc = PDDocument.load(conn.getInputStream());
if (doc != null && doc.getNumberOfPages() > 0) {
var page = doc.getDocumentCatalog().getAllPages().get(0);
destinationDoc.importPage(page);
}
}
} catch (res) {
log.error("Error message retrieved is " + exceptionMsg);
throw new BaseRuntimeException("Unable to convert the PDF for image with id " + img.getId(), res);
}
}
}
try {
if(destinationDoc != null) {
destinationDoc.save(outputByteStream);
destinationDoc.close();
}
} catch (e1) {
log.error("Error in writing the document to the output stream " + pw + "." , e1);
throw e1;
}
return outputByteStream.toByteArray();
Source code runs and generates a PDF but all the pages of the PDF are pointing to the first page. So if my for-loop run 4 times, all 4 pages of the PDF are for first label.
If I use addPage like below
var outputByteStream = new ByteArrayOutputStream();
var destinationDoc = new PDDocument();
var doc = null;
for each(var img in images.toArray()) {
log.debug("Working with image : " + img);
if("ZPL".equalsIgnoreCase(img.getFormat()) || "ZPL203".equalsIgnoreCase(img.getFormat())) {
try {
var convertorServiceUrl = "http://labelary ...." + img.getId()+"?labelSize=4x6&density=8dpmm";
var urlObject = new URL(convertorServiceUrl);
var conn = urlObject.openConnection();
conn.connect();
doc = PDDocument.load(conn.getInputStream());
if (doc != null && doc.getNumberOfPages() > 0) {
var page = doc.getDocumentCatalog().getAllPages().get(0);
destinationDoc.addPage(page);
}
} catch (res) {
log.error("Error message retrieved is " + exceptionMsg);
throw new BaseRuntimeException("Unable to convert the PDF for image with id " + img.getId(), res);
}
}
}
try {
if(destinationDoc != null) {
destinationDoc.save(outputByteStream);
destinationDoc.close();
}
} catch (e1) {
log.error("Error in writing the document to the output stream " + pw + "." , e1);
throw e1;
}
return outputByteStream.toByteArray();
Then the result is a PDF page with all empty pages.
I have already ensured that the content returned from the labelary service is correct and if I simply take the response and save it into a file it works correctly. Even saving the PDDocument one page at a time also produces the PDF correctly.
The problem I have is with the "stitching" of PDFs. It should work as per the documentation but I am not sure what I am doing wrong.

Get only child nodes of a parent node

I try to work with html agility pack. The basic works fine, only when I try to get the childnodes of a part, then i dont get all nodes with this the class 'dealer-offer' equal in which parentnode it will be.
Here is the code, that i use for it:
private void getListOfDiv(string html, string classname)
{
if (html != null)
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var divProduktkategorie = doc.DocumentNode.SelectSingleNode("//div[#class='" + classname + "']");
//this.txtHtmlCode.Text = divProduktkategorie.InnerHtml;
//return;
int i = 1;
foreach( var divAngebote in divProduktkategorie.SelectNodes("//div[#class='dealer-offer']"))
{
this.listBox1.Items.Add(i + ": " + classname);
this.txtHtmlCode.AppendText(divAngebote.OuterHtml);
i++;
}
}
}
Wenn I return the divProduktkategorie to the outputfild, then I get only the 3 positiones, which be under this singlenode, but wenn I start the loop, then I get every node with the class 'dealer-offer' and not only the 3 positions.
Where is my fault? I didn't find it by myself.
Thanks for helping

Try to get the 3 nodes with correct relative path and then just foreach them. Dont search them in divProduktkategorie references.
private void getListOfDiv(string html, string classname)
{
if (html != null)
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var divProduktkategorie = doc.DocumentNode.SelectSingleNode("//div[#class='" + classname + "']//div[#class='dealer-offer']");
//this.txtHtmlCode.Text = divProduktkategorie.InnerHtml;
//return;
int i = 1;
foreach( var divAngebote in divProduktkategorie)
{
this.listBox1.Items.Add(i + ": " + classname);
this.txtHtmlCode.AppendText(divAngebote.OuterHtml);
i++;
}
}
}

How to extract the text values of a given attribute using Xpath?

I want to extract the text within the content attribute using X path.
<meta name="keywords" content="football,cricket,Rugby,Volleyball">
I want to select only "football,cricket,Rugby,Volleyball"
I'm using C#, htmlagilitypack.
this is how I supposed to do this.but it did not work.
private void scrapBtn_Click(object sender, EventArgs e)
{
string url = urlTextBox.Text;
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
try
{
var node = doc.DocumentNode.SelectSingleNode("//head/title/text()");
var node1 = doc.DocumentNode.SelectSingleNode("//head/meta[#name='DESCRIPTION']/#content");
try
{
label4.Text = "Title:";
label4.Text += "\t"+node.Name.ToUpper() + ": " + node.OuterHtml;
}
catch (NullReferenceException)
{
MessageBox.Show(url + "does not contain <Title>", "Oppz, Sorry");
}
try
{
label4.Text += "\nMeta Keywords:";
label4.Text += "\n\t" + node1.Name.ToUpper() + ": " + node1.OuterHtml;
}
catch (NullReferenceException)
{
MessageBox.Show(url + "does not contain <meta='Keywords'>", "Oppz, Sorry");
}
}
catch(Exception ex){
MessageBox.Show(ex.StackTrace, "Oppz, Sorry");
}
}

With HTML Agility Pack you can use doc.SelectSingleNode("/html/head/meta[#name = 'keywords']").Attributes["content"].Value. I think their XPath support for attribute nodes is a bit odd so it is better to select the element and then use the Attributes property to select the attribute and the Value property to extract the value. If you want to use pure XPath to get the attribute value as a string then use doc.CreateNavigator().Evaluate("string(/html/head/meta[#name = 'keywords']/#content)").

You can use string() to get just the value:
string(//head/meta[#name]/#content/text())

GoogleDrive C# SDK: Too much items in FileList

When I retrieve all Files (and Folders) of my GoogleDrive Account I should get something like 1500 List elements back, but I get a bit more than 3000 back. I looked into the List and found that some files are 2-3 times in it. Why is that?
Here is the code I use to retrieve the files:
public async Task<List<File>> RetrieveAllFilesAsList(DriveService service, string query = null)
{
List<File> result = new List<File>();
FilesResource.ListRequest request = service.Files.List();
if (query != null)
{
request.Q = query;
}
do
{
try
{
FileList files = await request.ExecuteAsync();
result.AddRange(files.Items);
request.PageToken = files.NextPageToken;
}
catch (Exception e)
{
Console.WriteLine("An error occurred (from RetrieveAllFilesAsList): " + e.Message);
request.PageToken = null;
}
}
while (!String.IsNullOrEmpty(request.PageToken));
return result;
}
Update1:
public async Task<List<File>> RetrieveAllFilesAsList(DriveService service, string query = null)
{
List<File> result = new List<File>();
FilesResource.ListRequest request = service.Files.List();
request.MaxResults = 1000;
if (query != null)
{
request.Q = query + " AND trashed=false";
}
else
{
request.Q = "trashed=false";
}
do
{
try
{
FileList files = await request.ExecuteAsync();
result.AddRange(files.Items);
request.PageToken = files.NextPageToken;
}
catch (Exception e)
{
Console.WriteLine("An error occurred (from RetrieveAllFilesAsList): " + e.Message);
request.PageToken = null;
}
}
while (!String.IsNullOrEmpty(request.PageToken));
int i;
for (i = 0; i < result.Count; i++ )
{
System.IO.File.AppendAllText(#"C:\Users\carl\Desktop\log.txt", result[i].Id + "\t" + result[i].Title + "\t" + result[i].ExplicitlyTrashed.ToString() + "\r\n");
}
// prints 3120 Lines
System.IO.File.AppendAllText(#"C:\Users\carl\Desktop\log.txt", "" + i + Environment.NewLine);
//Count = 3120
System.IO.File.AppendAllText(#"C:\Users\carl\Desktop\log.txt", "" + result.Count);
return result;
}
Word failed to give me the right the linecount, so I did it over my Function.
But I can find the FileId 2-3 times in the File.

I cannot write on the comments yet, so according to the API
from google
"Note: This method returns all files by default. This includes files with trashed=true in the results. Use the trashed=false query parameter to filter these from the results."
So can you check what url of the rest api is actually being called? It seems you need to put some filters on the List method.

Why does XDocument.Parse throw NotSupportedException?

I am trying to parse xml data using XDocument.Parse wchich throws NotSupportedException, just like in topic: Is XDocument.Parse different in Windows Phone 7? and I updated my code according to posted advice, but it still doesn't help. Some time ago I parsed RSS using similar (but simpler) method and that worked just fine.
public void sList()
{
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
string url = "http://eztv.it";
Uri u = new Uri(url);
client.DownloadStringAsync(u);
client.DownloadStringCompleted += new DownloadStringCompletedEventHandler(client_DownloadStringCompleted);
}
private void client_DownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
{
try
{
string s = e.Result;
s = cut(s);
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Ignore;
XDocument document = null;// XDocument.Parse(s);//Load(s);
using (XmlReader reader = XmlReader.Create(new StringReader(e.Result), settings))
{
document = XDocument.Load(reader); // error thrown here
}
// ... rest of code
}
catch (Exception ex)
{
MessageBox.Show( ex.Message);
}
}
string cut(string s)
{
int iod = s.IndexOf("<select name=\"SearchString\">");
int ido = s.LastIndexOf("</select>");
s = s.Substring(iod, ido - iod + 9);
return s;
}
When I substitute string s for
//string s = "<select name=\"SearchString\"><option value=\"308\">10 Things I Hate About You</option><option value=\"539\">2 Broke Girls</option></select>";
Everything works and no exception is thrown, so what do I do wrong?

There are special symbols like '&' in e.Result.
I just tried replace this symbols (all except '<', '>', '"') with HttpUtility.HtmlEncode() and XDocument parsed it
UPD:
I didn't want to show my code, but you left me no chance :)
string y = "";
for (int i = 0; i < s.Length; i++)
{
if (s[i] == '<' || s[i] == '>' || s[i] == '"')
{
y += s[i];
}
else
{
y += HttpUtility.HtmlEncode(s[i].ToString());
}
}
XDocument document = XDocument.Parse(y);
var options = (from option in document.Descendants("option")
select option.Value).ToList();
It's work for me on WP7. Please, do not use this code for html conversion. I wrote it quickly just for test purposes

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

XML Parsing for Mediawiki link - c#

rev is not the child of root of tree. This is the path to it: api query pages page revisions rev You can use .Descendants() to reach it. var entry = XDocument.Parse(html); var g = entry.Descendants("rev").First().Value;

Related

PDDocument addPage /importPage not working correct

Get only child nodes of a parent node

How to extract the text values of a given attribute using Xpath?

GoogleDrive C# SDK: Too much items in FileList

Why does XDocument.Parse throw NotSupportedException?

Categories

Resources