C# Trying to read a page using XmlNode - c#

So I am trying to read the Steam store page from the lowest price to the highest. I have the URL needed and I have written some code which have worked in the past but does not work anymore. I have spend some days trying to fix this problem but I just can't seem to find the problem.
Link I am trying to read.
Here is the code.
//List of items from the Steam market from lowest to highest
private void priceFromMarket(int StartPage)
{
if (valueList.Count != 0)
{
valueList.Clear();
numList.Clear();
nameList.Clear();
}
string pageContent = null;
string results_html = null;
try
{
HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create("http://steamcommunity.com/market/search/render/?query=appid:730&start=" + StartPage.ToString() + "&sort_column=price&sort_dir=asc&count=100&currency=1&l=english");
HttpWebResponse myRes = (HttpWebResponse)myReq.GetResponse();
using (StreamReader sr = new StreamReader(myRes.GetResponseStream()))
{
pageContent = sr.ReadToEnd();
}
}
catch { Thread.Sleep(30000); priceFromMarket(StartPage); }
if (pageContent == null) { priceFromMarket(StartPage); }
try
{
JObject user = JObject.Parse(pageContent);
bool success = (bool)user["success"];
if (success)
{
results_html = (string)user["results_html"];
string data = results_html;
data = "<root>" + data + "</root>";
XmlDocument document = new XmlDocument();
document.LoadXml(System.Net.WebUtility.HtmlDecode(data));
XmlNode rootnode = document.SelectSingleNode("root");
XmlNodeList items = rootnode.SelectNodes("./a/div");
foreach (XmlNode node in items)
{
//This does not work anymore!
//The try fails here at line 574!
string value = node.SelectSingleNode("./div[contains(concat(' ', #class, ' '), ' market_listing_their_price ')]/span/span").InnerText;
string num = node.SelectSingleNode("./div[contains(concat(' ', #class, ' '), ' market_listing_num_listings ')]/span/span").InnerText;
string name = node.SelectSingleNode("./div/span[contains(concat(' ', #class, ' '), ' market_listing_item_name ')]").InnerText;
valueList.Add(value); //Lowest price for the item
numList.Add(num); //Volume of that item
nameList.Add(name); //Name of that item
}
}
else { Thread.Sleep(60000); priceFromMarket(StartPage); }
}
catch { Thread.Sleep(60000); priceFromMarket(StartPage); }
}

It's never reliable to parse HTML as XML because HTML doesn't have to be well formatted to be parsed properly...
For parsing HTML in C# i prefer to use CSQuery https://www.nuget.org/packages/CsQuery/
it lets you parse HTML in c# similar to doing it via jquery.
Another way is HTML Agility Pack which you could probably use without changing much of your code.. it's functions are similar to the System.Xml.XmlDocument Library.

Related

Read XML File Using Linq is not reading element

I am not able to get the value from this xml response, I will appreciate any help.
<Response>
<Result>
<Item1>GREEN</Item1>
<Item2>05/19/2017 22:08:14</Item2>
</Result>
<Other>
<Id>xxxxxxxxxxxxc</Id>
</Other>
</Response>
What I tried so far but the results is empty
string responseXml = response.ToXML();
XElement doc = XElement.Load(new StringReader(responseXml));
var results = from p in
doc.Descendants("Result")
select new
{
item = p.Element("Item1").Value,
};
foreach (var elm in results)
{
Console.WriteLine(elm.item);
}
Use Parse instead of load. You may also be getting error due to extra characters in the string. In the string you postged there are single quotes. Not sure if the single quote is in the actual string you are using.
string responseXml = "<Response>" +
"<Result>" +
"<Item1>GREEN</Item1>" +
"<Item2>05/19/2017 22:08:14</Item2>" +
"</Result>" +
"<Other>" +
"<Id>xxxxxxxxxxxxc</Id>" +
"</Other>" +
"</Response>";
XElement doc = XElement.Parse(responseXml);
var results = from p in
doc.Descendants("Result")
select new
{
item = p.Element("Item1").Value,
};
foreach (var elm in results)
{
Console.WriteLine(elm.item);
}

My function to get text between two strings isn't finding the correct words

I am creating an application that fetches information about a website. I have been trying several approaches on getting the information from the HTML tags. The website is who.is and I am trying to get information about Google (as a test!) Source can be found on view-source:https://who.is/whois/google.com/ < (if using Chrome browser)
Now the problem is that I am trying to get the name of the creator of the website (Mark or something) but I am not receiving the correct result. My code:
//GET name
string getName = source;
string nameBegin = "<div class=\"col-md-4 queryResponseBodyKey\">Name</div><div class=\"col-md-8 queryResponseBodyValue\">";
string nameEnd = "</div>";
int nameStart = getName.IndexOf(nameBegin) + nameBegin.Length;
int nameIntEnd = getName.IndexOf(nameEnd, nameStart);
string creatorName = getName.Substring(nameStart, nameIntEnd - nameStart);
lb_name.Text = creatorName;
(source contains html of page)
This doesn't put out the correct answer though... I think it has something to do with the fact that I use a [\] because of the multiple "" 's...
What am I doing wrong? :(
Instead of trying the parse the html result manually, use a real html parser like HtmlAgilityPack
using (var client = new HttpClient())
{
var html = await client.GetStringAsync("https://who.is/whois/google.com/");
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//*[#class='col-md-4 queryResponseBodyKey']");
var results = nodes.ToDictionary(n=>n.InnerText, n=>n.NextSibling.NextSibling.InnerText);
//print
foreach(var kv in results)
{
Console.WriteLine(kv.Key + " => " + kv.Value);
}
}
string getName = "<div class=\"col-md-4 queryResponseBodyKey\">Name</div><div class=\"col-md-8 queryResponseBodyValue\">";
string nameBegin = "<div class=\"col-md-4 queryResponseBodyKey\">";
string nameEnd = "</div>";
int nameStart = getName.IndexOf(nameBegin) + nameBegin.Length;
int nameIntEnd = getName.IndexOf(nameEnd, nameStart);
string creatorName = getName.Substring(nameStart, nameIntEnd - nameStart);
//lb_name.Text = creatorName;
Console.WriteLine(creatorName);
Console.ReadLine();
Is this what you are looking for, to get Name from that div ?

How to access and replace text in certain paragraphs using OPENXML powertools case by case

I am trying to redact some word files using c# and openxml. I need to do controlled replace of the numbers with certain phrase. Each word file contains different amount of info. I want to use OPENXML powertools for this purspose.
I used normal openxml method to replace but it very unreliable and gets random errors such as zero length error.I used regex replace and that seems to work but it replaces it through out the document which is highly undesirable.
Here is some snippet of the code :
private void redact_Replaceall(string wfile)
{
try
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(wfile, true))
{
var ydoc = doc.MainDocumentPart.GetXDocument();
IEnumerable<XElement> content = ydoc.Descendants(W.body);
Regex regex = new Regex(#"\d+\.\d{2,3}");
int count1 = OpenXmlPowerTools.OpenXmlRegex.Match(content, regex);
int count2 = OpenXmlPowerTools.OpenXmlRegex.Replace(content, regex, replace_text, null);
statusBar1.Text = "Try 1: Found: " + count1 + ", Replaced: " + count2;
doc.MainDocumentPart.PutXDocument();
}
}
catch(Exception e)
{
MessageBox.Show("Replace all exprienced error: " + e.Message);
}
}
Basically, I want to do this redaction based on content of paragraph. I am able to get the paragraphs using but not the id's
IEnumerable<XElement> content = ydoc.Descendants(W.p);
Here is my approach using the normal openxml method but I get alot of errors depending on the file.
foreach (DocumentFormat.OpenXml.Wordprocessing.Paragraph para in bod.Descendants<DocumentFormat.OpenXml.Wordprocessing.Paragraph>())
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
string temp = text.Text;
int firstlength = first.Length + 1;
int secondlength = second.Length + 1;
if (text.Text.Contains(first) && !(temp.Length > firstlength))
{
text.Text = text.Text.Replace(first, "DELETED");
}
if (text.Text.Contains(second) && !(temp.Length > secondlength))
{
text.Text = text.Text.Replace(second, "DELETED");
}
}
}
}
Here is the last new approach but I am stuck on it
private void redact_Replacebadones(string wfile)
{
try
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(wfile, true))
{
var ydoc = doc.MainDocumentPart.GetXDocument();
/* from XElement xele in ydoc.Root.Elements();
List<string> lhsElements = xele.Elements("lhs")
.Select(el => el.Attribute("id").Value)
.ToList();
*/
/// XElement
IEnumerable<XElement> content = ydoc.Descendants(W.p);
foreach (var p in content )
{
if (p.Value.Contains("each") && !p.Value.Contains("DELETED"))
{
string to_overwrite = p.Value;
Regex regexop = new Regex(#"\d+\.\d{2,3}");
regexop.Replace(to_overwrite, "Deleted");
p.SetValue(to_overwrite);
MessageBox.Show("NAME :" + p.GetParagraphInfo() +" VValue:"+to_overwrite);
}
}
doc.MainDocumentPart.PutXDocument();
}
}
catch (Exception e)
{
MessageBox.Show("Replace each exprienced error: " + e.Message);
}
}
May be a bit late. OpenXML Power tools by Eric white has a Function SearchAndReplace where you can replace Text content, so you don't have to handle it with RegEx.
This function handles also text which is splitted into runs. (If you edit a word, a word can be splittet in runs, so you dint find the search phrase directly.)
May be this helps somebody.

Web scraper using HtmlAgilityPack

I am new to C# so this might be very obvious how to get this to work or way too complex for me but I am trying to setup and scrape a web page using the HtmlAgilityPack. Currently my code compiles but when I write the string I only get 1 result and it happens to be the last result from the li in the ul. The reason for the string split is so I can eventually output the title and description strings into a .csv for further use. I am just unsure what to do next thus, why I am asking for any help/understanding/ideas/thoughts/suggestions that can be offered. Thank you!
private void button1_Click(object sender, EventArgs e)
{
List<string> cities = new List<string>();
//var xpath = "//h2[span/#id='Cities']";
var xpath = "//h2[span/#id='Cities']" + "/following-sibling::ul[1]" + "/li";
WebClient web = new WebClient();
String html = web.DownloadString("http://wikitravel.org/en/Vietnam");
hap.HtmlDocument doc = new hap.HtmlDocument();
doc.LoadHtml(html);
foreach (hap.HtmlNode node in doc.DocumentNode.SelectNodes(xpath))
{
string all = node.InnerText;
//splits text between '—', '-' or ' ' into 2 parts
string[] split = all.Split(new char[] { '—', ' ', '-' }, StringSplitOptions.None);
string title;
string description;
int nodeCount;
nodeCount = node.ChildNodes.Count;
if (nodeCount == 2)
{
title = node.ChildNodes[0].InnerText;
description = node.ChildNodes[1].InnerText;
}
else if (nodeCount == 4)
{
title = node.ChildNodes[0].InnerText;
description = node.ChildNodes[1].InnerText + node.ChildNodes[2].InnerText;
}
else
{
title = "Error";
description = "The node cound was not 2 or 3. Check the div section.";
}
System.IO.StreamWriter write = new System.IO.StreamWriter(#"C:\Users\cbrannin\Desktop\textTest\testText.txt");
write.WriteLine(all);
write.Close();
}
}
}
One problem is that you're overwriting the output file each time through the loop. You probably want to do this:
using (StreamWriter write = new StreamWriter(#"filename"))
{
foreach (hap.HtmlNode node in doc.DocumentNode.SelectNodes(xpath))
{
// do your thing
write.WriteLine(all);
}
}
Also, have you single-stepped this to see if you're getting more than one HtmlNode from your SelectNode call?
Finally, I don't see where you're doing anything with the title or description. Were you planning to use those for something else?

Reading specific text from XML files

I have created a small XML tool which gives me count of specific XML tags from multiple XML files.
The code for this is as follow:
public void SearchMultipleTags()
{
if (txtSearchTag.Text != "")
{
try
{
//string str = null;
//XmlNodeList nodelist;
string folderPath = textBox2.Text;
DirectoryInfo di = new DirectoryInfo(folderPath);
FileInfo[] rgFiles = di.GetFiles("*.xml");
foreach (FileInfo fi in rgFiles)
{
int i = 0;
XmlDocument xmldoc = new XmlDocument();
xmldoc.Load(fi.FullName);
//rtbox2.Text = fi.FullName.ToString();
foreach (XmlNode node in xmldoc.GetElementsByTagName(txtSearchTag.Text))
{
i = i + 1;
//
}
if (i > 0)
{
rtbox2.Text += DateTime.Now + "\n" + fi.FullName + " \nInstance: " + i.ToString() + "\n\n";
}
else
{
//MessageBox.Show("No Markup Found.");
}
//rtbox2.Text += fi.FullName + "instances: " + str.ToString();
}
}
catch (Exception)
{
MessageBox.Show("Invalid Path or Empty File name field.");
}
}
else
{
MessageBox.Show("Dont leave field blanks.");
}
}
This code returns me the tag counts in Multiple XML files which user wants.
Now the same I want to Search for particular text and its count present in XML files.
Can you suggest the code using XML classes.
Thanks and Regards,
Mayur Alaspure
Use LINQ2XML instead..It's simple and a complete replacement to othe XML API's
XElement doc = XElement.Load(fi.FullName);
//count of specific XML tags
int XmlTagCount=doc.Descendants().Elements(txtSearchTag.Text).Count();
//count particular text
int particularTextCount=doc.Descendants().Elements().Where(x=>x.Value=="text2search").Count();
System.Xml.XPath.
Xpath supports counting: count(//nodeName)
If you want to count nodes with specific text, try count(//*[text()='Hello'])
See How to get count number of SelectedNode with XPath in C#?
By the way, your function should probably look something more like this:
private int SearchMultipleTags(string searchTerm, string folderPath) { ...
//...
return i;
}
Try using XPath:
//var document = new XmlDocument();
int count = 0;
var nodes = document.SelectNodes(String.Format(#"//*[text()='{0}']", searchTxt));
if (nodes != null)
count = nodes.Count;

Categories