Efficient Parsing of XML - c#

Good day,
I'm writing a program in C# .Net to manage products of my store,
Following a given link I can retrieve an XML file that contains all the possible products that I can list onto my storefront.
The XML structure looks like this :
<Product StockCode="103-10440">
<lastUpdated><![CDATA[Fri, 20 May 2016 17:00:03 GMT]]></lastUpdated>
<StockCode><![CDATA[103-10440]]></StockCode>
<Brand><![CDATA[3COM]]></Brand>
<BrandID><![CDATA[14]]></BrandID>
<ProdName><![CDATA[BIG FLOW BLOWING JUNCTION FLEX BLOCK, TAKES 32, 40]]> </ProdName>
<ProdDesc/>
<Categories>
<TopCat><![CDATA[Accessories]]></TopCat>
<TopCatID><![CDATA[24]]></TopCatID>
</Categories>
<ProdImg/>
<ProdPriceExclVAT><![CDATA[30296.79]]></ProdPriceExclVAT>
<ProdQty><![CDATA[0]]></ProdQty>
<ProdExternalURL><![CDATA[http://pinnacle.eliance.co.za/#!/product/4862]]></ProdExternalURL>
</Product>
Here are the entries I'm looking for :
lastUpdated
StockCode
Brand
ProdName
ProdDesc
TopCat <--- nested in Categories tag.
ProdImg
ProdPriceExclVAT
ProdQty
ProdExternalURL
This is all fine to handle , and in-fact I did :
public ProductList Parse() {
XmlDocument doc = new XmlDocument();
doc.Load(XMLLink);
XmlNodeList ProductNodeList = doc.GetElementsByTagName("Product");
foreach (XmlNode node in ProductNodeList) {
Product Product = new Product();
for (int i = 0; i < node.ChildNodes.Count; i++) {
if (node.ChildNodes[i].Name == "StockCode") {
Product.VariantSKU = Convert.ToString(node.ChildNodes[i].InnerText);
}
if (node.ChildNodes[i].Name == "Brand") {
Product.Vendor = Convert.ToString(node.ChildNodes[i].InnerText);
}
if (node.ChildNodes[i].Name == "ProdName") {
Product.Title = Convert.ToString(node.ChildNodes[i].InnerText);
Product.SEOTitle = Product.Title;
Product.Handle = Product.Title;
}
if (node.ChildNodes[i].Name == "ProdDesc") {
Product.Body = Convert.ToString(node.ChildNodes[i].InnerText);
Product.SEODescription = Product.Body;
if (Product.Body == "") {
Product.Body = "ERROR";
Product.SEODescription = "ERROR";
}
}
if (node.ChildNodes[i].Name == "Categories") {
if (!tempList.Categories.Contains(node.ChildNodes[i].ChildNodes[0].InnerText)) {
if (!tempList.Categories.Contains("All")) {
tempList.Categories.Add("All");
}
tempList.Categories.Add(node.ChildNodes[i].ChildNodes[0].InnerText);
}
Product.Type = Convert.ToString(node.ChildNodes[i].ChildNodes[0].InnerText);
}
if (node.ChildNodes[i].Name == "ProdImg") {
Product.ImageSrc = Convert.ToString(node.ChildNodes[i].InnerText);
if (Product.ImageSrc == "") {
Product.ImageSrc = "ERROR";
}
Product.ImageAlt = Product.Title;
}
if (node.ChildNodes[i].Name == "ProdPriceExclVAT") {
float baseprice = float.Parse(node.ChildNodes[i].InnerText);
double Costprice = ((baseprice * 0.14) + (baseprice * 0.15) + baseprice);
Product.VariantPrice = Costprice.ToString("0.##");
}
}
Product.Supplier = "Pinnacle";
if (!tempList.Suppliers.Contains(Product.Supplier)) {
tempList.Suppliers.Add(Product.Supplier);
}
tempList.Products.Add(Product);
}
return tempList;
}
}
The problem is however, that this way of doing it, takes about 10 seconds to finish, and this is only just the first of multiple such files that I have to parse.
I am looking for the most efficient way to parse this XML file, getting all the fields's data that I mentioned above.
EDIT :
I benchmarked the code when running with a pre-downloaded copy of the file, and when downloading the file from the server at runtime :
With local copy : 5 Seconds.
Without local copy : 7.30 Seconds.

With large XML files you have to use an XmlReader. The code below will read one Product at a time.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
XmlReader reader = XmlReader.Create("filename");
while(!reader.EOF)
{
if (reader.Name != "Product")
{
reader.ReadToFollowing("Product");
}
if (!reader.EOF)
{
XElement product = (XElement)XElement.ReadFrom(reader);
string lastUpdated = (string)product.Element("lastUpdated");
}
}
}
}
}

Related

Loop through specific Node of XML in C#

I have researched a lot but I cannot find solution to my particular problem. I have to read an external xml file in C# and read the values in an Object. Here is the snapshot of my xml file:
<DatabaseList>
<DatabaseDetails>
<ConnectionId>1</ConnectionId>
<ConnectionName>MyConn1</ConnectionName>
<ServerConnection xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<CobConnection>
<CobConnection />
<ConnectionType>MSSQL</ConnectionType>
<Database />
<Server />
</CobConnection>
<ConnectionType>MSSQL</ConnectionType>
<Database>MyDB1</Database>
<Port>2431</Port>
<Server>MyServerName1</Server>
</ServerConnection>
</DatabaseDetails>
<DatabaseDetails>
<ConnectionId>2</ConnectionId>
<ConnectionName>MyConn2</ConnectionName>
<ServerConnection xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<CobConnection>
<CobConnection />
<ConnectionType>MSSQL</ConnectionType>
<Database />
<Server />
</CobConnection>
<ConnectionType>MSSQL</ConnectionType>
<Database>MyDB2</Database>
<Port>2431</Port>
<Server> MyServerName2</Server>
</ServerConnection>
</DatabaseDetails>
</DatabaseList>
For example, ConnectionName = MyConn2 is passed to the procedure below, code should read values for MyConn2. But in my case, I'm selecting the xmlNodesLvl2 correctly, but it starts from the beginning of the file. I need to read the value of node just found in previous step. For that particular Database.ConnectionName, I need to read the node values for eg, Database, ConnectionType, Server, etc. But I’m starting in next step from the beginning. I have put comment in my code //Problem here.
public static void GetInfo(string ConnectionName)
{
XmlDocument xmlDoc = new XmlDocument();
bool bfound = false;
xmlDoc.Load(#"C:\path..\Database.xml");
XmlNodeList xmlNodesLvl1 = xmlDoc.SelectNodes("DatabaseList/DatabaseDetails");
foreach (XmlNode xmlNode in xmlNodesLvl1)
{
if (xmlNode.HasChildNodes)
{
foreach (XmlNode item in xmlNode.ChildNodes)
{
string tagName = item.Name;
if (tagName == "ConnectionId")
{
Database.ConnectionId = item.InnerText;
}
if (tagName == "ConnectionName")
{
if (item.InnerText == ConnectionName)
{
Database.ConnectionName = item.InnerText;
bfound = true;
XmlNodeList xmlNodesLvl2 = null;
//Problem here
if (Enviroment == "COB")
{
xmlNodesLvl2 = xmlDoc.SelectNodes("DatabaseList/DatabaseDetails/ServerConnection/CobConnection");
}
else
{
xmlNodesLvl2 = xmlDoc.SelectNodes("DatabaseList/DatabaseDetails/ServerConnection");
}
foreach (XmlNode xmlNodeLvl2 in xmlNodesLvl2)
{
if (xmlNodeLvl2.HasChildNodes)
{
foreach (XmlNode itemLvl2 in xmlNodeLvl2.ChildNodes)
{
if (itemLvl2.Name == "CobConnection")
{
Database.CobConnection = itemLvl2.InnerText;
}
if (itemLvl2.Name == "Database")
{
Database.Name = itemLvl2.InnerText;
}
if (itemLvl2.Name == "ConnectionType")
{
Database.ConnectionType = itemLvl2.InnerText;
}
if (itemLvl2.Name == "Server")
{
Database.Server = itemLvl2.InnerText;
}
}
if (bfound == true)
{
break;
}
}
}
if (bfound == true)
{
break;
}
}
}
}
if (bfound == true)
{
break;
}
}
}
}
Please advise!
Try putting in DataTable
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
const string FILENAME = #"c:\temp\test.xml";
static void Main(string[] args)
{
XDocument doc = XDocument.Load(FILENAME);
DataTable dt = new DataTable();
dt.Columns.Add("ConnectionId",typeof(string));
dt.Columns.Add("ConnectionName",typeof(string));
dt.Columns.Add("CobConnection",typeof(string));
dt.Columns.Add("CobConnectionType",typeof(string));
dt.Columns.Add("CobDatabase",typeof(string));
dt.Columns.Add("CobServer",typeof(string));
dt.Columns.Add("ConnectionType",typeof(string));
dt.Columns.Add("Database",typeof(string));
dt.Columns.Add("Port",typeof(string));
dt.Columns.Add("Server", typeof(string));
List<XElement> details = doc.Descendants("DatabaseDetails").ToList();
foreach (XElement detail in details)
{
string id = (string)detail.Element("ConnectionId");
string name = (string)detail.Element("ConnectionName");
XElement xCobConnection = detail.Descendants("CobConnection").FirstOrDefault();
string cobConnection = (string)xCobConnection.Element("CobConnection");
string cobType = (string)xCobConnection.Element("ConnectionType");
string cobDatabase = (string)xCobConnection.Element("Database");
string cobServer = (string)xCobConnection.Element("Server");
XElement serverConnection = detail.Element("ServerConnection");
string connectionType = (string)serverConnection.Element("ConnectionType");
string database = (string)serverConnection.Element("Database");
string port = (string)serverConnection.Element("Port");
string server = (string)serverConnection.Element("Server");
dt.Rows.Add(new object[] {
id,
name,
cobConnection,
cobType,
cobDatabase,
cobServer,
connectionType,
database,
port,
server
});
}
}
}
}

How to get the node value by passing type in XDocument C#

I have below XML.
<subscription>
<subscription_add_ons type="array">
<subscription_add_on>
<add_on_code>premium_support</add_on_code>
<name>Premium Support</name>
<quantity type="integer">1</quantity>
<unit_amount_in_cents type="integer">15000</unit_amount_in_cents>
<add_on_type>fixed</add_on_type>
<usage_percentage nil="true"></usage_percentage>
<measured_unit_id nil="true"></measured_unit_id>
</subscription_add_on>
</subscription_add_ons>
My XMLParse function
public XNode GetXmlNodes(XElement xml, string elementName)
{
List<string> addOnCodes= new List<string>();
//elementName = "subscription_add_ons ";
var addOns = xml.DescendantNodes().Where(x => x.Parent.Name == elementName).FirstOrDefault();
foreach (XNode addOn in addOns)
{
//Needed to do something like this
/*var len = "add_on_code".Length + 2;
var sIndex = addOn.ToString().IndexOf("<add_on_code>") + len;
var eIndex = addOn.ToString().IndexOf("</add_on_code>");
var addOnCode = addOn.ToString().Substring(sIndex, (eIndex - sIndex)).Trim().ToLower();
addOnCodes.Add(addOnCode);*/
}
As mentioned in comments by #JonSkeet, I updated my snippet as below.
var addOns = xml.Descendants(elementName).Single().Elements();
foreach (XNode addOn in addOns)
{
/*addon = {<subscription_add_on>
<add_on_code>premium_support</add_on_code>
<name>Premium Support</name>
<quantity type="integer">1</quantity>
<unit_amount_in_cents type="integer">15000</unit_amount_in_cents>
<add_on_type>fixed</add_on_type>
<usage_percentage nil="true"></usage_percentage>
<measured_unit_id nil="true"></measured_unit_id>
</subscription_add_on>} */
//how to get the addOnCode node value ?
var addOnCode = string.Empty;
addOnCodes.Add(addOnCode);
}
But what I need is from the passed XML, get all the nodes of type subscription_add_on then get the value contained in add_on_code & add it to string collection.
Or in general get the value of node by passing type ? Tried with the available methods coming from VS Intellisense but not getting the exact method that can do this?
Thanks!
Here is solution with Xml Linq (XDOCUMENT) :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication107
{
class Program
{
const string FILENAME = #"c:\temp\test.xml";
static void Main(string[] args)
{
XDocument doc = XDocument.Load(FILENAME);
var results = doc.Descendants("subscription_add_on").Select(x => new
{
add_on_code = (string)x.Element("add_on_code"),
name = (string)x.Element("name"),
quantity = (int)x.Element("quantity"),
amount = (int)x.Element("unit_amount_in_cents"),
add_on_type = (string)x.Element("add_on_type")
}).ToList();
}
}
}

C# XML multiple subchild

Good day! Im trying to parse XML subchild using dataset. The thing is its not reading the "SiteCode" when it has multiple value.
for example:
string filePath = #"" + _clsPathIntervalSttngs.localPath + "/" + "hehe.xml";
DataSet dataSet = new DataSet()
dataSet.ReadXml(filePath, XmlReadMode.InferSchema);
// Then display informations to test
foreach (DataTable table in dataSet.Tables)
{
Console.WriteLine(table);
for (int i = 0; i < table.Columns.Count; ++i)
{
Console.Write("\t" + table.Columns[i].ColumnName.Substring(0, Math.Min(6, table.Columns[i].ColumnName.Length)));
Console.WriteLine();
}
foreach (var row in table.AsEnumerable())
{
for (int i = 0; i < table.Columns.Count; ++i)
{
Console.Write("\t" + row[i]);
}
Console.WriteLine();
}
}
this is what it is returning.
Its returning a 0 value and selecting the Product instead of sitecode.
Where did i go wrong?
You might have to check the code because I just took something similar I had lying around and changed it to look at your document hierarchy. I also didn't use a DataSet. Consider the following code:
var filePath = "<path to your file.xml>";
var xml = XDocument.Load(filePath);
var items = from item in xml.Descendants("Product").Elements()
select item.Value;
Array.ForEach(items.ToArray(), Console.WriteLine);
That should show you the values of each element under product. If you want the whole element, remove the .Value in the select clause of the LINQ query.
Update
I'm now projecting to an anonymous type. You'll get one of these for each Product element in the file.
var items = from item in dataset.Descendants("Product")
select new
{
RefCode = item.Element("RefCode").Value,
Codes = string.Join(", ", item.Elements("SiteCode").Select(x => x.Value)),
Status = item.Element("Status").Value
};
Array.ForEach(items.ToArray(), Console.WriteLine);
I have flattened the codes to a comma separated string but you can keep the IEnumerable or ToList it as you wish.
Using xml Linq :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication51
{
class Program
{
const string FILENAME = #"c:\temp\test.xml";
static void Main(string[] args)
{
XElement doc = XElement.Load(FILENAME);
List<Product> products = doc.Descendants("Product").Select(x => new Product()
{
refCode = (string)x.Element("RefCode"),
siteCode = x.Elements("SiteCode").Select(y => (int)y).ToArray(),
status = (string)x.Element("Status")
}).ToList();
}
}
public class Product
{
public string refCode { get; set; }
public int[] siteCode { get; set; }
public string status { get; set; }
}
}

CSV Delimited to XML - Folder Hierarchy

This is my first time posting so I apologize for any ignorance or failed use of examples.
I have a console app project to create where I have been given a fair few CSV files and I need to create some kind of Parent/Child/Grandchild relationship out of them (XML? maybe? - then I can use that to do the uploads and writes to the DMS with minimal calls - I don't want to be querying if a folder exists over and over)
I am a little out of my depth on this one
I need to know the best way to do this without 3rd party library dependencies, pure C#, using the OLEDB JET provider is most likely required as it will handle the parsing required, there is no order to the CSV files in regards to date, previous years could appear down the list and vice versa.
Here's an example of the CSV output
"DESCRIPTION1","8, 5/8\" X 6.4MM","STRING","filename001.pdf","2016-09-19","1"
"DESCRIPTION2","12, 3/4\" X 6.4MM","STRING","filename001.pdf","2016-09-19","1"
"DESCRIPTION3","12, 3/4\" X 6.4MM","STRING","filename001.pdf","2016-09-19","1"
"another description 20# gw","1","388015","Scan123.pdf","2015-10-24","1"
"another description 20# gw","3","385902","Scan456.pdf","2015-04-14","1"
"STRINGVAL1","273.10 X 9.27 X 6000","45032-01","KHJDWNEJWKFD9101529.pdf","2012-02-03","1"
"STRINGVAL2","273.10 X 21.44 X 6000","7-09372","DJSWH68767681540.pdf","2017-02-03","1"
The end output will be (YEAR/MONTH/FILENAME + (Attributes for each file - these are for eventually updating columns inside a DMS))
Year and Month retrieved from the column with the date
If the YEAR alread exists then it will not be created again
If the month under that year exists it will not be created again
If the filename already exists under that YEAR/MONTH it will not be created again BUT the additional ATTRIBUTES for that FileName will be added to the attributes - "line seperated?"
Required Output:
I have attempted a Linq query to begin to output the possible required XML for me to progress but it outputs every row and does no grouping, I am not familiar with Linq at the moment.
I also ran into issues with the basic escaping on the .Split(',') doing it this way (see original CSV examples above compared to me using TAB separation in my test file and example below) which is why I want the Oledb provider to handle it.
string[] source = File.ReadAllLines(#"C:\Processing\In\mockCsv.csv");
XElement item = new XElement("Root",
from str in source
let fields = str.Split('\t')
select new XElement("Year", fields[4].Substring(0, 4),
new XElement("Month", fields[4].Substring(5, 2),
new XElement("FileName", fields[3]),
new XElement("Description",fields[0]),
new XElement("Length", fields[1]),
new XElement("Type", fields[2]),
new XElement("FileName", fields[3]),
new XElement("Date", fields[4]),
new XElement("Authorised", fields[5]))
)
);
I also need to log every step of the process so I have setup a Logger class
private class Logger
{
private static string LogFile = null;
internal enum MsgType
{
Info,
Debug,
Error
}
static Logger()
{
var processingDetails = ConfigurationManager.GetSection(SECTION_PROCESSINGDETAILS) as NameValueCollection;
LogFile = Path.Combine(processingDetails[KEY_WORKINGFOLDER],
String.Format("Log_{0}.txt", StartTime.ToString("MMMyyyy")));
if (File.Exists(LogFile))
File.Delete(LogFile);
}
internal static void Write(string msg, MsgType msgType, bool isNewLine, bool closeLine)
{
if (isNewLine)
msg = String.Format("{0} - {1} : {2}", DateTime.Now.ToString("dd/MM/yyyy HH:mm:ss"), msgType, msg);
if (closeLine)
Console.WriteLine(msg);
else
Console.Write(msg);
if (String.IsNullOrEmpty(LogFile))
return;
try
{
using (StreamWriter sw = new StreamWriter(LogFile, true))
{
if (closeLine)
sw.WriteLine(msg);
else
sw.Write(msg);
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
}
Used as such
Logger.Write(String.Format("Reading records from csv file ({0})... ",
csvFile), Logger.MsgType.Info, true, false);
Try following. If you are reading from a file use StreamReader instead of StringReader :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;
using System.Text.RegularExpressions;
namespace ConsoleApplication74
{
class Program
{
static void Main(string[] args)
{
string input =
"\"DESCRIPTION1\",\"8, 5/8 X 6.4MM\",\"STRING\",\"filename001.pdf\",\"2016-09-19\",\"1\"\n" +
"\"DESCRIPTION2\",\"12, 3/4 X 6.4MM\",\"STRING\",\"filename001.pdf\",\"2016-09-19\",\"1\"\n" +
"\"DESCRIPTION3\",\"12, 3/4 X 6.4MM\",\"STRING\",\"filename001.pdf\",\"2016-09-19\",\"1\"\n" +
"\"another description 20# gw\",\"1\",\"388015\",\"Scan123.pdf\",\"2015-10-24\",\"1\"\n" +
"\"another description 20# gw\",\"3\",\"385902\",\"Scan456.pdf\",\"2015-04-14\",\"1\"\n" +
"\"STRINGVAL1\",\"273.10 X 9.27 X 6000\",\"45032-01\",\"KHJDWNEJWKFD9101529.pdf\",\"2012-02-03\",\"1\"\n" +
"\"STRINGVAL2\",\"273.10 X 21.44 X 6000\",\"7-09372\",\"DJSWH68767681540.pdf\",\"2017-02-03\",\"1\"\n";
string pattern = "\\\"\\s*,\\s*\\\"";
string inputline = "";
StringReader reader = new StringReader(input);
XElement root = new XElement("Root");
while ((inputline = reader.ReadLine()) != null)
{
string[] splitLine = Regex.Split(inputline,pattern);
Item newItem = new Item() {
description = splitLine[0].Replace("\"",""),
length = splitLine[1],
type = splitLine[2],
filename = splitLine[3],
date = DateTime.Parse(splitLine[4]),
authorized = splitLine[5].Replace("\"", "") == "1" ? true : false
};
Item.items.Add(newItem);
}
foreach(var year in Item.items.GroupBy(x => x.date.Year).OrderBy(x => x.Key))
{
XElement newYear = new XElement("_" + year.Key.ToString());
root.Add(newYear);
foreach(var month in year.GroupBy(x => x.date.Month).OrderBy(x => x.Key))
{
XElement newMonth = new XElement("_" + month.Key.ToString());
newYear.Add(newMonth);
newMonth.Add(
month.OrderBy(x => x.date).Select(x => new XElement(
x.filename,
string.Join("\r\n", new object[] {
x.description,
x.length,
x.type,
x.date.ToString(),
x.authorized.ToString()
}).ToList()
)));
}
}
}
}
public class Item
{
public static List<Item> items = new List<Item>();
public string description { get; set; }
public string length { get; set; }
public string type { get; set; }
public string filename { get; set; }
public DateTime date { get; set; }
public Boolean authorized { get; set; }
}
}

Web scraping a listings website

I'm trying to scrape a website - ive accomplished this on other projects but i cant seem to get this right. It could be that ive been up for over 2 days working and maybe i am missing something. Please could someone look over my code? Here it is :
using System;
using System.Collections.Generic;
using HtmlAgilityPack;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Linq;
using System.Xml.Linq;
using System.IO;
public partial class _Default : System.Web.UI.Page
{
List<string> names = new List<string>();
List<string> address = new List<string>();
List<string> number = new List<string>();
protected void Page_Load(object sender, EventArgs e)
{
string url = "http://www.scoot.co.uk/find/" + "cafe" + " " + "-in-uk?page=" + "4";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
List<List<string>> mainList = new List<List<string>>();
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
{
names.Add(Regex.Replace(node.ChildNodes[0].InnerHtml, #"\s{2,}", " "));
}
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//p[#class='result-address']"))
{
address.Add(Regex.Replace(node.ChildNodes[0].InnerHtml, #"\s{2,}", " "));
}
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//p[#class='result-number']"))
{
number.Add(Regex.Replace(node.ChildNodes[0].InnerHtml, #"\s{2,}", " "));
}
XDocument doccy = new XDocument(
new XDeclaration("1.0", "utf-8", "yes"),
new XComment("Business For Sale"),
new XElement("Data",
from data in mainList
select new XElement("data", new XAttribute("data", "data"),
new XElement("Name : ", names[0]),
new XElement("Add : ", address[0]),
new XElement("Number : ", number[0])
)
)
);
var xml = doccy.ToString();
Response.ContentType = "text/xml"; //Must be 'text/xml'
Response.ContentEncoding = System.Text.Encoding.UTF8; //We'd like UTF-8
doccy.Save(Response.Output); //Save to the text-writer
}
}
The website lists business name, phone number and address and they are all defined by a class name (result-address, result-number etc). I am trying to get XML output so i can get the business name, address and phone number from each listing on page 4 for a presentation tomorrow but i cant get it to work at all!
The results are right in all 3 of the for each loops but they wont output in the xml i get an out of range error.
My first piece of advice would be to keep your CodeBehind as light as possible. If you bloat it up with business logic then the solution will become difficult to maintain. That's off topic, but I recommend looking up SOLID principles.
First, I've created a custom object to work with instead of using Lists of strings which have no way of knowing which address item links up with which name:
public class Listing
{
public string Name { get; set; }
public string Address { get; set; }
public string Number { get; set; }
}
Here is the heart of it, a class that does all the scraping and serializing (I've broken SOLID principles but sometimes you just want it to work right.)
using System.Collections.Generic;
using HtmlAgilityPack;
using System.IO;
using System.Xml;
using System.Xml.Serialization;
using System.Linq;
public class TheScraper
{
public List<Listing> DoTheScrape(int pageNumber)
{
List<Listing> result = new List<Listing>();
string url = "http://www.scoot.co.uk/find/" + "cafe" + " " + "-in-uk?page=" + pageNumber;
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
// select top level node, this is the closest we can get to the elements in which all the listings are a child of.
var nodes = doc.DocumentNode.SelectNodes("//*[#id='list']/div/div/div/div");
// loop through each child
if (nodes != null)
{
foreach (var node in nodes)
{
Listing listing = new Listing();
// get each individual listing and manually check for nulls
// listing.Name = node.SelectSingleNode("./div/div/div/div/h2/a")?.InnerText; --easier way to null check if you can use null propagating operator
var nameNode = node.SelectSingleNode("./div/div/div/div/h2/a");
if (nameNode != null) listing.Name = nameNode.InnerText;
var addressNode = node.SelectSingleNode("./div/div/div/div/p[#class='result-address']");
if (addressNode != null) listing.Address = addressNode.InnerText.Trim();
var numberNode = node.SelectSingleNode("./div/div/div/div/p[#class='result-number']/a");
if (numberNode != null) listing.Number = numberNode.Attributes["data-visible-number"].Value;
result.Add(listing);
}
}
// filter out the nulls
result = result.Where(x => x.Name != null && x.Address != null && x.Number != null).ToList();
return result;
}
public string SerializeTheListings(List<Listing> listings)
{
var xmlSerializer = new XmlSerializer(typeof(List<Listing>));
using (var stringWriter = new StringWriter())
using (var xmlWriter = XmlWriter.Create(stringWriter, new XmlWriterSettings { Indent = true }))
{
xmlSerializer.Serialize(xmlWriter, listings);
return stringWriter.ToString();
}
}
}
Then your code behind would look something like this, plus references to the scraper class and model class:
public partial class _Default : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
TheScraper scraper = new TheScraper();
List<Listing> listings = new List<Listing>();
// quick hack to do a loop 5 times, to get all 5 pages. if this is being run frequently you'd want to automatically identify how many pages or start at page one and find / use link to next page.
for (int i = 0; i < 5; i++)
{
listings = listings.Union(scraper.DoTheScrape(i)).ToList();
}
string xmlListings = scraper.SerializeTheListings(listings);
}
}

Categories