Get Links From Web Page [closed]

Get Links From Web Page [closed] - c#

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 years ago.
Improve this question
I need to get all the item links(URLs) from this webpage into a text file delimited by breaks (in other words a list like so: "Item #1" "Item #2" etc.
http://dota-trade.com/equipment?order=name is the webpage and if you scroll down it goes on and on to about ~500-1000 items.
What programming language would I have to use or how would I be able to do this. I also have experience using imacros already.

You need to download HtmlAgilityPack
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication5
{
class Program
{
static void Main(string[] args)
{
WebClient wc = new WebClient();
var sourceCode = wc.DownloadString("http://dota-trade.com/equipment?order=name");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(sourceCode);
var node = doc.DocumentNode;
var nodes = node.SelectNodes("//a");
List<string> links = new List<string>();
foreach (var item in nodes)
{
var link = item.Attributes["href"].Value;
links.Add(link.Contains("http") ? link : "http://dota-trade.com" +link);
}
int index = 1;
while (true)
{
sourceCode = wc.DownloadString("http://dota-trade.com/equipment?order=name&offset=" + index.ToString());
doc = new HtmlDocument();
doc.LoadHtml(sourceCode);
node = doc.DocumentNode;
nodes = node.SelectNodes("//a");
var cont = node.SelectSingleNode("//tr[#itemtype='http://schema.org/Thing']");
if (cont == null) break;
foreach (var item in nodes)
{
var link = item.Attributes["href"].Value;
links.Add(link.Contains("http") ? link : "http://dota-trade.com" + link);
}
index++;
}
System.IO.File.WriteAllLines(#"C:\Users\Public\WriteLines.txt", links);
}
}
}

I would recommend using any language with Regular Expression support. I use ruby a lot so I would do something like this:
require 'net/http'
require 'uri'
uri = URI.parse("http://dota-trade.com/equipment?order=name")
req = Net::HTTP::Get(uri.path)
http = Net::HTTP.new(uri.host, uri.port)
response = http.request(request)
links = response.body.match(/<a.+?href="(.+?)"/)
This is off the top of my head but links[0] should be a match object, every element there after is a match.
puts links[1..-1].join("\n")
The last line should dump what you want but likely won't include the host. If you want the host included do something like this:
puts links[1..-1].map{|l| "http://dota-trade.com" + l }.join("\n")
Please keep in mind this is untested.

Related

CS0103 - The name 'GetListFromCVS' does not exist in the current context [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
Pretty new to C# and object-oriented programming in general. I'm currently recieving this error when trying to call GetListFromCSV method that resides in another ".cs" file but resides in the same namespace. I'm not sure why I'm not able to call this method?
I originally had the code in GetListFromCSV method in main but wanted to put it in it's own class file to practice the SOLID principles. Maybe it doesn't make sense in this case?
Any help would be appreciated.
Thanks!
Main:
using MathNet.Numerics;
using System.Collections.Generic;
namespace SimPump_Demo
{
class Program
{
static void Main(string[] args)
{
// Get CSV file location
string dirCSVLocation = System.IO.Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().Location);
string fileCSVLocation = dirCSVLocation + #"\PumpCurveCSVFiles\pumpcurve.csv";
// Read CSV contents into variable "records"
//CSVToList CSVIsntance = new CSVToList();
List<PumpCurveCSVInput> records = GetListFromCVS(fileCSVLocation);
//float pumpFlowOutput;
double[] pumpFlow = new double[records.Count];
double[] pumpHead = new double[records.Count];
int i = 0;
foreach (var record in records)
{
//if (record.pumpHead == 50)
//{
// pumpFlowOutput = record.pumpFlow;
//}
pumpFlow[i] = record.pumpFlow;
pumpHead[i] = record.pumpHead;
i++;
}
// Determine pump curve
Polynomial.Fit(pumpFlow, pumpHead, 3);
}
}
}
GetListFromCSV Method:
using CsvHelper;
using System.Collections.Generic;
using System.Globalization;
using System.IO;
using System.Linq;
namespace SimPump_Demo
{
public class CSVToList
{
public static List<PumpCurveCSVInput> GetListFromCVS(string fileCSV)
{
List<PumpCurveCSVInput> records;
using (var reader = new StreamReader(fileCSV))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
records = csv.GetRecords<PumpCurveCSVInput>().ToList();
}
return records;
}
}
}

Even though GetListFromCVS is a static method, it still belongs to the CSVToList class. Therefore you must call it like this:
List<PumpCurveCSVInput> records = CSVToList.GetListFromCVS(fileCSVLocation);
Just use the name of the class without creating an instance.
If you make your method non-static
public class CSVToList
{
public List<PumpCurveCSVInput> GetListFromCVS(string fileCSV)
{
// Your code
}
}
In that case you should create an instance of the CSVToList class before being able to use this method
var csvHelper = new CSVToList();
List<PumpCurveCSVInput> records = csvHelper.GetListFromCVS(fileCSVLocation);

I have a plain text file with a list of filenames inside, I need to search for a string on each file from the list. How can I do it? In C# [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have this file named "Files.txt" which contains a list of file names inside:
------Files.TXT------
TS0001.000
TS0002.000
TS0003.000 ...
I need to look for the string "Process Error" inside of the files in the mentioned list and output the filename and the error to a new file as a log.
How can I do it? Complete newbie here :)
Thanks in advance!!

Your question does not provide complete information, so the solution may need to changed a bit based on your complete problem
using System;
using System.Collections.Generic;
using System.IO;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
var path = #"c:\file.txt";
var outPath = #"c:\error.log";
var outs = new List<string>();
string fileText = File.ReadAllText(path);
foreach (var file in fileText.Split(" "))
{
var readText = File.ReadAllLines(#"C:\" + file);
foreach (string line in readText)
{
if (line.Contains("Process Error"))
{
outs.Add(file);
outs.Add(line);
}
}
}
File.WriteAllLines(outPath, outs);
}
}
}

GetSafeHtmlFragment removing all html tags

I am using GetSafeHtmlFragment in my website and I found that all of tags except <p> and <a> is removed.
I researched around and I found that there is no resolution for it from Microsoft.
Is there any superseded for it or is there any solution?
Thanks.

Amazing that Microsoft in the 4.2.1 version terribly overcompensated for a security leak in the 4.2 XSS library and now still hasn't updated a year later. The GetSafeHtmlFragment method should have been renamed to StripHtml as I read someone commenting somewhere.
I ended up using the HtmlSanitizer library suggested in this related SO issue. I liked that it was available as a package through NuGet.
This library basically implements a variation of the white-list approach the now accepted answer uses. However it is based on CsQuery instead of the HTML Agility library. The package also gives some additional options, like being able to keep style information (e.g. HTML attributes). Using this library resulted in code in my project something like below, which - at least - is a lot less code than the accepted answer :).
using Html;
...
var sanitizer = new HtmlSanitizer();
sanitizer.AllowedTags = new List<string> { "p", "ul", "li", "ol", "br" };
string sanitizedHtml = sanitizer.Sanitize(htmlString);

An alternative solution would be to use the Html Agility Pack in conjunction with your own tags white list :
using System;
using System.IO;
using System.Text;
using System.Linq;
using System.Collections.Generic;
using HtmlAgilityPack;
class Program
{
static void Main(string[] args)
{
var whiteList = new[]
{
"#comment", "html", "head",
"title", "body", "img", "p",
"a"
};
var html = File.ReadAllText("input.html");
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodesToRemove = new List<HtmlAgilityPack.HtmlNode>();
var e = doc
.CreateNavigator()
.SelectDescendants(System.Xml.XPath.XPathNodeType.All, false)
.GetEnumerator();
while (e.MoveNext())
{
var node =
((HtmlAgilityPack.HtmlNodeNavigator)e.Current)
.CurrentNode;
if (!whiteList.Contains(node.Name))
{
nodesToRemove.Add(node);
}
}
nodesToRemove.ForEach(node => node.Remove());
var sb = new StringBuilder();
using (var w = new StringWriter(sb))
{
doc.Save(w);
}
Console.WriteLine(sb.ToString());
}
}

Troubles with HtmlAgilityPack

I can't figure out what goes wrong. i just create the poject to test HtmlAgilityPack and what i've got.
using System;
using System.Collections.Generic;
using System.Text;
using HtmlAgilityPack;
namespace parseHabra
{
class Program
{
static void Main(string[] args)
{
HTTP net = new HTTP(); //some http wraper
string result = net.MakeRequest("http://stackoverflow.com/", null);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
//Get all summary blocks
HtmlNodeCollection news = doc.DocumentNode.SelectNodes("//div[#class=\"summary\"]");
foreach (HtmlNode item in news)
{
string title = String.Empty;
//trouble is here for each element item i get the same value
//all the time
title = item.SelectSingleNode("//a[#class=\"question-hyperlink\"]").InnerText.Trim();
Console.WriteLine(title);
}
Console.ReadLine();
}
}
}
It looks like i make xpath not for each node i've selected but to whole document. Any suggestions why it so ? Thx in advance.

I have not tried your code, but from the quick look I suspect the problem is that the // is searching from the root of the entire document and not the root of the current element as I guess you are expecting.
Try putting a . before the //
".//a[#class=\"question-hyperlink\"]"

I'd rewrite your xpath as a single query to find all the question titles, rather than finding the summaries then the titles. Chris' answer points out the problem which could have easily been avoided.
var web = new HtmlWeb();
var doc = web.Load("http://stackoverflow.com");
var xpath = "//div[starts-with(#id,'question-summary-')]//a[#class='question-hyperlink']";
var questionTitles = doc.DocumentNode
.SelectNodes(xpath)
.Select(a => a.InnerText.Trim());

C# Getting The HTML of the Links(Content) from Website

What I want is, opening a Link from a Website (from HtmlContent)
and get the Html of this new opened site..
Example: I have www.google.com, now I want to find all Links.
For each Link I want to have the HTMLContent of the new Site.
I do something like this:
foreach (String link in GetLinksFromWebsite(htmlContent))
{
using (var client = new WebClient())
{
htmlContent = client.DownloadString("http://" + link);
}
foreach (Match treffer in istBildURL)
{
string bildUrl = treffer.Groups[1].Value;
bildLinks.Add(bildUrl);
}
}
public static List<String> GetLinksFromWebsite(string htmlSource)
{
string linkPattern = "(.*?)";
MatchCollection linkMatches = Regex.Matches(htmlSource, linkPattern, RegexOptions.Singleline);
List<string> linkContents = new List<string>();
foreach (Match match in linkMatches)
{
linkContents.Add(match.Value);
}
return linkContents;
}
The other problem is, that I only get Links, not Linkbuttons (ASP.NET)..
How can I solve the problem?

Steps to follow:
Download Html Agility Pack
Reference the assembly you have downloaded in your project
Throw everything that starts with the word regex or regular expression out from your project and which deals with parsing HTML (read this answer to better understand why). In your case this would be the contents of the GetLinksFromWebsite method.
Replace what you have thrown away with a simple call to the Html Agility Pack parser.
Here's an example:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
class Program
{
static void Main()
{
using (var client = new WebClient())
{
var htmlSource = client.DownloadString("http://www.stackoverflow.com");
foreach (var item in GetLinksFromWebsite(htmlSource))
{
// TODO: you could easily write a recursive function
// that will call itself here and retrieve the respective contents
// of the site ...
Console.WriteLine(item);
}
}
}
public static List<String> GetLinksFromWebsite(string htmlSource)
{
var doc = new HtmlDocument();
doc.LoadHtml(htmlSource);
return doc
.DocumentNode
.SelectNodes("//a[#href]")
.Select(node => node.Attributes["href"].Value)
.ToList();
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get Links From Web Page [closed] - c#

Related

CS0103 - The name 'GetListFromCVS' does not exist in the current context [closed]

I have a plain text file with a list of filenames inside, I need to search for a string on each file from the list. How can I do it? In C# [closed]

GetSafeHtmlFragment removing all html tags

Troubles with HtmlAgilityPack

C# Getting The HTML of the Links(Content) from Website

Categories

Resources