Based upon a previous written code snippet I'm now trying to store multiple images at once from a certain subreddit into a local directory. My Problem is that I can't get my LINQ statement working properly. I also don't want to download the thumbnail pictures which was why I took a look at the HTML-page and found out that the links I aim to retrieve are hidden in level 5 within the href attribute:
(...)
Level 1: <div class="content">...</div>
Level 2: <div class="spacer">...</div>
Level 3: <div class="siteTable">...</div>
Level 4: <div class=" thing id-t3_6dj7qp odd link ">...</div>
Level 5: <a class="thumbnail may-blank outbound" href="href="http://i.imgur.com/jZ2ZAyk.jpg"">...</a>
That was my best bet in line '???':
.Where(link => Directory.GetParent(link).Equals(#"http://i.imgur.com"))
Sadly enough it throws out an error stating that
Object reference not set to an instance of an object
Well now I know why it's not working but I've still got no clue how to rewrite this line since I'm still fairly new to Lambda Expressions. To be honest, I don't really know why I got a System.NullReferenceException in the first place but not in the next line. What's the difference? Maybe my approach on this problem isn't even good practice at all so please let me know how I could proceed further.
using System;
using System.Collections.Generic;
using System.Linq;
using System.IO;
using System.Net;
using HtmlAgilityPack;
namespace GetAllImages
{
class Program
{
static void Main(string[] args)
{
List<string> imageLinks = new List<string>();
// Specify Directory manually
string dirName = "Jessica Clements";
string rootPath = #"C:\Users\Stefan\Desktop";
string dirPath = Path.Combine(rootPath, dirName);
// Specify the subReddit manually
string subReddit = "r/Jessica_Clements";
string url = #"https://www.reddit.com/" + subReddit;
try
{
DirectoryInfo imageFolder = Directory.CreateDirectory(dirPath);
HtmlDocument document = new HtmlWeb().Load(url);
imageLinks = document.DocumentNode.Descendants("a")
.Select(element => element.GetAttributeValue("href", null))
.Where(???)
.Where(stringLink => !String.IsNullOrEmpty(stringLink))
.ToList();
foreach(string link in imageLinks)
{
using (WebClient _wc = new WebClient())
{
_wc.DownloadFileAsync(new Uri(link), Path.Combine(dirPath, Path.GetFileName(link)));
}
}
Console.WriteLine($"Files successfully saved in '{Path.GetFileName(dirPath)}'.");
}
catch(Exception e)
{
while(e != null)
{
Console.WriteLine(e.Message);
e = e.InnerException;
}
}
if(System.Diagnostics.Debugger.IsAttached)
{
Console.WriteLine("Press any key to continue . . .");
Console.ReadKey(true);
}
}
}
}
Edit: Just in case someone is interested in this solution that's how I made it work in the end using the answers below:
HtmlDocument document = new HtmlWeb().Load(url);
imageLinks = document.DocumentNode.Descendants("a")
.Select(element => element.GetAttributeValue("href", null))
.Where(link => (link?.Contains(#"http://i.imgur.com") == true))
.Distinct()
.ToList();
Given that this line throws the exception:
.Where(link => Directory.GetParent(link).Equals(#"http://i.imgur.com"))
I'd make sure that link is not null and that the result of GetParent(link) is not null either. So you could do:
.Where(link => link != null && (Directory.GetParent(link)?.Equals(#"http://i.imgur.com") ?? false))
Notice the null check and the ?. after GetParent(). This one stops the execution of the term if null is returned from GetParent(). It is called the Null Conditional Operator or "Elvis Operator" because it can be seen as two eyes with twirly hair. The ?? false gives the default value in case the execution was stopped because of a null value.
However, if you plan to parse HTML code you should definitely have a look at the Html Agility Pack (HAP).
if you are trying to get all links pointing to the http://i.imgur.com, you need something like this
imageLinks = document.DocumentNode.Descendants("a")
.Select(element => element.GetAttributeValue("href", null))
.Where(link => link?.Contains(#"http://i.imgur.com") == true)
.ToList();
Related
I'm trying to update a site that uses an sanitizer based on AngleSharp to process user-generated HTML content. The site users need to be able to embed iframes, and I am trying to use a whitelist to control what domains the frame can load. I'd like to rewrite the 'blocked' iframes to a new custom element "blocked-iframe" that will then be stripped out by the sanitizer, so we can review if other domains need to be added to the whitelist.
I'm trying to use a solution based on this answer: https://stackoverflow.com/a/55276825/794
It looks like so:
string BlockIFrames(string content)
{
var parser = new HtmlParser(new HtmlParserOptions { });
var doc = parser.Parse(content);
foreach (var element in doc.QuerySelectorAll("iframe"))
{
var src = element.GetAttribute("src");
if (string.IsNullOrEmpty(src) || !Settings.Sanitization.IFrameWhitelist.Any(wls => src.StartsWith(wls)))
{
var newElement = doc.CreateElement("blocked-iframe");
foreach (var attr in element.Attributes)
{
newElement.SetAttribute(attr.Name, attr.Value);
}
element.Insert(AdjacentPosition.BeforeBegin, newElement.OuterHtml);
element.Remove();
}
}
return doc.FirstElementChild.OuterHtml;
}
It ostensibly works but I notice that the angle brackets in the new element's tag are being escaped on insertion, so the result just gets written into the page as text. I think I could build a map of replacements and just execute them against the string before sending back but I'm wondering if theres a way to do it using AngleSharp's API. The site is using 0.9.9 currently and I'm not sure how far ahead we'll be able to update considering some of the other dependencies in play.
Digging around in the source I found the ReplaceChild method in INode, which works if called from the parent of element
string BlockIFrames(string content)
{
var parser = new HtmlParser(new HtmlParserOptions { });
var doc = parser.Parse(content);
foreach (var element in doc.QuerySelectorAll("iframe"))
{
var src = element.GetAttribute("src");
if (string.IsNullOrEmpty(src) ||
!Settings.Sanitization.IFrameWhitelist.Any(wls => src.StartsWith(wls)))
{
var newElement = doc.CreateElement("blocked-iframe");
foreach (var attr in element.Attributes)
{
newElement.SetAttribute(attr.Name, attr.Value);
}
element.Parent.ReplaceChild(newElement, element);
}
}
return doc.FirstElementChild.OuterHtml;
}
I will keep testing but this seems decent enough to me, if there is a better way I'd love to hear it.
Just after some help with some code i've written to extract data using HttpClient.
I am new to writing code so can't find my problem. Could someone pls help me troubleshoot this.
I expect to write the data of the table i'm scraping to the console line.
Any help appreciated
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using HtmlAgilityPack;
namespace weatherCheck
{
class Program
{
private static void Main(string[] args)
{
GetHtmlAsync();
Console.ReadLine();
}
protected static async void GetHtmlAsync()
{
var url = "https://www.weatherzone.com.au/vic/melbourne/melbourne";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//grab the rain chance, rain in mm and date
var MyTable = Enumerable.FirstOrDefault(htmlDocument.DocumentNode.Descendants("table")
.Where(table => table.Attributes.Contains("id"))
, table => table.Attributes["id"].Value == "forecast-table");
List<HtmlNode> rows = htmlDocument.DocumentNode.SelectNodes("//tr").ToList();
foreach (var row in rows)
{
try
{
if (MyTable != null)
{
Console.WriteLine(MyTable.GetAttributeValue("forecast-table", " "));
}
}
catch (Exception)
{
}
}
}
}
}
I used your code to look up the values but it didnt produce anything for me either. When i look at the htmlDocument.DocumentNode.OuterHtml to view the entire Html it is scraping, I dont see anything in the document that reflects an attribute forecast-table.
Also, you are validating MyTable each time you loop through rows. You should validate row != null along with printing attribute from row.
var MyTable = Enumerable.FirstOrDefault(htmlDocument.DocumentNode.Descendants("table")
.Where(table => table.Attributes.Contains("id")), table => table.Attributes["id"].Value == "forecast-table");
List<HtmlNode> rows = htmlDocument.DocumentNode.SelectNodes("//tr").ToList();
foreach (var row in rows)
{
try
{
if (row != null) // Here, it should be row, not My Table along with MyTable in line below.
Console.WriteLine(row.GetAttributeValue("forecast-table", " "));
}
catch (Exception)
{
}
}
Problem is
You also should know that Html you view by using Dev Tools on chrome is not the same as the one you see in HtmlAgilityPack. Chrome renders the page after executing the scripts where HtmlAgilityPack simply provides you with default HTML of the page. This is the reason why you are not able to get the value of forecast-table.
From Doc, For GetAttributeValue(name,def) it'll return def if the attribute not found.
So, it'll print ""(empty string if the attribute not found in your case)
remove async and await as you already calling httpClient.GetStringAsync(url);
var html =httpClient.GetStringAsync(url).Result;
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
And print,
Console.WriteLine(MyTable.GetAttributeValue("forecast-table","SOME_TEXT_HERE").ToString());
I need to check if a element exists basically and if it does I want to open a url then back to the original page and then continue writing as it was. I tried a few approaches but they kept giving throwing exceptions. I added comments to the lines in question. I just cant figure out how to implement it.
foreach (string line in File.ReadLines(#"C:\\tumblrextract\\in7.txt"))
{
if (line.Contains("#"))
{
searchEmail.SendKeys(line);
submitButton.Click();
var result = driver.FindElement(By.ClassName("invite_someone_success")).Text;
if (driver.FindElements(By.ClassName("invite_someone_failure")).Count != 0)
// If invite_someone_failure exists open this url
driver.Url = "https://www.tumblr.com/lookup";
else
// Then back to following page and continue searchEmail.SendKeys(line); submitButton.Click(); write loop
driver.Url = "https://www.tumblr.com/following";
using (StreamWriter writer = File.AppendText("C:\\tumblrextract\\out7.txt"))
{
writer.WriteLine(result + ":" + line);
}
}
}
What is the exception you are getting? probably it may be Null reference exception. Please consider adding Null check in your code for the following
if(By.ClassName("invite_someone_success") != null){
var result = driver.FindElement(By.ClassName("invite_someone_success")).Text;
}
Above is not verified/exact code, just a pseudo code
you are using selenium and your might throw exceptions in some lines of code you have there - also take in consideration that i don't know tumblr website and it's html structure.
But first:
You're in a foreach loop and everytime you load at least once a page, all your elements will Stale, so this lines:
var searchEmail = driver.FindElement(By.Name("follow_this"));
var submitButton = driver.FindElement(By.Name("submit"));
will probably Stale in the next execution. (ElementStaleException).
Paste them too after:
driver.Url = "https://www.tumblr.com/following";
Second:
when using FindElement method you have to make sure the element exists or an ElementNotFoundException will also be thrown.
var result = driver.FindElement(By.ClassName("invite_someone_success")).Text;
var isThere = driver.FindElements(By.ClassName("invite_someone_failure"));
the dotNet selenium client have a static (i believe) class to help with that it's the ExpectedCondition
that you can use to check if an element is present before trying to read it's text..
I Invite you to understand how selenium works, specially StaleElementReferenceException.
Have fun.
I have been working successfully with the C# OpenXml SDK (Unofficial Microsoft Package 2.5 from NuGet) for some time now, but have recently noticed that the following line of code returns different results depending on what mood Microsoft Word appears to be in when the file gets saved:
var fields = document.Descendants<FieldCode>();
From what I can tell, when creating the document in the first place (using Word 2013 on Windows 8.1) if you use the Insert->QuickParts->Field and choose MergeField from the Field names left hand pane, and then provide a Field name in the field properties and click OK then the field code is correctly saved in the document as I would expect.
Then when using the aforementioned line of code I will receive a field code count of 1 field. If I subsequently edit this document (and even leave this field well alone) the subsequent saving could mean that this field code no longer is returned in my query.
Another case of the same curiousness is when I see the FieldCode nodes split across multiple items. So rather than seeing say:
" MERGEFIELD Author \\* MERGEFORMAT "
As the node name, I will see:
" MERGEFIELD Aut"
"hor \\* MERGEFORMAT"
Split as two FieldCode node values. I have no idea why this would be the case, but it certainly makes my ability to match nodes that much more exciting. Is this expected behaviour? A known bug? I don't really want to have to crack open the raw xml and edit this document to work until I understand what is going on. Many thanks all.
I came across this very problem myself, and found a solution that exists within OpenXML: a utility class called MarkupSimplifier which is part of the PowerTools for Open XML project. Using this class solved all the problems I was having that you describe.
The full article is located here.
Here are some pertinent exercepts :
Perhaps the most useful simplification that this performs is to merge adjacent runs with identical formatting.
It goes on to say:
Open XML applications, including Word, can arbitrarily split runs as necessary. If you, for instance, add a comment to a document, runs will be split at the location of the start and end of the comment. After MarkupSimplifier removes comments, it can merge runs, resulting in simpler markup.
An example of the utility class in use is:
SimplifyMarkupSettings settings = new SimplifyMarkupSettings
{
RemoveComments = true,
RemoveContentControls = true,
RemoveEndAndFootNotes = true,
RemoveFieldCodes = false,
RemoveLastRenderedPageBreak = true,
RemovePermissions = true,
RemoveProof = true,
RemoveRsidInfo = true,
RemoveSmartTags = true,
RemoveSoftHyphens = true,
ReplaceTabsWithSpaces = true,
};
MarkupSimplifier.SimplifyMarkup(wordDoc, settings);
I have used this many times with Word 2010 documents using VS2015 .Net Framework 4.5.2 and it has made my life much, much easier.
Update:
I have revisited this code and have found it clears upon runs on MERGEFIELDS but not IF FIELDS that reference mergefields e.g.
{if {MERGEFIELD When39} = "Y???" "Y" "N" }
I have no idea why this might be so, and examination of the underlying XML offers no hints.
Word will often split text runs with into multiple text runs for no reason I've ever understood. When searching, comparing, tidying etc. We preprocess the body with method which combines multiple runs into a single text run.
/// <summary>
/// Combines the identical runs.
/// </summary>
/// <param name="body">The body.</param>
public static void CombineIdenticalRuns(W.Body body)
{
List<W.Run> runsToRemove = new List<W.Run>();
foreach (W.Paragraph para in body.Descendants<W.Paragraph>())
{
List<W.Run> runs = para.Elements<W.Run>().ToList();
for (int i = runs.Count - 2; i >= 0; i--)
{
W.Text text1 = runs[i].GetFirstChild<W.Text>();
W.Text text2 = runs[i + 1].GetFirstChild<W.Text>();
if (text1 != null && text2 != null)
{
string rPr1 = "";
string rPr2 = "";
if (runs[i].RunProperties != null) rPr1 = runs[i].RunProperties.OuterXml;
if (runs[i + 1].RunProperties != null) rPr2 = runs[i + 1].RunProperties.OuterXml;
if (rPr1 == rPr2)
{
text1.Text += text2.Text;
runsToRemove.Add(runs[i + 1]);
}
}
}
}
foreach (W.Run run in runsToRemove)
{
run.Remove();
}
}
I tried to simplify the document with Powertools but the result was a corrupted word file. I make this routine for simplify only fieldcodes that has specifics names, works in all parts on the docs (maindocumentpart, headers and footers):
internal static void SimplifyFieldCodes(WordprocessingDocument document)
{
var masks = new string[] { Constants.VAR_MASK, Constants.INP_MASK, Constants.TBL_MASK, Constants.IMG_MASK, Constants.GRF_MASK };
SimplifyFieldCodesInElement(document.MainDocumentPart.RootElement, masks);
foreach (var headerPart in document.MainDocumentPart.HeaderParts)
{
SimplifyFieldCodesInElement(headerPart.Header, masks);
}
foreach (var footerPart in document.MainDocumentPart.FooterParts)
{
SimplifyFieldCodesInElement(footerPart.Footer, masks);
}
}
internal static void SimplifyFieldCodesInElement(OpenXmlElement element, string[] regexpMasks)
{
foreach (var run in element.Descendants<Run>()
.Select(item => (Run)item)
.ToList())
{
var fieldChar = run.Descendants<FieldChar>().FirstOrDefault();
if (fieldChar != null && fieldChar.FieldCharType == FieldCharValues.Begin)
{
string fieldContent = "";
List<Run> runsInFieldCode = new List<Run>();
var currentRun = run.NextSibling();
while ((currentRun is Run) && currentRun.Descendants<FieldCode>().FirstOrDefault() != null)
{
var currentRunFieldCode = currentRun.Descendants<FieldCode>().FirstOrDefault();
fieldContent += currentRunFieldCode.InnerText;
runsInFieldCode.Add((Run)currentRun);
currentRun = currentRun.NextSibling();
}
// If there is more than one Run for the FieldCode, and is one we must change, set the complete text in the first Run and remove the rest
if (runsInFieldCode.Count > 1)
{
// Check fielcode to know it's one that we must simplify (for not to change TOC, PAGEREF, etc.)
bool applyTransform = false;
foreach (string regexpMask in regexpMasks)
{
Regex regex = new Regex(regexpMask);
Match match = regex.Match(fieldContent);
if (match.Success)
{
applyTransform = true;
break;
}
}
if (applyTransform)
{
var currentRunFieldCode = runsInFieldCode[0].Descendants<FieldCode>().FirstOrDefault();
currentRunFieldCode.Text = fieldContent;
runsInFieldCode.RemoveAt(0);
foreach (Run runToRemove in runsInFieldCode)
{
runToRemove.Remove();
}
}
}
}
}
}
Hope this helps!!!
This is what my team and I chose to do for our school project. Well, actually we haven't decided on how to parse the C# source files yet.
What we are aiming to achieve is, perform a full analysis on a C# source file, and produce up a report.
In which the report is going to contain stuff that happening in the codes.
The report only has to contain:
string literals
method names
variable names
field names
etc
I'm in charge of looking into this Irony library. To be honest, I don't know the best way to sort the data out into a clean readable report. I am using the C# grammar class packed with the zip.
Is there any step where I can properly identify each node children? (eg: using directives, namespace declaration, class declaration etc, method body)
Any help or advice would be very much appreciated. Thanks.
EDIT: Sorry I forgot to say we need to analysis the method calls too.
Your main goal is to master the basics of formal languages. A good start-up might be found here. This article describes the way to use Irony on the sample of a grammar of a simple numeric calculator.
Suppose you want to parse a certain file containing C# code the path to which you know:
private void ParseForLongMethods(string path)
{
_parser = new Parser(new CSharpGrammar());
if (_parser == null || !_parser.Language.CanParse()) return;
_parseTree = null;
GC.Collect(); //to avoid disruption of perf times with occasional collections
_parser.Context.SetOption(ParseOptions.TraceParser, true);
try
{
string contents = File.ReadAllText(path);
_parser.Parse(contents);//, "<source>");
}
catch (Exception ex)
{
}
finally
{
_parseTree = _parser.Context.CurrentParseTree;
TraverseParseTree();
}
}
And here is the traversal method itself with counting some info in the nodes. Actually this code counts the number of statements in every method of the class. If you have any question you are always welcome to ask me
private void TraverseParseTree()
{
if (_parseTree == null) return;
ParseNodeRec(_parseTree.Root);
}
private void ParseNodeRec(ParseTreeNode node)
{
if (node == null) return;
string functionName = "";
if (node.ToString().CompareTo("class_declaration") == 0)
{
ParseTreeNode tmpNode = node.ChildNodes[2];
currentClass = tmpNode.AstNode.ToString();
}
if (node.ToString().CompareTo("method_declaration") == 0)
{
foreach (var child in node.ChildNodes)
{
if (child.ToString().CompareTo("qual_name_with_targs") == 0)
{
ParseTreeNode tmpNode = child.ChildNodes[0];
while (tmpNode.ChildNodes.Count != 0)
{ tmpNode = tmpNode.ChildNodes[0]; }
functionName = tmpNode.AstNode.ToString();
}
if (child.ToString().CompareTo("method_body") == 0) //method_declaration
{
int statementsCount = FindStatements(child);
//Register bad smell
if (statementsCount>(((LongMethodsOptions)this.Options).MaxMethodLength))
{
//function.StartPoint.Line
int functionLine = GetLine(functionName);
foundSmells.Add(new BadSmellRegistry(name, functionLine,currentFile,currentProject,currentSolution,false));
}
}
}
}
foreach (var child in node.ChildNodes)
{ ParseNodeRec(child); }
}
I'm not sure this is what you need but you could use the CodeDom and CodeDom.Compiler namespaces to compile the C# code, and than analyze the results using Reflection, something like:
// Create assamblly in Memory
CodeSnippetCompileUnit code = new CodeSnippetCompileUnit(classCode);
CSharpCodeProvider provider = new CSharpCodeProvider();
CompilerResults results = provider.CompileAssemblyFromDom(compileParams, code);
foreach(var type in results.CompiledAssembly)
{
// Your analysis go here
}
Update: In VS2015 you could use the new C# compiler (AKA Roslyn) to do the same, for example:
var root = (CompilationUnitSyntax)tree.GetRoot();
var compilation = CSharpCompilation.Create("HelloTDN")
.AddReferences(references: new[] { MetadataReference.CreateFromAssembly(typeof(object).Assembly) })
.AddSyntaxTrees(tree);
var model = compilation.GetSemanticModel(tree);
var nameInfo = model.GetSymbolInfo(root.Usings[0].Name);
var systemSymbol = (INamespaceSymbol)nameInfo.Symbol;
foreach (var ns in systemSymbol.GetNamespaceMembers())
{
Console.WriteLine(ns.Name);
}