I'm writing a web-gallery scraper and I want to parallel the processing for files as much as possible with TPL Dataflow.
To scrape, I first get the gallery main page and parse the HTML to get the image page links as a list. Then I go to each page in the list and parse the HTML to get the link to the image which I then want to save to disk.
Here's the outline of my program:
var galleryBlock = new TransformBlock<Uri, IEnumerable<Uri>>(async uri =>
{
// 1. Get the page
// 2. Parse the page to get the urls of each image page
return imagePageLinks;
});
var imageBlock = new TransformBlock<Uri, Uri>(async uri =>
{
// 1. Go to the url and fetch the image page html
// 2. Parse the html to retrieve the image url
return imageUri;
});
var downloadBlock = ActionBlock<Uri>(async uri =>
{
// Download the image from uri to list
});
var opts = new DataflowLinkOptions { PropagateCompletion = true};
galleryBlock.LinkTo(imageBlock, opts); // this doesn't work, as I'm returning a list and not a single Item. However I want to progress that block in parallel.
imageBlock.LinkTo(downloadBlock, opts);
You can use a TransformManyBlock in place of your TransformBlock:
var galleryBlock = new TransformManyBlock<Uri, Uri>(async uri =>
{
return Enumerable.Empty<Uri>(); //just to get it compiling
});
var imageBlock = new TransformBlock<Uri, Uri>(async uri =>
{
return null; //just to get it compiling
});
var opts = new DataflowLinkOptions { PropagateCompletion = true };
galleryBlock.LinkTo(imageBlock, opts); // bingo!
Related
I'm having issues with Puppeteer, I am trying to type in a textbox that is in an IFrame.
I have created a simple repo with a code snippet, this one contains an IFrame with a tweet from Twitter.
await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultChromiumRevision);
var launchOptions = new LaunchOptions
{
Headless = false,
DefaultViewport = null
};
launchOptions.Args = new[] { "--disable-web-security", "--disable-features=IsolateOrigins,site-per-process" };
ChromeDriver = await Puppeteer.LaunchAsync(launchOptions);
page = await ChromeDriver.NewPageAsync();
await page.GoToAsync(Url, new NavigationOptions { WaitUntil = new WaitUntilNavigation[] { WaitUntilNavigation.Networkidle0 } });
var selectorIFrame = "#twitter_iframe";
var frameElement1 = await page.WaitForSelectorAsync(selectorIFrame);
var frame1 = await frameElement1.ContentFrameAsync();
var frameContent1 = await frame1.GetContentAsync();
var frame1 = await frameElement1.ContentFrameAsync(); fails with Frame # not found, see image with error below.
Versions:
PuppeteerSharp 7.0
.Net Framework 6
Git example
Try to disable some of the security features that can be disabled when launching puppeteer.
Check in puppeteer chrome://flags/ in case there's something blocking iframe access, maybe is insecure content or maybe you have to be explicit about isolation trial.
My 2 cents on this, it should allow it to access it from non secure
Args = new[]
{
"--disable-web-security",
"--disable-features=IsolateOrigins,site-per-process,BlockInsecurePrivateNetworkRequests",
"--disable-site-isolation-trials"
}
import {
browser,
Config
} from 'protractor';
//import {browser, element, By, by, protractor, ElementFinder, ExpectedConditions, WebElement, Key } from "#syncfusion/ej2-base/e2e/index"
var fs = require('fs')
describe("protractor screenshot", () => {
browser.manage().window().setPosition(0, 0);
it("Demo", async (done) => {
browser.get("file:///D:/New%20folder/ej2-documenteditor-e2e/demos/CR_Issues/samples/height/image.html");
browser.sleep(2000);
function writeScreenShot(data: string, filename: string) {
var stream = fs.createWriteStream(filename);
stream.write(new Buffer(data, 'base64'));
stream.end();
}
browser.executeScript('window.scrollTo(0,document.body.scrollHeight)');
screenShotUtils.takeScreenshot({
saveTo: "fullpageScreenshot.png"
})
});
});
Im getting [Cannot find name 'screenShotUtils'.]this error comes after importing in protractor .js.I have to take screenshot of the image in the website and compare with the already present image using protractor
in your code you have no module or object called screenShotUtils , you should be trying something like:
https://www.protractortest.org/#/api?view=webdriver.WebElement.prototype.takeScreenshot
function writeScreenShot(data, filename) {
var stream = fs.createWriteStream(filename);
stream.write(new Buffer(data, 'base64'));
stream.end();
}
var foo = element(by.id('foo'));
//of element
foo.takeScreenshot().then((png) => {
writeScreenShot(png, 'foo.png');
});
//of entire page in viewport
browser.takeScreenshot().then((png) => {
writeScreenShot(png, 'foo.png');
});
I'm trying to click a song out of a list on Youtube.
I'll try make things easy without sharing all my classes, but still show you the elements I'm using.
IwebDriver _webdriver = new ChromeDriver();
_webdriver.Navigate().GoToUrl("https://www.youtube.com/");
var element = wait.Until(x => x.FindElement(By.Id("search")));
element.SendKeys("Perfect");
var element = wait.Until(x => x.FindElement(By.CssSelector("#search-icon-legacy>yt-icon")));
element.Click();
var content = wait.Until(x => x.FindElement(By.Id("contents")));
var songHREF = content.FindElements(By.CssSelector("#video-title"));
songHREF[2].Click();
So, main thing that happens is that 90% of the runs, the songHREF will click on an object(song's link) that actually located on the main page and not the "results" page.
The other 10% it just fails. It doesn't find the songHREF element(element not visible).
Try to wait until element will be clickable:
var wait = new WebDriverWait(driver, TimeSpan.FromMinutes(1));
wait.Until(ExpectedConditions.ElementIsClickable(songHREF[2]));
songHREF[2].Click()
This will wait a least 1 minute until element will be clickable and only then clicks on it.
Full code would be like this:
IwebDriver _webdriver = new ChromeDriver();
_webdriver.Navigate().GoToUrl("https://www.youtube.com/");
var element = wait.Until(x => x.FindElement(By.Id("search")));
element.SendKeys("Perfect");
var element = wait.Until(x => x.FindElement(By.CssSelector("#search-icon-
legacy>yt-icon")));
element.Click();
// refresh the page
Driver.Instance.Navigate().Refresh();
var content = wait.Until(x => x.FindElement(By.Id("contents")));
var songHREF = content.FindElements(By.CssSelector("#video-title"));
var wait2 = new WebDriverWait(driver, TimeSpan.FromMinutes(1));
wait2.Until(ExpectedConditions.ElementIsClickable(songHREF[2]));
songHREF[2].Click();
I have catched the issue. Sometimes the locators are working wrong locating invisible elements, but simply refresh the page will help to fix it. Note: refresh the page before you are locating elements like in the code snippet above.
Normal result when search the elements:
Sometimes comes this result with the same locator:
And the first 25 elements in this case are invisible. But after refreshing the page the result is like in the first case(as expected).
As per your question once you send a character sequence e.g. Perfect to the search field and initiate search, you can create a List of all the songs and then according to your choice e.g. Song which contains the text Lyrics in the heading you can invoke Click() on it and you can use the following solution:
IwebDriver _webdriver = new ChromeDriver();
_webdriver.Navigate().GoToUrl("https://www.youtube.com/");
new WebDriverWait(_webdriver, TimeSpan.FromSeconds(10)).Until(ExpectedConditions.ElementToBeClickable(By.CssSelector("input#search"))).SendKeys("Perfect");
_webdriver.FindElement(By.CssSelector("button.style-scope.ytd-searchbox>yt-icon")).Click();
IList<IWebElement> contents = new WebDriverWait(_webdriver, TimeSpan.FromSeconds(10)).Until(ExpectedConditions.VisibilityOfAllElementsLocatedBy(By.CssSelector("h3.title-and-badge.style-scope.ytd-video-renderer a")));
foreach (IWebElement content in contents)
if(content.GetAttribute("innerHTML").Contains("Lyrics"))
{
content.Click();
break;
}
I'm working with an existing library - the goal of the library is to pull text out of PDFs to verify against expected values to quality check recorded data vs data in pdf.
I'm looking for a way to succinctly pull a specific page worth of text given a string that should only fall on that specific page.
var pdfDocument = new Document(file.PdfFilePath);
var textAbsorber = new TextAbsorber{
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
pdfDocument.Pages.Accept(textAbsorber);
foreach (var page in pdfDocument.Pages)
{
}
I'm stuck inside the foreach(var page in pdfDocument.Pages) portion... or is that the right area to be looking?
Answer: Text Absorber recreated each page - inside the foreach loop.
If the absorber isn't recreated, it keeps text from previous loops.
public List<string> ProcessPage(MyInfoClass file, string find)
{
var pdfDocument = new Document(file.PdfFilePath);
foreach (Page page in pdfDocument.Pages)
{
var textAbsorber = new TextAbsorber {
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
page.Accept(textAbsorber);
var ext = textAbsorber.Text;
var exts = ext.Replace("\n", "").Split('\r').ToList();
if (ext.Contains(find))
return exts;
}
return null;
}
I for parsing html use Html Agility Pack and so Grate stuff
but i encountered some bad things :|
this is my Background Code
public static HtmlDocument GetXHtmlFromUri2(string uri)
{
HttpClient client = HttpClientFactory.Create(new CustomeHeaderHandler());
var htmlDoc = new HtmlDocument()
{
OptionCheckSyntax = true,
OptionFixNestedTags = true,
OptionAutoCloseOnEnd = true,
OptionReadEncoding = true,
OptionDefaultStreamEncoding = Encoding.UTF8,
};
htmlDoc.LoadHtml(client.GetStringAsync(uri).Result);
return htmlDoc;
}
i use html agility for WebApi (Mvc4) and this is Get Method Logic
//GET api/values
public string GetHtmlFlights()
{
var result = ClientFlightTabale.GetXHtmlFromUri2("http://ikiafids.ir/departureFA.html");
HtmlNode node = result.DocumentNode.SelectSingleNode("//table[1]/tbody/tr[1]");
string temp = node.FirstChild.InnerHtml.Trim();
return temp;
}
but when i Call this method (from Browser and Fiddler) encountered Exceptions , With this theme :
Object reference not set to an instance of an object, and this exception Is concerned this line
string temp = node.FirstChild.InnerHtml.Trim();
can anyone help me please ?
I think you are looking for something like this:
var result = ClientFlightTabale.GetXHtmlFromUri2("http://ikiafids.ir/departureFA.html");
var tableNode = result.DocumentNode.SelectSingleNode("//table[1]");
var titles = tableNode.Descendants("th")
.Select(th => th.InnerText)
.ToList();
var table = tableNode.Descendants("tr").Skip(1)
.Select(tr => tr.Descendants("td")
.Select(td => td.InnerText)
.ToList())
.ToList();
I think your selector is wrong. Try this instead?
result.DocumentNode.SelectSingleNode("//table/tr[1]")