I'm looking for a way to get the document information (or document text) from another applications webbrowser control (and possibly alter it).
The other application is written in .net, but not by me.
I'm looking for an ability like this:
I would like an eventhandler for the OnDocumentCompleted that can get me the information of that document.
If possible, i would also like to intercept certain pages, add some html, and send them back to the second app to be displayed.
Searching the web pointed me towards using 'Hooks', but not much is found using hooks in this situation.
Hope you can help me out
Anthony
This code provides an example of html parsing that returns plain text (
the parsing depends on page content).
private string GetPlainText(WebBrowser webBrowser)
{
StringBuilder sb = new StringBuilder();
// Pick out a heading.
foreach (HtmlElement h1 in webBrowser.Document.GetElementsByTagName("H1"))
sb.Append(h1.InnerText + ". ");
// Select only some text, ignoring everything else.
foreach (HtmlElement div in webBrowser.Document.GetElementsByTagName("DIV"))
if (div.GetAttribute("classname") == "story-body")
foreach (HtmlElement p in div.GetElementsByTagName("P"))
{
string classname = p.GetAttribute("classname");
if (classname == "introduction" || classname == "") sb.Append(p.InnerText + " ");
}
return sb.ToString();
}
}
Related
I'm having trouble with datascraping on this web address: http://patorjk.com/software/taag/#p=display&f=Graffiti&t=Type%20Something%20.
The problem is: I've written a code that is supposed to grab the contents of a certain node and display it on console. However, the contents withing the node and the specific node itself seem to be unreachable, but I know they exists for the fact that I've created a condition within my code in order to let me know if nodes withing a certain body are being found and it is indeed being found but not displayed for some reason:
private static void getTextArt(string font, string word)
{
HtmlWeb web = new HtmlWeb();
//cureHtml method is just meant to return the http address
HtmlDocument htmlDoc = web.Load(cureHtml(font, word));
if(web.Load(cureHtml(font, word)) != null)
Console.WriteLine("Connection Established");
else
Console.WriteLine("Connection Failed!");
var nodes = htmlDoc.DocumentNode.SelectSingleNode(nodeXpath).ChildNodes;
foreach(HtmlNode node in nodes)
{
if(node != null)
Console.WriteLine("Node Found.");
else
Console.WriteLine("Node not found!");
Console.WriteLine(node.OuterHtml);
}
}
private const string nodeXpath = "//div[#id='maincontent']";
}
The Html displayed by the website looks like this:
The Html code within the website. Arrows point at the node I'm trying to reach and the content within it I'm trying to display on the console
When I run my code on console to check for the node and its contents and try to display the OuterHtml string of the Xpath, this is how console will display it to me:
Console Window Display
I hope some of you are able to explain to me why is it behaving this way. I've tried all kinds of searches on google for two days trying to figure out the problem for no use. Thank you all in advance.
The content you desire is loaded dynamically.
Use the HtmlWeb.LoadFromBrowser() method instead. Also, check htmlDoc for null, instead of calling it twice. Your current logic doesn't guarantee your state.
HtmlDocument htmlDoc = web.LoadFromBrowser(cureHtml(font, word));
if (htmlDoc != null)
Console.WriteLine("Connection Established");
else
Console.WriteLine("Connection Failed!");
Also, you'll need to decode the result.
Console.WriteLine(WebUtility.HtmlDecode(node.OuterHtml));
If this doesn't work, then your cureHtml() method is broken, or you're targeting .NET Core :)
I am using HTMLElementCollection, HtmlElement to iterate through a website and using Get/Set attributes of a website HTML and returning it to a ListView. Is it possible to get values from website a and website b to return it to the ListView?
HtmlElementCollection oCol1 = oDoc.Body.GetElementsByTagName("input");
foreach (HtmlElement oElement in oCol1)
{
if (oElement.GetAttribute("id").ToString() == "search")
{
oElement.SetAttribute("value", m_sPartNbr);
}
if (oElement.GetAttribute("id").ToString() == "submit")
{
oElement.InvokeMember("click");
}
}
HtmlElementCollection oCol1 = oDoc.Body.GetElementsByTagName("tr");
foreach (HtmlElement oElement1 in oCol1)
{
if (oElement1.GetAttribute("data-mpn").ToString() == m_sPartNbr.ToUpper())
{
HtmlElementCollection oCol2 = oElement1.GetElementsByTagName("td");
foreach (HtmlElement oElement2 in oCol2)
{
if (oElement2 != null)
{
if (oElement2.InnerText != null)
{
if (oElement2.InnerText.StartsWith("$"))
{
string sPrice = oElement2.InnerText.Replace("$", "").Trim();
double dblPrice = double.Parse(sPrice);
if (dblPrice > 0)
m_dblPrices.Add(dblPrice);
}
}
}
}
}
}
As one of the comments mentioned the better approach would be to use HttpWebRequest to send a get request to www.bestbuy.com or whatever site. What it returns is the full HTML code (what you see) which you can then parse through. This kind of approach keeps you from seinding too many requests and getting blacklisted. If you need to click a button or type in a text field its best to mimic human input to avoid being blacklisted also. I would suggest injecting a simple javascript into the page header or body and execute it from the app to send a 'onClick' event from the button (which would then reply with a new page to parse or display) or to modify the text property of something.
this example is in c++/cx but it originally came from a c# example. the script sets the username and password text fields then clicks the login button:
String^ script = "document.GetElementById('username-text').value='myUserName';document.getElementById('password-txt').value='myPassword';document.getElementById('btn-go').click();";
auto args = ref new Platform::Collections::Vector<Platform::String^>();
args->Append(script);
create_task(wv->InvokeScriptAsync("eval", args)).then([this](Platform::String^ response){
//LOGIN COMPLETE
});
//notes: wv = webview
EDIT:
as pointed out the absolute best approach would be to get/request an api. I was surprised to see that site mason pointed out for bestbuy developers. Personally I have only tried to work with auto part stores who either laugh while saying I can't afford it or have no idea what I'm asking for and hang up (when calling corporate).
EDIT 2: in my code the site used was autozone. I had to use chrome developer tools (f12) to get the names of the username, password, and button name. From the developer tools you can also watch what is sent from your computer to the site/server. This allows you to recreate everything and mimic javascript input and actions using post/get with HttpWebRequest.
I have a requirement where I would like users to type some string tokens into a Word document so that they can be replaced via a C# application with some values. So say I have a document as per the image
Now using the SDK I can read the document as follows:
private void InternalParseTags(WordprocessingDocument aDocumentToManipulate)
{
StringBuilder sbDocumentText = new StringBuilder();
using (StreamReader sr = new StreamReader(aDocumentToManipulate.MainDocumentPart.GetStream()))
{
sbDocumentText.Append(sr.ReadToEnd());
}
however as this comes back as the raw XML I cannot search for the tags easily as the underlying XML looks like:
<w:t><:</w:t></w:r><w:r w:rsidR="002E53FF" w:rsidRPr="000A794A"><w:t>Person.Meta.Age
(and obviously is not something I would have control over) instead of what I was hoping for namely:
<w:t><: Person.Meta.Age
OR
<w:t><: Person.Meta.Age
So my question is how do I actually work on the string itself namely
<: Person.Meta.Age :>
and still preserve formatting etc. so that when I have replaced the tokens with values I have:
Note: Bolding of the value of the second token value
Do I need to iterate document elements or use some other approach? All pointers greatly appreciated.
This is a bit of a thorny problem with OpenXML. The best solution I've come across is explained here:
http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2011/06/13/open-xml-presentation-generation-using-a-template-presentation.aspx
Basically Eric expands the content such that each character is in a run by itself, then looks for the run that starts a '<:' sequence and then the end sequence. Then he does the substitution and recombines all runs that have the same attributes.
The example is for PowerPoint, which is generally much less content-intensive, so performance might be a factor in Word; I expect there are ways to narrow down the scope of paragraphs or whatever you have to blow up.
For example, you can extract the text of the paragraph to see if it includes any placeholders and only do the expand/replace/condense operation on those paragraphs.
Instead of doing find/replace of tokens directly, using OpenXML, you could use some 3rd party OpenXML-based template which is trivial to use and can pays itself off soon.
As Scanny pointed out, OpenXML is full of nasty details that one has to master on on-by-one basis. The learning curve is long and steep. If you want to become OpenXML guru then go for it and start climbing. If you want to have time for some decent social life there are other alternatives: just pick one third party toolkit that is based on OpenXML. I've evaluated Docentric Toolkit. It offers template based approach, where you prepare a template, which is a file in Word format, which contains placeholders for data that gets merged from the application at runtime. They all support any formatting that MS Word supports, you can use conditional content, tables, etc.
You can also create or change a document using DOM approach. Final document can be .docx or .pdf.
Docentric is licensed product, but you will soon compensate the cost by the time you will save using one of these tools.
If you will be running your application on a server, don't use interop - see this link for more details: (http://support2.microsoft.com/kb/257757).
Here is some code I slapped together pretty quickly to account for tokens spread across runs in the xml. I don't know the library much, but was able to get this to work. This could use some performance enhancements too because of all the looping.
/// <summary>
/// Iterates through texts, concatenates them and looks for tokens to replace
/// </summary>
/// <param name="texts"></param>
/// <param name="tokenNameValuePairs"></param>
/// <returns>T/F whether a token was replaced. Should loop this call until it returns false.</returns>
private bool IterateTextsAndTokenReplace(IEnumerable<Text> texts, IDictionary<string, object> tokenNameValuePairs)
{
List<Text> tokenRuns = new List<Text>();
string runAggregate = String.Empty;
bool replacedAToken = false;
foreach (var run in texts)
{
if (run.Text.Contains(prefixTokenString) || runAggregate.Contains(prefixTokenString))
{
runAggregate += run.Text;
tokenRuns.Add(run);
if (run.Text.Contains(suffixTokenString))
{
if (possibleTokenRegex.IsMatch(runAggregate))
{
string possibleToken = possibleTokenRegex.Match(runAggregate).Value;
string innerToken = possibleToken.Replace(prefixTokenString, String.Empty).Replace(suffixTokenString, String.Empty);
if (tokenNameValuePairs.ContainsKey(innerToken))
{
//found token!!!
string replacementText = runAggregate.Replace(prefixTokenString + innerToken + suffixTokenString, Convert.ToString(tokenNameValuePairs[innerToken]));
Text newRun = new Text(replacementText);
run.InsertAfterSelf(newRun);
foreach (Text runToDelete in tokenRuns)
{
runToDelete.Remove();
}
replacedAToken = true;
}
}
runAggregate = String.Empty;
tokenRuns.Clear();
}
}
}
return replacedAToken;
}
string prefixTokenString = "{";
string suffixTokenString = "}";
Regex possibleTokenRegex = new Regex(prefixTokenString + "[a-zA-Z0-9-_]+" + suffixTokenString);
And some samples of calling the function:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(memoryStream, true))
{
bool replacedAToken = true;
//continue to loop document until token's have not bee replaced. This is because some tokens are spread across 'runs' and may need a second iteration of processing to catch them.
while (replacedAToken)
{
//get all the text elements
IEnumerable<Text> texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>();
replacedAToken = this.IterateTextsAndTokenReplace(texts, tokenNameValuePairs);
}
wordDoc.MainDocumentPart.Document.Save();
foreach (FooterPart footerPart in wordDoc.MainDocumentPart.FooterParts)
{
if (footerPart != null)
{
Footer footer = footerPart.Footer;
if (footer != null)
{
replacedAToken = true;
while (replacedAToken)
{
IEnumerable<Text> footerTexts = footer.Descendants<Text>();
replacedAToken = this.IterateTextsAndTokenReplace(footerTexts, tokenNameValuePairs);
}
footer.Save();
}
}
}
foreach (HeaderPart headerPart in wordDoc.MainDocumentPart.HeaderParts)
{
if (headerPart != null)
{
Header header = headerPart.Header;
if (header != null)
{
replacedAToken = true;
while (replacedAToken)
{
IEnumerable<Text> headerTexts = header.Descendants<Text>();
replacedAToken = this.IterateTextsAndTokenReplace(headerTexts, tokenNameValuePairs);
}
header.Save();
}
}
}
}
So I am starting to learn how to use XML data within a app and decided to use some free data to do this however I cannot for the life of me get it working this is my code so far. (I have done a few apps with static data before but hey apps are designed to use the web right? :p)
public partial class MainPage : PhoneApplicationPage
{
List<XmlItem> xmlItems = new List<XmlItem>();
// Constructor
public MainPage()
{
InitializeComponent();
LoadXmlItems("http://hatrafficinfo.dft.gov.uk/feeds/datex/England/CurrentRoadworks/content.xml");
test();
}
public void test()
{
foreach (XmlItem item in xmlItems)
{
testing.Text = item.Title;
}
}
public void LoadXmlItems(string xmlUrl)
{
WebClient client = new WebClient();
client.OpenReadCompleted += (sender, e) =>
{
if (e.Error != null)
return;
Stream str = e.Result;
XDocument xdoc = XDocument.Load(str);
***xmlItems = (from item in xdoc.Descendants("situation id")
select new XmlItem()
{
Title = item.Element("impactOnTraffic").Value,
Description = item.Element("trafficRestrictionType").Value
}).ToList();***
// close
str.Close();
// add results to the list
xmlItems.Clear();
foreach (XmlItem item in xmlItems)
{
xmlItems.Add(item);
}
};
client.OpenReadAsync(new Uri(xmlUrl, UriKind.Absolute));
}
}
I am basically trying to learn how to do this at the moment as I am intrigued how to actually do it (I know there are many ways but ATM this way seems the easiest) I just don't get what the error is ATM. (The bit in * is where it says the error is)
I also know the display function ATM is not great (As it will only show the last item) but for testing this will do for now.
To some this may seem easy, as a learner its not so easy for me just yet.
The error in picture form:
(It seems I cant post images :/)
Thanks in advance for the help
Edit:
Answer below fixed the error :D
However still nothing is coming up. I "think" it's because of the XML layout and the amount of descendants it has (Cant work out what I need to do being a noob at XML and pulling it from the web as a data source)
Maybe I am starting too complicated :/
Still any help/tips on how to pull some elements from the feed (As there all in Descendants) correctly and store them would be great :D
Edit2:
I have it working (In a crude way) but still :D
Thanks Adam Maras!
The last issue was the double listing. (Adding it to a list, to then add it to another list was causing a null exception) Just using the 1 list within the method solved this issue, (Probably not the best way of doing it but it works for now) and allowed for me to add the results to a listbox until I spend some time working out how to use ListBox.ItemTemplate & DataTemplate to make it look more appealing. (Seems easy enough I say now...)
Thanks Again!!!
from item in xdoc.Descendants("situation id")
// ^
XML tag names can't contain spaces. Looking at the XML, you probably just want "situation" to match the <situation> elements.
After looking at your edit and further reviewing the XML, I figured out what the problem is. If you look at the root element of the document:
<d2LogicalModel xmlns="http://datex2.eu/schema/1_0/1_0" modelBaseVersion="1.0">
You'll see that it has a default namespace applied. The easiest solution to your problem will be to first get the namespsace from the root element:
var ns = xdoc.Root.Name.Namespace;
And then apply it wherever you're using a string to identify an element or attribute name:
from item in xdoc.Descendants(ns + "situation")
// ...
item.Element(ns + "impactOnTraffic").Value
item.Element(ns + "trafficRestrictionType").Value
One more thing: <impactOnTraffic> and <trafficRestrictionType> aren't direct children of the <situation> element, so you'll need to change that code as well:
Title = items.Descendants(ns + "impactOnTraffic").Single().Value,
Description = item.Descendants(ns + "trafficRestrictionType").Single().Value
I'm trying to get an XML-text from a wabpage, that is already opened in IE. Web requests are not allowed because of a security of target page (long boring story with certificates etc). I use method to walk through all opened pages and, if I found a match with page's URI, I need to get it's XML.
Some time ago I needed to get an HTML-code between body tags. I used method with IHTMLDocument2 like this:
private string GetSourceHTML()
{
Regex reg = new Regex(patternURL);
Match match;
string result;
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
match = reg.Match(ie.LocationURL.ToString());
if (!string.IsNullOrEmpty(match.Value))
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)ie.Document;
result = doc.body.innerHTML.ToString();
return result;
}
}
result = string.Empty;
return result;
}
So now I need to get a whole XML-code of a target page. I've googled a lot, but didn't find anything useful. Any ideas? Thanks.
Have you tried this? It should get the HTML, which hopefully you could parse to XML?
Retrieving the HTML source code