Parse full string in Html using C# - c#

I have the following two examples of html-
User: <a style="color:#333" href="http://foo.com/word"></a> blue elephant ·
User: <a style="color:#333" href="http://foo.com/word">#<b>word</b></a> blue elephant ·
I am trying to parse this using C# to put into a csv file and it is working to an extent however, when the html contains the '#' symbol in it, it will either leave the csv cell blank or not include the word with '#' before it. The main part I am trying to get is #word blue elephant however this is bringing back a blank cell, whereas the first html example brings back blue elephant as desired.
I am using the following technique to do this-
string[] comm = System.Text.RegularExpressions.Regex.Split(content[1], "<a");
How can I alter this to work for the second html example?

You want to use a proper HTML parser like the one in HTML agility pack in this situation (and save yourself from invoking the wrath of Cthulhu)
Some examples of how to use it
Getting started
Easily extracting links from a snippet of html with HtmlAgilityPack

Related

How to create and assign a string to clipboard so that when pasted MS Word will accept it as a table?

I have to find a way to generate a string so that when set and then pasted to a Word document it will display a table.
I did some research about this, but none of them seem to work. One of the things I tried was to generate a string of HTML, then set it as the clipboard data and pass format as HTML.
string html = #"<html><body><table>
<tr>
<th>Month</th>
<th>Savings</th>
</tr>
<tr>
<td>January</td>
<td>$100</td>
</tr>
</table></body></html>";
Clipboard.SetData(DataFormats.Html, html);
but it did not work, nothing was pasted to word doc when I tried. And then when I set Data format as text (DataFormats.Text) it was pasted but only as text, not as a table.
Basically, everything you need to know is in Microsoft's own article on this very topic: how to use Win32's Clipboard API to copy and paste HTML:
https://learn.microsoft.com/en-us/windows/win32/dataxchg/html-clipboard-format
(Note: at the time of writing, there's a small irony in that an article about correctly formatting HTML for the clipboard itself is itself a mesh of Markdown and broken HTML... so I thought I'd act all civic for once and fix the broken HTML and... I ended up basically rewriting the article, here's the PR).
To pass HTML through Windows' Clipboard API, you need to do a few things:
The entire HTML text must be structurally valid HTML (so you can still hang on to any Netscape-era HTML when everyone WAS SHOUTING ALL THE TIME BECAUSE HTML TAGS WERE ALL UPPERCASE and few bothered to ever use a </p> or </li>).
...but you can't do things like <script><head></p>.
While HTML like <span><div></div></span> is syntactically valid and well-formed, HTML itself doesn't allow <span> to parent a <div> and requires browsers to break-up the outer <span> into separate siblings and cousins and bring the single <div> to the top-level - and the documentation doesn't mention what happens if you attempt to copy (or paste) "HTML" like that - though I assume Windows doens't care and just treats it as a big string, but the possibility of having fun with other applications that consume pasted HTML (e.g. Word, Excel, etc) because this is how security vulnerabilities start.
Also, it's also completeley undocumented how Windows and other applications handle HTML's Custom Elements feature.
Then you need to generate a header, with the right parameters, and the right formatting just for it to work. Computers are awful.
As you're passing in a single large element (the <table>, I assume?) then the <table>'s outerHTML can be the bounds of the actual HTML fragment that will (or rather, should) be copied by the receiving application.
This outer element, the fragment, is marked with HTML comments <!--StartFragment--> and <!--EndFragment-->.
Only now can you calculate the values of those header parameters (which
also represent an intentionally redundant representation of the <!--Start/EndFragment--> bounds)...
... the problem is the insertion of those Start/EndFragment comments into the HTML makes the message longer would break any previously calculated absolute offsets.
So, (Doug DeMuro voice) this is the template for text data representing a HTML fragment being placed into the Windows Clipboard:
Version:1.0
StartHTML:AAAA
EndHTML:BBBB
StartFragment:CCCC
EndFragment:DDDD
[StartSelection:CCCC
EndSelection:CCCC]
<HTML goes here>
The headers are separated by simple line-breaks. Curiously MS's documentation says that all 3 major line-break-styles were permissable (\r, \r\n, and \n),
The last 2 headers/parameters (StartSelection and EndSelection) are optional (but such that you must either supply neither or supply both - supplying only one of the two Start/EndSelection parameter will break things in unspecified ways.
My personal favoute is nasal demons.
Now do this:
Copy that template into a new String and stick your completed HTML directly at the end of the header, separated by a single line-break.
Take a look at that header above. See at the top where it says Version:1.0? Yeah? Good. Ignore it.
Now see th StartHTML parameter below. We will calculate that last. Ignore it for now
Next is EndHTML, which is the absolute byte offset (from the start of the header, not the HTML) of the end of the HTML text itself - basically it's the message length.
StartFragment is the absolute byte offset (from the start of the header) to the < in a well-formed HTML element representing the fragment itself.
The inserted <!--Start/EndFragment--> comments must be considered outside the fragment (i.e. a String representation of your fragment's raw HTML would not have those comments in it anywhere)..
EndFragment is the absolute byte offset (from the start of the header) to the end of the fragment. This is an exclusive upper-bound (so if StartFragment=100 and EndFragment=150 then the fragment is 50 char in length, simple).
Important note: Remember the text run represented by Start/EndFragment must be a valid kinda-"self-contained" element. So you cannot have a fragment that abrubtly ends inside a tag or its attributes - nor start half-way through a container element's many text nodes...
...however the Start/EndSelection headers, if present, can start and end at arbitrary points in the human-readable text (but conversely, the documentation does not say what happens if Start/EndSelection are inside an element's tags or attributes, for example).
Then substitute those calculated numbers into those BBBB/CCCC/etc placeholders below: but do not delete unused placeholders: instead replace them with a leading digit '0' character.
Don't worry: in this case (at least) any leading zeroes are not interpreted as forcing octal (base-8) integer parsing, phew.
Oh, almost forgot: go back to the top to caclulate StartHTML, which is the absaolute byte offset to the start of the HTML text (and of the opening '<' in <htm> I assume).
So you'll end up with this:
string htmlInTheClipboardLooksLikeThis =
#"Version:1.0
StartHTML:0081
EndHTML:0263
StartFragment:0016
EndFragment:0126
<html>
<body>
<!--StartFragment--><table>
<tr>
<th>Month</th>
<th>Savings</th>
</tr>
<tr>
<td>January</td>
<td>$100</td>
</tr>
</table><!--EndFragment-->
</body>
</html>";
and you just need to pass it into SetText (not SetData) with TextDataFormat.Html:
Clipboard.SetText( htmlInTheClipboardLooksLikeThis, TextDataFormat.Html );
and just running that in Linqpad gives us something we can paste directly into Word... and it indeed imported it as a Word table:

Use OpenXML to replace text in DOCX file - strange content

I'm trying to use the OpenXML SDK and the samples on Microsoft's pages to replace placeholders with real content in Word documents.
It used to work as described here, but after editing the template file in Word adding headers and footers it stopped working. I wondered why and some debugging showed me this:
Which is the content of texts in this piece of code:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(DocumentFile, true))
{
var texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>().ToList();
}
So what I see here is that the body of the document is "fragmented", even though in Word the content looks like this:
Can somebody tell me how I can get around this?
I have been asked what I'm trying to achieve. Basically I want to replace user defined "placeholders" with real content. I want to treat the Word document like a template. The placeholders can be anything. In my above example they look like {var:Template1}, but that's just something I'm playing with. It could basically be any word.
So for example if the document contains the following paragraph:
Do not use the name USER_NAME
The user should be able to replace the USER_NAME placeholder with the word admin for example, keeping the formatting intact. The result should be
Do not use the name admin
The problem I see with working on paragraph level, concatenating the content and then replacing the content of the paragraph, I fear I'm losing the formatting that should be kept as in
Do not use the name admin
Various things can fragment text runs. Most frequently proofing markup (as apparently is the case here, where there are "squigglies") or rsid (used to compare documents and track who edited what, when), as well as the "Go back" bookmark Word sets in the background. These become readily apparent if you view the underlying WordOpenXML (using the Open XML SDK Productivity Tool, for example) in the document.xml "part".
It usually helps to go an element level "higher". In this case, get the list of Paragraph descendants and from there get all the Text descendants and concatenate their InnerText.
OpenXML is indeed fragmenting your text:
I created a library that does exactly this : render a word template with the values from a JSON.
From the documenation of docxtemplater :
Why you should use a library for this
Docx is a zipped format that contains some xml. If you want to build a simple replace {tag} by value system, it can already become complicated, because the {tag} is internally separated into <w:t>{</w:t><w:t>tag</w:t><w:t>}</w:t>. If you want to embed loops to iterate over an array, it becomes a real hassle.
The library basically will do the following to keep formatting :
If the text is :
<w:t>Hello</w:t>
<w:t>{name</w:t>
<w:t>} !</w:t>
<w:t>How are you ?</w:t>
The result would be :
<w:t>Hello</w:t>
<w:t>John !</w:t>
<w:t>How are you ?</w:t>
You also have to replace the tag by <w:t xml:space=\"preserve\"> to ensure that the space is not stripped out if they is any in your variables.

How to retrieve data from an html string from a span tag by using Regular Expressions?

I need to retrieve some info from an html doc since the web service to get a json or an xml is still not ready. Im working with c# and using regular expressions to get the data i need from the html string. I've managed to get the div i want to work with from the whole html string but now i'm having trouble getting the info between the first span tag.
I've attempted to retrieve the data between ; and the first closing span tag but what i really want is the content between the first span tag.
Here's the regular expression i've written so far, but it's not working:
".*;(?<Content>(\r|\n|.)*)</span>"
I also tried this but didnt work either:
"<span class=""type"">(?<Content>(\r|\n|.)*)</span>"
Here is the div i want to retrieve the data from:
<div class="main">ABASASDFÓ 18/06/2014 17:38h Blabla Balbal <span class="type">15.80€ </span>+1.94 % +0.30€ | HOME <SPAN class="type2">11,398.70</span> +0.65 % +74.10</div>
EDIT: I can't use Htmlagilitypack since my client does not want us to use any external library. I've also heard about using the XmlReader but i'm not sure the structure of the html will match an xml one accordingly.
This regex will capture the string:
"<span class=\"type\">(?<Content>([^<]*))</span>"
Although, I agree with other answers, you should use something like Path instead of Regexes for parsing html.
Here's how it is done with a regex in Javascript. You should be able to adapt this for C# pretty easily.
var inner = html.match( /<span class="type"(?:\s+[a-z]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)))*\s*>([\S\s]*)<\/span>/i)[1];
Fiddle: http://jsfiddle.net/GarryPas/uk32r8vz/
You want to use XPath for that. Something like this:
div/span/text()
I understand not wanting some external 3rd party library in your solution, the solution to that is to go fetch the source code of the entire library:
https://htmlagilitypack.codeplex.com/
Now you don't have an external library, you have an internal library and you can use the right tool for the job!
XmlReader is a fairly low-level tool, it could technically do the job for you but what you're more after is "use XmlReader to do XPath" which is talked about here: https://msdn.microsoft.com/en-us/library/ms950778.aspx
The XPathReader class is the result of all that, which has been superseded by LINQ to XML: https://msdn.microsoft.com/en-ca/library/bb387098.aspx
So another option here is to try to use some LINQ to process your HTML file, but that might be tricky since HTML isn't good XML. Still, it's another option if you're looking for those.

C# CsQuery as Html Documents Builder

So far I used HtmlAgilityPack for building Html documents.
The problem is that it is not stable, I get Stackoverflow Exceptions and it doesn't support jQuery syntax.
What I am trying to use to build Html documents is CsQuery.
My question is:
Is it designated for building Html documents?
I like the functions it offers, but I cannot render the modified html document.
For example:
CQ fragment= CQ.CreateFragment("<p>some text</p>");
CQ html = CQ.CreateFromFile(#"index.html");
CQ modified_html= html.Select("#test").Append(fragment);
Which means, I want to append fragment variable to element with id "test".
the problem is that I expect modified_html.Render() to return the modified version (including < p> sometext < /p> added to #test element), but it actually doesn't!!!
is there anyway to achieve this?
Actually it does. I also checked with your code, it do append <p>some text</p> to the modified_html. The only possible issue I can think: there is no element with id = "test" in index.html. You may also want to save modified html to file so it will be easier for you to examine the output :
modified_html.Save(#"index_modified.html");

How to read line of HTML as string in c#

I am trying to get a page title from page source of different pages. But lets say some pages have title like this:
"This is an example," ABC.
It has some html in it like """. If i use string in c# to get this title i get the whole thing and while displaying it displays it like above which is wrong. Is there any way to ignore or to take into account html values in c#?
I am also using htmlagilitypack so anything in that will do too.
You can use WebUtility.HtmlDecode to decode html, link on MSDN:
WebUtility.HtmlDecode(""This is an example," ABC.");
just use:
using System.Net;
The result will be: "\"This is an example,\" ABC."
You also can use HtmlEntity.DeEntitize in HTML Agility Pack:
HtmlEntity.DeEntitize(string text)
You don't know what you can find in the page title. Sometimes is a whole mess there. My suggestion is to get the string as it is and process it before to show/save it.
In this case, the solution is simple: replace the
"
with corresponding char.
Each time you read a HTML document to extract some tags, take care to tags never closed. If the user forget to close the title tag... you'll get in that line the whole page!

Categories