Edit HTML files manipulating DOM in a jQuery style - c#

I have a batch of HTML files which need some editions easy to perform with jQuery (mainly selecting some nodes and changing their attributes).
My approach to achive this, has been opening them one by one in Google Chrome, excecuting the jQuery code in the console, and then copying the resulting DOM back to my HTML editor.
Since what I'm currently doing takes a lot of time, and also due to the fact that every file needs the same edition (i.e., the same jQuery/JS code will work for every HTML file), I am considering to write a script/program to do this.
Anyway, I am not completely clear of which of the following (if any of them) approaches I should take to accomplish this task.
Write a JavaScript script with jQuery using some FileSystem/File manipulation library (which one?)
Write a Java or C# program using some jQuery-based library (like CsQuery)
Finding a plugin for some of my editors (Aptana, Notepad++, Eclipse, etc) or a completely different editor that supports jQuery-like commands for edition (just as notepad++ regex replacement support). This would be slow with big batches, but at least it would allow me to avoid the annoying copy/paste to/from Chrome.
Is one of this approaches the right way to accomplish what I need? (Is there a right way to do this?) Which should be more straight-forward?
I think that #2 would be easier for me since I have a lot more experience in Java and C# than in JavaScript, but I think that maybe that idea would be sort of using a sledgehammer to crack a nut.

You should consider using PhantomJs. It is a headless WebKit which can be executed from te commandline. It accepts a javascript or coffeescript file as a an argument, which can be used to e.g. do something with a web page. Here is an example:
var page = require('webpage').create();
page.open('http://m.bing.com', function(status) {
var title = page.evaluate(function(s) {
return document.querySelector(s).innerText;
}, 'title');
console.log(title);
phantom.exit();
});

I am not sure of the right way but it sounds like you are familiar with C# and would think writing a class library would be the least overhead for automation. Here are some potential solutions:
Scripting Library (e.g., C#.NET) - You can use a library like the one you mentioned or something like ScriptSharp if you want to use DOM manipulation. If the HTML has appropriate closing tags you can also use LINQ to easily navigate the HTML (or something like the HTML Agility Pack found on CodePlex). I would even recommend using Mustache with an HTML file template in C#.
JavaScript Library - If you wanted to stay in pure JavaScript you can use Node.js. There are file manipulation libraries you can use.
Headless Browsers - Haven't thought through being able to save the resulting HTML automatically but you can use something like jsTestDriver or Phantom.js
You can go with the plugins in editors as well, but I would stick with a Java, C#, python, etc. library that you can potentially call from existing application or schedule as a job/service.

Related

What is the most straightforward way of modifying PDF start-up options via C#?

I'd like to know what would be the most straightforward (i.e., preferably without adding DLLs to my solution) way to write a C# code snippet for modifying a PDF document's page layout and magnification options.
I know this is incredibly easy in LaTeX (well, it's easy to set these for a PDF to-be-generated using a single package) through hyperref options.
Modifying an already-created PDF document is a lot more work however. Also, most MS Office print-to-PDF options don't seem to include this ability while it's obviously annoying to keep opening documents up in Acrobat and manually modifying these for the proper (desired) opening settings (layout: single page, and magnification: fit page, for me).
So, I would like to write a code snippet that could do this. What is the most straightforward way of modifying PDF start-up options via C#?
Alternatively, is there a way to force-apply these options when generating PDFs from MS Office software suites?
I looked into PDFSharp and MigraDoc at one point for PDF-making from C#, but that really didn't match what I was hoping for. Plus, it added a DLL set (of ten dependencies) which was sub-optimal with respect to how many ancillary files were included with the build for managing a relatively simple function.
However, as noted in the comments, such a batch modification code is likely to be very difficult without any dependencies, so I'll also accept answers which reference dependencies.
Past topics (but not very good matches):
This is similar but for JavaScript, and actually suggests it shouldn't be done.
This is a bit more elaborate for JavaScript, but not exactly what I am looking for.
This topic seems to be this question for PHP but including some external references...

Secure way to run runtime code (XML file editing) C#

I have a CLI application which is able to edit XML files with some parameters.
However I'm needing now a more powerful way to do it.
I want to give users the option to edit XML files using custom code from a .txt for total control over the XML editing.
For example:
#CODE File<file name for XML editing>
<code>
# Custom XML parser/editing code
for elem in tree.iter(tag='location'):
if elem.text == 'J':
elem.text = 'January'
</code>
Which would be the safest way to do this in .net C#? I mean the user only be able to edit the XML file and not doing anything more that compromises the security of the system (like deleting files)?
I'm thinking of using a Javascript engine (like this one) and running javascript code from the file. I believe javascript would limit what the user would be able to do. I also thought in C# code and Python but this ones may introduce the security issues.
Edit:
One requirement is that it must work on mono.
I have choose the IronJS .NET runtime with a javascript XML library discussed here (XML for < script> W3C DOM Parser).
I have also looked for other javascript .NET runtimes like: Javascript .NET, Jurassic and Jint (opted for IronJS because the better performance). Plus tested some .NET Lua libraries, namely Kopilua, but opted for the javascript solution because it seemed more complete, more documented and easier to use.

How to access IE XHTML DOM+JS engines without starting the browser itself

I'm trying to build a headless browser in c#. c# has plenty of classes, which are supposed to make this possible, like, for example JScriptCodeProvider.
I am looking to get IE XML DOM classes for the JavaScript code to work with. Can anyone tell me where to find those, and, if possible, to provide me with a workable example for what I'm trying do to?
Use the webbrowser control. That should get you everything you need.

.Net HTML Editor Control

I need to add in a WYSIWYG control into a .NET form. I found this one from SpiceLogic on several sites and was wondering if this is a decent library to use?
http://www.spicelogic.com/Products/NET-Win-HTML-Editor-Control-8/
If anyone has any additional input, I also would like to know of any other decent alternatives, both free and non-free.
Thanks in advance for any opinions on this!
EDIT Should have clarified this before, but this is a desktop application.
You can also try one of these strategies:
Use the RichTextBox control, which exposes a FlowDocument. Write a program that converts the FlowDocument to HTML. Since FlowDocs are much more constrained that HTML, this conversion might be pretty straightforward (sections -> div, paragraph -> p, styles -> css or style attributes, etc).
Use MSHTML and put it into edit mode. http://msdn.microsoft.com/en-us/library/aa753622(v=vs.85).aspx
You may want to try XStandard. I have used it in CMS web sites and it works great. You can also use it with desktop apps. There is a free "lite" version and a for=pay pro version. It creates XHTML markup and has lots of slick built-in functionality.
As a comparison, I have used Telerik RAD Editor and XStandard is much better (IMO). I have also tried other web-specific solutions like FCKEditor and TinyMCE and I prefer XStandard.
If your concern is to get XHTML all the time right from the beginning which should be published on the Web, then, I would say, "Yes", you can try that component from SpiceLogic, especially the version 5.x which was released very recently. It comes with many features like embedding images for an email client, Uploading local images to FTP, paste from MS Word, rich Dialogs for Tables, Images, Hyperlink, Symbols, Inline Spell Checker and Spell Checker dialogs, and more.
https://www.spicelogic.com/Products/NET-WinForms-HTML-Editor-Control-8
All Screenshots:
http://www.spicelogic.com/Products/NET-WinForms-HTML-Editor-Control-8/Screenshots
TinyMCE is a great way to achieve this. Here is a way to embed TinyMCE in Winform. I tested it and it works pretty well: https://github.com/Rocker93/winforms-html-editor
An other solution is CEFSharp. The integration is not easy but it's very well documented and it's the most powerful and free solution I have found.
At work we use telerik controls for this stuff:
http://www.telerik.com/products/aspnet-ajax/editor.aspx
its definitely not free though.

How can I make an application in c# collect data from a website?

First of all, I hope my question doesn't bother you. I really need to get and idea of how I can accomplish that, but unfortunatelly, I'm really a beginner, I'm crawling when it comes to programming. I'm struggling to learn it the best way I can. I'll thank you for any help you give me.
Here's the task: I was ordered to find a way to collect some data from a website using a c# application. This will be done everyday, in order to update the data which we'll use to calculate some financial index.
I know my question might sound vague, anyway, even telling me how I can be more precise will help me. I know I seem to know desperate, but putting appart all the personell issues, my scholarship kind of depends on it.
Thanks in advance! (Please, don't mind the bad English, I'm brasilian and my English might not be that good yet.)
First, your English is fine. In fact, I thought you were a native speaker until you said otherwise.
The term you're looking for is 'site scraping'. Observe this question: Options for HTML scraping?. The second answer points to an HTML agility pack library you can use.
Now, there are two possibilities here. The first is you have to parse the HTML and scrape your data out of it. This is more computationally intensive and depends on the layout of the page. If they change the way the site looks, it could break the scraper.
The second possibility is they provide some XML or JSON web service you can consume. In this case you aren't scraping anything, but are rather using a true data feed. If the layout of the site changes, you will not break. Whether your target site supports this form of data feed is up to the site.
If I understand your question, you're being asked to do some Web Scraping, where you 1) download the contents of a web page and 2) try to parse data from that content.
For step #1, you should look into using a WebClient object in C# to download the HTML from the web page. You can give a WebClient object the URL you want to download the content from and obtain a String containing the content (probably HTML) of the URL.
How you go about doing step #2 depends on what content is present at the web site. If you know of certain patterns you're looking for in the HTML, you can search the HTML string using various methods. A more general solution for parsing HTML data can be found through using the Html Agility Pack, which will let you handle the HTML as a tree structure (DOM).
Use the WebClient class to get the page.
Turn the html into xml.
Use XPath to select the data you are interested in.
Ok, this is a pretty straightforward app design, and a lot of the code exists that you can reuse. Since you're a beginner, I'll break down into steps of what you need to do and recommend approaches.
1) You will use classes from System.Net to pull the web pages (WebClient being the easiest to usse). You will want to have this part of the program run on a timer if you can (using the scheduled jobs feature of the OS) and have it just pull the pages and drop them in a folder.
2) You have a second job which will run separately, pulling unread files from that folder, parsing them (using the HtmlAgility pack library is best) and then storing them in an index of some kind (Lucene is best for that)
3) You have a front end application of some sort (web or desktop) which queries that index for the information you're looking for.

Categories