Modular programming to accommodate future changes (software for scraping websites)

Modular programming to accommodate future changes (software for scraping websites) - c#

I had developed a software in C# using Windows Forms to scrape selected websites for images.
First problem I have is that the websites I monitor constantly change their look and feel, thus making my code in need for updating. I had switched to using XPaths to isolate the divs I look for, but the div ids change too. I have thought of using a text file with the div xpath for each site which the software would read thru, thus saving me the time to edit and recompile the code. Is there a better way to solve this problem ? Maybe CodeDom ?
Secondly, since every website uses different formatting and encoding I had to rewrite parts of code with the HtmlDocument, HtmlWebResponse, HtmlNodes and others for each of them, which ended up accounting for nearly half of my code. I could not put them together since some need extra scraping and paginating and some do not. Is there a way make to simplify this problem ?
Lastly, I have the whole code in one class file with around 600 lines of code. The only methods I have are the backgroundworkers, ui event handlers, a scraping method each for each site, and one method to save the images. Is it alright to have the whole code in one class ? When I used to write in Java, I used to often make use of multiple classes and call them as objects, this helped making changes to particular sections easier. Can I do the same with C# ?
Is there a more efficient approach to making the software ? I was thinking of making a class for each site, so that modifications could be done directly to the class in question, but that would cause a lot of lines to be repeated in each class. Or is it okay to have the whole in one class file ?
Thanks.
PS: This software is for personal use, but I think it is a good opportunity to learn and apply good programming.

Related

XML with .NET - Build in code or read from a file?

I am building a web application that will generate charts and graphs using a 3rd party charting component. This charting component requires it receive an XML file containing the design parameters and data in order to render the chart. The application may render up to 10 to 20 charts per page view. I am looking for suggestions for the most efficient way to handle this.
I need to load XML templates, of which there will be about 15-20, one for each chart type definition. With the templates loaded, I will them add the chart specific data and send it off to the charting component for rendering. Some of the possible ways of handling this off the top of my head include ->
Build each XML template in code, using StringBuilder
Build each XML template in code, using one of the .NET XML classes
Store each XML template in a file, load it from the disk on demand
Store each XML template in a file, load them all at once on application start
Storing the XML templates in files would greatly simplify the development processes for me, but I don't know what kind of performance hit I would take, especially if I was continually reading them off the disk. It seems like option 4 would be the better way to go, but I'm not quite sure the best practice way to implement that solution.
So.. any thoughts out there?

I'm just taking a crack at it but I'd save the templates into a constant like so and then use string.format to substitute any values and convert to XML file and pass it along to the 3rd party component.
const string cChart1 = #"<chart type='pie'>
<total>{0}</total>
<sections count={1}>
<section>{2}</section>
<section>{3}</section>
<section>{4}</section>
</section>
</chart>";
XmlDocument xmlChart1 = new XmlDocument();
xmlChart1.LoadXML(String.format(cChart1, somevalue1, somevalue2, somevalue3, somevalue4, somevalue5));
3rdPartyChartComponent cc = new 3rdPartyChartComponent(xmlChart1);

Thanks for your suggestions everyone.
I created a test application that ran x number of trials for each of the suggested methods to see which performed best. As it turns out, building the XML string directly using StringBuilder was orders of magnitude faster, unsurprisingly.
Involving an XmlDocument in any way greatly reduced performance. It should be noted that my results were based off of running thousands of trials for each method... but in a practical sense, any of these method are fast enough to do the job right, in my opinion.
Of course, building everything using StringBuilder is a bit on the messy side. I like Jarealist's suggestion, it's a lot easier on the eyes, and if I handle the XML as a string throughout rather than loading it into an XmlDocument, its one of the fastest ways to go.

Are the same templates used more than once? You could store the template as a static variable. Then add a property getter that builds the template (I would probably use your #2) if it hasn't yet been created, and then return it.
This would impose a performance hit the first time the template is used and be very fast after that.

I am pretty sure you can tell the compiler to bundle those XML files inside your CLR exe. Reading from these would not imply a noticeable performance hit as they would be already in memory. You will need to research a bit as i cant get the code out of my head right now, too sleepy.
EDIT.
http://msdn.microsoft.com/en-us/library/f45fce5x(v=vs.100).aspx - More info on the subject.
Another benefit from using this approach is that the CLR can guarantee the readability and existance of those files, else your executable will be corrupt and wont run.

Is good or bad to write everything in c# rather than using web controls?

I'm confused with choosing between several ways of getting result in ASP.NET.
For example, Web form control SqlDataSource, you retrieve data from database and show results in other controls such as DataGridView, BulletedList etc. However all those things can be written in C#, creating a string which will hold your HTML codes with the retrived data, then you insert your Html code into div using innerHTML. What's the difference?
Example:
[ <div id='block1' runnat='server'></div]
and in CodeBehind
[ block1.innerHTML = myString;]
After writing C# code SqlConnect, Loops, Datatable, you put value of your HTML string into myString.
Why not to implement everything with C#?

Think about what's easiest. For simple cases, using markup, templates and databinding is usually easiest and most simple, because most of what's written is static markup - so we can stay in markup's "native land". But if the markup could radically change based on programmatic logic, then trying to express that in ASP.NET markup can be tedious at best.
Also think about deployment and reuse - templates might also be easier to maintain for a single application, but harder to package and reuse in different applications.
You want to minimize effort and complexity. Achieving these flow directly into less bugs and more stability, plus shorter delivery time. So think about how effort and complexity are affected by:
How hard will it be for you to write?
How hard will it be for you (or others) to change? - if this is a throwaway application, or unlikely to change much, this is less of a concern.
How hard will it be to deploy?
How hard will it be to reuse? - if there is no reuse, this is not a concern.

Writing it all in pure C# is possible but not very convenient when you are trying to achieve a specific html layout - it is painful to maintain, and very hard to work alongside a developer if you want to take their html and just tweak it to add the data.
Personally I'd look at MVC here; for example, I've been playing with razor recently which allows very elegant integration between C# and html in the same file:
<div id="#obj.Id">
<ul>
#foreach(var item in obj.Items) {
<li>#item.Name</li>
}
</ul>
</div>
There I can:
clearly see at a glance how the code maps to the source I can see at the client
make changes with confidence, both from visual inspection and the IDE telling me if I do something obviously wrong
compare to the designer's draft easily

Mostly for maintenance reasons.
Can you imaging how much difficult it can get to make changes to it or debug it? And since it is not a traditional approach, any programmer after you that has to work on that code will of course not be happy with it.
Always remember,
HTML is for markup (for example, Building)
UI customization/styles go to CSS and Themes are for Server Control customization (for example, Paint)
C# (or code-behind to be specific) is for logic (for example, Amenities or wiring up).

Generating HTML Programmatically in C#, Targeting Printed Reports

I've taken over a C# (2.0) code base that has the ability to print information. The code to do this is insanely tedious. Elements are drawn onto each page, with magic constants representing positioning. I imagine the programmer sitting with a ruler, designing each page by measuring and typing in the positions. And yes, one could certainly come up with some nice abstractions to make this approach rational. But I am looking at a different method.
The idea is that I'll replace the current code that prints with code that generates static HTML pages, saves them to a file, and then launches the web browser on that file. The most obvious benefit is that I don't have to deal with formatting-- I can let the web browser do that for me with tags and CSS.
So what I am looking for is a very lightweight set of classes that I can use to help generate HTML. I don't need anything as heavyweight as HTMLTextWriter. What I'm looking for is something to avoid fragments like this:
String.Format("<tr><td>{0}</td><td>{1}</td></tr>", foo, bar);
And instead take have this kind of feel:
...
table().
tr().
td(foo).
td(bar)
Or something like that. I've seen lightweight classes like that for other languages but can't find the equivalent (or better) for C#. I can certainly write it myself, but I'm a firm believer in not reinventing wheels.
Know anything like this? Know anything better than this?

Just as an idea: why do you want to assemble the HTML in your applications code? Sounds a bit tedious to me. You could aggregate the data needed for the report and pass this on to one of the template engines (that are "normally" used for web apps) existing for C#. Then you save the result of the template engine to a html file.
The main benefits I see with this approach:
separates view from business logic
html templates can be edited by non-C# developers, html/css knowledge is enough
no need to recompile the application if the html changes
I havent used it yet, but I heard that the Spark View Engine is OK: http://sparkviewengine.com/ (not sure if it is for C# 2.0 though)
Some time ago I experimented (in PHP) with Gagawa ( http://code.google.com/p/gagawa/ ), where you can do stuff like:
$div = new Div();
$div->setId("mydiv")->setCSSClass("myclass");
$link = new A();
$link->setHref("http://www.example.com")->setTarget("_blank");
$div->appendChild( $link );
But soon dropped such an approach in favor of an template engine.

Another approach is converting the data to XML and applying an XSL stylesheet. In order to change the HTML formating you just need to replace the stylesheet.

How powerful is the <script> tag in ASP.NET?

I'm new at web development with .NET, and I'm currently studying a page where I have both separated codebehinds (in my case, a .CS file associated to the ASPX file), and codebehind that is inside the ASPX file inside tags like this:
<script runat="server">
//code
</script>
Q1:What is the main difference (besides logical matters like organization, readability and ETC), what could be done in one way that could not be done in another? What is each mode best suited for ?
Q2:If I'm going to develop a simple page with database connection, library imports, access to controls (ascx) and image access in other folders.. which method should I choose ?

Anything you can do in a code-behind, you can do in an inline script like what you posted. But you should use a code-behind most of the time anyway. Some things (like using directives) are just a little easier there, and it helps keep your code organized.

Q1: Nothing. Aside from what you and the others have mentioned (separation, readability), you can do everything "code behind" can do with "inline" (code within page itself) coding.
Inline coding doesn't necessarily mean its like "spaghetti code" where UI and code are mixed in (like old-school ASP). All your code can live outside of UI/HTML but still be inline. You can copy/paste all the code-behind code into your inline page and make a few adjustments (wiring, namespaces, import declarations, etc.) and that's that.
The other comments hit the nail: portability and quick fixes/modifications.
Depending on your use case, you may not want certain sections of code exposed (proprietary), but available for use. This is common for web dev professionals. Inline code allows your customers to quickly/easily customize functionality any way they want to, and can use some of your (proprietary) libraries (dlls) whenever they want to, without having to be code jocks (if they were, they wouldn't have hired you in the first place).
So in practical terms, it's like sending off an "html" file to clients containing instructions on how to change things around (without breaking things)...instead of sending off source code files along with html (aspx) pages and hoping your clients know what to do with them....
Q2: While either style (inline or code-behind) will work, its really a matter of looking at your application in "tiers". Usually, it will be: UI, business logic and data tiers. Thinking about things this way will save you a lot of time.
Practical examples:
If more than one page of your web app must expose/access data, then having a data tier is the best approach. Actually, even if you currently have a 1 page need, its likely never going to stay that way, so think of it as best practice.
If more than one page of your web app will collect input from users (i.e. contact us, registration/sign up, etc.) then you're likely going to need to validate input. So instead of doing this on a page by page basis, a common input validation library will save you time, and lessen the amount of code you need.
In the above examples, you've "separated" a lot of the processing into their own tiers. Your individual html/aspx pages can then use the "code libraries" (data and input validation) quickly with minimal code at the "page level". Then the decision to use either inline or code-behind styles at the "page level" wouldn't matter much - you've essentially "dumbed it down" to whatever your use case is at the time.
Hope this helps....

Keep it separated. Use the .aspx page for your layout, use the .aspx.cs page for any page specific code and for preference, pull your data access/business logic out into their own layer, makes for much simpler maintenance/re-use later on.
Slight caveat there - ASP.net MVC uses inline scripts in it's views, and I've really come round to that idea - it can keep the simple stuff simple, but the architecture used in MVC ensures that your business code remains separate from your presentation code.

I'm not saying you should ever be hacking live code... but one bit of flexibility from having the "code behind" as in-line script is that you could hack in changes without having to rebuild/publish the site.
Personally, I don't ever do this but I've heard instances where people have done it to get in an emergency fix.

There is no difference between the script tag and code behind. The code behind option actually came out of using the script tag or the <% %> from "Classic ASP". A lot of developers didn't like the fact that they server side code sat along side the UI code, because it made the file look messy, and it was a lot more difficult for the HTML people (web designers or whatever you would like to call them) to develop on the same page as the developers at the same time.
Most people like using the code behind option (It's actually considered the standard way of doing things), because it keeps the UI and the Code separate. It's what I prefer, but you really can use either.

You can use all the same stuff
Always try to keep the code separated unless you have a compelling reason not to
Funnily enough, I used the <script runat="server"> in the code infront only today! I did this because you do not need to Build the whole web application to deploy a fix that needs code behind. Yes- it was a bug fix ;)

Reducing a large single page AJAX application (jQuery, ASP.net)

I'm currently building a single page AJAX application. It's a large "sign-up form" that's been built as a multi-step wizard with multiple branches and different verbiage based on what choices the user makes. At the end of the form is an editable review page. Once the user submits the form, it sends a rather large email to us, and a small email to them. It's sort of like a very boring choose your own adventure book.
Feature creep has pushed the size of this app beyond the abilities of the current architecture, and it's too slow to work in any slower computers (not good for a web app), especially those using Internet Explorer. It currently has 64 individual steps, 5400 DOM elements and the .aspx file alone weighs in at 300kb (4206 LOC). Loading the app takes anywhere from 1.5 seconds on a fast machine running FireFox 3, to 20 seconds on a slower machine running IE7. Moving between steps takes about the same amount of time.
So let's recap the features:
Multi-Step, multi-path wizard style
form (64 steps)
Current step is shown in a fashion similar to this: http://codylindley.com/CSS/325/css-step-menu
Multiple validated fields
Changing verbiage based on user
choices
Final, editable review page
I'm using jQuery 1.3.2 and the following plugins:
jQuery Form Wizard Plugin
jQuery clueTip plugin
jQuery sexycombo
jQuery meioMask plugin
As well as some custom script for loading the verbiage from an XML file, running the review page and some aesthetic accoutrements.
I don't have this posted anywhere public, but I'm mostly looking for some tips on how to approach this sort of project and make it light weight and extensible. If anyone has any ideas as far as tools, tutorials or technologies, that's what I'm looking for. I'm a pretty novice programmer (I'm mostly a CSS/xHTML/Design guy), so speak gently. I just need a good plan of attack to make this app faster. Any ideas?

One way would be to break apart the steps into multiple pages / requests. To do this you would have to store the state of the previous pages somewhere. You could use a database to do this or some other method.
Another way would be to dynamically load the parts you need via AJAX. This won't help with the 54000 DOM elements though, but it would help with the initial page load.
Based on the question comments a quick way to "solve" this problem is to make a C# class that mirrors all the fields in your question. Something like this:
public class MySurvey
{
public string FirsName { get; set; }
public string LastName { get; set; }
// and so on...
}
Then you would store this in the session (too keep it easy... I know it's not the "best" way) like this
public MySurvey Survey
{
get
{
var survey = Session["MySurvey"] as MySurvey;
if (survey == null)
{
survey = new MySurvey();
Session["MySurvey"] = survey;
}
return survey;
}
}
This way you'll always have a non-null Survey object you can work with.
The next step would be to break that big form into smaller pages, let's say: step1.aspx, step2.aspx, step3.aspx etc. All these pages would inherit from a common base page that would include the property above. After this all you'd need to do is send the request from step1.aspx back and save it to Survey, similar to what you're doing now but for each small piece. When done redirect (Response.Redirect("~/stepX.aspx")) to the next page. The info from the previous page would be saved in the session object. If they close the browser page they won't be able to get back though.
Rather than saving it to the session you could save it in a database or in a cookie, but you're limited to 4K for cookies so it may not fit.

I agree with PBZ, saving the individual steps would be ideal. You can, however, do this with AJAX. If you did, though, it'd require some stuff that sounds like it might be outside of your skillset of mostly front-end development, you'd need to probably create a new database row and tie it to the user's session ID, and every time they click to the next step have it update that row. Possibly even tie it to their IP address so if the whole thing blows up they can come back and hit "remember me?" for your application to retrieve it.
As far as optimizing the existing structure, jQuery is fairly heavy when it comes to optimization, and adding a lot of jQuery modules doesn't help that. I'm not saying it's bad, because it saves you a lot of time, but there are some instances where you are using a module for one of its many functionalities, and you can replace that entire module with a few lines of jQuery enabled javascript.
As far as minimizing the individual DOM elements, the step above I mentioned could help slim that down, because you're probably loading a lot of extensible functions for those modules that you may or may not need.
On the back end, I'd have to see the source to see how to tell you to optimize it, but it sounds like there's a lot of redundancy in individual steps, some of that can probably be trimmed down into functions that include a little recursion, or at the least delegate some of the tasks to one another.
I wish I could help more but without digging through your source I can only suggest basic strategies. Best of luck, though!

Agree, break up the steps. 5400 elements is too many.
There are a few options if you need to keep it on one page.
AJAX requests to get back either raw HTML, or an array of objects to parse into HTML or DOM
Frames or Iframes
JavaScript to set innerHTML or manipulate the DOM based on the current step. Note with this option IE7 and especially IE6 will have memory leaks. Google IE6 JavaScript memory leaks for more info.
Use document.write to include only the .js file(s) needed for the current step.
HTH.

Sounds like mostly a JQuery optimization problem.
First suggestion would be switch as many selects into ID selectors as you can. I've had speedups of over 200-300x by being able to move to id attribute selection only.
Second suggestion is more of a plan of attack. Since IE is your main problem area, I suggest using the IE8 debugger. You just need to hit f12 in IE8... Tabs 3 and 4 are script and profiler respectively.
Once you've done as much of #1 as you think you can, to get a starting point, just go to profiler, hit start profiling, do some slow action on the webpage, and then stop profiling. You will see your longest method calls, and just work your way through it.
For finer testing/dev, go to the script tab. Breakpoints locals etc are there for analysis. You can dev/test changes via the immediate window... i.e. put a break point where you want to change a function, trigger the function, execute your javascript instead of the defined javascript in the immediate window.
When you think you have something figured out, profile your changes to make sure they are really improvements. Just start the profiler, run the old code, stop it and note your benchmark. Then re-start the profiler and use the immediate window to execute your altered function.
That's about it. If that flow can't take you far enough, as mentioned above, JQuery itself (and hence its plugins) are not terribly performant, and replacing with standard javascript will speed everything up. If your plugins benchmark slow, look at replacing them with other plugins.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.