I'm considering writing my own tool for tracking visitors/sales as Google Analytics and others are just not comprehensive enough in the data dept. They have nice GUIs but if you have SQL skills those GUIs are unnecessary.
I'm wondering what the best approach is to do this.
I could simply just log the IP, etc to a text file and then have an async service run in the background to dump it into the DB. Or, maybe that's overkill and I can just put it straight in the DB. But one DB WRITE per web request seems like a poor choice where scalability is concerned. Thoughts?
As a sidenote, it is possible to capture the referring URL or any incoming traffic, right? So if they came from a forum post or something, you can track that actual URL, is that right?
It just seems that this is a very standard requirement and I don't want to go reinventing the wheel.
As always, thanks for the insight SOF.
The answer to this question mentions the open-source GAnalytics alternative Piwik - it's not C# but you might get some ideas looking at the implementation.
For a .NET solution I would recommend reading Matt Berseth's Visit/PageView Analysis Services Cube blog posts (and earlier and example and another example, since they aren't easy to find on his site).
I'm not sure if he ever posted the server-side code (although you will find his openurchin.js linked in his html), but you will find most of the concepts explained. You could probably get something working pretty quickly by following his instructions.
I don't think you'd want to write to a text file - locking issues might arise; I'd go for INSERTs into a database table. If the table grows too big you can always 'roll up' the results periodically and purge old records.
As for the REFERER Url, you can definitely grab that info from the HTTP HEADERS (assuming it has been sent by the client and not stripped off by proxies or strict AV s/w settings).
BTW, keep in mind that Google Analytics adds a lot of value to stats - it geocodes IP addresses to show results by location (country/city) and also by ISP/IP owner. Their javascript does Flash detection and segments the User-Agent into useful 'browser catagories', and also detects other user-settings like operating system and screen resolution. That's some non-trivial coding that you will have to do if you want to achieve the same level of reporting - not to mention the data and calculations to get entry & exit page info, returning visits, unique visitors, returning visitors, time spent on site, etc.
There is a Google Analytics API that you might want to check out, too.
Have you looked at Log Parser to parse the IIS logs?
I wouldn't have though writing to a text file would be more efficient than writing to a database - quite the opposite, in fact. You would have to lock the text file while writing, to avoid concurrency problems, and this would probably have more of an impact than writing to a database (which is designed for exactly that kind of scenario).
I'd also be wary of re-inventing the wheel. I'm not at all clear what you think a bespoke hits logger could do better than Google Analytics, which is extremely comprehensive. Believe me, I've been down the road and written my own, and Analytics made it quite redundant.
Related
I am making an webcrawler in C# which needs to find webshops. The problem i'm having is that I need to detect if the webpage is a webshop. If it is I need to find out what type of e-commerse software it is using. But the problem is that I don't know how you can detect it in the source code.
I also have just a Chrome plugin called builtwith which can detect all kinds of software. But I have yet to find out how they are doing that.
It would be nice if someone could help me with this problem
Before giving you an actual answer, it's worth noting that what you're proposing could be in violation of the terms of use for many websites out there. You should take the time to investigate what legal liability you might be exposing yourself and your organization to.
This is going to be a lot of time consuming work, but it's not difficult. Your crawler is just going to need to simply work using a rules-based approach to detect signatures in the payload of the page.
Find the specific software that you're intending to detect.
Find 2-3 sites that are definitely using the software.
Review the HTML payload to see what scripts, CSS, and HTML patterns they have that are common across the sites.
Build a code-based rule that can detect those patterns consistently. For example: if (html.Contains("widgetName")) isPlatformName = true;
Test that patterns across more sites that you know for certain are using that software.
Repeat for each software vendor.
The more complicated thing will be when the targets have multiple versions and you need to adapt your rules to know and be aware of the various versions, or when platforms are very similar.
I think the most complicated part of this is having a well-thought-out bot issue detection, reporting, and throttling architecture in place. You should probably spend the bulk of your time planning that.
That's it.
There are a couple different ways to determine the technologies a site is using. Firstly, if you are technically savvy, you can right click on an eCommerce page (either catalog, checkout page, etc) and look at the source code. Many platforms will have hints in the source code that will give you an idea what the site is running.
You can also look at the DNS/hosting information, which would help you determine if the eCommerce solution is hosted or SaaS (like Shopify, for example).
You can also try using InterNIC and enter the domain name. The results will return the nameservers which could point you in the right direction.
Finally, if all that sleuthing seems too difficult, there’s an easier way! Try BuiltWith. It’s generally pretty reliable, as long as the system you're looking up isn’t custom/proprietary. Enter a domain into BuiltWith and it will show you the platform, widgets used, analytics and tracking codes, CDNs, CMS, payment processors, and more.
What are good priciples for creating a scalable website predominantly C#? What design patterns are more common for a C# based website? Links to good books or articles are welcome.
I think these apply to all websites, not just C#
Set proper expectations
Scaling means different things at different times. Are you trying to scale up from your 1000 beta users, to 100,000 active users on launch day? Are you trying to handle constant growth without refactoring? Do you just want to make sure if there's a good old "Slashdot" effect on your site, you can handle it? These all require scalability, but some are very different than others.
Calculate the value of data
Often times people freak out about data loss. But really what people mean to freak out about is data consistency. I won't be really mad if the account I created 30 seconds ago disappears. I will be mad if the photo I uploaded of my trip to Prague is replaced by Halloween Harlots downing beer bongs. So if you are able to calculate the risk of data loss, its a fairly easy process to then calculate the impact, and make a real business decision on whether or not its ok, and if so, how much before it hurts you.
Simplicity trumps Coolness
I love block diagrams as much as anyone else, but how many times have you heard someone go "DAMN, this TV is SO WELL DESIGNED I have to buy it." More often than not they'll say things like "Sony hasn't let me down, and this TV LOOKS great, I have to have it!" Keeping things simple and modular, even if it means NOT using some really cool, ultra-abstracted infinitely scalable pattern, will allow you to scale when and where you need it.
You will not always be alone
The toughest part of your website (and any business) to scale is always people. Sure, by the time you need more engineers, you should be rolling in money so you can just hire 10 people and they'll rewrite the site to be maintainable. However, if you can just hire 2, and they don't immediately surf to the daily wtf to post your entire code base in a 7 part exposé ... you should come out ahead.
This is hard to answer without more details about your architecture.
Which version of .NET and C# are you using?
Are you using MVC or webforms?
Do you have multiple webservers connecting to one database?
When designing a website for scalability I start with having a lot of javascript that will be the controller, pulling information from the webservice, either a WCF app or .asmx service. This way the webserver serves out the pages, but after that the rendering is done by the javascript. This helps relieve stress on the webserver.
If you can have the webserver have all the static content and for any business logic have the code-behind call to another server that will do the processing, get info from the database and return back to the code-behind.
By having this separation, if you need to add more servers in one layer you can determine which part needs extra horsepower, and ensure that the webserver is only really doing one task, interfacing with the browsers.
If you could go into more detail about your architecture it would be helpful, as well as how much of a load you are planning for, then it would be easier to give a more detailed response.
Martin Fowler's Patterns of Enterprise Application Architecture (summaries on his website) are a good place to start. There's an awful, awful lot of components and technologies that can go into building a scalable website... load balancing, caching, application serving, database setup, networking, and somewhere in there is the actual code being written.
I am trying to detect is a visitor is human or not. I just got an idea but not sure if this will work or not. But if I can store a cookie on the persons browser and retrieve it when they are browsing my site. If I successfully retrieve the cookie can this be a good technique to detect bots and spiders?
A well-designed bot or spider can certainly store -- and send you back -- whatever cookies you're sending. So, no, this technique won't help one bit.
Browsers are just code. Bots are just code. Code can do anything you program it too. That includes cookies.
Bots, spammers and the like work on the principle of low-hanging fruit. They're after as many sites or users as they can get with as little effort as possible. Thus they go after popular packages like phpBB and vBulletin because getting into those will get them into a lot of sites.
By the same token, they won't spend a lot of effort to get into your site if the effort is only for your site (unless your site happens to be Facebook or the like). So the best defense against malicious activity of this kind of simply to be different in such a way that an automatic script already written won't work on your site.
But an "I am human" cookie isn't the answer.
No, as Alex says this won't work; the typical process is to use a robots.txt to get them to behave. Further to that, you start to investigate the user-agent string (but this can be spoofed). Any more work than this and you're into CAPTCHA territory.
What are you actually trying to avoid?
You should take a look at the information in the actual http headers and how .Net exposes these things to you. The extra information you have about the person hitting your website is there. Take a look at what Firefox is doing by downloading Live Http Headers plugin and go to your own site. Basically, at a page level, the Request.Headers property exposes this information. I don't know if it's the same in asp.net mvc though. So, the important header for what you want is the User-Agent. This can be altered, obviously, but the major crawlers will let you know who they are by sending a unique UserAgent that identifies them. Same thing with the major browsers.
I wrote a bot that works with cookies and javascript. The easiest way of bot/spam prevention is use Nobot component in Ajax Control Toolkit.
http://www.asp.net/AJAX/AjaxControlToolkit/Samples/NoBot/NoBot.aspx
I am a web developer that is very conscious of security and try and make my web applications as secure as possible.
How ever I have started writing my own windows applications in C# and when it comes testing the security of my C# application, I am really only a novice.
Just wondering if anyone has any good tutorials/readme's on how to hack your own windows application and writing secure code.
The books by Michael Howard are a good starting point;
19 Deadly Sins of software security (with examples in several
languages)
Writing Secure Code
There's loads of links and interesting articles from Michael Howard's blog here
There's an interesting powerpoint presentation from Microsoft about threat assessment, risks and ASP here.
Apart from all the obvious answers to prevent buffer overflows, code injection, session highjacking et. al. you should find somebody else to check your code/software because you can only think about ways to hack your software that you know how to prevent. Only because you can’t find a way to hack your own software that doesn’t mean that nobody else can.
This is something that is very difficult for you to do, and I think that you are approaching the problem from the wrong angle. If you are writing an application of any size then attempting to deal with security at the end, by looking for specific ways of breaking your own software, is almost impossible.
This is for a number of reasons. You already think about your software in a certain way. You think of specific ways of interacting with it, and you know how to get the best from it. You don't think about it in terms of ways to exploit it, and this is a hard thing to do with software that you are intimately familiar with.
Another problem is that the task by this point is too big to deal with. Any problems that you do find may open up any number of other problems. A system wide security check is nowhere near granular enough.
What you should be doing is thinking about security while you write the software. Learn the best practices, and consider each method and class that you write from a security perspective. This goes hand in hand with unit testing, try to consider what inputs could make this specific part of my program break. and then deal with them at that level.
After that I think its a matter of responding quickly to any security concerns that you are made aware of.
Small things that I have come across through my own experience.
Do not use dynamic SQL, you are then vulnerable to SQL injection. Rather use SQL queries with parameters.
Do not have incrementing ids like user_id = 1, 2, 3 etc etc and then use that in a URL, something.aspx?user_id=1, i can then guess the next id and session hope. Same for accounts and what ever else is sensitive.
Watch out for XSS, (cross site scripting). If you accept user input and store it directly, make sure that they can't go insert alert() for their name or something.
This is by no means a complete list. Just the stuff that I have run into recently.
You could do much worse than reading Ross Anderson's Security Engineering book. The first edition is downloadable as a PDF and is a good read. I haven't read the second edition, but I suspect it's better and has more goodies in it.
Do note it is a book that explains how to build security in from the start, not how to break security, but the exposition of assorted security faults should give you a good idea for where to start looking.
To secure your win form application open it and try to do everything that lambda user shouldn't do! I'll explain:
If you "say enter yes or no", try with A-Z, 0-9 because that's what some users do to try to find some stack trace that could be interesting. So put validators everywhere.
Beware of connection to databases but if you come from web dev you should be more aware than me :).
The hardest part is to watch out about memory leaks or stuff like that, but that's in big big apps or in not well developed apps.
Windows, Firefox or Google Chrome all monitor usage statistics and analyze the crash reports are sent to them. I am thinking of implementing the same feature into my application.
Of course it's easy to litter an application with a lot of logging statement, but this is the approach that I want to avoid because I don't want my code to have too many cross cutting concern in a function. I am thinking about using AOP to do it, but before that I want to know how other people implement this feature first.
Anyone has any suggestion?
Clarification: I am working on desktop application, and doesn't involve any RDBMS
Joel had a blog article about something like this - his app(s) trap crashes and then contact his server with some set of details. I think he checks for duplicates and throws them out. It is a great system and I was impressed when I read it.
http://www.fogcreek.com/FogBugz/docs/30/UsingFogBUGZtoGetCrashRep.html
We did this at a place I was at that had a public server set up to receive data. I am not a db guy and have no servers I control on the public internets. My personal projects unfortunately do not have this great feature yet.
In "Debugging .Net 2.0 Applications" John Robbins (of Wintellect) writes extensively about how to generate and debug crash reports (acutally windbg/SOS mini dumps). His Superassert class contains code to generate these. Be warned though - there is a lot of effort required to set this up properly: symbol servers, source servers as well as a good knowledge of VS2005 and windbg. His book, however, guides you through the process.
Regarding usage statistics, I have often tied this into authorisation, i.e. has a user the right to carry out a particular task. Overly simply put this could be a method like this (ApplicationActions is an enum):
public static bool HasPermission( ApplicationActions action )
{
// Validate user has permission.
// Log request and result.
}
This method could be added to a singleton SercurityService class. As I said this is overly simple but should indicate the sort of service I have in mind.
I would take a quick look at the Logging Application Block that is part of the Enterprise Library. It provided a large number of the things you require, and is well maintained. Check out some of the scenarios and samples available, I think you will find them to your liking.
http://msdn.microsoft.com/en-us/library/cc309506.aspx