Parse a HTML containing file to get the server-sided script

Parse a HTML containing file to get the server-sided script - c#

I'm currently building a web-server which can receive request, and send back a response.
I've managed to embed a port of Google's v8 JavaScript engine to c# (javascript.net) to my project and I want to parse a requested file and run the server-sided JavaScript code that in it. I decided that this code will be contained inside a 2-character brackets, <: for opening and :> for closing. I started to parse it with code I written but after encountering some problems which made the code more messy and probably not very efficient I decided to go ahead and try using RegEx (I had you study it because I've never used it before). BUT WAIT. After talking to my friend about it he send me this post RegEx match open tags except XHTML self-contained tags I understood that it isn't a good idea...
So my question is, How do I parse such thing? (Taking efficiency and clean code into account, after all it's a webserver).
Thanks in advance!

Ideally, what you'd want to do is hook into V8's lexer so you don't end up catching things inside of strings and such. I looked at the source to that .NET wrapper, however, and it looks like it doesn't allow that much customization. Instead, you may want to create a small state machine. You'd probably want at least these states:
Literal data (for stuff outside of your <: and :> tags)
Left angle bracket (for once you've consumed a < and are waiting for a potential :)
Script state (for stuff inside of your <: and :> tags)
Script double-quote string state
Script double-quote string escape state
Script single-quote string state
Script single-quote string escape state
Script slash state (for comments and regular expressions1)
Script line comment state
Script block comment state
Script block comment star state
Script regular expression state
Script colon state (for when you've encountered a : and are unsure whether a > or something else is next)
It may not be so quick to write as a regular expression, but it would be able to handle code like this:
Hello, world!
<:
document.write("At least you won't think the script :> ends there.");
:>
1On second thought, it's probably not so easy to detect regular expressions.

If I understand well, you want to take everything betwen "<" and ">", even "<" and ">" which are in it? Well... Since you can use RegEx for this, maybe try to find first "<", make counter which will increase for every next "<" and decrease for every ">". When the counter will be at 0, and next ">" appears: here you have end of the server-side script. If you will have some embedded HTML and want to get rid of them, try to detect """" or something like that. This solution is slow, but the simplest i can imagine.

Related

C# Regex filter problems

At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.
So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:
.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*
Also filtering some CSS files:
.css
.css*
And some SQL Files:
.sql, .ddl, .dml
.sql*, .ddl*, .dml*
My previous question got an answer to filtering Python files:
.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$
But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.
For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).
Thank you.

For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:
In C# this would basically, with IgnoreCase, be
Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);
Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*
With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*
Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:
\.(html|htm|shtml|shtm|xhtml).*
Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):
\.(s|x)?(htm)l?.*
And, I always check if it's still working in regexstorm for .Net
That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.

Is XSS possible through the MailAddress class?

Considering I parse user input, which is supposed to be an email address, into the MailAdress class:
var mailString = Request.QueryString["mail"];
var mail = new MailAddress(mailString);
Is there any possibility left for a cross-site-scripting attack if I output the MailAddress object later in any way? For example through a Literal control in WebForms:
litMessage.Text = "Your mail address is " + mail.Address;
Is it necessary to sanitize the outpout even though I made sure that the address is a valid email address by parsing the string?
From what I could gather the RFC for mail addresses is pretty complicated, so I am unsure if cross site scripts can be hidden in a mail address considered valid by .NET.
EDIT:
MSDN says that > and < brackets are allowed in an email address:
The address parameter can contain a display name and the associated e-mail address if you enclose the address in angle brackets. For example: "Tom Smith <tsmith#contoso.com>"
So the question remains if this is enough for an XSS attack and/or if the MailMessage class does anything to escape dangerous parts.

Generally speaking, you shouldn't need to validate the output later. However, I always recommend that you do so for the following reasons:
There may be a hole somewhere in your app that doesn't validate the input properly. This could be discovered by an attacker and used for XSS. This is especially possible when many different devs are working on the app.
There may be old data in the database that was stored before implementing/updating your filter on the input. This could contain malicious code that could be used for XSS.
Attackers are very clever and can usually figure out a way to beat a filter. Microsoft puts a lot of attention on preventing this, but it's never going to perfect. It makes the attackers job that much harder if they face and outgoing filter as well and as incoming filter.
I know it's a pain to constantly filter, but there is a lot of value in doing so. A Defense-in-Depth strategy is necessary in today's world.
Edit:
Sorry I didn't really answer the second part of your question. Based on the documentation I don't get the impression that the API is focused on sanitizing as much as it is on verifying valid formatting. Therefore I don't know that it is safe to rely on it for security purposes.
However, writing your own sanitizer isn't terribly hard, and you can update it immediately if you find flaws. First run the address through a good RegEx filter (see: Regex Email validation), then recursively remove every nonvalid character in an email address (these shouldn't get through at this point but do this for comprehensiveness and in case you want to reuse the class elsewhere), then escape every character with HTML meaning. I emphasize the recursive application of the filter because attackers can take advantage of a non-recursive filter with stuff like this:
<scr<script>ipt>
Notice that a nonrecursive filter would remove the middle occurence of <script> and leave the outer occurrence in tact.

Is it necessary to sanitize the outpout
You don't 'sanitise' output, you encode it. Every string that you output into an HTML document needs to be HTML-encoded, so if there was a < character in the mail address it wouldn't matter - you'd get < in the HTML source as a result and that would display correctly as a literal < on the page.
Many ASP.NET controls automatically take care of HTML-escaping for you, but Literal does not by default because it can be used to show markup. But if you set the Mode property of the Literal control to Encode then setting the Text like you're doing is perfectly fine.
You should make sure you always use safe HTML-encoded output every time you put content into an HTML page, regardless of whether you think the values you're using will ever be able to include a < character. This is a separation-of-concerns issue: HTML output code knows all about HTML formatting, but it shouldn't know anything about what characters are OK in an e-mail address or other application field.
Leaving out an escape because you think the value is 'safe' introduces an implicit and fragile coupling between the output stage and the input stage, making it difficult to verify that the code is safe and easy to make it unsafe when you make changes.

Designing a translation API - How to handle spaces

My application consumes an external Translation API (no option to use other translation engines). I'm seeing the following unexpected behavior when I call the translation engine.
input
<b1> Hello World. </b1>
expected output
<b1> Hola a todos. </b1>
actual output
<b1>Hola a todos.</b1>
Is it proper for the API to be trimming the spaces? This feels wrong to me.
Note: it is documented to replace non-html tags with <b1></b1> tag pairs (numbers increment to keep tag pairs unique).
Update: The end result was that I had to hack around the issue, encode spaces before I call the translation API. I don't like it, but I was not able to convince the API owner change it to GIGO (Garbage In, Garbage Out).

Well, in general whitespaces are not considered part of a word so it is not really surprising that the API is doing that. Whether or not this behaviour is ok is probably debateable (at least it should be documented) but you should follow the rule "be liberal in what you accept and strict in what you produce". As you produce the tokens you should be more strict.

As far as I know, whitespace in HTML is not particularly significant, multiple spaces are collapsed to single space, newlines are ignored, etc. so it's not much of a surprise that the leading and trailing spaces in that string are being dropped. From the browser's point of view, they're equivalent.
So the question then becomes, is there an option in the API to preserve spaces or treat the incoming text as "plain text" and not html?

C# Special Characters in String Crashing Program

I have a slight problem with a path:
"D:\\Music\\DJ Ti%C3%ABsto\\Tiesto\\Adagio For Strings (Spirit of London).mp3"
"D:\\Music\\Dj Tiësto\\Tiesto\\Adagio For Strings (Spirit of London).mp3"
Currently, when it sends that path to my Audio Library, it cannot open the path. (the reason for it crashing is trying to assign a -1 to a trackbar...but it's irrelevant).
So I'm wondering, is there anyway to prevent C# from switching special characters with %[code]? I've done a .Replace for "[" and "]", but I rather not have to look up every single special character, and add a line of code to prevent it. Is there anyway around this?

Call Uri.UnescapeDataString.
By the way, when putting paths in strings, you can put an # sign before the string to tell the compiler not to process escape codes, like this: #"D:\Music\DJ Tiësto\Tiesto\Adagio For Strings (Spirit of London).mp3". This way, you don't need to double up every backslash.

Is it possible to programmatically 'clean' emails?

Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?

In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.

A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.

Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(

Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."

If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!

SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.