C# Regex to get ID value from HTML

C# Regex to get ID value from HTML - c#

I want to use a regex expression to extract an ID value from a web-response which I get from a website.
This is the line which has the id value required:
SONET.globalSettings = $.extend(SONET.globalSettings, {"esc_my_persona":"1234566","esc_my_player":"1234567","esc_my_nucleus":"123456677","esc_my_platform":"cem_ea_id"}
I need to get the value of esc_my_nuclues...that is: 123456677.
Help please?

You can do
string regex=#"esc_my_nucleus"":""(\d+)";
string val=Regex.Match(input,regex,RegexOptions.Ignorecase).Groups[1].Value;

Assuming you're using ASP.NET and the JavaScript library is making an HTTP GET, then the name-value pairs should be in the query part of the URL. Hence:
var val = HttpContext.Request.QueryString["esc_my_nucleus"];
If the JavaScript is performing an HTTP POST then you need to extract from the request body
var val = HttpContext.Request.Form["esc_my_nucleus"];
Or use HttpContext.Request.Params which combines both the above.
However knowing the server side framework (WebForms, MVC or something else?) and how that JavaScript is expressed as an HTTP request would allow a better answer.
NB. none of these needs a regex: even if you are parsing the URL yourself regular expressions are unlikely to be the right answer (too much flexibility in what a URL can contain, too easy to open security vulnerabilities).

Related

Regular expressions redirection

I want to set redirection from
www.somesite.com/products/dynamicstring/randomtext1/randomtext2
to www.somesite.com/products/dynamicstring
Is it possible to do that through Regex ?
It means if my incming url is
www.somesite.com/products/myproducts/test1/test2 it should redirect to www.somesite.com/products/myproducts/
just briefing more about this :
#TomLord i am using HttpContext.Current.Response.RedirectPermanent(matchingDefinition.To) i have all the redirects "From" and "To" in a class object, in the form of REGEX expressions.Example in From "/product/*" and To "/products" , i am reading these object and trying to redirect them, but i am not able to redirect something like /products/dynamicstring/randomtext1/ to /products/dynamicstring where dynamic string is random string , i dont find any regular expression which can be use to do this. For example /products/samples/randomtext1 should redirect to /products/samples/

Redirection cannot be done with regex alone. Google a bit what is a regular expression in reality. The short answer is: it's string-like expression that describes search pattern. So it can't redirect, not even replace a substring with substring or do anything else then match and capture parts of the matched string.
That being said, regex can help us do what you wanna. I am gonna assume you can use Javascript, cause I can't put a solution in every language. I am also gonna assume you will try to go over the code not copy paste and press enter. If you only need that hire a programmer. If you use another language, principle should be the same:
obtain URL
define regex
use capture group to extract the part of your URL that you need
construct a new URL
redirect to it
While matching the URLs in general is a fair bit more complex, like:
^(?:https?://)?(?:[\w]+\.)(?:\.?[\w]{2,})+$
As long as you are sure you will only be getting URLs and in the format you wanna, we will do it far simpler.
Basically, let's say you have:
some text with 2 dots that ends in com
then a /products/dynamicstring/
then text
then /
then text
As a regex that is:
/\w*.\w*.com\/products\/dynamicstring\/\w*\/\w*/g
Curde matching is done, but we still need to add a capture group we will use to extract part of the string we need:
/(\w*.\w*.com\/products\/)dynamicstring\/\w*\/\w*/g
Oke, now let's leverage this regex to do rest of the work:
Define regex:
var regex = /\w*.\w*.com\/products\/dynamicstring\/\w*\/\w*/g;
Get current URL. If you already have URL use it.
var currUrl = window.location.href;
Extract capture group from string:
var match = regex.exec(currUrl);
Use that to get a new URL from old one:
var redirectUrl = match[1] + myproducts/
Finally, we redirect with:
window.location.replace(redirectUrl);
I wrote all this straight from my head so I recommend you go over each step, look how it works, read some documentation about functions used. You might find an error as well as learn a lot.

How to skip encoding params in ASP.NET Routes

In my ASP.NET WebForm application I have simple rule:
routes.MapPageRoute("RouteSearchSimple", "search/{SearchText}", "~/SearchTicket.aspx");
As "SearchText" param I need to use cyrillic words, so to create Url I use:
string searchText = "текст";
string url = Page.GetRouteUrl("RouteSearchSimple",
new
{
SearchText = searchText
});
GetRouteUrl automatically encode searchText value and as a result
url = /search/%D1%82%D0%B5%D0%BA%D1%81%D1%82
but I need -> /search/текст
How is it possible to get it by Page.GetRouteUrl function.
Thanks a lot!

Actually, I believe Alexei Levenkov is close to the answer. Ultimately, a URL may only contain ASCII characters, so anything beyond alphanumeric characters will be URL encoded (even things like spaces).
Now, to your point, there are browsers out there that will display non-ASCII characters, but that is up to the implementation of the browser (behind the scenes, it is still performing the encoding). GetRouteUrl, however, will return the ASCII-encoded form every time because that is a requirement for URLs.
(As an aside, that "some 8 year old document" defines URLs. It's written by Tim Berners Lee. He had a bit of an impact on the Internet.)
Update
And because you got me interested, I did a bit more research. It looks as though Internationalized Domain Names do exist. However, from what I understand from the article, underneath the covers, ToASCII or ToUnicode are applied to the names. More can be read in this spec: RFC 3490. So, again, you're still at the same point. More discussion can be found at this Stackoverflow question.

Ok, guys, thank you for replies, it helps much. Simple answer is: it's impossible to do that by Page.GetRouteUrl() function. It's very strange why it hasn't beed developed in way to rely Encoding/Decoding params on developers like we have it in Request.Params or .QueryString, or at least it would be some alternate routing function where developers could control that.
One way I found is getting Url from RouteTable and parse it manually, in my case it would be like:
string url = (System.Web.Routing.RouteTable.Routes["RouteSearchSimple"] as System.Web.Routing.Route).Url.Replace("{SearchText}", "текст");
or simplest way is just creating url via string concatenation:
string param = "текст";
string url = "/search/" + param;
what I already did, but in that case you will need change the code in all places where it appears if you change your route url, therefore better create some static function like GetSearchUrl(string searchText) in one place.
And it works like a charm, Url's looks human readable and I can read params via RouteData.Values

The most simple solution is to decode with UrlDecode method:
string searchText = "текст";
string url = Page.GetRouteUrl("RouteSearchSimple",
new
{
SearchText = searchText
});
string decodedUrl = Server.UrlDecode(url); // => /search/текст

URL from user, XSS security? [duplicate]

We have a high security application and we want to allow users to enter URLs that other users will see.
This introduces a high risk of XSS hacks - a user could potentially enter javascript that another user ends up executing. Since we hold sensitive data it's essential that this never happens.
What are the best practices in dealing with this? Is any security whitelist or escape pattern alone good enough?
Any advice on dealing with redirections ("this link goes outside our site" message on a warning page before following the link, for instance)
Is there an argument for not supporting user entered links at all?
Clarification:
Basically our users want to input:
stackoverflow.com
And have it output to another user:
stackoverflow.com
What I really worry about is them using this in a XSS hack. I.e. they input:
alert('hacked!');
So other users get this link:
stackoverflow.com
My example is just to explain the risk - I'm well aware that javascript and URLs are different things, but by letting them input the latter they may be able to execute the former.
You'd be amazed how many sites you can break with this trick - HTML is even worse. If they know to deal with links do they also know to sanitise <iframe>, <img> and clever CSS references?
I'm working in a high security environment - a single XSS hack could result in very high losses for us. I'm happy that I could produce a Regex (or use one of the excellent suggestions so far) that could exclude everything that I could think of, but would that be enough?

If you think URLs can't contain code, think again!
https://owasp.org/www-community/xss-filter-evasion-cheatsheet
Read that, and weep.
Here's how we do it on Stack Overflow:
/// <summary>
/// returns "safe" URL, stripping anything outside normal charsets for URL
/// </summary>
public static string SanitizeUrl(string url)
{
return Regex.Replace(url, #"[^-A-Za-z0-9+&##/%?=~_|!:,.;\(\)]", "");
}

The process of rendering a link "safe" should go through three or four steps:
Unescape/re-encode the string you've been given (RSnake has documented a number of tricks at http://ha.ckers.org/xss.html that use escaping and UTF encodings).
Clean the link up: Regexes are a good start - make sure to truncate the string or throw it away if it contains a " (or whatever you use to close the attributes in your output); If you're doing the links only as references to other information you can also force the protocol at the end of this process - if the portion before the first colon is not 'http' or 'https' then append 'http://' to the start. This allows you to create usable links from incomplete input as a user would type into a browser and gives you a last shot at tripping up whatever mischief someone has tried to sneak in.
Check that the result is a well formed URL (protocol://host.domain[:port][/path][/[file]][?queryField=queryValue][#anchor]).
Possibly check the result against a site blacklist or try to fetch it through some sort of malware checker.
If security is a priority I would hope that the users would forgive a bit of paranoia in this process, even if it does end up throwing away some safe links.

Use a library, such as OWASP-ESAPI API:
PHP - http://code.google.com/p/owasp-esapi-php/
Java - http://code.google.com/p/owasp-esapi-java/
.NET - http://code.google.com/p/owasp-esapi-dotnet/
Python - http://code.google.com/p/owasp-esapi-python/
Read the following:
https://www.golemtechnologies.com/articles/prevent-xss#how-to-prevent-cross-site-scripting
https://www.owasp.org/
http://www.secbytes.com/blog/?p=253
For example:
$url = "http://stackoverflow.com"; // e.g., $_GET["user-homepage"];
$esapi = new ESAPI( "/etc/php5/esapi/ESAPI.xml" ); // Modified copy of ESAPI.xml
$sanitizer = ESAPI::getSanitizer();
$sanitized_url = $sanitizer->getSanitizedURL( "user-homepage", $url );
Another example is to use a built-in function. PHP's filter_var function is an example:
$url = "http://stackoverflow.com"; // e.g., $_GET["user-homepage"];
$sanitized_url = filter_var($url, FILTER_SANITIZE_URL);
Using filter_var allows javascript calls, and filters out schemes that are neither http nor https. Using the OWASP ESAPI Sanitizer is probably the best option.
Still another example is the code from WordPress:
http://core.trac.wordpress.org/browser/tags/3.5.1/wp-includes/formatting.php#L2561
Additionally, since there is no way of knowing where the URL links (i.e., it might be a valid URL, but the contents of the URL could be mischievous), Google has a safe browsing API you can call:
https://developers.google.com/safe-browsing/lookup_guide
Rolling your own regex for sanitation is problematic for several reasons:
Unless you are Jon Skeet, the code will have errors.
Existing APIs have many hours of review and testing behind them.
Existing URL-validation APIs consider internationalization.
Existing APIs will be kept up-to-date with emerging standards.
Other issues to consider:
What schemes do you permit (are file:/// and telnet:// acceptable)?
What restrictions do you want to place on the content of the URL (are malware URLs acceptable)?

Just HTMLEncode the links when you output them. Make sure you don't allow javascript: links. (It's best to have a whitelist of protocols that are accepted, e.g., http, https, and mailto.)

You don't specify the language of your application, I will then presume ASP.NET, and for this you can use the Microsoft Anti-Cross Site Scripting Library
It is very easy to use, all you need is an include and that is it :)
While you're on the topic, why not given a read on Design Guidelines for Secure Web Applications
If any other language.... if there is a library for ASP.NET, has to be available as well for other kind of language (PHP, Python, ROR, etc)

For Pythonistas, try Scrapy's w3lib.
OWASP ESAPI pre-dates Python 2.7 and is archived on the now-defunct Google Code.

How about not displaying them as a link? Just use the text.
Combined with a warning to proceed at your own risk may be enough.
addition - see also Should I sanitize HTML markup for a hosted CMS? for a discussion on sanitizing user input

There is a library for javascript that solves this problem
https://github.com/braintree/sanitize-url
Try it =)

In my project written in JavaScript I use this regex as white list:
url.match(/^((https?|ftp):\/\/|\.{0,2}\/)/)
the only limitation is that you need to put ./ in front for files in same directory but I think I can live with that.

Using Regular Expression to prevent XSS vulnerability is becoming complicated thus hard to maintain over time while it could leave some vulnerabilities behind. Having URL validation using regular expression is helpful in some scenarios but better not be mixed with vulnerability checks.
Solution probably is to use combination of an encoder like AntiXssEncoder.UrlEncode for encoding Query portion of the URL and QueryBuilder for the rest:
public sealed class AntiXssUrlEncoder
{
public string EncodeUri(Uri uri, bool isEncoded = false)
{
// Encode the Query portion of URL to prevent XSS attack if is not already encoded. Otherwise let UriBuilder take care code it.
var encodedQuery = isEncoded ? uri.Query.TrimStart('?') : AntiXssEncoder.UrlEncode(uri.Query.TrimStart('?'));
var encodedUri = new UriBuilder
{
Scheme = uri.Scheme,
Host = uri.Host,
Path = uri.AbsolutePath,
Query = encodedQuery.Trim(),
Fragment = uri.Fragment
};
if (uri.Port != 80 && uri.Port != 443)
{
encodedUri.Port = uri.Port;
}
return encodedUri.ToString();
}
public static string Encode(string uri)
{
var baseUri = new Uri(uri);
var antiXssUrlEncoder = new AntiXssUrlEncoder();
return antiXssUrlEncoder.EncodeUri(baseUri);
}
}
You may need to include white listing to exclude some characters from encoding. That could become helpful for particular sites.
HTML Encoding the page that render the URL is another thing you may need to consider too.
BTW. Please note that encoding URL may break Web Parameter Tampering so the encoded link may appear not working as expected.
Also, you need to be careful about double encoding
P.S. AntiXssEncoder.UrlEncode was better be named AntiXssEncoder.EncodeForUrl to be more descriptive. Basically, It encodes a string for URL not encode a given URL and return usable URL.

You could use a hex code to convert the entire URL and send it to your server. That way the client would not understand the content in the first glance. After reading the content, you could decode the content URL = ? and send it to the browser.

Allowing a URL and allowing JavaScript are 2 different things.

C# ASP.NET HttpWebRequest automatically decodes ampersand (&) values from query string?

Assume the following Url:
"http://server/application1/TestFile.aspx?Library=Testing&Filename=Documents & Functions + Properties.docx&Save=true"
I use HttpUtility.UrlEncode() to encode the value of the Filename parameter and I create the following Url:
"http://server/application1/TestFile.aspx?Library=Testing&Filename=Documents%20%26%20Functions%20%2B%20Properties.docx&Save=true"
I send the following (encoded version) of request from a client to a C# Web Application. On the server when I process the request I have a problem. The HttpRequest variable contains the query string partially decoded. That is to say when I try to use or quick watch the following properties of HttpRequest they have the following values.
Property = Value
================
HttpRequest.QueryString = "{Library=Testing&Filename=Documents+&+Functions+++Properties.docx&Save=true}"
HttpRequest.Url = "{http://server/application1/TestFile.aspx?Library=Testing&Filename=Documents & Functions + Properties.docx&Save=true}"
HttpRequest.Url.AbsoluteUri = "http://server/application1/TestFile.aspx?Library=Testing&Filename=Documents%20&%20Functions%20+%20Properties.docx&Save=true"
I have also checked the following properties but all of them have the & value decoded. However all other values remain properly encoded (e.g. space is %20).
HttpRequest.Url.OriginalString
HttpRequest.Url.Query
HttpRequest.Url.PathAndQuery
HttpRequest.RawUrl
There is no way I can read the value of the parameter Filename properly. Am I missing something?

The QueryString property returns a NameValueCollection object that maps the querystring keys to fully-decoded values.
You need to write Request.QueryString["FileName"].

I'm answering this question many years later because I just had this problem and figured out the solution. The problem is that HttpRequest.Url isn't really the value that you gave. HttpRequest.Url is a Uri class and that value is the ToString() value of that class. ToString() for the Uri class decodes the Url. Instead, what you want to use is HttpRequest.Url.OriginalString. That is the encoded version of the URL that you are looking for. Hope this helps some future person having this problem.

What happens when you don't use UrlEncode? You didn't show how exactly you are using the url that you created using UrlEncode, so it is quite possible that things are just being double encoded (lots of the framework will encode the URLs for you automatically).

FWIW I ran into this same problem with RavenDB (version 960). They implement their own HttpRequest object that behaves just like this -- it first decodes just the ampersands (from %26 to &) and then decodes the entire value. I believe this is a bug.
A couple of workarounds to this problem:
Implement your own query string parsing on the server. It's not fun but it is effective.
Double-encode ampersands. First encode just the ampersands in the string, then encode the entire string. (It's an easy solution but not extensible because it puts the burden on the client.)

How to get QueryString from a href?

I am trying to stop XSS attack so I am using html agility pack to make my whitelist and Microsoft Anti-Cross Site Scripting Library to deal with the rest.
Now I am looking at encoding all html hrefs. I get a big string of html code that can contain hrefs. Accours to MS Library they have an URL encode but if you encode the whole URl then it can't be used. So in the example they just encode the query string
UrlEncode Untrusted input is used in a
URL (such as a value in a
querystring) Click
Here!
http://msdn.microsoft.com/en-us/library/aa973813.aspx
So now my questions is how do I parse through a href and find the query string. Is it always just "?" then query string or can it have spaces and be written in different ways?
Edit
This urls will not be written by me but the users who will share them. So that's why I need a way to make sure I get all query strings and not just ones in valid format. If it can work invalid format I have to grab these ones too. Hackers won't care if it is valid format or not as long as it still does what they want.

I believe it is always the part after the ? but you can easily use the Uri class for this:
Uri uri = new Uri("http://foo.com/page.html?query");
string query = uri.Query;
That will include the ? itself. Of course, you can fetch the other bits as well, which could be handy.

what about using encrypted query string and in your code you can decrypt it
OR you can use Request.PathInfo that make you not need ? in query string

Here's a W3C reference addressing the composition of URIs with querystrings, which says in part:
The question mark ("?", ASCII 3F hex)
is used to delimit the boundary
between the URI of a queryable object,
and a set of words used to express a
query on that object.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.