how do i get my c# code to recognize "ö"? - c#

how do i get my c# code to recognize "ö"?
The output of the query is nice and formatted all special characters are visible, but in codebehind, i cannot use them for sorting.
example:
if (link.Contains("teborg"))
{
CountRss++;
Response.Write("<p class='RssCont'><a href='" + link + "' target='new'><b>" + title + "</b></a><br/>");
Response.Write(description + "</p>");
}
will give several results with "Göteborg" in title but:
if (link.Contains("Göteborg"))
{
CountRss++;
Response.Write("<p class='RssCont'><a href='" + link + "' target='new'><b>" + title + "</b></a><br/>");
Response.Write(description + "</p>");
}
will give no results at all.

Your code is perfectly sensible and good as code, the issue is with data. There are four general possibilities here.
The first is encoding issues, but I doubt this is the case as you say that it's rendering okay, so I'd highly doubt that's the issue or you'd have problems there too.
The second is a conflict between composed ö and ö formed from a o followed with combining-diaresis. This is unlikely, but putting the string into NFC with link.Normalize() will catch that.
The third is that since it's a URI it might be in URI rather than IURI form. So it'll be G%c3%b6teborg (indeed, it could be G%C3%b6teborg, G%c3%B6teborg or G%C3%B6teborg). Unescape the string with Uri.UnescapeDataString(link) or any of the various methods for this. This is the one I'd bet on.
The fourth is that it could be XML escaped (since it's from RSS to judge from the names used), in which case HtmlDecode should sort that out as barring a DTD defining other entities, the encoding for HTML is a superset of that for XML. However, this is only possible if you are parsing the RSS with text-based rather than XML-based methods in which case you've got bigger problems. If you're using XmlReader or XmlDocument or any other XML based class, this decoding will have been done for you already if necessary, so that's not the issue.
So the third is by far seeming the most likely, and Uri.UnescapeDataString(link) seems the most promising.
You might want a less precise check that case-sensitive exact char for char. Other methods will let you match göteborg and GÖTEBORG too. There are also some that would e.g. match goeteborg (it's common to transliterate ö to oe in English - this is more often done with German than Swedish but it might still be done). (Matching e.g. the English Gothenburg or the Danish Gøteborg is a much more involved matter).

If your code renders link correctly it should be encoded and as result will not contain non-ASCII characters.
Depending on position of the word in the url you may need to search for different text to find match.
Note that using proper Uri class to deal with url will make life easier. Also make sure you have correctly encoded link to avoid script injection attacks on your page.

Related

Best Method of standard string to XML legal string - C#

Currently my understanding of XML legal strings is that all is required is that you convert any instances of: &, ", ', <, > with & " &apos; < >
So I made the following parser:
private static string ToXmlCompliantStr(string uriStr)
{
string uriXml = uriStr;
uriXml = uriXml.Replace("&", "&");
uriXml = uriXml.Replace("\"", """);
uriXml = uriXml.Replace("'", "&apos;");
uriXml = uriXml.Replace("<", "<");
uriXml = uriXml.Replace(">", ">");
return uriXml;
}
I am aware that there are similar questions out there with good answers (which is how I was able to write this function) I am writing this question to ask if this code will translate ANY string that C# can throw at it and have XDocument parse it as a part of a whole document without any complaints as all the questions out there that I've found state that these are the only escape characters, not that parsing them will cause 100% valid XML string. I've gone as far as reading through the decompiled XNode class trying to see how that parse it.
Thanks
Firstly, you should absolutely not do this yourself. Use an XML API - that way you can trust that to do the right thing, rather than worrying about covering corner cases etc. You generally shouldn't be trying to come up with an "escaped string" at all - you should pass the string to the XElement constructor (or XAttribute, or whatever your situation is).
In other words, I think you should try really hard to design your application so that you don't need a method of the kind you've shown in your question at all. Look at where you'd be using that method, and see whether you can just create an XElement (or whatever) instead. If you try to treat XML as a data structure in itself rather than just as text, you'll have a much better experience in my experience.
Secondly, you need to understand that in XML 1.0 at least, there are Unicode characters that cannot be validly represented in XML, no matter how much escaping you use. In particular, values U+0000 to U+001F are unrepresentable other than U+0009 (tab), U+000A (line feed) and U+000D (carriage return). Also if you have a string which contains invalid UTF-16 (e.g. an unmatched half of a surrogate pair), that can't be correctly represented in XML.

Replacing URL encoded data with their symbols

Hi I am trying to format my Url in order for it to look more user friendly.So far I managed to replace spaces with "-" but it seems that there are special characters like # and : that display as encoded data.This is what I mean:
http://localhost:51208/Home/Details/C%23-in-Depth%2c-Second-Edition/BookId-3
The "#" symbol is displayed as %23 and the "," is displayed as %2c.I would like to be able to replace this encoding with their original symbols.
Does such a way exist?
Oh no, you totally don't want to replace it with #. This symbol has a special meaning in an url. It represents the fragment identifier and its value is never sent to the server. This basically means that if there's a # symbol in your url, everything that follows it gets truncated and never sent to the server. You may take a look at the following post to see what StackOverflow uses to format the slug in the question title. You could run your string through this replace function in order to make sure that no dangerous characters are left.
I would also recommend you reading the following blog post from Scott Hanselman where he covers the various scenarios you might encounter with IIS if you attempt to send special characters in the path portion of your url. I am quoting his conclusion here:
After ALL this effort to get crazy stuff in the Request Path, it's
worth mentioning that simply keeping the values as a part of the Query
String (remember WAY back at the beginning of this post?) is easier,
cleaner, more flexible, and more secure.
Just replace "#" with "sharp" and ":" with "-", you cannot just put those special characters in the url

c# equivalent to stripcslashes function?

I am working with a project that includes getting MMS from a mms-gateway and storing the image on disk.
This includes using a received base64encoded string and storing it as a zip to a web server. This zip is then opened, and the image is retrieved.
We have managed to store it as a zip file, but it is corrupted and cannot be opened.
The documentation from the gateway is pretty sparse, and we have only a php example to rely on. I think we have figured out how to "translate" most of it, except for the PHP function stripcslashes(inputvalue). Can anyone shed shed any light on how to do the same thing in c#?
We are thankful for any help!
stripcslashes() looks for "\x" type elements within longer strings (where 'x' could be any character, or perhaps, more than one). If the 'x' is not recognised as meaningful, it just removes the '\' but if it does recognise it as a valid C-style escape sequence (i.e. "\n" is newline; "\t" is tab, etc.), as I understand it, the recognised character is inserted instead: \t will be replaced by a tab character (0x09, I think) in your string.
I'm not aware of any simple way to get the .net framework to do the same thing without building a similar function yourself. This obviously isn't very hard, but you need to know which escape sequences to process.
If you happen to know (or find out by inspecting your base64 text) that the only thing in your input that will need processing is a particular one or two sequences (say, tab characters), it becomes very easy and the following snippet shows use of String.Replace():
string input = #"Some\thing"; // '#' means string stored without processing '\t'
Console.WriteLine(input);
string output = input.Replace(#"\t", "\t");
Console.WriteLine(output);
Of course, if you really do simply want to remove all the slashes:
string output = input.Replace(#"\", "");

Is it possible to programmatically 'clean' emails?

Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?
In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.
A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.
Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(
Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."
If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!
SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);

QueryString malformed after URLDecode

I'm trying to pass in a Base64 string into a C#.Net web application via the QueryString. When the string arrives the "+" (plus) sign is being replaced by a space. It appears that the automatic URLDecode process is doing this. I have no control over what is being passed via the QueryString. Is there any way to handle this server side?
Example:
http://localhost:3399/Base64.aspx?VLTrap=VkxUcmFwIHNldCB0byAiRkRTQT8+PE0iIHBsdXMgb3IgbWludXMgNSBwZXJjZW50Lg==
Produces:
VkxUcmFwIHNldCB0byAiRkRTQT8 PE0iIHBsdXMgb3IgbWludXMgNSBwZXJjZW50Lg==
People have suggested URLEncoding the querystring:
System.Web.HttpUtility.UrlEncode(yourString)
I can't do that as I have no control over the calling routine (which is working fine with other languages).
There was also the suggestion of replacing spaces with a plus sign:
Request.QueryString["VLTrap"].Replace(" ", "+");
I had though of this but my concern with it, and I should have mentioned this to start, is that I don't know what other characters might be malformed in addition to the plus sign.
My main goal is to intercept the QueryString before it is run through the decoder.
To this end I tried looking at Request.QueryString.toString() but this contained the same malformed information. Is there any way to look at the raw QueryString before it is URLDecoded?
After further testing it appears that .Net expects everything coming in from the QuerString to be URL encoded but the browser does not automatically URL encode GET requests.
The suggested solution:
Request.QueryString["VLTrap"].Replace(" ", "+");
Should work just fine. As for your concern:
I had though of this but my concern with it, and I should have mentioned this to start, is that I don't know what other characters might be malformed in addition to the plus sign.
This is easy to alleviate by reading about base64. The only non alphanumeric characters that are legal in modern base64 are "/", "+" and "=" (which is only used for padding).
Of those, "+" is the only one that has special meaning as an escaped representation in URLs. While the other two have special meaning in URLs (path delimiter and query string separator), they shouldn't pose a problem.
So I think you should be OK.
You could manually replace the value (argument.Replace(' ', '+')) or consult the HttpRequest.ServerVariables["QUERY_STRING"] (even better the HttpRequest.Url.Query) and parse it yourself.
You should however try to solve the problem where the URL is given; a plus sign needs to get encoded as "%2B" in the URL because a plus otherwise represents a space.
If you don't control the inbound URLs, the first option would be preferred as you avoid the most errors this way.
I'm having this exact same issue except I have control over my URL. Even with Server.URLDecode and Server.URLEncode it doesn't convert it back to a + sign, even though my query string looks as follows:
http://localhost/childapp/default.aspx?TokenID=0XU%2fKUTLau%2bnSWR7%2b5Z7DbZrhKZMyeqStyTPonw1OdI%3d
When I perform the following.
string tokenID = Server.UrlDecode(Request.QueryString["TokenID"]);
it still does not convert the %2b back into a + sign. Instead I have to do the following:
string tokenID = Server.UrlDecode(Request.QueryString["TokenID"]);
tokenID = tokenID.Replace(" ", "+");
Then it works correctly. Really odd.
I had similar problem with a parameter that contains Base64 value and when it comes with '+'.
Only Request.QueryString["VLTrap"].Replace(" ", "+"); worked fine for me;
no UrlEncode or other encoding helping because even if you show encoded link on page yourself with '+' encoded as a '%2b' then it's browser that changes it to '+' at first when it showen and when you click it then browser changes it to empty space. So no way to control it as original poster says even if you show links yourself. The same thing with such links even in html emails.
If you use System.Uri.UnescapeDataString(yourString) it will ignore the +. This method should only be used in cases like yours where when the string was encoded using some sort of legacy approach either on the client or server.
See this blog post:
http://blogs.msdn.com/b/yangxind/archive/2006/11/09/don-t-use-net-system-uri-unescapedatastring-in-url-decoding.aspx
If you URLEncode the string before adding it to the URL you will not have any of those problems (the automatic URLDecode will return it to the original state).
Well, obviously you should have the Base64 string URLEncoded before sending it to the server.
If you cannot accomplish that, I would suggest simply replacing any embedded spaces back to +; since b64 strings are not suposed to have spaces, its a legitimate tactic...
System.Web.HttpUtility.UrlEncode(yourString) will do the trick.
As a quick hack you could replace space with plus character before base64-decoding.
I am by no means a C# developer but it looks like you need to url ENCODE your Base64 string before sending it as a url.
Can't you just assume a space is a + and replace it?
Request.QueryString["VLTrap"].Replace(" ", "+");
;)

Categories