Sanitizing string to url safe format

Sanitizing string to url safe format - c#

I am trying to sanitize a string so that it can be used to be put in an URL. This is just for show in the URL. Now I was using this function in PHP which worked fine:
$CleanString = IconV('UTF-8', 'ASCII//TRANSLIT//IGNORE', $String);
$CleanString = Preg_Replace("/[^a-zA-Z0-9\/_|+ -]/", '', $CleanString);
$CleanString = StrToLower( Trim($CleanString, '-') );
$CleanString = Preg_Replace("/[\/_|+ -]+/", $Delimiter, $CleanString);
Now I am trying to put this in C#, the regex's are no problem but the first line is a bit tricky. What is the safe way to replace characters as é á ó with their normal equivalents a e o?
For example, above would transer:
The cát ís running & getting away
to
the-cat-is-running-getting-away

The CharUnicodeInfo.GetUnicodeCategory(c) method can tell you if a character is a "Non spacing mark". This can only be used when the string is in a form where accents ("diacritics") are separated from their letter, which can be obtained with Normalize(NormalizationForm.FormD).
Here is the full string extension method:
using System.Text;
using System.Globalization;
...
public static string RemoveDiacritics(this string strThis)
{
if (strThis == null)
return null;
var sb = new StringBuilder();
foreach (char c in strThis.Normalize(NormalizationForm.FormD))
{
if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
sb.Append(c);
}
return sb.ToString();
}

Related

getting CS0029 error when using StringBuilder

I'm trying to refresh my knowledge regarding c# and came accross this problem,
Have the function StringChallenge(str) take the str parameter being passed and return a compressed version of the string using the Run-length encoding algorithm. This algorithm works by taking the occurrence of each repeating character and outputting that number along with a single character of the repeating sequence. For example: "wwwggopp" would return 3w2g1o2p. The string will not contain any numbers, punctuation, or symbols.
and my code is
using System;
using System.Text;
class MainClass {
public static string StringChallenge(string str) {
// code goes here
var newString = new StringBuilder();
var result = new StringBuilder();
foreach (var c in str){
if (newString.Length == 0 || newString[newString.Length - 1] == c){
newString.Append(c);
}
else{
result.Append($"{newString.Length}{newString[0]}");
newString.Clear();
newString.Append(c);
}
}
if (newString.Length > 0){
result.Append($"{newString.Length}{newString[0]}");
}
return result;
}
static void Main() {
// keep this function call here
Console.WriteLine(StringChallenge(Console.ReadLine()));
}
}
please help. thank you!

C# Split a string and build a stringarray out of the string [duplicate]

I need to split a string into newlines in .NET and the only way I know of to split strings is with the Split method. However that will not allow me to (easily) split on a newline, so what is the best way to do it?

To split on a string you need to use the overload that takes an array of strings:
string[] lines = theText.Split(
new string[] { Environment.NewLine },
StringSplitOptions.None
);
Edit:
If you want to handle different types of line breaks in a text, you can use the ability to match more than one string. This will correctly split on either type of line break, and preserve empty lines and spacing in the text:
string[] lines = theText.Split(
new string[] { "\r\n", "\r", "\n" },
StringSplitOptions.None
);

What about using a StringReader?
using (System.IO.StringReader reader = new System.IO.StringReader(input)) {
string line = reader.ReadLine();
}

Try to avoid using string.Split for a general solution, because you'll use more memory everywhere you use the function -- the original string, and the split copy, both in memory. Trust me that this can be one hell of a problem when you start to scale -- run a 32-bit batch-processing app processing 100MB documents, and you'll crap out at eight concurrent threads. Not that I've been there before...
Instead, use an iterator like this;
public static IEnumerable<string> SplitToLines(this string input)
{
if (input == null)
{
yield break;
}
using (System.IO.StringReader reader = new System.IO.StringReader(input))
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
This will allow you to do a more memory efficient loop around your data;
foreach(var line in document.SplitToLines())
{
// one line at a time...
}
Of course, if you want it all in memory, you can do this;
var allTheLines = document.SplitToLines().ToArray();

You should be able to split your string pretty easily, like so:
aString.Split(Environment.NewLine.ToCharArray());

Based on Guffa's answer, in an extension class, use:
public static string[] Lines(this string source) {
return source.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
}

Regex is also an option:
private string[] SplitStringByLineFeed(string inpString)
{
string[] locResult = Regex.Split(inpString, "[\r\n]+");
return locResult;
}

For a string variable s:
s.Split(new string[]{Environment.NewLine},StringSplitOptions.None)
This uses your environment's definition of line endings. On Windows, line endings are CR-LF (carriage return, line feed) or in C#'s escape characters \r\n.
This is a reliable solution, because if you recombine the lines with String.Join, this equals your original string:
var lines = s.Split(new string[]{Environment.NewLine},StringSplitOptions.None);
var reconstituted = String.Join(Environment.NewLine,lines);
Debug.Assert(s==reconstituted);
What not to do:
Use StringSplitOptions.RemoveEmptyEntries, because this will break markup such as Markdown where empty lines have syntactic purpose.
Split on separator new char[]{Environment.NewLine}, because on Windows this will create one empty string element for each new line.

I just thought I would add my two-bits, because the other solutions on this question do not fall into the reusable code classification and are not convenient.
The following block of code extends the string object so that it is available as a natural method when working with strings.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Collections;
using System.Collections.ObjectModel;
namespace System
{
public static class StringExtensions
{
public static string[] Split(this string s, string delimiter, StringSplitOptions options = StringSplitOptions.None)
{
return s.Split(new string[] { delimiter }, options);
}
}
}
You can now use the .Split() function from any string as follows:
string[] result;
// Pass a string, and the delimiter
result = string.Split("My simple string", " ");
// Split an existing string by delimiter only
string foo = "my - string - i - want - split";
result = foo.Split("-");
// You can even pass the split options parameter. When omitted it is
// set to StringSplitOptions.None
result = foo.Split("-", StringSplitOptions.RemoveEmptyEntries);
To split on a newline character, simply pass "\n" or "\r\n" as the delimiter parameter.
Comment: It would be nice if Microsoft implemented this overload.

Starting with .NET 6 we can use the new String.ReplaceLineEndings() method to canonicalize cross-platform line endings, so these days I find this to be the simplest way:
var lines = input
.ReplaceLineEndings()
.Split(Environment.NewLine, StringSplitOptions.None);

I'm currently using this function (based on other answers) in VB.NET:
Private Shared Function SplitLines(text As String) As String()
Return text.Split({Environment.NewLine, vbCrLf, vbLf}, StringSplitOptions.None)
End Function
It tries to split on the platform-local newline first, and then falls back to each possible newline.
I've only needed this inside one class so far. If that changes, I will probably make this Public and move it to a utility class, and maybe even make it an extension method.
Here's how to join the lines back up, for good measure:
Private Shared Function JoinLines(lines As IEnumerable(Of String)) As String
Return String.Join(Environment.NewLine, lines)
End Function

Well, actually split should do:
//Constructing string...
StringBuilder sb = new StringBuilder();
sb.AppendLine("first line");
sb.AppendLine("second line");
sb.AppendLine("third line");
string s = sb.ToString();
Console.WriteLine(s);
//Splitting multiline string into separate lines
string[] splitted = s.Split(new string[] {System.Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
// Output (separate lines)
for( int i = 0; i < splitted.Count(); i++ )
{
Console.WriteLine("{0}: {1}", i, splitted[i]);
}

string[] lines = text.Split(
Environment.NewLine.ToCharArray(),
StringSplitOptions.RemoveEmptyStrings);
The RemoveEmptyStrings option will make sure you don't have empty entries due to \n following a \r
(Edit to reflect comments:) Note that it will also discard genuine empty lines in the text. This is usually what I want but it might not be your requirement.

I did not know about Environment.Newline, but I guess this is a very good solution.
My try would have been:
string str = "Test Me\r\nTest Me\nTest Me";
var splitted = str.Split('\n').Select(s => s.Trim()).ToArray();
The additional .Trim removes any \r or \n that might be still present (e. g. when on windows but splitting a string with os x newline characters). Probably not the fastest method though.
EDIT:
As the comments correctly pointed out, this also removes any whitespace at the start of the line or before the new line feed. If you need to preserve that whitespace, use one of the other options.

Examples here are great and helped me with a current "challenge" to split RSA-keys to be presented in a more readable way. Based on Steve Coopers solution:
string Splitstring(string txt, int n = 120, string AddBefore = "", string AddAfterExtra = "")
{
//Spit each string into a n-line length list of strings
var Lines = Enumerable.Range(0, txt.Length / n).Select(i => txt.Substring(i * n, n)).ToList();
//Check if there are any characters left after split, if so add the rest
if(txt.Length > ((txt.Length / n)*n) )
Lines.Add(txt.Substring((txt.Length/n)*n));
//Create return text, with extras
string txtReturn = "";
foreach (string Line in Lines)
txtReturn += AddBefore + Line + AddAfterExtra + Environment.NewLine;
return txtReturn;
}
Presenting a RSA-key with 33 chars width and quotes are then simply
Console.WriteLine(Splitstring(RSAPubKey, 33, "\"", "\""));
Output:
Hopefully someone find it usefull...

Silly answer: write to a temporary file so you can use the venerable
File.ReadLines
var s = "Hello\r\nWorld";
var path = Path.GetTempFileName();
using (var writer = new StreamWriter(path))
{
writer.Write(s);
}
var lines = File.ReadLines(path);

using System.IO;
string textToSplit;
if (textToSplit != null)
{
List<string> lines = new List<string>();
using (StringReader reader = new StringReader(textToSplit))
{
for (string line = reader.ReadLine(); line != null; line = reader.ReadLine())
{
lines.Add(line);
}
}
}

Very easy, actually.
VB.NET:
Private Function SplitOnNewLine(input as String) As String
Return input.Split(Environment.NewLine)
End Function
C#:
string splitOnNewLine(string input)
{
return input.split(environment.newline);
}

Decode HTML string in c# [duplicate]

How do I decode this string 'Sch\u00f6nen' (#"Sch\u00f6nen") in C#, I've tried HttpUtility but it doesn't give me the results I need, which is "Schönen".

Regex.Unescape did the trick:
System.Text.RegularExpressions.Regex.Unescape(#"Sch\u00f6nen");
Note that you need to be careful when testing your variants or writing unit tests: "Sch\u00f6nen" is already "Schönen". You need # in front of string to treat \u00f6 as part of the string.

If you landed on this question because you see "Sch\u00f6nen" (or similar \uXXXX values in string constant) - it is not encoding. It is a way to represent Unicode characters as escape sequence similar how string represents New Line by \n and Return by \r.
I don't think you have to decode.
string unicodestring = "Sch\u00f6nen";
Console.WriteLine(unicodestring);
Schönen was outputted.

Wrote a code that covnerts unicode strings to actual chars. (But the best answer in this topic works fine and less complex).
string stringWithUnicodeSymbols = #"{""id"": 10440119, ""photo"": 10945418, ""first_name"": ""\u0415\u0432\u0433\u0435\u043d\u0438\u0439""}";
var splitted = Regex.Split(stringWithUnicodeSymbols, #"\\u([a-fA-F\d]{4})");
string outString = "";
foreach (var s in splitted)
{
try
{
if (s.Length == 4)
{
var decoded = ((char) Convert.ToUInt16(s, 16)).ToString();
outString += decoded;
}
else
{
outString += s;
}
}
catch (Exception e)
{
outString += s;
}
}

unable to find a substring in html after decode/normalize

I have a snippet of html held as a string "s", it's user generated and may come from multiple sources, so I can't control the encoding of characters etc.
I have a simple string "comparison", and I need to check if comparison exists as a substring of "s". "comparison" does not have any html tags or encoding.
I am decoding, normalizing, and using a regex to strip out html tags, but am still unable to find the substring even when I know it is there...
string s = "<p>this is my string.</p><p>my string is html with tags and <a href="someurl">links</a> and encoding.</p><p>i want to find a substring but my comparison might not have tags & encoding.";
string comparison = "i want to find a substring";
string decode = HttpUtility.HtmlDecode(s);
string tagsreplaced = Regex.Replace(decode, "<.*?>", " ");
string normalized = tagsreplaced.Normalize();
Literal1.Text = normalized;
if (normalized.IndexOf(comparison) != -1)
{
Label1.Text = "substring found";
}
else
{
Label1.Text = "substring not found";
}
This is returning "substring not found". I can see by clicking view source that the string sent to the Literal absolutely includes the comparison string exactly as provided, so why isn't in being found?
Is there another way to achieve this?

The answer is that the HTML entity decoding still decodes your to the character 0xc2 0xa0 which is not a normal space character ' ' (which is 0x20). Verfy this with the following program:
using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Web;
namespace TestStuff
{
class Program
{
static void Main(string[] args)
{
string s = "<p>this is my string.</p><p>my string is html with tags and <a href="someurl">links</a> and encoding.</p><p>i want to find a substring but my comparison might not have tags & encoding.";
s = "i want to find a substring";
string comparison = "i want to find a substring";
string decode = HttpUtility.HtmlDecode(s);
string tagsreplaced = Regex.Replace(decode, "<.*?>", " ");
string normalized = tagsreplaced.Normalize();
Console.WriteLine("Dumping first string");
Console.WriteLine(normalized);
Console.WriteLine(BitConverter.ToString(Encoding.UTF8.GetBytes(normalized)));
Console.WriteLine("Dumping second string");
Console.WriteLine(comparison);
Console.WriteLine(BitConverter.ToString(Encoding.UTF8.GetBytes(comparison)));
if (normalized.IndexOf(comparison) != -1)
Console.WriteLine("substring found");
else
Console.WriteLine("substring not found");
Console.ReadLine();
return;
}
}
}
It dumps the UTF8 encodings of the two strings for you. You'll see as output:
Dumping first string
i want to find a substring
69-20-77-61-6E-74-20-74-6F-C2-A0-66-69-6E-64-C2-A0-61-C2-A0-73-75-62-73-74-72-69-6E-67
Dumping second string
i want to find a substring
69-20-77-61-6E-74-20-74-6F-20-66-69-6E-64-20-61-20-73-75-62-73-74-72-69-6E-67
substring not found
You see that the bytearrays do not match, therefore they aren't equal, therefore .IndexOf() is right to tell you that nothing was found.
So, the problem lies within the HTML itself since there is a non-breaking space character which you don't decode to a normal space. You can hack around it by substituting a " " for a " " in the string using String.Replace().

Is there a ReadWord() method in the .NET Framework?

I'd hate to reinvent something that was already written, so I'm wondering if there is a ReadWord() function somewhere in the .NET Framework that extracts words based some text delimited by white space and line breaks.
If not, do you have a implementation that you'd like to share?
string data = "Four score and seven years ago";
List<string> words = new List<string>();
WordReader reader = new WordReader(data);
while (true)
{
string word =reader.ReadWord();
if (string.IsNullOrEmpty(word)) return;
//additional parsing logic goes here
words.Add(word);
}

Not that I'm aware of directly. If you don't mind getting them all in one go, you could use a regular expression:
Regex wordSplitter = new Regex(#"\W+");
string[] words = wordSplitter.Split(data);
If you have leading/trailing whitespace you'll get an empty string at the beginning or end, but you could always call Trim first.
A different option is to write a method which reads a word based on a TextReader. It could even be an extension method if you're using .NET 3.5. Sample implementation:
using System;
using System.IO;
using System.Text;
public static class Extensions
{
public static string ReadWord(this TextReader reader)
{
StringBuilder builder = new StringBuilder();
int c;
// Ignore any trailing whitespace from previous reads
while ((c = reader.Read()) != -1)
{
if (!char.IsWhiteSpace((char) c))
{
break;
}
}
// Finished?
if (c == -1)
{
return null;
}
builder.Append((char) c);
while ((c = reader.Read()) != -1)
{
if (char.IsWhiteSpace((char) c))
{
break;
}
builder.Append((char) c);
}
return builder.ToString();
}
}
public class Test
{
static void Main()
{
// Give it a few challenges :)
string data = #"Four score and
seven years ago ";
using (TextReader reader = new StringReader(data))
{
string word;
while ((word = reader.ReadWord()) != null)
{
Console.WriteLine("'{0}'", word);
}
}
}
}
Output:
'Four'
'score'
'and'
'seven'
'years'
'ago'

Not as such, however you could use String.Split to split the string into an array of string based on a delimiting character or string. You can also specify multiple strings / characters for the split.
If you'd prefer to do it without loading everything into memory then you could write your own stream class that does it as it reads from a stream but the above is a quick fix for small amounts of data word splitting.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Sanitizing string to url safe format - c#

Related

getting CS0029 error when using StringBuilder

C# Split a string and build a stringarray out of the string [duplicate]

Decode HTML string in c# [duplicate]

unable to find a substring in html after decode/normalize

Is there a ReadWord() method in the .NET Framework?

Categories

Resources