Using regex or string manipulation when creating permalinks

Using regex or string manipulation when creating permalinks - c#

I have following method(and looks expensive too) for creating permalinks but it's lacking few stuff that are quite important for nice permalink:
public string createPermalink(string text)
{
text = text.ToLower().TrimStart().TrimEnd();
foreach (char c in text.ToCharArray())
{
if (!char.IsLetterOrDigit(c) && !char.IsWhiteSpace(c))
{
text = text.Replace(c.ToString(), "");
}
if (char.IsWhiteSpace(c))
{
text = text.Replace(c, '-');
}
}
if (text.Length > 200)
{
text = text.Remove(200);
}
return text;
}
Few stuff that it is lacking:
if someone enters text like this:
"My choiches are:foo,bar" would get returned as "my-choices-arefoobar"
and it should be like: "my-choiches-are-foo-bar"
and If someone enters multiple white spaces it would get returned as "---" which is not nice to have in url.
Is there some better way to do this in regex(I really only used it few times)?
UPDATE:
Requirement was:
Any non digit or letter chars at beginning or end are not allowed
Any non digit or letter chars should be replaced by "-"
When replaced with "-" chars should not reapeat like "---"
And finally stripping string at index 200 to ensure it's not too long

Change to
public string createPermalink(string text)
{
text = text.ToLower();
StringBuilder sb = new StringBuilder(text.Length);
// We want to skip the first hyphenable characters and go to the "meat" of the string
bool lastHyphen = true;
// You can enumerate directly a string
foreach (char c in text)
{
if (char.IsLetterOrDigit(c))
{
sb.Append(c);
lastHyphen = false;
}
else if (!lastHyphen)
{
// We use lastHyphen to not put two hyphens consecutively
sb.Append('-');
lastHyphen = true;
}
if (sb.Length == 200)
{
break;
}
}
// Remove the last hyphen
if (sb.Length > 0 && sb[sb.Length - 1] == '-')
{
sb.Length--;
}
return sb.ToString();
}
If you really want to use regexes, you can do something like this (based on the code of Justin)
Regex rgx = new Regex(#"^\W+|\W+$");
Regex rgx2 = new Regex(#"\W+");
return rgx2.Replace(rgx.Replace(text.ToLower(), string.Empty), "-");
The first regex searches for non-word characters (1 or more) at the beginning (^) or at the end of the string ($) and removes them. The second one replaces one or more non-word characters with -.

This should solve the problem that you have explained. Please let me know if it needs any further explanation.
Just as an FYI, the regex makes use of lookarounds to get it done in one run
//This will find any non-character word, lumping them in one group if more than 1
//It will ignore non-character words at the beginning or end of the string
Regex rgx = new Regex(#"(?!\W+$)\W+(?<!^\W+)");
//This will then replace those matches with a -
string result = rgx.Replace(input, "-");
To keep the string from going beyond 200 characters, you will have to use substring. If you do this before the regex, then you will be ok, but if you do it after, then you run the risk of having a trailing dash again, FYI.
example:
myString.Substring(0,200)

I use an iterative approach for this - because in some cases you might want certain characters to be turned into words instead of having them turned into '-' characters - e.g. '&' -> 'and'.
But when you're done you'll also end up with a string that potentially contains multiple '-' - so you have a final regex that collapses all multiple '-' characters into one.
So I would suggest using an ordered list of regexes, and then run them all in order. This code is written to go in a static class that is then exposed as a single extension method for System.String - and is probably best merged into the System namespace.
I've hacked it from code I use, which had extensibility points (e.g. you could pass in a MatchEvaluator on construction of the replacement object for more intelligent replacements; and you could pass in your own IEnumerable of replacements, as the class was public), and therefore it might seem unnecessarily complicated - judging by the other answers I'm guessing everybody will think so (but I have specific requirements for the SEO of the strings that are created).
The list of replacements I use might not be exactly correct for your uses - if not, you can just add more.
private class SEOSymbolReplacement
{
private Regex _rx;
private string _replacementString;
public SEOSymbolReplacement(Regex r, string replacement)
{
//null-checks required.
_rx = r;
_replacementString = replacement;
}
public string Execute(string input)
{
/null-check required
return _rx.Replace(input, _replacementString);
}
}
private static readonly SEOSymbolReplacement[] Replacements = {
new SEOSymbolReplacement(new Regex(#"#", RegexOptions.Compiled), "Sharp"),
new SEOSymbolReplacement(new Regex(#"\+", RegexOptions.Compiled), "Plus"),
new SEOSymbolReplacement(new Regex(#"&", RegexOptions.Compiled), " And "),
new SEOSymbolReplacement(new Regex(#"[|:'\\/,_]", RegexOptions.Compiled), "-"),
new SEOSymbolReplacement(new Regex(#"\s+", RegexOptions.Compiled), "-"),
new SEOSymbolReplacement(new Regex(#"[^\p{L}\d-]",
RegexOptions.IgnoreCase | RegexOptions.Compiled), ""),
new SEOSymbolReplacement(new Regex(#"-{2,}", RegexOptions.Compiled), "-")};
/// <summary>
/// Transforms the string into an SEO-friendly string.
/// </summary>
/// <param name="str"></param>
public static string ToSEOPathString(this string str)
{
if (str == null)
return null;
string toReturn = str;
foreach (var replacement in DefaultReplacements)
{
toReturn = replacement.Execute(toReturn);
}
return toReturn;
}

Related

Remove list of words from string

I have a list of words that I want to remove from a string I use the following method
string stringToClean = "The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam";
string[] BAD_WORDS = {
"720p", "web-dl", "hevc", "x265", "Rmteam", "."
};
var cleaned = string.Join(" ", stringToClean.Split(' ').Where(w => !BAD_WORDS.Contains(w, StringComparer.OrdinalIgnoreCase)));
but it is not working And the following text is output
The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam

For this it would be a good idea to create a reusable method that splits a string into words. I'll do this as an extension method of string. If you are not familiar with extension methods, read extension methods demystified
public static IEnumerable<string> ToWords(this string text)
{
// TODO implement
}
Usage will be as follows:
string text = "This is some wild text!"
List<string> words = text.ToWords().ToList();
var first3Words = text.ToWords().Take(3);
var lastWord = text.ToWords().LastOrDefault();
Once you've got this method, the solution to your problem will be easy:
IEnumerable<string> badWords = ...
string inputText = ...
IEnumerable<string> validWords = inputText.ToWords().Except(badWords);
Or maybe you want to use Except(badWords, StringComparer.OrdinalIgnoreCase);
The implementation of ToWords depends on what you would call a word: everything delimited by a dot? or do you want to support whitespaces? or maybe even new-lines?
The implementation for your problem: A word is any sequence of characters delimited by a dot.
public static IEnumerable<string> ToWords(this string text)
{
// find the next dot:
const char dot = '.';
int startIndex = 0;
int dotIndex = text.IndexOf(dot, startIndex);
while (dotIndex != -1)
{
// found a Dot, return the substring until the dot:
int wordLength = dotIndex - startIndex;
yield return text.Substring(startIndex, wordLength;
// find the next dot
startIndex = dotIndex + 1;
dotIndex = text.IndexOf(dot, startIndex);
}
// read until the end of the text. Return everything after the last dot:
yield return text.SubString(startIndex, text.Length);
}
TODO:
Decide what you want to return if text starts with a dot ".ABC.DEF".
Decide what you want to return if the text ends with a dot: "ABC.DEF."
Check if the return value is what you want if text is empty.

Your split/join don't match up with your input.
That said, here's a quick one-liner:
string clean = BAD_WORDS.Aggregate(stringToClean, (acc, word) => acc.Replace(word, string.Empty));
This is basically a "reduce". Not fantastically performant but over strings that are known to be decently small I'd consider it acceptable. If you have to use a really large string or a really large number of "words" you might look at another option but it should work for the example case you've given us.
Edit: The downside of this approach is that you'll get partials. So for example in your token array you have "720p" but the code I suggested here will still match on "720px" but there are still ways around it. For example instead of using string's implementation of Replace you could use a regex that will match your delimiters something like Regex.Replace(acc, $"[. ]{word}([. ])", "$1") (regex not confirmed but should be close and I added a capture for the delimiter in order to put it back for the next pass)

Check if an uppercase Letter is inside a string

I am currently learning C# and RegEx. I am working on a small wordcrawler. I get a big list of many words where I escape those which don't fit to my RegEx.
Here's my code:
var WordRegex = new Regex("^[a-zA-Z]{4,}$", RegexOptions.Compiled | RegexOptions.CultureInvariant);
var secondRegex = new Regex("([A-Z]{1})");
var words = new List<string>();
var finalList = new List<string>();
foreach (var word in words)
{
if (WordRegex.IsMatch(word) && secondRegex.Matches(word).Count == 1 || secondRegex.Matches(word).Count == 0)
{
finalList.Add(word);
}
}
So this works fine if the word is 'McLaren' (two uppercase letters) it won't add it to finalList. But if the words is something like 'stackOverflow' (one uppercase letter but not at start of string), it does take it to finallist. Is there any simple way to prevent this problem?
PS: if there is any better way than RegEx let me know!
Here are some examples:
("McLaren");//false
("Nissan");//true
("BMW");//false
("Subaru");//true
("Maserati");//true
("Mercedes Benz");//false
("Volkswagen");//true
("audi");//true
("Alfa Romeo");//false
("rollsRoyce");//false
("drive");//true
These with true should be accepted and the other shouldn't be accepted.
What I want to reach is that the regex shouldnt add when its written like this'rollsRoyce' but if it's written like 'Rollsroyce' or 'RollsRoyce' it should be accepted. So I have to check if there are uppercase letters inside the string.

If you want to check if the string contains an upper letter - this would be my approach
string sValue = "stackOverflow";
bool result = !sValue.Any(x => char.IsUpper(x));
Update to the updated question
string sValue = "stackOverflow";
bool result = sValue.Where(char.IsUpper).Skip(1).Any();
this ignores the 1st char and determines if the rest of the string contains at least one upper letter

There is a very easy solution without regex or linq:
bool hasUppercase = !str.equals(str.toLowerCase());
Now you can easily check:
if(!hasUppercase) {
// no uppercase letter
}
else {
// there is an uppercase letter
}
Just checking if the string is equal to its lowercased self.

Match and split string with regex

I want to validate an input string against a regular expression and then split it.
The input string can be any combination of the letter A and letter A followed by an exclamation mark. For example these are valid input strings: A, A!, AA, AA!, A!A, A!A!, AAA, AAA!, AA!A, A!AA, ... Any other characters should yield an invalid match.
My code would probably look something like this:
public string[] SplitString(string s)
{
Regex regex = new Regex(#"...");
if (!regex.IsMatch(s))
{
throw new ArgumentException("Wrong input string!");
}
return regex.Split(s);
}
How should my regex look like?
Edit - some examples:
input string "AAA", function should return an array of 3 strings ("A", "A", "A")
input string "A!AAA!", function should return an array of 4 strings ("A!", "A", "A", "A!")
input string "AA!b", function should throw an ArgumentException

Doesn't seem like a Regex is a good plan here. Have a look at this:
private bool ValidString(string myString)
{
char[] validChars = new char[] { 'A', '!' };
if (!myString.StartsWith("A"))
return false;
if (myString.Contains("!!"))
return false;
foreach (char c in myString)
{
if (!validChars.Contains(c))
return false;
}
return true;
}
private List<string> SplitMyString(string myString)
{
List<string> resultList = new List<string>();
if (ValidString(myString))
{
string resultString = "";
foreach (char c in myString)
{
if (c == 'A')
resultString += c;
if (c == '!')
{
resultString += c;
resultList.Add(string.Copy(resultString));
resultString = "";
}
}
}
return resultList;
}
The reason for Regex not being a good plan is that you can write the logic out in a few simple if-statements that compile and function a lot faster and cheaper. Also Regex isn't so good at repeating patterns for an unlimited length string. You'll either end up writing a long Regex or something illegible.
EDIT
At the end of my code you will either have a List<string> with the split input string like in your question. Or an empty List<string>. You can adjust it a little to throw an ArgumentException if that requirement is very important to you. Alternatively you can do a Count on the list to see if it was successful.

Regex regex = new Regex(#"^(A!|A)+$");
Edit:
Use something like http://gskinner.com/RegExr/ to play with Regular Expressions
Edit after comment:
Ok, you have made it a bit more clear what you want. Don't approach it like that. Because in what you are doing, you cannot expect to match the entire input and then split as it would be the entire input. Either use separate regular expression for the split part, or use groups to get the matched values.
Example:
//Initial match part
Regex regex2 = new Regex(#"(A!)|(A)");
return regex2.Split(s);
And again, regular expressions are not always the answer. See how this might impact your application.

You could try something like:
Regex regex = new Regex(#"^[A!]+$");

((A+!?)+)
Try looking at Espresso http://www.ultrapico.com/Expresso.htm or Rad Software Regular Expression Designer http://www.radsoftware.com.au/regexdesigner/ for designing and testing RE's.

I think I have a solution that satisfies all examples. I've had to break it into two regular expressions (which I don't like)...
public string[] SplitString(string s)
{
Regex regex = new Regex(#"^[A!]+$");
if (!regex.IsMatch(s))
{
throw new ArgumentException("Wrong input string!");
}
return Regex.Split(s, #"(A!?)").Where(x => !string.IsNullOrEmpty(x)).ToArray();
}
Note the use of linq - required to remove the empty matches.

Need regex (using C#) to condense all whitepace into single whitespaces

I need to replace multiple whitespaces into a single whitespace (per iteration) in a document. Doesn't matter whether they are spaces, tabs or newlines, any combination of any kind of whitespace needs to be truncated to a single whitespace.
Let's say we have the string: "Hello,\t \t\n  \t    \n world", (where \t and \n represent tabs and newlines respectively) then I'd need it to become "Hello, world".
I'm so completely bewildered by regex more generally that I ended up just asking.
Considerations:
I have no control over the document, since it could be any document on the internet.
I'm using C#, so if anyone knows how to do this in C# specifically, that would be even more awesome.
I don't really have to use regex (before someone asks), but I figured it's probably the optimal way, since regex is designed for this sort of stuff, and my own strpos/str_replace/substr soup would probably not perform as well. Performance is important on this one so what I'm essentially looking for is an efficient way to do this to any random text file on the internet (remember, I can't predict the size!).
Thanks in advance!

newString = Regex.Replace(oldString, #"\s+", " ");
The "\s" is a regex character class for any whitespace character, and the + means "one or more". It replaces each occurence with a simple space character.

You may find this SO answer useful:
How do I replace multiple spaces with a single space in C#?
Adapting the answer to also replace tabs and newlines as well is relatively straight forward:
RegexOptions options = RegexOptions.None;
Regex regex = new Regex(#"\s+", options);
tempo = regex.Replace(tempo, #" ");

As someone who sympathizes with Jamie Zawinski's position on Regex, I'll offer an alternative for what it's worth.
Not wanting to be religious about it, but I'd say it's faster than Regex, though whether you'll ever be processing strings long enough to see the difference is another matter.
public static string CompressWhiteSpace(string value)
{
if (value == null) return null;
bool inWhiteSpace = false;
StringBuilder builder = new StringBuilder(value.Length);
foreach (char c in value)
{
if (Char.IsWhiteSpace(c))
{
inWhiteSpace = true;
}
else
{
if (inWhiteSpace) builder.Append(' ');
inWhiteSpace = false;
builder.Append(c);
}
}
return builder.ToString();
}

I would suggest you replace your chomp with
$line =~ s/\s+$//;
which will strip off all trailing white spaces - tabs, spaces, new lines and returns as well.
Taken from: http://www.wellho.net/forum/Perl-Programming/New-line-characters-beware.html
I'm aware its Perl, but it should be helpful enough for you.

Actually I think an extension method would probably be more efficient as you don't have the state machine overhead of the regex. Essentially, it becomes a very specialized pattern matcher.
public static string Collapse( this string source )
{
if (string.IsNullOrEmpty( source ))
{
return source;
}
StringBuilder builder = new StringBuilder();
bool inWhiteSpace = false;
bool sawFirst = false;
foreach (var c in source)
{
if (char.IsWhiteSpace(c))
{
inWhiteSpace = true;
}
else
{
// only output a whitespace if followed by non-whitespace
// except at the beginning of the string
if (inWhiteSpace && sawFirst)
{
builder.Append(" ");
}
inWhiteSpace = false;
sawFirst = true;
builder.Append(c);
}
}
return builder.ToString();
}

Best way to parse Space Separated Text

I have string like this
/c SomeText\MoreText "Some Text\More Text\Lol" SomeText
I want to tokenize it, however I can't just split on the spaces. I've come up with somewhat ugly parser that works, but I'm wondering if anyone has a more elegant design.
This is in C# btw.
EDIT: My ugly version, while ugly, is O(N) and may actually be faster than using a RegEx.
private string[] tokenize(string input)
{
string[] tokens = input.Split(' ');
List<String> output = new List<String>();
for (int i = 0; i < tokens.Length; i++)
{
if (tokens[i].StartsWith("\""))
{
string temp = tokens[i];
int k = 0;
for (k = i + 1; k < tokens.Length; k++)
{
if (tokens[k].EndsWith("\""))
{
temp += " " + tokens[k];
break;
}
else
{
temp += " " + tokens[k];
}
}
output.Add(temp);
i = k + 1;
}
else
{
output.Add(tokens[i]);
}
}
return output.ToArray();
}

The computer term for what you're doing is lexical analysis; read that for a good summary of this common task.
Based on your example, I'm guessing that you want whitespace to separate your words, but stuff in quotation marks should be treated as a "word" without the quotes.
The simplest way to do this is to define a word as a regular expression:
([^"^\s]+)\s*|"([^"]+)"\s*
This expression states that a "word" is either (1) non-quote, non-whitespace text surrounded by whitespace, or (2) non-quote text surrounded by quotes (followed by some whitespace). Note the use of capturing parentheses to highlight the desired text.
Armed with that regex, your algorithm is simple: search your text for the next "word" as defined by the capturing parentheses, and return it. Repeat that until you run out of "words".
Here's the simplest bit of working code I could come up with, in VB.NET. Note that we have to check both groups for data since there are two sets of capturing parentheses.
Dim token As String
Dim r As Regex = New Regex("([^""^\s]+)\s*|""([^""]+)""\s*")
Dim m As Match = r.Match("this is a ""test string""")
While m.Success
token = m.Groups(1).ToString
If token.length = 0 And m.Groups.Count > 1 Then
token = m.Groups(2).ToString
End If
m = m.NextMatch
End While
Note 1: Will's answer, above, is the same idea as this one. Hopefully this answer explains the details behind the scene a little better :)

The Microsoft.VisualBasic.FileIO namespace (in Microsoft.VisualBasic.dll) has a TextFieldParser you can use to split on space delimeted text. It handles strings within quotes (i.e., "this is one token" thisistokentwo) well.
Note, just because the DLL says VisualBasic doesn't mean you can only use it in a VB project. Its part of the entire Framework.

There is the state machine approach.
private enum State
{
None = 0,
InTokin,
InQuote
}
private static IEnumerable<string> Tokinize(string input)
{
input += ' '; // ensure we end on whitespace
State state = State.None;
State? next = null; // setting the next state implies that we have found a tokin
StringBuilder sb = new StringBuilder();
foreach (char c in input)
{
switch (state)
{
default:
case State.None:
if (char.IsWhiteSpace(c))
continue;
else if (c == '"')
{
state = State.InQuote;
continue;
}
else
state = State.InTokin;
break;
case State.InTokin:
if (char.IsWhiteSpace(c))
next = State.None;
else if (c == '"')
next = State.InQuote;
break;
case State.InQuote:
if (c == '"')
next = State.None;
break;
}
if (next.HasValue)
{
yield return sb.ToString();
sb = new StringBuilder();
state = next.Value;
next = null;
}
else
sb.Append(c);
}
}
It can easily be extended for things like nested quotes and escaping. Returning as IEnumerable<string> allows your code to only parse as much as you need. There aren't any real downsides to that kind of lazy approach as strings are immutable so you know that input isn't going to change before you have parsed the whole thing.
See: http://en.wikipedia.org/wiki/Automata-Based_Programming

You also might want to look into regular expressions. That might help you out. Here is a sample ripped off from MSDN...
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main ()
{
// Define a regular expression for repeated words.
Regex rx = new Regex(#"\b(?<word>\w+)\s+(\k<word>)\b",
RegexOptions.Compiled | RegexOptions.IgnoreCase);
// Define a test string.
string text = "The the quick brown fox fox jumped over the lazy dog dog.";
// Find matches.
MatchCollection matches = rx.Matches(text);
// Report the number of matches found.
Console.WriteLine("{0} matches found in:\n {1}",
matches.Count,
text);
// Report on each match.
foreach (Match match in matches)
{
GroupCollection groups = match.Groups;
Console.WriteLine("'{0}' repeated at positions {1} and {2}",
groups["word"].Value,
groups[0].Index,
groups[1].Index);
}
}
}
// The example produces the following output to the console:
// 3 matches found in:
// The the quick brown fox fox jumped over the lazy dog dog.
// 'The' repeated at positions 0 and 4
// 'fox' repeated at positions 20 and 25
// 'dog' repeated at positions 50 and 54

Craig is right — use regular expressions. Regex.Split may be more concise for your needs.

[^\t]+\t|"[^"]+"\t
using the Regex definitely looks like the best bet, however this one just returns the whole string. I'm trying to tweak it, but not much luck so far.
string[] tokens = System.Text.RegularExpressions.Regex.Split(this.BuildArgs, #"[^\t]+\t|""[^""]+""\t");

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Using regex or string manipulation when creating permalinks - c#

Related

Remove list of words from string

Check if an uppercase Letter is inside a string

Match and split string with regex

Need regex (using C#) to condense all whitepace into single whitespaces

Best way to parse Space Separated Text

Categories

Resources