Cutting text to specific length preserving the words - c#

I have the following text:
Test some text. Now here is some new realylonglonglong text
And I need to cut it to 50 characters but without cutting the words. So, the desire result is:
Test some text. Now here is some new ...
I am looking only for solution using regular expression replace. The following regular expression:
^.{0,50}(?= |$)
matches:
Test some text. Now here is some new
but I failed transforming it for use in replace function.
In my real case I have SQL CLR function called [dbo].[RegexReplace] and I am calling it like this:
SELECT [dbo].[RegexReplace](#TEST, '^.{0,50}(?= |$)', '...')
Its C# definition is:
public static string Replace(SqlString sqlInput, SqlString sqlPattern, SqlString sqlReplacement)
{
string input = (sqlInput.IsNull) ? string.Empty : sqlInput.Value;
string pattern = (sqlPattern.IsNull) ? string.Empty : sqlPattern.Value;
string replacement = (sqlReplacement.IsNull) ? string.Empty : sqlReplacement.Value;
return Regex.Replace(input, pattern, replacement);
}
That's why I want to to this with regular expression replace function.

This is the regex you want:
string result = Regex.Replace("Test some text. Now here is some new realylonglonglong text", "(?=.{50,})(^.{0,50}) .*", "$1...");
so look for ^(?=.{50,})(.{0,50}) .* and replace it with $1...
Explanation... You are looking for texts that are AT LEAST 50 characters long, because shorter texts don't need shortening, so (?=.{50,}) (but note that this won't capture anything). Then you look for the first 0...50 characters (.{0,50}) followed by a space , followed by anything else .*. You'll replace all of this with the first 0...50 characters ($1) followed by ...
I need the (?=.{50,}) because otherwise the regex would replace Test test with Test..., replacing from the first space.

Related

C# Regex Formatting String without Input string

I have the following Regex that is being used to matching incoming packets:
public static class ProtobufConstants
{
public static Regex ContentTypeNameRegex => new Regex("application/protobuf; proto=(.*?)");
}
I also need to write outgoing packets strings in the same format, i.e. create strings similar to "application/protobuf; proto=mynamespace.class1" ideally by using the same regex definition new Regex("application/protobuf; proto=(.*?)");.
To keep this code in one place, is it possible to use this regex template and replace the (.*?) parameter with a string (as per above example i would like to substitute "mynamespace.class1").
I see there is a Regex.Replace(string input, string replacement) but given the above ContentTypeNameRegex already has the format defined I don't have an input per se, I just want to format - not sure what to put here, if anything.
Is it possible to use in this manner, or do i need to revert to string.Format?
If you just want to replace the matched group with something else, you can change your pattern to:
(application/protobuf; proto=)(.*?)
That way, you can replace it by doing something like:
Regex re = ContentTypeNameRegex;
string replacement = "mynamespace.class1";
re.Replace(input, "$1" + replacement);
Use Regex.Replace but use the match evaluator to handle your formatting needs. Here is an example which simply replaces a slash with a dash and visa versa, based on what has been matched.
var text = "001-34/323";
Regex.Replace(text, "[-/]", me => { return me.Value == "-" ? "/" : "-"; })
Result
001/34-323
You can do the same with your input, to decide to change it or send it on as is.

C# Extract part of the string that starts with specific letters

I have a string which I extract from an HTML document like this:
var elas = htmlDoc.DocumentNode.SelectSingleNode("//a[#class='a-size-small a-link-normal a-text-normal']");
if (elas != null)
{
//
_extractedString = elas.Attributes["href"].Value;
}
The HREF attribute contains this part of the string:
gp/offer-listing/B002755TC0/
And I'm trying to extract the B002755TC0 value, but the problem here is that the string will vary by its length and I cannot simply use Substring method that C# offers to extract that value...
Instead I was thinking if there's a clever way to do this, to perhaps a match beginning of the string with what I search?
For example I know for a fact that each href has this structure like I've shown, So I would simply match these keywords:
offer-listing/
So I would find this keyword and start extracting the part of the string B002755TC0 until the next " / " sign ?
Can someone help me out with this ?
This is a perfect job for a regular expression :
string text = "gp/offer-listing/B002755TC0/";
Regex pattern = new Regex(#"offer-listing/(\w+)/");
Match match = pattern.Match(text);
string whatYouAreLookingFor = match.Groups[1].Value;
Explanation : we just match the exact pattern you need.
'offer-listing/'
followed by any combination of (at least one) 'word characters' (letters, digits, hyphen, etc...),
followed by a slash.
The parenthesis () mean 'capture this group' (so we can extract it later with match.Groups[1]).
EDIT: if you want to extract also from this : /dp/B01KRHBT9Q/
Then you could use this pattern :
Regex pattern = new Regex(#"/(\w+)/$");
which will match both this string and the previous. The $ stands for the end of the string, so this literally means :
capture the characters in between the last two slashes of the string
Though there is already an accepted answer, I thought of sharing another solution, without using Regex. Just find the position of your pattern in the input + it's lenght, so the wanted text will be the next character. to find the end, search for the first "/" after the begining of the wanted text:
string input = "gp/offer-listing/B002755TC0/";
string pat = "offer-listing/";
int begining = input.IndexOf(pat)+pat.Length;
int end = input.IndexOf("/",begining);
string result = input.Substring(begining,end-begining);
If your desired output is always the last piece, you could also use split and get the last non-empty piece:
string result2 = input.Split(new string[]{"/"},StringSplitOptions.RemoveEmptyEntries)
.ToList().Last();

Regex to remove specific string if exist

I wanna remove the -L from the end of my string if exists
So
ABCD => ABCD
ABCD-L => ABCD
at the moment I'm using something like the line below which uses the if/else type of arrangement in my Regex, however, I have a feeling that it should be way more easier than this.
var match = Regex.Match("...", #"(?(\S+-L$)\S+(?=-L)|\S+)");
How about just doing:
Regex rgx = new Regex("-L$");
string result = rgx.Replace("ABCD-L", "");
So basically: if the string ends with -L, replace that part with an empty string.
If you want to not only invoke the replacement at the end of the string, but also at the end of a word, you can add an additional switch to detect word boundaries (\b) in addition to the end of the string:
Regex rgx = new Regex("-L(\b|$)");
string result = rgx.Replace("ABCD-L ABCD ABCD-L", "");
Note that detecting word boundaries can be a little ambiguous. See here for a list of characters that are considered to be word characters in C#.
You also can use String.Replace() method to find a specific string inside a string and replace it with another string in this case with an empty string.
http://msdn.microsoft.com/en-us/library/fk49wtc1(v=vs.110).aspx
Use Regex.Replace function,
Regex.Replace(string, #"(\S+?)-L(?=\s|$)", "$1")
DEMO
Explanation:
( group and capture to \1:
\S+? non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times)
) end of \1
-L '-L'
(?= look ahead to see if there is:
\s whitespace (\n, \r, \t, \f, and " ")
| OR
$ before an optional \n, and the end of
the string
) end of look-ahead
You certainly can use Regex for this, but why when using normal string functions is clearer?
Compare this:
text = text.EndsWith("-L")
? text.Substring(0, text.Length - "-L".Length)
: text;
to this:
text = Regex.Replace(text, #"(\S+?)-L(?=\s|$)", "$1");
Or better yet, define an extension method like this:
public static string RemoveIfEndsWith(this string text, string suffix)
{
return text.EndsWith(suffix)
? text.Substring(0, text.Length - suffix.Length)
: text;
}
Then your code can look like this:
text = text.RemoveIfEndsWith("-L");
Of course you can always define the extension method using the Regex. At least then your calling code looks a lot cleaner and is far more readable and maintainable.

Using Regex to determine if string contains a repeated sequence of a particular substring with comma separators and nothing else

I want to find if a string contains a repeated sequence of a known substring (with comma separators) and nothing else and return true if this is the case; otherwise false. For example: the substring is "0,8"
String A: "0,8,0,8,0,8,0,8" returns true
String B: "0,8,0,8,1,0,8,0" returns false because of '1'
I tried using the C# string functions Contains but it does not suit my requirements. I am totally new to regular expression but I feel it should be powerful enough to do this. What RegEx should I use to do this?
The pattern for a string containing nothing but a repeated number of a given substring (possibly zero of them, resulting in an empty string) is \A(?:substring goes here)*\z. The \A matches the beginning of the string, the \z the end of the string, and the (?:...)* matches 0 or more copies of anything matching the thing between the colon and the close parenthesis.
But your string doesn't actually match \A(?:0,8)*\z, because of the extra commas; an example that would match is "0,80,80,80,8". You need to account for the commas explicitly with something like \A0,8(?:,0,8)*\z.
You can build such a thing in C# thus:
string OkSubstring = "0,8";
string aOk = "0,8,0,8,0,8,0,8";
string bOK = "0,8,0,8,1,0,8,0";
Regex OkRegex = new Regex( #"\A" + OkSubstring + "(?:," + OkSubstring + #")*\z" );
OkRegex.isMatch(aOK); // True
OkRegex.isMatch(bOK); // False
That hard-codes the comma-delimiter; you could make it more general. Or maybe you just need the literal regex. Either way, that's the pattern you need.
EDIT Changed the anchors per Mike Samuel's suggestion.

problem in regular expression

I am having a regular expression
Regex r = new Regex(#"(\s*)([A|B|C|E|G|H|J|K|L|M|N|P|R|S|T|V|Y|X]\d(?!.*[DFIOQU])(?:[A-Z](\s?)\d[A-Z]\d))(\s*)",RegexOptions.IgnoreCase);
and having a string
string test="LJHLJHL HJGJKDGKJ JGJK C1C 1C1 LKJLKJ";
I have to fetch C1C 1C1.This running fine.
But if a modify test string as
string test="LJHLJHL HJGJKDGKJ JGJK C1C 1C1 ON";
then it is unable to find the pattern i.e C1C 1C1.
any idea why this expression is failing?
You have a negative look ahead:
(?!.*[DFIOQU])
That matches the "O" in "ON" and since it is a negative look ahead, the whole pattern fails. And, as an aside, I think you want to replace this:
[A|B|C|E|G|H|J|K|L|M|N|P|R|S|T|V|Y|X]
With this:
[A-CEGHJ-NPR-TVYX]
A pipe (|) is a literal character inside a character class, not an alternation, and you can use ranges to help hilight the characters that you're leaving out.
A single regex might not be the best way to parse that string. Or perhaps you just need a looser regex.
You are searching for a not a following DFIOQU with your negative look ahead (?!.*[DFIOQU])
In your second string there is a O at the end in ON, so it must be failing to match.
If you remove the .* in your negative look ahead it will only check the directly following character and not the complete string to the end (Is it this what you want?).
\s*([ABCEGHJKLMNPRSTVYX]\d(?![DFIOQU])(?:[A-Z]\s?\d[A-Z]\d))\s*
then it works, see it here on Regexr. It is now checking if there is not one of the characters in the class directly after the digit, I don't know if this is intended.
Btw. I removed the | from your first character class, its not needed and also some brackets around your whitespaces, also not needed.
As I understood you need to find the C1C 1C1 text in your string
I've used this regex for do this
string strRegex = #"^.*(?<c1c>C1C)\s*(?<c1c2>1C1).*$";
after that you can extract text from named groups
string strRegex = #"^.*(?<c1c>C1C)\s*(?<c1c2>1C1).*$";
RegexOptions myRegexOptions = RegexOptions.Multiline;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = #"LJHLJHL HJGJKDGKJ JGJK C1C 1C1 LKJLKJ";
string secondStr = "LJHLJHL HJGJKDGKJ JGJK C1C 1C1 ON";
Match match = myRegex.Match(strTargetString);
string c1c = match.Groups["c1c"].Value;
string c1c2 = match.Groups["c1c2"].Value;
Console.WriteLine(c1c + " " +c1c2);

Categories