regex: how exclude a possible word that follows if it does - c#

I'm reading a list by line and using regex in c# to capture the fields:
fed line 1: Type: eBook Year: 1990 Title: This is ebook 1 ISBN:15465452 Pages: 100 Authors: Cendric, Paul
fed line 2: Type: Movie Year: 2016 Title: This is movie 1 Authors: Pepe Giron ; Yamasaki Suzuki Length: 4500 Media Type: DVD
string pattern = #"(?:(Type: )(?<type>\w+)) *(?:(Year: )(?<year>\d{4})) *(?:(Title: )(?<title>[^ISBN]*))(?:(ISBN:) *(?<ISBN>\d*))* *(?:(Pages: )(?<pages>\d*))* *(?:(Authors: )(?<author1>[\w ,]*)) *;* *(?<author2>[\w ,]*) *(?:(Length: )(?<length>\d*))* *(?:Media Type: )*(?<discType>[\w ,]*)";
MatchCollection matches = Regex.Matches(line, pattern);
If the line fed has "Length: " I want to stop capturing the surname of the Author excluding the word Length.
If I use (?:(Length: )(?<length>\d*))* Length is added to the surname of the second author for match.Groups["author2"].Value. If I use (?:(Length: )(?<length>\d*))+ I get no matches for the first line.
Can you please give me guidance.
Thank you, Sergio

Using full regexes for something as fuzzy as the format you have is always a way for hurting themselves. As written by #Kevin, you should look for the keys and extract the values.
My proposal is looking for those keys and splitting the string before and after them. There is a nifty, randomly working (they even changed its working between .NET 1.1 and .NET 2.0), nearly unknown feature of Regex that is called Regex.Split(). We could try to use it :-)
string pattern = #"(?<=^| )(Type: |Year: |Title: |ISBN:|Pages: |Authors: |Length: |Media Type: )";
var rx = new Regex(pattern);
string[] parts = rx.Split(line);
Now parts is an array where if in an element there is a key, in the next element there is the value... The Regex.Split can add an empty element at the beginning of the array.
string type = null, title = null, mediaType = null;
int? year, length;
string[] authors = new string[0];
// The parts[0] == string.Empty ? 1 : 0 is caused by the "strangeness" of Regex.Split
// that can add an empty element at the beginning of the string
for (int i = parts[0] == string.Empty ? 1 : 0; i < parts.Length; i += 2)
{
string key = parts[i].TrimEnd();
string value = parts[i + 1].Trim();
Console.WriteLine("[{0}|{1}]", key, value);
switch (key)
{
case "Type:":
type = value;
break;
case "Year:":
{
int temp;
if (int.TryParse(value, out temp))
{
year = temp;
}
}
break;
case "Title:":
title = value;
break;
case "Authors:":
{
authors = value.Split(" ; ");
}
break;
case "Length:":
{
int temp;
if (int.TryParse(value, out temp))
{
length = temp;
}
}
break;
case "Media Type:":
mediaType = value;
break;
}
}

After all, #xanathos is right. An overcomplicated regex that is hard to maintain and error prone may not serve you well in the long run.
But to answer your question, your regex can be fixed with a tempered greedy token*, e.g. do not allow Length: in the author's pattern:
(?:(?:(?!Length: )[\w ,])*)
* The linked description uses a . in the greedy token but it's useful to limit the range of allowed characters more here.
Arguably, this should be added to the author1 and author2 part.
The final pattern then looks like this:
(?:(Type: )(?<type>\w+)) *(?:(Year: )(?<year>\d{4})) *(?:(Title: )(?<title>[^ISBN]*))(?:(ISBN:) *(?<ISBN>\d*))* *(?:(Pages: )(?<pages>\d*))* *(?:(Authors: )(?<author1>(?:(?:(?!Length: )[\w ,])*) *)) *;* *(?<author2>(?:(?:(?!Length: )[\w ,])*) *)(?:(Length: )(?<length>\d*))* *(?:Media Type: )*(?<discType>[\w ,]*)
Demo

Related

How to do I cut off a certain part a String?

I have a big String in my program.
For Example:
String Newspaper = "...Blablabla... What do you like?...Blablabla... ";
Now I want to cut out the "What do you like?" an write it to a new String. But the problem is that the "Blablabla" is everytime something diffrent. Whit "cut out" I mean that you submit a start and a end word and all the things wrote between these lines should be in the new string. Because the sentence "What do you like?" changes sometimes except the start word "What" and the end word "like?"
Thanks for every responds
You can write the following method:
public static string CutOut(string s, string start, string end)
{
int startIndex = s.IndexOf(start);
if (startIndex == -1) {
return null;
}
int endIndex = s.IndexOf(end, startIndex);
if (endIndex == -1) {
return null;
}
return s.Substring(startIndex, endIndex - startIndex + end.Length);
}
It returns null if either the start or end pattern is not found. Only end patterns that follow the start pattern are searched for.
If you are working with C# 8+ and .NET Core 3.0+, you can also replace the last line with
return s[startIndex..(endIndex + end.Length)];
Test:
string input = "...Blablabla... What do you like?...Blablabla... ";
Console.WriteLine(CutOut(input, "What ", " like?"));
prints:
What do you like?
If you are happy with Regex, you can also write:
public static string CutOutRegex(string s, string start, string end)
{
Match match = Regex.Match(s, $#"\b{Regex.Escape(start)}.*{Regex.Escape(end)}");
if (match.Success) {
return match.Value;
}
return null;
}
The \b ensures that the start pattern is only found at the beginning of a word. You can drop it if you want. Also, if the end pattern occurs more than once, the result will include all of them unlike the first example with IndexOf which will only include the first one.
You have to do a substring, like the example below. See source for more information on substrings.
// A long string
string bio = "Mahesh Chand is a founder of C# Corner. Mahesh is also an
author, speaker, and software architect. Mahesh founded C# Corner in
2000.";
// Get first 12 characters substring from a string
string authorName = bio.Substring(0, 12);
Console.WriteLine(authorName);
In this case I would do it like this, cut the first part and then the second and concatenate with the fixed words using them as a parameter for cutting.
public string CutPhrase(string phrase)
{
var fst = "What";
var snd = "like?";
string[] cut1 = phrase.Split(new[] { fst }, StringSplitOptions.None);
string[] cut2 = cut1[1].Split(new[] { snd }, StringSplitOptions.None);
var rst = $"{fst} {cut2[0]} {snd}";
return rst;
}

Replacing anchor/link in text

I'm having issues doing a find / replace type of action in my function, i'm extracting the < a href="link">anchor from an article and replacing it with this format: [link anchor] the link and anchor will be dynamic so i can't hard code the values, what i have so far is:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
string theString = string.Empty;
switch (articleWikiCheck) {
case "id|wpTextbox1":
StringBuilder newHtml = new StringBuilder(articleBody);
Regex r = new Regex(#"\<a href=\""([^\""]+)\"">([^<]+)");
string final = string.Empty;
foreach (var match in r.Matches(theString).Cast<Match>().OrderByDescending(m => m.Index))
{
string text = match.Groups[2].Value;
string newHref = "[" + match.Groups[1].Index + " " + match.Groups[1].Index + "]";
newHtml.Remove(match.Groups[1].Index, match.Groups[1].Length);
newHtml.Insert(match.Groups[1].Index, newHref);
}
theString = newHtml.ToString();
break;
default:
theString = articleBody;
break;
}
Helpers.ReturnMessage(theString);
return theString;
}
Currently, it just returns the article as it originally is, with the traditional anchor text format: < a href="link">anchor
Can anyone see what i have done wrong?
regards
If your input is HTML, you should consider using a corresponding parser, HtmlAgilityPack being really helpful.
As for the current code, it looks too verbose. You may use a single Regex.Replace to perform the search and replace in one pass:
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody, #"<a\s+href=""([^""]+)"">([^<]+)", "[$1 $2]");
}
else
{
// Helpers.ReturnMessage(articleBody); // Uncomment if it is necessary
return articleBody;
}
}
See the regex demo.
The <a\s+href="([^"]+)">([^<]+) regex matches <a, 1 or more whitespaces, href=", then captures into Group 1 any one or more chars other than ", then matches "> and then captures into Group 2 any one or more chars other than <.
The [$1 $2] replacement replaces the matched text with [, Group 1 contents, space, Group 2 contents and a ].
Updated (Corrected regex to support whitespaces and new lines)
You can try this expression
Regex r = new Regex(#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>");
It will match your anchors, even if they are splitted into multiple lines. The reason why it is so long is because it supports empty whitespaces between the tags and their values, and C# does not supports subroutines, so this part [\s\n]* has to be repeated multiple times.
You can see a working sample at dotnetfiddle
You can use it in your example like this.
public static string GetAndFixAnchor(string articleBody, string articleWikiCheck) {
if (articleWikiCheck == "id|wpTextbox1")
{
return Regex.Replace(articleBody,
#"<[\s\n]*a[\s\n]*(([^\s]+\s*[ ]*=*[ ]*[\s|\n*]*('|"").*\3)[\s\n]*)*href[ ]*=[ ]*('|"")(?<link>.*)\4[.\n]*>(?<anchor>[\s\S]*?)[\s\n]*<\/[\s\n]*a>",
"[${link} ${anchor}]");
}
else
{
return articleBody;
}
}

How can I read input as two different answers

Say I get the following question
Console.WriteLine("Which teams have faced eachother? - use Red vs Blue format");
Then my answer to the question above will have two teams. But how can read them as two seperate?
So that i only read [Red] [Blue], but the "vs" part inbetween as to be there.
I hope my you understood what I am trying to say. My english is not great.
best regards,
ps, as you can tell I am pretty new in programming.
edit: oh and this is all in C#
You can use String.Split():
var answers = userInput.Split(new String[] { "vs" }, StringSplitOptions.RemoveEmptyEntries);
if (answers.Length == 2) {
var red = answers[0];
var blue = answers[1];
}
There are many option you can use Split function to make it array and remove "vs"
or simple use String.Replace("vs","") function to replace the "vs" string with blank value.
You can try using a regular expression:
Match m = Regex.Match("^(?<team1>\.+) vs (?<team2>\.+)$", userInput);
if (m.Success)
{
string team1 = m.Groups["team1"].Value;
string team2 = m.Groups["team2"].Value;
}
Please note that this may not be 100% syntactically correct - you have to refer to IntelliSense a bit - for example, I'm not sure whether the pattern is the first or the second parameter in Match, but I'm sure you get the picture.
U can read all as one string then split with "vs" seperator, then ull get table of 2 strings that u need
Use the String.Split function, as others have suggested. This will split your string into an array of strings. Then, identify which string in the array is the 'vs' string. Take the value of the index prior to 'vs' and after 'vs'. For example:
string input = "Which teams have faced eachother? - use Red vs Blue format";
string[] inputArray = input.Split( ' ' );
int vsLocation = 0;
for ( int i = 0; i < inputArray.Length; i++ ) {
if ( inputArray[i] == "vs" ) {
vsLocation = i;
break;
}
}
if ( vsLocation > 0) {
string team1 = inputArray[vsLocation - 1];
string team2 = inputArray[vsLocation + 1];
}

How to capitalize the first character of each word, or the first character of a whole string, with C#?

I could write my own algorithm to do it, but I feel there should be the equivalent to ruby's humanize in C#.
I googled it but only found ways to humanize dates.
Examples:
A way to turn "Lorem Lipsum Et" into "Lorem lipsum et"
A way to turn "Lorem lipsum et" into "Lorem Lipsum Et"
As discussed in the comments of #miguel's answer, you can use TextInfo.ToTitleCase which has been available since .NET 1.1. Here is some code corresponding to your example:
string lipsum1 = "Lorem lipsum et";
// Creates a TextInfo based on the "en-US" culture.
TextInfo textInfo = new CultureInfo("en-US",false).TextInfo;
// Changes a string to titlecase.
Console.WriteLine("\"{0}\" to titlecase: {1}",
lipsum1,
textInfo.ToTitleCase( lipsum1 ));
// Will output: "Lorem lipsum et" to titlecase: Lorem Lipsum Et
It will ignore casing things that are all caps such as "LOREM LIPSUM ET" because it is taking care of cases if acronyms are in text so that "IEEE" (Institute of Electrical and Electronics Engineers) won't become "ieee" or "Ieee".
However if you only want to capitalize the first character you can do the solution that is over hereā€¦ or you could just split the string and capitalize the first one in the list:
string lipsum2 = "Lorem Lipsum Et";
string lipsum2lower = textInfo.ToLower(lipsum2);
string[] lipsum2split = lipsum2lower.Split(' ');
bool first = true;
foreach (string s in lipsum2split)
{
if (first)
{
Console.Write("{0} ", textInfo.ToTitleCase(s));
first = false;
}
else
{
Console.Write("{0} ", s);
}
}
// Will output: Lorem lipsum et
There is another elegant solution :
Define the function ToTitleCase in an static class of your projet
using System.Globalization;
public static string ToTitleCase(this string title)
{
return CultureInfo.CurrentCulture.TextInfo.ToTitleCase(title.ToLower());
}
And then use it like a string extension anywhere on your project:
"have a good day !".ToTitleCase() // "Have A Good Day !"
Use regular expressions for this looks much cleaner:
string s = "the quick brown fox jumps over the lazy dog";
s = Regex.Replace(s, #"(^\w)|(\s\w)", m => m.Value.ToUpper());
All the examples seem to make the other characters lowered first which isn't what I needed.
customerName = CustomerName <-- Which is what I wanted
this is an example = This Is An Example
public static string ToUpperEveryWord(this string s)
{
// Check for empty string.
if (string.IsNullOrEmpty(s))
{
return string.Empty;
}
var words = s.Split(' ');
var t = "";
foreach (var word in words)
{
t += char.ToUpper(word[0]) + word.Substring(1) + ' ';
}
return t.Trim();
}
If you just want to capitalize the first character, just stick this in a utility method of your own:
return string.IsNullOrEmpty(str)
? str
: str[0].ToUpperInvariant() + str.Substring(1).ToLowerInvariant();
There's also a library method to capitalize the first character of every word:
http://msdn.microsoft.com/en-us/library/system.globalization.textinfo.totitlecase.aspx
CSS technique is ok but only changes the presentation of the string in the browser. A better method is to make the text itself capitalised before sending to browser.
Most of the above implimentations are ok, but none of them address the issue of what happens if you have mixed case words that need to be preserved, or if you want to use true Title Case, for example:
"Where to Study PHd Courses in the USA"
or
"IRS Form UB40a"
Also using CultureInfo.CurrentCulture.TextInfo.ToTitleCase(string) preserves upper case words as in
"sports and MLB baseball" which becomes "Sports And MLB Baseball" but if the whole string is put in upper case, then this causes an issue.
So I put together a simple function that allows you to keep the capital and mixed case words and make small words lower case (if they are not at the start and end of the phrase) by including them in a specialCases and lowerCases string arrays:
public static string TitleCase(string value) {
string titleString = ""; // destination string, this will be returned by function
if (!String.IsNullOrEmpty(value)) {
string[] lowerCases = new string[12] { "of", "the", "in", "a", "an", "to", "and", "at", "from", "by", "on", "or"}; // list of lower case words that should only be capitalised at start and end of title
string[] specialCases = new string[7] { "UK", "USA", "IRS", "UCLA", "PHd", "UB40a", "MSc" }; // list of words that need capitalisation preserved at any point in title
string[] words = value.ToLower().Split(' ');
bool wordAdded = false; // flag to confirm whether this word appears in special case list
int counter = 1;
foreach (string s in words) {
// check if word appears in lower case list
foreach (string lcWord in lowerCases) {
if (s.ToLower() == lcWord) {
// if lower case word is the first or last word of the title then it still needs capital so skip this bit.
if (counter == 0 || counter == words.Length) { break; };
titleString += lcWord;
wordAdded = true;
break;
}
}
// check if word appears in special case list
foreach (string scWord in specialCases) {
if (s.ToUpper() == scWord.ToUpper()) {
titleString += scWord;
wordAdded = true;
break;
}
}
if (!wordAdded) { // word does not appear in special cases or lower cases, so capitalise first letter and add to destination string
titleString += char.ToUpper(s[0]) + s.Substring(1).ToLower();
}
wordAdded = false;
if (counter < words.Length) {
titleString += " "; //dont forget to add spaces back in again!
}
counter++;
}
}
return titleString;
}
This is just a quick and simple method - and can probably be improved a bit if you want to spend more time on it.
if you want to keep the capitalisation of smaller words like "a" and "of" then just remove them from the special cases string array. Different organisations have different rules on capitalisation.
You can see an example of this code in action on this site: Egg Donation London - this site automatically creates breadcrumb trails at the top of the pages by parsing the url eg "/services/uk-egg-bank/introduction" - then each folder name in the trail has hyphens replaced with spaces and capitalises the folder name, so uk-egg-bank becomes UK Egg Bank. (preserving the upper case 'UK')
An extension of this code could be to have a lookup table of acronyms and uppercase/lowercase words in a shared text file, database table or web service so that the list of mixed case words can be maintained from one single place and apply to many different applications that rely on the function.
There is no prebuilt solution for proper linguistic captialization in .NET. What kind of capitialization are you going for? Are you following the Chicago Manual of Style conventions? AMA or MLA? Even plain english sentence capitalization has 1000's of special exceptions for words. I can't speak to what ruby's humanize does, but I imagine it likely doesn't follow linguistic rules of capitalization and instead does something much simpler.
Internally, we encountered this same issue and had to write a fairly large amount code just to handle proper (in our little world) casing of article titles, not even accounting for sentence capitalization. And it indeed does get "fuzzy" :)
It really depends on what you need - why are you trying to convert the sentences to proper capitalization (and in what context)?
I have achieved the same using custom extension methods. For First Letter of First sub-string use the method yourString.ToFirstLetterUpper(). For First Letter of Every sub-string excluding articles and some propositions, use the method yourString.ToAllFirstLetterInUpper(). Below is a console program:
class Program
{
static void Main(string[] args)
{
Console.WriteLine("this is my string".ToAllFirstLetterInUpper());
Console.WriteLine("uniVersity of lonDon".ToAllFirstLetterInUpper());
}
}
public static class StringExtension
{
public static string ToAllFirstLetterInUpper(this string str)
{
var array = str.Split(" ");
for (int i = 0; i < array.Length; i++)
{
if (array[i] == "" || array[i] == " " || listOfArticles_Prepositions().Contains(array[i])) continue;
array[i] = array[i].ToFirstLetterUpper();
}
return string.Join(" ", array);
}
private static string ToFirstLetterUpper(this string str)
{
return str?.First().ToString().ToUpper() + str?.Substring(1).ToLower();
}
private static string[] listOfArticles_Prepositions()
{
return new[]
{
"in","on","to","of","and","or","for","a","an","is"
};
}
}
OUTPUT
This is My String
University of London
Process finished with exit code 0.
Far as I know, there's not a way to do that without writing (or cribbing) code. C# nets (ha!) you upper, lower and title (what you have) cases:
http://support.microsoft.com/kb/312890/EN-US/

Regular expression that returns a constant value as part of a match

I have a regular expression to match 2 different number formats: \=(?[0-9]+)\?|\+(?[0-9]+)\?
This should return 9876543 as its Value for ;1234567890123456?+1234567890123456789012345123=9876543? and ;1234567890123456?+9876543?
What I would like is to be able to return another value along with the matched 'Value'.
So, for example, if the first string was matched, I'd like it to return:
Value:
9876543
Format:
LongFormat
And if matched in the second string:
Value:
9876543
Format:
ShortFormat
Is this possible?
Another option, which is not quite the solution you wanted, but saves you using two separate regexes, is to use named groups, if your implementation supports it.
Here is some C#:
var regex = new Regex(#"\=(?<Long>[0-9]+)\?|\+(?<Short>[0-9]+)\?");
string test1 = ";1234567890123456?+1234567890123456789012345123=9876543?";
string test2 = ";1234567890123456?+9876543?";
var match = regex.Match(test1);
Console.WriteLine("Long: {0}", match.Groups["Long"]); // 9876543
Console.WriteLine("Short: {0}", match.Groups["Short"]); // blank
match = regex.Match(test2);
Console.WriteLine("Long: {0}", match.Groups["Long"]); // blank
Console.WriteLine("Short: {0}", match.Groups["Short"]); // 9876543
Basically just modify your regex to include the names, and then regex.Groups[GroupName] will either have a value or wont. You could even just use the Success property of the group to know which matched (match.Groups["Long"].Success).
UPDATE:
You can get the group name out of the match, with the following code:
static void Main(string[] args)
{
var regex = new Regex(#"\=(?<Long>[0-9]+)\?|\+(?<Short>[0-9]+)\?");
string test1 = ";1234567890123456?+1234567890123456789012345123=9876543?";
string test2 = ";1234567890123456?+9876543?";
ShowGroupMatches(regex, test1);
ShowGroupMatches(regex, test2);
Console.ReadLine();
}
private static void ShowGroupMatches(Regex regex, string testCase)
{
int i = 0;
foreach (Group grp in regex.Match(testCase).Groups)
{
if (grp.Success && i != 0)
{
Console.WriteLine(regex.GroupNameFromNumber(i) + " : " + grp.Value);
}
i++;
}
}
I'm ignoring the 0th group, because that is always the entire match in .NET
No, you can't match text that isn't there. The match can only return a substring of the target.
You essentially want to match against two patterns and take different actions in each case. See if you can separate them in your code:
if match(\=(?[0-9]+)\?) then
return 'Value: ' + match + 'Format: LongFormat'
else if match(\+(?[0-9]+)\?) then
return 'Value: ' + match + 'Format: ShortFormat'
(Excuse the dodgy pseudocode, but you get the idea.)
You can't match text that isn't there - but, depending on what language you're using, you can process what you match, and conditionally add text based on what is there.
With some implementations of regex, you can specify a "callback function" which allows you to run logic against each result.
Here's a pseudo-code example:
Input.replaceAll( /[+=][0-9]+(?=\?)/ , formatValue );
formatValue : function(match,groups)
{
switch( left(match,1) )
{
case '+' : Format = 'Short'; break;
case '=' : Format = 'Long'; break;
default : Format = 'Unknown'; break;
}
Value : match.replace('[+=]');
return 'Value: '+Value+' Format: ' + Format;
}
What that will do, in a language that supports regex callbacks, is execute the formatValue function every time it finds a match, and use the result of the function as the replacement text.
You haven't specified which implementation you're using, so this may or not be possible for you, but it is definitely worth checking out.

Categories