Regex to find all substrings and longest substring - c#

I would usually do something like this using a string libray. But I wonder if it can be done using a regex.
I want to do the following: Given a search string:
Seattle is awesome
I want to find all substrings of it in a given sentence. So applying a regex to the following sentence
Seattle Seattle is awesome is awesome awesome is Seattle
Should give me
Seattle, Seattle is awesome, is awesome, awesome, is, Seattle
One constraint that might be helpful is that the sentence will always have only the words present in the search string and whitespaces in between.
Note If there's a match, it should be the longest possible string. So like in the above example the matches shouldn't be single words rather longest possible substrings. Order among words also needs to be maintained. That's why
awesome is Seattle
in the sentence above gives us
awesome, is and Seattle
I am not sure if something like this can be done with a regex though, since it is greedy. Would appreciate any insight into this!
I'm familiar with both C# and Java and could use either of their regex libraries.

I don't think you can do this with a regex. Wikipedia has a good article on the longest common subsequence problem.

There is no good way to express such a pattern directly with a regex.
You'll need to list all allowed combinations:
Seattle is awesome|Seattle is|Seattle|is awesome|is|awesome
or more succinctly:
Seattle( is( awesome)?)?|is( awesome)?|awesome
You can programmatically transform your input string to this format.

Can you describe your problem a little further? It sounds a lot more like a search engine than simple string matching. I would highly recommend checking out Apache Lucene -- it has a bit of a learning curve, but it is a great little tool for smart searching. It handles lots of things that are gotchas when dealing with search. You can set up the scoring of hits to do pretty much exactly what you describe.

In Java, not tested. This returns an iterator on lists of strings. Each list is a matching subsequence.
Just put spaces between the members of the list to print. If this is getting used a lot, the use of intern() may be bad.
static Iterator<List<String>> getSequences(String squery, String starget)
{
List<String> query = Arrays.asList(squery.split(" "));
for ( int i = 0; i < query.size(); i++)
query.set(i, query.get(i).intern());
List<String> target = Arrays.asList(starget.split(" "));;
for ( int i = 0; i < target.size(); i++)
target.set(i, target.get(i).intern());
// Because the strings are all intern'ed, this HashSet acts like we want -
// If two lists are the same sequence of words, they are equal.
// It's used to remove duplicates.
HashSet<List<String>> ret = new HashSet<List<String>>();
for ( int qBegin = 0; qBegin < query.size(); qBegin++ ) {
for ( int tBegin = 0; tBegin < target.size(); tBegin++ ) {
for ( int iCursor = 0;
iCursor < min(query.size()-qBegin, target.size()- tBegin);
iCursor++) {
if ( query.get(qBegin+iCursor)==target.get(tBegin+iCursor) )
ret.add(query.subList(qBegin, qBegin+iCursor+1));
else break;
}
}
}
return ret.iterator();
}
static int min(int a, int b) { return (a<b)? a:b; }

Related

What would the regex be to match a string whether it has 1 or less apostrophes at any number of different indexes or none at all?

I really don't know Regular Expression syntax that well, but I am using a simple highlighting plug-in for jQuery, and I need it to select a word whether it has 1 or less apostrophes at any number of different indexes or none at all.
For example, say I have a string: Tods (note that this string could be anything).
I need a regular expression that could still select: Tod's, To'ds, T'ods, or 'Tods. (Note that I did not include an apostrophe at the last index, as this is not necessary, although, it probably wouldn't hurt anything).
So far I have this code in jQuery...:
$("input.highlightTerm").each(function () {
$(".resultValue").highlight($(this).val());
});
...where $(this).val() is the string that will be highlighted.
It is also possible for me to do this in C#, as I populate the hidden input fields that this jQuery code picks up ($("input.highlightTerm")) on server-side, using C#.
Simple C# Razor Syntax:
for (var n = 0; n < searchTermsArray.Length; n++)
{
<input class="highlightTerm" type="hidden" value="#searchTermsArray[n]" />
}
What is the regular expression syntax I need to get this done?
More Examples of What Should and Shouldn't Match:
T'o'd's [Should Match]
Tod's [Should Match]
'Tods' [Should Match]
'Tods OR Tods' [Really doesn't matter, because of how the plug-in works, but I guess Should Match, is preferred]
Tod''s [Shouldn't Match]
''Tods [Shouldn't Match]
--Pretty much I only want matches if there is 1 or less apostrophes among any number of different indexes within the string.
First make sure the string has length and that there are no double-apostrophes (this rules out triples and higher as well). Then test the string for containing only word characters or apostrophes.
var re = /^[\w']*$/;
function checkForApostrophe(str) {
if ( !str.length ) { return; }
if ( str.indexOf("''") !== -1 ) { return; }
if ( str.charAt(str.length-1) === "'" ) { return; }
return re.test(str);
}
Replace '\w' in the regex with [a-zA-Z], possibly including [0-9] depending on your requirements.
The question is a little difficult to understand exactly what you want, so if this isn't quite right please comment.
I think after reading the comments on the other answers, I've figured out what it is you're going for. You don't need a single regex that can do this for any possible input, you already have input, and you need to build a regex that matches it and its variations. What you need to do is this.
var re = new RegExp("'?" + "tods".split("").join("'?") + "'?")
This will create a regex that matches in the way you're describing, provided it's OK that it also matches the original string.
In this case, the above line builds this regex:
/'?t'?o'?d'?s'?/
This may still not be 100% right. You know, since I don't have that highlight function around myself to play with, but I think it should get you on the right track.
I think you have to do something like this !
function checkForApostrophe(str) {
var length = str.length;
if (length != 0)
{
// Makes sure string contains only Alphabets, numbers and apostrophe : Nothing else
if (str.matches("[a-zA-Z0-9']*")) {
// makes sure there is only one or zero apostrophe
if ((str.indexOf("'") != -1) && (str.indexOf("'") == str.lastIndexOf("'"))) {
// Makes sure there is no apostrophe stranded at the end
if (str.lastIndexOf("'") == length - 1)
return false
else
return true;
}
else {
return false;
}
}
else {
return false;
}
}
}

Find index of first Char(Letter) in string

I have a mental block and can't seem to figure this out, sure its pretty easy 0_o
I have the following string: "5555S1"
String can contain any number of digits, followed by a Letter(A-Z), followed by numbers again.
How do I get the index of the Letter(S), so that I can substring so get everything following the Letter
Ie: 5555S1
Should return S1
Cheers
You could also check if the integer representation of the character is >= 65 && <=90.
Simple Python:
test = '5555Z187456764587368457638'
for i in range(0,len(test)):
if test[i].isalpha():
break
print test[i:]
Yields: Z187456764587368457638
Given that you didn't say what language your using I'm going to pick the one I want to answer in - c#
String.Index see http://msdn.microsoft.com/en-us/library/system.string.indexof.aspx for more
for good measure here it is in java string.indexOf
One way could be to loop through the string untill you find a letter.
while(! isAlpha(s[i])
i++;
or something should work.
This doesn't answer your question but it does solve your problem.
(Although you can use it to work out the index)
Your problem is a good candidate for Regular Expressions (regex)
Here is one I prepared earlier:
String code = "1234A0987";
//timeout optional but needed for security (so bad guys dont overload your server)
TimeSpan timeout = TimeSpan.FromMilliseconds(150);
//Magic here:
//Pattern == (Block of 1 or more numbers)(block of 1 or more not numbers)(Block of 1 or more numbers)
String regexPattern = #"^(?<firstNum>\d+)(?<notNumber>\D+)(?<SecondNum>\d+)?";
Regex r = new Regex(regexPattern, RegexOptions.None, timeout);
Match m = r.Match(code);
if (m.Success)//We got a match!
{
Console.WriteLine ("SecondNumber: {0}",r.Match(code).Result("${SecondNum}"));
Console.WriteLine("All data (formatted): {0}",r.Match(code).Result("${firstNum}-${notNumber}-${SecondNum}"));
Console.WriteLine("Offset length (not that you need it now): {0}", r.Match(code).Result("${firstNum}").Length);
}
Output:
SecondNumber: 0987
All data (formatted): 1234-A-0987
Offset length (not that you need it now): 4
Further info on this example here.
So there you go you can even work out what that index was.
Regex cheat sheet

auto detect tag within a text

Does there is any library or algorithm that can do auto detection of tags in a text (ignoring the usual words of the chosen language)?
Something like this:
string[] keywords = GetKeyword("Your order is num #0123456789")
and keywords[] would contain "order" and "#0123456789" ...?
Does it exist? Or the user will select by himself all the tags of every document all the time? :?
foreach(string keyword in keywords) { // where keywords is a List<string>
if ("Your order is num #0123456789".Contains(keyword)) {
keywordsPresent.Add(keyword); // where keywordsPresent is a List<string>
}
}
return keywordsPresent;
What the above does is not cater for your #0123456789, for that add some more logic to find the index of the # or something...
Sorry, I misunderstood the question. If you want to look for specific words, the algorithm will depend on you strings. For example, you can use string.Split() to generate an array of words from one string, and then work with that, like this:
string[] words = string.Split("Your order is num #0123456789");
string orderNumber = "";
if(words.Contains("order") && w.StartsWith("#").Count > 0)
{
orderNumber = words.Where(w=>w.StartsWith("#").FirstOrDefault();
}
This will first generate an array of words from "Your order is num #0123456789" , then if it contains the word "order" it will wind a word that starts with "#" and select that;
I think that a lot of different algorithms can be used. Some of them are simple another are super complex. I can suggest you the next basic way:
Split all text into array of words.
Remove stop words from the array. (Goole "stop words list" to get full list of stop words.)
Walk through the array and calculate count of each word.
Sort words in accordance with their 'weight' in the array.
Choose necessary amount of tags.

Replace Bad words using Regex

I am trying to create a bad word filter method that I can call before every insert and update to check the string for any bad words and replace with "[Censored]".
I have an SQL table with has a list of bad words, I want to bring them back and add them to a List or string array and check through the string of text that has been passed in and if any bad words are found replace them and return a filtered string back.
I am using C# for this.
Please see this "clbuttic" (or for your case cl[Censored]ic) article before doing a string replace without considering word boundaries:
http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html
Update
Obviously not foolproof (see article above - this approach is so easy to get around or produce false positives...) or optimized (the regular expressions should be cached and compiled), but the following will filter out whole words (no "clbuttics") and simple plurals of words:
const string CensoredText = "[Censored]";
const string PatternTemplate = #"\b({0})(s?)\b";
const RegexOptions Options = RegexOptions.IgnoreCase;
string[] badWords = new[] { "cranberrying", "chuffing", "ass" };
IEnumerable<Regex> badWordMatchers = badWords.
Select(x => new Regex(string.Format(PatternTemplate, x), Options));
string input = "I've had no cranberrying sleep for chuffing chuffings days -
the next door neighbour is playing classical music at full tilt!";
string output = badWordMatchers.
Aggregate(input, (current, matcher) => matcher.Replace(current, CensoredText));
Console.WriteLine(output);
Gives the output:
I've had no [Censored] sleep for [Censored] [Censored] days - the next door neighbour is playing classical music at full tilt!
Note that "classical" does not become "cl[Censored]ical", as whole words are matched with the regular expression.
Update 2
And to demonstrate a flavour of how this (and in general basic string\pattern matching techniques) can be easily subverted, see the following string:
"I've had no cranberryıng sleep for chuffıng chuffıngs days - the next door neighbour is playing classical music at full tilt!"
I have replaced the "i"'s with Turkish lower case undottted "ı"'s. Still looks pretty offensive!
Although I'm a big fan of Regex, I think it won't help you here. You should fetch your bad word into a string List or string Array and use System.String.Replace on your incoming message.
Maybe better, use System.String.Split and .Join methods:
string mayContainBadWords = "... bla bla ...";
string[] badWords = new string[]{"bad", "worse", "worst"};
string[] temp = string.Split(badWords, StringSplitOptions.RemoveEmptyEntries);
string cleanString = string.Join("[Censored]", temp);
In the sample, mayContainBadWords is the string you want to check; badWords is a string array, you load from your bad word sql table and cleanString is your result.
you can use string.replace() method or RegEx class
There is also a nice article about it which can e found here
With a little html-parsing skills, you can get a large list with swear words from noswear

.NET String parsing performance improvement - Possible Code Smell

The code below is designed to take a string in and remove any of a set of arbitrary words that are considered non-essential to a search phrase.
I didn't write the code, but need to incorporate it into something else. It works, and that's good, but it just feels wrong to me. However, I can't seem to get my head outside the box that this method has created to think of another approach.
Maybe I'm just making it more complicated than it needs to be, but I feel like this might be cleaner with a different technique, perhaps by using LINQ.
I would welcome any suggestions; including the suggestion that I'm over thinking it and that the existing code is perfectly clear, concise and performant.
So, here's the code:
private string RemoveNonEssentialWords(string phrase)
{
//This array is being created manually for demo purposes. In production code it's passed in from elsewhere.
string[] nonessentials = {"left", "right", "acute", "chronic", "excessive", "extensive",
"upper", "lower", "complete", "partial", "subacute", "severe",
"moderate", "total", "small", "large", "minor", "multiple", "early",
"major", "bilateral", "progressive"};
int index = -1;
for (int i = 0; i < nonessentials.Length; i++)
{
index = phrase.ToLower().IndexOf(nonessentials[i]);
while (index >= 0)
{
phrase = phrase.Remove(index, nonessentials[i].Length);
phrase = phrase.Trim().Replace(" ", " ");
index = phrase.IndexOf(nonessentials[i]);
}
}
return phrase;
}
Thanks in advance for your help.
Cheers,
Steve
This appears to be an algorithm for removing stop words from a search phrase.
Here's one thought: If this is in fact being used for a search, do you need the resulting phrase to be a perfect representation of the original (with all original whitespace intact), but with stop words removed, or can it be "close enough" so that the results are still effectively the same?
One approach would be to tokenize the phrase (using the approach of your choice - could be a regex, I'll use a simple split) and then reassemble it with the stop words removed. Example:
public static string RemoveStopWords(string phrase, IEnumerable<string> stop)
{
var tokens = Tokenize(phrase);
var filteredTokens = tokens.Where(s => !stop.Contains(s));
return string.Join(" ", filteredTokens.ToArray());
}
public static IEnumerable<string> Tokenize(string phrase)
{
return string.Split(phrase, ' ');
// Or use a regex, such as:
// return Regex.Split(phrase, #"\W+");
}
This won't give you exactly the same result, but I'll bet that it's close enough and it will definitely run a lot more efficiently. Actual search engines use an approach similar to this, since everything is indexed and searched at the word level, not the character level.
I guess your code is not doing what you want it to do anyway. "moderated" would be converted to "d" if I'm right. To get a good solution you have to specify your requirements a bit more detailed. I would probably use Replace or regular expressions.
I would use a regular expression (created inside the function) for this task. I think it would be capable of doing all the processing at once without having to make multiple passes through the string or having to create multiple intermediate strings.
private string RemoveNonEssentialWords(string phrase)
{
return Regex.Replace(phrase, // input
#"\b(" + String.Join("|", nonessentials) + #")\b", // pattern
"", // replacement
RegexOptions.IgnoreCase)
.Replace(" ", " ");
}
The \b at the beginning and end of the pattern makes sure that the match is on a boundary between alphanumeric and non-alphanumeric characters. In other words, it will not match just part of the word, like your sample code does.
Yeah, that smells.
I like little state machines for parsing, they can be self-contained inside a method using lists of delegates, looping through the characters in the input and sending each one through the state functions (which I have return the next state function based on the examined character).
For performance I would flush out whole words to a string builder after I've hit a separating character and checked the word against the list (might use a hash set for that)
I would create A Hash table of Removed words parse each word if in the hash remove it only one time through the array and I believe that creating a has table is O(n).
How does this look?
foreach (string nonEssent in nonessentials)
{
phrase.Replace(nonEssent, String.Empty);
}
phrase.Replace(" ", " ");
If you want to go the Regex route, you could do it like this. If you're going for speed it's worth a try and you can compare/contrast with other methods:
Start by creating a Regex from the array input. Something like:
var regexString = "\\b(" + string.Join("|", nonessentials) + ")\\b";
That will result in something like:
\b(left|right|chronic)\b
Then create a Regex object to do the find/replace:
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(regexString, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Then you can just do a Replace like so:
string fixedPhrase = regex.Replace(phrase, "");

Categories