Splitting a string into words in a culture neutral way

Splitting a string into words in a culture neutral way - c#

I've come up with the method below that aims to split a text of variable length into an array of words for further full text index processing (stop word removal, followed by stemmer). The results seem to be ok but I would like to hear opinions how reliable this implementation would against texts in different languages. Would you recommend using a regex for this instead? Please note that I've opted against using String.Split() because that would require me to pass a list of all known seperators which is exactly what I was trying to avoid when I wrote the function
P.S: I can't use a full blown full text search engine like Lucene.Net for several reasons (Silverlight, Overkill for project scope etc).
public string[] SplitWords(string Text)
{
bool inWord = !Char.IsSeparator(Text[0]) && !Char.IsControl(Text[0]);
var result = new List<string>();
var sbWord = new StringBuilder();
for (int i = 0; i < Text.Length; i++)
{
Char c = Text[i];
// non separator char?
if(!Char.IsSeparator(c) && !Char.IsControl(c))
{
if (!inWord)
{
sbWord = new StringBuilder();
inWord = true;
}
if (!Char.IsPunctuation(c) && !Char.IsSymbol(c))
sbWord.Append(c);
}
// it is a separator or control char
else
{
if (inWord)
{
string word = sbWord.ToString();
if (word.Length > 0)
result.Add(word);
sbWord.Clear();
inWord = false;
}
}
}
return result.ToArray();
}

Since you said in culture neutral way, I really doubt if Regular Expression (word boundary: \b) will do. I have googled a bit and found this. Hope it would be useful.
I am pretty surprised that there is no built-in Java BreakIterator equivalent...

Related

how to find text in a string in c#

I am learning Dotnet c# on my own.
how to find whether a given text exists or not in a string and if exists, how to find count of times the word has got repeated in that string. even if the word is misspelled, how to find it and print that the word is misspelled?
we can do this with collections or linq in c# but here i used string class and used contains method but iam struck after that.
if we can do this with help of linq, how?
because linq works with collections, Right?
you need a list in order to play with linq.
but here we are playing with string(paragraph).
how linq can be used find a word in paragraph?
kindly help.
here is what i have tried so far.
string str = "Education is a ray of light in the darkness. It certainly is a hope for a good life. Eudcation is a basic right of every Human on this Planet. To deny this right is evil. Uneducated youth is the worst thing for Humanity. Above all, the governments of all countries must ensure to spread Education";
for(int i = 0; i < i++)
if (str.Contains("Education") == true)
{
Console.WriteLine("found");
}
else
{
Console.WriteLine("not found");
}

You can make a string a string[] by splitting it by a character/string. Then you can use LINQ:
if(str.Split().Contains("makes"))
{
// note that the default Split without arguments also includes tabs and new-lines
}
If you don't care whether it is a word or just a sub-string, you can use str.Contains("makes") directly.
If you want to compare in a case insensitive way, use the overload of Contains:
if(str.Split().Contains("makes", StringComparer.InvariantCultureIgnoreCase)){}

string str = "money makes many makes things";
var strArray = str.Split(" ");
var count = strArray.Count(x => x == "makes");

the simplest way is to use Split extension to split the string into an array of words.
here is an example :
var words = str.Split(' ');
if(words.Length > 0)
{
foreach(var word in words)
{
if(word.IndexOf("makes", StringComparison.InvariantCultureIgnoreCase) != -1)
{
Console.WriteLine("found");
}
else
{
Console.WriteLine("not found");
}
}
}
Now, since you just want the count of number word occurrences, you can use LINQ to do that in a single line like this :
var totalOccurrences = str.Split(' ').Count(x=> x.IndexOf("makes", StringComparison.InvariantCultureIgnoreCase) != -1);
Note that StringComparison.InvariantCultureIgnoreCase is required if you want a case-insensitive comparison.

Regex match up to the end of a standard pattern

I'm working on an application to manage filenames of downloaded TV Shows. Basically it will search the directory and clean up the filenames, removing things like full stops and replacing them with spaces and getting rid of the descriptions at the end of the filename after the easily recognizable pattern of, for eg., S01E13. (.1080p.BluRay.x264-ROVERS)
What I want to do is to make a regex expression for use in C# to just extract whatever is before the SnnEnn including itself (where n is any whole positive integer).
But, i don't know much regex to get me going
For example, if I had the filename TV.Show.S01E01.1080p.BluRay.x264-ROVERS, the query would only get TV.Show.S01E01, irrespective of how many words are before the pattern, so it could be TV.Show.On.ABC.S01E01 and it would still work.
Thanks for any help :)

Try this
string input = "TV.Show.S01E01.1080p.BluRay.x264-ROVERS";
string pattern = #"(?'pattern'^.*\d\d[A-Z]\d\d)";
string results = Regex.Match(input, pattern).Groups["pattern"].Value;

There is more obvious way without regex:
string GetNameByPattern(string s)
{
const string pattern_length = 6; //SnnEnn
for (int i = 0; i < s.Length - pattern_length; i++)
{
string part = s.SubString(i, pattern_length);
if (part[0] == 'S' && part[3] == 'N') //candidat
if (Char.IsDigit(part[1]) && Char.IsDigit(part[2]) && Char.IsDigit(part[4]) && Char.IsDigit(part[5]))
return s.SubString(0, i + pattern_length);
}
return "";
}

Match a string against an easy pattern

I am trying to future proof a program I am creating so that the pattern I need to have users put in is not hard coded. There is always a chance that the letter or number patter can change, but when it does I need everyone to remain consistent. Plus I want the managers to be to control what goes in without relying on me. Is it possible to use regex or another string tool to compare input against a list stored in a database. I want it to be easy so the patterns stored in the database would look like X###### or X######-X####### and so on.

Sure, just store the regular expression rules in a string column in a table and then load them into an IEnumerable<Regex> in your app. Then, a match is simply if ANY of those rules match. Beware that conflicting rules could be prone to greedy race (first one to be checked wins) so you'd have to be careful there. Also be aware that there are many optimizations that you could perform beyond my example, which is designed to be simple.
List<string> regexStrings = db.GetRegexStrings();
var result = new List<Regex>(regexStrings.Count);
foreach (var regexString in regexStrings)
{
result.Add(new Regex(regexString);
}
...
// The check
bool matched = result.Any(i => i.IsMatch(testInput));

You could store your patterns as-is in your database, and then translate them to regexes.
I don't know specifically what characters you'd need in your format, but let's suppose you just want to substitute a number to # and leave the rest as-is, here's some code for that:
public static Regex ConvertToRegex(string pattern)
{
var sb = new StringBuilder();
sb.Append("^");
foreach (var c in pattern)
{
switch (c)
{
case '#':
sb.Append(#"\d");
break;
default:
sb.Append(Regex.Escape(c.ToString()));
break;
}
}
sb.Append("$");
return new Regex(sb.ToString());
}
You can also use options like RegexOptions.IgnoreCase if that's what you need.
NB: For some reason, Regex.Escape escapes the # character, even though it's not special... So I just went for the character-by-character approach.

private bool TestMethod()
{
const string textPattern = "X###";
string text = textBox1.Text;
bool match = true;
if (text.Length == textPattern.Length)
{
char[] chrStr = text.ToCharArray();
char[] chrPattern = textPattern.ToCharArray();
int length = text.Length;
for (int i = 0; i < length; i++)
{
if (chrPattern[i] != '#')
{
if (chrPattern[i] != chrStr[i])
{
return false;
}
}
}
}
else
{
return false;
}
return match;
}
This is doing everything I need it to do now. Thanks for all the tips though. I will have to look into the regex more in the future.

Using MaskedTextProvider, you could do do something like this:
using System.Globalization;
using System.ComponentModel;
string pattern = "X&&&&&&-X&&&&&&&";
string text = "Xabcdef-Xasdfghi";
var culture = CultureInfo.GetCultureInfo("sv-SE");
var matcher = new MaskedTextProvider(pattern, culture);
int position;
MaskedTextResultHint hint;
if (!matcher.Set(text, out position, out hint))
{
Console.WriteLine("Error at {0}: {1}", position, hint);
}
else if (!matcher.MaskCompleted)
{
Console.WriteLine("Not enough characters");
}
else if (matcher.ToString() != text)
{
Console.WriteLine("Missing literals");
}
else
{
Console.WriteLine("OK");
}
For a description of the format, see: http://msdn.microsoft.com/en-us/library/system.windows.forms.maskedtextbox.mask

Check string for invalid characters? Smartest way?

I would like to check some string for invalid characters. With invalid characters I mean characters that should not be there. What characters are these? This is different, but I think thats not that importan, important is how should I do that and what is the easiest and best way (performance) to do that?
Let say I just want strings that contains 'A-Z', 'empty', '.', '$', '0-9'
So if i have a string like "HELLO STaCKOVERFLOW" => invalid, because of the 'a'.
Ok now how to do that? I could make a List<char> and put every char in it that is not allowed and check the string with this list. Maybe not a good idea, because there a lot of chars then. But I could make a list that contains all of the allowed chars right? And then? For every char in the string I have to compare the List<char>? Any smart code for this? And another question: if I would add A-Z to the List<char> I have to add 25 chars manually, but these chars are as I know 65-90 in the ASCII Table, can I add them easier? Any suggestions? Thank you

You can use a regular expression for this:
Regex r = new Regex("[^A-Z0-9.$ ]$");
if (r.IsMatch(SomeString)) {
// validation failed
}
To create a list of characters from A-Z or 0-9 you would use a simple loop:
for (char c = 'A'; c <= 'Z'; c++) {
// c or c.ToString() depending on what you need
}
But you don't need that with the Regex - pretty much every regex engine understands the range syntax (A-Z).

I have only just written such a function, and an extended version to restrict the first and last characters when needed. The original function merely checks whether or not the string consists of valid characters only, the extended function adds two integers for the numbers of valid characters at the beginning of the list to be skipped when checking the first and last characters, in practice it simply calls the original function 3 times, in the example below it ensures that the string begins with a letter and doesn't end with an underscore.
StrChr(String, "_0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"));
StrChrEx(String, "_0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ", 11, 1));
BOOL __cdecl StrChr(CHAR* str, CHAR* chars)
{
for (int s = 0; str[s] != 0; s++)
{
int c = 0;
while (true)
{
if (chars[c] == 0)
{
return false;
}
else if (str[s] == chars[c])
{
break;
}
else
{
c++;
}
}
}
return true;
}
BOOL __cdecl StrChrEx(CHAR* str, CHAR* chars, UINT excl_first, UINT excl_last)
{
char first[2] = {str[0], 0};
char last[2] = {str[strlen(str) - 1], 0};
if (!StrChr(str, chars))
{
return false;
}
if (excl_first != 0)
{
if (!StrChr(first, chars + excl_first))
{
return false;
}
}
if (excl_last != 0)
{
if (!StrChr(last, chars + excl_last))
{
return false;
}
}
return true;
}

If you are using c#, you do this easily using List and contains. You can do this with single characters (in a string) or a multicharacter string just the same
var pn = "The String To ChecK";
var badStrings = new List<string>()
{
" ","\t","\n","\r"
};
foreach(var badString in badStrings)
{
if(pn.Contains(badString))
{
//Do something
}
}

If you're not super good with regular expressions, then there is another way to go about this in C#. Here is a block of code I wrote to test a string variable named notifName:
var alphabet = "a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z";
var numbers = "0,1,2,3,4,5,6,7,8,9";
var specialChars = " ,(,),_,[,],!,*,-,.,+,-";
var validChars = (alphabet + "," + alphabet.ToUpper() + "," + numbers + "," + specialChars).Split(',');
for (int i = 0; i < notifName.Length; i++)
{
if (Array.IndexOf(validChars, notifName[i].ToString()) < 0) {
errorFound = $"Invalid character '{notifName[i]}' found in notification name.";
break;
}
}
You can change the characters added to the array as needed. The Array IndexOf method is the key to the whole thing. Of course if you want commas to be valid, then you would need to choose a different split character.

Not enough reps to comment directly, but I recommend the Regex approach. One small caveat: you probably need to anchor both ends of the input string, and you will want at least one character to match. So (with thanks to ThiefMaster), here's my regex to validate user input for a simple arithmetical calculator (plus, minus, multiply, divide):
Regex r = new Regex(#"^[0-9\.\-\+\*\/ ]+$");

I'd go with a regex, but still need to add my 2 cents here, because all the proposed non-regex solutions are O(MN) in the worst case (string is valid) which I find repulsive for religious reasons.
Even more so when LINQ offers a simpler and more efficient solution than nesting loops:
var isInvalid = "The String To Test".Intersect("ALL_INVALID_CHARS").Any();

What's the most efficient way to format the following string?

I have a very simple question, and I shouldn't be hung up on this, but I am. Haha!
I have a string that I receive in the following format(s):
123
123456-D53
123455-4D
234234-4
123415
The desired output, post formatting, is:
123-455-444
123-455-55
123-455-5
or
123-455
The format is ultimately dependent upon the total number of characters in the original string..
I have several ideas of how to do this, but I keep thing there's a better way than string.Replace and concatenate...
Thanks for the suggestions..
Ian

Tanascius is right but I cant comment or upvote due to my lack of rep but if you want additional info on the string.format Ive found this helpful.
http://blog.stevex.net/string-formatting-in-csharp/

I assume this does not merely rely upon the inputs always being numeric? If so, I'm thinking of something like this
private string ApplyCustomFormat(string input)
{
StringBuilder builder = new StringBuilder(input.Replace("-", ""));
int index = 3;
while (index < builder.Length)
{
builder.Insert(index, "-");
index += 4;
}
return builder.ToString();
}

Here's a method that uses a combination of regular expressions and LINQ to extract groups of three letters at a time and then joins them together again. Note: it assumes that the input has already been validated. The validation can also be done with a regular expression.
string s = "123456-D53";
string[] groups = Regex.Matches(s, #"\w{1,3}")
.Cast<Match>()
.Select(match => match.Value)
.ToArray();
string result = string.Join("-", groups);
Result:
123-456-D53

EDIT: See history for old versions.
You could use char.IsDigit() for finding digits, only.
var output = new StringBuilder();
var digitCount = 0;
foreach( var c in input )
{
if( char.IsDigit( c ) )
{
output.Append( c );
digitCount++;
if( digitCount % 3 == 0 )
{
output.Append( "-" );
}
}
}
// Remove possible last -
return output.ToString().TrimEnd('-');
This code should fill from left to right (now I got it, first read, then code) ...
Sorry, I still can't test this right now.

Not the fastest, but easy on the eyes (ed: to read):
string Normalize(string value)
{
if (String.IsNullOrEmpty(value)) return value;
int appended = 0;
var builder = new StringBuilder(value.Length + value.Length/3);
for (int ii = 0; ii < value.Length; ++ii)
{
if (Char.IsLetterOrDigit(value[ii]))
{
builder.Append(value[ii]);
if ((++appended % 3) == 0) builder.Append('-');
}
}
return builder.ToString().TrimEnd('-');
}
Uses a guess to pre-allocate the StringBuilder's length. This will accept any Alphanumeric input with any amount of junk being added by the user, including excess whitespace.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Splitting a string into words in a culture neutral way - c#

Since you said in culture neutral way, I really doubt if Regular Expression (word boundary: \b) will do. I have googled a bit and found this. Hope it would be useful. I am pretty surprised that there is no built-in Java BreakIterator equivalent...

Related

how to find text in a string in c#

Regex match up to the end of a standard pattern

Match a string against an easy pattern

Check string for invalid characters? Smartest way?

What's the most efficient way to format the following string?

Categories

Resources