Finding a word - String Operationg or Linq - c#

I have a string full of a few hundred words.
How would I get each "word" (this can also be a single letter number or punctuation), and as each "word" is found, it is removed from the string.
Is this possible?
Example:
String:
"this is a string full of words and letters and also some punctuation! and num6er5."
As far as the algorithm is concerned, there are exactly 15 words in the above string.

What you're trying to do is known as tokenizing.
In C#, the string Split() function works pretty well. If it's used like in Niedermair's code without any parameters, it returns an array of strings split (splitted?) by any spaces like this:
"I have spaces" -> {"I", "have", "spaces"}
You can also give any chars to split on as a parameter to Split() (for instance, ',' or ';' to handle csv files).
The Split() method pays no heed to what goes into the strings, so any letters, numbers and other chars will be handled.
About removing the words from the string: You might want to write the string into a buffer to achieve this, but I seriously think that's going too far. Strings are immutable which means any time you remove the "next word" you'll have to recreate the entire string object.
It will be a lot easier to just Split() the entire string, throw the string away, and work with the array from there on.

Related

Using Regex to Compare and Compensate Missing Characters from 2 Strings to One Final String

I have 2 strings which comes from 2 different sources of machine readings for a code-line, but the readings are not always accurate and might get some missing characters. they both refer for a reading from the same code-line, but they use two different technologies of reading because both of them are not always accurate
The form of the strings should be like this example string:
string s = "<0123456<:012345678:00112233445566778899<";
but because of the accuracy, they could be like this:
string reading1 = "?012?456<:012?45678:00112?33445566?78899<";
string reading2 = "??<0?23456??012?45676?00112?3344556?778890????";
where question-mark is unreadable character which can be from 0 - 9, <, :, or even can result from some noises which may occur at any of the start or the end of the code-line like in reading2
Also the numbers lengths can vary in each of the 3 parts in s string.
I am a new in Regex, and I need a way to use it so that I can get one final string that compensate the missing chars from both strings, so for the previous reading, the final string should be:
string finalString = "<0123456<:012?45678:00112?33445566778899<";
As the reading1 has the top priority so if the same char in both strings differs, the char in reading1 should be used like in the last 9 in the code line, and for non reading chars for both strings, it should remain as question-mark in final string.
I am new in using Regex so I am not sure if there is a way to implement this using Regex, I searched and found many Regex examples which no one of them like my problem, but some of them can solve only some parts of my problem.
Thank you.

Way to parse a string into segments when segments are maximum size or terminated with a particular character

I am looking to parse out a string in C# to get relevant data segments from the string.
The rule for one part of the data stream is for Address with this rule set:
Address with $ between address lines. Terminated with “^” if less than 29 characters.
Some examples:
28 Atol Av$Suite 2$^
Hiawatha Park$Apt 2037^
340 Brentwood Dr.$Fall Estate
There are other similar rules for segments but if I have a solid plan for this segment I can modified it for the rest of the parsing.
I am wondering if there is a regex that could be used.
I have.{0,29}\^ that seems to do the trick. I wasn't escaping the ^ initially.
thanks,
Dan
You can use string.Split() to do this.
string [] substrings = string.Split('$');
Now you have an array of strings that contains the values between the '$' characters.
Then, I imagine you just want to get rid of the '^' character on the last element of the array (if it exists).
int index = substrings.Length - 1;
substrings[index] = substrings[index].TrimEnd('^');
You can use regular expressions and Regex.Split(), but you really don't need it if all you need to do is split on '$' and trim '^'. Writing a regular expression for this would be overkill.
EDIT: Now that I think of it, you could split on both '$' and '^' and just discard the empty entries, saving you the trimming step.
string [] substrings = string.Split("$^".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
I'll leave the pre-edit code as-is since it's more explicit, and explains the usage better.

Regex.Split command in c#

I am trying to use Regex.SPlit to split a a string in order to keep all of its contents, including the delimiters i use. The string is a math problem. For example, 5+9/2*1-1. I have it working if the string contains a + sign but I don't know how to add more then one to the delimiter list. I have looked online at multiple pages but everything I try gives me errors. Here is the code for the Regex.Split line I have: (It works for the plus, Now i need it to also do -,*, and /.
string[] everything = Regex.Split(inputBox.Text, #"(\+)");
Use a character class to match any of the math operations: [*/+-]
string input = "5+9/2*1-1";
string pattern = #"([*/+-])";
string[] result = Regex.Split(input, pattern);
Be aware that character classes allow ranges, such as [0-9], which matches any digit from 0 up to 9. Therefore, to avoid accidental ranges, you can escape the - or place it at either the beginning or end of the character class.

C# Trouble with Regex.Replace

Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)
Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.
This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());

Regex Word splitting in C#

I know similar questions have been asked before, but I can't find one that is like mine, or enough like mine to help me out :). So essentially I want to split up a string which contains a bunch of words, and I don't want to return any characters that are not words (this is the key problem I am struggling with, ignoring characters). This is how I define the problem:
What constitutes a word is a string of any character a-zA-Z only
(no numbers or anything else)
In between any word, there can be any number of random other characters
I want to get back a string[] containing only the words
eg: text: "apple^&**^orange1247pear"
I want to return: apple, orange, pear in an array.
The closest I have found I suppose is this:
Regex.Split("apple^orange7pear",#"([a-zA-Z]*)")
Which splits out the apple/orange/pear, but also returns a bunch of other junk and blank strings.
Anyone know how to stop the split function from returning certain parts of the string, or is that not possible?
Thanks in advance for any help you give me :)
Split should match the tokens between your words. In your regex you've added a group around the word, so it is included in the result, but that isn't desired in this case. Note that this regex matches anything besides valid words - anything that isn't an ASCII letter:
string[] words = Regex.Split(str, "[^a-zA-Z]+");
Another option is to match the words directly:
MatchCollection matches = Regex.Matches(str, "[a-zA-Z]+");
string[] words2 = matches.Cast<Match>().Select(m => m.Value).ToArray();
The second option is probably clearer, and will not include blank elements on the start or end of the array.
var splits = Regex.Split("aaa $$$bbb ccc", #"[^A-Za-z]+");
But to include non-latin letters, I would use this:
var splits = Regex.Split("aaa $$$bbb ccc", #"\P{L}+");
Try this:
Regex.Matches("kalle kula(/()&//()nisse8978971", #"[A-Za-z]+")
Using Matches() will collect only the words, Split() will divide the string which is not what you want.
The second option Kobi listed is better and easier to control. I use the following regular expression to locate common entities such as words, numbers, email addresses in a string it will.
var regex = new Regex(#"[\p{L}\p{N}\p{M}]+(?:[-.'´_#][\p{L}|\p{N}|\p{M}]+)*", RegexOptions.Compiled);

Categories