Find if string contains at least 2 characters similar to another? C#

Find if string contains at least 2 characters similar to another? C# - c#

I need a method to check if a string contains one or more similar characters to another. I dont want to find all strings containing the letter "D".
For example, if I have a string "Christopher" and want to see if "Chris" is contained in "Christopher", I want that to return. However, if I want to see if "Candy" is in the string "Christopher", I wont want it to return just because it has a "C" in common.
I have tried the .Contains() method but cant give that rules for 2 or more similar characters and I have thought about using regular expressions but that might be a bit over kill. The similar letters must be next to eachother.
Thank you :)

This looks for each 2-character-gram of s1 and looks for it in s2.
string s1 = "Chrx";
string s2 = "Christopher";
IsMatchOn2Characters(s1, s2);
static bool IsMatchOn2Characters(string a, string b)
{
string s1 = a.ToLowerInvariant();
string s2 = b.ToLowerInvariant();
for (int i = 0; i < s1.Length - 1; i++)
{
if (s2.IndexOf(s1.Substring(i,2)) >= 0)
return true; // match
}
return false; // no match
}

This looks a lot like a longest common substring problem. This can be solved easily using DP in O(m*n).
If you are not worried about performance and don't really want to implement this, you can also go with the brute force solution of searching every substring of s1 into s2.

Related

Split string with plus sign as a delimiter

I have an issue with a string containing the plus sign (+).
I want to split that string (or if there is some other way to solve my problem)
string ColumnPlusLevel = "+-J10+-J10+-J10+-J10+-J10";
string strpluslevel = "";
strpluslevel = ColumnPlusLevel;
string[] strpluslevel_lines = Regex.Split(strpluslevel, "+");
foreach (string line in strpluslevel_lines)
{
MessageBox.Show(line);
strpluslevel_summa = strpluslevel_summa + line;
}
MessageBox.Show(strpluslevel_summa, "summa sumarum");
The MessageBox is for my testing purpose.
Now... The ColumnPlusLevel string can have very varied entry but it is always a repeated pattern starting with the plus sign.
i.e. "+MJ+MJ+MJ" or "+PPL14.1+PPL14.1+PPL14.1" as examples.
(It comes form Another software and I cant edit the output from that software)
How can I find out what that pattern is that is being repeated?
That in this exampels is the +-J10 or +MJ or +PPL14.1
In my case above I have tested it by using only a MessageBox to show the result but I want the repeated pattering stored in a string later on.
Maybe im doing it wrong by using Split, maybe there is another solution.
Maybe I use Split in the wrong way.
Hope you understand my problem and the result I want.
Thanks for any advice.
/Tomas

How can I find out what that pattern is that is being repeated?
Maybe i didn't understand the requirement fully, but isn't it easy as:
string[] tokens = ColumnPlusLevel.Split(new[]{'+'}, StringSplitOptions.RemoveEmptyEntries);
string first = tokens[0];
bool repeatingPattern = tokens.Skip(1).All(s => s == first);
If repeatingPattern is true you know that the pattern itself is first.
Can you maybe explain how the logic works
The line which contains tokens.Skip(1) is a LINQ query, so you need to add using System.Linq at the top of your code file. Since tokens is a string[] which implements IEnumerable<string> you can use any LINQ (extension-)method. Enumerable.Skip(1) will skip the first because i have already stored that in a variable and i want to know if all others are same. Therefore i use All which returns false as soon as one item doesn't match the condition(so one string is different to the first). If all are same you know that there is a repeating pattern which is already stored in the variable first.

You should use String.Split function :
string pattern = ColumnPlusLevel.Split("+")[0];

...but it is always a repeated pattern starting with the plus sign.
Why do you even need String.Split() here if the pattern always only repeats itself?
string input = #"+MJ+MJ+MJ";
int indexOfSecondPlus = input.IndexOf('+', 1);
string pattern = input.Remove(indexOfSecondPlus, input.Length - indexOfSecondPlus);
//pattern is now "+MJ"
No need of string split, no need to use LinQ

String has a method called Split which let's you split/divide the string based on a given character/character-set:
string givenString = "+-J10+-J10+-J10+-J10+-J10"'
string SplittedString = givenString.Split("+")[0] ///Here + is the character based on which the string would be splitted and 0 is the index number
string result = SplittedString.Replace("-","") //The mothod REPLACE replaces the given string with a targeted string,i added this so that you can get the numbers only from the string

C# - Input string was not in a correct format

I am working on a simple windows forms application that the user enters a string with delimiters and I parse the string and only get the variables out of the string.
So for example if the user enters:
2X + 5Y + z^3
I extract the values 2,5 and 3 from the "equation" and simply add them together.
This is how I get the integer values from a string.
int thirdValue
string temp;
temp = Regex.Match(variables[3], #"\d+").Value
thirdValue = int.Parse(temp);
variables is just an array of strings I use to store strings after parsing.
However, I get the following error when I run the application:
Input string was not in a correct format

Why i everyone moaning about this question and marking it down? it's incredibly easy to explain what is happening and the questioner was right to say it as he did. There is nothing wrong whatsoever.
Regex.Match(variables[3], #"\d+").Value
throws a Input string was not in a correct format.. FormatException if the string (here it's variables[3]) doesn't contain any numbers. It also does it if it can't access variables[3] within the memory stack of an Array when running as a service. I SUSPECT THIS IS A BUG The error is that the .Value is empty and the .Match failed.
Now quite honestly this is a feature masquerading as a bug if you ask me, but it's meant to be a design feature. The right way (IMHO) to have done this method would be to return a blank string. But they don't they throw a FormatException. Go figure. It is for this reason you were advised by astef to not even bother with Regex because it throws exceptions and is confusing. But he got marked down too!
The way round it is to use this simple additional method they also made
if (Regex.IsMatch(variables[3], #"\d+")){
temp = Regex.Match(variables[3], #"\d+").Value
}
If this still doesn't work for you you cannot use Regex for this. I have seen in a c# service that this doesn't work and throws incorrect errors. So I had to stop using Regex

I prefer simple and lightweight solutions without Regex:
static class Program
{
static void Main()
{
Console.WriteLine("2X + 65Y + z^3".GetNumbersFromString().Sum());
Console.ReadLine();
}
static IEnumerable<int> GetNumbersFromString(this string input)
{
StringBuilder number = new StringBuilder();
foreach (char ch in input)
{
if (char.IsDigit(ch))
number.Append(ch);
else if (number.Length > 0)
{
yield return int.Parse(number.ToString());
number.Clear();
}
}
yield return int.Parse(number.ToString());
}
}

you can change the string to char array and check if its a digit and count them up.
string temp = textBox1.Text;
char[] arra = temp.ToCharArray();
int total = 0;
foreach (char t in arra)
{
if (char.IsDigit(t))
{
total += int.Parse(t + "");
}
}
textBox1.Text = total.ToString();

This should solve your problem:
string temp;
temp = Regex.Matches(textBox1.Text, #"\d+", RegexOptions.IgnoreCase)[2].Value;
int thirdValue = int.Parse(temp);

Most efficient way to parse a delimited string in C#

This has been asked a few different ways but I am debating on "my way" vs "your way" with another developer. Language is C#.
I want to parse a pipe delimited string where the first 2 characters of each chunk is my tag.
The rules. Not my rules but rules I have been given and must follow.
I can't change the format of the string.
This function will be called possibly many times so efficiency is key.
I need to keep is simple.
The input string and tag I am looking for may/will change during runtime.
Example input string: AOVALUE1|ABVALUE2|ACVALUE3|ADVALUE4
Example tag I may need value for: AB
I split string into an array based on delimiter and loop through the array each time the function is called. I then looked at the first 2 characters and return the value minus the first 2 characters.
The "other guys" way is to take the string and use a combination of IndexOf and SubString to find the starting point and ending point of the field I am looking for. Then using SubString again to pullout the value minus the first 2 characters. So he would say IndexOf("|AB") the find then next pipe in the string. This would be the start and end. Then SubString that out.
Now I should think that IndexOf and SubString would parse the string each time at a char by char level so this would be less efficient than using large chunks and reading the string minus the first 2 characters. Or is there another way the is better then what both of us has proposed?

The other guy's approach is going to be more efficient in time given that input string needs to be reevaluated each time. If the input string is long, it is also won't require the extra memory that splitting the string would.
If I'm trying to code a really tight loop I prefer to directly use array/string operators rather than LINQ to avoid that additional overhead:
string inputString = "AOVALUE1|ABVALUE2|ACVALUE3|ADVALUE4";
static string FindString(string tag)
{
int startIndex;
if (inputString.StartsWith(tag))
{
startIndex = tag.Length;
}
else
{
startIndex = inputString.IndexOf(string.Format("|{0}", tag));
if (startIndex == -1)
return string.Empty;
startIndex += tag.Length + 1;
}
int endIndex = inputString.IndexOf('|', startIndex);
if (endIndex == -1)
endIndex = inputString.Length;
return inputString.Substring(startIndex, endIndex - startIndex);
}

I've done a lot of parsing in C# and I would probably take the approach suggested by the "other guys" just because it would be a bit lighter on resources used and likely to be a little faster as well.
That said, as long as the data isn't too big, there's nothing wrong with the first approach and it will be much easier to program.

Something like this may work ok
string myString = "AOVALUE1|ABVALUE2|ACVALUE3|ADVALUE4";
string selector = "AB";
var results = myString.Split('|').Where(x => x.StartsWith(selector)).Select(x => x.Replace(selector, ""));
Returns: list of the matches, in this case just one "VALUE2"
If you are just looking for the first or only match this will work.
string result = myString.Split('|').Where(x => x.StartsWith(selector)).Select(x => x.Replace(selector, "")).FirstOrDefault();

SubString does not parse the string.
IndexOf does parse the string.
My preference would be the Split method, primarily code coding efficiency:
string[] inputArr = input.Split("|".ToCharArray()).Select(s => s.Substring(3)).ToArray();
is pretty concise. How many LoC does the substring/indexof method take?

sub string according to some sign

Q:
I want to get sub strings according to some sign like - .
EX:
if i have string like this :
saturday-sa-0-
and i wanna to get:
saturday
sa
0
I search and find the following method:
string substring = name.Split('-')[i];
my code block sample:
foreach (string name in q)
{
for (int i = 0; i < 3; i++)
{
string substring = name.Split('-')[i];
}
}
but i read the comments about the performance drawbacks when i have a long string ..
my question is: Is there any way to substring according to specific sign and not affect badly on the performance code?

Splitting a string is O(N) , no more and no less, which is the actual complexity of String.Split So even if you write your own procedure, it cannot be REALLY faster. Perhaps it can be slightly faster. In any case, first make sure that the performance of String.Split is indeed insatisfactory for you.
And yes, if you split is over and over in a LOOP, then it will be a performance issue. You must first split it and then iterate over the array - see other answers

First, you should execute the Split operation only once. I.e., instead of
some loop {
...
string substring = name.Split('-')[i];
...
}
use
string[] substrings = name.Split('-');
some loop {
...
string substring = substrings[i];
...
}
Second, don't worry about the performance of Strint.Split too much unless
you have a real, measurable performance problem and
you know that String.Split is the culprit.
For example, if you have some database operation that takes 1 second, it does not really matter if the subsequent Split operation takes 0.001 or 0.002 seconds.
EDIT: Regarding the code in your comment: You can refactor
foreach (string name in q) {
for (int i = 0; i < 3; i++) {
string substring = name.Split('-')[i];
// do something with substring
}
}
to
foreach (string name in q) {
string[] substrings = name.Split('-');
for (int i = 0; i < 3; i++) {
string substring = substrings[i];
// do something with substring
}
}

The issues with split are if it is over used. If you need to split a string on a character then split the string. Regex is the next most used way but comes with its own set or performance gotchas. If you really need to keep you foot print small scanning the string and doing your processing in place is you best option, however this is fraught with peril as well since .net string are immutable and you may well run it the same issues you run into with split. So I guess the long and the short of it is use split and if that doesn't meet your need reevaluate.

I'd say it depends on your data. However, you should not repeatedly do ...
string substring = name.Split('-')[i];
... since this will split your string into parts every time you need to access just one of the parts. Instead, cache the split result like this ...
string[] parts = name.Split('-');
... and then use ...
string substring = parts[i];
... to access the respective parts.

CSV Parsing with double quotes

I am trying to use C# to parse CSV. I used regular expressions to find "," and read string if my header counts were equal to my match count.
Now this will not work if I have a value like:
"a",""b","x","y"","c"
then my output is:
'a'
'"b'
'x'
'y"'
'c'
but what I want is:
'a'
'"b","x","y"'
'c'
Is there any regex or any other logic I can use for this ?

CSV, when dealing with things like multi-line, quoted, different delimiters* etc - can get trickier than you might think... perhaps consider a pre-rolled answer? I use this, and it works very well.
*=remember that some locales use [tab] as the C in CSV...

CSV is a great example for code reuse - No matter which one of the csv parsers you choose, don't choose your own. Stop Rolling your own CSV parser

I would use FileHelpers if I were you. Regular Expressions are fine but hard to read, especially if you go back, after a while, for a quick fix.
Just for sake of exercising my mind, quick & dirty working C# procedure:
public static List<string> SplitCSV(string line)
{
if (string.IsNullOrEmpty(line))
throw new ArgumentException();
List<string> result = new List<string>();
bool inQuote = false;
StringBuilder val = new StringBuilder();
// parse line
foreach (var t in line.Split(','))
{
int count = t.Count(c => c == '"');
if (count > 2 && !inQuote)
{
inQuote = true;
val.Append(t);
val.Append(',');
continue;
}
if (count > 2 && inQuote)
{
inQuote = false;
val.Append(t);
result.Add(val.ToString());
continue;
}
if (count == 2 && !inQuote)
{
result.Add(t);
continue;
}
if (count == 2 && inQuote)
{
val.Append(t);
val.Append(',');
continue;
}
}
// remove quotation
for (int i = 0; i < result.Count; i++)
{
string t = result[i];
result[i] = t.Substring(1, t.Length - 2);
}
return result;
}

There's an oft quoted saying:
Some people, when confronted with a
problem, think "I know, I'll use
regular expressions." Now they have
two problems. (Jamie Zawinski)
Given that there's no official standard for CSV files (instead there are a large number of slightly incompatible styles), you need to make sure that what you implement suits the files you will be receiving. No point in implementing anything fancier than what you need - and I'm pretty sure you don't need Regular Expressions.
Here's my stab at a simple method to extract the terms - basically, it loops through the line looking for commas, keeping track of whether the current index is within a string or not:
public IEnumerable<string> SplitCSV(string line)
{
int index = 0;
int start = 0;
bool inString = false;
foreach (char c in line)
{
switch (c)
{
case '"':
inString = !inString;
break;
case ',':
if (!inString)
{
yield return line.Substring(start, index - start);
start = index + 1;
}
break;
}
index++;
}
if (start < index)
yield return line.Substring(start, index - start);
}
Standard caveat - untested code, there may be off-by-one errors.
Limitations
The quotes around a value aren't removed automatically.
To do this, add a check just before the yield return statement near the end.
Single quotes aren't supported in the same way as double quotes
You could add a separate boolean inSingleQuotedString, renaming the existing boolean to inDoubleQuotedString and treating both the same way. (You can't make the existing boolean do double work because you need the string to end with the same quote that started it.)
Whitespace isn't automatically removed
Some tools introduce whitespace around the commas in CSV files to "pretty" the file; it then becomes difficult to tell intentional whitespace from formatting whitespace.

In order to have a parseable CSV file, any double quotes inside a value need to be properly escaped somehow. The two standard ways to do this are by representing a double quote either as two double quotes back to back, or a backslash double quote. That is one of the following two forms:
""
\"
In the second form your initial string would look like this:
"a","\"b\",\"x\",\"y\"","c"
If your input string is not formatted against some rigorous format like this then you have very little chance of successfully parsing it in an automated environment.

If all your values are guaranteed to be in quotes, look for values, not for commas:
("".*?""|"[^"]*")
This takes advantage of the fact that "the earliest longest match wins" - it looks for double quoted values first, and with a lower priority for normal quoted values.
If you don't want the enclosing quote to be part of the match, use:
"(".*?"|[^"]*)"
and go for the value in match group 1.
As I said: Prerequisite for this to work is well-formed input with guaranteed quotes or double quotes around each value. Empty values must be quoted as well! A nice side-effect is that it does not care for the separator char. Commas, TABs, semi-colons, spaces, you name it. All will work.

FileHelpers supports multiline fields.
You could parse files like these:
a,"line 1
line 2
line 3"
b,"line 1
line 2
line 3"
Here is the datatype declaration:
[DelimitedRecord(",")]
public class MyRecord
{
public string field1;
[FieldQuoted('"', QuoteMode.OptionalForRead, MultilineMode.AllowForRead)]
public string field2;
}
Here is the usage:
static void Main()
{
FileHelperEngine engine = new FileHelperEngine(typeof(MyRecord));
MyRecord[] res = engine.ReadFile("file.csv");
}

Try CsvHelper (a library I maintain) or FastCsvReader. Both work well. CsvHelper does writing also. Like everyone else has been saying, don't roll your own. :P

FileHelpers for .Net is your friend.

See the link "Regex fun with CSV" at:
http://snippets.dzone.com/posts/show/4430

The Lumenworks CSV parser (open source, free but needs a codeproject login) is by far the best one I've used. It'll save you having to write the regex and is intuitive to use.

Well, I'm no regex wiz, but I'm certain they have an answer for this.
Procedurally it's going through letter by letter. Set a variable, say dontMatch, to FALSE.
Each time you run into a quote toggle dontMatch.
each time you run into a comma, check dontMatch. If it's TRUE, ignore the comma. If it's FALSE, split at the comma.
This works for the example you give, but the logic you use for quotation marks is fundamentally faulty - you must escape them or use another delimiter (single quotes, for instance) to set major quotations apart from minor quotations.
For instance,
"a", ""b", ""c", "d"", "e""
will yield bad results.
This can be fixed with another patch. Rather than simply keeping a true false you have to match quotes.
To match quotes you have to know what was last seen, which gets into pretty deep parsing territory. You'll probably, at that point, want to make sure your language is designed well, and if it is you can use a compiler tool to create a parser for you.
-Adam

I have just try your regular expression in my code..its work fine for formated text with quote ...
but wondering if we can parse below value by Regex..
"First_Bat7679",""NAME","ENAME","FILE"","","","From: "DDD,_Ala%as"#sib.com"
I am looking for result as:
'First_Bat7679'
'"NAME","ENAME","FILE"'
''
''
'From: "DDD,_Ala%as"#sib.com'
Thanx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Find if string contains at least 2 characters similar to another? C# - c#

This looks a lot like a longest common substring problem. This can be solved easily using DP in O(m*n). If you are not worried about performance and don't really want to implement this, you can also go with the brute force solution of searching every substring of s1 into s2.

Related

Split string with plus sign as a delimiter

C# - Input string was not in a correct format

Most efficient way to parse a delimited string in C#

sub string according to some sign

CSV Parsing with double quotes

Categories

Resources