How do you find a delimited/isolated substring with string.contains? - c#

I am trying to parse out and identify some values from strings that I have in a list.
I am using string.Contains to identify the value im looking for, but I am getting hits even if the value is surrounded by other text. How can I make sure I only get a hit if the value is isolated?
Example parse:
Looking for value = "302"
string sale =
"199708. (30), italiano, delim fabricata modella, serialNumber302. tnr F18529302E.";
var result = sale.ToLower().Contains(”302”));
In this example I will get a hit for "serialNumber302" and "F18529302E", which in the context is incorrect since I only want a hit if it finds “302” isolated, like “dontfind302 shouldfind 302”.
Any ideas on how to do this?

If you try Regex, you can define a word boundary using \b:
string sale =
"199708. (30), italiano, delim fabricata modella, serialNumber302. tnr F18529302E.";
bool result = Regex.IsMatch(sale, #"\b302\b"); // false
sale = "A string with 302 isolated";
result = Regex.IsMatch(sale, #"\b302\b"); // true
So 302 will only be found if it is at the start of the string, at the end of the string, or if it is surrounded by non-word characters i.e. not a-z A-Z 0-9 or _

EDIT: From the comments I realiſed that it waſn't clear whether or not "serialNum302" ſhould get a hit. I aſſumed ſo in this anſwer.
I ſee a few eaſy ways you could do this:
1) If the input is always a number as in the example, one option would be to only ſearch for ſubſtrings not ſurrounded by more numbers, by examining all the reſults of an initial ſearch and comparing their neighboring characters againſt the ſtring "0123456789". I really don't think this is the beſt option though, becauſe ſooner or later it's goïng to break when it miſinterprets one of the other bits of data.
2) If the ſtring sale always has the ſeriäl number in the format "serialNumber[Num]", inſtead of juſt looking for Num, look for "serialNumber" + Num, as this is leſs likely to be meſſed up with the other data.
3) From your ſtring, it looks like you have a ſtandardized format that's beïng introduced to the ſyſtem. In this caſe, parſe it in a ſtandardized way, e.g. by ſplitting it into ſubſtrings at the commas, then parſing each ſubſtring differently as it requires.

Related

Format String to Match Specific Pattern

I am trying to figure out how to format a string to a specific pattern.
When a user is entering their employee id number, they often get confused on what is expected from them. Because they are often told that their employee id is either a 5 digit or 4 digit number depending on when they were hired.
For example, my employee id number is E004033 but for most of our systems, I just have to enter 4033 and the system will find me.
We are trying to add this to one of our custom pages. Basically what I want to do is format a string to always look like E0XXXXX
So if they enter 4033 the script will convert it to E004033, if they enter something like 0851 it will convert it to E000851 or if they enter 11027 it will convert it to E011027
Is there a way basically add padding zeros and a leading E if they are missing from the users input?
You can simply:
var formattedId = "E" + id.PadLeft(6, '0');
To remove an existing leading E(s)
var text = "E" + val.TrimStart(new[] {'E'}).PadLeft(6, '0');
Make sure the user's input is an integer, then format to 6 spaces using String.Format.
int parsedId;
bool ok = int.TryParse(id, out parsedId);
if (ok)
{
return String.Format("E{0:000000}", parsedId);
}

Convert string into three letter Abbreviation

I've recently been given a new project by work to convert Any given string into 1-3 letter abbreviations.
An example of something similar to what I must produce is below however the strings given could be anything:
switch (string.Name)
{
case "Emotional, Social & Personal": return "ESP";
case "Speech & Language": return "SL";
case "Physical Development": return "PD";
case "Understanding the World": return "UW";
case "English": return "E";
case "Expressive Art & Design": return "EAD";
case "Science": return "S";
case "Understanding The World And It's People"; return "UTW";
}
I figured that I could use string.Split & count the number of words in the array. Then add conditions for handling particular length strings as generally these sentences wont be longer than 4 words however problems I will encounter are.
If a string is longer than I expected it wouldn't be handled
Symbols must be excluded from the abbreviation
Any suggestions as to the logic I could apply would be very appreciated.
Thanks
Something like the following should work with the examples you have given.
string abbreviation = new string(
input.Split()
.Where(s => s.Length > 0 && char.IsLetter(s[0]) && char.IsUpper(s[0]))
.Take(3)
.Select(s => s[0])
.ToArray());
You may need to adjust the filter based on your expected input. Possibly adding a list of words to ignore.
It seems that if it doesn't matter, you could just go for the simplest thing. If the string is shorter than 4 words, take the first letter of each string.
If the string is longer than 4, eliminate all "ands", and "ors", then do the same.
To be better, you could have a lookup dictionary of words that you wouldn't care about - like "the" or "so".
You could also keep an 3D char array, in alphabetical order for quick lookup. That way, you wouldn't have any repeating abbreviations.
However, there are only a finite number of abbreviations. Therefore, it might be better to keep the 'useless' words stored in another string. That way, if the abbreviation your program does by default is already taken, you can use the useless words to make a new one.
If all of the above fail, you could start to linearly move through string to get a different 3 letter word abbreviation - sort of like codons on DNA.
Perfect place to use a dictionary
Dictionary<string, string> dict = new Dictionary<string, string>() {
{"Emotional, Social & Personal", "ESP"},
{"Speech & Language","SL"},
{"Physical Development", "PD"},
{"Understanding the World","UW"},
{"English","E"},
{"Expressive Art & Design","EAD"},
{"Science","S"},
{"Understanding The World And It's People","UTW"}
};
string results = dict["English"];​
Following snippet may help you:
string input = "Emotional, Social & Personal"; // an example from the question
string plainText = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(Regex.Replace(input, #"[^0-9A-Za-z ,]", "").ToLower()); // will produce a text without special charactors
string abbreviation = String.Join("",plainText.Split(" ".ToCharArray(),StringSplitOptions.RemoveEmptyEntries).Select(y =>y[0]).ToArray());// get first character from each word

Ignoring a line on Comparing two strings

I need to compare two strings representing an html (something like 300 lines both). They should be identical, except a line which contains a date in this format dd/MM/yyyy hh:mm:ss, so I need to ignore that line.
The problem is that I have a static file containing one html which I use as the base in comparing, and the other one I get on runtime from a URL. So this line with that date will be always different.
The line doesn't have any identifier tag, like id or name, even the parent elements doesn't have nothing to identify it. So, what options do I have to ignore this line in the comparing method?
Remove the date time with a Regex.Replace, then compare the strings.
You can try to find wich is the position in the string of the sequence of chars that define the date line.
Suppose your date line starts with "mydate".
Get the first part of the string from index 0 to indexOf("mydate") from the two files and compare them (if you do not find "mydate", then something is really different, the date line was not found).
Then get the second part of the string from the index of what should be directly after the date line from the two files and compare them.
You can remove both datetimes from both htmls using regex, then compare them.
A simple solution consist in identifying the characters of static HTML (s1) that are not identical to the HTML (S2) got from URL.
A prerequisite is to update the static HTML s1 by replacing the DateTime by a string like "##.##.##.##.##.##" insuring that all characters of this string cannot match any char (including separators) of the DateTime in s2.
string originalDateTimeString = "##.##.##.##.##.##" ;
// check to see if same length
bool compareok=s1.Length==s2.Length ;
// check all char. when different store char in diff1
string diff1="" ;
int lastDiffIndex =-1 ;
for (int i=0;i<s1.Length && compareok; i ++) if(s1[i]!=s2[i])
{ // Check if differences are consecutive
compareok = lastDiffIndex==-1 || lastDiffIndex==i-1 ;
diff1+=s1[i] ;
lastDiffIndex=i ;
}
// The comparison succeeds if the differences matches the original DateTime string
compareok = compareok && diff1==originalDateTimeString ;

C# Regex.Match to decimal

I have a string "-4.00 %" which I need to convert to a decimal so that I can declare it as a variable and use it later. The string itself is found in string[] rows. My code is as follows:
foreach (string[] row in rows)
{
string row1 = row[0].ToString();
Match rownum = Regex.Match(row1.ToString(), #"\-?\d+\.+?\d+[^%]");
string act = Convert.ToString(rownum); //wouldn't convert match to decimal
decimal actual = Convert.ToDecimal(act);
textBox1.Text = (actual.ToString());
}
This results in "Input string was not in a correct format." Any ideas?
Thanks.
I see two things happening here that could contribute.
You are treating the Regex Match as though you expect it to be a string, but what a Match retrieves is a MatchGroup.
Rather than converting rownum to a string, you need to lookat rownum.Groups[0].
Secondly, you have no parenthesised match to capture. #"(\-?\d+\.+?\d+)%" will create a capture group from the whole lot. This may not matter, I don't know how C# behaves in this circumstance exactly, but if you start stretching your regexes you will want to use bracketed capture groups so you might as well start as you want to go on.
Here's a modified version of your code that changes the regex to use a capturing group and explicitly look for a %. As a consequence, this also simplifies the parsing to decimal (no longer need an intermediary string):
EDIT : check rownum.Success as per executor's suggestion in comments
string[] rows = new [] {"abc -4.01%", "def 6.45%", "monkey" };
foreach (string row in rows)
{
//regex captures number but not %
Match rownum = Regex.Match(row.ToString(), #"(\-?\d+\.+?\d+)%");
//check for match
if(!rownum.Success) continue;
//get value of first (and only) capture
string capture = rownum.Groups[1].Value;
//convert to decimal
decimal actual = decimal.Parse(capture);
//TODO: do something with actual
}
If you're going to use the Match class to handle this, then you have to access the Match.Groups property to get the collection of matches. This class assumes that more than one occurrence appears. If you can guarantee that you'll always get 1 and only 1 you could get it with:
string act = rownum.Groups[0];
Otherwise you'll need to parse through it as in the MSDN documentation.

CSV Parsing with double quotes

I am trying to use C# to parse CSV. I used regular expressions to find "," and read string if my header counts were equal to my match count.
Now this will not work if I have a value like:
"a",""b","x","y"","c"
then my output is:
'a'
'"b'
'x'
'y"'
'c'
but what I want is:
'a'
'"b","x","y"'
'c'
Is there any regex or any other logic I can use for this ?
CSV, when dealing with things like multi-line, quoted, different delimiters* etc - can get trickier than you might think... perhaps consider a pre-rolled answer? I use this, and it works very well.
*=remember that some locales use [tab] as the C in CSV...
CSV is a great example for code reuse - No matter which one of the csv parsers you choose, don't choose your own. Stop Rolling your own CSV parser
I would use FileHelpers if I were you. Regular Expressions are fine but hard to read, especially if you go back, after a while, for a quick fix.
Just for sake of exercising my mind, quick & dirty working C# procedure:
public static List<string> SplitCSV(string line)
{
if (string.IsNullOrEmpty(line))
throw new ArgumentException();
List<string> result = new List<string>();
bool inQuote = false;
StringBuilder val = new StringBuilder();
// parse line
foreach (var t in line.Split(','))
{
int count = t.Count(c => c == '"');
if (count > 2 && !inQuote)
{
inQuote = true;
val.Append(t);
val.Append(',');
continue;
}
if (count > 2 && inQuote)
{
inQuote = false;
val.Append(t);
result.Add(val.ToString());
continue;
}
if (count == 2 && !inQuote)
{
result.Add(t);
continue;
}
if (count == 2 && inQuote)
{
val.Append(t);
val.Append(',');
continue;
}
}
// remove quotation
for (int i = 0; i < result.Count; i++)
{
string t = result[i];
result[i] = t.Substring(1, t.Length - 2);
}
return result;
}
There's an oft quoted saying:
Some people, when confronted with a
problem, think "I know, I'll use
regular expressions." Now they have
two problems. (Jamie Zawinski)
Given that there's no official standard for CSV files (instead there are a large number of slightly incompatible styles), you need to make sure that what you implement suits the files you will be receiving. No point in implementing anything fancier than what you need - and I'm pretty sure you don't need Regular Expressions.
Here's my stab at a simple method to extract the terms - basically, it loops through the line looking for commas, keeping track of whether the current index is within a string or not:
public IEnumerable<string> SplitCSV(string line)
{
int index = 0;
int start = 0;
bool inString = false;
foreach (char c in line)
{
switch (c)
{
case '"':
inString = !inString;
break;
case ',':
if (!inString)
{
yield return line.Substring(start, index - start);
start = index + 1;
}
break;
}
index++;
}
if (start < index)
yield return line.Substring(start, index - start);
}
Standard caveat - untested code, there may be off-by-one errors.
Limitations
The quotes around a value aren't removed automatically.
To do this, add a check just before the yield return statement near the end.
Single quotes aren't supported in the same way as double quotes
You could add a separate boolean inSingleQuotedString, renaming the existing boolean to inDoubleQuotedString and treating both the same way. (You can't make the existing boolean do double work because you need the string to end with the same quote that started it.)
Whitespace isn't automatically removed
Some tools introduce whitespace around the commas in CSV files to "pretty" the file; it then becomes difficult to tell intentional whitespace from formatting whitespace.
In order to have a parseable CSV file, any double quotes inside a value need to be properly escaped somehow. The two standard ways to do this are by representing a double quote either as two double quotes back to back, or a backslash double quote. That is one of the following two forms:
""
\"
In the second form your initial string would look like this:
"a","\"b\",\"x\",\"y\"","c"
If your input string is not formatted against some rigorous format like this then you have very little chance of successfully parsing it in an automated environment.
If all your values are guaranteed to be in quotes, look for values, not for commas:
("".*?""|"[^"]*")
This takes advantage of the fact that "the earliest longest match wins" - it looks for double quoted values first, and with a lower priority for normal quoted values.
If you don't want the enclosing quote to be part of the match, use:
"(".*?"|[^"]*)"
and go for the value in match group 1.
As I said: Prerequisite for this to work is well-formed input with guaranteed quotes or double quotes around each value. Empty values must be quoted as well! A nice side-effect is that it does not care for the separator char. Commas, TABs, semi-colons, spaces, you name it. All will work.
FileHelpers supports multiline fields.
You could parse files like these:
a,"line 1
line 2
line 3"
b,"line 1
line 2
line 3"
Here is the datatype declaration:
[DelimitedRecord(",")]
public class MyRecord
{
public string field1;
[FieldQuoted('"', QuoteMode.OptionalForRead, MultilineMode.AllowForRead)]
public string field2;
}
Here is the usage:
static void Main()
{
FileHelperEngine engine = new FileHelperEngine(typeof(MyRecord));
MyRecord[] res = engine.ReadFile("file.csv");
}
Try CsvHelper (a library I maintain) or FastCsvReader. Both work well. CsvHelper does writing also. Like everyone else has been saying, don't roll your own. :P
FileHelpers for .Net is your friend.
See the link "Regex fun with CSV" at:
http://snippets.dzone.com/posts/show/4430
The Lumenworks CSV parser (open source, free but needs a codeproject login) is by far the best one I've used. It'll save you having to write the regex and is intuitive to use.
Well, I'm no regex wiz, but I'm certain they have an answer for this.
Procedurally it's going through letter by letter. Set a variable, say dontMatch, to FALSE.
Each time you run into a quote toggle dontMatch.
each time you run into a comma, check dontMatch. If it's TRUE, ignore the comma. If it's FALSE, split at the comma.
This works for the example you give, but the logic you use for quotation marks is fundamentally faulty - you must escape them or use another delimiter (single quotes, for instance) to set major quotations apart from minor quotations.
For instance,
"a", ""b", ""c", "d"", "e""
will yield bad results.
This can be fixed with another patch. Rather than simply keeping a true false you have to match quotes.
To match quotes you have to know what was last seen, which gets into pretty deep parsing territory. You'll probably, at that point, want to make sure your language is designed well, and if it is you can use a compiler tool to create a parser for you.
-Adam
I have just try your regular expression in my code..its work fine for formated text with quote ...
but wondering if we can parse below value by Regex..
"First_Bat7679",""NAME","ENAME","FILE"","","","From: "DDD,_Ala%as"#sib.com"
I am looking for result as:
'First_Bat7679'
'"NAME","ENAME","FILE"'
''
''
'From: "DDD,_Ala%as"#sib.com'
Thanx

Categories