Find the longest sequence of digits in a string - c#

I am trying to clear up the results for poor quality OCR reads, attempting to remove everything I can safely assume is a mistake.
The desired result is a 6 digit numerical string, so I can rule out any character that isn't a digit from the results. I also know these numbers appear sequentially, so any numbers out of sequence are also very likely to be incorrect.
(Yes, fixing the quality would be best but no... they won't/can't change their documents)
I immediately Trim() to remove white space, also as these are going to end up as file names I also remove all illegal characters.
I've found out which Characters are digits and added them to a dictionary against the array position in which they where found.
This leaves me with a clear visual indication of the number sequencies but I am struggling on the logic of how to get my program to recognise this.
Tested with the string "Oct', 2$3622" (an actual bad read)
The ideal output for this would be "3662"
public String FindLongest(string OcrText)
{
try
{
Char[] text = OcrText.ToCharArray();
List<char> numbers = new List<char>();
Dictionary<int, char> consec = new Dictionary<int, char>();
for (int a = 0; a < text.Length; a++)
{
if (Char.IsDigit(text[a]))
{
consec.Add(a, text[a]);
// Won't allow duplicates?
//consec.Add(text[a].ToString(), true);
}
}
foreach (var item in consec.Keys)
{
#region Idea that didn't work
// Combine values with consecutive keys into new list
// With most consecutive?
for (int i = 0; i < consec.Count; i++)
{
// if index key doesn't match loop, value was not consecutive
// Ah... falsely assuming it will start at 1. Won't work.
if (item == i)
numbers.Add(consec[item]);
else
numbers.Add(Convert.ToChar("#")); //string split value
}
#endregion
}
return null;
}
catch (Exception ex)
{
string message;
if (ex.InnerException != null)
message =
"Exception: " + ex.Message +
"\r\n" +
"Inner: " + ex.InnerException.Message;
else
message = "Exception: " + ex.Message;
MessageBox.Show(message);
return null;
}
}

A quick and dirty way to get the longest sequence of digits would be by using a Regex like this:
var t = "sfas234sdfsdf55323sdfasdf23";
var longest = Regex.Matches(t, #"\d+").Cast<Match>().OrderByDescending(m => m.Length).First();
Console.WriteLine(longest);
This will actually get all the sequences and obviously you can use LINQ to select the longest of these.
This doesn't handle multiple sequences of the same length.

so you just need find the longest # sequence? why not use regex?
Regex reg = new Regex("\d+");
Matches mc = reg.Matches(input);
foreach (Match mt in mc)
{
// mt.Groups[0].Value.Length is the len of the sequence
// just find the longest
}
Just a thought.

Since you strictly want numeric matches, I would suggest using a regex that matches (\d+).
MatchCollection matches = Regex.Matches(input, #"(\d+)");
string longest = string.Empty;
foreach (Match match in matches) {
if (match.Success) {
if (match.Value.Length > longest.Length) longest = match.Value;
}
}
This will give you the number of the longest length. If you wanted to actually compare values (which would also work with the "longest length", but could solve an issue with same-length matches):
MatchCollection matches = Regex.Matches(input, #"(\d+)");
int biggest = 0;
foreach (Match match in matches) {
if (match.Success) {
int current = 0;
int.TryParse(match.Value, out current);
if (current > biggest) biggest = current;
}
}

var split = Regex.Split(OcrText, #"\D+").ToList();
var longest = (from s in split
orderby s.Length descending
select s).FirstOrDefault();
I would recommend using a Regex.Split using \D (#"\D+" in code) which finds all characters that are not digits. I would then perform a Linq query to find the longest string by .Length.
As you can see, it's both simple and very readable.

Related

Fixing badly formatted string with number and thousands seperator

I am receiving a string with numbers, nulls, and delimiters that are the same as characters in the numbers. Also there are quotes around numbers that contain a comma(s). With C#, I want to parse out the string, such that I have a nice, pipe delimited series of numbers, no commas, 2 decimal places.
I tried the standard replace, removing certain string patterns to clean it up but I can't hit every case. I've removed the quotes first, but then I get extra numbers as the thousands separator turns into a delimiter. I attempted to use Regex.Replace with wildcards but can't get anything out of it due to the multiple numbers with quotes and commas inside the quotes.
edit for Silvermind: temp = Regex.Replace(temp, "(?:\",.*\")","($1 = .\n)");
I don't have control over the file I receive. I can get most of the data cleaned up. It's when the string looks like the following, that there is a problem:
703.36,751.36,"1,788.36",887.37,891.37,"1,850.37",843.37,"1,549,797.36",818.36,749.36,705.36,0.00,"18,979.70",934.37
Should I look for the quote character, find the next quote character, remove commas from everything between those 2 chars, and move on? This is where I'm headed but there has to be something more elegant out there (yes - I don't program in C# that often - I'm a DBA).
I would like to see the thousands separator removed, and no quotes.
This regex pattern will match all of the individual numbers in your string:
(".*?")|(\d+(.\d+)?)
(".*?") matches things like "123.45"
(\d+(.\d+)?) matches things like 123.45 or 123
From there, you can do a simple search and replace on each match to get a "clean" number.
Full code:
var s = "703.36,751.36,\"1,788.36\",887.37,891.37,\"1,850.37\",843.37,\"1,549,797.36\",818.36,749.36,705.36,0.00,\"18,979.70\",934.37";
Regex r = new Regex("(\".*?\")|(\\d+(.\\d+)?)");
List<double> results = new List<double>();
foreach (Match m in r.Matches(s))
{
string cleanNumber = m.Value.Replace("\"", "");
results.Add(double.Parse(cleanNumber));
}
Console.WriteLine(string.Join(", ", results));
Output:
703.36, 751.36, 1788.36, 887.37, 891.37, 1850.37, 843.37, 1549797.36, 818.36, 749.36, 705.36, 0, 18979.7, 934.37
This would be simpler to solve with a parser type solution which keeps track of state. Regex is for regular text anytime you have context it gets hard to solve with regex. Something like this would work.
internal class Program
{
private static string testString = "703.36,751.36,\"1,788.36\",887.37,891.37,\"1,850.37\",843.37,\"1,549,797.36\",818.36,749.36,705.36,0.00,\"18,979.70\",934.37";
private static void Main(string[] args)
{
bool inQuote = false;
List<string> numbersStr = new List<string>();
int StartPos = 0;
StringBuilder SB = new StringBuilder();
for(int x = 0; x < testString.Length; x++)
{
if(testString[x] == '"')
{
inQuote = !inQuote;
continue;
}
if(testString[x] == ',' && !inQuote )
{
numbersStr.Add(SB.ToString());
SB.Clear();
continue;
}
if(char.IsDigit(testString[x]) || testString[x] == '.')
{
SB.Append(testString[x]);
}
}
if(SB.Length != 0)
{
numbersStr.Add(SB.ToString());
}
var nums = numbersStr.Select(x => double.Parse(x));
foreach(var num in nums)
{
Console.WriteLine(num);
}
Console.ReadLine();
}
}

C# "between strings" run several times

Here is my code to find a string between { }:
var text = "Hello this is a {Testvar}...";
int tagFrom = text.IndexOf("{") + "{".Length;
int tagTo = text.LastIndexOf("}");
String tagResult = text.Substring(tagFrom, tagTo - tagFrom);
tagResult Output: Testvar
This only works for one time use.
How can I apply this for several Tags? (eg in a While loop)
For example:
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
tagResult[] Output (eg Array): Testvar, Tagvar, Endvar
IndexOf() has another overload that takes the start index of which starts to search the given string. if you omit it, it will always look from the beginning and will always find the first one.
var text = "Hello this is a {Testvar}...";
int start = 0, end = -1;
List<string> results = new List<string>();
while(true)
{
start = text.IndexOf("{", start) + 1;
if(start != 0)
end = text.IndexOf("}", start);
else
break;
if(end==-1) break;
results.Add(text.Substring(start, end - start));
start = end + 1;
}
I strongly recommend using regular expressions for the task.
using System;
using System.Text.RegularExpressions;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
var regex = new Regex(#"(\{(?<var>\w*)\})+", RegexOptions.IgnoreCase);
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
var matches = regex.Matches(text);
foreach (Match match in matches)
{
var variable = match.Groups["var"];
Console.WriteLine($"Found {variable.Value} from position {variable.Index} to {variable.Index + variable.Length}");
}
}
}
}
Output:
Found Testvar from position 17 to 24
Found Tagvar from position 47 to 53
Found Endvar from position 71 to 77
For more information about regular expression visit the MSDN reference page:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference
and this tool may be great to start testing your own expressions:
http://regexstorm.net/tester
Hope this help!
I would use Regex pattern {(\\w+)} to get the value.
Regex reg = new Regex("{(\\w+)}");
var text = "Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.";
string[] tagResult = reg.Matches(text)
.Cast<Match>()
.Select(match => match.Groups[1].Value).ToArray();
foreach (var item in tagResult)
{
Console.WriteLine(item);
}
c# online
Result
Testvar
Tagvar
Endvar
Many ways to skin this cat, here are a few:
Split it on { then loop through, splitting each result on } and taking element 0 each time
Split on { or } then loop through taking only odd numbered elements
Adjust your existing logic so you use IndexOf twice (instead of lastindexof). When you’re looking for a } pass the index of the { as the start index of the search
This is so easy by using Regular Expressions just by using a simple pattern like {([\d\w]+)}.
See the example below:-
using System.Text.RegularExpressions;
...
MatchCollection matches = Regex.Matches("Hello this is a {Testvar}... and we have more {Tagvar} in this string {Endvar}.", #"{([\d\w]+)}");
foreach(Match match in matches){
Console.WriteLine("match : {0}, index : {1}", match.Groups[1], match.index);
}
It can find any series of letters or number in these brackets one by one.

Regex Puzzle Find all Valid String Combinations

I am trying to find the possible subsets within in a string which satisfy the all given condition.
The first letter is a lowercase English letter.
Next, it contains a sequence of zero or more of the following characters:
lowercase English letters, digits, and colons.
Next, it contains a forward slash '/'.
Next, it contains a sequence of one or more of the following characters:
lowercase English letters and digits.
Next, it contains a backward slash '\'.
Next, it contains a sequence of one or more lowercase English letters.
Given some string, s, we define the following:
s[i..j] is a substring consisting of all the characters in the inclusive range between index i and index j.
Two substrings, s[i1..j1] and s[i[2]..j[2]], are said to be distinct if either i1 ≠ i[2] or j1 ≠ j[2].
For example, your command line is abc:/b1c\xy. Valid command substrings are:
abc:/b1c\xy
bc:/b1c\xy
c:/b1c\xy
abc:/b1c\x
bc:/b1c\x
c:/b1c\x
to which I solved as ^([a-z])([a-z0-9:]*)(/)([a-z0-9]+)([\\])([a-z]*)
but this doesn't satisfy the second condition, I tried ^([a-z])([a-z0-9:]*)(/)([a-z0-9]+)([\\])([a-z]+[a-z]*) but still for w:/a\bc it should be 2 subsets [w:/a\b,w:/a\bc] but by regex wise its 1 which is obviuos . what i am doing wrong
Regex Tool: Check
Edit: why w:/a\bc should yield two subsets [w:/a\b, w:/a\bc], cause it satisfies all 6 constraints and its distinct as 'w:/a\bc' is super set of w:/a\b,
You have to perform sub string operations after matching the strings.
For Example:
your string is "abc:/b1c\xy", you matched it using your regex, now it's time to get the required data.
int startIndex=1;
String st="abc:/b1c\xy";
regex1="[a-z0-9:]*(/)"
regex2="(/)([a-z0-9]+)([\\])";
regex3="([\\])([a-z])+";
String PrefixedString=regex1.match(st).group(0);
String CenterString=regex2.match(st).group(0);
String PostfixedString=regex3.match(st).group(0);
if(PrefixedString.contains(":"))
{ startIndex=2; }
for(int i=;i<PrefixedString.length-startIndex;i++)//ends with -startIndex because '/' is included in the string or ':' may be
{
String temp=PrefixedString[i];
if(i!=PrefixedString.length)
{
for(int j=i+1;j<PrefixedString.length;j++)
{
temp+=PrefixedString[j];
}
}
print(temp+CenterString+PostfixedString);
}
for(int i=1;i<PostfixedString.length;i++)//starts with -1 because '\' is included in the string
{
String temp=PrefixedString+CenterString+PostfixedString[i];
if(i!=PostfixedString.length)
{
for(int j=i+1;j<PostfixedString.length;j++)
{
temp+=PostfixedString[j];
}
}
print(temp);
}
I hope this will give you some idea.
You may be able to create a regex that helps you in separating all relevant result parts, but as far as I know, you can't create a regex that gives you all result sets with a single search.
The tricky part are the first two conditions, since there can be many possible starting points when there is a mix of letters, digits and colons.
In order to find possible starting points, I suggest the following pattern for the part before the forward slash: (?:([a-z]+)(?:[a-z0-9:]*?))+
This will match potentially multiple captures where every letter within the capture could be a starting point to the substring.
Whole regex: (?:([a-z]+)(?:[a-z0-9:]*?))+/[a-z0-9]+\\([a-z]*)
Create your results by combining all postfix sub-lengths from all captures of group 1 and all prefix sub-lengths from group 2.
Example code:
var testString = #"a:ab2c:/b1c\xy";
var reg = new Regex(#"(?:([a-z]+)(?:[a-z0-9:]*?))+/[a-z0-9]+\\([a-z]*)");
var matches = reg.Matches(testString);
foreach (Match match in matches)
{
var prefixGroup = match.Groups[1];
var postfixGroup = match.Groups[2];
foreach (Capture prefixCapture in prefixGroup.Captures)
{
for (int i = 0; i < prefixCapture.Length; i++)
{
for (int j = 0; j < postfixGroup.Length; j++)
{
var start = prefixCapture.Index + i;
var end = postfixGroup.Index + postfixGroup.Length - j;
Console.WriteLine(testString.Substring(start, end - start));
}
}
}
}
Output:
a:ab2c:/b1c\xy
a:ab2c:/b1c\x
ab2c:/b1c\xy
ab2c:/b1c\x
b2c:/b1c\xy
b2c:/b1c\x
c:/b1c\xy
c:/b1c\x
Intuitive Way might not correct.
var regex = new Regex(#"(^[a-z])([a-z0-9:]*)(/)([a-z0-9]+)([\\])([a-z]+)");
var counter = 0;
for (var c = 0; c < command.Length; c++)
{
var isMatched = regex.Match(string.Join(string.Empty, command.Skip(c)));
if (isMatched.Success)
{
counter += isMatched.Groups.Last().Value.ToCharArray().Length;
}
}
return counter;

How to find the number of occurrences of a letter in only the first sentence of a string?

I want to find number of letter "a" in only first sentence. The code below finds "a" in all sentences, but I want in only first sentence.
static void Main(string[] args)
{
string text; int k = 0;
text = "bla bla bla. something second. maybe last sentence.";
foreach (char a in text)
{
char b = 'a';
if (b == a)
{
k += 1;
}
}
Console.WriteLine("number of a in first sentence is " + k);
Console.ReadKey();
}
This will split the string into an array seperated by '.', then counts the number of 'a' char's in the first element of the array (the first sentence).
var count = Text.Split(new[] { '.', '!', '?', })[0].Count(c => c == 'a');
This example assumes a sentence is separated by a ., ? or !. If you have a decimal number in your string (e.g. 123.456), that will count as a sentence break. Breaking up a string into accurate sentences is a fairly complex exercise.
This is perhaps more verbose than what you were looking for, but hopefully it'll breed understanding as you read through it.
public static void Main()
{
//Make an array of the possible sentence enders. Doing this pattern lets us easily update
// the code later if it becomes necessary, or allows us easily to move this to an input
// parameter
string[] SentenceEnders = new string[] {"$", #"\.", #"\?", #"\!" /* Add Any Others */};
string WhatToFind = "a"; //What are we looking for? Regular Expressions Will Work Too!!!
string SentenceToCheck = "This, but not to exclude any others, is a sample."; //First example
string MultipleSentencesToCheck = #"
Is this a sentence
that breaks up
among multiple lines?
Yes!
It also has
more than one
sentence.
"; //Second Example
//This will split the input on all the enders put together(by way of joining them in [] inside a regular
// expression.
string[] SplitSentences = Regex.Split(SentenceToCheck, "[" + String.Join("", SentenceEnders) + "]", RegexOptions.IgnoreCase);
//SplitSentences is an array, with sentences on each index. The first index is the first sentence
string FirstSentence = SplitSentences[0];
//Now, split that single sentence on our matching pattern for what we should be counting
string[] SubSplitSentence = Regex.Split(FirstSentence, WhatToFind, RegexOptions.IgnoreCase);
//Now that it's split, it's split a number of times that matches how many matches we found, plus one
// (The "Left over" is the +1
int HowMany = SubSplitSentence.Length - 1;
System.Console.WriteLine(string.Format("We found, in the first sentence, {0} '{1}'.", HowMany, WhatToFind));
//Do all this again for the second example. Note that ideally, this would be in a separate function
// and you wouldn't be writing code twice, but I wanted you to see it without all the comments so you can
// compare and contrast
SplitSentences = Regex.Split(MultipleSentencesToCheck, "[" + String.Join("", SentenceEnders) + "]", RegexOptions.IgnoreCase | RegexOptions.Singleline);
SubSplitSentence = Regex.Split(SplitSentences[0], WhatToFind, RegexOptions.IgnoreCase | RegexOptions.Singleline);
HowMany = SubSplitSentence.Length - 1;
System.Console.WriteLine(string.Format("We found, in the second sentence, {0} '{1}'.", HowMany, WhatToFind));
}
Here is the output:
We found, in the first sentence, 3 'a'.
We found, in the second sentence, 4 'a'.
You didn't define "sentence", but if we assume it's always terminated by a period (.), just add this inside the loop:
if (a == '.') {
break;
}
Expand from this to support other sentence delimiters.
Simply "break" the foreach(...) loop when you encounter a "." (period)
Well, assuming you define a sentence as being ended with a '.''
Use String.IndexOf() to find the position of the first '.'. After that, searchin a SubString instead of the entire string.
find the place of the '.' in the text ( you can use split )
count the 'a' in the text from the place 0 to instance of the '.'
string SentenceToCheck = "Hi, I can wonder this situation where I can do best";
//Here I am giving several way to find this
//Using Regular Experession
int HowMany = Regex.Split(SentenceToCheck, "a", RegexOptions.IgnoreCase).Length - 1;
int i = Regex.Matches(SentenceToCheck, "a").Count;
// Simple way
int Count = SentenceToCheck.Length - SentenceToCheck.Replace("a", "").Length;
//Linq
var _lamdaCount = SentenceToCheck.ToCharArray().Where(t => t.ToString() != string.Empty)
.Select(t => t.ToString().ToUpper().Equals("A")).Count();
var _linqAIEnumareable = from _char in SentenceToCheck.ToCharArray()
where !String.IsNullOrEmpty(_char.ToString())
&& _char.ToString().ToUpper().Equals("A")
select _char;
int a =linqAIEnumareable.Count;
var _linqCount = from g in SentenceToCheck.ToCharArray()
where g.ToString().Equals("a")
select g;
int a = _linqCount.Count();

How to find the first x occurrences of a Char in a String using Regex

i'm trying to find out how i can get the first x Matches of a Char in a String. I tried using a Matchcollection but i cant find any escapesequence to stop after the x'd-match.
FYI:
I need this for a string with a variable length and a different number of occurences of the searched Char, so just getting all and using only the first x isnt a solution.
Thanks in advance
Edit:
I am using steam reader to get information out of a .txt files and write it to a atring, for each file one string. These atrings have very different lengths. In every string are lets say 3 keywords. But sometimes something went wrong and i have only one or two of the keywords. Between the keywords are other fields separated with a ;. So if i use a Matchcollection to get the indexes of the ;'s and one Keyword is missing the Information in the File is shifted. Because of that i need to find the first x occourencces before/after a (existing)keyword.
Do you really want to use Regex, something like this won't do ?
string simpletext = "Hello World";
int firstoccur = simpletext.IndexOfAny(new char[]{'o'});
Since you want all the indexes for that character you can try in this fashion
string simpletext = "Hello World";
int[] occurences = Enumerable.Range(0, simpletext.Length).Where(x => simpletext[x] == 'o').ToArray();
You can use the class Match. this class returns only one result, but you can iterate over the string till it found the last one.
Something like this:
Match match = Regex.Match(input, pattern);
int count = 0;
while (match.Success)
{
count++;
// do something with match
match = match.NextMatch();
// Exit the loop when your match number is reached
}
If you're determined to use Regex then I'd do this with Matches as opposed to Match actually; largely because you get the count up front.
string pattern = "a";
string source = "this is a test of a regex match";
int maxMatches = 2;
MatchCollection mc = Regex.Matches(source, pattern);
if (mc.Count() > 0)
{
for (int i = 0; i < maxMatches; i++)
{
//do something with mc[i].Index, mc[i].Length
}
}
The split operation is pretty fast so if the regex is not a requirement this could be used:
public static IEnumerable<int> IndicesOf(this string text, char value, int count)
{
var tokens = text.Split(value);
var sum = tokens[0].Length;
var currentCount = 0;
for (int i = 1; i < tokens.Length &&
sum < text.Length &&
currentCount < count; i++)
{
yield return sum;
sum += 1 + tokens[i].Length;
currentCount++;
}
}
executes in roughly 60% of the time of the regex

Categories