Match a row with fixed columns as long as possible

Match a row with fixed columns as long as possible - c#

I'm going to parse a position base file from a legacy system. Each column in the file has a fixed column width and each row can maximum be 80 chars long. The problem is that you don't know how long a row is. Sometime they only have filled in the first five columns, and sometimes all columns are used.
If I KNOW that all 80 chars where used, then I simple could do like this:
^\s*
(?<a>\w{3})
(?<b>[ \d]{2})
(?<c>[ 0-9a-fA-F]{2})
(?<d>.{20})
...
But the problem with this is that if the last columns is missing, the row will not match. The last column can even be less number of chars then the maximum of that column.
See example
Text to match a b c d
"AQM45A3A text " => AQM 45 A3 "A text " //group d has 9 chars instead of 20
"AQM45F5" => AQM 45 F5 //group d is missing
"AQM4" => AQM 4 //group b has 1 char instead of 2
"AQM4 ASome Text" => AQM 4 A "Some Text" //group b and c only uses one char, but fill up the gap with space
"AQM4FSome Text" => No match, group b should have two numbers, but it is only one.
"COM*A comment" => Comments do not match (all comments are prefixed with COM*)
" " => Empty lines do not match
How should I design the Regular Expression to match this?
Edit 1
In this example, EACH row that I want to parse, is starting with AQM
Column a is always starting at position 0
Column b is always starting at position 3
Column c is always starting at position 5
Column d is always starting at position 7
If a column is not using all its space, is files up with spaces
Only the last column that is used can be trimed
Edit 2
To make it more clearer, I enclose here soem exemple of how the data might look like, and the definition of the columns (note that the examples I have mentioned earlier in the question was heavily simplified)

I'm not sure a regexp is the right thing to use here. If I understand your structure, you want something like
if (length >= 8)
d = everything 8th column on
remove field d
else
d = empty
if (length >= 6)
c = everything 6th column on
remove field c
else
c = empty
etc. Maybe a regexp can do it, but it will probably be rather contrived.

Try using a ? after the groups which could not be there. In this case if some group is missing you would have the match.
Edit n, after Sguazz answer
I would use
(?<a>AQM)(?<b>[ \d]{2})?(?<c>[ 0-9a-fA-F]{2})?(?<d>.{0,20})?
or even a + instead of the {0,20} for the last group, if could be that there are more than 20 chars.
Edit n+1,
Better like this?
(?<a>\w{3})(?<b>\d[ \d])(?<c>[0-9a-fA-F][ 0-9a-fA-F])(?<d>.+)

So, just to rephrase: in your example you have a sequence of character, and you know that the first 3 belong to group A, the following 2 belong to group B, then 2 to group C and 20 to group D, but there might not be this many elements.
Try with:
(?<a>\w{0,3})(?<b>[ \d]{0,2})(?<c>[ 0-9a-fA-F]{0,2})(?<d>.{0,20})
Basically these numbers are now an upper limit of the group as opposed to a fixed size.
EDIT, to reflect your last comment: if you know that all your relevant rows start with 'AQM', you can replace group A with (?<a>AQM)
ANOTHER EDIT: Let's try with this instead.
(?<a>AQM)(?<b>[ \d]{2}|[ \d]$)(?<c>[ 0-9a-fA-F]{0,2})(?<d>.{0,20})

Perhaps you could use a function like this one to break the string into its column values. It doesn't parse comment strings and is able to handle strings that are shorter than 80 characters. It doesn't validate the contents of the columns though. Maybe you can do that when you use the values.
/// <summary>
/// Break a data row into a collection of strings based on the expected column widths.
/// </summary>
/// <param name="input">The width delimited input data to break into sub strings.</param>
/// <returns>
/// An empty collection if the input string is empty or a comment.
/// A collection of the width delimited values contained in the input string otherwise.
/// </returns>
private static IEnumerable<string> ParseRow(string input) {
const string COMMENT_PREFIX = "COM*";
var columnWidths = new int[] { 3, 2, 2, 3, 6, 14, 2, 2, 3, 2, 2, 10, 7, 7, 2, 1, 1, 2, 7, 1, 1 };
int inputCursor = 0;
int columnIndex = 0;
var parsedValues = new List<string>();
if (String.IsNullOrEmpty(input) || input.StartsWith(COMMENT_PREFIX) || input.Trim().Length == 0) {
return parsedValues;
}
while (inputCursor < input.Length && columnIndex < columnWidths.Length) {
//Make sure the column width never exceeds the bounds of the input string. This can happen if the input string doesn't end on the edge of a column.
int columnWidth = Math.Min(columnWidths[columnIndex++], input.Length - inputCursor);
string columnValue = input.Substring(inputCursor, columnWidth);
parsedValues.Add(columnValue);
inputCursor += columnWidth;
}
return parsedValues;
}

Related

Find smallest number in given range in an array

Hi i have an array of size N. The array values will always have either 1, 2, 3 integer values only. Now i need to find the lowest number between a given range of array indices. So for e.g. array = 2 1 3 1 2 3 1 3 3 2. the lowest value for ranges like [2-4] = 1, [4-5] = 2, [7-8] = 3, etc.
Below is my code :
static void Main(String[] args) {
string[] width_temp = Console.ReadLine().Split(' ');
int[] width = Array.ConvertAll(width_temp,Int32.Parse); // Main Array
string[] tokens_i = Console.ReadLine().Split(' ');
int i = Convert.ToInt32(tokens_i[0]);
int j = Convert.ToInt32(tokens_i[1]);
int vehicle = width[i];
for (int beg = i+1; beg <= j; beg++) {
if (vehicle > width[beg]) {
vehicle = width[beg];
}
}
Console.WriteLine("{0}", vehicle);
}
The above code works fine. But my concern is about efficiency. In above I am just taking one set of array range, but in actual there will be n number of ranges and I would have to return the lowest for each range. Now the problem is if there is a range like [0-N], N is array size, then I would end up comparing all the items for lowest. So I was wondering if there is a way around to optimize the code for efficiency???

I think it is a RMQ (Range Minimum Query) and there is several implementation which may fit your scenario.
Here is a nice TopCoder Tutorial cover a lot of them, I recommend two of them:
Using the notation in the tutorial, define <P, T> as <Preprocess Complexity, Query Complexity>, there is two famous and common implementation / data structure which can handle RMQ: Square Rooting Array & Segment Tree.
Segment Tree is famous yet hard to implement, it can solve RMQ in <O(n), O(lg n)> though, which has better complexity than Square Rooting Array (<O(n), O(sqrt(n))>)
Square Rooting Array (<O(n), O(sqrt(n))>)
Note That It is not a official name of the technique nor any data structure, indeed I do not know if there is any official naming of this technique since I learnt it...but here we go
For query time, it is definitely not the best you can got to solve RMQ, but it has an advantage: Easy Implementation! (Compared to Segment Tree...)
Here is the high level concept of how it works:
Let N be the length of the array, we split the array into sqrt(N) groups, each contain sqrt(N) elements.
Now we use O(N) time to find the minimum value of each groups, store them into another array call M
So using the above array, M[0] = min(A[0..2]), M[1] = min(A[3..5]), M[2] = min(A[6..8]), M[3] = min(A[9..9])
(The image from TopCoder Tutorial is storing the index of the minimum element)
Now let's see how to query:
For any range [p..q], we can always split this range into 3 parts at most.
Two parts for the left boundaries which is some left over elements that cannot be form a whole group.
One part is the elements in between, which forms some groups.
Using the same example, RMQ(2,7) can be split into 3 parts:
Left Boundary (left over elements): A[2]
Right Boundary (left over elements): A[6], A[7]
In between elements (elements across whole group): A[3],A[4],A[5]
Notice that for those in between elements, we have already preprocessed their minimum using M, so we do not need to look at each element, we can look and compare M instead, there is at most O(sqrt(N)) of them (it is the length of M afterall)
For boundary parts, as they cannot form a whole group by definition, means there is at most O(sqrt(N)) of them (it is the length of one whole group afterall)
So combining two boundary parts, with one part of in between elements, we only need to compare O(3*sqrt(N)) = O(sqrt(N)) elements
You can refer to the tutorial for more details (even for some pseudo codes).

You could do this using Linq extension methods.
List<int> numbers = new List<int> {2, 1, 3, 1, 2, 3, 1, 3, 3, 2};
int minindex =1, maxindex =3, minimum=-1;
if(minindex <= maxindex && maxindex>=0 && maxindex >=0 && maxindex < numbers.Count())
{
minimum = Enumerable.Range(minindex, maxindex-minindex+1) // max inclusive, remove +1 if you want to exclude
.Select(x=> numbers[x]) // Get the elements between given indices
.Min(); // Get the minimum among.
}
Check this Demo

This seems a fun little problem. My first point would be that scanning a fixed array tends to be pretty fast (millions per second), so you'd need a vast amount of data to warrant a more complex solution.
The obvious first thing, is to break from the loop when you have found a 1, as you've found your lowest value then.
If you want something more advanced.
Create a new array of int. Create a pre load function that populates each item of this array with the next index where it gets lower.
Create a loop that uses the new array to skip.
Here is what I mean. Take the following arrays.
int[] intialArray = new int[] { 3, 3, 3, 3, 2, 2, 2, 1 };
int[] searchArray = new int[] { 4, 4, 4, 4, 7, 7, 7, 7 };
So the idea is to find the lowest between positions 0-7.
Start at initialArray[0] and get value 3.
Read searchArray[0] and get the value 4. The 4 is the next index where the number is lower.
Read initialArray[4] and get the value 2.
etc.
So basically you'd need to put some effort to build the searcharray, but onces it's complete you would scan each range much faster.

Form your looping like the following:
int[] inputArray = { 2, 1, 3, 1, 2, 3, 1, 3, 3, 2 };
int minIndex = 2;
int maxIndex = 5;
int minVal = 3;
for (int i = minIndex; i <= maxIndex; i++)
{
if (inputArray[i] <= minVal)
minVal = inputArray[i];
}
Console.WriteLine("Minimum value in the Given range is ={0}", minVal);

How to set string values into an enumerator

I have one string that contains integers and strings, separated by a comma.
For example:
0, Link Alive,1, Link Dead,2, Link Weak,3, Wiznet 0 Dead,4, Wiznet 1 Dead,5, Wiznets Dead
I want to make an enum out of this string like this:
public enum myEnums {
Link Alive = 0,
Link Dead = 2,
Link Weak = 1,
Wiznet 0 Dead = 3,
Wiznet 1 Dead = 4,
Wiznets Dead = 5
}
I was thinking about changing the string into a char array. After that I loop through the char array.
If I detect an integer, I assign its value to a temporary integer value. If I detect a string, I assign its value to a temporary string. After this I'll assign the temporary integer and string to an enumerator.
Only thing is, I don't know how to deal with the comma and the equal sign.
Can someone show me how it's supposed to be done?

It sounds to me like what you really ought to be doing is creating a Dictionary<string,int> since unless you are going to generate code, you can't change an enum at runtime, it's constant.
Now looking at your string:
0, Link Alive,1, Link Dead,2, Link Weak,3, Wiznet 0 Dead,4, Wiznet 1 Dead,5, Wiznets Dead
It looks like you have a set of comma delimited values. So split on , and then each pair of values is an int and a string. Make that you dictionary.
So a simple way to do that might look like this (assuming your data is good, i.e. it has a even number of items and every odd item actually can be parsed as an int):
var dict = new Dictionary<int,string>();
var cells = source.Split(',');
for (var i=0; i < cells.Length; i+=2)
{
dict[int.Parse(cells[i])] = cells[i+1].Trim(); // Note: you might want to check boundaries first!
}
Or using Linq, you could do something like this:
string source = "0, Link Alive,1, Link Dead,2, Link Weak,3, Wiznet 0 Dead,4, Wiznet 1 Dead,5, Wiznets Dead";
var dict = source.Split(',')
.Select((v,i) => new { v, i })
.GroupBy(x => x.i/2)
.ToDictionary(x => int.Parse(x.First().v), x => x.Skip(1).First().v.Trim());
Here's a fiddle.
To explain what we are doing here:
First with Split your string on ,. This give us a string array with ["0","Link Alive","1","Link Dead",...]
Next we use Select to select each item and it's index in a pair. So now we have a collection of objects that looks something like [{v="0",i=0},{v="Link Alive",i=1},...]
Now we group this by dividing the index by 2. Because this is integer division, it will truncate. So 0/2 == 0 and 1/2 == 0 and 2/2 == 1 and 3/2 == 1. So we are sorting into pairs of values.
Finally we convert these groups (which we know are pairs of values) into a dictionary. To do that we use the first item in each group and parse it into an int and use that as the key for our dictionary. Then we use the second value as the value. This finally gives us our dictionary
Now with you dictionary, if you want to look up a value, it's easy:
var myValue = dict[2]; // myValue is now "Link Weak"

By enumerator I assume you mean something over which you can iterate. An 'enum' is basically a set of named integers.
So if you have a string of items separated by commas and want to 'iterate' over them, then this may help:
string input = "0, Link Alive,1, Link Dead,2, Link Weak,3, Wiznet 0 Dead,4, Wiznet 1 Dead,5, Wiznets Dead"
string[] parts = input.split(new char[] {','}, StringSplitOptions.RemoveEmptyEntries);
foreach (string part in parts)
{
// do something
}

How to store the last few numbers as a variable in a table row

If I have a table , with row with numbers like 70-0002098, lets just call the row, ID
I need the last 4 numbers for all the table rows,
So what I need is something like
foreach(var row in table)
{
var Id = row.ID(but just the last 4 digits)
}

Not sure what format you want to store it as, or what you want to do with it after, but...
Edit: Added an if check for length to avoid index out of bounds condition. Also corrected syntax- SubString() => Substring()
int count = 0;
foreach(var row in table){
string temp = row.ID.ToString();
count += (temp.Length > 5)? Convert.ToInt32(temp.Substring(temp.Length-5, 4)) : Convert.ToInt32(temp);
}
// But I have no idea what datatype you have for that or what
// you want to do (count up the integer values or store in an array or something.
// From here you can do whatever you want.

Your illustration suggests that the RowID is not currently a number (its got a hyphen in it) so I assume its a string
id.Right(4);
will return the right four characters. It doesn't guarantee they are numbers though. Right is an extension method of string which can be easily written, or copied from this thread Right Function in C#?

Are there any algorithms to categorize an array among certain patterns?

For a simple problem of array length 5 to start with ( in practice the array length might be 20.. )
I have got a predefined set of patterns, like AAAAB, AAABA, BAABC, BCAAA, .... Each pattern is of the same length of the input array. I would need a function that takes any integer array as input, and returns all the patterns it matches. (an array may match a few patterns) as fast as possible.
"A" means that in the pattern all numbers at the positions of A are equal. E.g. AAAAA simply means all numbers are equal, {1, 1, 1, 1, 1} matches AAAAA.
"B" means the number at the positions B are not equal to the number at the position of A. (i.e. a wildcard for a number which is not A)Numbers represented by B don't have to be equal. E.g. ABBAA means the 1st, 4th, 5th numbers are equal to, say x, and 2nd, 3rd are not equal to x. {2, 3, 4, 2, 2} matches ABBAA.
"C" means this position can be any number (i.e. a wildcard for a number). {1, 2, 3, 5, 1} matches ACBBA, {1, 1, 3, 5, 1} also matches ACBBA
I am looking for an efficient ( in terms of comparisons number) algorithm. It doesn't have to be optimal, but shouldn't be too bad from optimal. I feel it is sort-of like the decision tree...
A very straightforward but inefficient way is like the following:
Try to match each pattern against the input. say AABCA against {a, b, c, d, e}. It checks if (a=b=e && a!=c).
If the number of patterns is n, the length of the pattern/array is m, then the complexity is about O(n*m)
Update:
Please feel free to suggest better wordings for the question, as I don't know how to make the question simple to understand without confusions.
An ideal algorithm would need some kind of preparation, like to transform the set of patterns into a decision tree. So that the complexities after preprocessing can be achieved to something like O(log n * log m) for some special pattern sets.(just a guess)
Some figures that maybe helpful: the predefined pattern sets is roughly of the size of 30. The number of input arrays to match with is about 10 millions.
Say, if AAAAA and AAAAC are both in the pre defined pattern set. Then if AAAAA matches, AAAAC matches as well. I am looking for an algorithm which could recognize that.
Update 2
#Gareth Rees 's answer gives a O(n) solution, but under assumption that there are not many "C"s. (otherwise the storage is huge and many unnecessary comparisons)
I would also welcome any ideas on how to deal with situations where there are many "C"s, say, for input array of length 20, there are at least 10 "C"s for each predefined patterns.

Here's an idea that trades O(2n) preparation and storage for O(n)-ish runtime. If your arrays are no longer than your machine's word size (you imply that 20 would be a typical size), or if there are not too many occurrences of C in the patterns, this idea might work for you. (If neither of these conditions is satisfied, avoid!)
(Preparatory step, done once.) Create a dictionary d mapping numbers to sets of patterns. For each pattern p, and each subset S of the occurrences of C in that pattern, let n be the number that has a set bit corresponding to each A in the pattern, and for each occurrence of C in S. Add p to the set of patterns d[n].
(Remaining steps are done each time a new array needs to be matched against the patterns.) Create a dictionary e mapping numbers to numbers.
Let j run over the indexes of the array, and for each j:
Let i be the j-th integer in the array.
If i is not in the dictionary e, set e[i] = 0.
Set e[i] = e[i] + 2ℓ − j − 1 where ℓ is the length of the array.
Now the keys of e are the distinct numbers i in the array, and the value e[i] has a set bit corresponding to each occurrence of i in the array. For each value e[i] that is found in the dictionary d, all the patterns in the set d[e[i]] match the array.
(Note: in practice you'd build the bitsets the other way round, and use 2j at step 3.3 instead of 2ℓ − j − 1, but I've described the algorithm this way for clarity of exposition.)
Here's an example. Suppose we have the patterns AABBA and ACBBA. In the preprocessing step AABBA turns into the number 25 (11001 in binary), and ACBBA turns into the numbers 25 (11001 in binary) and 17 (10001 in binary), for the two possible subsets of the occurrences of C in the pattern. So the dictionary d looks like this:
17 → {ACBBA}
25 → {AABBA, ACBBA}
After processing the array {1, 2, 3, 5, 1} we have e = {1 → 17, 2 → 8, 3 → 4, 5 → 2}. The value e[1] = 17 is found in d, so this input matches the pattern ACBBA.
After processing the array {1, 1, 2, 3, 1} we have e = {1 → 25, 2 → 4, 3 → 2}. The value e[1] = 25 is found in d, so this input matches the patterns AABBA and ACBBA.

Get the index of the first A in the pattern, get the value for that position, then loop through the positions.
To check if the array array matches the pattern in the string pattern, the result is in the boolean match:
int index = pattern.IndexOf('A');
int value = array[index];
bool match = true;
for (int i = 0; i < array.Length; i++) {
if (pattern[i] != 'C' && i != index) {
if ((pattern[i] == 'A') != (array[i] == value)) {
match = false;
break;
}
}
}

Aligning strings into columns

I have a collection of strings that the user can add to or subtract from. I need a way to print the strings out in columns so that the 1st letter of each string aligned. However I the number of columns must be changeable during run time. Although the default is 4 columns the use can opt for any number from 1 to 6. I have no idea how to format an unknown quantity of string into an unknown number of columns.
Example Input:
it we so be a i o u t y z c yo bo go an
Example output of four columns
"Words" with 2 letters:
it so be we
yo bo go an
"Words" with 1 letter:
a i o u
t y z c
Note: not worried about parsing of the words I already have that in my code which I can add if helpful.

If you are trying to create fixed width columns, you can use string.PadLeft(paddingChar, width) and string.PadRight(paddingChar, width) when you are creating your rows.
http://msdn.microsoft.com/en-us/library/system.string.padleft.aspx
You can loop through your words and call .PadXXXX(width) on each word. It will automatically pad your words with the correct number of spaces to make your string the width you supplied.

You can divide the total line width by the number of columns and pad each string to that length. You may also want to trim extra long strings. Here's an example that pads strings that are shorter than the column width and trims strings that are longer. You may want to tweak the behavior for longer strings:
int Columns = 4;
int LineLength = 80;
public void WriteGroup(String[] group)
{
// determine the column width given the number of columns and the line width
int columnWidth = LineLength / Columns;
for (int i = 0; i < group.Length; i++)
{
if (i > 0 && i % Columns == 0)
{ // Finished a complete line; write a new-line to start on the next one
Console.WriteLine();
}
if (group[i].Length > columnWidth)
{ // This word is too long; truncate it to the column width
Console.WriteLine(group[i].Substring(0, columnWidth));
}
else
{ // Write out the word with spaces padding it to fill the column width
Console.Write(group[i].PadRight(columnWidth));
}
}
}
If you call the above method with this sample code:
var groupOfWords = new String[] { "alphabet", "alegator", "ant",
"ardvark", "ark", "all", "amp", "ally", "alley" };
WriteGroup(groupOfWords);
Then you should get output that looks like this:
alphabet alegator ant ardvark
ark all amp ally
alley

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.