Regex Split behaviour when splitting string by variable length in C# [duplicate]

Regex Split behaviour when splitting string by variable length in C# [duplicate] - c#

This question already has answers here:
C# Regex.Split: Removing empty results
(9 answers)
Closed 3 years ago.
An application produces a flat file where each line represents data to be imported into another application. The type of data is irrelevant to this question, but suppose the first line is a string of numbers "0123456789" and the delimiter is a different width for each column. For example, I have to split the strings into an array of different lengths, e.g. 1,2,3,4 giving;
0
12
345
6789
The following code using Regex.Split(s,s) tests this; but can anyone explain why the string is split into 6 groups when I expected 4?
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string data = "0123456789";
string splitPattern = "^";
for(int x = 1; x < 5; ++x) {
splitPattern += string.Format("(.{{{0}}})", x);
}
string[] processedData = Regex.Split(data, splitPattern);
Console.WriteLine($"Using {splitPattern} to split {data} yields {processedData.Length} results.");
foreach(string d in processedData) {
Console.WriteLine(String.Format("[{0}]", d));
}
}
}
Running this code results in the following printed lines;
Using ^(.{1})(.{2})(.{3})(.{4}) to split 0123456789 yields 6 results.
[]
[0]
[12]
[345]
[6789]
[]
In reality the data includes text, numbers and punctuation. Also, the column lengths are not incremental, but I was stumped by the way this was split.
Links
dotnetfiddle
Regex101
Edit
Thanks for the answers and comments. I don't consider this to be a duplicate of C# Regex.Split: Removing empty results since the user actually edited their question to explain it was relating to their regex pattern. I understand now that the behaviour I've noticed is expected and after thinking about it, appreciate why this is so. The pattern in Regex.Split(data, splitPattern) kind of denotes where the delimiter should be. So if the pattern matches the start (and end), then an empty string is the result before (and after) the match.
I prefer Split over Match in this instance since it returns a simple string[] instead of a Match.

It's because Split actually splits the results into a component before and after the expression. When the expression has groups in it, it also includes it as part of the split as well.
See tweaked demo: https://dotnetfiddle.net/gUnxGP
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string data = "01234,56789";
var splitPattern = ","; // two results
string[] processedData = Regex.Split(data, splitPattern);
Console.WriteLine($"Using {splitPattern} to split {data} yields {processedData.Length} results.");
foreach (string d in processedData)
{
Console.WriteLine(String.Format("[{0}]", d));
}
splitPattern = "(,)"; // three results (includes the comma itself)
processedData = Regex.Split(data, splitPattern);
Console.WriteLine($"Using {splitPattern} to split {data} yields {processedData.Length} results.");
foreach (string d in processedData)
{
Console.WriteLine(String.Format("[{0}]", d));
}
}
}
/* Output
Using , to split 01234,56789 yields 2 results.
[01234]
[56789]
Using (,) to split 01234,56789 yields 3 results.
[01234]
[,]
[56789]
*/
As Wiktor commented, you probably should be using Matches instead of Split

Related

Split a file using specific word in C# [duplicate]

This question already has answers here:
How to tell a RegEx to be greedy on an 'Or' Expression
(2 answers)
Closed 2 years ago.
there is a file which i want to split
MSH|^~\&||||^asdasdasd|||asdasd|637226866166648574|637226866166648574|2.4
EVN|asd|20200416|20200416
PID|1|PW9074asdasd41|asd|PW907441|asdsad^wqe^wqeqwe||19700524|M
MSH|^~\&||||^qweqwewqe|||qwewqeqw|637226866166648574|637226866166648574|2.4
EVN|P03|20200416|20200416
PID|1|PW907441|PW907441|PW907441|Purvis^Walter^Rayshawn||19700524|M
I want to split it using MSH so that the result would be an array of string
array[0]=
"MSH|^~\&||||^asdasdasd|||asdasd|637226866166648574|637226866166648574|2.4
EVN|asd|20200416|20200416
PID|1|PW9074asdasd41|asd|PW907441|asdsad^wqe^wqeqwe||19700524|M";
array[1]=
"MSH|^~\&||||^asdasdasd|||asdasd|637226866166648574|637226866166648574|2.4
EVN|asd|20200416|20200416
PID|1|PW9074asdasd41|asd|PW907441|asdsad^wqe^wqeqwe||19700524|M";
What I have tried so far:
string[] sentences = Regex.Split(a, #"\W*((?i)MSH(?-i))\W*");
result:
array[0]="";
array[1]="MSH";
array[2]="asdasdasd|||asdasd|637226866166648574|637226866166648574|2.4
EVN|asd|20200416|20200416
PID|1|PW9074asdasd41|asd|PW907441|asdsad^wqe^wqeqwe||19700524|M";
array[3]="MSH";
array[4]="asdasdasd|||asdasd|637226866166648574|637226866166648574|2.4
EVN|asd|20200416|20200416
PID|1|PW9074asdasd41|asd|PW907441|asdsad^wqe^wqeqwe||19700524|M";
Or atleast it should not miss |^~\&||||^ after split in index 1 and 2

You can simply use the Split() function for this. Below generates an IEnumerable, which you can make an array using ToArray if you wanted to:
void Main()
{
string s = #"MSH|^~\&||||^asdasdasd|||asdasd|637226866166648574|637226866166648574|2.4
EVN|asd|20200416|20200416
PID|1|PW9074asdasd41|asd|PW907441|asdsad^wqe^wqeqwe||19700524|M
MSH|^~\&||||^qweqwewqe|||qwewqeqw|637226866166648574|637226866166648574|2.4
EVN|P03|20200416|20200416
PID|1|PW907441|PW907441|PW907441|Purvis^Walter^Rayshawn||19700524|M";
foreach (var element in s.Split(new string[] { "MSH" }, StringSplitOptions.RemoveEmptyEntries).Select(x => $"MSH{x}"))
{
Console.WriteLine(element);
}
}

If you want to split on MSH, Cetin Basoz is right. It will perfectly work doing that :
var sentences = a.Split(new String[] { "MSH" }, StringSplitOptions.RemoveEmptyEntries);
If you wanna be case insensitive, you can use that which is much simpler than the regex you used previously :
var sentences = Regex.Split(a, "MSH", RegexOptions.IgnoreCase);

how to extract a number only, not any operators

Hi I am trying to match numbers only in a string that contain operators.
However the following regEx is also giving me operators I dnt know why?
For example I have the string "2X/8" and I am trying to get rid of 8.
if(Regex.IsMatch(elements[i], #"\d"))
{
Console.WriteLine("Adding to numberstack:+ ", elements[i]);
numberStack.Push(elements[i]);
}
if (i >= elements.Length - 1)
{
Console.WriteLine("Inside the popper");
if ((i - 2) >= 0)
{
Console.WriteLine(numberStack.Peek());
if (elements[i - 1].Contains("/*") && elements[i - 2].Contains("X"))
{
numberStack.Pop();
}
}
}

The question seems a bit confusing to me, but I'm assuming the following (correct me if I'm wrong):
You are trying to discover if there is a number at the beginning of your string, and extract that number
"elements" is an array of string elements.
"numberStack" is a stack of string elements.
If so, notice you are pushing the "elements[i]" value to the stack, which I understand contains the whole string of the expression you are trying to evaluate ("2X/8" as per your example), and not only the number at the start of the expression. This would explain why you are getting not only the number "2" on your result, but the whole "2X/8" value.
There are several ways to extract the numbers for an expression, and selecting one of them depends on your specific needs.
As a quick example, if you just want to extract each set of numbers from a string, you can get all matches for your regular expression inside that string and iterate through them:
string theExpression = "2X/8+123";
MatchCollection matches = Regex.Matches(theExpression, #"\d+");
foreach (Match m in matches)
Console.WriteLine(m.Value);
Console.ReadLine();
This example would print the numbers 2, 8 and 123 from the given expression to the output.

String splitting with a special structure

I have strings of the following form:
str = "[int]:[int],[int]:[int],[int]:[int],[int]:[int], ..." (for undefined number of times).
What I did was this:
string[] str_split = str.Split(',');
for( int i = 0; i < str_split.Length; i++ )
{
string[] str_split2 = str_split[i].Split(':');
}
Unfortunately this breaks when some of the numbers have extra ',' inside a number. For example, we have something like this:
695,000:14,306,000:12,136000:12,363000:6
in which the followings are the numbers, ordered from the left to the right:
695,000
14
306,000
12
136000
12
363000
6
How can I resolve this string splitting problem?

If it is the case that only the number to the left of the colon separator can contain commas, then you could simply express this as:
string s = "695,000:14,306,000:12,136000:12,363000:6";
var parts = Regex.Split(s, #":|(?<=:\d+),");
The regex pattern, which identifies the separators, reads: "any colon, or any comma that follows a colon and a sequence of digits (but not another comma)".

A simple solution is split using : as delimiter. The resultant array will have numbers of the format [int],[int]. Parse through the array and split each entry using , as the delimiter. This will give you an array of [int] numbers.

It might not be the best way to do it and it might not work all the time but here's what I'd do.
string[] leftRightDoubles = str.Split(':');
foreach(string substring in leftRightDoubles){
string[] indivNumbers = str.Split(',');
//if indivNumbers.Length == 2, you know that these two are separate numbers
//if indivNumbers.Length > 2, use heuristics to determine which parts belong to which number
if(indivNumbers.Length > 2) {
for(int i = 0, i < indivNumbers.Length, i++) {
if(indivNumbers[i] != '000') { //Or use some other heuristic
//It's a new number
} else {
//It's the rest of previous number
}
}
}
}
//It's sort of pseudocode with comments (haven't touched C# in a while so I don't want to write full C# code)

splitting string into array with a specific number of elements, c#

I have a string which consists number of ordered terms separated by lines (\n) as it shown in the following example: (note, the string I have is an element of an array of string)
term 1
term 2
.......
.......
term n
I want to split a specific number of terms, let we say (1000) only and discard the rest of the terms. I'm trying the following code :
string[] training = traindocs[tr].Trim().Split('\n');
List <string> trainterms = new List<string>();
for (int i = 0; i < 1000; i++)
{
if (i >= training.Length)
break;
trainterms.Add(training[i].Trim().Split('\t')[0]);
}
Can I conduct this operation without using List or any other data structure? I mean just extract the specific number of the terms into the the Array (training) directly ?? thanks in advance.

How about LINQ? The .Take() extension method kind of seems to fit your bill:
List<string> trainterms = traindocs[tr].Trim().Split('\n').Take(1000).ToList();

According to MSDN you can use an overloaded version of the split method.
public string[] Split( char[] separator, int count,
StringSplitOptions options )
Parameters
separator Type: System.Char[] An array of Unicode characters that
delimit the substrings in this string, an empty array that contains no
delimiters, or null.
count Type: System.Int32 The maximum number of
substrings to return.
options Type: System.StringSplitOptions
StringSplitOptions.RemoveEmptyEntries to omit empty array elements
from the array returned; or StringSplitOptions.None to include empty
array elements in the array returned.
Return Value
Type: System.String[] An array whose elements contain the substrings
in this string that are delimited by one or more characters in
separator. For more information, see the Remarks section.
So something like so:
String str = "A,B,C,D,E,F,G,H,I";
String[] str2 = str.Split(new Char[]{','}, 5, StringSplitOptions.RemoveEmptyEntries);
System.Console.WriteLine(str2.Length);
System.Console.Read();
Would print: 5
EDIT:
Upon further investigation it seems that the count parameter just instructs when the splitting stops. The rest of the string will be kept in the last element.
So, the code above, would yield the following result:[0] = A, [1] = B, [2] = C, [3] = D, [4] = E,F,G,H,I, which is not something you seem to be after.
To fix this, you would need to do something like so:
String str = "A\nB\nC\nD\nE\nF\nG\nH\nI";
List<String> myList = str.Split(new Char[]{'\n'}, 5, StringSplitOptions.RemoveEmptyEntries).ToList<String>();
myList[myList.Count - 1] = myList[myList.Count - 1].Split(new Char[] { '\n' })[0];
System.Console.WriteLine(myList.Count);
foreach (String str1 in myList)
{
System.Console.WriteLine(str1);
}
System.Console.Read();
The code above will only retain the first 5 (in your case, 1000) elements. Thus, I think that Darin's solution might be cleaner, if you will.

If you want most efficient(fastest) way, you have to use overload of String.Split, passing total number of items required.
If you want easy way, use LINQ.

C# - Parsing a line of text - what's the best way to do this? [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Parse multiple doubles from string in C#
Say I have a line of text that looks as follows:
"45.690 24.1023 .09223 4.1334"
What would be the most efficient way, in C#, to extract just the numbers from this line? The number of spaces between each number varies and is unpredictable from line to line. I have to do this thousands of times, so efficiency is key.
Thanks.

IEnumerable<double> doubles = s.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
.Select<string, double>(double.Parse)
Updated to use StringSplitOptions.RemoveEmptyEntries since the number of spaces varies

Use a Regex split. This will allow you to split on any whitespace of any length between your numbers:
string input = "45.690 24.1023 .09223 4.1334";
string pattern = "\\s*"; // Split on whitepsace
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}

I haven't measured, but simplicity is key if you are trying to be efficient so probably something like
var chars = new List<char>();
for( int i =0; i < numChars; ++i )
if( char.IsDigit( text[i] ) )
chars.Add(text[i]);

You want efficient.....
var regex = new Regex(#"([\d\.]+)", RegexOptions.Compiled)
var matches = regex.Matches(input);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex Split behaviour when splitting string by variable length in C# [duplicate] - c#

Related

Split a file using specific word in C# [duplicate]

how to extract a number only, not any operators

String splitting with a special structure

splitting string into array with a specific number of elements, c#

C# - Parsing a line of text - what's the best way to do this? [duplicate]

Categories

Resources