I have the following node(s) which I retrieve in a streamreader. There could be numerous of these. I am only interested to retrieve a few groups within this node for instance REPLICATE_ID, ASSAY_NUMBER,FEW DATES FIELDS.
The ordering of the fields within the node could be different and sometimes new fields could be present as well but the fields I want to extract they will not change.
So far the regex I have matches the entire node so in case the node has new fields or the order is different, it breaks. Is it possible to match groups I am only interested in?
TEST_REPLICATE
{
REPLICATE_ID 453w
ASSAY_NUMBER 334
ASSAY_VERSION 4
ASSAY_STATUS test
DILUTION_ID 1
SAMPLE_ID "NC_dede"
SAMPLE_TYPE Specimen
TEST_ORDER_DATE 05.23.2012
TEST_ORDER_TIME 04:25:07
TEST_INITIATION_DATE 05.23.2012
TEST_INITIATION_TIME 05:19:43
TEST_COMPLETION_DATE 05.23.2012
TEST_COMPLETION_TIME 05:48:01
ASSAY_CALIBRATION_DATE NA
ASSAY_CALIBRATION_TIME NA
TRACK 1
PROCESSING_LANE 1
MODULE_SN "EP004"
LOAD_LIST_NAME C:\BwedwQwedw_SCC\edwLoadlist2RACKSB.json
OPERATOR_ID "Q_dwe"
DARK_SUBREADS 16 23 19 20 16 18 21 16 17 18 19 19 20 22 19 20 19 20 18 20 17 20 21 16 19 23 20 22 19 20
SIGNAL_SUBREADS 18 17 20 21 42 61 41 31 30 30 26 26 25 22 24 DARK_COUNT 577
SIGNAL_COUNT 781
CORRECTED_COUNT 204
STD_BAK 1.95965044971226
AVG_BAK 19.2333333333333
STD_FOR 8.67212471810898
AVG_FOR 26.0333333333333
SHAPE NA
EXCEPTION_STRING TestException - Parameters:Unable to process test, background read failure.
RESULT NA
REPORTED_RESULT NA
REPORTED_RESULT_UNITS NA
REAGENT_MASTER_LOT 13600LI02
REAGENT_SERIAL_NUMBER 25022
RESULT_FLAGS RUO
RESULT_INTERPRETATION NA
DILUTION_PROTOCOL UNDILUTED
RESULT_COMMENT frer 1 LANE A
DATA_MANAGEMENT_FIELD_1 NA
DATA_MANAGEMENT_FIELD_2 NA
DATA_MANAGEMENT_FIELD_3 NA
DATA_MANAGEMENT_FIELD_4 NA
}
string pat = #"TEST_REPLICATE\s*{\s*REPLICATE_ID\s*([^}]*?)\s+ASSAY_NUMBER\s*([^}]*?)\s+ASSAY_VERSION\s*([^}]*?)\s+DILUTION_ID\s*([^}]*?)\s+SAMPLE_ID\s*([^}]*?)\s+SAMPLE_TYPE\s*([^}]*?)\s+TEST_ORDER_DATE\s*([^}]*?)\s+TEST_ORDER_TIME\s*([^}]*?)\s+TEST_INITIATION_DATE\s*([^}]*?)\s+TEST_INITIATION_TIME\s*([^}]*?)\s+TEST_COMPLETION_DATE\s*([^}]*?)\s+TEST_COMPLETION_TIME\s*([^}]*?)\s+ASSAY_CALIBRATION_DATE\s*([^}]*?)\s+ASSAY_CALIBRATION_TIME\s*([^}]*?)\s+TRACK\s*([^}]*?)\s+PROCESSING_LANE\s*([^}]*?)\s+MODULE_SN\s*([^}]*?)\s+LOAD_LIST_NAME\s*([^}]*?)\s+OPERATOR_ID\s*([^}]*?)\s+DARK_SUBREADS\s*([^}]*?)\s+SIGNAL_SUBREADS\s*([^}]*?)\s+DARK_COUNT\s*([^}]*?)\s+SIGNAL_COUNT\s*([^}]*?)\s+CORRECTED_COUNT\s*([^}]*?)\s+STD_BAK\s*([^}]*?)\s+AVG_BAK\s*([^}]*?)\s+STD_FOR\s*([^}]*?)\s+AVG_FOR\s*([^}]*?)\s+SHAPE\s*([^}]*?)\s+EXCEPTION_STRING\s*([^}]*?)\s+RESULT\s*([^}]*?)\s+REPORTED_RESULT\s*([^}]*?)\s+REPORTED_RESULT_UNITS\s*([^}]*?)\s+REAGENT_MASTER_LOT\s*([^}]*?)\s+REAGENT_SERIAL_NUMBER\s*([^}]*?)\s+RESULT_FLAGS\s*([^}]*?)\s+RESULT_INTERPRETATION\s*([^}]*?)\s+DILUTION_PROTOCOL\s*([^}]*?)\s+RESULT_COMMENT\s*([^}]*?)\s+DATA_MANAGEMENT_FIELD_1\s*([^}]*?)\s+DATA_MANAGEMENT_FIELD_2\s*([^}]*?)\s+DATA_MANAGEMENT_FIELD_3\s*([^}]*?)\s+DATA_MANAGEMENT_FIELD_4\s*([^}]*?)\s*}";
Yeah, you probably should just parse the record for key-value pairs.
Here is a code sample if you want to extract key-value pairs from a record.
When a match is found, the key's your looking for can be tested against those in the capture collection.
You can also alter the regex as to how the begin/end of record are allowed.
But don't alter the core, it protects from catastrophic backtracking.
Regex alternatives:
# Record starts on a new line, closing brace can be anywhere
^ [^\S\n]*TEST_REPLICATE\s*\{
(?>
\s* (?<key> [^\s{}]+ ) [^\S\n]* (?<val> [^\n{}]*? ) [^\S\n]* (?:$|(?=\}))
)*
\s*\}
# Record starts anywhere, closing brace is on a new line
TEST_REPLICATE\s*\{
(?>
\s* (?<key> [^\s{}]+ ) [^\S\n]* (?<val> [^\n{}]*? ) [^\S\n]* $
)*
\s*\}
C# test code:
Regex testRx = new Regex(
#"
^ [^\S\n]* TEST_REPLICATE # Record, starts on a newline
\s* # Optional whitespaces (trims blank lines)
\{ # Record opening brace
(?> # Atomic group
\s* # Optional many whitespace (trims blank lines)
# Line in record to be recorded
(?<key> [^\s{}]+) # required <key>, not whitespacs nor braces
[^\S\n]* # trim whitespaces (don't include newline)
(?<val> [^\n{}]*?) # optional <value>, not newlines nor braces
[^\S\n]* # trim whitespaces (don't include newline)
(?:$|(?=\})) # End of line, or next char is a closing brace
)* # End atomic group, do many times (optional)
\s* # Optional whitespaces (trims blank lines)
\} # Record closing brace
", RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
string testdata = #"
TEST_REPLICATE{}
TEST_REPLICATE{
REPLICATE_ID 1asdf985
ASSAY_NUMBER 123sdg
ASSAY_VERSION 4sdgn
ASSAY_TYPE unknown
}
TEST_REPLICATE
{
REPLICATE_ID
ASSAY_NUMBER 123
ASSAY_VERSION 4
ASSAY_TYPE unknown
DILUTION_ID 1
SAMPLE_ID ""NC_HIV1""
SAMPLE_TYPE Specimen
TEST_ORDER_DATE 05.21.2012
TEST_ORDER_TIME 03:44:01
TEST_INITIATION_DATE 05.21.2012
TEST_INITIATION_TIME 04:03:36
TEST_COMPLETION_DATE 05.21.2012
TEST_COMPLETION_TIME 04:29:32
ASSAY_CALIBRATION_DATE NA
ASSAY_CALIBRATION_TIME NA
TRACK 1
PROCESSING_LANE 1
MODULE_SN ""EP004""
LOAD_LIST_NAME C:\sdddd
OPERATOR_ID ""Q_SI""
DARK_SUBREADS NA
SIGNAL_SUBREADS NA
DARK_COUNT NA
SIGNAL_COUNT NA
CORRECTED_COUNT NA
STD_BAK NA
AVG_BAK NA
STD_FOR NA
AVG_FOR NA
SHAPE NA
EXCEPTION_STRING Test execution was stopped.
RESULT NA
REPORTED_RESULT NA
REPORTED_RESULT_UNITS NA
REAGENT_MASTER_LOT 2345
REAGENT_SERIAL_NUMBER 25022
RESULT_FLAGS NA
RESULT_INTERPRETATION NA
DILUTION_PROTOCOL UNDILUTED
RESULT_COMMENT HIV NC 1
DATA_MANAGEMENT_FIELD_1 NA
DATA_MANAGEMENT_FIELD_2 NA
DATA_MANAGEMENT_FIELD_3 NA
DATA_MANAGEMENT_FIELD_4 NA
}
";
Match m_testrec = testRx.Match(testdata);
// Each match contains a single record
//
while (m_testrec.Success)
{
Console.WriteLine("New Record\n------------------------");
CaptureCollection cc_key = m_testrec.Groups["key"].Captures;
CaptureCollection cc_val = m_testrec.Groups["val"].Captures;
for (int i = 0; i < cc_key.Count; i++)
{
Console.WriteLine("'{0}' = '{1}'", cc_key[i].Value, cc_val[i].Value);
//
// Test specific keys here
// if (cc_key[i].Value == "REAGENT_SERIAL_NUMBER") ...
}
Console.WriteLine("------------------------");
// Get next record
m_testrec = m_testrec.NextMatch();
}
Related
I have a C# program that takes as input a subtitle text file with contents like this:
1
00: 00: 07.966 -> 00: 00: 11.166
How's the sea?
- This is great.
2
00: 00: 12.967 -> 00: 00: 15.766
It's really pretty.
What I want to do is basically correct it, so that it will skip any spaces, replace the . character with the , character and add another hyphen to the -> string, so that it will become -->. For the previous example, the correct output would be:
1
00:00:07,966 --> 00:00:11,166
How's the sea?
- This is great.
2
00:00:12,967 --> 00:00:15,766
It's really pretty.
So far, I've thought about iterating through each line and checking if it starts and ends with a digit, like so:
if (line.StartsWith("[0-9]") && line.EndsWith("[0-9]")) {
}
I don't know how to state the regular expression to do this, though.
Please take note that my input can have spaces anywhere at the subtitle timing line, not just after the : character, so the string can end up being as worse as this:
"^ 0 0 : 0 0 : 0 7 . 9 6 6 -> 0 0 : 0 0 : 1 1 . 1 6 6 $"
It may not be a single regex that does everything, but I think that is actually an advantage and the logic is easy to follow and modify.
using var input = new StreamReader(inputPath);
using var output = new StreamWriter(outputPath);
// matches a timestamp line with a "->" and no alpha characters
var timestampRegex = new Regex(#"[^A-Za-z]*-\s*>[^A-Za-z]*");
string line;
while((line = input.ReadLine()) != null)
{
// if a timestamp line is found then it is modified
if (timestampRegex.IsMatch(line))
{
line = Regex.Replace(line, #"\s", ""); // remove all whitespace
line = line.Replace("->", " --> "); // update arrow style
}
output.WriteLine(line);
}
You can solve it with the regular expression:
(?m)(?:\G(?!\A)|^(?=\d.*\d\r?$))(\d{2}:)[ \t](?:(\d+,\d+[ \t])(-)(>[ \t]))?
The replacement will be $1$2$3$3$4.
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?m) set flags for this block (with ^ and $
matching start and end of line) (case-
sensitive) (with . not matching \n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\G where the last m//g left off
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\A the beginning of the string
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\r? '\r' (carriage return) (optional
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
a "line"
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
[ \t] any character of: ' ', '\t' (tab)
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
[ \t] any character of: ' ', '\t' (tab)
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
> '>'
--------------------------------------------------------------------------------
[ \t] any character of: ' ', '\t' (tab)
--------------------------------------------------------------------------------
) end of \4
--------------------------------------------------------------------------------
)? end of grouping
C# code:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"(?:\G(?!\A)|^(?=\d.*\r?\d$))(\d{2}:)[ \t](?:(\d+,\d+[ \t])(-)(>[ \t]))?";
string substitution = #"$1$2$3$3$4";
string input = #"1
00: 00: 07,966 -> 00: 00: 11,166
How's the sea?
- This is great.
2
00: 00: 12,967 -> 00: 00: 15,766
It's really pretty.";
RegexOptions options = RegexOptions.Multiline;
Regex regex = new Regex(pattern, options);
string result = regex.Replace(input, substitution);
Console.Write(result);
}
}
Results:
1
00:00:07,966 --> 00:00:11,166
How's the sea?
- This is great.
2
00:00:12,967 --> 00:00:15,766
It's really pretty.
I want to trim all spaces between numbers before words "usd" and "eur".
I have regex pattern like this:
#"\b(\d\s*)+\s(usd|eur)"
How to exclude space and usd|eur from result match?.
String example: "sdklfjsd 10 343 usd ds 232 300 eur"
Result should be: "sdklfjsd 10343 usd ds 232300 eur"
string line = "2 300 $ 12 Asdsfd 2 300 530 usd and 2 351 eur";
MatchCollection matches;
Regex defaultRegex = new Regex(#"\b(\d+\s*)+(usd|eur)");
matches = defaultRegex.Matches(line);
WriteLine("Parsing '{0}'", line);
for (int ctr = 0; ctr < matches.Count; ctr++)
WriteLine("={0}){1}", ctr, matches[ctr].Value);
There my be a more eloquent way, but it can be done easily with a MatchEvaluator
new Regex(#"\b(\d+\s*)+(?=\s(usd|eur))").
Replace("sdklfjsd 10 343 usd ds 232 300 eur",
m => string.Join("", m.Groups[1].Captures.Cast<Capture>().Select(c => c.Value.Trim())))
The Regex \b(\d+\s*)+(?=\s(usd|eur)) uses a look-ahead to only match numbers that are followed by \s(usd|eur) and a grouping to match each consecutive match to \d+\s* (I assume the \b boundary from your question so that with abc12 34 56 eur it would only match 34 56 is desired, remove it otherwise).
Then for each match it gets all of that group's captures, trims them all, and concatenates them together to produce the replacement text.
(Note that generally currency codes should be capitalised, so you my have another issue there).
Try Regex: (\d+) *(\d+)(?= (?:usd|eur))
Demo
Assuming there only two numbers, you can use
\b(\d+)\s*(\d+)(?=\s(usd|eur)) with a replacement string of $1$2
You could also use a posotive lookbehind and a positive lookahead to match all the spaces you want to remove:
(?<=\d)\s+(?=(?:\d+\s+)*\d+\s+(?:eur|usd)\b)
Explanation
(?<=\d) Positive lookbehind to assert what is on the left is
\s+ Match 1+ whitespace characters
(?= Positive lookahead to assert what is on the right is
(?:\d+\s+)* Repeat 0+ times matching 1+ digits followed by 1+ whitespace characters
\d+\s+(?:eur|usd)\b match 1+ digits followed by 1+ whitespace characters and eur or usd
) Close positive lookahead
Regex demo
string line = "2 300 $ 12 Asdsfd 2 300 530 usd and 2 351 eur";
string result = Regex.Replace(line , #"(?<=\d)\s+(?=(?:\d+\s+)*\d+\s+(?:eur|usd)\b)", "");
Console.WriteLine(result); // 2 300 $ 12 Asdsfd 2300530 usd and 2351 eur
Demo C#
Text from txt file:
10 25
32 44
56 88
102 127
135 145
...
If it is a first line place 0, rest use the last number as a first in new line. Is it possible to do it or I need to loop through lines after regex parse.
0 10 25
25 32 44
44 56 88
88 102 127
127 135 145
(?<Middle>\d+)\s(?<End>\d+) //(?<Start>...)
I would advise against using regex for readability reasons but this will work:
var input = ReadFromFile();
var regex = #"(?<num>\d*)[\n\r]+";
var replace = "${num}\n${num} ";
var output = Regex.Replace(input, regex, replace);
That will do everything apart from the first 0.
Note that a regex approach does not sound quite good for a task like this. It can be used for small input strings, for larger ones, it is recommended that you write some more logic and parse text line by line.
So, more from academic interest, here is a regex solution showing how to replace with different replacement patterns based on whether the line matched is first or not:
var pat = #"(?m)(?:(\A)|^(?!\A))(.*\b\s+(\d+)\r?\n)";
var s = "10 25\n32 44\n56 88\n102 127\n135 14510 25\n32 44\n56 88\n102 127\n135 145";
var res = Regex.Replace(s, pat, m => m.Groups[1].Success ?
$"0 {m.Groups[2].Value}{m.Groups[3].Value} " : $"{m.Groups[2].Value}{m.Groups[3].Value} ");
Result of the C# demo:
0 10 25
25 32 44
44 56 88
88 102 127
127 135 14510 25
25 32 44
44 56 88
88 102 127
127 135 145
Note the \n line breaks are hardcoded, but it is still just an illustration of regex capabilities.
Pattern details
(?m) - an inline RegexOptions.Multiline modifier
(?:(\A)|^(?!\A)) - a non-capturing group matching either
(\A) - start of string capturing it to Group 1
| - or
^(?!\A) - start of a line (but not string due to the (?!\A) negative lookahead)
(.*\b\s+(\d+)\r?\n) - Group 2:
.*\b - 0+ chars other than newline up to the last word boundary on a line followed with...
\s+ - 1+ whitespaces (may be replaced with [\p{Zs}\t]+ to only match horizontal whitespaces)
(\d+) - Group 3: one or more digits
\r?\n - a CRLF or LF line break.
The replacement logic is inside the match evaluator: if Group 1 matched (m.Groups[1].Success ?) replace with 0 and Group 2 + Group 3 values + space. Else, replace with Group 2 + Group 3 + space.
With C#.
var lines = File.ReadLines(fileName);
var st = new StringBuilder(); //or StreamWriter directly to disk ect.
var last = "0";
foreach (var line in lines)
{
st.AppendLine(last + " " + line );
last = line.Split().LastOrDefault();
}
var lines2 = st.ToString();
I was wondering if this was possible using Regex. I would like to exclude all letters (upper and lowercase) and the following 14 characters ! “ & ‘ * + , : ; < = > # _
The problem is the equal sign. In the string (which must either be 20 or 37 characters long) that I will be validating, that equal sign must either be in the 17th or 20th position because it is used as a separator in those positions. So it must check if that equal sign is anywhere other than in the 16th or 20th position (but not both). The following are some examples:
pass: 1234567890123456=12345678901234567890
pass: 1234567890123456789=12345678901234567
don't pass: 123456=890123456=12345678901234567
don't pass: 1234567890123456=12=45678901234567890
I am having a hard time with the part that I must allow the equal sign in those two positions and not sure if that's possible with Regex. Adding an if-statement would require substantial code change and regression testing because this function that stores this regex currently is used by many different plug-ins.
I'll go for
^([^a-zA-Z!"&'*+,:;<=>#_]{16}=[^a-zA-Z!"&'*+,:;<=>#_]+|[^a-zA-Z!"&'*+,:;<=>#_]{19}=[^a-zA-Z!"&'*+,:;<=>#_]*)$
Explanations :
1) Start with your allowed char :
^[^a-zA-Z!"&'*+,:;<=>#_]$
[^xxx] means all except xxx, where a-z is lower case letters A-Z upper case ones, and your others chars
2) Repeat it 16 times, then =, then others allowed chars ("allowed char" followed by '+' to tell that is repeated 1 to n times)
^[^a-zA-Z!"&'*+,:;<=>#_]{16}=[^a-zA-Z!"&'*+,:;<=>#_]+$
At this point you'll match your first case, when = is at position 17.
3) Your second case will be
^[^a-zA-Z!"&'*+,:;<=>#_]{19}=[^a-zA-Z!"&'*+,:;<=>#_]*$
with the last part followed by * instead of + to handle strings that are only 20 chars long and that ends with =
4) just use the (case1|case2) to handle both
^([^a-zA-Z!"&'*+,:;<=>#_]{16}=[^a-zA-Z!"&'*+,:;<=>#_]+|[^a-zA-Z!"&'*+,:;<=>#_]{19}=[^a-zA-Z!"&'*+,:;<=>#_]*)$
Tested OK with notepad++ and your examples
Edit to match exactly 20 or 37 chars
^([^a-zA-Z!"&'*+,:;<=>#_]{16}=[^a-zA-Z!"&'*+,:;<=>#_]{3}|[^a-zA-Z!"&'*+,:;<=>#_]{16}=[^a-zA-Z!"&'*+,:;<=>#_]{20}|[^a-zA-Z!"&'*+,:;<=>#_]{19}=|[^a-zA-Z!"&'*+,:;<=>#_]{19}=[^a-zA-Z!"&'*+,:;<=>#_]{17})$
More readable view with explanation :
`
^(
// 20 chars with = at 17
[^a-zA-Z!"&'*+,:;<=>#_]{16} // 16 allowed chars
= // followed by =
[^a-zA-Z!"&'*+,:;<=>#_]{3} // folowed by 3 allowed chars
|
[^a-zA-Z!"&'*+,:;<=>#_]{16} // 37 chars with = at 17
=
[^a-zA-Z!"&'*+,:;<=>#_]{20}
|
[^a-zA-Z!"&'*+,:;<=>#_]{19} // 20 chars with = at 20
=
|
[^a-zA-Z!"&'*+,:;<=>#_]{19} // 37 chars with = at 20
=
[^a-zA-Z!"&'*+,:;<=>#_]{17}
)$
`
I've omitted other symbols matching other symbols and just placed the [^=], you should have there code for all allowed symbols except =
var r = new Regex(#"^(([0-9\:\<\>]{16,16}=(([0-9\:\<\>]{20})|([0-9\:\<\>]{3})))|(^[^=]{19,19}=(([0-9\:\<\>]{17}))?))$");
/*
#"^(
([0-9\:\<\>]{16,16}
=
(([0-9\:\<\>]{20})|([0-9\:\<\>]{3})))
|
(^[^=]{19,19}
=
(([0-9\:\<\>]{17}))?)
)$"
*/
using {length,length} you can also specify the overall string length. The $ in the end and ^ in the beginning are important also.
Hi I am newbie in RegEx operations. I have a text like
[JUNCTIONS]
;ID Elev Demand Pattern
3 50 100 ;
4 50 30 ;
5 50 20 ;
6 40 20 ;
7 50 5 ;
8 30 5 ;
9 30 5 ;
2 50 80 ;
10 50 70 ;
11 50 30 ;
12 50 52 ;
13 50 40 ;
14 50 40 ;
15 50 10 ;
16 50 10 ;
17 50 10 ;
18 0 0 ;
19 0 0 ;
[RESERVOIRS]
;ID Head Pattern
1 100 ;
[TANKS]
I want to create a pattern and output the text between [JUNCTIONS] and [RESERVOIRS] then [RESERVOIRS] to [TANKS] then so on. [XXXX] is not known to me. I want to get text inside [XXX] to [XXX]. How can i do that?
Here is the regex:
(?=(\[\S+\].*?\[\S+\]))
or
(?=(\[(?:JUNCTIONS|RESERVOIRS)\].*?\[(?:RESERVOIRS|TANKS)\]))
Assuming you want to handle all the [...] things from your input.
Note: Use the make sure you are handling multiple line regex matching from your c#. And don't for get to escape the \ character if you need.
Here is some c# code to do the match, and get the results.
Be sure to add error checking, for example to make sure that the match actually worked.
Note the Singleline flag - this lets the dot (.) match all characters, including newlines. You'll also probably need to cleanup and trim the output, to remove any trailing newlines, etc.
MatchCollection matches = Regex.Matches(test, #"^\[JUNCTIONS\](.*)\[RESERVOIRS\](.*)\[TANKS\](.*)$", RegexOptions.Singleline);
GroupCollection groups = matches[0].Groups;
// JUNCTIONS text
Console.WriteLine(groups[1]);
// RESERVOIRS text
Console.WriteLine(groups[2]);
Edit - Updated to match OP's changes
If you want to match an unspecified number of matches, its a little trickier. This regex will match a [TEXT] block and anything that comes after it, until it its a [ character. The way to use this regex is to loop over the MatchCollection for each region, and use .groups[1] for the text and .groups[2] for the body.
MatchCollection matches =
Regex.Matches(test, #"\[([\w+]+)\]([^\[]+)?", RegexOptions.Singleline);
// for each block / section of the document
foreach(Match match in matches){
GroupCollection groups = match.Groups;
// [TEXT] part will be here
Console.WriteLine(groups[1]);
// The rest will be here
Console.WriteLine(groups[2]);
}
Why use a regex?
Assuming you can read this input text one line at a time, it will probably be quicker and easier to just loop over the lines, and output those you need. Some variant of:
Update:
In response to you comment below; you can probably use this to skip any lines with [something] in them, and print out the rest:
// Pattern: Any instance of [] with one or more characters of between them:
var pattern = #"\[.+\]";
while((line = file.ReadLine()) != null)
{
if(!Regex.IsMatch(line, pattern)) // Skip lines that match
{
Console.WriteLine(line);
}
}