String split on dynamic separator - c#

I am to deal with the following problem.
I have to extract messages from a communication buffer. Sadly the communication protocol is lousy and not well-structured. The only way I came up with to distinguish packets in the buffer is an intermediate "ack" command that is transmitted by the server.
Example:
[Packet1][ACK][Packet2][ACK][Packet3]
I could have used String.Split(ACK), but the separator is also not consistent. Though, there are 3 rules to identify the ack packet.
Starts with "AK".
Ends with "0" or "1".
Total length is 5 characters.
Ack example:
"AKxxy" where:
xx: (01 to 99)
y: (0 or 1)
I hope that there may be a regular expression that can solve my problem, but I lack the needed knowledge and time.
Is there any RegEx "expert" that may possible help me? Feel free to suggest any solution.
Thank you.
Edit:
Example packet (I really had to remove the packet information):
AK010CONFIDENTIALPACKET1AK011CONFIDENTIALPACKET2AK020AK011CONFIDENTIALPACKET3AK021CONFIDENTIALPACKET4AK050
Sadly, each packet in the protocol does not start or end with a specific character so I cannot distinguish them. To identify each one I have to split them using the ack packet and then perform different checks on each one.

The direct translation would be
\bAK\d{2}[01]\b
That is
\b # a word boundary
AK # AK literally
\d{2} # two digits
[01] # one of 0 or 1
\b # another word boundary
The expression needs to be tested though (see a demo on regex101.com).

EDIT:
looking at the other answers, this has probably merely ornamental value.
The solution of #Jan and #ThymosK
var packets = Regex.Split(buffer, #"AK\d{2}[01]");
seems much more elegant.
But I think it might be nice to see how all the parsing can be moved inside the regex. Even if it is way too unreadable :P
I have designed a regex that can give you messages and delimiters as groups:
(?s)(AK[0-9][0-9][0,1])|((?:(?!AK[0-9][0-9][0,1]).)*)
It can analyze text like this:
You can test it here.
As usual, regexes are write only. I can hardly read this myself. But I'll try and go through it:
The first group is straightforward and simply catches your ack command:
(AK[0-9][0-9][0,1])
The second group contains a negative lookahead (?! ... ) which matches anything not followed by the regex specified by .... Here we insert your ack syntax, so anything not followed by ack is matched. Then we add a single character, to extend this to actually match anything up to ack.
Basically this second part asserts that we are currently not followed by ack and then adds a single character. This is repeated as long as possible, until we find the ack. I turn this into the second group.
Since I don't have C# currently, I can't wrap this in code with the C# regex engine. But python works nicely with it and offers a useful findall method, that gives you all those groups.

string interim = Regex.Replace(buffer, "AK\d{2}[01]", "|");
var commands = interim.Split('|');
Assuming that | is not a valid input char. You can pick something very exotic.

Related

Regular Expression how to match IP and port after a few garbage data

I'm trying to extract the IP and Port from html text.
The data looks like this
177.93.79.34\n<!--\n<img src='images/proxy/3537536.gif' border='0' hspace='0' vspace='0' width='140' height='14' alt='View this Proxy details'/>\n-->\n</a></td>\n\t<td><a href='/proxy-4145-Socks4--ssl.htm' title='Select proxies with port number 4145'>4145
My regular expression pattern looks like this.
MatchCollection Match = Regex.Matches(source, #"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b\s\S.*Select proxies with port number ([0-9]+)");
I also try this
\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b[\s\S].*Select proxies with port number ([0-9]+)
But I get 0 results.. if I take off \b\s\S.*Select proxies with port number ([0-9]+) it finds all the IP addressses great.. but the information is useless without the port data how would I both in 1 regular expression match.
SSpoke,
If you're trying to keep things as close to your original Regex as possible, simply replacing [\s\S].* with [\s\S]* in the middle takes care of it. (Demo)
Generally, however, it's best to throw your different parts of interest into capture groups, like such:
(\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b)([\s\S]*?Select proxies with port number)\s([0-9]+)
(Demo)
The main issue (as I see it) involves this portion of your expression: [\s\S].*
That section is supposed to deal will all of the junk in between the IP and the Port number, but what it really says is, find one character that is either a \s or not a \S (which is any character) followed by any character except a new line character 0 to unlimited times (.*). Although the text pasted into your original question was a single line, the "source" in your C# code (the HTML) was not. To solve the problem, you just need to change the expression to [\s\S]* which will match any character including newline characters, 0 to unlimited times.
If the HTML format is certain, maybe this regex can help (in JavaScript)
const regex = /((?:[0-9]{1,3}\.){3}[0-9]{1,3}).+Select proxies with port number.+?([0-9]{4})/gm with the results being captured in two groups.

c#, find multiline message by regex request

Main task
Find all DEBUG messages and select message fully (no matter single line message or multiline with unknown length)
I wrote such regex code:
\d{13}\t.*DEBUG.*(?=\d{13})
its search perfectly, but only single-line messages
Also I tried such code:
string myReg1 = #"\d{13}\t.*DEBUG.*(?=\d{13})";
MatchCollection match1 = Regex.Matches(logData, myReg1, RegexOptions.Singleline);
but this code found only one mach, where must be 147 matches....
I have logs like this:
1426174736798 addons.manager DEBUG Registering shutdown blocker for OpenH264Provider
1426174736799 addons.manager DEBUG Registering shutdown blocker for PluginProvider
*** Blocklist::_preloadBlocklistFile: blocklist is disabled
Try using this non-greedy regex instead (EDIT: tweaked a bit for input):
\d{13}\t.{0,100}DEBUG.+?(?=\d{13}|$)
Now this is tweaked a bit more closely to your input data. I can't really think of an ideal way to keep that first dot before the DEBUG from eating up other rows that you don't want. In a perfect world, you could write a phrase to say something like, "any character except a row of 13 digits", but this is not really something that regex does well. Maybe someone else can make this better. In the meantime, I have restricted the first dot to consume no more than 100 characters. If it goes more than 100 characters past the 13 digit number and has not found the string "DEBUG" yet, it is fairly safe to assume it is on a row we don't care about. You may need to tweak this number up or down a bit to fit your data (and I hate imperfect solutions like this), but hopefully this will get you in the neighborhood.
Changing .* to .+? makes the dot non-greedy. I also added an or to the last non-capturing group with a $ to match end-of-line (RegexOptions.SingleLine will treat the entire input as one line) to ensure that your last record is captured, since there will be no 13 digit number following the end of it.
This appears to work correctly in Expresso, which uses the same regex engine as .NET

Regex for ^ | in C#

I am working on HL7 messages and I need a regex. This doesn't work:
HL7 message=MSH|^~\&|DATACAPTOR|123|123|20100816171948|ORU^R01|081617194802900|P|2.3|8859/1
My regex is:
MSH|^~\&|DATACAPTOR|\d{3}|\d{3}|(\d{4}\d{2}\d{2}\d{2}\d{2}\d{2})|ORU\\^R01|\d{20}|P|2.3|8859/1
Can anybody suggest a regex for special characters?
I am using this code:
strRegex = "\\vMSH|^~\\&|DATACAPTOR|\\d{3}|\\d{3}|
(\\d{4}\\d{2}\\d{2}\\d{2}\\d{2}\\d{2})|ORU\\^R01|\\d{20}|P|2.3|8859/1";
Regex rx = new Regex(strRegex, RegexOptions.Compiled | RegexOptions.IgnoreCase );
|, ^, and \ are all special characters in regular expressions, so you'd have to escape them with \. Remember \ is also an escape character within a regular string literal so you'd have to escape that, too:
var strRegex = "\\vMSH\\|\\^~\\\\&\\|DATACAPTOR\\|…
But it's generally a lot easier to use a verbatim string literal (#"…"):
var strRegex = #"\vMSH\|\^~\\&\|DATACAPTOR\|…
Finally, note that (\d{4}\d{2}\d{2}\d{2}\d{2}\d{2}) can be simplified to (\d{14}).
However, for a structure like this, it's probably easier to just use the Split method.
var segment = "MSH|^~\&|DATACAPTOR…";
var fields = segment.Split('|');
var timestamp = fields[5];
Warning: HL7 messages may use different control characters—starting the 4th character in the MSH segment as a field separator (in this case |^~\& are the control characters). It's best to parse the control characters first if you don't control your input and these control characters may change.
For me your question describes two distinct problems.
Problem 1) "..I need a regex..this doesn't work..My regex is..anybody suggest a (better) regex..?"
This is the good part of your question.
As already pointed out by #p-s-w-g some special characters in regular expressions must be escaped. Page Microsoft Developer Network: Character Escapes in Regular Expressions tells you which characters are special and how to escape them.
In order to easily test if your regex recognizes the grammar you may find useful some interactive regex testing tools, e.g. Regex Hero or The Regulator
Problem 2) "I am working on HL7 messages..this doesn't work..My regex is..anybody suggest a (better) regex..?"
This is the bad part of your question.
The
MSH|^~\&|DATACAPTOR|123|123|20100816171948|ORU^R01|081617194802900|P|2.3|8859/1
example shown in your question is already not valid HL7 message fragment. It is something similar to HL7 but it is was already damaged probably by some text pre-processing code. HL7 v2 messages are not transmitted using text protocol that can be manipulated using text tools. The protocol is binary but at the same time partially readable and thus controllable by humans without any special tools. But it is binary protocol and must be processed as such. Regex is a tool for working with text strings not binary strings. And although it may seem possible to outsmart some ancient 20 years old protocol by a new-age regex one-liner, it is not good approach. I have tried to explain the why not in the comment part of your question.
Basic decoding of the fragment is:
MSH-0: MSH
MSH-1: |
MSH-2: ^~\&
MSH-3: DATACAPTOR
MSH-4: 123
MSH-5: 123
MSH-6: ! missing !
MSH-7: 20100816171948
MSH-8: ! missing !
MSH-9: ORU^R01
MSH-10: 081617194802900
MSH-11: P
MSH-12: 2.3
MSH-13: ! missing !
MSH-14: ! missing !
MSH-15: ! missing !
MSH-16: ! missing !
MSH-17: ! missing !
MSH-18: 8859/1
The ! missing ! pieces are really missing. In normal MSH segment they should be there at their corresponding positions, just having default empty value.
By reading Health Level Seven, Version 2.3.1 © 1999 - Chapter 2.24.1 MSH - message header segment we can see that
The message was created 4 years ago in 2010, probably by Capsule Tech, Inc.'s DataCaptor™ and formatted by rules defined by Health Level Seven, Version 2.3© 1997 that is by 17 years old and several times updated standard and was supposed to be used by one of the countries listed in Wikipedia: ISO/IEC 8859-1
From your question I can't see more, but whatever you are trying to do and whatever data you are going to process for whatever reason, the code fragment you are starting with is already wrong, in general the HL7 regex parsing approach is strange and if you're working on a serious software to be used anywhere in the healthcare industry, please consider writing or using a serious and tested parser, e.g. the one used by NHapi library http://sourceforge.net/p/nhapi/code/HEAD/tree/NHapi20/NHapi.Base/Parser/PipeParser.cs

Parse directories from a string

Firstly i have spent Three hours trying to solve this. Also please don't suggest not using regex. I appreciate other comments and can easily use other methods but i am practicing regex as much as possible.
I am using VB.Net
Example string:
"Hello world this is a string C:\Example\Test E:\AnotherExample"
Pattern:
"[A-Z]{1}:.+?[^ ]*"
Works fine. How ever what if the directory name contains a white space? I have tried to match all strings that start with 1 uppercase letter followed by a colon then any thing else. This needs to be matched up until a whitespace, 1 upper letter and a colon. But then match the same sequence again.
Hope i have made sense.
How about "[A-Z]{1}:((?![A-Z]{1}:).)*", which should stop before the next drive letter and colon?
That "?!" is a "negative lookaround" or "zero-width negative lookahead" which, according to Regular expression to match a line that doesn't contain a word? is the way to get around the lack of inverse matching in regexes.
Not to be too picky, but most filesystems disallow a small number of characters (like <>/\:?"), so a correct pattern for a file path would be more like [A-Z]:\\((?![A-Z]{1}:)[^<>/:?"])*.
The other important point that has been raised is how you expect to parse input like "hello path is c:\folder\file.extension this is not part of the path:P"? This is a problem you commonly run into when you start trying to parse without specifying the allowed range of inputs, or the grammar that a parser accepts. This particular problem seems pretty ad hoc and so I don't really expect you to come up with a grammar or to define how particular messages are encoded. But the next time you approach a parsing problem, see if you can first define what messages are allowed and what they mean (syntax and semantics). I think you'll find that once you've defined the structure of allowed messages, parsing can be almost trivial.

Packet detection using regex

I'm new to regular expressions, and I need to write a set of regular expressions that match different data packet formats.
My problem is, usually I only need to look for the start and ending parts of the packet to distinguish between them, the data in between is irrelevant.
What's the most efficient way to ignore the data between the start and end?
Here's a simple example.
The packet I'm looking for starts with $CH; and ends with #
Currently my regex is \$CH;.*?#
It's the .*? I'm worried about. Is there a better (or more efficient) way to accept any character between the packet header and ending character?
Also, some of the packets have \n chars in the data, so using . won't work at all if it means [^\n].
I've also considered [^\x00]*? to detect any characters since null is never used in the data.
Any suggestions?
\$CH;.*?# is fine and should be quite efficient. You can make it more explicit that there should be no backtracking by writing it as \$CH;[^#]*#, if you like.
You can use (.|\n) or [\w\W] to match truly any character--or even better, use the RegexOptions.Singleline option to change the behavior of .:
Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).
Try this:
\$CH;[\s\S]*?#
To detect start of line/data use ^ anchor, to detect the end, use $ anchor:
^start.*?end$
Be aware that .*? may fail to match newlines, one option is to change it for [\s\S]*?
I would recommend checking the initial and terminal sequences separately using anchored regular expressions.

Categories