c#, find multiline message by regex request

c#, find multiline message by regex request - c#

Main task
Find all DEBUG messages and select message fully (no matter single line message or multiline with unknown length)
I wrote such regex code:
\d{13}\t.*DEBUG.*(?=\d{13})
its search perfectly, but only single-line messages
Also I tried such code:
string myReg1 = #"\d{13}\t.*DEBUG.*(?=\d{13})";
MatchCollection match1 = Regex.Matches(logData, myReg1, RegexOptions.Singleline);
but this code found only one mach, where must be 147 matches....
I have logs like this:
1426174736798 addons.manager DEBUG Registering shutdown blocker for OpenH264Provider
1426174736799 addons.manager DEBUG Registering shutdown blocker for PluginProvider
*** Blocklist::_preloadBlocklistFile: blocklist is disabled

Try using this non-greedy regex instead (EDIT: tweaked a bit for input):
\d{13}\t.{0,100}DEBUG.+?(?=\d{13}|$)
Now this is tweaked a bit more closely to your input data. I can't really think of an ideal way to keep that first dot before the DEBUG from eating up other rows that you don't want. In a perfect world, you could write a phrase to say something like, "any character except a row of 13 digits", but this is not really something that regex does well. Maybe someone else can make this better. In the meantime, I have restricted the first dot to consume no more than 100 characters. If it goes more than 100 characters past the 13 digit number and has not found the string "DEBUG" yet, it is fairly safe to assume it is on a row we don't care about. You may need to tweak this number up or down a bit to fit your data (and I hate imperfect solutions like this), but hopefully this will get you in the neighborhood.
Changing .* to .+? makes the dot non-greedy. I also added an or to the last non-capturing group with a $ to match end-of-line (RegexOptions.SingleLine will treat the entire input as one line) to ensure that your last record is captured, since there will be no 13 digit number following the end of it.
This appears to work correctly in Expresso, which uses the same regex engine as .NET

Related

String split on dynamic separator

I am to deal with the following problem.
I have to extract messages from a communication buffer. Sadly the communication protocol is lousy and not well-structured. The only way I came up with to distinguish packets in the buffer is an intermediate "ack" command that is transmitted by the server.
Example:
[Packet1][ACK][Packet2][ACK][Packet3]
I could have used String.Split(ACK), but the separator is also not consistent. Though, there are 3 rules to identify the ack packet.
Starts with "AK".
Ends with "0" or "1".
Total length is 5 characters.
Ack example:
"AKxxy" where:
xx: (01 to 99)
y: (0 or 1)
I hope that there may be a regular expression that can solve my problem, but I lack the needed knowledge and time.
Is there any RegEx "expert" that may possible help me? Feel free to suggest any solution.
Thank you.
Edit:
Example packet (I really had to remove the packet information):
AK010CONFIDENTIALPACKET1AK011CONFIDENTIALPACKET2AK020AK011CONFIDENTIALPACKET3AK021CONFIDENTIALPACKET4AK050
Sadly, each packet in the protocol does not start or end with a specific character so I cannot distinguish them. To identify each one I have to split them using the ack packet and then perform different checks on each one.

The direct translation would be
\bAK\d{2}[01]\b
That is
\b # a word boundary
AK # AK literally
\d{2} # two digits
[01] # one of 0 or 1
\b # another word boundary
The expression needs to be tested though (see a demo on regex101.com).

EDIT:
looking at the other answers, this has probably merely ornamental value.
The solution of #Jan and #ThymosK
var packets = Regex.Split(buffer, #"AK\d{2}[01]");
seems much more elegant.
But I think it might be nice to see how all the parsing can be moved inside the regex. Even if it is way too unreadable :P
I have designed a regex that can give you messages and delimiters as groups:
(?s)(AK[0-9][0-9][0,1])|((?:(?!AK[0-9][0-9][0,1]).)*)
It can analyze text like this:
You can test it here.
As usual, regexes are write only. I can hardly read this myself. But I'll try and go through it:
The first group is straightforward and simply catches your ack command:
(AK[0-9][0-9][0,1])
The second group contains a negative lookahead (?! ... ) which matches anything not followed by the regex specified by .... Here we insert your ack syntax, so anything not followed by ack is matched. Then we add a single character, to extend this to actually match anything up to ack.
Basically this second part asserts that we are currently not followed by ack and then adds a single character. This is repeated as long as possible, until we find the ack. I turn this into the second group.
Since I don't have C# currently, I can't wrap this in code with the C# regex engine. But python works nicely with it and offers a useful findall method, that gives you all those groups.

string interim = Regex.Replace(buffer, "AK\d{2}[01]", "|");
var commands = interim.Split('|');
Assuming that | is not a valid input char. You can pick something very exotic.

Conditional match without false force a match?

I'm using the following regex in c# to match some input cases:
^
(?<entry>[#])?
(?(entry)(?<id>\w+))
(?<value>.*)
$
The options are ignoring pattern whitespaces.
My input looks as follows:
hello
#world
[xxx]
This all can be tested here: DEMO
My problem is that this regex will not match the last line. Why?
What I'm trying to do is to check for an entry character. If it's there I force an identifier by \w+. The rest of the input should be captured in the last group.
This is a simplyfied regex and simplyfied input.
The problem can be fixed if I change the id regex to something like (?(entry)(?<id>\w+)|), (?(entry)(?<id>\w+))? or (?(entry)(?<id>\w+)?).
I try to understand why the conditional group doesn't match as stated in original regex.
I'm firm in regex and know that the regex can be simplyfied to ^(\#(?<id>\w+))?(?<value>.*)$ to match my needs. But the real regex contains two more optional groups:
^
(?<entry>[#])?
(\?\:)?
(\(\?(?:\w+(?:-\w+)?|-\w+)\))?
(?(entry)(?<id>\w+))
(?<value>.*)
$
That's the reason why I'm trying to use a conditional match.
UPDATE 10/12/2018
I tested a little arround it. I found the following regex that should match on every input, even an empty one - but it doesn't:
(?(a)a).*
DEMO
I'm of the opinion that this is a bug in .net regex and reported it to microsoft: See here for more information

There is no error in the regex parser, but in one's usage of the . wildcard specifier. The . specifier will consume all characters, wait for it, except the linefeed character \n. (See Character Classes in Regular Expressions "the any character" .])
If you want your regex to work you need to consume all characters including the linefeed and that can be done by specify the option SingleLine. Which to paraphrase what is said
Singline tells the parser to handle the . to match all characters including the \n.
Why does it still fail when not in singleline mode for the other lines are consumed? That is because the final match actually places the current position at the \n and the only option (as specified is use) is the [.*]; which as we mentioned cannot consume it, hence stops the parser. Also the $ will lock in the operations at this point.
Let me demonstrate what is happening by a tool I have created which illustrates the issue. In the tool the upper left corner is what we see of the example text. Below that is what the parser sees with \r\n characters represented by ↵¶ respectively. Included in that pane is what happens to be matched at the time in yellow boxes enclosing the match. The middle box is the actual pattern and the final right side box shows the match results in detail by listening out the return structures and also showing the white space as mentioned.
Notice the second match (as index 1) has world in group capture id and value as ↵.
I surmise your token processor isn't getting what you want in the proper groups and because one doesn't actually see the successful match of value as the \r, it is overlooked.
Let us turn on Singline and see what happens.
Now everything is consumed, but there is a different problem. :-)

Parse directories from a string

Firstly i have spent Three hours trying to solve this. Also please don't suggest not using regex. I appreciate other comments and can easily use other methods but i am practicing regex as much as possible.
I am using VB.Net
Example string:
"Hello world this is a string C:\Example\Test E:\AnotherExample"
Pattern:
"[A-Z]{1}:.+?[^ ]*"
Works fine. How ever what if the directory name contains a white space? I have tried to match all strings that start with 1 uppercase letter followed by a colon then any thing else. This needs to be matched up until a whitespace, 1 upper letter and a colon. But then match the same sequence again.
Hope i have made sense.

How about "[A-Z]{1}:((?![A-Z]{1}:).)*", which should stop before the next drive letter and colon?
That "?!" is a "negative lookaround" or "zero-width negative lookahead" which, according to Regular expression to match a line that doesn't contain a word? is the way to get around the lack of inverse matching in regexes.

Not to be too picky, but most filesystems disallow a small number of characters (like <>/\:?"), so a correct pattern for a file path would be more like [A-Z]:\\((?![A-Z]{1}:)[^<>/:?"])*.
The other important point that has been raised is how you expect to parse input like "hello path is c:\folder\file.extension this is not part of the path:P"? This is a problem you commonly run into when you start trying to parse without specifying the allowed range of inputs, or the grammar that a parser accepts. This particular problem seems pretty ad hoc and so I don't really expect you to come up with a grammar or to define how particular messages are encoded. But the next time you approach a parsing problem, see if you can first define what messages are allowed and what they mean (syntax and semantics). I think you'll find that once you've defined the structure of allowed messages, parsing can be almost trivial.

RegEx Performance Issue

We are having problem with the following regular expression:
(.*?)\|\*\|([0-9]+)\*\|\*(.*?)
It should match things like: |*25 *|
We are using .Net Framework 4 RegEx Class the code is the following:
string expression = "(.*?)" +
Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) +
"([0-9]+)" +
Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END) +
"(.*?)";
Regex r = new Regex(expression);
r.Matches(contentText)
It is taking too long (like 60 seconds) with a 40.000 character text.
But with a text of 180.000 the speed its very acceptable (3 sec or less)
The only difference between texts its that the first text(the one which is slow) it is all contained in a single line, with no line breaks. Can this be an issue? That is affecting the performance?
Thanks

#David Gorsline's solution (from the comment) is correct:
string expression =
Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) +
"([0-9]+)" +
Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END);
Specifically, it's the (.*?) at the beginning that's doing you in. What that does is take over doing what the regex engine should be doing itself--scan for the next place where the regex can match--and doing it much, much less efficiently. At each position, the (.*?) effectively performs a lookahead to determine whether the next part of the regex can match, and only if that fails does it go ahead and consume the next character.
But even if you used something more efficient, like [^|]*, you would still be slowing it down. Leave that part off, though, and the regex engine can instead scan for the first constant portion of the regex, probably using an algorithm like Boyer-Moore or Knuth-Morris-Pratt. So don't worry about what's around the bits you want to match; just tell the regex engine what you're looking for and get out of its way.
On the other hand, the trailing (.*?) has virtually no effect, because it never really does anything. The ? turns the .* reluctant, so what does it take to make it go ahead and consume the next character? It will only do so if there's something following it in the regex that forces it to. For example, foo.*?bar consumes everything from the next "foo" to the next "bar" after that, but foo.*? stops as soon as it's consumed "foo". It never makes sense to have a reluctant quantifier as the last thing in a regex.

You've answered your question: the problem is that . fails to match new-lines (it doesn't by default), which results in many failed attempts - almost one for every position on your 40000 character string.
On the long but single lined file, the engine can match the pattern in a single pass over the file (assuming a successful match exists - if it doesn't, I suspect it will take a long time to fail...).
On the shorter file, with many lines, the engine tries to match from the first character. It matches .*? until the end of the first line (this is a lazy match, so a lot more is happening, but lets ignore that), and fails. Now, it stats again from the second character, not the second line! This results in n² complexity even before matching the number.
A simple solution is to make . match newlines:
Regex r = new Regex(expression, RegexOptions.Singleline);
You can also make sure to match from start to end using the absolute start and end anchors, \A and \z:
string expression = "\\A(.*?)" +
Regex.Escape(Constants.FIELD_START_DELIMITER_BACK_END) +
"([0-9]+)" +
Regex.Escape(Constants.FIELD_END_DELIMITER_BACK_END) +
"(.*?)\\z";
Another note:
As David suggests in the comments, \|\*\|([0-9]+)\*\|\* should work well enough. Even if you need to "capture" all text before and after the match, you can easily get it using the position of the match.

What Regex to capture Multiline Text Between Two Phrases?

I need to capture form data text from an email form by capturing what exists between elements.
The text I get in the body of the email is multiline with a lot of whitespace between keywords. I don't care about the whitespace; I'll trim it out, but I have to be able to capture what occurs between two form field descriptors.
The key phrases are really clear and unique, but I can't get the Regex to work:
Sample data:
Loan Number:
123456789
Address:
101 Main Street
My City, WA
99101
Servicemember Name:
Joe Smith
Servicemember Phone Number:
423-283-5000
Complaint Description:
He has a complaint
Associate Information
Associate Name:
Some Dude
Phone Login:
654312
Complaint Date:
1/10/2012
Regex (to capture the loan number, for example):
^Loan Number:(.*?)Address:.$
What am I missing>?
EDIT: Also, in addition to capturing data between the various form labels, I need to capture the data between the last label and the end of the file. After reading the responses here, I've been able to capture the data between form labels, but not the last piece of data, the Complaint Date.

What am I missing?
You'll need to drop the anchors (^ and $) and enable the dotall which allows the . to match new lines. Not familiar enough with C#, but it should be the m modifier. Check the docs.
Why is this so difficult?
Regular Expressions are a very powerful tool. With great power comes great responsibility. That is, no one said it would be easy...
UPDATE
After reviewing the question more closely, you have solid anchor points and a very specific capture (i.e. loan number digits. The following regular expression should work and without the modifier mentioned about.
Loan Number\s+(\d+)\s+Escalation Required

This one works for me:
Loan Number(?<Number>(.*\n)+)Escalation Required
Where Number named group is the result.

Your main problem is that you aren't specifying Multiline mode. Without that, ^ only matches the very beginning of the text and $ only matches the very end. Also, the (.*?) needs to match the line separators before and after the loan number in addition to the number itself, and it can't do that unless you specify Singleline mode.
There are two ways you can specify these matching modes. One is by passing the appropriate RegexOptions argument when you create the Regex:
Regex r = new Regex(#"^Loan Number(.*?)Escalation Required.$",
RegexOptions.Multiline | RegexOptions.Singleline);
The other is by adding "inline" modifiers to the regex itself:
Regex r = new Regex(#"(?ms)^Loan Number(.*?)Escalation Required.$");
But I recommend you do this instead:
Regex r = new Regex(#"(?m)^Loan Number\s*(\d+)\s*Escalation Required(?=\z|\r\n|[\r\n])");
About \s*(\d+)\s*:
In Singleline mode (known as DOTALL mode in some flavors), there's nothing to stop .*? from matching all the way to the end of the document, however long it happens to be. It will try to consume as little as possible thanks to the non-greedy modifier (?), but in cases where no match is possible, the regex engine will have to do a lot of pointless work before it admits defeat. I practically never use Singleline mode for that reason.
Singleline mode or not, don't use .* or .*? without at least considering something more specific. In this case, \s*(\d+)\s* has the advantage that it allows you to capture the loan number only. You don't have to trim whitespace or perform any other operations to extract the part that interests you.
About (?=\z|\r\n|[\r\n]):
According to the Unicode standard, $ in multiline mode should match before a carriage-return (\r) or before a linefeed (\n) if it's not preceded by \r--it should never match between \r and \n. There are several other single-character line separators as well, but the .NET regex flavor doesn't recognize anything but \n. Your source text (an email message) uses \r\n to separate lines, which is why you had to add that dot before the anchor: .$.
But what if you don't know which kind of line separators to expect? Realistically, \n or \r\n are by far the most common choices, but even if you disregard the others, .$ is going to fail half the time. (?=\z|\r\n|[\r\n]) is still a hack, but it's a much more portable hack. ;) It even handles \r (carriage-return only) the line separator associated with pre-OSX Macintosh systems.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.