C# Get Substring from STring using Regex

C# Get Substring from STring using Regex - c#

I have a log from a firewall, which is in what I think is a horrible format, but the information I actually want to extract is relatively consistently delimited. An Example (although I've removed all the specific information for privacy) would be:
<46>Nov7 04:33:25 FirewallDeviceName [Some identifier from the firewall, can contain spaces]: in:[InterfaceName] out:[InterfaceName], connection-state:new src-mac [Mac-ID], proto UDP, [SourceIP]:[SourcePort]->[Dst-IP]:[Dst-Port], len 32
What I want to extract from this is just the Source and Destination IP Addresses and ports, and maybe also the In and Out interfaces, and maybe the protocol.
I thought that the best way to do this would be to use a combination of .SubString(pos,length) and .IndexOf(char) with RegEx to match the bits of the string that I need for each.
For Example:
\s[0-9]+\. would get the part of the string where the source-IP starts.
[0-9]+\, would get the end of the section containing the IP-Addresses
Can Split() this first using "-->" to split the source and destination and then split each of these using ":" to separate the IP Address from the Port.
The bit I don't know is how to use RegEx within either the IndexOf (to get the character position) or within SubString functions, or even if that's possible.
Any help or advice here.
What I'm basically looking to write initially is a parser to parse some text-file logs that I've generated (these were generated from a syslog listener that I wrote for our new firewall, to work out what the output looked like)... ultimately the parser will be built into the listener itself so that the bits I want are logged directly to an SQL Database, but that bit I can do... it's the parser with the Regex that I'm not sure about.
Thanks very much.

Based on the example text given, the RegEx
, (\[SourceIP\]:\[SourcePort\])->(\[Dst-IP\]:\[Dst-Port\]),
will capture the source and destination into $1 and $2. However, I suspect that they are actually numbers with dots and not words within square brackets. Thus a better expression may be
, ([\d.]+:\d+)->([\d.]+:\d+),
This RegEx matches the two parts within proto UDP, 1.2.3.4:567->8.9.0:123, len 32.

Related

Email validation C# asp.net [duplicate]

This question already has answers here:
How can I validate an email address using a regular expression?
(79 answers)
Closed 3 years ago.
I used the following pattern to validate my email field.
return Regex.IsMatch(email,
#"^(?("")("".+?(?<!\\)""#)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])#))" +
#"(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-0-9a-z]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9]))$",
RegexOptions.IgnoreCase, TimeSpan.FromMilliseconds(250));
It uses the following reference:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/how-to-verify-that-strings-are-in-valid-email-format
My requirement is to have maximum number of 64 characters for user part, and max length for whole email string is 254 characters. The pattern in the reference only allow max 134 characters. Can someone give clear explanation of the meaning for the pattern? What is the right pattern to achieve my goal?

The code you cited is over-engineered, all you need to verify an email is to check for an at symbol and for a dot. If you need anything more precise, you are probably at a point where you actually need to email the recipient and ask for their confirmation that they hold the email, something that is simpler than a complex regex, and which provides much more precision.
Such a regex would simply be:
.+#.+\..+
Commentated below
.+ At least one of any character
# The at symbol
.+ At least one character
\. The . symbol
.+ At least one character
Of course this means that some emails might be accepted as false positives, like tomas#company.c when the user intended tomas#company.com , but even if you design the most robust of regexes, one that checks against a list of accepted TLDs, you will never catch tomas#company.co, and you might insert positive falses like tomas#company.blockchain when a new TLD is released and your code isn't updated.
So just keep it simple.

If you wanted to avoid using regex (which is, in my opinion, difficult to decipher), you could use the .Split() method on the email string using the "#" symbol as your delimiter. Then, you can check the string lengths of the two components from there.

Several years back, I wrote an email validation attribute in C# that should recognize most of that subset of syntactically valid email addresses that have the form local-part#domain — I say "most" because I didn't bother to try do deal with things like punycode, IPv4 address literals (dotted quads), or IPv6 address literals.
I'm sure there's lots of other edge cases I missed as well. But it worked well enough for our purposes at the time.
Use it in good health: C# Email Address validation
Before you go down the road of writing you own, you might want to read through the multiple relevant RFCs and try to understand the vagaries of what constitutes a "valid" email address (it's not what you think), and (2) stop trying to validate an RFC 822 email address. About the only way to "validate" an email address is to send mail to it and see if it bounces or not. Which doesn't mean that anybody is home at that address, or that that mailbox won't disappear next week.
https://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx/
https://jackfoxy.github.io/FsRegEx/emailregex.html
Jeffrey Friedl's book Mastering Regular Expressions has a [more-or-less?] complete regular expression to match syntactically valid email addresses. It's 6,598 characters long.
Did you know that postmaster#. is a legal email address? It theoretically gets you to the postmaster of the root DNS server.
Or that [theoretically] "bang path" email addresses like MyDepartmentServer!MainServer!BigRouter!TheirDepartmentServer!SpecificServer!jsmith are valid. Here you define the actual path through the network that the email should take. Helps if you know the network topology involved.

How do you delete text surrounding a string that you want?

I've looked online for this but not been able to find an answer unfortunately (sorry if there is something I have missed).
I have some code which filters out a specific string (which can change depending on what is read from the serial port). I want to be able to delete all of the characters which I am not using.
e.g. the string I want from the text below is "ThisIsTheStringIWant"
efefhokiehfdThisIsTheStringIWantcbunlokew
Now, I already have a function with some code which will identify this and print it to where I want. However, as the comms could be coming in from multiple ports at any frequency, before printing the string to where I want it, I need to have a piece of code which will recognise everything I don't want and delete it from my buffer.
e.g. Using the same random text above, I want to get rid of the two random strings at the ends (which are before and after "ThisIsTheStringIWant" in the middle).
efefhokiehfdThisIsTheStringIWantcbunlokew
I have tried using the highest voted answer from this question, however I can't find a way to delete the unwanted text before my wanted string. Remove characters after specific character in string, then remove substring?
If anyone can help, that would be great!
Thanks!
Edit:
Sorry, I should have probably made my question clearer.
Any possible number of characters could be before and/or after the actual string I want, and as the string I want is coming from a serial port it will be different every time depending on what comms are coming in from the serial port. On my application I have a cell in a DGV called "Extract" and by typing in the first bit of the comms I am expecting (in this case, the extract would be This). But that will be different depending on what I am doing.

Find the position of the string you want, delete from the beginning to the predecessor of that position, then delete everything from the length of your string to the end.
String: efefhokiehfdThisIsTheStringIWantcbunlokew
Step 1 - "ThisIsTheStringIWant" starts at position 13, so delete the first twelve, leaving...
String: ThisIsTheStringIWantcbunlokew
Step 2 - "ThisIsTheStringIWant" is 20 characters long, so delete from character 21 to the length of the string, leaving:
String: ThisIsTheStringIWant

How do I determine a delimiter in a text file

I have 2 types of input files:
1. comma delimited (i.e: lastName, firstName, Address)
2. space delimited (i.e lastName firstName Address)
The comma delimited file HAS spaces between the ',' and the next word.
How do I go about determining which file I am dealing with ?
I am using C# btw

I've done tons of work with various delimited file types and as everyone else is saying, without normalization you can't really handle the whole thing programmatically.
Generally (and it seems like it would be totally necessary for space-delim) a delimited file will have a text qualifier character (often double-quotes). A couple examples of this points:
Space Delimited:
lastName "Von Marshall" is impossible
without qualifiers.
Addresses would be altogether impossible as well.
Comma Delimited:
addresses are generally unworkable unless they are broken into separate fields or having a solid string is acceptable for your use-case.
So the space delim should be easy enough to determine since you're looking for " ". If this is the case I'd (personally) replace all " " with "," to change it to comma-delim. That way you'd only have to build a single method for handling the text, otherwise I imagine you'll need methods for spaces and commas separately.
If your comma-delim file does not have a text qualifier, you're in a really tricky spot. I haven't found any "perfect" way of addressing this without any human work, but it can be minimized. I've used Notepad++ a lot to do batch replacement with its regular expression functions.
However, you can also use C#'s regex abilities. Here's what MSDN says on that.
So, to answer your question to the best of my ability, unless you can establish a uniqueness between the 2 file types - there's no way. However, if the text has proper text qualifiers, the files have different file extensions, or if the are generated in different directories - you could use any of those qualities or a mix thereof to decide what type of file it is. I have no experience doing this as yet (though I've just started a project using it), so I can't give an exact example, but I can say for anyone to build a perfect example it'd be best if you showed example strings for each file.

As other users have said with some guaranty of having no commas in the space delimited version you cannot with 100% accuracy.
With some information, say that there will always be three fields for all records in all cases when parsed correctly you could just do both and test the results for the correct number of fields. Address is a big block here though since we do not know what that format could be. Also these rules seems odd at best when talking about address.... is
1111somestreest.houston,tx11111 or
1111 somestreet st. Houston, Tx 11111
a valid format?

You could count the number of commas per line of the file. If you have at least 2 commas per line (considering your info is last name, first name, address), you probably have a comma separated. If you have, in at least one line, less than 2 commas, you should consider it as space separated.
I, however, would skip this step and ignore the commas when evaluating the input by replacing all of them by spaces and would implement a single read/grab information procedure (considering only space separated files).

Create Variable from part of file

Hi having trouble working out where to start with creating a variable from a file created by Windows Remote Assistance. I need to extract the port from the text file so i can create an ssh tunnel allowing remote assistance from anywhere.
the port appears after the ip address in 'RCTICKET="65538,1,192.168.9.22:7532,' The colon is the first one in the whole file so I think I need to search for the first ":" and then copy the 4 digits that come after it unless the port is 5 digits ( I think here checking if the 5th character is a comma meaning a 4 digit port of if its a number meaning a 5 digit port )
Any help where to start with this I've been googling for hours just can't think how to put this in a search term.
Below is an example of test.msrcincident the file created by Microsoft Remote Assistance that i need to extract the port from
<?xml version="1.0"?>
<UPLOADINFO TYPE="Escalated"><UPLOADDATA USERNAME="jon" LHTICKET="BDF9C9782B31A1BC276C029A169930ABB4490E2088169FA45A3A095258F5C54D345F4D793363E2C9 B924C5D6A38210AF2E86B3E3D33E5BEB3E35729ECDA88D5F5CE23879899768432726AF419FA2147194F4358BA2A0F245C4307EC8CAB882E2B670977562E5423C90EC336A15BA3DC57496F1EBB26B55B449B45FBD317CD4E422186EA7989F78C6FC3019BCF5831B1E060B174C5254D92448992A543079E576A66617F8B5BEA4C5961FC75C0B67F28B996CD4F1247DBC1C725B9D69B094B53AE24A533501A607CF119ED99C34F0C7210376C6564A48E25871AA32934409D981CF63F60DA956B0877AFBD669DFC321D16D55A34B9949AE0B26B6EEB473915AC416ABFC1129C08021F4011F1F0D1869BB86842C0218C03286C956FC7897B319E0B3A495EBA8ED41835E84E6BAD6B30199F6ACF191B6529DF2C5A264F578AF3B31A84997DA9C4BF1F8AD9E4931F99AE94A0E66D941F050AC0B025523148A95D24E60A6C548341C486BB40089B2088F5FE49AC966D65B728E36E0D7D76C98827335983BEC912DFC0B714DBBBFA060DE62658E7BABDB9BEB45486138950548DA62FDFD6437D0798A67D20CA1911880F58FCDA5F98FA5E0CAEF643171FE9DA8AF046" RCTICKET="65538,1,192.168.9.22:7532,*,U15FphW2EDtpPVdlHmafYLmnO/aVc+YFoFEw30tpjJ+6vJ+LspOTtaqgFoDt3bsp,*,*,P1ooZJPDyfMMTXqlz5hACdwD8F4=" PassStub="TE*0ViGNuB2T6I" RCTICKETENCRYPTED="1" DtStart="1379526042" DtLength="360" L="0"/></UPLOADINFO>
Thank you for reading

Something simple like this would get you the data you need, to some extent:
var reader = XDocument.Load("path to XML file");
var data = reader.Element("UPLOADINFO")
.Element("UPLOADDATA")
.Attribute("RCTICKET")
var values = data.Split(',');
You will need to work with that RCTICKET string to extract the value you need. It would be a bit safer to work with commas, colons, and whatnot in the context of a single attribute instead of the whole file. Caveat: When I generated an incident file, I ended up with multiple IP addresses in the RCTICKET field. I have multiple VPNs and ethernet adapters in my machine. You will have to pick the right one.
You will also want to handle failures if the XML isn't in the format we expect, or if the file is otherwise inaccessible. You can do this with a try/catch and/or checking for nulls.

regex that can handle horribly misspelled words

Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.

I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.

To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.

This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.

I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.