I've looked online for this but not been able to find an answer unfortunately (sorry if there is something I have missed).
I have some code which filters out a specific string (which can change depending on what is read from the serial port). I want to be able to delete all of the characters which I am not using.
e.g. the string I want from the text below is "ThisIsTheStringIWant"
efefhokiehfdThisIsTheStringIWantcbunlokew
Now, I already have a function with some code which will identify this and print it to where I want. However, as the comms could be coming in from multiple ports at any frequency, before printing the string to where I want it, I need to have a piece of code which will recognise everything I don't want and delete it from my buffer.
e.g. Using the same random text above, I want to get rid of the two random strings at the ends (which are before and after "ThisIsTheStringIWant" in the middle).
efefhokiehfdThisIsTheStringIWantcbunlokew
I have tried using the highest voted answer from this question, however I can't find a way to delete the unwanted text before my wanted string. Remove characters after specific character in string, then remove substring?
If anyone can help, that would be great!
Thanks!
Edit:
Sorry, I should have probably made my question clearer.
Any possible number of characters could be before and/or after the actual string I want, and as the string I want is coming from a serial port it will be different every time depending on what comms are coming in from the serial port. On my application I have a cell in a DGV called "Extract" and by typing in the first bit of the comms I am expecting (in this case, the extract would be This). But that will be different depending on what I am doing.
Find the position of the string you want, delete from the beginning to the predecessor of that position, then delete everything from the length of your string to the end.
String: efefhokiehfdThisIsTheStringIWantcbunlokew
Step 1 - "ThisIsTheStringIWant" starts at position 13, so delete the first twelve, leaving...
String: ThisIsTheStringIWantcbunlokew
Step 2 - "ThisIsTheStringIWant" is 20 characters long, so delete from character 21 to the length of the string, leaving:
String: ThisIsTheStringIWant
Related
I have a log from a firewall, which is in what I think is a horrible format, but the information I actually want to extract is relatively consistently delimited. An Example (although I've removed all the specific information for privacy) would be:
<46>Nov7 04:33:25 FirewallDeviceName [Some identifier from the firewall, can contain spaces]: in:[InterfaceName] out:[InterfaceName], connection-state:new src-mac [Mac-ID], proto UDP, [SourceIP]:[SourcePort]->[Dst-IP]:[Dst-Port], len 32
What I want to extract from this is just the Source and Destination IP Addresses and ports, and maybe also the In and Out interfaces, and maybe the protocol.
I thought that the best way to do this would be to use a combination of .SubString(pos,length) and .IndexOf(char) with RegEx to match the bits of the string that I need for each.
For Example:
\s[0-9]+\. would get the part of the string where the source-IP starts.
[0-9]+\, would get the end of the section containing the IP-Addresses
Can Split() this first using "-->" to split the source and destination and then split each of these using ":" to separate the IP Address from the Port.
The bit I don't know is how to use RegEx within either the IndexOf (to get the character position) or within SubString functions, or even if that's possible.
Any help or advice here.
What I'm basically looking to write initially is a parser to parse some text-file logs that I've generated (these were generated from a syslog listener that I wrote for our new firewall, to work out what the output looked like)... ultimately the parser will be built into the listener itself so that the bits I want are logged directly to an SQL Database, but that bit I can do... it's the parser with the Regex that I'm not sure about.
Thanks very much.
Based on the example text given, the RegEx
, (\[SourceIP\]:\[SourcePort\])->(\[Dst-IP\]:\[Dst-Port\]),
will capture the source and destination into $1 and $2. However, I suspect that they are actually numbers with dots and not words within square brackets. Thus a better expression may be
, ([\d.]+:\d+)->([\d.]+:\d+),
This RegEx matches the two parts within proto UDP, 1.2.3.4:567->8.9.0:123, len 32.
So what I'm trying to do is that I want my bot to be able to have two different parameters. Like what I mean is something like I can extract like a certain part of it and then after there's a "," or another symbol I can extract the following separately. So I get two different strings from one input. So like I have two strings and I want one of them to be the first half and the second one to be the rest. And I am not planning on updating to 1.0 so tell me if it's not possible in 0.9.6.
Your question isn't very clear but I think I know what you are looking for. This is a general answer for C# as I don't know how the Discord interface differs. You seem to be taking input in the form of a string, for example: play *songname*,*channelname*. To split this string into two inputs you want to use String.Split(',')
An example would be this:
string stringTakenFromDiscord = "play *songname*,*channelname*";
String[] input = stringTakenFromDiscord.Split(',');
//input[0] will be equal to what comes before the comma
//if you were to print it, it would be "play *songname*"
//input[1] will be what comes after the comma
//if you were to print it, it would be "*channelname*"
Now you can do anything you want with either of the values of the array input[] and feed them through your code to parse them. Do note that when it splits by the character, the character won't appear in either of the output strings. This will only work for inputs that only have one instance of your chosen character. You can change the character to whatever you want.
It occurs to me that it might be easier to just take the input on two separate lines instead.
Hi having trouble working out where to start with creating a variable from a file created by Windows Remote Assistance. I need to extract the port from the text file so i can create an ssh tunnel allowing remote assistance from anywhere.
the port appears after the ip address in 'RCTICKET="65538,1,192.168.9.22:7532,' The colon is the first one in the whole file so I think I need to search for the first ":" and then copy the 4 digits that come after it unless the port is 5 digits ( I think here checking if the 5th character is a comma meaning a 4 digit port of if its a number meaning a 5 digit port )
Any help where to start with this I've been googling for hours just can't think how to put this in a search term.
Below is an example of test.msrcincident the file created by Microsoft Remote Assistance that i need to extract the port from
<?xml version="1.0"?>
<UPLOADINFO TYPE="Escalated"><UPLOADDATA USERNAME="jon" LHTICKET="BDF9C9782B31A1BC276C029A169930ABB4490E2088169FA45A3A095258F5C54D345F4D793363E2C9 B924C5D6A38210AF2E86B3E3D33E5BEB3E35729ECDA88D5F5CE23879899768432726AF419FA2147194F4358BA2A0F245C4307EC8CAB882E2B670977562E5423C90EC336A15BA3DC57496F1EBB26B55B449B45FBD317CD4E422186EA7989F78C6FC3019BCF5831B1E060B174C5254D92448992A543079E576A66617F8B5BEA4C5961FC75C0B67F28B996CD4F1247DBC1C725B9D69B094B53AE24A533501A607CF119ED99C34F0C7210376C6564A48E25871AA32934409D981CF63F60DA956B0877AFBD669DFC321D16D55A34B9949AE0B26B6EEB473915AC416ABFC1129C08021F4011F1F0D1869BB86842C0218C03286C956FC7897B319E0B3A495EBA8ED41835E84E6BAD6B30199F6ACF191B6529DF2C5A264F578AF3B31A84997DA9C4BF1F8AD9E4931F99AE94A0E66D941F050AC0B025523148A95D24E60A6C548341C486BB40089B2088F5FE49AC966D65B728E36E0D7D76C98827335983BEC912DFC0B714DBBBFA060DE62658E7BABDB9BEB45486138950548DA62FDFD6437D0798A67D20CA1911880F58FCDA5F98FA5E0CAEF643171FE9DA8AF046" RCTICKET="65538,1,192.168.9.22:7532,*,U15FphW2EDtpPVdlHmafYLmnO/aVc+YFoFEw30tpjJ+6vJ+LspOTtaqgFoDt3bsp,*,*,P1ooZJPDyfMMTXqlz5hACdwD8F4=" PassStub="TE*0ViGNuB2T6I" RCTICKETENCRYPTED="1" DtStart="1379526042" DtLength="360" L="0"/></UPLOADINFO>
Thank you for reading
Something simple like this would get you the data you need, to some extent:
var reader = XDocument.Load("path to XML file");
var data = reader.Element("UPLOADINFO")
.Element("UPLOADDATA")
.Attribute("RCTICKET")
var values = data.Split(',');
You will need to work with that RCTICKET string to extract the value you need. It would be a bit safer to work with commas, colons, and whatnot in the context of a single attribute instead of the whole file. Caveat: When I generated an incident file, I ended up with multiple IP addresses in the RCTICKET field. I have multiple VPNs and ethernet adapters in my machine. You will have to pick the right one.
You will also want to handle failures if the XML isn't in the format we expect, or if the file is otherwise inaccessible. You can do this with a try/catch and/or checking for nulls.
Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures
Having used SQL Server Bulk insert of CSV file with inconsistent quotes (CsvToOtherDelimiter option) as my basis, I discovered a few weirdnesses with the RemoveCSVQuotes part [it chopped the last char from quoted strings that contained a comma!]. So.. rewrote that bit (maybe a mistake?)
One wrinkle is that the client has asked 'what about data like this?'
""17.5179C,""
I assume if I wanted to keep using the CsvToOtherDelimiter solution, I'd have to amend the RegExp...but it's WAY beyond me... what's the best approach?
To clarify: we are using C# to pre-process the file into a pipe-delimited format prior to running a bulk insert using a format file. Speed is pretty vital.
The accepted answer from your link starts with:
You are going to need to preprocess the file, period.
Why not transform your csv to xml? Then you would be able to verify your data against an xsd before storing into a database.
To convert a CSV string into a list of elements, you could write a program that keeps track of state (in quotes or out of quotes) as it processes the string one character at a time, and emits the elements it finds. The rules for quoting in CSV are weird, so you'll want to make sure you have plenty of test data.
The state machine could go like this:
scan until quote (go to 2) or comma (go to 3)
if the next character is a quote, add only one of the two quotes to the field and return to 1. Otherwise, go to 4 (or report an error if the quote isn't the first character in the field).
emit the field, go to 1
scan until quote (go to 5)
if the next character is a quote, add only one of the two quotes to the field and return to 4. Otherwise, emit the field, scan for a comma, and go to 1.
This should correctly scan stuff like:
hello, world, 123, 456
"hello world", 123, 456
"He said ""Hello, world!""", "and I said hi"
""17.5179C,"" (correctly reports an error, since there should be a
separator between the first quoted string "" and the second field
17.5179C).
Another way would be to find some existing library that does it well. Surely, CSV is common enough that such a thing must exist?
edit:
You mention that speed is vital, so I wanted to point out that (so long as the quoted strings aren't allowed to include line returns...) each line may be processed independently in parallel.
I ended up using the csv parser that I don't know we had already (comes as part of our code generation tool) - and noting that ""17.5179C,"" is not valid and will cause errors.