Extract text from large file using RegEx?

Extract text from large file using RegEx? - c#

I have a big file that has a bunch of data in it, but essentially what I would like to do is to grab only parts of it, let me explain what parts I'm interested in:
(imagine "x" as an IP Address)
(imagine "?" as any alphanumerical character /w any length)
(imagine "MD5" as an MD5 hash)
(Actual -not literally though- text file below)
'xxx.xxx.xxx.xxx'
xxxxxxxxxx
'?'
'?'
'MD5'
Now my inquiry is the following one, How could I identify the line
'xxx.xxx.xxx.xxx'
anywhere at the beginning inside a file and then automatically write to another file both of the '?' entries and the 'MD5' entry for each IP Address instances found.
So in a nutshell, the program should start at the beginning of the file, read the contents, if it hits an IP Address (Regex: '\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b' works fine for me), skip one line below, then start copying the other data to another file until it hits the MD5 entry (Regex: '[a-f0-9]{32}' works fine for me), then iterate again from that point and so on looking for another instance of an IP Address etc, etc. It should keep doing that until it reaches the end of the file.
I'm trying to do this myself but I don't even know where to start, or methods of doing it at all.

You can use the following to match the content that you are looking for.. and copy it to the desired location/ file:
('\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')(\s*.+\s*)([\s\S]*?)('\b[a-f0-9]{32}\b')
And extract $1$3$4
See DEMO
Code:
String regex = "('\\b\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\b')(\\s*.+\\s*)([\\s\\S]*?)('\\b[a-f0-9]{32}\\b')";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(myString);
while (m.find()) {
System.out.println("end(): " + m.group(1));
//System.out.println("end(): " + m.group(2));
System.out.println("end(): " + m.group(3));
System.out.println("end(): " + m.group(4));
}

Given the fact that your file is machine generated and that the overall pattern is pretty specific, I don't think it's necessary to be overly specific with the IP address.
Matching it as "a bunch of digits and dots in single quotes" is probably enough, in the context of the rest of the pattern (*).
Here is an expression that matches your entire requirement into named groups:
^'(?<IP>[\d.]+)'\s+
^(?<ID>\w*)\s+
^'(?<line1>\w*)'\s+
^'(?<line2>\w*)'\s+
^'(?<MD5>[A-Fa-f0-9]{32})'
Use it with the Multiline and IgnorePatternWhitespace regex options (the latter means you can keep the regex layout for better readability).
(*) Besides, regex patterns for IP addresses are literally all over the place, in countless examples. Of course you can use something more sophisticated than '[\d.]+' if you think it's necessary.

I have tried out this in Java as below.
public class TestRegex
{
/**
* #param args
*/
public static void main(String[] args)
{
String input = "assasasa 123.234.223.223 333 aad sddsf 343sdd sds23343 ssdfs33344 MD5=aas jjsjjdjd 143.234.223.223 333 aad sddsf 343sdd sds23343 ssdfs33344 MD5=asas";
String regexPattern = "(\\b[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\b).*?([A-Z a-z]+[0-9]+=.*?\\s)";
Matcher m = null;
Pattern pattern = Pattern.compile(regexPattern);
m = pattern.matcher(input);
// System.out.println(matcher.toString());
while (m.find()) {
System.out.println("start(): " + m.start());
System.out.println("end(): " + m.end());
System.out.println("end(): " + m.toString());
System.out.println("end(): " + m.group(1));
System.out.println("end(): " + m.group(2));
}
}
}

Related

C# Adding Whitespace around a specific character for spacing in file names

I'm building a program which processes documents based on their file path and file name.
My current solution is based on file names containing 3 strings each separated by a space, dash and another space so that a valid name would be: "STRING1 - STRING2 - STRING3.pdf".
My program reads these values by using IndexOf().
string1.Substring(fileName.IndexOf("-") - 1)
string3.Substring(fileName.LastIndexOf("-") + 2)
The problem is that this breaks when the file names don't contain whitespaces, therefore breaking everything. So I opted to use Regex instead but how would I add a condition, so it doesn't add spaces to a name which already contains them.
Example,
String fileName[1] = "Test123 - Dog - Page 1.pdf"
String fileName[2] = "Test123-Dog-Page1.pdf"
Regex.Replace(fileName[1], "-", " - ");
Regex.Replace(fileNameB[2], "-", " - ");
Output:
fileName[1] = Test123 - Dog - Page 1.pdf
fileName[2] = Test123 - Dog - Page 1.pdf
fileName[1] was originally valid, now it's invalid.
fileName[2] was originally invalid, now it's valid.
I need both to be valid via an if condition.
Ps. Apologies if anything is unclear, I'm new to posting on Stack

You don't need regex, in case pure string methods are more readable for you:
string FixFileName(string fn)
{
string fnwe = System.IO.Path.GetFileNameWithoutExtension(fn);
return string.Join(" - ", fnwe.Split('-').Select(token => token.Trim()))
+ System.IO.Path.GetExtension(fn);
}
Demo: https://dotnetfiddle.net/alv6sB

Make a new file who's name is a directory path

I'm creating a csv file with a bunch of data. This file is going to be pushed up to another location and its name is going to be used to put it in the directory it belongs in. I need to create the filename to mimic a directory, without actually using that directory.
I'm using the below, basically "outputDirectory" is where the file should live, everything after it needs to be part of the filename.
String fileName = outputDirectory + DateTime.Now.ToString("yyyy-mm-hh") + "//" + app + "//" + client +"//" + site +"//" + unit + ".csv";
using (StreamWriter sw = new StreamWriter(fileName, false))
{
foreach (AFValue AFval in AFvals)
{
string tagname = AFval.PIPoint.Name;
string timestamp = AFval.Timestamp.ToString();
string value = AFval.Value.ToString();
var newLine = string.Format("{0},{1},{2}", tagname, timestamp, value);
sw.Write(newLine);
sw.Write(Environment.NewLine);
}
}
So right now this code is throwing an exception with
'Could not find a part of the path 'C:\Users\user\Desktop\Output\2019-53-01\app\client\site\Unit.csv'.'
I need it to create a file in 'C:\Users\user\Desktop\Output\' called
2019-53-01\app\client\site\Unit.csv'.'
Any ideas?

You cannot use the slash **** in the file name.
Here is an excerpt from Naming Files, Paths, and Namespaces
Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
The following reserved characters:
< (less than)
(greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
(asterisk)
Integer value zero, sometimes referred to as the ASCII NUL character.
Characters whose integer representations are in the range from 1 through 31, except for alternate data streams where these characters are allowed. For more information about file streams, see File Streams.
Any other character that the target file system does not allow.

Get data from file and split into an array

I have information formatted on a webpage which looks like the following:
Key=submission_id, Value=300348811884547965
Key=formID, Value=50514289063151
Key=ip, Value=xxxxx
Key=editimage, Value=Yes
Key=openimage5, Value=Yes
Key=copyimage, Value=Yes
How would I go about getting the value of each line, I was thinking of doing some sort of next() while getting all data after the 2nd equal sign of each line however I am unsure on how to do it in c#. I am sure there is a better solution then what I have in mind. Please let me know your thoughts.

A regex works nicely for parsing data structured in this way.
Regex splitter = new Regex(#"Key=([\w]+), Value=([\w]+)");
string path = "TextFile1.txt";
string[] lines = System.IO.File.ReadAllLines(path);
lines.ToList().ForEach((s) =>
{
Match match = splitter.Match(s);
if (match.Success)
{
Console.WriteLine("The Key is " + match.Groups[1] + " and the value is " + match.Groups[2]);
}
});

Match Multiline & IgnoreSome

I'm trying to extract some information from a JCL source using regex in C#
Basically, this is a string I can have:
//JOBNAME0 JOB (BLABLABLA),'SOME TEXT',MSGCLASS=YES,ILIKE=POTATOES, GRMBL
// IALSOLIKE=TOMATOES, ANOTHER GARBAGE
// FINALLY=BYE
//OTHER STUFF
So I need to extract the jobname JOBNAME0, the info (BLABLABLA), the description 'SOME TEXT' and the other parms MSGCLASS=YES ILIKE=POTATOES IALSOLIKE=TOMATOES FINALLY=BYE.
I must ignore everything that is after the space ... like GRMBL or ANOTHER GARBAGE
I must continue to next line if my last valid char was a , and stop if it there were none.
So far, I have successfully managed to get the jobname, the info and the description, pretty easy. For the other parms, i'm able to get all the parms and to split them, but i don't know how to get rid of the garbage.
Here is my code:
var regex = "//([^\\s]*) JOB (\\([^)]*\\))?,?(\\'[^']*\\')?,?([^,]*[,|\\s|$])*";
Match match2 = Regex.Match(test5, regex,RegexOptions.Singleline);
string CarteJob2 = match2.Groups[0].Value;
string JobName2 = match2.Groups[1].Value;
string JobInfo2 = match2.Groups[2].Value;
string JobDesc2 = match2.Groups[3].Value;
IEnumerable<string> parms = match2.Groups[4].Captures.OfType<Capture>().Select(x => x.Value);
string JobParms2 = String.Join("|", parms);
Console.WriteLine(CarteJob2 + "|");
Console.WriteLine(JobName2 + "|");
Console.WriteLine(JobInfo2 + "|");
Console.WriteLine(JobDesc2 + "|");
Console.WriteLine(JobParms2 + "|");
The output I get is this one:
//JOBNAME0 JOB (BLABLABLA),'SOME TEXT',MSGCLASS=YES,ILIKE=POTATOES, GRMBL
// IALSOLIKE=TOMATOES, ANOTHER GARBAGE
// FINALLY=BYE
//OTHER |
JOBNAME0|
(BLABLABLA)|
'SOME TEXT'|
MSGCLASS=YES,|ILIKE=POTATOES,| GRMBL
// IALSOLIKE=TOMATOES,| ANOTHER GARBAGE
// FINALLY=BYE
//OTHER |
The output I would like to see is:
//JOBNAME0 JOB (BLABLABLA),'SOME TEXT',MSGCLASS=YES,ILIKE=POTATOES, GRMBL
// IALSOLIKE=TOMATOES, ANOTHER GARBAGE
// FINALLY=BYE|
JOBNAME0|
(BLABLABLA)|
'SOME TEXT'|
MSGCLASS=YES|ILIKE=POTATOES|IALSOLIKE=TOMATOES|FINALLY=BYE|
Is there a way to get what I want ?

I think I'd try and do this with two Regex expressions.
The first one to get all the starting information from the beginning of the string - job name, info, description.
The second one to get all the parameters, which all seem to have a simple pattern of <param name>=<param value>.
The first Regex might look like this:
^//(?<job>[\d\w]+)[ ]+JOB[ ]+\((?<info>[\d\w]+)\),'(?<description>[\d\w ]+)'
I don't know if rules permit whitespaces to appear in the job name, info or description - adjust as needed. Also, I'm assuming this is the start of the file using the ^ char. Finally, this Regex has groups already defined, so getting values should be easier in C#.
The second Regex might be something like this:
(?<param>[\w\d]+)=(?<value>[\w\d]+)
Again, grouping is added to help get the parameter names and values.
Hope this helps.
EDIT:
A small tip - you can use the # sign before a string in C# to make it easier to write such Regex patterns. For example:
Regex reg = new Regex(#"(?<param>[\w\d]+)=(?<value>[\w\d]+)");

Extract sub-string between two certain words right to left side

Example String
This is an important example about regex for my work.
I can extract important example about regex with this (?<=an).*?(?=for) snippet. Reference
But i would like to extract to string right to left side. According to this question's example; first position must be (for) second position must be (an).
I mean extracting process works back ways.
I tried what i want do as below codes in else İf case, but it doesn't work.
public string FnExtractString(string _QsString, string _QsStart, string _QsEnd, string _QsWay = "LR")
{
if (_QsWay == "LR")
return Regex.Match(_QsString, #"(?<=" + _QsStart + ").*?(?=" + _QsEnd + ")").Value;
else if (_QsWay == "RL")
return Regex.Match(_QsString, #"(?=" + _QsStart + ").*?(<=" + _QsEnd + ")").Value;
else
return _QsString;
}
Thanks in advance.
EDIT
My real example as below
#Var|First String|ID_303#Var|Second String|ID_304#Var|Third String|DI_t55
When i pass two string to my method (for example "|ID_304" and "#Var|") I would like to extract "Second String" but this example is little peace of my real string and my string is changeable.

No need for forward or backward lookahead! You could just:
(.*)\san\s.*\sfor\s
The \s demands whitespace, so you don't match an import*an*t.

One potential problem in your current solution is that the string passed in contains special characters, which needs to be escaped with Regex.Escape before concatenation:
return Regex.Match(_QsString, #"(?<=" + Regex.Escape(_QsStart) + ").*?(?=" + Regex.Escape(_QsEnd) + ")").Value;
For your other requirement of matching RL, I don't understand your requirement.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract text from large file using RegEx? - c#

Related

C# Adding Whitespace around a specific character for spacing in file names

Make a new file who's name is a directory path

Get data from file and split into an array

Match Multiline & IgnoreSome

Extract sub-string between two certain words right to left side

Categories

Resources