Parsing log file, ambiguous delimiter

Parsing log file, ambiguous delimiter - c#

I have to parse a log file and not sure how to best take different pieces of each line. The problem I am facing is original developer used ':' to delimit tokens which was a bit idiotic since the line contains timestamp which itself contains ':'!
A sample line looks something like this:
transaction_date_time:[systemid]:sending_system:receiving_system:data_length:data:[ws_name]
2019-05-08 15:03:13:494|2019-05-08 15:03:13:398:[192.168.1.2]:ABC:DEF:67:cd71f7d9a546ec2b32b,AACN90012001000012,OPNG:[WebService.SomeName.WebServiceModule::WebServiceName]
I have no problem reading the log file and accessing each line but no sure how to get the pieces parsed?

Since the input string is not exactly splittable, because of the delimiter char is also part of the content, a simple regex expression can be used instead.
Simple but probably fast enough, even with the default settings.
The different parts of the input string can be separated with these capturing groups:
string pattern = #"^(.*?)\|(.*?):\[(.*?)\]:(.*?):(.*?):(\d+):(.*?):\[(.*)\]$";
This will give you 8 groups + 1 (Group[0]) which contains the whole string.
Using the Regex class, simply pass a string to parse (named line, here) and the regex (named pattern) to the Match() method, using default settings:
var result = Regex.Match(line, pattern);
The Groups.Value property returns the result of each capturing group. For example, the two dates:
var dateEnd = DateTime.ParseExact(result.Groups[1].Value, "yyyy-MM-dd hh:mm:ss:ttt", CultureInfo.InvariantCulture),
var dateStart = DateTime.ParseExact(result.Groups[2].Value, "yyyy-MM-dd hh:mm:ss:ttt", CultureInfo.InvariantCulture),
The IpAddress is extracted with: \[(.*?)\].
You could give a name to this grouping, so it's more clear what the value refers to. Simply add a string, prefixed with ? and enclosed in <> or single quotes ' to name the grouping:
...\[(?<IpAddress>.*?)\]...
Note, however, that naming a group will modify the Regex.Groups indexing: the un-named groups will be inserted first, the named groups after. So, naming only the IpAddress group will cause it to become the last item, Groups[8]. Of course you can name all the groups and the indexing will be preserved.
var hostAddress = IPAddress.Parse(result.Groups["IpAddress"].Value);
This patter should allow a medium machine to parse 130,000~150,000 strings per second.
You'll have to test it to find the perfect pattern. For example, the first match (corresposnding to the first date): (.*?)\|, is much faster if non-greedy (using the *? lazy quantifier). The opposite for the last match: \[(.*)\]. The pattern used by jdweng is even faster than the one used here.
See Regex101 for a detailed description on the use and meaning of each token.

Using Regex I was able to parse everything. It looks like the data came from excel because the faction of seconds has a colon instead of a period. c# does not like the colon so I had to replace colon with a period. I also parsed from right to left to get around the colon issues.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
namespace ConsoleApplication3
{
class Program1
{
const string FILENAME = #"c:\temp\test.txt";
static void Main(string[] args)
{
string line = "";
int rowCount = 0;
StreamReader reader = new StreamReader(FILENAME);
string pattern = #"^(?'time'.*):\[(?'systemid'[^\]]+)\]:(?'sending'[^:]+):(?'receiving'[^:]+):(?'length'[^:]+):(?'data'[^:]+):\[(?'ws_name'[^\]]+)\]";
while ((line = reader.ReadLine()) != null)
{
line = line.Trim();
if (line.Length > 0)
{
if (++rowCount != 1) //skip header row
{
Log_Data newRow = new Log_Data();
Log_Data.logData.Add(newRow);
Match match = Regex.Match(line, pattern, RegexOptions.RightToLeft);
newRow.ws_name = match.Groups["ws_name"].Value;
newRow.data = match.Groups["data"].Value;
newRow.length = int.Parse(match.Groups["length"].Value);
newRow.receiving_system = match.Groups["receiving"].Value;
newRow.sending_system = match.Groups["sending"].Value;
newRow.systemid = match.Groups["systemid"].Value;
//end data is first then start date is second
string[] date = match.Groups["time"].Value.Split(new char[] {'|'}).ToArray();
string replacePattern = #"(?'leader'.+):(?'trailer'\d+)";
string stringDate = Regex.Replace(date[1], replacePattern, "${leader}.${trailer}", RegexOptions.RightToLeft);
newRow.startDate = DateTime.Parse(stringDate);
stringDate = Regex.Replace(date[0], replacePattern, "${leader}.${trailer}", RegexOptions.RightToLeft);
newRow.endDate = DateTime.Parse(stringDate );
}
}
}
}
}
public class Log_Data
{
public static List<Log_Data> logData = new List<Log_Data>();
public DateTime startDate { get; set; } //transaction_date_time:[systemid]:sending_system:receiving_system:data_length:data:[ws_name]
public DateTime endDate { get; set; }
public string systemid { get; set; }
public string sending_system { get; set; }
public string receiving_system { get; set; }
public int length { get; set; }
public string data { get; set; }
public string ws_name { get; set; }
}
}

Related

Using enum names in a multiline string to associate each string line with the integer value of the enum. Is there a better way?

My RTF parser needs to process two flavors of rtf files (one file per program execution): rtf files as saved from Word and rtf files as created by a COTS report generator utility. The rtf for each is valid, but different. My parser uses regex patterns to detect, extract, and process the various rtf elements in the two types of rtf files.
I decided to implement the list of rtf regex patterns in two dictionaries, one for the rtf regex patterns needed for a Word rtf file and another for the rtf regex patterns needed for a COTS utility rtf file. At runtime, my parser detects which type of rtf file is being processed (Word rtf includes the rtf element //schemas.microsoft.com/office/word and the COTS rtf does not) and then obtains the needed regex pattern from the appopriate dictionary.
To ease the task of referencing the patterns as I write the code, I implemented an enum where each enum value represents a specific regex pattern. To ease the task of keeping the patterns in sync with their corresponding enum, I implemented the regex patterns as a here-string where each line is a csv composition: {enum name}, {word rtf regex pattern}, {cots rtf regex pattern}. Then, at run time when the patterns are loaded into their dictionaries, I obtain the int value of the enum from the csv and use it to create the dictionary key.
This makes writing the code easier, but I'm not sure it's the best way to implement and reference the rtf expressions. Is there a better way?
Example code:
public enum Rex {FOO, BAR};
string ex = #"FOO, word rtf regex pattern for FOO, cots rtf regex pattern for FOO
BAR, word rtf regex pattern for BAR, cots rtf regex pattern for BAR
";
I load the dictionaries like this:
using (StringReader reader = new StringReader(ex))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] splitLine = line.Split(',');
int enumIntValue = (int)(Rex)Enum.Parse(typeof(Rex), splitLine[0].Trim());
ObjWordRtfDict.Add(enumIntValue, line.Split(',')[1].Trim());
ObjRtfDict.Add(enumIntValue, line.Split(',')[2].Trim());
}
}
Then, at runtime, I access ObjWordRtfDict or ObjRtfDict based on the type of rtf file the parser detects.
string regExPattFoo = ObjRegExExpr.GetRegExPattern(ClsRegExExpr.Rex.FOO);
public string GetRegExPattern(Rex patternIndex)
{
string regExPattern = "";
if (isWordRtf)
{
ObjWordRtfDict.TryGetValue((int)patternIndex, out regExPattern);
}
else
{
ObjRtfDict.TryGetValue((int)patternIndex, out regExPattern);
}
return regExPattern;
}
Modified New code based on Asif's recommendations
I kept my enum for pattern names so references to pattern names can be checked by the compiler
Example csv file included as an embedded resource
SECT,^\\pard.*\{\\rtlch.*\\sect\s\}, ^\\pard.*\\sect\s\}
HORZ_LINE2, \{\\pict.*\\pngblip, TBD
Example usage
string sectPattern = ObjRegExExpr.GetRegExPattern(ClsRegExPatterns.Names.SECT);
ClsRegExPatterns class
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Text.RegularExpressions;
namespace foo
{
public class ClsRegExPatterns
{
readonly bool isWordRtf = false;
List<ClsPattern> objPatternList;
public enum Names { SECT, HORZ_LINE2 };
public class ClsPattern
{
public string Name { get; set; }
public string WordRtfRegex { get; set; }
public string COTSRtfRegex { get; set; }
}
public ClsRegExPatterns(StringBuilder rawRtfTextFromFile)
{
// determine if input file is Word rtf or not Word rtf
if ((Regex.Matches(rawRtfTextFromFile.ToString(), "//schemas.microsoft.com/office/word", RegexOptions.IgnoreCase)).Count == 1)
{
isWordRtf = true;
}
//read patterns from embedded content csv file
string patternsAsCsv = new StreamReader((Assembly.GetExecutingAssembly()).GetManifestResourceStream("eLabBannerLineTool.Packages.patterns.csv")).ReadToEnd();
//create list to hold patterns
objPatternList = new List<ClsPattern>();
//load pattern list
using (StringReader reader = new StringReader(patternsAsCsv))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] splitLine = line.Split(',');
ClsPattern objPattern = new ClsPattern
{
Name = splitLine[0].Trim(),
WordRtfRegex = splitLine[1].Trim(),
COTSRtfRegex = splitLine[2].Trim()
};
objPatternList.Add(objPattern);
}
}
}
public string GetRegExPattern(Names patternIndex)
{
string regExPattern = "";
string patternName = patternIndex.ToString();
if (isWordRtf)
{
regExPattern = objPatternList.SingleOrDefault(x => x.Name == patternName)?.WordRtfRegex;
}
else
{
regExPattern = objPatternList.SingleOrDefault(x => x.Name == patternName)?.COTSRtfRegex;
}
return regExPattern;
}
}
}

If I understand your problem statement correctly; I would rather prefer something like below.
Create a class called RtfProcessor
public class RtfProcessor
{
public string Name { get; set; }
public string WordRtfRegex { get; set; }
public string COTSRtfRegex { get; set; }
void ProcessFile()
{
throw new NotImplementedException();
}
}
Where name signifies FOO or BAR etc. You can maintain a list of such files and keep populating from csv files like below
List<RtfProcessor> fileProcessors = new List<RtfProcessor>();
using (StringReader reader = new StringReader(ex))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] splitLine = line.Split(',');
RtfProcessor rtfProcessor = new RtfProcessor();
rtfProcessor.Name = splitLine[0].Trim();
rtfProcessor.WordRtfRegex = line.Split(',')[1].Trim();
rtfProcessor.WordRtfRegex = line.Split(',')[2].Trim();
fileProcessors.Add(rtfProcessor);
}
}
And to retrieve regex pattern for FOO or BAR
// to get the regex parrtern for FOO you can use
fileProcessors.SingleOrDefault(x => x.Name == "FOO")?.WordRtfRegex;
hope this helps.

How to append newline before every occurrence of time stamp in string property?

I have a list containing string property's called Actions. Within each Actions string property there are multiple text entries separated by a timestamp like below:
05/10/2016 15:23:42- UTC--test
05/10/2016 16:07:04- UTC--test
05/10/2016 16:33:54- UTC--test
06/10/2016 08:24:52- UTC--test
What I'd like to do is insert a newline \n character before each timestamp in the string property.
So I looped through each record in the list, then tried to modify each string property by adding a newline to each timestamp. But I'm not sure how to get the timestamp value in the string to perform the replace:
//Not sure how to find the instance of timestamp in the string
foreach (var record in escList)
{
record.Actions = record.Actions.Replace("timestamp_text_string","\n" + "timestamp_text_value");
}
I was thinking of using a regex to match every string matching a timestamp pattern, but not sure if the regex works in this context:
string pattern = #"\[[0-9]:[0-9]{1,2}:[0-9]{1,2}\]"; //timestamp pattern
record.Actions = record.Actions.Replace(pattern,"\n" + pattern);
How can you append a newline before every occurrence of time stamp in string property?
The desired result is that for every entry in the string property, i.e, 05/10/2016 15:23:42- UTC--test there would be a new line added before that portion of the string. Giving the following output:
05/10/2016 15:23:42- UTC--test
05/10/2016 16:07:04- UTC--test
05/10/2016 16:33:54- UTC--test
06/10/2016 08:24:52- UTC--test

Use Split:
List<string> result=new List<string>();
foreach (var record in escList)
{
result.Add(record.Actions.Replace(record.Actions.Split(' ')[1], "\n" + record.Actions.Split(' ')[1]));
}

Not sure If I understood your desired result correctly, but I think performance wise you would be interested in using a StringBuilder instead of a List. Here's a sample I made:
class Program
{
static void Main(string[] args)
{
string action1 = "05/10/2016 15:23:42- UTC--test";
string action2 = "05/10/2016 16:07:04- UTC--test";
string action3 = "05/10/2016 16:33:54- UTC--test";
string action4 = "06/10/2016 08:24:52- UTC--test";
List<string> sample_actions = new List<string>() { action1, action2, action3, action4 };
Record rec = new Record();
foreach (string sample_action in sample_actions)
{
rec.Actions.AppendLine(sample_action).AppendLine();
}
}
}
class Record
{
public StringBuilder Actions { get; set; }
public Record()
{
Actions = new StringBuilder();
}
}
Edited to match your needs

Assuming actions has at least one element:
spacedAtions = actions.Take(1).Concat(actions.Skip(1).Select(a => $"\n{a}));

Using split() method without text qualifier

I'm trying to get some field value from a text file using a streamReader.
To read my custom value, I'm using split() method. My separator is a colon ':' and my text format looks like:
Title: Mytitle
Manager: Him
Thema: Free
.....
Main Idea: best idea ever
.....
My problem is, when I try to get the first field, which is title, I use:
string title= text.Split(:)[1];
I get title = MyTitle Manager
instead of just: title= MyTitle.
Any suggestions would be nice.
My text looks like this:
My mail : ........................text............
Manager mail : ..................text.............
Entity :.......................text................
Project Title :...............text.................
Principal idea :...................................
Scope of the idea : .........text...................
........................text...........................
Description and detail :................text.......
..................text.....
Cost estimation :..........
........................text...........................
........................text...........................
........................text...........................
Advantage for us :.................................
.......................................................
Direct Manager IM :................................

Updated per your post
//I would create a class to use if you haven't
//Just cleaner and easier to read
public class Entry
{
public string MyMail { get; set; }
public string ManagerMail { get; set; }
public string Entity { get; set; }
public string ProjectTitle { get; set; }
// ......etc
}
//in case your format location ever changes only change the index value here
public enum EntryLocation
{
MyMail = 0,
ManagerMail = 1,
Entity = 2,
ProjectTitle = 3
}
//return the entry
private Entry ReadEntry()
{
string s =
string.Format("My mail: test#test.com{0}Manager mail: test2#test2.com{0}Entity: test entity{0}Project Title: test project title", Environment.NewLine);
//in case you change your delimiter only need to change it once here
char delimiter = ':';
//your entry contains newline so lets split on that first
string[] split = s.Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
//populate the entry
Entry entry = new Entry()
{
//use the enum makes it cleaner to read what value you are pulling
MyMail = split[(int)EntryLocation.MyMail].Split(delimiter)[1].Trim(),
ManagerMail = split[(int)EntryLocation.ManagerMail].Split(delimiter)[1].Trim(),
Entity = split[(int)EntryLocation.Entity].Split(delimiter)[1].Trim(),
ProjectTitle = split[(int)EntryLocation.ProjectTitle].Split(delimiter)[1].Trim()
};
return entry;
}

That is because split returns strings delimited by the sign you've specified. In your case:
Title
Mytitle Manager
Him
.1. You can change your data format to get the value you need, for example:
Title: Mytitle:Manager: Him
There each second element will be the value.
text.Split(:)[1] == " Mytitle";
text.Split(:)[3] == " Him";
.2. Or you can call text.Split(' ', ':') to get identical list of name-value pairs without format change.
.3. Also if your data is placed each on a new line in the file like:
Title: Mytitle
Manager: Him
And you content is streamed into single string then you can also do:
text.Split(new string[] {Environment.NewLine, ":"}, StringSplitOptions.None);

Dollar $ symbol in mongodb

I have a couple of collections that has a string like this.
This is a cool stock. $AAPL. Let's buy it.
This is a cool stock. $MSFT. Let's buy it.
This is a cool stock. $GOOG. Let's buy it.
How do I find the APPL one.
i use something like this db.collection_name.find(fieldname: /$AAPL/) but it doesn't like the dollar symbol. If i run it without the $ in it, it works fine. But I only want the result when the $AAPL is in the text.
Cheers.

A complete C# example:
// sample class with a property that could contain the sample string
// in your example, "This is a cool stock. $MSFT"
public class Talk {
public string Message { get; set; }
}
var client = new MongoClient("mongodb://localhost");
var server = client.GetServer();
var database = server.GetDatabase("stocktalk");
var collection = database.GetCollection<Talk>("talk");
var query = Query<Talk>.EQ(m => m.Message,
new BsonRegularExpression(#"\$MSFT"));
// get all of the Talk objects that match
var matches = collection.FindAs<Talk>(query);
Also note that this is a very inefficient query in general as it would need to search through all documents in the collection to find a match. You might want to consider storing the stock ticker symbols in a distinct array property as part of the document and using $in to find them (you could then use an index for example and it would be very fast to find matching strings):
public class Talk {
public string Message { get; set; }
public string[] TickerSymbols { get; set; }
}
var query = Query<Talk>.In(m => m.TickerSymbols, new string[]{"$MSFT"});

$ is a special character in regular expressions; it matches the end of the original string.
To match a literal $ character, you need to escape it with a backslash:
db.collection_name.find(fieldname: /\$AAPL/)

Regular expression problem in C#

I have these strings as a response from a FTP server:
02-17-11 01:39PM <DIR> dec
04-06-11 11:17AM <DIR> Feb 2011
05-10-11 07:09PM 87588 output.xlsx
06-10-11 02:52PM 3462 output.xlsx
where the pattern is: [datetime] [length or <dir>] [filename]
Edit: my code was- #"^\d{2}-\d{2}-\d{2}(\s)+(<DIR>|(\d)+)+(\s)+(.*)+"
I need to parse these strings in this object:
class Files{
Datetime modifiedTime,
bool ifTrueThenFile,
string name
}
Please note that, filename may have spaces.
I am not good at regex matching, can you help?

Regex method
One approach is using this regex
#"(\d{2}-\d{2}-\d{2} \d{2}:\d{2}(?:PM|AM)) (<DIR>|\d+) (.+)";
I am capturing groups, so
// Group 1 - Matches the DateTime
(\d{2}-\d{2}-\d{2} \d{2}:\d{2}(?:PM|AM))
Notice the syntax (?:xx), it means that the content here will not be caught in a group, we need to match PM or AM but this group alone doesn't matter.
Next I match the file size or <DIR> with
// Group 2 - Matches the file size or <DIR>
(<DIR>|\d+)
Catching the result in a group.
The last part matches directory names or file names
// Group 3 - Matches the dir/file name
(.+)
Now that we captured all groups we can parse the values:
DateTime.Parse(g[1].Value); // be careful with current culture
// a different culture may not work
To check if the captured entry is a file or not you can just check if it is <DIR> or a number.
IsFile = g[2].Value != "<DIR>"; // it is a file if it is not <DIR>
And the name is just what is left
Name = g[3].Value; // returns a string
Then you can use the groups to build the object, an example:
public class Files
{
public DateTime ModifiedTime { get; set; }
public bool IsFile { get; set; }
public string Name { get; set; }
public Files(GroupCollection g)
{
ModifiedTime = DateTime.Parse(g[1].Value);
IsFile = g[2].Value != "<DIR>";
Name = g[3].Value;
}
}
static void Main(string[] args)
{
var p = #"(\d{2}-\d{2}-\d{2} \d{2}:\d{2}(?:PM|AM)) (<DIR>|\d+) (.+)";
var regex = new Regex(p, RegexOptions.IgnoreCase);
var m1 = regex.Match("02-17-11 01:39PM <DIR> dec");
var m2 = regex.Match("05-10-11 07:09PM 87588 output.xlsx");
// DateTime: 02-17-11 01:39PM
// IsFile : false
// Name : dec
var file1 = new Files(m1.Groups);
// DateTime: 05-10-11 07:09PM
// IsFile : true
// Name : output.xlsx
var file2 = new Files(m2.Groups);
}
Further reading
Regex class
Regex groups
String manipulation method
Another way to achieve this is to split the string which can be much faster:
public class Files
{
public DateTime ModifiedTime { get; set; }
public bool IsFile { get; set; }
public string Name { get; set; }
public Files(string line)
{
// Gets the date part and parse to DateTime
ModifiedTime = DateTime.Parse(line.Substring(0, 16));
// Gets the file information part and split
// in two parts
var fileBlock = line.Substring(17).Split(new char[] { ' ' }, 2);
// first part tells if it is a file
IsFile = fileBlock[0] != "<DIR>";
// second part tells the name
Name = fileBlock[1];
}
}
static void Main(string[] args)
{
// DateTime: 02-17-11 01:39PM
// IsFile : false
// Name : dec
var file3 = new Files("02-17-11 01:39PM <DIR> dec");
// DateTime: 05-10-11 07:09PM
// IsFile : true
// Name : out put.xlsx
var file4 = new Files("05-10-11 07:09PM 87588 out put.xlsx");
}
Further reading
String split
String.Split Method (Char[], Int32)

You can try with something like:
^(\d\d-\d\d-\d\d)\s+(\d\d:\d\d[AP]M)\s+(\S+)\s+(.*)$
The first capture group will contain the date, the second the time, the third the size (or <DIR>, and the last everything else (which will be the filename).
(Note that this is probably not portable, the time format is locale dependent.)

Here you go:
(\d{2})-(\d{2})-(\d{2}) (\d{2}):(\d{2})([AP]M) (<DIR>|\d+) (.+)
I used a lot of sub expressions, so it would catch all relevant parts like year, hour, minute etc. Maybe you dont need them all, just remove the brackets in case.

try this
String regexTemp= #"(<Date>(\d\d-\d\d-\d\d\s*\d\d:\d\dA|PM)\s*(<LengthOrDir>\w*DIR\w*|\d+)\s*(<Name>.*)";
Match mExprStatic = Regex.Match(regexTemp, RegexOptions.IgnoreCase | RegexOptions.Singleline);
if (mExprStatic.Success || !string.IsNullOrEmpty(mExprStatic.Value))
{
DateTime _date = DateTime.Parse(mExprStatic.Groups["lang"].Value);
String lengthOrDir = mExprStatic.Groups["LengthOrDir"].Value;
String Name = mExprStatic.Groups["Name"].Value;
}

A lot of good answers, but I like regex puzzles, so I thought I'd contribute a slightly different version...
^([\d- :]{14}[A|P]M)\s+(<DIR>|\d+)\s(.+)$
For help in testing, I always use this site : http://www.myregextester.com/index.php

You don't need to use regex here. Why don't you split the string by spaces with a number_of_elements limit:
var split = yourEntryString.Split(new string []{" "}, 4,
StringSplitOptions.RemoveEmptyEntries);
var date = string.Join(" ", new string[] {split[0], split[1]});
var length = split[2];
var filename = split[3];
this is of course assuming that the pattern is correct and none of the entries would be empty.

I like the regex Leif posted.
However, i'll give you another solution which people will probably hate: fast and dirty solution which i am coming up with just as i am typing:
string[] allParts = inputText.Split(" ")
allParts[0-1] = parse your DateTime
allParts[2] = <DIR> or Size
allParts[3-n] = string.Join(" ",...) your filename
There are some checks missing there, but you get the idea.
Is it nice code? Probably not. Will it work? With the right amount of time, surely.
Is it more readable? I tend to to think "yes", but others might disagree.

You should be able to implement this with simple string.split, if statement and parse/parseexact method to convert the value. If it is a file then just concatenated the remaining string token so you can reconstruct filename with space

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing log file, ambiguous delimiter - c#

Related

Using enum names in a multiline string to associate each string line with the integer value of the enum. Is there a better way?

How to append newline before every occurrence of time stamp in string property?

Using split() method without text qualifier

Dollar $ symbol in mongodb

Regular expression problem in C#

Categories

Resources