I have a couple of collections that has a string like this.
This is a cool stock. $AAPL. Let's buy it.
This is a cool stock. $MSFT. Let's buy it.
This is a cool stock. $GOOG. Let's buy it.
How do I find the APPL one.
i use something like this db.collection_name.find(fieldname: /$AAPL/) but it doesn't like the dollar symbol. If i run it without the $ in it, it works fine. But I only want the result when the $AAPL is in the text.
Cheers.
A complete C# example:
// sample class with a property that could contain the sample string
// in your example, "This is a cool stock. $MSFT"
public class Talk {
public string Message { get; set; }
}
var client = new MongoClient("mongodb://localhost");
var server = client.GetServer();
var database = server.GetDatabase("stocktalk");
var collection = database.GetCollection<Talk>("talk");
var query = Query<Talk>.EQ(m => m.Message,
new BsonRegularExpression(#"\$MSFT"));
// get all of the Talk objects that match
var matches = collection.FindAs<Talk>(query);
Also note that this is a very inefficient query in general as it would need to search through all documents in the collection to find a match. You might want to consider storing the stock ticker symbols in a distinct array property as part of the document and using $in to find them (you could then use an index for example and it would be very fast to find matching strings):
public class Talk {
public string Message { get; set; }
public string[] TickerSymbols { get; set; }
}
var query = Query<Talk>.In(m => m.TickerSymbols, new string[]{"$MSFT"});
$ is a special character in regular expressions; it matches the end of the original string.
To match a literal $ character, you need to escape it with a backslash:
db.collection_name.find(fieldname: /\$AAPL/)
Related
I have to parse a log file and not sure how to best take different pieces of each line. The problem I am facing is original developer used ':' to delimit tokens which was a bit idiotic since the line contains timestamp which itself contains ':'!
A sample line looks something like this:
transaction_date_time:[systemid]:sending_system:receiving_system:data_length:data:[ws_name]
2019-05-08 15:03:13:494|2019-05-08 15:03:13:398:[192.168.1.2]:ABC:DEF:67:cd71f7d9a546ec2b32b,AACN90012001000012,OPNG:[WebService.SomeName.WebServiceModule::WebServiceName]
I have no problem reading the log file and accessing each line but no sure how to get the pieces parsed?
Since the input string is not exactly splittable, because of the delimiter char is also part of the content, a simple regex expression can be used instead.
Simple but probably fast enough, even with the default settings.
The different parts of the input string can be separated with these capturing groups:
string pattern = #"^(.*?)\|(.*?):\[(.*?)\]:(.*?):(.*?):(\d+):(.*?):\[(.*)\]$";
This will give you 8 groups + 1 (Group[0]) which contains the whole string.
Using the Regex class, simply pass a string to parse (named line, here) and the regex (named pattern) to the Match() method, using default settings:
var result = Regex.Match(line, pattern);
The Groups.Value property returns the result of each capturing group. For example, the two dates:
var dateEnd = DateTime.ParseExact(result.Groups[1].Value, "yyyy-MM-dd hh:mm:ss:ttt", CultureInfo.InvariantCulture),
var dateStart = DateTime.ParseExact(result.Groups[2].Value, "yyyy-MM-dd hh:mm:ss:ttt", CultureInfo.InvariantCulture),
The IpAddress is extracted with: \[(.*?)\].
You could give a name to this grouping, so it's more clear what the value refers to. Simply add a string, prefixed with ? and enclosed in <> or single quotes ' to name the grouping:
...\[(?<IpAddress>.*?)\]...
Note, however, that naming a group will modify the Regex.Groups indexing: the un-named groups will be inserted first, the named groups after. So, naming only the IpAddress group will cause it to become the last item, Groups[8]. Of course you can name all the groups and the indexing will be preserved.
var hostAddress = IPAddress.Parse(result.Groups["IpAddress"].Value);
This patter should allow a medium machine to parse 130,000~150,000 strings per second.
You'll have to test it to find the perfect pattern. For example, the first match (corresposnding to the first date): (.*?)\|, is much faster if non-greedy (using the *? lazy quantifier). The opposite for the last match: \[(.*)\]. The pattern used by jdweng is even faster than the one used here.
See Regex101 for a detailed description on the use and meaning of each token.
Using Regex I was able to parse everything. It looks like the data came from excel because the faction of seconds has a colon instead of a period. c# does not like the colon so I had to replace colon with a period. I also parsed from right to left to get around the colon issues.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
namespace ConsoleApplication3
{
class Program1
{
const string FILENAME = #"c:\temp\test.txt";
static void Main(string[] args)
{
string line = "";
int rowCount = 0;
StreamReader reader = new StreamReader(FILENAME);
string pattern = #"^(?'time'.*):\[(?'systemid'[^\]]+)\]:(?'sending'[^:]+):(?'receiving'[^:]+):(?'length'[^:]+):(?'data'[^:]+):\[(?'ws_name'[^\]]+)\]";
while ((line = reader.ReadLine()) != null)
{
line = line.Trim();
if (line.Length > 0)
{
if (++rowCount != 1) //skip header row
{
Log_Data newRow = new Log_Data();
Log_Data.logData.Add(newRow);
Match match = Regex.Match(line, pattern, RegexOptions.RightToLeft);
newRow.ws_name = match.Groups["ws_name"].Value;
newRow.data = match.Groups["data"].Value;
newRow.length = int.Parse(match.Groups["length"].Value);
newRow.receiving_system = match.Groups["receiving"].Value;
newRow.sending_system = match.Groups["sending"].Value;
newRow.systemid = match.Groups["systemid"].Value;
//end data is first then start date is second
string[] date = match.Groups["time"].Value.Split(new char[] {'|'}).ToArray();
string replacePattern = #"(?'leader'.+):(?'trailer'\d+)";
string stringDate = Regex.Replace(date[1], replacePattern, "${leader}.${trailer}", RegexOptions.RightToLeft);
newRow.startDate = DateTime.Parse(stringDate);
stringDate = Regex.Replace(date[0], replacePattern, "${leader}.${trailer}", RegexOptions.RightToLeft);
newRow.endDate = DateTime.Parse(stringDate );
}
}
}
}
}
public class Log_Data
{
public static List<Log_Data> logData = new List<Log_Data>();
public DateTime startDate { get; set; } //transaction_date_time:[systemid]:sending_system:receiving_system:data_length:data:[ws_name]
public DateTime endDate { get; set; }
public string systemid { get; set; }
public string sending_system { get; set; }
public string receiving_system { get; set; }
public int length { get; set; }
public string data { get; set; }
public string ws_name { get; set; }
}
}
When parsing in superpower, how to match a string only if it is the first thing in a line?
For example, I need to match the A colon in "A: Hello Goodbye\n" but not in "Goodbye A: Hello\n"
Using your example here, I would change your ActorParser and NodeParser definitions to this:
public readonly static TokenListParser<Tokens, Node> ActorParser =
from name in NameParser
from colon in Token.EqualTo(Tokens.Colon)
from text in TextParser
select new Node {
Actor = name + colon.ToStringValue(),
Text = text
};
public readonly static TokenListParser<Tokens, Node> NodeParser =
from node in ActorParser.Try()
.Or(TextParser.Select(text => new Node { Text = text }))
select node;
I feel like there is a bug with Superpower, as I'm not sure why in the NodeParser I had to put a Try() on the first parser when chaining it with an Or(), but it would throw an error if I didn't add it.
Also, your validation when checking input[1] is incorrect (probably just a copy paste issue). It should be checking against "Goodbye A: Hello" and not "Hello A: Goodbye"
Unless RegexOptions.Multiline is set, ^ matches the beginning of a string regardless of whether it is at the beginning of a line.
You can probably use inline (?m) to turn on multiline:
static TextParser<Unit> Actor { get; } =
from start in Span.Regex(#"(?m)^[A-Za-z][A-Za-z0-9_]+:")
select Unit.Value;
I have actually done something similar, but I do not use a Tokenizer.
private static string _keyPlaceholder;
private static TextParser<MyClass> Actor { get; } =
Span.Regex("^[A-Za-z][A-Za-z0-9_]*:")
.Then(x =>
{
_keyPlaceholder = x.ToStringValue();
return Character.AnyChar.Many();
}
))
.Select(value => new MyClass { Key = _keyPlaceholder, Value = new string(value) });
I have not tested this, just wrote it out by memory. The above parser should have the following:
myClass.Key = "A:"
myClass.Value = " Hello Goodbye"
So, splitting a string based on a delimiter is easy with good 'ol string.split. Now let's say I want to split on an open curly bracket and a closed curly bracket. Also straightforward with:
var foo = "{foo}{bar}";
var splitme = foo.Split(new char[] { '{', '}'});
Now let's make it more complicated by adding nested { } inside the initial opening/closing { }, up to n levels deep. What I'm after is trying to parse a what looks to be proprietary text file format for game mods (stellaris, great game), and I'm looking for a good way to parse this thing. How would I go about preserving each part of the bracketized (tokenized?) piece of a text? Adding to the mix is preserving a key value pair sort of business using an = as the indicator of a relation.
Here is an example of something I'm trying to parse in this fashion:
#Neutronium Materials
tech_ship_armor_5 = {
area = engineering
cost = #tier3cost4
tier = 3
category = { materials }
ai_update_type = military
prerequisites = { "tech_ship_armor_4" "tech_mine_neutronium" }
weight = #tier3weight4
weight_modifier = {
factor = 1.25
modifier = {
factor = 1.25
research_leader = {
area = engineering
has_trait = "leader_trait_expertise_materials"
}
}
}
ai_weight = {
modifier = {
factor = 1.25
research_leader = {
area = engineering
has_trait = "leader_trait_expertise_materials"
}
}
}
}
My first approach was to read this bad boy line by line with a StreamReader, and keep track of how many { I run into before they start getting closed with the corresponding }. Within each chunk of {} I hunt down that = and then figure out my key value pair that I just found, and where it exists in the hierarchy. This... doesn't seem ideal. Is there a better way with some regex magic or an off the shelf text parsing library?
My first thought would be to look at a JSON parser and see how it's done there.
Your sample looks to be best parsed via recursion: for example, consider tech_ship_armor_5 to be an object, get its opening tag, verify existence of its closing tag and go from there.
So then you'd have a tech_ship_armor_5.area property with a value of engineering; the value of the category property would then be another object materials with properties of its own.
Yep, JSON-like parsing is the way to go with this.
I have a .txt file which has about 500k entries, each separated by new line. The file size is about 13MB and the format of each line is the following:
SomeText<tab>Value<tab>AnotherValue<tab>
My problem is to find a certain "string" with the input from the program, from the first column in the file, and get the corresponding Value and AnotherValue from the two columns.
The first column is not sorted, but the second and third column values in the file are actually sorted. But, this sorting is of no good use to me.
The file is static and does not change. I was thinking to use the Regex.IsMatch() here but I am not sure if that's the best approach here to go line by line.
If the lookup time would increase drastically, I could probably go for rearranging the first column (and hence un-sorting the second & third column). Any suggestions on how to implement this approach or the above approach if required?
After locating the string, how should I fetch those two column values?
EDIT
I realized that there will be quite a bit of searches in the file for atleast oe request by the user. If I have an array of values to be found, how can I return some kind of dictionary having a corresponding values of found matches?
Maybe with this code:
var myLine = File.ReadAllLines()
.Select(line => line.Split(new [] {' ', '\t'}, SplitStringOptions.RemoveEmptyEntries)
.Single(s => s[0] == "string to find");
myLine is an array of strings that represents a row. You may also use .AsParallel() extension method for better performance.
How many times do you need to do this search?
Is the cost of some pre-processing on startup worth it if you save time on each search?
Is loading all the data into memory at startup feasible?
Parse the file into objects and stick the results into a hashtable?
I don't think Regex will help you more than any of the standard string options. You are looking for a fixed string value, not a pattern, but I stand to be corrected on that.
Update
Presuming that the "SomeText" is unique, you can use a dictionary like this
Data represents the values coming in from the file.
MyData is a class to hold them in memory.
public IEnumerable<string> Data = new List<string>() {
"Text1\tValue1\tAnotherValue1\t",
"Text2\tValue2\tAnotherValue2\t",
"Text3\tValue3\tAnotherValue3\t",
"Text4\tValue4\tAnotherValue4\t",
"Text5\tValue5\tAnotherValue5\t",
"Text6\tValue6\tAnotherValue6\t",
"Text7\tValue7\tAnotherValue7\t",
"Text8\tValue8\tAnotherValue8\t"
};
public class MyData {
public String SomeText { get; set; }
public String Value { get; set; }
public String AnotherValue { get; set; }
}
[TestMethod]
public void ParseAndFind() {
var dictionary = Data.Select(line =>
{
var pieces = line.Split(new char[] { '\t' }, StringSplitOptions.RemoveEmptyEntries);
return new MyData {
SomeText = pieces[0],
Value = pieces[1],
AnotherValue = pieces[2],
};
}).ToDictionary<MyData, string>(dat =>dat.SomeText);
Assert.AreEqual("AnotherValue3", dictionary["Text3"].AnotherValue);
Assert.AreEqual("Value7", dictionary["Text7"].Value);
}
hth,
Alan
var firstFoundLine = File.ReadLines("filename").FirstOrDefault(s => s.StartsWith("string"));
if (firstFoundLine != "")
{
char yourColumnDelimiter = '\t';
var columnValues = firstFoundLine.Split(new []{yourColumnDelimiter});
var secondColumn = columnValues[1];
var thirdColumns = columnValues[2];
}
File.ReadLines is better than File.RealAllLines because you won't need to read the whole file -- only until matching string is found http://msdn.microsoft.com/en-us/library/dd383503.aspx
Parse this monstrosity into some sort of database.
SQL Server/MySQL would be preferable, but if you can't use them for various reasons, SQLite or even Access or Excel could work.
Doing that a single time is not hard.
After you are done with that, searching will become easy and fast.
GetLines(inputPath).FirstOrDefault(p=>p.Split(",")[0]=="SearchText")
private static IEnumerable<string> GetLines(string inputFile)
{
string filePath = Path.Combine(Directory.GetCurrentDirectory(),inputFile);
return File.ReadLines(filePath);
}
I have a log file that I want to parse and load into a database. I'm struggling with the best way to go about parsing it.
The log file is in the format Category: Information
Case Number: CASE01
User ID: JOSM
Software: Microsoft Word
Date Started: 21-01-2010
Date Ended: 22-01-2010
Thing is, there's other bits and pieces thrown into the log file that mean the information isn't always present on the same line. I also only want the information, not the category.
So far, I've tried stick it all into an array separated by \r\n, but I have to know the index of the information I want in order to consistently retrieve it, and that changes. I've also tried feeding it through StreamReader and saying
if (line.Contains("Case Number"))
{
tbReport.AppendText("Case Number: " + line.Remove(0, 13) + "\r\n");
}
Which gets me the information I want, but makes it very hard to do anything with.
I feel I'm better off going down the array path, but I could do with some guidance on how to search the array for the the category, and then parse the information.
Once I can parse it accurately, adding it into a database should be fairly straight forward. As it's my first time attempting this, I'd be interested in any tips or guidance as to the best way to go about this though.
Thanks.
This will give you a collection with all key/value pairs.
List<KeyValuePair> items = new List<KeyValuePair>();
var line = reader.ReadLine();
while (line != null)
{
int pos = line.IndexOf(':');
items.Add(new KeyValuePair(line.Substring(0, pos), line.Substring(pos+1));
line = reader.ReadLine();
}
If you have a log class which contains all possible names as properties, you can use reflection instead:
class LogEntry
{
public string CaseNumber { get; set; }
public string User { get; set; }
public string Software{ get; set; }
public string DateStarted { get; set; }
public string DateEnded { get; set; }
}
List<LogEntry> items = new List<LogEntry>();
var line = reader.ReadLine();
var currentEntry = new LogEntry();
while (line != null)
{
if (line == "") //empty line = new log entry. Change to your delimiter.
{
items.Add(currentEntry);
currentEntry = new LogEntry();
}
int pos = line.IndexOf(':');
var name = line.Substring(0, pos).Replace(" ", string.Empty);
var value = line.Substring(pos+1);
var pi = entry.GetType().GetProperty(name);
pi.SetValue(entry, value, null);
line = reader.ReadLine();
}
Note that I've not tested the code (just written it directly in here). You have to add error checking and such. The last alternative is not very performant as it is, but should do OK.
Sounds like a good case candidate for RegExp :
http://www.regular-expressions.info/dotnet.html
They're not too easy to learn but once you get the basic understanding, they can't be beaten for that kind of tasks.
It's not really a simple answer, but have you maybe though about using a regular expression for parsing the information out?
Regular expressions is kinda hardcore stuff, but they can parsed advanced files quite easily.
So in what I can see, then its like:
If a line starts with A-Z, then (a-z or A-Z or 0-9 or space) from zero to many times, then followed by a : then a space, and then the value.
So if you make a regular expression for that (If you wait awhile I will try to make one for you), then you could test each line with that. If it matches, then we can also use regular expressions to take the last part out, and the "key". If it don't matches, then we just append it to the last key.
Beware that its not totally fool-proof, as a new line could just start this way, but its kinda the best thing we can do, i think.
As promised here is a starting point for your regular expression:
^(?'key'[A-Z][a-z,A-Z,0-9,\s]+):\s(?'value'.+)
So to try and tell what it does, we need to go though each part:
^ ensures that a match starts on the beginning of a line
(?'key' is a syntax to begin a "capture" group. The regular expression will then give us access to easily take the "key" part of the regular expression out.
We that with a [A-Z] - that is a group that will match any big letter. But only one
[a-z,A-Z,0-9,\s]+ - is like the previous group, but just for all big, or small letters, numbers and space (\s), the plus outside the group tells that it can match more than one.
Then we just end the group, and puts in out *: and then a space.
We then begin a new group the value group, just like the key group.
Then we just write . (that means everything), and then just a + after that to make it catch more than one
I actually think that you can just take the whole string, and just match a:
RegEx.Matches (or something like that), and loop over them.
Then just take match.Groups["key"] and match.Groups["value"] and put into your array. (Sorry i dont have a Visual Studio handy to test it out)