Referring to this topic: Minimize LINQ string token counter
And using the following provided code:
string src = "for each character in the string, take the rest of the " +
"string starting from that character " +
"as a substring; count it if it starts with the target string";
var results = src.Split() // default split by whitespace
.GroupBy(str => str) // group words by the value
.Select(g => new
{
str = g.Key, // the value
count = g.Count() // the count of that value
});
I need to enumerate all values (keywords and number of occurrences) and load them into NameValueCollection. Due to my very limited Linq knowledge, can't figure out how to make it. Please advise. Thanks.
I won't guess why you want to put anything in a NameValueCollection, but is there some reason why
foreach (var result in results)
collection.Add(result.str, result.count.ToString());
is not sufficient?
(EDIT: Changed accessor to Add, which may be better for your use case.)
If the answer is "no, that works" you should probably stop and figure out what the hell the above code is doing before using it in your project.
Looks like your particular problem could just as easily use a Dictionary instead of a NameValueCollection. I forget if this is the correct ToDictionary syntax, but just google the ToDictionary() method:
Dictionary<string, int> useADictionary = results.ToDictionary(x => x.str, x => x.count);
You certainly want a Dictionary instead of NameValueCollection. The whole point is to show unique tokens (strings) with each token's occurrence count (an int), yes?
NameValueCollection is a special-purpose collection that requires string and key and value - Dictionary<string, int> is the mainstream .Net way to associate a unique string key with its corresponding int value.
Take a look at the various System.Collections namespaces to understand what each is intended to achieve. Typically these days, System.Collections.Generic is the most widely-seen, with System.Collections.Concurrent for multithreaded programs.
Related
I have a file that is formatted this way --
{2000}000000012199{3100}123456789*{3320}110009558*{3400}9876
54321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX
78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLAS
TX 73920**
Basically, the number in curly brackets denotes field, followed by the value for that field. For example, {2000} is the field for "Amount", and the value for it is 121.99 (implied decimal). {3100} is the field for "AccountNumber" and the value for it is 123456789*.
I am trying to figure out a way to split the file into "records" and each record would contain the record type (the value in the curly brackets) and record value, but I don't see how.
How do I do this without a loop going through each character in the input?
A different way to look at it.... The { character is a record delimiter, and the } character is a field delimiter. You can just use Split().
var input = #"{2000}000000012199{3100}123456789*{3320}110009558*{3400}987654321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**";
var rows = input.Split( new [] {"{"} , StringSplitOptions.RemoveEmptyEntries);
foreach (var row in rows)
{
var fields = row.Split(new [] { "}"}, StringSplitOptions.RemoveEmptyEntries);
Console.WriteLine("{0} = {1}", fields[0], fields[1]);
}
Output:
2000 = 000000012199
3100 = 123456789*
3320 = 110009558*
3400 = 987654321*
3600 = CTR
4200 = D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**
5000 = D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**
Fiddle
This regular expression should get you going:
Match a literal {
Match 1 or more digts ("a number")
Match a literal }
Match all characters that are not an opening {
\{\d+\}[^{]+
It assumes that the values itself cannot contain an opening curly brace. If that's the case, you need to be more clever, e.g. #"\{\d+\}(?:\\{|[^{])+" (there are likely better ways)
Create a Regex instance and have it match against the text. Each "field" will be a separate match
var text = #"{123}abc{456}xyz";
var regex = new Regex(#"\{\d+\}[^{]+", RegexOptions.Compiled);
foreach (var match in regex.Matches(text)) {
Console.WriteLine(match.Groups[0].Value);
}
This doesn't fully answer the question, but it was getting too long to be a comment, so I'm leaving it here in Community Wiki mode. It does, at least, present a better strategy that may lead to a solution:
The main thing to understand here is it's rare — like, REALLY rare — to genuinely encounter a whole new kind of a file format for which an existing parser doesn't already exist. Even custom applications with custom file types will still typically build the basic structure of their file around a generic format like JSON or XML, or sometimes an industry-specific format like HL7 or MARC.
The strategy you should follow, then, is to first determine exactly what you're dealing with. Look at the software that generates the file; is there an existing SDK, reference, or package for the format? Or look at the industry surrounding this data; is there a special set of formats related to that industry?
Once you know this, you will almost always find an existing parser ready and waiting, and it's usually as easy as adding a NuGet package. These parsers are genuinely faster, need less code, and will be less susceptible to bugs (because most will have already been found by someone else). It's just an all-around better way to address the issue.
Now what I see in the question isn't something I recognize, so it's just possible you genuinely do have a custom format for which you'll need to write a parser from scratch... but even so, it doesn't seem like we're to that point yet.
Here is how to do it in linq without slow regex
string x = "{2000}000000012199{3100}123456789*{3320}110009558*{3400}987654321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**";
var result =
x.Split('{',StringSplitOptions.RemoveEmptyEntries)
.Aggregate(new List<Tuple<string, string>>(),
(l, z) => { var az = z.Split('}');
l.Add(new Tuple<string, string>(az[0], az[1]));
return l;})
LinqPad output:
This seems so simple that I'm convinced I must be overlooking something. I cannot establish how to do the following in Lucene:
The problem
I'm searching for place names.
I have a field called Name
It is using Lucene.Net.Analysis.Standard.StandardAnalyzer
It is TOKENIZED
The value of Name contains 1 space in the value: halong bay.
The search term may or may not contain an extra space due to culturally different spellings or genuine spelling mistakes. E.g. ha long bay instead of halong bay.
If I use the term halong bay I get a hit.
If I use the term ha long bay I do not get a hit.
The attempted solution
Here's the code I'm using to build my predicate using LINQ to Lucene from Sitecore:
var searchContext = ContentSearchManager.GetIndex("my_index").CreateSearchContext();
var term = "ha long bay";
var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Name == term);
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);
I have also tried a fuzzy match using the .Like() extension:
var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Like(term));
This also yields no results for ha long bay.
How do I configure Lucene in Sitecore to return a hit for both halong bay and ha long bay search terms, ideally without having to do anything fancy with the input term (e.g. stripping space, adding wildcards, etc)?
Note: I recognise that this would also allow the term h a l o n g b a y to produce a hit, but I don't think I have a problem with this.
A TOKENIZED field means that the field value is split by a token (space in that case) and the resulting terms are added to the index dictionary. If you index "halong bay" in such a field, it will create the "halong" and "bay" terms.
It's normal for the search engine to fail to retrieve this result for the "ha long" search query because it doesn't know any result with the "ha" or "long" terms.
A manual approach would be to define all the other ways to write the place name in another multi-value computed index field named AlternateNames. Then you could issue this kind of query: Name==query OR AlternateNames==query.
An automatic approach would be to also index the place names without spaces in a separate computed index field named CompactName. Then you could issue this kind of query: Name==query OR CompactName==compactedQueryWithoutSpaces
I hope this helps
Jeff
Something like this might do the trick:
var predicate = PredicateBuilder.False<MySearchResultItemClass>();
foreach (var t in term.Split(' '))
{
var tempTerm = t;
predicate = predicate.Or(p => p.Name.Contains(tempTerm));
}
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);
It does split your input string, but I guess that is not 'fancy' ;)
I am trying to find if a list of strings contains a specific string in C#.
for example: Suppose I have 3 entries in my list
list<string> s1 = new List<string>(){
"the lazy boy went to the market in a car",
"tom",
"balloon"};
string s2 = "market";
Now I want to return true if s1 contains s2, which it does in this case.
return s1.Contains(s2);
This returns false which is not what I want. I was reading about Predicate but could not make much sense out of it for this case.
Thanks in advance.
The simplest way is to search each string individually:
bool exists = s1.Any(s => s.Contains(s2));
The List<string>.Contains() method is going to check if any whole string matches the string you ask for. You need to check each individual list element to accomplish what you want.
Note that this may be a time-consuming operation, if your list has a large number of elements in it, very long strings, and especially in the case where the string you're searching for either does not exist or is found only near the end of the list.
Contains' alternative could be IndexOf:
var res = s1.Any(s => s.IndexOf(s2, StringComparison.Ordinal) >= 0)
StringComparison.Ordinal passed as parameter because Contains() also use it internally.
Peter Duniho's answer is generally the best way. I am providing an alternate solution. This one does not require LINQ, lamdas, or loops. This only requires string built-in type's methods.
string.Concat(listOfString).Contains("data");
Note: This approach can lead to incorrect results. For example:
string.Concat("da", "ta").Contains("data");
will return true when it should be false;
I'm sure this isn't as complicated as I'm making it.
I have a string that follows the following pattern:
#,"value"#,"next value"#,"next value" . . .
I need to parse out the number/value pairs in order to use the data in my application.
Here is a sample of a return:
0,"120"1,""10,"298630427"29,"577015971830"30,"SG MSNA "33,"A1"34,"4625"35,"239"36,"2105"37,"2759"60,"15"112,"0"
To complicate matters the string can contain newline characters (\r,\n, or \r\n).
I can handle that by simply removing the newlines with a few string.replace calls.
I would ultimately like to parse the data into key/value pairs. I just can't seem to get my mind unto the right path.
I apologize if this is trivial but I've been pulling 18+ hours days for two months trying to meet a deadline and my brain is shot. Any assistance or guidance in the right direction will be most appreciated.
var numVal=Regex.Matches(input,#"\"([^\"]+)\"(\d+)")
.Cast<Match>()
.Select(x=>new
{
num=x.Groups[2].Value,
value=x.Groups[1].Value
});
Now you can iterate over numVal
foreach(var nv in numVal)
{
nv.num;
nv.value;
}
If you're going straight to key value pairs, you might be able to use LINQ to make your life easier. You'll have to be aware of and handle cases where you don't match the key/value format. However, you might be able to achieve it using something like this.
var string_delimiter = new [] { ',' };
var kvp_delimiter = new[] { "\"" };
var dictionary = string_value.Split(string_delimiter)
.Select(kvp_string => kvp_string.Split(kvp_delimiter, StringSplitOptions.RemoveEmptyEntries))
.ToDictionary(kvp_vals => kvp_vals.First(), kvp_vals => kvp_vals.Last());
I have a string of attribute names and definitions.
I am trying to split the string on the attribute name, into a Dictionary of string string. Where the key is the attribute name and the definition is the value. I won't know the attribute names ahead of time, so I have been trying to somehow split on the ":" character, but am having trouble with that because the attribute name is is not included in the split.
For example, I need to split this string on "Organization:", "OranizationType:", and "Nationality:" into a Dictionary. Any ideas on the best way to do this with C#.Net?
Organization: Name of a governmental, military or other organization. OrganizationType: Organization classification to one of the following types: sports, governmental military, governmental civilian or political party. (required) Nationality: Organization nationality if mentioned in the document. (required)
Here is some sample code to help:
private static void Main()
{
const string str = "Organization: Name of a governmental, military or other organization. OrganizationType: Organization classification to one of the following types sports, governmental military, governmental civilian or political party. (required) Nationality: Organization nationality if mentioned in the document. (required)";
var array = str.Split(':');
var dictionary = array.ToDictionary(x => x[0], x => x[1]);
foreach (var item in dictionary)
{
Console.WriteLine("{0}: {1}", item.Key, item.Value);
}
// Expecting to see the following output:
// Organization: Name of a governmental, military or other organization.
// OrganizationType: Organization classification to one of the following types sports, governmental military, governmental civilian or political party.
// Nationality: Organization nationality if mentioned in the document. (required)
}
Here is a visual explanation of what I am trying to do:
http://farm5.static.flickr.com/4081/4829708565_ac75b119a0_b.jpg
I'd do it in two phases, firstly split into the property pairs using something like this:
Regex.Split(input, "\s(?=[A-Z][A-Za-z]*:)")
this looks for any whitespace, followed by a alphabetic string followed by a colon. The alphabetic string must start with a capital letter. It then splits on that white space. That will get you three strings of the form "PropertyName: PropertyValue". Splitting on that first colon is then pretty easy (I'd personally probably just use substring and indexof rather than another regular expression but you sound like you can do that bit fine on your own. Shout if you do want help with the second split.
The only thing to say is be carful in case you get false matches due to the input being awkward. In this case you'll just have to make the regex more complicated to try to compensate.
You would need some delimiter to indicate when it is the end of each pair as opposed to having one large string with sections in between e.g.
Organization: Name of a governmental, military or other organization.|OrganizationType: Organization classification to one of the following types: sports, governmental military, governmental civilian or political party. (required) |Nationality: Organization nationality if mentioned in the document. (required)
Notice the | character which is indicating the end of the pair. Then it is just a case of using a very specific delimiter, something that is not likely to be used in the description text, instead of one colon you could use 2 :: as one colon could possibly crop up on occassions as others have suggested. That means you would just need to do:
// split the string into rows
string[] rows = myString.Split('|');
Dictionary<string, string> pairs = new Dictionary<string, string>();
foreach (var r in rows)
{
// split each row into a pair and add to the dictionary
string[] split = Regex.Split(r, "::");
pairs.Add(split[0], split[1]);
}
You can use LINQ as others have suggested, the above is more for readability so you can see what is happening.
Another alternative is to devise some custom regex to do what you need but again you would need to be making a lot of assumptions of how the description text would be formatted etc.
Considering that each word in front of the colon always has at least one capital (please confirm), you could solve this by using regular expressions (otherwise you'd end up splitting on all colons, which also appear inside the sentences):
var resultDict = Regex.Split(input, #"(?<= [A-Z][a-zA-Z]+):")
.ToDictionary(a => a[0], a => a[1]);
The (?<=...) is a positive look-behind expression that doesn't "eat up" the characters, thus only the colon is removed from the output. Tested with your input here.
The [A-Z][a-zA-Z]+ means: a word that starts with a capital.
Note that, as others have suggested, a "smarter" delimiter will provide easier parsing, as does escaping the delimiter (i.e. like "::" or ":" when you are required to use colons. Not sure if those are options for you though, hence the solution with regular expressions above.
Edit
For one reason or another, I kept getting errors with using ToDictionary, so here's the unwinded version, at least it works. Apologies for earlier non-working version. Not that the regular expression is changed, the first did not include the key, which is the inverse of the data.
var splitArray = Regex.Split(input, #"(?<=( |^)[A-Z][a-zA-Z]+):|( )(?=[A-Z][a-zA-Z]+:)")
.Where(a => a.Trim() != "").ToArray();
Dictionary<string, string> resultDict = new Dictionary<string, string>();
for(int i = 0; i < splitArray.Count(); i+=2)
{
resultDict.Add(splitArray[i], splitArray[i+1]);
}
Note: the regular expression becomes a tad complex in this scenario. As suggested in the thread below, you can split it in smaller steps. Also note that the current regex creates a few empty matches, which I remove with the Where-expression above. The for-loop should not be needed if you manage to get ToDictionary working.