How to split a string into an array - c#

I have a string of attribute names and definitions.
I am trying to split the string on the attribute name, into a Dictionary of string string. Where the key is the attribute name and the definition is the value. I won't know the attribute names ahead of time, so I have been trying to somehow split on the ":" character, but am having trouble with that because the attribute name is is not included in the split.
For example, I need to split this string on "Organization:", "OranizationType:", and "Nationality:" into a Dictionary. Any ideas on the best way to do this with C#.Net?
Organization: Name of a governmental, military or other organization. OrganizationType: Organization classification to one of the following types: sports, governmental military, governmental civilian or political party. (required) Nationality: Organization nationality if mentioned in the document. (required)
Here is some sample code to help:
private static void Main()
{
const string str = "Organization: Name of a governmental, military or other organization. OrganizationType: Organization classification to one of the following types sports, governmental military, governmental civilian or political party. (required) Nationality: Organization nationality if mentioned in the document. (required)";
var array = str.Split(':');
var dictionary = array.ToDictionary(x => x[0], x => x[1]);
foreach (var item in dictionary)
{
Console.WriteLine("{0}: {1}", item.Key, item.Value);
}
// Expecting to see the following output:
// Organization: Name of a governmental, military or other organization.
// OrganizationType: Organization classification to one of the following types sports, governmental military, governmental civilian or political party.
// Nationality: Organization nationality if mentioned in the document. (required)
}
Here is a visual explanation of what I am trying to do:
http://farm5.static.flickr.com/4081/4829708565_ac75b119a0_b.jpg

I'd do it in two phases, firstly split into the property pairs using something like this:
Regex.Split(input, "\s(?=[A-Z][A-Za-z]*:)")
this looks for any whitespace, followed by a alphabetic string followed by a colon. The alphabetic string must start with a capital letter. It then splits on that white space. That will get you three strings of the form "PropertyName: PropertyValue". Splitting on that first colon is then pretty easy (I'd personally probably just use substring and indexof rather than another regular expression but you sound like you can do that bit fine on your own. Shout if you do want help with the second split.
The only thing to say is be carful in case you get false matches due to the input being awkward. In this case you'll just have to make the regex more complicated to try to compensate.

You would need some delimiter to indicate when it is the end of each pair as opposed to having one large string with sections in between e.g.
Organization: Name of a governmental, military or other organization.|OrganizationType: Organization classification to one of the following types: sports, governmental military, governmental civilian or political party. (required) |Nationality: Organization nationality if mentioned in the document. (required)
Notice the | character which is indicating the end of the pair. Then it is just a case of using a very specific delimiter, something that is not likely to be used in the description text, instead of one colon you could use 2 :: as one colon could possibly crop up on occassions as others have suggested. That means you would just need to do:
// split the string into rows
string[] rows = myString.Split('|');
Dictionary<string, string> pairs = new Dictionary<string, string>();
foreach (var r in rows)
{
// split each row into a pair and add to the dictionary
string[] split = Regex.Split(r, "::");
pairs.Add(split[0], split[1]);
}
You can use LINQ as others have suggested, the above is more for readability so you can see what is happening.
Another alternative is to devise some custom regex to do what you need but again you would need to be making a lot of assumptions of how the description text would be formatted etc.

Considering that each word in front of the colon always has at least one capital (please confirm), you could solve this by using regular expressions (otherwise you'd end up splitting on all colons, which also appear inside the sentences):
var resultDict = Regex.Split(input, #"(?<= [A-Z][a-zA-Z]+):")
.ToDictionary(a => a[0], a => a[1]);
The (?<=...) is a positive look-behind expression that doesn't "eat up" the characters, thus only the colon is removed from the output. Tested with your input here.
The [A-Z][a-zA-Z]+ means: a word that starts with a capital.
Note that, as others have suggested, a "smarter" delimiter will provide easier parsing, as does escaping the delimiter (i.e. like "::" or ":" when you are required to use colons. Not sure if those are options for you though, hence the solution with regular expressions above.
Edit
For one reason or another, I kept getting errors with using ToDictionary, so here's the unwinded version, at least it works. Apologies for earlier non-working version. Not that the regular expression is changed, the first did not include the key, which is the inverse of the data.
var splitArray = Regex.Split(input, #"(?<=( |^)[A-Z][a-zA-Z]+):|( )(?=[A-Z][a-zA-Z]+:)")
.Where(a => a.Trim() != "").ToArray();
Dictionary<string, string> resultDict = new Dictionary<string, string>();
for(int i = 0; i < splitArray.Count(); i+=2)
{
resultDict.Add(splitArray[i], splitArray[i+1]);
}
Note: the regular expression becomes a tad complex in this scenario. As suggested in the thread below, you can split it in smaller steps. Also note that the current regex creates a few empty matches, which I remove with the Where-expression above. The for-loop should not be needed if you manage to get ToDictionary working.

Related

how to get a value from json with just the index?

Im making an app which needs to loop through steam games.
reading libraryfolder.vbf, i need to loop through and find the first value and save it as a string.
"libraryfolders"
{
"0"
{
"path" "D:\\Steam"
"label" ""
"contentid" "-1387328137801257092942"
"totalsize" "0"
"update_clean_bytes_tally" "42563526469"
"time_last_update_corruption" "1663765126"
"apps"
{
"730" "31892201109"
"4560" "9665045969"
"9200" "22815860246"
"11020" "776953234"
"34010" "11967809445"
"34270" "1583765638"
for example, it would record:
730
4560
9200
11020
34010
34270
Im already using System.Text.JSON in the program, is there any way i could loop through and just get the first value using System.Text.JSON or would i need to do something different as vdf doesnt separate the values with colons or commas?
That is not JSON, that is the KeyValues format developed by Valve. You can read more about the format here:
https://developer.valvesoftware.com/wiki/KeyValues
There are existing stackoverflow questions regarding converting a VDF file to JSON, and they mention libraries already developed to help read VDF which can help you out.
VDF to JSON in C#
If you want a very quick and dirty way to read the file without needing any external library I would probably use REGEX and do something like this:
string pattern = "\"apps\"\\s+{\\s+(\"(\\d+)\"\\s+\"\\d+\"\\s+)+\\s+}";
string libraryPath = #"C:\Program Files (x86)\Steam\steamapps\libraryfolders.vdf";
string input = File.ReadAllText(libraryPath);
List<string> indexes = Regex.Matches(input, pattern, RegexOptions.Singleline)
.Cast<Match>().ToList()
.Select(m => m.Groups[2].Captures).ToList()
.SelectMany(c => c.Cast<Capture>())
.Select(c => c.Value).ToList();
foreach(string s in indexes)
{
Debug.WriteLine(s);
}
See the regular expression explaination here:
https://regex101.com/r/bQSt79/1
It basically captures all occurances of "apps" { } in the 0 group, and does a repeating capture of pairs of numbers inbetween the curely brackets in the 1 group, but also captures the left most number in the pair of numbers in the 2 group. Generally repeating captures will only keep the last occurance but because this is C# we can still access the values.
The rest of the code takes each match, the 2nd group of each match, the captures of each group, and the values of those captures, and puts them in a list of strings. Then a foreach will print the value of those strings to log.

Parsing this special format file

I have a file that is formatted this way --
{2000}000000012199{3100}123456789*{3320}110009558*{3400}9876
54321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX
78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLAS
TX 73920**
Basically, the number in curly brackets denotes field, followed by the value for that field. For example, {2000} is the field for "Amount", and the value for it is 121.99 (implied decimal). {3100} is the field for "AccountNumber" and the value for it is 123456789*.
I am trying to figure out a way to split the file into "records" and each record would contain the record type (the value in the curly brackets) and record value, but I don't see how.
How do I do this without a loop going through each character in the input?
A different way to look at it.... The { character is a record delimiter, and the } character is a field delimiter. You can just use Split().
var input = #"{2000}000000012199{3100}123456789*{3320}110009558*{3400}987654321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**";
var rows = input.Split( new [] {"{"} , StringSplitOptions.RemoveEmptyEntries);
foreach (var row in rows)
{
var fields = row.Split(new [] { "}"}, StringSplitOptions.RemoveEmptyEntries);
Console.WriteLine("{0} = {1}", fields[0], fields[1]);
}
Output:
2000 = 000000012199
3100 = 123456789*
3320 = 110009558*
3400 = 987654321*
3600 = CTR
4200 = D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**
5000 = D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**
Fiddle
This regular expression should get you going:
Match a literal {
Match 1 or more digts ("a number")
Match a literal }
Match all characters that are not an opening {
\{\d+\}[^{]+
It assumes that the values itself cannot contain an opening curly brace. If that's the case, you need to be more clever, e.g. #"\{\d+\}(?:\\{|[^{])+" (there are likely better ways)
Create a Regex instance and have it match against the text. Each "field" will be a separate match
var text = #"{123}abc{456}xyz";
var regex = new Regex(#"\{\d+\}[^{]+", RegexOptions.Compiled);
foreach (var match in regex.Matches(text)) {
Console.WriteLine(match.Groups[0].Value);
}
This doesn't fully answer the question, but it was getting too long to be a comment, so I'm leaving it here in Community Wiki mode. It does, at least, present a better strategy that may lead to a solution:
The main thing to understand here is it's rare — like, REALLY rare — to genuinely encounter a whole new kind of a file format for which an existing parser doesn't already exist. Even custom applications with custom file types will still typically build the basic structure of their file around a generic format like JSON or XML, or sometimes an industry-specific format like HL7 or MARC.
The strategy you should follow, then, is to first determine exactly what you're dealing with. Look at the software that generates the file; is there an existing SDK, reference, or package for the format? Or look at the industry surrounding this data; is there a special set of formats related to that industry?
Once you know this, you will almost always find an existing parser ready and waiting, and it's usually as easy as adding a NuGet package. These parsers are genuinely faster, need less code, and will be less susceptible to bugs (because most will have already been found by someone else). It's just an all-around better way to address the issue.
Now what I see in the question isn't something I recognize, so it's just possible you genuinely do have a custom format for which you'll need to write a parser from scratch... but even so, it doesn't seem like we're to that point yet.
Here is how to do it in linq without slow regex
string x = "{2000}000000012199{3100}123456789*{3320}110009558*{3400}987654321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**";
var result =
x.Split('{',StringSplitOptions.RemoveEmptyEntries)
.Aggregate(new List<Tuple<string, string>>(),
(l, z) => { var az = z.Split('}');
l.Add(new Tuple<string, string>(az[0], az[1]));
return l;})
LinqPad output:

Build regular expression for replacing duplicated string into single word

I'm working of filtering comments. I'd like to replace string like this:
llllolllllllllllooooooooooooouuuuuuuuuuuddddddddddddddllllollllllllllllloooooooooooooooooouuuuuuuuuuuuuuuuuuddddddddddddddllllollllllllllllloooooooooooooooooouuuuuuuuuuuuuuuuuuddddddddddddddllllollllllllllllloooooooooooouuuuuuuuuuuuuuuuudddddddddddddd
with two words: lol loud
string like this:
cuytwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
with: cuytw
And string like this:
hyyuyuyuyuyuyuyuyuyuyuyuyuyu
with: hyu
but not modify strings like look, geek.
Is there any way to achieve this with single regular expression in C#?
I think I can answer this categorically.
This definitely cant be done with RegEx or even standard code due to your input and output requirements without at minimum some sort of dictionary and algorithm to try and reduce doubles in a permutation check for legitimate words.
The result (at best) would give you a list of possible non mutually-exclusive combinations of nonsense words and legitimate words with doubles.
In fact, I'd go as far to say with your current requirements and no extra specificity on rules, your input and output are generically impossible and could only be taken at face value for the cases you have given.
I'm not sure how to use RegEx for this problem, but here is an alternative which is arguably easier to read.*
Assuming you just want to return a string comprising the distinct letters of the input in order, you can use GroupBy:
private static string filterString(string input)
{
var groups = input.GroupBy(c => c);
var output = new string(groups.Select(g => g.Key).ToArray());
return output;
}
Passes:
Returns loud for llllolllllllllllooooooooooooouuuuuuuuuuuddddddddddddddllllollllllllllllloooooooooooooooooouuuuuuuuuuuuuuuuuuddddddddddddddllllollllllllllllloooooooooooooooooouuuuuuuuuuuuuuuuuuddddddddddddddllllollllllllllllloooooooooooouuuuuuuuuuuuuuuuudddddddddddddd
Returns cuytw for cuytwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
Returns hyu for hyyuyuyuyuyuyuyuyuyuyuyuyuyu
Failures:
Returns lok for look
Returns gek for geek
* On second read you want to leave words like look and geek alone; this is a partial answer.

Determine POS tagging in English based on database files

I'm a little bit confused how to determine part-of-speech tagging in English. In this case, I assume that one word in English has one type, for example word "book" is recognized as NOUN, not as VERB. I want to recognize English sentences based on tenses. For example, "I sent the book" is recognized as past tense.
Description:
I have a number of database (*.txt) files: NounList.txt, verbList.txt, adjectiveList.txt, adverbList.txt, conjunctionList.txt, prepositionList.txt, articleList.txt. And if input words are available in the database, I assume that type of those words can be concluded. But, how to begin lookup in the databases? For example, "I sent the book": how to begin a search in the databases for every word, "I" as Noun, "sent" as verb, "the" as article, "book" as noun? Any better approach than searching every word in every database? I doubt that every databases has unique element.
I enclose my perspective here.
private List<string> ParseInput(String allInput)
{
List<string> listSentence = new List<string>();
char[] delimiter = ".?!;".ToCharArray();
var sentences = allInput.Split(delimiter, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Trim());
foreach (var s in sentences)
listSentence.Add(s);
return listSentence;
}
private void tenseReviewMenu_Click(object sender, EventArgs e)
{
string allInput = rtbInput.Text;
List<string> listWord = new List<string>();
List<string> listSentence = new List<string>();
HashSet<string> nounList = new HashSet<string>(getDBList("nounList.txt"));
HashSet<string> verbList = new HashSet<string>(getDBList("verbList.txt"));
HashSet<string> adjectiveList = new HashSet<string>(getDBList("adjectiveList.txt"));
HashSet<string> adverbList = new HashSet<string>(getDBList("adverbList.txt"));
char[] separator = new char[] { ' ', '\t', '\n', ',' etc... };
listSentence = ParseInput(allInput);
foreach (string sentence in listSentence)
{
foreach (string word in sentence.Split(separator))
if (word.Trim() != "")
listWord.Add(word);
}
string testPOS = "";
foreach (string word in listWord)
{
if (nounList.Contains(word.ToLowerInvariant()))
testPOS += "noun ";
else if (verbList.Contains(word.ToLowerInvariant()))
testPOS += "verb ";
else if (adjectiveList.Contains(word.ToLowerInvariant()))
testPOS += "adj ";
else if (adverbList.Contains(word.ToLowerInvariant()))
testPOS += "adv ";
}
tbTest.Text = testPOS;
}
POS tagging is my secondary explanation in my assignment. So I use a simple approach to determine POS tagging that is based on database. But, if there's a simpler approach: easy to use, easy to understand, easy to get pseudocode, easy to design... to determine POS tagging, please let me know.
I hope the pseudocode I present below proves helpful to you. If I find time, I'd also write some code for you.
This problem can be tackled by following the steps below:
Create a dictionary of all the common sentence patterns in the English language. For example, Subject + Verb is an English pattern and all the sentences like I sleep, Dog barked and Ship will arrive match the S-V pattern. You can find a list of the most common english patterns here. Please note that for some time you may need to keep revising this dictionary to enhance the accuracy of your program.
Try to fit the input sentence in one of the patterns in the dictionary you created above, for example, if the input sentence is Snakes, unlike elephants, are venomous., then your code must be able to find a match with the pattern: Subject, unlike AnotherSubject, Verb Object or S-,unlike-S`-, -V-O. To successfully perform this step, you may need to write code that's good at spotting Structure Markers like the word unlike, in this example sentence.
When you have found a match for your input sentence in your pattern dictionary, you can easily assign a tag to each word in the sentence. For example, in our sentence, the word Snakes would be tagged as a subject, just like the word elephants, the word are would be tagged as a verb and finally the word venomous would be tagged as an object.
Once you have assigned a unique tag to each of the words in your sentence, you can go lookup the word in the appropriate text files that you already have and determine whether or not your sentence is valid.
If your sentence doesn't match any sentence pattern, then you have two options:
a) Add the pattern of this unrecognized sentence in your pattern dictionary if it is a valid English sentence.
b) Or, discard the input sentence as an invalid English sentence.
Things like what you're trying to achieve are best solved using machine learning techniques so that the system can learn any new patterns. So, you may want to include a trainer system that would add a new pattern to your pattern dictionary whenever it finds a valid English sentence not matching any of the existing patterns. I haven't thought much about how this can be done, but for now, you may manually revise your Sentence Pattern dictionary.
I'd be glad to hear your opinion about this pseudocode and would be available to brainstorm it further.

String sorting issue in C#

I have List like this
List<string> items = new List<string>();
items.Add("-");
items.Add(".");
items.Add("a-");
items.Add("a.");
items.Add("a-a");
items.Add("a.a");
items.Sort();
string output = string.Empty;
foreach (string s in items)
{
output += s + Environment.NewLine;
}
MessageBox.Show(output);
The output is coming back as
-
.
a-
a.
a.a
a-a
where as I am expecting the results as
-
.
a-
a.
a-a
a.a
Any idea why "a-a" is not coming before "a.a" where as "a-" comes before "a."
I suspect that in the last case "-" is treated in a different way due to culture-specific settings (perhaps as a "dash" as opposed to "minus" in the first strings). MSDN warns about this:
The comparison uses the current culture to obtain culture-specific
information such as casing rules and the alphabetic order of
individual characters. For example, a culture could specify that
certain combinations of characters be treated as a single character,
or uppercase and lowercase characters be compared in a particular way,
or that the sorting order of a character depends on the characters
that precede or follow it.
Also see in this MSDN page:
The .NET Framework uses three distinct ways of sorting: word sort,
string sort, and ordinal sort. Word sort performs a culture-sensitive
comparison of strings. Certain nonalphanumeric characters might have
special weights assigned to them; for example, the hyphen ("-") might
have a very small weight assigned to it so that "coop" and "co-op"
appear next to each other in a sorted list. String sort is similar to
word sort, except that there are no special cases; therefore, all
nonalphanumeric symbols come before all alphanumeric characters.
Ordinal sort compares strings based on the Unicode values of each
element of the string.
So, hyphen gets a special treatment in the default sort mode in order to make the word sort more "natural".
You can get "normal" ordinal sort if you specifically turn it on:
Console.WriteLine(string.Compare("a.", "a-")); //1
Console.WriteLine(string.Compare("a.a", "a-a")); //-1
Console.WriteLine(string.Compare("a.", "a-", StringComparison.Ordinal)); //1
Console.WriteLine(string.Compare("a.a", "a-a", StringComparison.Ordinal)); //1
To sort the original collection using ordinal comparison use:
items.Sort(StringComparer.Ordinal);
If you want your string sort to be based on the actual byte value as opposed to the rules defined by the current culture you can sort by Ordinal:
items.Sort(StringComparer.Ordinal);
This will make the results consistent across all cultures (but it will produce unintuitive sortings of "14" coming before "9" which may or may not be what you're looking for).
The Sort method of the List<> class relies on the default string comparer of the .NET Framework, which is actually an instance of the current CultureInfo of the Thread.
The CultureInfo specifies the alphabetical order of characters and it seems that the default one is using an order different order to what you would expect.
When sorting you can specify a specific CultureInfo, one that you know will match your sorting requirements, sample (german culture):
var sortCulture = new CultureInfo("de-DE");
items.Sort(sortCulture);
More info can be found here:
http://msdn.microsoft.com/en-us/library/b0zbh7b6.aspx
http://msdn.microsoft.com/de-de/library/system.stringcomparer.aspx

Categories