Searching strings in C# - ignore x - c#

I wish to create a program for my business.
I will have a set of data, such as
Post to:
FirstName LastName
Their Address
Their Address
Australia
MORE RANDOM WORDS/DATA (such as the item they ordered)
What I wish to do is create a string of everything between "Post to:" and "Australia". How would I go about doing this as I would have maybe 30 customers and that means 30 (Post to:) and (Australia). I wish to take each of these and separate them to eventually copy them to the clipboard.
I will be using a windows form for this.
EDIT: I think creating a method which returns data from an array would do this. How would i do the searching though.

You could simply split based on Post to & then you'll get an array of items + australia + what you're not interested in
Then as a second step on all elements i'd split on " " and do a takeuntil it matches australia which gives you what you want but as a sequence of words
As a last step you'd then recombine those strings, all of this is pretty inefficient but for the little amount of data you mention it will be plenty fast & very easy to write / maintain.
Here's some pseudocode
var s = file.ReadToEnd(yourfile);
s.Split(new string[]{"Post to"}) // Split on post to to get an array of string with 1 string per customer
.Select(item=>item.Split(new char[]{' '}) // split on each word so that we can find australia
.TakeUntill(substring => substring == "Australia") // take all the words untill we find australia, then stop
.Aggregate((a,b)=>a+" " + b)// rebuild the string by summing all those words preceeding australia
);

Related

How to tell which delimiter string was split on

I'm trying to parse out line items from text extracted from a PDF. The text extracted comes out poorly formatted and in one long string per page. There aren't any useful delimiters, but the lines start with one of two strings. I've set up the Split() using a string array with both of those strings, but I need to know which delimiter the elements were split on.
I found this link, but I'm not that great at RegEx. Can someone assist in writing the RegEx string?
var lineItems = page.PageText.Split(new string[] { "First String Delimiter", "Second String Delimiter" }, StringSplitOptions.None);
What I need is to know is if element[x] was a result of "First String Delimiter" or "Second String Delimiter".
EDIT: I don't care if Regex is the solution. Linq may be equally suited. Linq didn't come out until after I earned my degrees, so I'm similarly unfamiliar with it.
Imagine a page with about 15-20 of these end to end coming back as one long string with no carriage returns: Since they all start with "Corporate Trade Payment Credit" or "Preauthorized ACH Credit", I can split on those, but I need to know what type it was.
Preauthorized ACH Credit (165) 10,000.00 489546541 0000000000 Text Some long description about transaction- Preauthorized ACH Credit (165) 5,310.99 8465498461 0000000000 Text Another long description Corporate Trade Payment Credit (165) 4,933.17 8478632458775 0000000000 Text Another confidential string description.
Why don't you just run the split twice, once with the first delimiter, then again with the second delimiter?
var firstDelimiterItems = page.PageText.Split("First String Delimiter");
var secondDelimiterItems = page.PageText.Split("Second String Delimiter");
Sometimes the simplest solutions are the best ones. Don't know why this didn't occur to me earlier.
var pageText = page.PageText.Replace("Corporate Trade Payment", "\r\nCorporate Trade Payment").Replace("Preauthorized ACH Credit", "\r\nPreauthorized ACH Credit");
This gives me the line items on their own lines. No Regex needed. Thank you all for your help, and if you find a way to the original question with Regex, please post. I'm always up to learning more.

C# read from text file and store in variables

I have a text file that reads
1 "601 Cross Street College Station TX 71234"
2 "(another address)"
3 ...
.
.
I wanted to know how to parse this text file into an integer and a string using C#. The integer would hold the S.No and the string the address without the quotes.
I need to do this because later on I have a function that takes these two values from the text file as input and spits out some data. This function has to be executed on each entry in the text file.
If i is an integer and add is the string, the output should be
a=1; add=601 Cross Street College Station TX 71234 //for the first line and so on
As one can observe the address needs to be one string.
This is not a homework question. And what I have been able to accomplish so far is to read out all the lines using
string[] lines = System.IO.File.ReadAllLines(#"C:\Users\KS\Documents\input.txt");
Any help is appreciated.
I would need to see more of your input data to determine the most reliable method.
But one approach would be to split each address into words. You can then loop through the words and find each word that contains only digits. This will be your street number. You could look after the street number and look for S, So, or South but as your example illustrates, there might be no such indicator.
Also, you haven't provided what you want to happen if more than one number is found.
As far as removing the quotes, just remove the first and last characters. I'd recommend checking that they are in fact quotes before removing them.
From your description, every entry has this format:
[space][number][space][quote][address][quote]
Here is some quick and dirty code that will parse this format into an int/string tuple:
using namespace System;
using namespace System.Linq;
static Tuple<int, string> ParseLine(string line)
{
var tokens = line.Split(); // Split by spaces
var number = int.Parse(tokens[1]); // The number is the 2nd token
var address = string.Join(" ", tokens.Skip(2)); // The address is every subsequent token
address = address.Substring(1, address.Length - 2); // ... minus the first and last characters
return Tuple.Create(number, address);
}

How do I prevent a string from appearing in a result string when a set of child strings are concatenated to form the result string?

I have 5 strings, let's call them
EarthString
FireString
WindString
WaterString
HeartString
All of them can have varying length, any of them can be empty, or can be very long (but never null).
These 5 strings are very good friends, and every weekend they are concatenated to form a result string using this c# statement
ResultString = EarthString + FireString + WindString + WaterString + HeartString
Depending on the values of these strings, sometimes (only sometimes), ResultString will contain "Captain Planet" as a substring.
My question is, how do I manipulate each of the 5 strings before they are concatenated, so that when they are combined, "Captain Planet" will never appear as a substring in the resultant string?
The only way I can think of right now is to examine each character in each string, in sequential order, but that seems very tedious. Since each of the 5 good friends strings can be of any length, examining the characters individually will also require some kind of concatenation before we can determine whether any character need to be dropped.
Edit: The resultant string is a filtered version of the 5 strings concatenated together, all the other content remain the same except the "Captain Planet" string is dropped. Yes, i'm looking for a solution which allows the 5 strings to be manipulated before concatenation. (this is actually a simplification of a bigger programming problem i'm encountering). Thanks guys.
If you want to do it pre-concat you could
Assign the start and end of each string a numeric value based on the portion of "CaptainPlanet" they contein. Ex: if Air = "net the big captain" then it would get 3 for a start value and 7 for an end value. to determine if you could concat 2 values safely you would just check to see if the end of the left string + start of the right string were not equal to the total length of "CaptainPlanet". If you had very large strings this would allow you to inspect just the first x and last x characters of the string to compute the start/end value.
This solution doesn't account for short strings like ei air = "Cap" , earth ="tain" and fire="Planet". In that case you would need to have a special case for tokens that are shorter than the length of "CaptainPlanet" For those.
Is there a particular reason you can't just do this?
ResultString.Replace("CaptainPlanet", "x");
If it doesn't matter how many chars will be dropped, you can remove f.e. all 'C' in all strings.
The original answer cleared all of the strings, but as pointed out by J.Steen, there was already a formulation of the expected output. So there we go.
Run elementString.Replace("Captain Planet", "") on every substring.
Now you have to identify all the prefixes / suffixes of "Captain Planet" on each of the substrings, and keep that information so that it can be processed before contatenation. That is, e.g. if the substring ends with "Capt", then you should have an information that "substring contains at the end a prefix of the 4 first letters of 'Captain Planet'". You also have to consider the cases of complete substrings (e.g. one of the strings is "ptain Pla"). The problem also becomes more complex if any of the e.g. prefixes can be recursive or repeated (e.g. "CaptainCap" contains 2 kinds of valid prefixes for "CaptainCaptain", and "apt" can be found at two locations in the resulting string);
You process that information before concatenation so that the result string has the same thing as ResultString.Replace("Captain Planet", ""). Congratulations, you have made your program much more complex than necessary!
But in short, you cannot get both the result that you want (all of the substrings intact except for the combined result output) and do the processing wholly before the concatenation step.

Efficient and fast way to parse a string with different languages

I have a string something like (generated via Google Transliterate REST call, and transliterated into 2 languages):
" This world is beautiful and थिस वर्ल्ड इस बेऔतिफुल एंड
থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ amazingly mysterious
अमज़िन्ग्ली म्य्स्तेरिऔस আমাজিন্গ্লি ম্য্স্তেরীয়ুস "
Now Google Transliterate REST call allows FIVE words at a time, so I had to loop, add it to the list and then concatenate the string. That's why we see that each CHUNK (of each language) is of 5 words. The total number of words is 7 words, so first 5 (This world is beautiful and) lies before rest 2 (amazingly mysterious) later.
How do I most efficiently parse the sentence such that I get something like:
This world is beautiful and amazingly mysterious थिस वर्ल्ड इस बेऔतिफुल एंड अमज़िन्ग्ली म्य्स्तेरिऔस থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ আমাজিন্গ্লি ম্য্স্তেরীয়ুস
Since the length of sentence, and the number of languages it can be converted into can be dynamic, may be using lists of each language can work, and then concatenated later?
I used an approach where I transliterated each word, one at a time, it works well, but too slow as it increases the number of calls to the API.
Can someone help me with an efficient (and dynamic) implementation of such a scenario? Thanks a bunch!
One list per language is the way to go.
if you mean different character ASCII code by different languages, you can use this answer here:
Regular expression Spanish and Arabic words
Pay for google translate's API and then your length restriction goes up to 5,000 characters per request https://developers.google.com/translate/v2/faq
Also, yes, as Daniel has said - grouping the text by language will be necessary
I have tried a work out, correct me if i misinterpret your question
string statement = "This world is beautiful and थिस वर्ल्ड इस बेऔतिफुल एंड থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ amazingly mysterious अमज़िन्ग्ली म्य्स्तेरिऔस আমাজিন্গ্লি ম্য্স্তেরীয়ুস ";
string otherLangStmt = statement;
MatchCollection matchCollection = Regex.Matches(statement, "([a-zA-Z]+)");
string result = "";
foreach (Match match in matchCollection)
{
if (match.Groups.Count > 0)
{
result += match.Groups[0].Value + " ";
otherLangStmt = otherLangStmt.Replace(match.Groups[0].Value, string.Empty);
}
}
otherLangStmt = Regex.Replace(otherLangStmt.Trim(), "[\\s]", " ");
Console.WriteLine(result);
Console.WriteLine(otherLangStmt);

parsing words in a continuous string

If a have a string with words and no spaces, how should I parse those words given that I have a dictionary/list that contains those words?
For example, if my string is "thisisastringwithwords" how could I use a dictionary to create an output "this is a string with words"?
I hear that using the data structure Tries could help but maybe if someone could help with the pseudo code? For example, I was thinking that maybe you could index the dictionary into a trie structure, then follow each char down the trie; problem is, I'm unfamiliar with how to do this in (pseudo)code.
I'm assuming that you want an efficient solution, not the obvious one where you repeatedly check if your text starts with a dictionary word.
If the dictionary is small enough, I think you could try and modify the standard KMP algorithm. Basically, build a finite-state machine on your dictionary which consumes the text character by character and yields the constructed words.
EDIT: It appeared that I was reinventing tries.
I already did something similar. You cannot use a simple dictionary. The result will be messy. It depends if you only have to do this once or as whole program.
My solution was to:
Connect to a database with working
words from a dictionary list (for
example online dictionary)
Filter long and short words in dictionary and check if you want to trim stuff (for example don't use words with only one character like 'I')
Start with short words and compare your bigString with the database dictionary.
Now you need to create a "table of possibility". Because a lot of words can fit into 100% but are wrong. As longer the word as more sure you are, that this word is the right one.
It is cpu intensive but it can work precise in the result.
So lets say, you are using a small dictionary of 10,000 words and 3,000 of them are with a length of 8 characters, you need to compare your bigString at start with all 3,000 words and only if result was found, it is allowed to proceed to the next word. If you have 200 characters in your bigString you need about (2000chars / 8 average chars) = 250 full loops minimum with comparation.
For me, I also did a small verification of misspelled words into the comparation.
example of procedure (don't copy paste)
Dim bigString As String = "helloworld.thisisastackoverflowtest!"
Dim dictionary As New List(Of String) 'contains the original words. lets make it case insentitive
dictionary.Add("Hello")
dictionary.Add("World")
dictionary.Add("this")
dictionary.Add("is")
dictionary.Add("a")
dictionary.Add("stack")
dictionary.Add("over")
dictionary.Add("flow")
dictionary.Add("stackoverflow")
dictionary.Add("test")
dictionary.Add("!")
For Each word As String In dictionary
If word.Length < 1 Then dictionary.Remove(word) 'remove short words (will not work with for each in real)
word = word.ToLower 'make it case insentitive
Next
Dim ResultComparer As New Dictionary(Of String, Double) 'String is the dictionary word. Double is a value as percent for a own function to weight result
Dim i As Integer = 0 'start at the beginning
Dim Found As Boolean = False
Do
For Each word In dictionary
If bigString.IndexOf(word, i) > 0 Then
ResultComparer.Add(word, MyWeightOfWord) 'add the word if found, long words are better and will increase the weight value
Found = True
End If
Next
If Found = True Then
i += ResultComparer(BestWordWithBestWeight).Length
Else
i += 1
End If
Loop
I told you that it seems like an impossible task. But you can have a look at this related SO question - it may help you.
If you are sure you have all the words of the phrase in the dictionary, you can use that algo:
String phrase = "thisisastringwithwords";
String fullPhrase = "";
Set<String> myDictionary;
do {
foreach(item in myDictionary){
if(phrase.startsWith(item){
fullPhrase += item + " ";
phrase.remove(item);
break;
}
}
} while(phrase.length != 0);
There are so many complications, like, some items starting equally, so the code will be changed to use some tree search, BST or so.
This is the exact problem one has when trying to programmatically parse languages like Chinese where there are no spaces between words. One method that works with those languages is to start by splitting text on punctuation. This gives you phrases. Next you iterate over the phrases and try to break them into words starting with the length of the longest word in your dictionary. Let's say that length is 13 characters. Take the first 13 characters from the phrase and see if it is in your dictionary. If so, take it as a correct word for now, move forward in the phrase and repeat. Otherwise, shorten your substring to 12 characters, then 11 characters, etc.
This works extremely well, but not perfectly because we've accidentally put in a bias towards words that come first. One way to remove this bias and double check your result is to repeat the process starting at the end of the phrase. If you get the same word breaks you can probably call it good. If not, you have an overlapping word segment. For example, when you parse your sample phrase starting at the end you might get (backwards for emphasis)
words with string a Isis th
At first, the word Isis (Egyptian Goddess) appears to be the correct word. When you find that "th" is not in your dictionary, however, you know there is a word segmentation problem nearby. Resolve this by going with the forward segmentation result "this is" for the non-aligned sequence "thisis" since both words are in the dictionary.
A less common variant of this problem is when adjacent words share a sequence which could go either way. If you had a sequence like "archand" (to make something up), should it be "arc hand" or "arch and"? The way to determine is to apply a grammar checker to the results. This should be done to the whole text anyway.
Ok, I will make a hand wavy attempt at this. The perfect(ish) data structure for your problem is (as you've said a trie) made up of the words in the dictionary. A trie is best visualised as a DFA, a nice state machine where you go from one state to the next on every new character. This is really easy to do in code, a Java(ish) style class for this would be :
Class State
{
String matchedWord;
Map<char,State> mapChildren;
}
From hereon, building the trie is easy. Its like having a rooted tree structure with each node having multiple children. Each child is visited on one character transition. The use of a HashMap kind of structure trims down time to look up character to next State mappings. Alternately if all you have are 26 characters for the alphabet, a fixed size array of 26 would do the trick as well.
Now, assuming all of that made sense, you have a trie, your problem still isn't fully solved. This is where you start doing things like regular expressions engines do, walk down the trie, keep track of states which match to a whole word in the dictionary (thats what I had the matchedWord for in the State structure), use some backtracking logic to jump to a previous match state if the current trail hits a dead end. I know its general but given the trie structure, the rest is fairly straightforward.
If you have dictionary of words and need a quick implmentation this can be solved efficiently with dynamic programming in O(n^2) time, assuming the dictionary lookups are O(1). Below is some C# code, the substring extraction could and dictionary lookup could be improved.
public static String[] StringToWords(String str, HashSet<string> words)
{
//Index of char - length of last valid word
int[] bps = new int[str.Length + 1];
for (int i = 0; i < bps.Length; i++)
bps[i] = -1;
for (int i = 0; i < str.Length; i++)
{
for (int j = i + 1; j <= str.Length ; j++)
{
if (bps[j] == -1)
{
//Destination cell doesn't have valid backpointer yet
//Try with the current substring
String s = str.Substring(i, j - i);
if (words.Contains(s))
bps[j] = i;
}
}
}
//Backtrack to recovery sequence and then reverse
List<String> seg = new List<string>();
for (int bp = str.Length; bps[bp] != -1 ;bp = bps[bp])
seg.Add(str.Substring(bps[bp], bp - bps[bp]));
seg.Reverse();
return seg.ToArray();
}
Building a hastset with the word list from /usr/share/dict/words and testing with
foreach (var s in StringSplitter.StringToWords("thisisastringwithwords", dict))
Console.WriteLine(s);
I get the output "t hi sis a string with words". Because as others have pointed out this algorithm will return a valid segmentation (if one exists), however this may not be the segmentation you expect. The presence of short words is reducing the segmentation quality, you might be able to add heuristic to favour longer words if two valid sub-segmentation enter an element.
There are more sophisticated methods that finite state machines and language models that can generate multiple segmentations and apply probabilistic ranking.

Categories