How to tell which delimiter string was split on - c#

I'm trying to parse out line items from text extracted from a PDF. The text extracted comes out poorly formatted and in one long string per page. There aren't any useful delimiters, but the lines start with one of two strings. I've set up the Split() using a string array with both of those strings, but I need to know which delimiter the elements were split on.
I found this link, but I'm not that great at RegEx. Can someone assist in writing the RegEx string?
var lineItems = page.PageText.Split(new string[] { "First String Delimiter", "Second String Delimiter" }, StringSplitOptions.None);
What I need is to know is if element[x] was a result of "First String Delimiter" or "Second String Delimiter".
EDIT: I don't care if Regex is the solution. Linq may be equally suited. Linq didn't come out until after I earned my degrees, so I'm similarly unfamiliar with it.
Imagine a page with about 15-20 of these end to end coming back as one long string with no carriage returns: Since they all start with "Corporate Trade Payment Credit" or "Preauthorized ACH Credit", I can split on those, but I need to know what type it was.
Preauthorized ACH Credit (165) 10,000.00 489546541 0000000000 Text Some long description about transaction- Preauthorized ACH Credit (165) 5,310.99 8465498461 0000000000 Text Another long description Corporate Trade Payment Credit (165) 4,933.17 8478632458775 0000000000 Text Another confidential string description.

Why don't you just run the split twice, once with the first delimiter, then again with the second delimiter?
var firstDelimiterItems = page.PageText.Split("First String Delimiter");
var secondDelimiterItems = page.PageText.Split("Second String Delimiter");

Sometimes the simplest solutions are the best ones. Don't know why this didn't occur to me earlier.
var pageText = page.PageText.Replace("Corporate Trade Payment", "\r\nCorporate Trade Payment").Replace("Preauthorized ACH Credit", "\r\nPreauthorized ACH Credit");
This gives me the line items on their own lines. No Regex needed. Thank you all for your help, and if you find a way to the original question with Regex, please post. I'm always up to learning more.

Related

Can't replace single whitespace with string.Replace()

I have run into a problem I do not understand. I am reading data from a file and have run into a situation where string.Replace(" ", "<whatever>") on an entry from the file will not replace the occurence of a single whitespace. I cannot help but to feel there is something very basic that I have missed, since the same kind of string declared as a literal works fine.
A typical line from the file (each entry is separated by a tab):
"2016-feb-08 09:54:00" "2016-feb-08 17:28:00" "Short" "227" "5 170,00" "+3,90%" "0,00"
The data from the file is read into an array using File.ReadAllLines().Split(new[] {"\t" }, StringSplitOptions.None);.
I then want to clean up the fifth entry for further processing, and this is when I run into the problem:
entries[4].Replace(" ", string.Empty).Replace("\"", string.Empty); gives "5 170,00"
Regex.Replace(entries[4], #"\s+", string.Empty).Replace("\"", string.Empty); gives "5170,00", which is the result I am looking for.
Running the first Replace() on a literal with a single space works fine, so I am curious if the whitespace inside the strings from the file are different somehow? And while the Regex solution works, I really want to know what my "issue" is.
You can use code like below to check hex values of the character. A normal space is 0x20 which the value showing between the five and the one in the code you posted.
string input = "2016-feb-08 09:54:00 2016-feb-08 17:28:00 Short 227 5 170,00 +3,90% 0,00";
byte[] output = Encoding.UTF8.GetBytes(input);

C# parsing out data line by line by character location from txt file

Just looking to see what the best way to approach the following situation would be.
I am trying to make a small job that reads in a txt file which has a thousand or so lines;
Each line is about 40 characters long (mostly numbers, some letter identifiers).
I have used
DataTable txtCache = new DataTable();
txtCache.Columns.Add(new DataColumn("Column1"));
string[] lines = System.IO.File.ReadAllLines(FILEcheck.Properties.Settings.Default.filePath);
foreach (string line in lines)
{
txtCache.Rows.Add(line);
}
However, what I really want to do is a bit confusing and hard to explain so i'll do my best. An example of line is below:
5498494000584454684840}eD44448774V6468465 Z
In the beginning of that long string is a "84", and then a "58" a little bit later. I need to do a comparison on these two numbers. They could be anything, but only a few combinations are acceptable in the file. They will always be in the same spot and same amount of characters (so it will always be 2 numbers and always in the 4-5 location). So I want to have 3 columns. I want the full string in 1 column, and then the 2 individual smaller numbers in columns of themselves. I can then compare them later on, and if there is an issue, I can return the full string which caused the issue.
Is this possible? I am just not sure how to parse out a substring based on character location and then loading it into a datatable.
Any advice would be appreciated. Thank you,
You could create the columns for each of items you are looking to store (whole string, first number, second number), and then add a row for each of the lines in the input file. You could just use the substring method to parse out the two digit numbers and store them. To do your analysis, you could parse the numbers out from the strings, or whatever else you need to do.
lines[0].Substring(3,2) will give you "84" in your above example. If you want the int, you could use Int32.Parse(lines[0].Substring(3,2))
Substring reference: http://msdn.microsoft.com/en-us/library/aka44szs%28v=vs.110%29.aspx

Searching strings in C# - ignore x

I wish to create a program for my business.
I will have a set of data, such as
Post to:
FirstName LastName
Their Address
Their Address
Australia
MORE RANDOM WORDS/DATA (such as the item they ordered)
What I wish to do is create a string of everything between "Post to:" and "Australia". How would I go about doing this as I would have maybe 30 customers and that means 30 (Post to:) and (Australia). I wish to take each of these and separate them to eventually copy them to the clipboard.
I will be using a windows form for this.
EDIT: I think creating a method which returns data from an array would do this. How would i do the searching though.
You could simply split based on Post to & then you'll get an array of items + australia + what you're not interested in
Then as a second step on all elements i'd split on " " and do a takeuntil it matches australia which gives you what you want but as a sequence of words
As a last step you'd then recombine those strings, all of this is pretty inefficient but for the little amount of data you mention it will be plenty fast & very easy to write / maintain.
Here's some pseudocode
var s = file.ReadToEnd(yourfile);
s.Split(new string[]{"Post to"}) // Split on post to to get an array of string with 1 string per customer
.Select(item=>item.Split(new char[]{' '}) // split on each word so that we can find australia
.TakeUntill(substring => substring == "Australia") // take all the words untill we find australia, then stop
.Aggregate((a,b)=>a+" " + b)// rebuild the string by summing all those words preceeding australia
);

C# read from text file and store in variables

I have a text file that reads
1 "601 Cross Street College Station TX 71234"
2 "(another address)"
3 ...
.
.
I wanted to know how to parse this text file into an integer and a string using C#. The integer would hold the S.No and the string the address without the quotes.
I need to do this because later on I have a function that takes these two values from the text file as input and spits out some data. This function has to be executed on each entry in the text file.
If i is an integer and add is the string, the output should be
a=1; add=601 Cross Street College Station TX 71234 //for the first line and so on
As one can observe the address needs to be one string.
This is not a homework question. And what I have been able to accomplish so far is to read out all the lines using
string[] lines = System.IO.File.ReadAllLines(#"C:\Users\KS\Documents\input.txt");
Any help is appreciated.
I would need to see more of your input data to determine the most reliable method.
But one approach would be to split each address into words. You can then loop through the words and find each word that contains only digits. This will be your street number. You could look after the street number and look for S, So, or South but as your example illustrates, there might be no such indicator.
Also, you haven't provided what you want to happen if more than one number is found.
As far as removing the quotes, just remove the first and last characters. I'd recommend checking that they are in fact quotes before removing them.
From your description, every entry has this format:
[space][number][space][quote][address][quote]
Here is some quick and dirty code that will parse this format into an int/string tuple:
using namespace System;
using namespace System.Linq;
static Tuple<int, string> ParseLine(string line)
{
var tokens = line.Split(); // Split by spaces
var number = int.Parse(tokens[1]); // The number is the 2nd token
var address = string.Join(" ", tokens.Skip(2)); // The address is every subsequent token
address = address.Substring(1, address.Length - 2); // ... minus the first and last characters
return Tuple.Create(number, address);
}

Parsing a String for Special characters in C#

I am getting a string in the following format in the query string:
Arnstung%20Chew(20)
I want to convert it to just Arnstung Chew.
How do I do it?
Also how do I make sure that the user is not passing a script or anything harmful in the query string?
string str = "Arnstung Chew (20)";
string replacedString = str.Substring(0, str.IndexOf("(") -1 ).Trim();
string safeString = System.Web.HttpUtility.HtmlEncode(replacedString);
It's impossible to provide a comprehensive answer without knowing what variations might appear on your input text. For example, will there always be two words separated by a space followed by a number in parentheses? Or might there be other variations as well?
I have a lot of parsing code on my Black Belt Coder site, including a sscanf() replacement for .NET that may potentially be useful in your case.

Categories