Parsing a text file with a custom format in C# - c#

I have a bunch of text files that has a custom format, looking like this:
App Name
Export Layout
Produced at 24/07/2011 09:53:21
Field Name Length
NAME 100
FULLNAME1 150
ADDR1 80
ADDR2 80
Any whitespaces may be tabs or spaces. The file may contain any number of field names and lengths.
I want to get all the field names and their corresponding field lengths and perhaps store them in a dictionary. This information will be used to process a corresponding fixed width data file having the mentioned field names and field lengths.
I know how to skip lines using ReadLine(). What I don't know is how to say: "When you reach the line that starts with 'Field Name', skip one more line, then starting from the next line, grab all the words on the left column and the numbers on the right column."
I have tried String.Trim() but that doesn't remove the whitespaces in between.
Thanks in advance.

You can use SkipWhile(l => !l.TrimStart().StartsWith("Field Name")).Skip(1):
Dictionary<string, string> allFieldLengths = File.ReadLines("path")
.SkipWhile(l => !l.TrimStart().StartsWith("Field Name")) // skips lines that don't start with "Field Name"
.Skip(1) // go to next line
.SkipWhile(l => string.IsNullOrWhiteSpace(l)) // skip following empty line(s)
.Select(l =>
{ // anonymous method to use "real code"
var line = l.Trim(); // remove spaces or tabs from start and end of line
string[] token = line.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries);
return new { line, token }; // return anonymous type from
})
.Where(x => x.token.Length == 2) // ignore all lines with more than two fields (invalid data)
.Select(x => new { FieldName = x.token[0], Length = x.token[1] })
.GroupBy(x => x.FieldName) // groups lines by FieldName, every group contains it's Key + all anonymous types which belong to this group
.ToDictionary(xg => xg.Key, xg => string.Join(",", xg.Select(x => x.Length)));
line.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries) will split by space and tabs and ignores all empty spaces. Use GroupBy to ensure that all keys are unique in the dictionary. In the case of duplicate field-names the Length will be joined with comma.
Edit: since you have requested a non-LINQ version, here is it:
Dictionary<string, string> allFieldLengths = new Dictionary<string, string>();
bool headerFound = false;
bool dataFound = false;
foreach (string l in File.ReadLines("path"))
{
string line = l.Trim();
if (!headerFound && line.StartsWith("Field Name"))
{
headerFound = true;
// skip this line:
continue;
}
if (!headerFound)
continue;
if (!dataFound && line.Length > 0)
dataFound = true;
if (!dataFound)
continue;
string[] token = line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
if (token.Length != 2)
continue;
string fieldName = token[0];
string length = token[1];
string lengthInDict;
if (allFieldLengths.TryGetValue(fieldName, out lengthInDict))
// append this length
allFieldLengths[fieldName] = lengthInDict + "," + length;
else
allFieldLengths.Add(fieldName, length);
}
I like the LINQ version more because it's much more readable and maintainable (imo).

Based on the assumption that the position of the header line is fixed, we may consider actual key-value pairs to start from the 9th line. Then, using the ReadAllLines method to return a String array from the file, we just start processing from index 8 onwards:
string[] lines = File.ReadAllLines(filepath);
Dictionary<string,int> pairs = new Dictionary<string,int>();
for(int i=8;i<lines.Length;i++)
{
string[] pair = Regex.Replace(lines[i],"(\\s)+",";").Split(';');
pairs.Add(pair[0],int.Parse(pair[1]));
}
This is a skeleton, not accounting for exception handling, but I guess it should get you started.

You can use String.StartsWith() to detect "FieldName". Then String.Split() with a parameter of null to split by whitespace. This will get you your fieldname and length strings.

Related

sort array of strings by a part of these strings

I am attempting to sort xls lines by fourth string in lines.
string[] list_lines = System.IO.File.ReadAllLines(#"E:\VS\WriteLines.xls");
// Display the file contents by using a foreach loop.
System.Console.WriteLine("Contents of Your Database = ");
foreach (var line in list_lines)
{
// Use a tab to indent each line of the file.
Console.WriteLine("\t" + line);
}
I am having problems creating algorithm that will identify the fourth element of each line and list content in alphabetical order.
The words in each line are separated by ' '.
Can anyone put me on a right direction please?
EDIT--------------------------
ok,
foreach (var line in list_lines.OrderBy(line => line.Split(' ')[3]))
sorted the problem. Lines are sorted as I need. Excel changes ' ' spaces with ';'. That's why when compiled it was giving error.
Now, I guess, I need to parse each part of string to int since it sorts by first digit and not by a number.
You can split the lines and then use the third item in an OrderBy:
foreach (var line in list_lines.OrderBy(line => line.Split(' ')[3]))
{
}
Well, just sort the array:
string[] list_lines = ...;
// General case: not all strings have 4 parts
Array.Sort(list_lines, (left, right) => {
String[] partsLeft = left.Split(' ');
String[] partsRight = right.Split(' ');
if (partsLeft.Length < 4)
if (partsRight.Length < 4)
return String.Compare(left, right, StringComparison.OrdinalIgnoreCase)
else
return -1;
else if (partsRight.Length < 4)
return 1;
return String.Compare(partsLeft[3], partsRight[3], StringComparison.OrdinalIgnoreCase);
});
If all the lines guaranteed to have 4 items at least it can be simplfied into
Array.Sort(list_lines, (left, right) =>
String.Compare(left.Split(' ')[3],
right.Split(' ')[3],
StringComparison.OrdinalIgnoreCase));

reading in text file and spliting by comma in c#

I have a text file whose format is like this
Number,Name,Age
I want to read "Number" at the first column of this text file into an array to find duplication. here is the two ways i tried to read in the file.
string[] account = File.ReadAllLines(path);
string readtext = File.ReadAllText(path);
But every time i try to split the array to just get whats to the left of the first comma i fail. Have any ideas? Thanks.
You need to explicitly split the data to access its various parts. How would your program otherwise be able to decide that it is separated by commas?
The easiest approach to access the number that comes to my mind goes something like this:
var lines = File.ReadAllLines(path);
var firstLine = lines[0];
var fields = firstLine.Split(',');
var number = fields[0]; // Voilla!
You could go further by parsing the number as an int or another numeric type (if it really is a number). On the other hand, if you just want to test for uniqueness, this is not really necessary.
If you want all duplicate lines according to the Number:
var numDuplicates = File.ReadLines(path)
.Select(l => l.Trim().Split(','))
.Where(arr => arr.Length >= 3)
.Select(arr => new {
Number = arr[0].Trim(),
Name = arr[1].Trim(),
Age = arr[2].Trim()
})
.GroupBy(x => x.Number)
.Where(g => g.Count() > 1);
foreach(var dupNumGroup in numDuplicates)
Console.WriteLine("Number:{0} Names:{1} Ages:{2}"
, dupNumGroup.Key
, string.Join(",", dupNumGroup.Select(x => x.Name))
, string.Join(",", dupNumGroup.Select(x => x.Age)));
If you are looking specifically for a string.split solution, here is a really simple method of doing what you are looking for:
List<int> importedNumbers = new List<int>();
// Read our file in to an array of strings
var fileContents = System.IO.File.ReadAllLines(path);
// Iterate over the strings and split them in to their respective columns
foreach (string line in fileContents)
{
var fields = line.Split(',');
if (fields.Count() < 3)
throw new Exception("We need at least 3 fields per line."); // You would REALLY do something else here...
// You would probably want to be more careful about your int parsing... (use TryParse)
var number = int.Parse(fields[0]);
var name = fields[1];
var age = int.Parse(fields[2]);
// if we already imported this number, continue on to the next record
if (importedNumbers.Contains(number))
continue; // You might also update the existing record at this point instead of just skipping...
importedNumbers.Add(number); // Keep track of numbers we have imported
}

split a string from a text file into another list

Hi i know the Title might sound a little confusing but im reading in a text file with many lines of data
Example
12345 Test
34567 Test2
i read in the text 1 line at a time and add to a list
using (StreamReader reader = new StreamReader("Test.txt"))
{
string line;
while ((line = reader.ReadLine()) != null)
{
list.Add(line);
}
}
how do i then separate the 1234 from the test so i can pull only the first column of data if i need like list(1).pars[1] would be 12345 and list(2).pars[2] would be test2
i know this sounds foggy but i hope someone out there understands
Maybe something like this:
string test="12345 Test";
var ls= test.Split(' ');
This will get you a array of string. You can get them with ls[0] and ls[1].
If you just what the 12345 then ls[0] is the one to choose.
If you're ok with having a list of string[]'s you can simply do this:
var list = new List<string[]>();
using (StreamReader reader = new StreamReader("Test.txt"))
{
string line;
while ((line = reader.ReadLine()) != null)
{
list.Add(line.Split(' '));
}
}
string firstWord = list[0][0]; //12345
string secondWord = list[0][1]; //Test
When you have a string of text you can use the Split() method to split it in many parts. If you're sure every word (separated by one or more spaces) is a column you can simply write:
string[] columns = line.Split(' ');
There are several overloads of that function, you can specify if blank fields are skipped (you may have, for example columns[1] empty in a line composed by 2 words but separated by two spaces). If you're sure about the number of columns you can fix that limit too (so if any text after the last column will be treated as a single field).
In your case (add to the list only the first column) you may write:
if (String.IsNullOrWhiteSpace(line))
continue;
string[] columns = line.TrimLeft().Split(new char[] { ' ' }, 2);
list.Add(columns[0]);
First check is to skip empty or lines composed just of spaces. The TrimLeft() is to remove spaces from beginning of the line (if any). The first column can't be empty (because the TrimLeft() so yo do not even need to use StringSplitOptions.RemoveEmptyEntries with an additional if (columns.Length > 1). Finally, if the file is small enough you can read it in memory with a single call to File.ReadAllLines() and simplify everything with a little of LINQ:
list.Add(
File.ReadAllLines("test.txt")
.Where(x => !String.IsNullOrWhiteSpace(x))
.Select(x => x.TrimLeft().Split(new char[] { ' ' }, 2)[0]));
Note that with the first parameter you can specify more than one valid separator.
When you have multiple spaces
Regex r = new Regex(" +");
string [] splitString = r.Split(stringWithMultipleSpaces);
var splitted = System.IO.File.ReadAllLines("Test.txt")
.Select(line => line.Split(' ')).ToArray();
var list1 = splitted.Select(split_line => split_line[0]).ToArray();
var list2 = splitted.Select(split_line => split_line[1]).ToArray();

What is the best way to parse this string in C#?

I have a string that I am reading from another system. It's basically a long string that represents a list of key value pairs that are separated by a space in between. It looks like this:
key:value[space]key:value[space]key:value[space]
So I wrote this code to parse it:
string myString = ReadinString();
string[] tokens = myString.split(' ');
foreach (string token in tokens) {
string key = token.split(':')[0];
string value = token.split(':')[1];
. . . .
}
The issue now is that some of the values have spaces in them so my "simplistic" split at the top no longer works. I wanted to see how I could still parse out the list of key value pairs (given space as a separator character) now that I know there also could be spaces in the value field as split doesn't seem like it's going to be able to work anymore.
NOTE: I now confirmed that KEYs will NOT have spaces in them so I only have to worry about the values. Apologies for the confusion.
Use this regular expression:
\w+:[\w\s]+(?![\w+:])
I tested it on
test:testvalue test2:test value test3:testvalue3
It returns three matches:
test:testvalue
test2:test value
test3:testvalue3
You can change \w to any character set that can occur in your input.
Code for testing this:
var regex = new Regex(#"\w+:[\w\s]+(?![\w+:])");
var test = "test:testvalue test2:test value test3:testvalue3";
foreach (Match match in regex.Matches(test))
{
var key = match.Value.Split(':')[0];
var value = match.Value.Split(':')[1];
Console.WriteLine("{0}:{1}", key, value);
}
Console.ReadLine();
As Wonko the Sane pointed out, this regular expression will fail on values with :. If you predict such situation, use \w+:[\w: ]+?(?![\w+:]) as the regular expression. This will still fail when a colon in value is preceded by space though... I'll think about solution to this.
This cannot work without changing your split from a space to something else such as a "|".
Consider this:
Alfred Bester:Alfred Bester Alfred:Alfred Bester
Is this Key "Alfred Bester" & value Alfred" or Key "Alfred" & value "Bester Alfred"?
string input = "foo:Foobarius Maximus Tiberius Kirk bar:Barforama zap:Zip Brannigan";
foreach (Match match in Regex.Matches(input, #"(\w+):([^:]+)(?![\w+:])"))
{
Console.WriteLine("{0} = {1}",
match.Groups[1].Value,
match.Groups[2].Value
);
}
Gives you:
foo = Foobarius Maximus Tiberius Kirk
bar = Barforama
zap = Zip Brannigan
You could try to Url encode the content between the space (The keys and the values not the : symbol) but this would require that you have control over the Input Method.
Or you could simply use another format (Like XML or JSON), but again you will need control over the Input Format.
If you can't control the input format you could always use a Regular expression and that searches for single spaces where a word plus : follows.
Update (Thanks Jon Grant)
It appears that you can have spaces in the key and the value. If this is the case you will need to seriously rethink your strategy as even Regex won't help.
string input = "key1:value key2:value key3:value";
Dictionary<string, string> dic = input.Split(' ').Select(x => x.Split(':')).ToDictionary(x => x[0], x => x[1]);
The first will produce an array:
"key:value", "key:value"
Then an array of arrays:
{ "key", "value" }, { "key", "value" }
And then a dictionary:
"key" => "value", "key" => "value"
Note, that Dictionary<K,V> doesn't allow duplicated keys, it will raise an exception in such a case. If such a scenario is possible, use ToLookup().
Using a regular expression can solve your problem:
private void DoSplit(string str)
{
str += str.Trim() + " ";
string patterns = #"\w+:([\w+\s*])+[^!\w+:]";
var r = new System.Text.RegularExpressions.Regex(patterns);
var ms = r.Matches(str);
foreach (System.Text.RegularExpressions.Match item in ms)
{
string[] s = item.Value.Split(new char[] { ':' });
//Do something
}
}
This code will do it (given the rules below). It parses the keys and values and returns them in a Dictonary<string, string> data structure. I have added some code at the end that assumes given your example that the last value of the entire string/stream will be appended with a [space]:
private Dictionary<string, string> ParseKeyValues(string input)
{
Dictionary<string, string> items = new Dictionary<string, string>();
string[] parts = input.Split(':');
string key = parts[0];
string value;
int currentIndex = 1;
while (currentIndex < parts.Length-1)
{
int indexOfLastSpace=parts[currentIndex].LastIndexOf(' ');
value = parts[currentIndex].Substring(0, indexOfLastSpace);
items.Add(key, value);
key = parts[currentIndex].Substring(indexOfLastSpace + 1);
currentIndex++;
}
value = parts[parts.Length - 1].Substring(0,parts[parts.Length - 1].Length-1);
items.Add(key, parts[parts.Length-1]);
return items;
}
Note: this algorithm assumes the following rules:
No spaces in the values
No colons in the keys
No colons in the values
Without any Regex nor string concat, and as an enumerable (it supposes keys don't have spaces, but values can):
public static IEnumerable<KeyValuePair<string, string>> Split(string text)
{
if (text == null)
yield break;
int keyStart = 0;
int keyEnd = -1;
int lastSpace = -1;
for(int i = 0; i < text.Length; i++)
{
if (text[i] == ' ')
{
lastSpace = i;
continue;
}
if (text[i] == ':')
{
if (lastSpace >= 0)
{
yield return new KeyValuePair<string, string>(text.Substring(keyStart, keyEnd - keyStart), text.Substring(keyEnd + 1, lastSpace - keyEnd - 1));
keyStart = lastSpace + 1;
}
keyEnd = i;
continue;
}
}
if (keyEnd >= 0)
yield return new KeyValuePair<string, string>(text.Substring(keyStart, keyEnd - keyStart), text.Substring(keyEnd + 1));
}
I guess you could take your method and expand upon it slightly to deal with this stuff...
Kind of pseudocode:
List<string> parsedTokens = new List<String>();
string[] tokens = myString.split(' ');
for(int i = 0; i < tokens.Length; i++)
{
// We need to deal with the special case of the last item,
// or if the following item does not contain a colon.
if(i == tokens.Length - 1 || tokens[i+1].IndexOf(':' > -1)
{
parsedTokens.Add(tokens[i]);
}
else
{
// This bit needs to be refined to deal with values with multiple spaces...
parsedTokens.Add(tokens[i] + " " + tokens[i+1]);
}
}
Another approach would be to split on the colon... That way, your first array item would be the name of the first key, second item would be the value of the first key and then name of the second key (can use LastIndexOf to split it out), and so on. This would obviously get very messy if the values can include colons, or the keys can contain spaces, but in that case you'd be pretty much out of luck...

Get parameters out of text file

I have a C# asp.net page that has to get username/password info from a text file.
Could someone please tell me how.
The text file looks as follows: (it is actually a lot larger, I just got a few lines)
DATASOURCEFILE=D:\folder\folder
var1= etc
var2= more
var3 = misc
var4 = stuff
USERID = user1
PASSWORD = pwd1
all I need is the UserID and password out of that file.
Thank you for your help,
Steve
This would work:
var dic = File.ReadAllLines("test.txt")
.Select(l => l.Split(new[] { '=' }))
.ToDictionary( s => s[0].Trim(), s => s[1].Trim());
dic is a dictionary, so you easily extract your values, i.e.:
string myUser = dic["USERID"];
string myPassword = dic["PASSWORD"];
Open the file, split on the newline, split again on the = for each item and then add it to a dictionary.
string contents = String.Empty;
using (FileStream fs = File.Open("path", FileMode.OpenRead))
using (StreamReader reader = new StreamReader(fs))
{
contents = reader.ReadToEnd();
}
if (contents.Length > 0)
{
string[] lines = contents.Split(new char[] { '\n' });
Dictionary<string, string> mysettings = new Dictionary<string, string>();
foreach (string line in lines)
{
string[] keyAndValue = line.Split(new char[] { '=' });
mysettings.Add(keyAndValue[0].Trim(), keyAndValue[1].Trim());
}
string test = mysettings["USERID"]; // example of getting userid
}
You can use Regular expressions to extract each variable. You can read one line at a time, or the entire file into one string. If the latter, you just look for a newline in the expression.
Regards,
Morten
Dictionary is not needed.
Old-fashioned parsing can do more, with less executable code, the same amount of compiled data, and less processing:
public string MyPath1;
public string MyPath2;
...
public void ReadConfig(string sConfigFile)
{
MyPath1 = MyPath2 = ""; // Clear the external values (in case the file does not set every parameter).
using (StreamReader sr = new StreamReader(sConfigFile)) // Open the file for reading (and auto-close).
{
while (!sr.EndOfStream)
{
string sLine = sr.ReadLine().Trim(); // Read the next line. Trim leading and trailing whitespace.
// Treat lines with NO "=" as comments (ignore; no syntax checking).
// Treat lines with "=" as the first character as comments too.
// Treat lines with "=" as the 2nd character or after as parameter lines.
// Side-benefit: Values containing "=" are processed correctly.
int i = sLine.IndexOf("="); // Find the first "=" in the line.
if (i <= 0) // IF the first "=" in the line is the first character (or not present),
continue; // the line is not a parameter line. Ignore it. (Iterate the while.)
string sParameter = sLine.Remove(i).TrimEnd(); // All before the "=" is the parameter name. Trim whitespace.
string sValue = sLine.Substring(i + 1).TrimStart(); // All after the "=" is the value. Trim whitespace.
// Extra characters before a parameter name are usually intended to comment it out. Here, we keep them (with or without whitespace between). That makes an unrecognized parameter name, which is ignored (acts as a comment, as intended).
// Extra characters after a value are usually intended as comments. Here, we trim them only if whitespace separates. (Parsing contiguous comments is too complex: need delimiter(s) and then a way to escape delimiters (when needed) within values.) Side-drawback: Values cannot contain " ".
i = sValue.IndexOfAny(new char[] {' ', '\t'}); // Find the first " " or tab in the value.
if (i > 1) // IF the first " " or tab is the second character or after,
sValue = sValue.Remove(i); // All before the " " or tab is the parameter. (Discard the rest.)
// IF a desired parameter is specified, collect it:
// (Could detect here if any parameter is set more than once.)
if (sParameter == "MyPathOne")
MyPath1 = sValue;
else if (sParameter == "MyPathTwo")
MyPath2 = sValue;
// (Could detect here if an invalid parameter name is specified.)
// (Could exit the loop here if every parameter has been set.)
} // end while
// (Could detect here if the config file set neither parameter or only one parameter.)
} // end using
}

Categories