Parsing tree in C# - c#

I have a [textual] tree like this:
+---step-1
| +---step_2
| | +---step3
| | \---step4
| +---step_2.1
| \---step_2.2
+---step1.2
Tree2
+---step-1
| \---step_2
| | +---step3
| | \---step4
+---step1.2
This is just a small example, tree can be deeper and with more children and etc..
Right now I'm doing this:
for (int i = 0; i < cmdOutList.Count; i++)
{
string s = cmdOutList[i];
String value = Regex.Match(s, #"(?<=\---).*").Value;
value = value.Replace("\r", "");
if (s[1].ToString() == "-")
{
DirectoryNode p = new DirectoryNode { Name = value };
//p.AddChild(f);
directoryList.Add(p);
}
else
{
DirectoryNode f = new DirectoryNode { Name = value };
directoryList[i - 1].AddChild(f);
directoryList.Add(f);
}
}
But this doesn't handle the "step_2.1" and "step_2.2"
I think I'm doing this totally wrong, maybe someone can help me out with this.
EDIT:
Here is the DirectoryNode class to make that a bit more clear..
public class DirectoryNode
{
public DirectoryNode()
{
this.Children = new List<DirectoryNode>();
}
public DirectoryNode ParentObject { get; set; }
public string Name;
public List<DirectoryNode> Children { get; set; }
public void AddChild(DirectoryNode child)
{
child.ParentObject = this;
this.Children.Add(child);
}
}

If your text is that simple (just either +--- or \--- preceded by a series of |), then a regex might be more than you need (and what's tripping you up).
DirectoryNode currentParent = null;
DirectoryNode current = null;
int lastStartIndex = 0;
foreach(string temp in cmdOutList)
{
string line = temp;
int startIndex = Math.Max(line.IndexOf("+"), line.IndexOf(#"\");
line = line.Substring(startIndex);
if(startIndex > lastStartIndex)
{
currentParent = current;
}
else if(startIndex < lastStartIndex)
{
for(int i = 0; i < (lastStartIndex - startIndex) / 4; i++)
{
if(currentParent == null) break;
currentParent = currentParent.ParentObject;
}
}
lastStartIndex = startIndex;
current = new DirectoryNode() { Name = line.Substring(4) };
if(currentParent != null)
{
currentParent.AddChild(current);
}
else
{
directoryList.Add(current);
}
}

Regex definitely looks unnecessary here, since the symbols in your markup language (that's what it is, after all) are both static and few. That is: Although the label names may vary, the tokens you need to look for when trying to parse them into relevant pieces will never be anything other than +---, \---, and ..
From a question I answered yesterday: "Regexes are extremely useful for describing a whole class of needles in a largely unknown haystack, but they're not the right tool for input that's in a very static format."
String manipulation is what you want for parsing this, especially since you're dealing with a recursive markup language, which can't be fully understood by regex anyway. I'd also suggest creating a tree-type data structure to store the data (which, surprisingly, doesn't seem to be included in the framework unless they added it after 2.0).
As an aside, your regex above seems to have an unnecessary \ in it, but that doesn't matter in most regex flavors.

Related

How do you find out if PDF Textbox field is multi-line using ABCpdf?

I can loop through all the fields in a PDF using ABCpdf using the GetFieldNames() collection and get their properties but the one I can't seem to get is whether or not the field is multi-line text field or not. Is there any example out there of how to find this property? My code is below if it's helpful but it's probably unnecessary.
....
foreach (string fieldName in doc.Form.GetFieldNames())
{
WebSupergoo.ABCpdf9.Objects.Field f = doc.Form[fieldName];
dt = GetFieldInstances(dt,f);
}
....
private static DocumentTemplate GetFieldInstances(DocumentTemplate dt, WebSupergoo.ABCpdf9.Objects.Field f)
{
Field field;
Instance inst = new Instance();
int instanceCount = 0;
bool fieldAlreadyExists = dt.Fields.Any(currentField => currentField.Name == f.Name);
if (!fieldAlreadyExists)
{
field = new Field();
field.Name = f.Name;
field.Value = f.Value;
field.Format = f.Format == null ? null : f.Format;
field.PartialName = f.PartialName;
field.TypeID = (int)f.FieldType;
//field.IsMultiline =
//field.IsRequired =
}
else
{
field = (from currentField in dt.Fields where currentField.Name == f.Name select currentField).SingleOrDefault();
instanceCount = field.Instances.Count();
}
if ((Field.FieldTypes)f.FieldType == Field.FieldTypes.Radio || (Field.FieldTypes)f.FieldType == Field.FieldTypes.Checkbox)
{
inst.ExportValue = f.Options[instanceCount];
}
if (f.Kids.Count() > 0)
{
f = f.Kids[instanceCount];
}
inst.Bottom = (int)f.Rect.Bottom;
inst.Height = (int)f.Rect.Height;
inst.Left = (int)f.Rect.Left;
inst.Width = (int)f.Rect.Width;
inst.PageNumber = f.Page.PageNumber;
field.Instances.Add(inst);
if (!fieldAlreadyExists)
{
dt.Fields.Add(field);
}
return dt;
}
I figured it out:
public bool GetIsMultiLine(WebSupergoo.ABCpdf9.Objects.Field field)
{
var flags = Atom.GetInt(Atom.GetItem(field.Atom, "Ff"));
var isMultiLine = GetIsBitFlagSet(flags, 13);
return isMultiLine;
}
public bool GetIsRequired(WebSupergoo.ABCpdf9.Objects.Field field)
{
var flags = Atom.GetInt(Atom.GetItem(field.Atom, "Ff"));
var isRequired = GetIsBitFlagSet(flags, 2);
return isRequired;
}
public bool GetIsReadOnly(WebSupergoo.ABCpdf9.Objects.Field field)
{
var flags = Atom.GetInt(Atom.GetItem(field.Atom, "Ff"));
var isReadOnly = GetIsBitFlagSet(flags, 1);
return isReadOnly;
}
private static bool GetIsBitFlagSet(int b, int pos)
{
return (b & (1 << (pos - 1))) != 0;
}
In case your not familiar with unsigned integer / binary conversions, I found this site really helpful to understand.
For example, let's say that the integer returned for a field's flags equals 4096. If you enter that into the online coversion tool in that website, it will show you that the 13th bit position is turned on (1 instead of a 0 in the 13th position starting from the right).
In the ABCPdf guide, it says that Multi-Line is bit position 13 so you know that field is a Multi-Line field.
Likewise for the 2nd position for Required and the 1st position for Read-only.

Regular expression can't handle rogue square brackets

Thanks to the smarties on here in the past I have this amazing recursive regular expression that helps me to transform custom BBCode-style tags in a block of text.
/// <summary>
/// Static class containing common regular expression strings.
/// </summary>
public static class RegularExpressions
{
/// <summary>
/// Expression to find all root-level BBCode tags. Use this expression recursively to obtain nested tags.
/// </summary>
public static string BBCodeTags
{
get
{
return #"
(?>
\[ (?<tag>[^][/=\s]+) \s*
(?: = \s* (?<val>[^][]*) \s*)?
]
)
(?<content>
(?>
\[(?<innertag>[^][/=\s]+)[^][]*]
|
\[/(?<-innertag>\k<innertag>)]
|
[^][]+
)*
(?(innertag)(?!))
)
\[/\k<tag>]
";
}
}
}
This regex works beautifully, recursively matching on all tags. Like this:
[code]
some code
[b]some text [url=http://www.google.com]some link[/url][/b]
[/code]
The regex does exactly what I want and matches the [code] tag. It breaks it up into three groups: tag, optional value, and content. Tag being the tag name ("code" in this case). Optional value being a value after the equals(=) sign if there is one. And content being everything between the opening and closing tag.
The regex can be used recursively to match nested tags. So after matching on [code] I would run it again against the content group and it would match the [b] tag. If I ran it again on the next content group it would then match the [url] tag.
All of that is wonderful and delicious but it hiccups on one issue. It can't handle rogue square brackets.
[code]This is a successful match.[/code]
[code]This is an [ unsuccessful match.[/code]
[code]This is also an [unsuccessful] match.[/code]
I really suck at regular expressions but if anyone knows how I might tweak this regex to correctly ignore rogue brackets (brackets that do not make up an opening tag and/or do not have a matching closing tag) so that it still matches the surrounding tags, I would be very appreciative :D
Thanks in advance!
Edit
If you are interested in seeing the method where I use this expression you are welcome to.
I did a program that can parse your strings in a debugable, developer-friendly way. It is not a small code like those regexes, but it has a positive side: you can debug it, and fine tune it as you need.
The implementation is a descent recursive parser, but if you need some kind of contextual data, you can place it all inside the ParseContext class.
It is quite long, but I consider it as being better than a a regex based solution.
To test it, create a console application, and replace all the code inside Program.cs with the following code:
using System.Collections.Generic;
namespace q7922337
{
static class Program
{
static void Main(string[] args)
{
var result1 = Match.ParseList<TagsGroup>("[code]This is a successful match.[/code]");
var result2 = Match.ParseList<TagsGroup>("[code]This is an [ unsuccessful match.[/code]");
var result3 = Match.ParseList<TagsGroup>("[code]This is also an [unsuccessful] match.[/code]");
var result4 = Match.ParseList<TagsGroup>(#"
[code]
some code
[b]some text [url=http://www.google.com]some link[/url][/b]
[/code]");
}
class ParseContext
{
public string Source { get; set; }
public int Position { get; set; }
}
abstract class Match
{
public override string ToString()
{
return this.Text;
}
public string Source { get; set; }
public int Start { get; set; }
public int Length { get; set; }
public string Text { get { return this.Source.Substring(this.Start, this.Length); } }
protected abstract bool ParseInternal(ParseContext context);
public bool Parse(ParseContext context)
{
var result = this.ParseInternal(context);
this.Length = context.Position - this.Start;
return result;
}
public bool MarkBeginAndParse(ParseContext context)
{
this.Start = context.Position;
var result = this.ParseInternal(context);
this.Length = context.Position - this.Start;
return result;
}
public static List<T> ParseList<T>(string source)
where T : Match, new()
{
var context = new ParseContext
{
Position = 0,
Source = source
};
var result = new List<T>();
while (true)
{
var item = new T { Source = source, Start = context.Position };
if (!item.Parse(context))
break;
result.Add(item);
}
return result;
}
public static T ParseSingle<T>(string source)
where T : Match, new()
{
var context = new ParseContext
{
Position = 0,
Source = source
};
var result = new T { Source = source, Start = context.Position };
if (result.Parse(context))
return result;
return null;
}
protected List<T> ReadList<T>(ParseContext context)
where T : Match, new()
{
var result = new List<T>();
while (true)
{
var item = new T { Source = this.Source, Start = context.Position };
if (!item.Parse(context))
break;
result.Add(item);
}
return result;
}
protected T ReadSingle<T>(ParseContext context)
where T : Match, new()
{
var result = new T { Source = this.Source, Start = context.Position };
if (result.Parse(context))
return result;
return null;
}
protected int ReadSpaces(ParseContext context)
{
int startPos = context.Position;
int cnt = 0;
while (true)
{
if (startPos + cnt >= context.Source.Length)
break;
if (!char.IsWhiteSpace(context.Source[context.Position + cnt]))
break;
cnt++;
}
context.Position = startPos + cnt;
return cnt;
}
protected bool ReadChar(ParseContext context, char p)
{
int startPos = context.Position;
if (startPos >= context.Source.Length)
return false;
if (context.Source[startPos] == p)
{
context.Position = startPos + 1;
return true;
}
return false;
}
}
class Tag : Match
{
protected override bool ParseInternal(ParseContext context)
{
int startPos = context.Position;
if (!this.ReadChar(context, '['))
return false;
this.ReadSpaces(context);
if (this.ReadChar(context, '/'))
this.IsEndTag = true;
this.ReadSpaces(context);
var validName = this.ReadValidName(context);
if (validName != null)
this.Name = validName;
this.ReadSpaces(context);
if (this.ReadChar(context, ']'))
return true;
context.Position = startPos;
return false;
}
protected string ReadValidName(ParseContext context)
{
int startPos = context.Position;
int endPos = startPos;
while (char.IsLetter(context.Source[endPos]))
endPos++;
if (endPos == startPos) return null;
context.Position = endPos;
return context.Source.Substring(startPos, endPos - startPos);
}
public bool IsEndTag { get; set; }
public string Name { get; set; }
}
class TagsGroup : Match
{
public TagsGroup()
{
}
protected TagsGroup(Tag openTag)
{
this.Start = openTag.Start;
this.Source = openTag.Source;
this.OpenTag = openTag;
}
protected override bool ParseInternal(ParseContext context)
{
var startPos = context.Position;
if (this.OpenTag == null)
{
this.ReadSpaces(context);
this.OpenTag = this.ReadSingle<Tag>(context);
}
if (this.OpenTag != null)
{
int textStart = context.Position;
int textLength = 0;
while (true)
{
Tag tag = new Tag { Source = this.Source, Start = context.Position };
while (!tag.MarkBeginAndParse(context))
{
if (context.Position >= context.Source.Length)
{
context.Position = startPos;
return false;
}
context.Position++;
textLength++;
}
if (!tag.IsEndTag)
{
var tagGrpStart = context.Position;
var tagGrup = new TagsGroup(tag);
if (tagGrup.Parse(context))
{
if (textLength > 0)
{
if (this.Contents == null) this.Contents = new List<Match>();
this.Contents.Add(new Text { Source = this.Source, Start = textStart, Length = textLength });
textStart = context.Position;
textLength = 0;
}
this.Contents.Add(tagGrup);
}
else
{
textLength += tag.Length;
}
}
else
{
if (tag.Name == this.OpenTag.Name)
{
if (textLength > 0)
{
if (this.Contents == null) this.Contents = new List<Match>();
this.Contents.Add(new Text { Source = this.Source, Start = textStart, Length = textLength });
textStart = context.Position;
textLength = 0;
}
this.CloseTag = tag;
return true;
}
else
{
textLength += tag.Length;
}
}
}
}
context.Position = startPos;
return false;
}
public Tag OpenTag { get; set; }
public Tag CloseTag { get; set; }
public List<Match> Contents { get; set; }
}
class Text : Match
{
protected override bool ParseInternal(ParseContext context)
{
return true;
}
}
}
}
If you use this code, and someday find that you need optimizations because the parser has become ambiguous, then try using a dictionary in the ParseContext, take a look here for more info: http://en.wikipedia.org/wiki/Top-down_parsing in the topic Time and space complexity of top-down parsing. I find it very interesting.
The first change is pretty simple - you can get it by changing [^][]+, which is responsible for matching the free text, to .. This seems a little crazy, perhaps, but it's actually safe, because you are using a possessive group (?> ), so all the valid tags will be matched by the first alternation - \[(?<innertag>[^][/=\s]+)[^][]*] - and cannot backtrack and break the tags.
(Remember to enable the Singleline flag, so . matches newlines)
The second requirement, [unsuccessful], seems to go against your goal it. The whole idea from the very start is not to match these unclosed tags. If you allow unclosed tags, all matches of the form \[(.*?)\].*?[/\1] become valid. Not good. At best, you can try to whitelist a few tags which are not allowed to be matched.
An example of both changes:
(?>
\[ (?<tag>[^][/=\s]+) \s*
(?: = \s* (?<val>[^][]*) \s*)?
\]
)
(?<content>
(?>
\[(?:unsuccessful)\] # self closing
|
\[(?<innertag>[^][/=\s]+)[^][]*]
|
\[/(?<-innertag>\k<innertag>)]
|
.
)*
(?(innertag)(?!))
)
\[/\k<tag>\]
Working example on Regex Hero
Ok. Here's another attempt. This one is a little more complicated.
The idea is to match the whole text from start to ext, and parse it to a single Match. While rarely used as such, .Net Balancing Groups allow you to fine tune your captures, remembering all positions and captures exactly the way you want them.
The pattern I came up with is:
\A
(?<StartContentPosition>)
(?:
# Open tag
(?<Content-StartContentPosition>) # capture the content between tags
(?<StartTagPosition>) # Keep the starting postion of the tag
(?>\[(?<TagName>[^][/=\s]+)[^\]\[]*\]) # opening tag
(?<StartContentPosition>) # start another content capture
|
# Close tag
(?<Content-StartContentPosition>) # capture the content in the tag
\[/\k<TagName>\](?<Tag-StartTagPosition>) # closing tag, keep the content in the <tag> group
(?<-TagName>)
(?<StartContentPosition>) # start another content capture
|
. # just match anything. The tags are first, so it should match
# a few if it can. (?(TagName)(?!)) keeps this in line, so
# unmatched tags will not mess with the resul
)*
(?<Content-StartContentPosition>) # capture the content after the last tag
\Z
(?(TagName)(?!))
Remember - the balancing group (?<A-B>) captures into A all text since B was last captured (and pops that position from B's stack).
Now you can parse the string using:
Match match = Regex.Match(sample, pattern, RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);
Your interesting data will be on match.Groups["Tag"].Captures, which contains all tags (some of them are contained in others), and match.Groups["Content"].Captures, which contains tag's contents, and contents between tags. For example, without all blanks, it contains:
some code
some text
This is also an successful match.
This is also an [ unsuccessful match.
This is also an [unsuccessful] match.
This is pretty close to a full parsed document, but you'll still have to play with indices and length to figure out the exact order and structure of the document (though it isn't more complex than sorting all captures)
At this point I'll state what others have said - it may be a good time to write a parser, this pattern isn't pretty...

String Replacements for Word Merge

using asp.net 4
we do a lot of Word merges at work. rather than using the complicated conditional statements of Word i want to embed my own syntax. something like:
Dear Mr. { select lastname from users where userid = 7 },
Your invoice for this quarter is: ${ select amount from invoices where userid = 7 }.
......
ideally, i'd like this to get turned into:
string.Format("Dear Mr. {0}, Your invoice for this quarter is: ${1}", sqlEval[0], sqlEval[1]);
any ideas?
Well, I don't really recommend rolling your own solution for this, however I will answer the question as asked.
First, you need to process the text and extract the SQL statements. For that you'll need a simple parser:
/// <summary>Parses the input string and extracts a unique list of all placeholders.</summary>
/// <remarks>
/// This method does not handle escaping of delimiters
/// </remarks>
public static IList<string> Parse(string input)
{
const char placeholderDelimStart = '{';
const char placeholderDelimEnd = '}';
var characters = input.ToCharArray();
var placeHolders = new List<string>();
string currentPlaceHolder = string.Empty;
bool inPlaceHolder = false;
for (int i = 0; i < characters.Length; i++)
{
var currentChar = characters[i];
// Start of a placeholder
if (!inPlaceHolder && currentChar == placeholderDelimStart)
{
currentPlaceHolder = string.Empty;
inPlaceHolder = true;
continue;
}
// Start of a placeholder when we already have one
if (inPlaceHolder && currentChar == placeholderDelimStart)
throw new InvalidOperationException("Unexpected character detected at position " + i);
// We found the end marker while in a placeholder - we're done with this placeholder
if (inPlaceHolder && currentChar == placeholderDelimEnd)
{
if (!placeHolders.Contains(currentPlaceHolder))
placeHolders.Add(currentPlaceHolder);
inPlaceHolder = false;
continue;
}
// End of a placeholder with no matching start
if (!inPlaceHolder && currentChar == placeholderDelimEnd)
throw new InvalidOperationException("Unexpected character detected at position " + i);
if (inPlaceHolder)
currentPlaceHolder += currentChar;
}
return placeHolders;
}
Okay, so that will get you a list of SQL statements extracted from the input text. You'll probably want to tweak it to use properly typed parser exceptions and some input guards (which I elided for clarity).
Now you just need to replace those placeholders with the results of the evaluated SQL:
// Sample input
var input = "Hello Mr. {select firstname from users where userid=7}";
string output = input;
var extractedStatements = Parse(input);
foreach (var statement in extractedStatements)
{
// Execute the SQL statement
var result = Evaluate(statement);
// Update the output with the result of the SQL statement
output = output.Replace("{" + statement + "}", result);
}
This is obviously not the most efficient way to do this, but I think it sufficiently demonstrates the concept without muddying the waters.
You'll need to define the Evaluate(string) method. This will handle executing the SQL.
I just finished building a proprietary solution like this for a law firm here.
I evaluated a product called Windward reports. It's a tad pricy, esp if you need a lot of copies, but for one user it's not bad.
it can pull from XML or SQL data sources (or more if I remember).
Might be worth a look (and no I don't work for 'em, just evaluated their stuff)
You might want to check out the razor engine project on codeplex
http://razorengine.codeplex.com/
Using SQL etc within your template looks like a bad idea. I'd suggest you make a ViewModel for each template.
The Razor thing is really easy to use. Just add a reference, import the namespace, and call the Parse method like so:
(VB guy so excuse syntax!)
MyViewModel myModel = new MyViewModel("Bob",150.00); //set properties
string myTemplate = "Dear Mr. #Model.FirstName, Your invoice for this quarter is: #Model.InvoiceAmount";
string myOutput = Razor.Parse(myTemplate, myModel);
Your string can come from anywhere - I use this with my templates stored in a database, you could equally load it from files or whatever. It's very powerful as a view engine, you can do conditional stuff, loops, etc etc.
i ended up rolling my own solution but thanks. i really dislike if statements. i'll need to refactor them out. here it is:
var mailingMergeString = new MailingMergeString(input);
var output = mailingMergeString.ParseMailingMergeString();
public class MailingMergeString
{
private string _input;
public MailingMergeString(string input)
{
_input = input;
}
public string ParseMailingMergeString()
{
IList<SqlReplaceCommand> sqlCommands = new List<SqlReplaceCommand>();
var i = 0;
const string openBrace = "{";
const string closeBrace = "}";
while (string.IsNullOrWhiteSpace(_input) == false)
{
var sqlReplaceCommand = new SqlReplaceCommand();
var open = _input.IndexOf(openBrace) + 1;
var close = _input.IndexOf(closeBrace);
var length = close != -1 ? close - open : _input.Length;
var newInput = _input.Substring(close + 1);
var nextClose = newInput.Contains(openBrace) ? newInput.IndexOf(openBrace) : newInput.Length;
if (i == 0 && open > 0)
{
sqlReplaceCommand.Text = _input.Substring(0, open - 1);
_input = _input.Substring(open - 1);
}
else
{
sqlReplaceCommand.Command = _input.Substring(open, length);
sqlReplaceCommand.PlaceHolder = openBrace + i + closeBrace;
sqlReplaceCommand.Text = _input.Substring(close + 1, nextClose);
sqlReplaceCommand.NewInput = _input.Substring(close + 1);
_input = newInput.Contains(openBrace) ? sqlReplaceCommand.NewInput : string.Empty;
}
sqlCommands.Add(sqlReplaceCommand);
i++;
}
return sqlCommands.GetParsedString();
}
internal class SqlReplaceCommand
{
public string Command { get; set; }
public string SqlResult { get; set; }
public string PlaceHolder { get; set; }
public string Text { get; set; }
protected internal string NewInput { get; set; }
}
}
internal static class SqlReplaceExtensions
{
public static string GetParsedString(this IEnumerable<MailingMergeString.SqlReplaceCommand> sqlCommands)
{
return sqlCommands.Aggregate("", (current, replaceCommand) => current + (replaceCommand.PlaceHolder + replaceCommand.Text));
}
}

get div element contents in C#

I have a moderately well-formatted HTML document. It is not XHTML so it's not valid XML. Given a offset of the opening tag I need to obtain contents of this tag, considering that it can have multiple nested tags inside of it.
What is the easiest way to solve this problem with a minimum amount of C# code that doesn't involve using non-standard libraries?
You can strip your html content using following function
public static string StripHTMLTag(string strHTML)
{
return Regex.Replace(strHTML, "<(.|\n)*?>", "");
}
pass your content of outer tag, this will strip all html tags and provide you only content.
Hope this helps
Imran
I ended up writing the following function. It seems to get the job done for my purposes.
I know that it's kind of dirty, but so is the HTML code of most web-pages.
If anyone can point out principal flaws, please do so:
private static readonly Regex rxDivTag = new Regex(
#"<(?<close>/)?div(\s[^>]*?)?(?<selfClose>/)?>",
RegexOptions.Compiled | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase | RegexOptions.Singleline);
private const string RXCAP_DIVTAG_CLOSE = "close";
private const string RXCAP_DIVTAG_SELFCLOSE = "selfClose";
private static List<string> GetProductDivs(string pageText, int start)
{
bool success = true;
int curr = start + 1;
for (Match matchNextTag = rxDivTag.Match(pageText, curr) ; depth > 0 ; matchNextTag = rxDivTag.Match(pageText, curr))
{
if (matchNextTag == Match.Empty)
{
success = false;
break;
}
if (matchNextTag.Groups[RXCAP_DIVTAG_CLOSE].Success)
{
if (matchNextTag.Groups[RXCAP_DIVTAG_SELFCLOSE].Success)
{
success = false;
break;
}
--depth;
}
else if (!matchNextTag.Groups[RXCAP_DIVTAG_SELFCLOSE].Success)
{
++depth;
}
curr = matchNextTag.Index + matchNextTag.Length;
}
if (success)
{
return pageText.Substring(start, curr - start);
}
else
{
return null;
}
}

Please critique my class

I've taken a few school classes along time ago on and to be honest i never really understood the concept of classes. I recently "got back on the horse" and have been trying to find some real world application for creating a class.
you may have seen that I'm trying to parse a lot of family tree data that is in an very old and antiquated format called gedcom
I created a Gedcom Reader class to read in the file , process it and make it available as two lists that contain the data that i found necessary to use
More importantly to me is i created a class to do it so I would very much like to get the experts here to tell me what i did right and what i could have done better ( I wont say wrong because the thing works and that's good enough for me)
Class:
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
namespace GedcomReader
{
class Gedcom
{
private string GedcomText = "";
public struct INDI
{
public string ID;
public string Name;
public string Sex;
public string BDay;
public bool Dead;
}
public struct FAM
{
public string FamID;
public string Type;
public string IndiID;
}
public List<INDI> Individuals = new List<INDI>();
public List<FAM> Families = new List<FAM>();
public Gedcom(string fileName)
{
using (StreamReader SR = new StreamReader(fileName))
{
GedcomText = SR.ReadToEnd();
}
ReadGedcom();
}
private void ReadGedcom()
{
string[] Nodes = GedcomText.Replace("0 #", "\u0646").Split('\u0646');
foreach (string Node in Nodes)
{
string[] SubNode = Node.Replace("\r\n", "\r").Split('\r');
if (SubNode[0].Contains("INDI"))
{
Individuals.Add(ExtractINDI(SubNode));
}
else if (SubNode[0].Contains("FAM"))
{
Families.Add(ExtractFAM(SubNode));
}
}
}
private FAM ExtractFAM(string[] Node)
{
string sFID = Node[0].Replace("# FAM", "");
string sID = "";
string sType = "";
foreach (string Line in Node)
{
// If node is HUSB
if (Line.Contains("1 HUSB "))
{
sType = "PAR";
sID = Line.Replace("1 HUSB ", "").Replace("#", "").Trim();
}
//If node for Wife
else if (Line.Contains("1 WIFE "))
{
sType = "PAR";
sID = Line.Replace("1 WIFE ", "").Replace("#", "").Trim();
}
//if node for multi children
else if (Line.Contains("1 CHIL "))
{
sType = "CHIL";
sID = Line.Replace("1 CHIL ", "").Replace("#", "");
}
}
FAM Fam = new FAM();
Fam.FamID = sFID;
Fam.Type = sType;
Fam.IndiID = sID;
return Fam;
}
private INDI ExtractINDI(string[] Node)
{
//If a individual is found
INDI I = new INDI();
if (Node[0].Contains("INDI"))
{
//Create new Structure
//Add the ID number and remove extra formating
I.ID = Node[0].Replace("#", "").Replace(" INDI", "").Trim();
//Find the name remove extra formating for last name
I.Name = Node[FindIndexinArray(Node, "NAME")].Replace("1 NAME", "").Replace("/", "").Trim();
//Find Sex and remove extra formating
I.Sex = Node[FindIndexinArray(Node, "SEX")].Replace("1 SEX ", "").Trim();
//Deterine if there is a brithday -1 means no
if (FindIndexinArray(Node, "1 BIRT ") != -1)
{
// add birthday to Struct
I.BDay = Node[FindIndexinArray(Node, "1 BIRT ") + 1].Replace("2 DATE ", "").Trim();
}
// deterimin if there is a death tag will return -1 if not found
if (FindIndexinArray(Node, "1 DEAT ") != -1)
{
//convert Y or N to true or false ( defaults to False so no need to change unless Y is found.
if (Node[FindIndexinArray(Node, "1 DEAT ")].Replace("1 DEAT ", "").Trim() == "Y")
{
//set death
I.Dead = true;
}
}
}
return I;
}
private int FindIndexinArray(string[] Arr, string search)
{
int Val = -1;
for (int i = 0; i < Arr.Length; i++)
{
if (Arr[i].Contains(search))
{
Val = i;
}
}
return Val;
}
}
}
Implementation:
using System;
using System.Windows.Forms;
using GedcomReader;
namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
string path = #"C:\mostrecent.ged";
string outpath = #"C:\gedcom.txt";
Gedcom GD = new Gedcom(path);
GraphvizWriter GVW = new GraphvizWriter("Family Tree");
foreach(Gedcom.INDI I in GD.Individuals)
{
string color = "pink";
if (I.Sex == "M")
{
color = "blue";
}
GVW.ListNode(I.ID, I.Name, "filled", color, "circle");
if (I.ID == "ind23800")
{MessageBox.Show("stop");}
//"ind23800" [ label="Sarah Mandley",shape="circle",style="filled",color="pink" ];
}
foreach (Gedcom.FAM F in GD.Families)
{
if (F.Type == "par")
{
GVW.ConnNode(F.FamID, F.IndiID);
}
else if (F.Type =="chil")
{
GVW.ConnNode(F.IndiID, F.FamID);
}
}
string x = GVW.SB.ToString();
GVW.SaveFile(outpath);
MessageBox.Show("done");
}
}
I am particularly interested in if anything could be done about the structures i don't know if how i use them in the implementation is the greatest but again it works
Thanks alot
Quick thoughts:
Nested types should not be visible.
ValueTypes (structs) should be immutable.
Fields (class variables) should not be public. Expose them via properties instead.
Check passed arguments for invalid values, like null.
It might be more readable. It's hard to read and understand.
You may study SOLID principles (http://butunclebob.com/ArticleS.UncleBob.PrinciplesOfOod)
Robert C. Martin gave good presentation on Oredev 2008 about clean code (http://www.oredev.org/topmenu/video/agile/robertcmartincleancodeiiifunctions.4.5a2d30d411ee6ffd2888000779.html)
Some recomended books to read about code readability:
Kent Beck "Implemetation patterns"
Robert C Martin "Clean Code" Robert C
Martin "Agile Principles, Patterns
and Practices in C#"
I suggest you check this place out: http://refactormycode.com/.
For some quick things, your naming is the biggest thing I would start to change.
No need to use ALL-CAPS or abbreviated terms.
Also, FxCop will help with a lot of suggested changes. For example, FindIndexinArray would be named FindIndexInArray.
EDIT:
I don't know if this is a bug in your code or by-design, but in FindIndexinArray, you don't break from your loop once you find a match. Do you want the first (break) or last (no break) match in the array?

Categories