I have a flat file that looks like this:
.root:NODE
.root.branch1:NODE
.root.branch1.size:INT
.root.branch1.name:STRING
.root.branch2:NODE
.root.branch2.flavor:NODE
.root.branch2.flavor.cost:INT
.root.branch2.flavor.name:STRING
The file contents, depth, lenght, etc., will be different every time, so I can't hardcode anything (though the nodes will always be of datatype 'NODE'). I need to bring it into C# as a data source. I'm not sure what the best way to parse the file is to convert it to a structure that looks like
+root
+branch1
size:
name:
+branch2
+flavor
cost:
name:
etc. Ideally, I'd like to dynamically build a treeview control that the user can use to select the node he'd like to access (these tags are paths to an actual datasource; so elsewhere in the code, I'm using)
int iVal = somefunction.readvar(".root.branch2.flavor.cost");
/edit/ if it helps, the file I'm trying to parse is a *.SYM file (a symbols file) generated by a TwinCat 2 program. There's a little documentation here: http://infosys.beckhoff.com/content/1033/tcplccontrol/html/tcplcctrl_componentsoptions.htm#Symbol%20configuration
C# :
private void AddToTreeView()
{
TreeViewItem root = new TreeViewItem();
foreach (string line in ReadLines())
{
var parts = line.Split(new char[] { '.' },StringSplitOptions.RemoveEmptyEntries);
if (parts.Length == 1)
{
root.Header = parts[0].Split(':')[0];
root.Tag = line;
}
else
{
TreeViewItem node = root;
foreach (var part in parts)
{
var header = part.Split(new char[] { ':' }, StringSplitOptions.RemoveEmptyEntries)[0];
if(!IfExists(node, header, ref node))
{
node.Items.Add(new TreeViewItem()
{
Header = header,
Tag = line
});
node = root;
}
}
}
}
treeView.Items.Add(root);
}
private bool IfExists(TreeViewItem itm, string header, ref TreeViewItem which)
{
if (itm.Header as string == header)
{
which = itm;
return true;
}
foreach (TreeViewItem i in itm.Items)
{
if (i.Header as string == header)
{
which = i;
return true;
}
else if (i.HasItems)
{
if (IfExists(i, header, ref which))
return true;
}
}
return false;
}
XAML:
<TreeView x:Name="treeView" Height="100">
</TreeView>
Screenshot:
Well, this code works but, it may have some flaws and it is not a perfect solution. Anyway, even it doesn't work perfectly, it should help you in some ways.
EDİT:
This way, it sometimes doesn't work, e.g. if the nodes isn't ordered.
I changed this part of the program so it should have no other errors anymore.
if (!IfExists(node, header, ref node))
{
var currNode = new TreeViewItem()
{
Header = header,
Tag = line
};
node.Items.Add(currNode);
if (part.Contains(":"))
node = root;
else
node = currNode;
}
The previous code's output if input is:
.root:NODE
.root.branch1:NODE
.root.branch2.flavor.name:STRING
.root.branch1.size:INT
.root.branch1.name:STRING
.root.branch2:NODE
.root.branch2.flavor:NODE
.root.branch2.flavor.cost:INT
But with the changed code, it gives output exactly as it should!
Related
I need help because I am not really used to work with HTML. I show a webdocument from my code, the web document read an HTML file, containing some Images.
Everytime, just before the Image tag, I observed two tags who create some wrong caracters. An example would be better.
<p ><br clear=all> </span>
<img border=0 width=265 height=105 id="Picture 84856"
src="Test_HTML/image272.jpg"></p>
the printing is partially correct because it shows the Images and a lots of wrong ÂÂÂÂÂÂÂÂÂ characters.
So I decided to try to cut the tags.
I don't know how to do this. Perhaps I am completely wrong but I think it is good start, isn't it?
My test to suppress these tags in a Html node is
public void ShowTag(string tag)
{
string innerHtml= "//div[#id='"+tag+ "']";
string inner = "//p";
string brToRemove = "//br";
string spanToRemove = "//span";
var nodes = document.DocumentNode.SelectSingleNode(innerHtml);
bool br_deleted = false;
foreach (HtmlNode nd in nodes.SelectNodes(inner))
{
foreach (HtmlNode child in nd.ChildNodes)
{
if (child.Name == "br")
{
int a = 0;
a++;
child.ParentNode.RemoveChild(child);
br_deleted = true;
}
if(child.Name=="span")
{
int b = 0;
b++;
if (br_deleted == true)
{
//nd.ParentNode.RemoveChild(child);
child.Remove();
br_deleted = false;
}
}
}
}
but I cannot remove the child, do you have any idea?
I founded where the problem came from: When selecting the good node, I needed to add the Headers so i could identify the encoding.
string innerHtml = "//div[#id='" + tag + "']";
string inner = "//p";
webbrowser.Navigate("about:blank");
LoadDocument();
HtmlNode nodes = document.DocumentNode.SelectSingleNode(innerHtml);
HtmlNode head = document.DocumentNode.SelectSingleNode("/html/head");
head.AppendChild(nodes);
webbrowser.NavigateToString(head.InnerHtml);
I have been working successfully with the C# OpenXml SDK (Unofficial Microsoft Package 2.5 from NuGet) for some time now, but have recently noticed that the following line of code returns different results depending on what mood Microsoft Word appears to be in when the file gets saved:
var fields = document.Descendants<FieldCode>();
From what I can tell, when creating the document in the first place (using Word 2013 on Windows 8.1) if you use the Insert->QuickParts->Field and choose MergeField from the Field names left hand pane, and then provide a Field name in the field properties and click OK then the field code is correctly saved in the document as I would expect.
Then when using the aforementioned line of code I will receive a field code count of 1 field. If I subsequently edit this document (and even leave this field well alone) the subsequent saving could mean that this field code no longer is returned in my query.
Another case of the same curiousness is when I see the FieldCode nodes split across multiple items. So rather than seeing say:
" MERGEFIELD Author \\* MERGEFORMAT "
As the node name, I will see:
" MERGEFIELD Aut"
"hor \\* MERGEFORMAT"
Split as two FieldCode node values. I have no idea why this would be the case, but it certainly makes my ability to match nodes that much more exciting. Is this expected behaviour? A known bug? I don't really want to have to crack open the raw xml and edit this document to work until I understand what is going on. Many thanks all.
I came across this very problem myself, and found a solution that exists within OpenXML: a utility class called MarkupSimplifier which is part of the PowerTools for Open XML project. Using this class solved all the problems I was having that you describe.
The full article is located here.
Here are some pertinent exercepts :
Perhaps the most useful simplification that this performs is to merge adjacent runs with identical formatting.
It goes on to say:
Open XML applications, including Word, can arbitrarily split runs as necessary. If you, for instance, add a comment to a document, runs will be split at the location of the start and end of the comment. After MarkupSimplifier removes comments, it can merge runs, resulting in simpler markup.
An example of the utility class in use is:
SimplifyMarkupSettings settings = new SimplifyMarkupSettings
{
RemoveComments = true,
RemoveContentControls = true,
RemoveEndAndFootNotes = true,
RemoveFieldCodes = false,
RemoveLastRenderedPageBreak = true,
RemovePermissions = true,
RemoveProof = true,
RemoveRsidInfo = true,
RemoveSmartTags = true,
RemoveSoftHyphens = true,
ReplaceTabsWithSpaces = true,
};
MarkupSimplifier.SimplifyMarkup(wordDoc, settings);
I have used this many times with Word 2010 documents using VS2015 .Net Framework 4.5.2 and it has made my life much, much easier.
Update:
I have revisited this code and have found it clears upon runs on MERGEFIELDS but not IF FIELDS that reference mergefields e.g.
{if {MERGEFIELD When39} = "Y???" "Y" "N" }
I have no idea why this might be so, and examination of the underlying XML offers no hints.
Word will often split text runs with into multiple text runs for no reason I've ever understood. When searching, comparing, tidying etc. We preprocess the body with method which combines multiple runs into a single text run.
/// <summary>
/// Combines the identical runs.
/// </summary>
/// <param name="body">The body.</param>
public static void CombineIdenticalRuns(W.Body body)
{
List<W.Run> runsToRemove = new List<W.Run>();
foreach (W.Paragraph para in body.Descendants<W.Paragraph>())
{
List<W.Run> runs = para.Elements<W.Run>().ToList();
for (int i = runs.Count - 2; i >= 0; i--)
{
W.Text text1 = runs[i].GetFirstChild<W.Text>();
W.Text text2 = runs[i + 1].GetFirstChild<W.Text>();
if (text1 != null && text2 != null)
{
string rPr1 = "";
string rPr2 = "";
if (runs[i].RunProperties != null) rPr1 = runs[i].RunProperties.OuterXml;
if (runs[i + 1].RunProperties != null) rPr2 = runs[i + 1].RunProperties.OuterXml;
if (rPr1 == rPr2)
{
text1.Text += text2.Text;
runsToRemove.Add(runs[i + 1]);
}
}
}
}
foreach (W.Run run in runsToRemove)
{
run.Remove();
}
}
I tried to simplify the document with Powertools but the result was a corrupted word file. I make this routine for simplify only fieldcodes that has specifics names, works in all parts on the docs (maindocumentpart, headers and footers):
internal static void SimplifyFieldCodes(WordprocessingDocument document)
{
var masks = new string[] { Constants.VAR_MASK, Constants.INP_MASK, Constants.TBL_MASK, Constants.IMG_MASK, Constants.GRF_MASK };
SimplifyFieldCodesInElement(document.MainDocumentPart.RootElement, masks);
foreach (var headerPart in document.MainDocumentPart.HeaderParts)
{
SimplifyFieldCodesInElement(headerPart.Header, masks);
}
foreach (var footerPart in document.MainDocumentPart.FooterParts)
{
SimplifyFieldCodesInElement(footerPart.Footer, masks);
}
}
internal static void SimplifyFieldCodesInElement(OpenXmlElement element, string[] regexpMasks)
{
foreach (var run in element.Descendants<Run>()
.Select(item => (Run)item)
.ToList())
{
var fieldChar = run.Descendants<FieldChar>().FirstOrDefault();
if (fieldChar != null && fieldChar.FieldCharType == FieldCharValues.Begin)
{
string fieldContent = "";
List<Run> runsInFieldCode = new List<Run>();
var currentRun = run.NextSibling();
while ((currentRun is Run) && currentRun.Descendants<FieldCode>().FirstOrDefault() != null)
{
var currentRunFieldCode = currentRun.Descendants<FieldCode>().FirstOrDefault();
fieldContent += currentRunFieldCode.InnerText;
runsInFieldCode.Add((Run)currentRun);
currentRun = currentRun.NextSibling();
}
// If there is more than one Run for the FieldCode, and is one we must change, set the complete text in the first Run and remove the rest
if (runsInFieldCode.Count > 1)
{
// Check fielcode to know it's one that we must simplify (for not to change TOC, PAGEREF, etc.)
bool applyTransform = false;
foreach (string regexpMask in regexpMasks)
{
Regex regex = new Regex(regexpMask);
Match match = regex.Match(fieldContent);
if (match.Success)
{
applyTransform = true;
break;
}
}
if (applyTransform)
{
var currentRunFieldCode = runsInFieldCode[0].Descendants<FieldCode>().FirstOrDefault();
currentRunFieldCode.Text = fieldContent;
runsInFieldCode.RemoveAt(0);
foreach (Run runToRemove in runsInFieldCode)
{
runToRemove.Remove();
}
}
}
}
}
}
Hope this helps!!!
This is what my team and I chose to do for our school project. Well, actually we haven't decided on how to parse the C# source files yet.
What we are aiming to achieve is, perform a full analysis on a C# source file, and produce up a report.
In which the report is going to contain stuff that happening in the codes.
The report only has to contain:
string literals
method names
variable names
field names
etc
I'm in charge of looking into this Irony library. To be honest, I don't know the best way to sort the data out into a clean readable report. I am using the C# grammar class packed with the zip.
Is there any step where I can properly identify each node children? (eg: using directives, namespace declaration, class declaration etc, method body)
Any help or advice would be very much appreciated. Thanks.
EDIT: Sorry I forgot to say we need to analysis the method calls too.
Your main goal is to master the basics of formal languages. A good start-up might be found here. This article describes the way to use Irony on the sample of a grammar of a simple numeric calculator.
Suppose you want to parse a certain file containing C# code the path to which you know:
private void ParseForLongMethods(string path)
{
_parser = new Parser(new CSharpGrammar());
if (_parser == null || !_parser.Language.CanParse()) return;
_parseTree = null;
GC.Collect(); //to avoid disruption of perf times with occasional collections
_parser.Context.SetOption(ParseOptions.TraceParser, true);
try
{
string contents = File.ReadAllText(path);
_parser.Parse(contents);//, "<source>");
}
catch (Exception ex)
{
}
finally
{
_parseTree = _parser.Context.CurrentParseTree;
TraverseParseTree();
}
}
And here is the traversal method itself with counting some info in the nodes. Actually this code counts the number of statements in every method of the class. If you have any question you are always welcome to ask me
private void TraverseParseTree()
{
if (_parseTree == null) return;
ParseNodeRec(_parseTree.Root);
}
private void ParseNodeRec(ParseTreeNode node)
{
if (node == null) return;
string functionName = "";
if (node.ToString().CompareTo("class_declaration") == 0)
{
ParseTreeNode tmpNode = node.ChildNodes[2];
currentClass = tmpNode.AstNode.ToString();
}
if (node.ToString().CompareTo("method_declaration") == 0)
{
foreach (var child in node.ChildNodes)
{
if (child.ToString().CompareTo("qual_name_with_targs") == 0)
{
ParseTreeNode tmpNode = child.ChildNodes[0];
while (tmpNode.ChildNodes.Count != 0)
{ tmpNode = tmpNode.ChildNodes[0]; }
functionName = tmpNode.AstNode.ToString();
}
if (child.ToString().CompareTo("method_body") == 0) //method_declaration
{
int statementsCount = FindStatements(child);
//Register bad smell
if (statementsCount>(((LongMethodsOptions)this.Options).MaxMethodLength))
{
//function.StartPoint.Line
int functionLine = GetLine(functionName);
foundSmells.Add(new BadSmellRegistry(name, functionLine,currentFile,currentProject,currentSolution,false));
}
}
}
}
foreach (var child in node.ChildNodes)
{ ParseNodeRec(child); }
}
I'm not sure this is what you need but you could use the CodeDom and CodeDom.Compiler namespaces to compile the C# code, and than analyze the results using Reflection, something like:
// Create assamblly in Memory
CodeSnippetCompileUnit code = new CodeSnippetCompileUnit(classCode);
CSharpCodeProvider provider = new CSharpCodeProvider();
CompilerResults results = provider.CompileAssemblyFromDom(compileParams, code);
foreach(var type in results.CompiledAssembly)
{
// Your analysis go here
}
Update: In VS2015 you could use the new C# compiler (AKA Roslyn) to do the same, for example:
var root = (CompilationUnitSyntax)tree.GetRoot();
var compilation = CSharpCompilation.Create("HelloTDN")
.AddReferences(references: new[] { MetadataReference.CreateFromAssembly(typeof(object).Assembly) })
.AddSyntaxTrees(tree);
var model = compilation.GetSemanticModel(tree);
var nameInfo = model.GetSymbolInfo(root.Usings[0].Name);
var systemSymbol = (INamespaceSymbol)nameInfo.Symbol;
foreach (var ns in systemSymbol.GetNamespaceMembers())
{
Console.WriteLine(ns.Name);
}
The first half of my function doesn't use htmlagilitypack and I know it functions as I want. however the function finishes without doing anything with the second half and doesnt return an errors. Please help
void classListHtml()
{
HtmlElementCollection elements = browser.Document.GetElementsByTagName("tr");
html = "<table>";
int i = 0;
foreach (HtmlElement element in elements)
{
if (element.InnerHtml.Contains("Marking Period 2") && i != 0)//will be changed to current assignment reports later
{
html += "" + element.OuterHtml;
}
else if (i == 0)
{
i++;
continue;
}
else
continue;
}
html += "" + "</table>";
myDocumentText(html);
//---------THIS IS WHERE IT STOPS DOING WHAT I WANT-----------
//removing color and other attributes
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(html);
HtmlNodeCollection nodeCollection = doc.DocumentNode.SelectNodes("//tr");//xpath expression for all row nodes
string[] blackListAttributes={"width", "valign","bgcolor","align","class"};
foreach(HtmlNode node in nodeCollection)//for each row node
{
HtmlAttributeCollection rows = node.Attributes;// the attributes of each row node
foreach (HtmlAttribute attribute in rows)//for each attribute
{
if (blackListAttributes.Contains(attribute.Name))//if its attribute name is in the blacklist, remove it.
attribute.Remove();
}
}
html = doc.ToString();
myDocumentText(html);//updating browser with new html
}
HtmlDocument.ToString() does not send back the text, unless you changed the original code, maybe you're looking for HtmlDocument.DocumentNode.OuterXml or Document.Save( ... text ...)?
myDocumentText(html);
What does this method do?
My assumption is that you have an exception being thrown somewhere within this method, and it's either being swallowed, or your debug environment is set to not break on user thrown exceptions.
Can you post the code within this method?
So, I have a Microsoft Word 2007 Document with several Plain Text Format (I have tried Rich Text Format as well) controls which accept input via XML.
For carriage returns, I had the string being passed through XML containing "\r\n" when I wanted a carriage return, but the word document ignored that and just kept wrapping things on the same line. I also tried replacing the \r\n with System.Environment.NewLine in my C# mapper, but that just put in \r\n anyway, which still didn't work.
Note also that on the control itself I have set it to "Allow Carriage Returns (Multiple Paragrpahs)" in the control properties.
This is the XML for the listMapper
<Field id="32" name="32" fieldType="SimpleText">
<DataSelector path="/Data/DB/DebtProduct">
<InputField fieldType=""
path="/Data/DB/Client/strClientFirm"
link="" type=""/>
<InputField fieldType=""
path="strClientRefDebt"
link="" type=""/>
</DataSelector>
<DataMapper formatString="{0} Account Number: {1}"
name="SimpleListMapper" type="">
<MapperData></MapperData>
</DataMapper>
</Field>
Note that this is the listMapper C# where I actually map the list (notice where I try and append the system.environment.newline)
namespace DocEngine.Core.DataMappers
{
public class CSimpleListMapper:CBaseDataMapper
{
public override void Fill(DocEngine.Core.Interfaces.Document.IControl control, CDataSelector dataSelector)
{
if (control != null && dataSelector != null)
{
ISimpleTextControl textControl = (ISimpleTextControl)control;
IContent content = textControl.CreateContent();
CInputFieldCollection fileds = dataSelector.Read(Context);
StringBuilder builder = new StringBuilder();
if (fileds != null)
{
foreach (List<string> lst in fileds)
{
if (CanMap(lst) == false) continue;
if (builder.Length > 0 && lst[0].Length > 0)
builder.Append(Environment.NewLine);
if (string.IsNullOrEmpty(FormatString))
builder.Append(lst[0]);
else
builder.Append(string.Format(FormatString, lst.ToArray()));
}
content.Value = builder.ToString();
textControl.Content = content;
applyRules(control, null);
}
}
}
}
}
Does anybody have any clue at all how I can get MS Word 2007 (docx) to quit ignoring my newline characters??
Use a function like this:
private static Run InsertFormatRun(Run run, string[] formatText)
{
foreach (string text in formatText)
{
run.AppendChild(new Text(text));
RunProperties runProps = run.AppendChild(new RunProperties());
Break linebreak = new Break();
runProps.AppendChild(linebreak);
}
return run;
}
None of the above answers were any help for me.
However I figured out that the InsertAfter method swaps the \n in the original XML string for \v and when this is passed into the content control it then renders correctly.
contentControl.MultiLine = true
contentControl.Range.InsertAfter(your string)
I got the same problem but it was in a table cell.
I had one string with carriage return (multiple line) into a Text object that was append to a paragraph that was append to a table cell.
=> The carriage return was ignored by word.
Well the solution was simple :
Create one paragraph by line and add all of these paragraph's to the table cell.
I think it works
WordprocessingDocument _docx = WordprocessingDocument.Create("c:\\Test.docx", WordprocessingDocumentType.Document);
MainDocumentPart _part = _docx.MainDocumentPart;
string _str = "abc\ndef\ngeh";
string _strArr[] = _str.Split('\n');
foreach (string _line in _strArr)
{
Body _body = new Body();
_body.Append(NewText(_text));
_part.Append(_body);
}
_part.Document.Save();
_docx.Close();
.
static Paragraph NewText(string _text)
{
Paragraph _head = new Paragraph();
Run _run = new Run();
Text _line = new Text(_text);
_run.Append(_line);
_head.Append(_run);
return _head;
}