C# Parse text file - c#

I am trying to parse a file in MVC C#, see the format below. Since its not in JSON I cannot use the Javascript serializer to deserialize to an object. The other option is use to LINQ and read line by line and retrieve the desired values. Could any one recommend a more efficient way to do it.
The first field I need to retrieve is the ASSAY NUMBER (for example value 877) from ASSAYS
and then the ASSAY_STATUS field from TEST_REPLICATE which could be multiple nodes. Thanks
LOAD_HEADER
{
EXPERIMENT_FILE_NAME "xyz.json"
EXPERIMENT_START_DATE_TIME 05.21.2012 03:44:01
OPERATOR_ID "Q_SI"
}
ASSAYS
{
ASSAY_NUMBER 877
ASSAY_VERSION 4
ASSAY_CALIBRATION_VERSION 1
}
TEST_REPLICATE
{
REPLICATE_ID 1985
ASSAY_NUMBER 877
ASSAY_VERSION 4
ASSAY_STATUS Research
}
TEST_REPLICATE
{
REPLICATE_ID 1985
ASSAY_NUMBER 877
ASSAY_VERSION 4
ASSAY_STATUS Research
}

You could either hack something together or use a parser generator like ANTLR or Coco/R. Both can generate parsers in C#.

I'm more fond of using a parser-combinator (a tool for constructing parsers using parser building blocks) than parser generators. I've had passable experience with Piglet, which is written with/for C#, and is pretty easy to use, and amazing experience with FParsec, but it's written for F#.
As far as parser generators go, there are those suggested by stmax, and there is also TinyPG, which a member recommended me once.
You can also roll your own parser. I suggest basing it on some sort of state machine model, though in this simple case, like Kirk Woll suggested, you could probably get by with some plain old string manipulation.

I think the answer to this hinges upon whether or not there will ever be more than one ASSAY_NUMBER value in the file. If so, the easiest and surest way I know is to read the file line-by-line and get the data you desire.
If, however, you know that each file is unique to a specific ASSY_NUMBER, you have a much simpler answer: read the file as one string and use REGEX to pull out the information you desire. I am not an expert on REGEX, but there are enough examples online that you should be able to create one that works.

Related

Remove Markdown tags from string

I have a string which has Markdown tags embedded inside it. I do not want to encode the Markdown as anything else, I just want to rip out all of the tags.
How can I do this quickly? I need to do this as part of a batch processing job which processes around 5 million pieces of text, so speed is very important.
I looked at MarkdownSharp, and using Transform, but I'm not sure it's the best way of doing this. I just want plaintext output, with no tags inside. I'm even considering a regex removal, but I'm not sure what the most performant option would be.
You could probably use MarkdownSharp or any other similar library (I recommend Strike, since it is surprisingly fast!) to convert the Markdown to Html and then use HtmlAgilityPack to extract the text.
A faster option, but more work for you, would be to modify an existing Markdown parser to produce plain text instead.
The solution was a bit hard to get from the comments, but this works for .NET 6:
Install Markdown Deep from NuGet. I needed something for .NET 6 so I used the Core version https://www.nuget.org/packages/MarkdownDeep.NET.Core/
Create a Markdown object:
using MarkdownDeep;
var markdownRemover = new Markdown()
{
SummaryLength = -1
};
Strip the markdown from the text:
var plainText = markdownRemover.Transform(mdText);

Matching tags in C#

I'm trying to match tags with C# and I'm having some trouble getting it to work. I have these tags:
<categories=1></categories=1>
The =1 could be really any number. It could be 1, 2, 3 or any other given number. Is there a way to match this tag in C# using IndexOf or RegEx or a better method.
So to give an example of how I want to use it. I would have something like:
if (PUT WORKING CODE HERE ONCE FIGURED OUT)
{
Do Something
}
Is there an easy way to do this?
Thanks!
I would suggest to first make the document valid XML by replacing those equation signs, then use any XML parser.
there is only one valid answer to this need, unless you are doing homeworks and need to learn how to code this yourself...
avoid reinventing things from scratch and use Html Agility Pack
it is called Html but also handles XML files, in case you have to do more complex things, like parsing, and don't want or cannot use pure XPath and XML related .NET Framework classes.
see here for some examples: How to use HTML Agility pack

using linq for this:

i have just started learning linq because i like the sound of it. and so far i think im doing okay at it.
i was wondering if Linq could be used to find the following information in a file, like a group at a time or something:
Control
Text
Location
Color
Font
Control Size
example:
Label
"this is text that will
appear on a label control at runtime"
23, 77
-93006781
Tahoma, 9.0, Bold
240, 75
The above info will be in a plain file and wil have more than one type of control and many different sizes, font properties etc associated with each control listed. is it possible in Linq to parse the info in this txt file and then convert it to an actual control?
i've done this using a regex but regex is too much of a hassle to update/maintain.
thanks heaps
jase
Edit:
Since XML is for structured data, would Linq To XML be appropriate for this task? And would you please share with me any helpful/useful links that you may have? (Other than MSDN, because I am looking at that now. :))
Thank you all
If you are generating this data yourself, then I HIGHLY recommend you store this in an XML file. Then you can use XElement to parse this.
EDIT: This is exactly the type of thing that XML is designed for, structured data.
EDIT EDIT: In response to the second question, Linq to XML is exactly what your looking for:
For an example, here is a couple of links to code I have written that parses XML using XElements. It also creates a XML document.
Example 1 - Loading and Saving: have a look under the FromXML() and ToXML() methods.
Example 2 - Parsing a large XML doc: have a look under the ParseXml method.
Hope these get you going :D
LINQ is good for filtering off rows, selecting relevant columns etc.
Even if you use LINQ for this, you will still need regex to select the relevant text and do the parsing.

Design strategy for a simple code parser

I'm attempting to write an application to extract properties and code from proprietary IDE design files. The file format looks something like this:
HEADING
{
SUBHEADING1
{
PropName1 = PropVal1;
PropName2 = PropVal2;
}
SUBHEADING2
{
{ 1 ; PropVal1 ; PropValue2 }
{ 2 ; PropVal1 ; PropValue2 ; OnEvent1=BEGIN
MESSAGE('Hello, World!');
{ block comments are between braces }
//inline comments are after double-slashes
END;
PropVal3 }
{ 1 ; PropVal1 ; PropVal2; PropVal3 }
}
}
What I am trying to do is extract the contents under the subheading blocks. In the case of SUBHEADING2, I would also separate each token as delimited by the semicolons. I had reasonably good success with just counting the brackets and keeping track of what subheading I'm currently under. The main issue I encountered involves dealing with the code comments.
This language happens to use {} for block comments, which interferes with the brackets in the file format. To make it even more interesting, it also needs to take into account double-slash inline comments and ignore everything up to the end of the line.
What is the best approach to tackling this? I looked at some of the compiler libraries discussed in another article (ANTLR, Doxygen, etc.) but they seem like overkill for solving this specific parsing issue.
I'd suggest writing a tokenizer and parser; this will give you more flexibility. The tokenizer basically does a simple text-wise breakdown of the sourcecode and puts it into more usable data structure; the parser figures out what to do with it, often leveraging recursion.
Terms to google: tokenizer, parser, compiler design, grammars
Math expression evaluator: http://www.codeproject.com/KB/vb/math_expression_evaluator.aspx
(you might be able to take an example like this and hack it apart into what you want)
More info about parsing: http://www.codeproject.com/KB/recipes/TinyPG.aspx
You won't have to go nearly as far as those articles go, but, you're going to want to study a bit on this one first.
You should be able to put something together in a few hours, using regular expressions in combination with some code that uses the results.
Something like this should work:
- Initialize the process by loading the file into a string.
Pull each top-level block from the string, using regex tags to separately identify the block keyword and contents.
If a block is found,
Make a decision based on the keyword
Pass the content to this process recursively.
Following this, you would process HEADING, then the first SUBHEADING, then the second SUBHEADING, then each sub-block. For the sub-block containing the block comment, you would presumably know based on the block's lack of a keyword that any sub-block is a comment, so there is no need to process the sub-blocks.
No matter which solution you will choose, I'm pretty sure the best way is to have 2 parsers/tokenizers. One for the main file structure with {} as grouping characters, and one for the code blocks.

Regex index in matching string where the match failed

I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.
I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.
I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."
I don't believe it's possible, but I am intrigued why you would want it.
In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.
It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps

Categories