What's the best way of parsing strings? [closed]

What's the best way of parsing strings? [closed] - c#

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
We've got a scenario that requires us to parse lots of e-mail (plain text), each e-mail 'type' is the result of a script being run against various platforms. Some are tab delimited, some are space delimited, some we simply don't know yet.
We'll need to support more 'formats' in the future too.
Do we go for a solution using:
Regex
Simply string searching (using string.IndexOf etc)
Lex/ Yacc
Other
The overall solution will be developed in C# 2.0 (hopefully 3.5)

Regex.
Regex can solve almost everything except for world peace. Well maybe world peace too.

The three solutions you stated each cover very different needs.
Manual parsing (simple text search) is the most flexible and the most adaptable, however, it very quickly becomes a real pain in the ass as the parsing required is more complicated.
Regex are a middle ground, and probably your best bet here. They are powerful, yet flexible as you can yourself add more logic from the code that call the different regex. The main drawback would be speed here.
Lex/Yacc is really only adapted to very complicated, predictable syntaxes and lacks a lot of post compile flexibility. You can't easily change parser in mid parsing, well actually you can but it's just too heavy and you'd be better using regex instead.
I know this is a cliché answer, it all really comes down to what your exact needs are, but from what you said, I would personally probably go with a bag of regex.
As an alternative, as Vaibhav poionted out, if you have several different situations that can arise and that you cna easily detect which one is coming, you could make a plugin system that chooses the right algorithm, and those algorithms could all be very different, one using Lex/Yacc in pointy cases and the other using IndexOf and regex for simpler cases.

You probably should have a pluggable system regardless of which type of string parsing you use. So, this system calls upon the right 'plugin' depending on the type of email to parse it.

You must architect your solution to be updatable, so that you can handle unknown situations when they crop up. Create an interface for parsers that contains not only methods for parsing the emails and returning results in a standard format, but also for examining the email to determine if the parser will execute.
Within your configuration, identify the type of parser you wish to use, set its configuration options, and the configuration for the identifiers which determine if a parser will act or not. Name the parsers by assembly qualified name so that the types can be instantiated at runtime even if there aren't static links to their assemblies.
Identifiers can implement an interface as well, so you can create different types that check for different things. For instance, you might create a regex identifier, which parses the email for a specific pattern. Make sure to make as much information available to the identifier, so that it can make decisions on things like from addresses as well as the content of the email.
When your known parsers can't handle a job, create a new DLL with types that implement the parser and identifier interfaces that can handle the job and drop them in your bin directory.

It depends on what you're parsing. For anything beyond what Regex can handle, I've been using ANTLR. Before you jump into recursive descent parsing for the first time, I would research how they work, before attempting to use a framework like this one. If you subscribe to MSDN Magazine, check the Feb 2008 issue where they have an article on writing one from scratch.
Once you get the understanding, learning ANTLR will be a ton easier. There are other frameworks out there, but ANTLR seems to have the most community support and public documentation. The author has also published The Definitive ANTLR Reference: Building Domain-Specific Languages.

Regex would probably be you bes bet, tried and proven. Plus a regular expression can be compiled.

Your best bet is RegEx because it provides a much greater degree of flexibility than any of the other options.
While you could use IndexOf to handle somethings, you may quickly find yourself writing code that looks like:
if(s.IndexOf("search1")>-1 || s.IndexOf("search2")>-1 ||...
That can be handled in one RegEx statement. Plus, there are a lot of place like RegExLib.com where you can find folks who have shared regular expressions to solve problems.

#Coincoin has covered the bases; I just want to add that with regex it's particularly easy to end up with hard-to-read, hard-to-maintain code. Regex is a powerful and very compact language, so that's how it often goes.
Using whitespace and comments within the regex can go a long way to make it easier to maintain regexes. Eric Gunnerson turned me on to this idea. Here's an example.

Use PCRE. All other answers are just 2nd Best.

With as little information you provided, i would choose Regex.
But what kind of information you want to parse and what you would want to do will change the decision to Lex/Yacc maybe..
But it looks like you've already made your mind up with String search :)

Related

Difficulty of switching from HTML only e-mail to Text only e-mail in .net C#? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am working with a consulting group on a program which currently uses a .net C# script to send e-mails in HTML format at regular intervals.
The e-mail itself aside from being in HTML format although the content is text with some tags and contains less then a page of text.
I would like the consultant to change this to text format replacing the tags with line feed/carriage returns. I have been told that this is a four hour job but that seems excessive to me.
When I look online at a page such as this http://www.mattvanandel.com/771/c-sending-an-email/ it would seem the change could be completed in less than 4 hours including recompiling the .net code into a DLL, testing and uploading the code to a server.
Not all developers are created equal, but assuming that the .Net developer is experienced enough to warrant a $250 per hour salary does this seem reasonable? If it is something less than 4 hours (i.e. more like 4 min) can someone tell me what might have to be done to make the modification. From what I can see its likely 2 lines of code that need to be modified (i.e. the body string and the IsBodyHtml statement). What else may I be missing?

Dependant upon what kind testing would be required to verify that the system is stable after the change, then perhaps 4 hours may or may not be excessive.
For a simple looking change in a tightly coupled system may have massive implications and risk. On the other hand in a loosely coupled system, the risk should be minimal.
So the question is, why 4 hours. If it was me. I'd request a breakdown of what the 4 hours represents. You are after all the customer and if you need a cost breakdown I'd suggest you're within your purview to request it.
However I'd suggest that you ask in a non confrontational way (i.e. don't jump in with all guns blazing) as the there may well be serious implications that the developer knows about but you don't. Maybe just ask for a simple - 'what is involved in implementing this change'.
And don't feel you have to accept the first answer given, you should if you are dissatified, request further clarification from the developer.

It all depends on how the code is written - and on that we can only speculate currently. It may be that they use a really complex 3rd party tool - in which case it might take four hours.
However, if it is done using System.Net.Mail then it could be as simple as setting the IsBodyHtmlproperty on a MailMessage to true, which is a four-second job.

Changing that 'IsBodyHtml' property would make it send text, but you would also need to modify the text to insert the line feeds - on static text this is not totally difficult, but you need to consider when a line feed is proper (what in the html has "block" layout and what is simple in-line styled). Also you do not mention if the text is dynamic or static which adds complications if it IS dynamicly generated.
Time you pay for, but also knowlege. I get someone else to fix things on my car, not because I can't, but because they are better and have the tools I might not have.
Just from a time spent perspective:
Get knowlege/use knowlege already present
Estimate time to communicate with you
Design the change
Code the change
Deploy the change
Test the change/functional test
Solicit feedback on the change/acceptance test (from you?)

There is only one property "IsBodyHtml" of MailMessage Class in .net to switch between Html/Text mail message type.
So you can check yourself, how big is the job excepting removing html tags and pumblishing the updated dll on server.

The mechanics of switching the code itself is as simple as you say above, replacing the HTML body string with the new string and changing the IsBodyHtml property. (Assuming the code uses the built in .NET Framework mailing components).
Remember though, that text based emails will remove all formatting, so you won't be able to have font colours, images, hyperlinks or anything else in the content except as plain text.
If you really want to cut the estimate down, get someone internal to edit the text and all the developer will have to do is switch 2 lines of code and then test/deploy.
I can't comment on the time required to test/deploy as that's entirely dependent on your system.

Developing Abstract Syntax Tree

I've scoured the internet looking for some newbie information on developing a C# Abstract Syntax Trees but I can only find information for people already 'in-the-know'. I am a line-of-business application developer so topics like these are a bit over my head, but this is for my own education so I'm willing to spend the time and learn whatever concepts are necessary.
Generally, I'd like to learn about the techniques behind developing an abstract representation of code from a code string. More specifically, I'd like to be able to use this AST to do C# syntax highlighting. (I realize that syntax highlighting doesn't necessary need an AST, but this seems like a good opportunity to learn some "compiler"-level techniques.)
I apologize if this question is a bit broad, but I'm not sure how else to ask.
Thanks!

First you need to understand what parsing is, and what abstract syntax trees are. For this, you can consult Wikipedia on abstract syntax trees for a first look.
You really need to spend some time with a compiler text book to understand how abstract syntax trees are related to parsing, and can be constructed while parsing; the classic reference is Aho/Ullman/Sethi's "Compilers" book (easily found on the web). You may find the SO answer to Are there any "fun" ways to learn about Languages, Grammars, Parsing and Compilers? instructive.
Once you understand how to build an AST for a simple grammar, you can then turn your attention to something like C#. The issue here is sheer scale; it is one thing to play with a toy language with 20 grammar rules. It is another to work with grammar of several hundred or a thousand rules. Experience will small ones will make it a lot easier to understand how the big ones are put together, and how to live with them.
You probably don't want to build your own C# grammar (or implement the one from the C# standard); its quite a lot of work. You can get available tools that will hand you C# ASTs (Roslyn has already been mentioned; ANTLR has a C# parser, there are many more).
It is true that you might use an AST for syntax highlighting (although that is probably killing a gnat with a sledgehammer). What most people don't think much about (but the compiler books emphasize), is what happens after you have an AST; mostly they aren't useful by themselves. You actually need a lot more machinery to do anything interesting.
Rather than repeat this over and over (I keep seeing the same kind of questions), you can see my discussion on Life After Parsing for more details.

You should probably take a look at this talk by Phil Trelford:
Write your own compiler in 24 hours
This man is a genius, and will leave you fired up to learn about compilers. He explains it literally easily enough for a five year old to understand. The five year old in question is his son, so probably has an unfair advantage, but five is five.

Take a look at Roslyn. I think it could be what you're looking for. It gives you access to the compilers AST, among lots of other amazing things!
http://blogs.msdn.com/b/visualstudio/archive/2011/10/19/introducing-the-microsoft-roslyn-ctp.aspx
Beyond that, I suggest a textbook on compilers.

Rules/guidelines for documenting C# code? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am a relatively new developer and have been assigned the task of documenting code written by an advanced C# developer. My boss told me to look through it, and to document it so that it would be easier to modify and update as needed.
My question is: Is there a standard type of Documentation/Comment structure I should follow? My boss made it sound like everyone knew exactly how to document the code to a certain standard so that anyone could understand it.
I am also curious if anyone has a good method for figuring out unfamiliar code or function uncertainty. Any help would be greatly appreciated.

The standard seems to be XML Doc (MSDN Technet article here).
You can use /// at the beginning of each line of documentation comments. There are standard XML style elements for documenting your code; each should follow the standard <element>Content</element> usage. Here are some of the elements:
<c> Used to differentiate code font from normal text
<c>class Foo</c>
<code>
<example>
<exception>
<para> Used to control formatting of documentation output.
<para>The <c>Foo</c> class...</para>
<param>
<paramref> Used to refer to a previously described <param>
If <paramref name="myFoo" /> is <c>null</c> the method will...
<remarks>
<returns>
<see> Creates a cross-ref to another topic.
The <see cref="System.String" /><paramref name="someString"/>
represents...
<summary> A description (summary) of the code you're documenting.

Sounds like you really did end up getting the short straw.
Unfortunately I think you've stumbled on one of the more controversial subjects of software development in general. Comments can be seen as extremely helpful where necessary, and unnecessary cruft when used wrongly. You'll have to be careful and decide quite carefully what goes where.
As far as commenting practice, it's usually down to the corporation or the developer. A few common rules I like to use are:
Comment logic that isn't clear (and consider a refactor)
Only Xml-Doc methods / properties that could be questioned (or, if you need to give a more detailed overview)
If your comments exceed the length of the containing method / class, you might want to think about comment verbosity, or even consider a refactor.
Try and imagine a new developer coming across this code. What questions would they ask?
It sounds like your boss is referring to commenting logic (most probably so that you can start understanding it) and using xml-doc comments.
If you haven't used xml-doc comments before, check out this link which should give you a little guidance on use and where appropriate.
If your workloadi s looking a little heavy (ie, lots of code to comment), I have some good news for you - there's an excellent plugin for Visual Studio that may help you out for xml-doc comments. GhostDoc can make xml-doc commenting methods / classes etc much easier (but remember to change the default placeholder text it inserts in there!)
Remember, you may want to check with your boss on just what parts of the code he wants documented before you go on a ghostdoc spree.

It's a bit of a worry that the original programmer didn't bother to do one of the most important parts of his job. However, there are lots of terrible "good" programmers out there, so this isn't really all that unusual.
However, getting you to document the code is also a pretty good training mechanism - you have to read and understand the code before you can write down what it does, and as well as gaining knowledge of the systems, you will undoubtedly pick up a few tips and tricks from the good (and bad!) things the other programmer has done.
To help get your documentation done quickly and consistently, you might like to try my add-in for Visual Studio, AtomineerUtils Pro Documentation. This will help with the boring grunt work of creating and updating the comments, make sure the comments are fully formed and in sync with the code, and let you concentrate on the code itself.
As to working out how the code works...
Hopefully the class, method, parameter and variable names will be descriptive. This should give you a pretty good starting point. You can then take one method or class at a time and determine if you believe that the code within it delivers what you think the naming implies. If there are unit tests then these will give a good indication of what the programmer expected the code to do (or handle). Regardless, try to write some (more) unit tests for the code, because thinking of special cases that might break the code, and working out why the code fails some of your tests, will give you a good understanding of what it does and how it does it. Then you can augment the basic documentation you've written with the more useful details (can this parameter be null? what range of values is legal? What will the return value be if you pass in a blank string? etc)
This can be daunting, but if you start with the little classes and methods first (e.g. that Name property that just returns a name string) you will gain familiarity with the surrounding code and be able to gradually work your way up to the more complex classes and methods.
Once you have basic code documentation written for the classes, you should then be in a position to write external overview documentation that describes how the system as a whole functions. And then you'll be ready to work on that part of the codebase because you'll understand how it all fits together.
I'd recommend using XML documentation (see the other answers) as this is immediately picked up by Visual Studio and used for intellisense help. Then anyone writing code that calls your classes will get help in tooltips as they type the code. This is such a major bonus when working with a team or a large codebase, but many companies/programmers just don't realise what they've been missing, banging their (undocumented) rocks together in the dark ages :-)

I suspect your boss is referring to the following XML Documentation Comments.
XML Documentation Comments (C# Programming Guide)

It might be worth asking your boss if he has any examples of code that is already documented so you can see first-hand what he is after.
Mark Needham has written a few blog posts about reading/documenting code (see Archive for the ‘Reading Code’ Category.
I remember reading Reading Code: Rhino Mocks some time ago that talks about diagramming the code to help keep track of where you are and to 'map out' what's going on.
Hope that helps - good luck!

Coding guidelines + Best Practices? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I couldn't find any question that directly applies to my query so I am posting this as a new question. If there is any existing discussion that may help me, please point it out and close the question.
Question:
I am going to do a presentation on C# coding guidelines but it is not supposed to limit to coding standards.
So I have a rough idea but I think I need to address good programing practices. So the contents will be something like this.
Basic coding standards - Casing, Formatting etc.
Good practices - Usage of Hashset over other data structures, String vs String Builder, String's immutability and using them effectively etc
Really I would like to add more good practices (Especially to improve the performance.) So like to hear some more good practices to be used with C#. Any suggestions??? (No need of large descriptions :) Just the idea is sufficient.)

Coding Guidelines for CSharp 3.0 and 4.0
IDesign Coding Standards
Lance Hunt's C# Coding Standards
Brad Abrams' Internal Coding Guidelines
Unsurprisingly, I just found a SO question: C# Coding standard / Best practices

Here are a few tips:
Use FxCop for static analysis.
Use StyleCop for coding style validation.
Because of the different semantics of value types, supply them with an alternative color in the IDE (go to Tools / Options / Environment / Fonts and Colors / Display Items and supply User Types (Enums) and User Types (Value types) with a value like #DF7120 [223, 113, 32]).
Because exceptions tend to show bugs in your code, let the IDE break on all exceptions. (go to Debug / Exceptions... / Common Language Runtime Exceptions and check Throw).
Project settings: Disallow unsafe code.
Project settings: Threat warnings as errors.
Project settings: Check for arithmetic overflow/underflow.
Use variables for a single, well defined goal.
Don't use magic numbers.
Write short methods. A method should only contain one level of abstraction.
A method can never be too small (a method of 20 lines is considered pretty big).
A method should protect itself against bad input.
Consider making a type immutable.
Don't suppress warnings in your code with pragma warning disable.
Don't comment bad code: rewrite it.
Document explicitly in code why you are swallowing an exception.
Note the performance implications of concatenating strings.
Never use goto statements.
Fail early, fail fast.

I'm using Microsoft's Design Guidelines for Developing Class Libraries.
And I think it is quite good to start with.

Basic Coding Standards - Make sure it's consistent. Even if they don't follow the conventions set out in this document on msdn. I think consistency is really key here.
Unit Tests - You cannot go wrong here.
Security - Talk about ensuring that if you are passing sensitive data around that it's secure.
Performance - You know, I personally feel that getting the application right and then looking at performance is what I do. I do have it in the back of my mind when writing code, so it's little fine tunings that come in at the end.

XML Commenting tips for C# programming [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
Good morning, afternoon, evening or night (depending on your timezone).
This is just a general question about XML commenting within C#. I have never been very big into commenting my programs, I've always been more of a verbose variable/property/method namer and letting the code speak for itself. I do write comments if I'm coding something that is fairly confusing, but for the most part I don't write alot of comments.
I was doing some reading about XML comments in .NET, Sandcastle, and the help file builder on codeplex and it has taken me down the path of wanting to document my code and generate some nice, helpful documentation for those who have to dig into my code when I'm no longer here.
My question is about standards and conventions. Is there a guide to "good" XML commenting? Should you comment EVERY variable and property? EVERY method? I'm just basically looking for tips on how to write good comments that will be compiled by sandcastle into good documentation so other programmers don't curse my name when they end up having to work on my code.
Thank you in advance for your advice and suggestions,
Scott Vercuski

Personally, we make sure that every public and protected method has XML comments. It also will provide you with Intellisense, and not just end-user help documentation. In the past, we also have included it on privately scoped declarations, but do not feel it is 100% required, as long as the methods are short and on-point.
Don't forget that there are tools to make you XML commenting tasks easier:
GhostDoc - Comment inheritance and templating add-in.
Sandcastle Help File Builder - Edits the Sandcastle projects via a GUI, can be run from a command line (for build automation), and can edit MAML for help topics not derived from code. (The 1.8.0.0 alpha version is very stable and very improved. Have been using it for about a month now, over 1.7.0.0)

Comments are very often outdated. This always has been a problem. My rule of thumb : the more you need to work to update a comment, the faster that comment will be obsolete.
XML Comments are great for API development. They works pretty well with Intellisens and they can have you generate an HTML help document in no time.
But this is not free: maintaining them will be hard (look at any non-trivial example, you will understand what I mean), so they will tend to be outdated very fast. As a result, reviewing XML Comments should be added to your code review as a mandatory check and this check should be performed every time a file is updated.
Well, since it is expensive to maintain, since a lot of non private symbols (in non-API development) are used only by 1 or 2 classes, and since these symboles are often self-explanatory, I would never enforce a rule saying that every non-private symbol should be XML commented. This would be overkill and conterproductive. What you will get is what I saw at a lot of places : nearly empty XML Comments adding nothing to the symbole name. And code that is just a little less readable...
I think that the very, very important guide line about comments in normal (non-API) code should not be about HOW they should be written but about WHAT they should contain. A lot of developers still don't know what to write. A description of what should be commented, with examples, would do better for your code than just a plain : "Do use XML Comments on every non-private symbole.".

I document public classes and the Public/Protected Members of those classes.
I don't document private or internal members or internal classes. Hence variables (I think you mean fields) because they are private.
The objective is to create some documentation for a developer who does not have ready access to the source code.
Endeavour to place some examples where usage is not obvious.

I very rarely comment on method variables, and equally rarely fields (since they are usually covered by a property, or simply don't exist if using auto-implemented properties).
Generally I try hard to add meaningful comments to all public/protected members, which is handy, since if you turn on the xml comments during build, you get automatic warnings for missing comments. Depending on the complexity, I might not fill out every detail - i.e. if it is 100% obvious what every parameter has to do (i.e. there is no special logic, and there is only 1 logical way of interpreting the variables), then I might get lazy and not add comments about the parameters.
But I certainly try to describe what methods, types, properties, etc represent/do.

We document the public methods/properties/etc on our libraries. As part of the build process we use NDoc to create an MSDN-like web reference. It's been very helpful for quick reference and lookup.
It's also great for Intellisense, especially with new team members or, like you said, when the original author is gone.
I agree that code, in general, should be self-explanatory. The XML documention, to me, is more about reference and lookup when you don't have the source open.

Personally my opinion is to avoid commenting. Commenting is dangerous. Because in industry we always update code(because business & requirements are always changing), but vary rarely we update our comments. This may misguide the programmers.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.