Handling escape characters on custom language

Handling escape characters on custom language - c#

I'm working on a new feature for a C# application that will process a text given by the user. This text can contain any character, but everything that is between braces ({}) or between brackets ([]) will be treated on a special way (basically, the text inside brackets will be replaced for another text, and the braces will indicate a subsection in the given text and will be processed differently).
So, I want to give the user the choice to use braces and brackets on his text, so the first thing I thought was to use "{{" to represent "{", and the same for all other special characters, but this will give problems. If he wants to open a subsection and wants the first character in the subsection to be "{", then he would write "{{{", but that's the same thing he would write if he would like the character before the subsection to be "{". So this causes an ambiguity.
Now I'm thinking I could use "\" to escape braces and brackets, and use "\\" to represent "\". And I'm kinda figuring out how to process this, but I got a feeling I'm trying to reinvent the wheel here. Wonder if there is a known algorithm or library that does what I'm trying to do.

Why don't you use an existing markup convention? There are plenty of lightweight syntaxes to choose from; depending on your user population, some of them might already be familiar with MediaWiki markup and/or BBcode and/or reST and/or Markdown.

Why don't you use XML tags instead of special characters?
<section>
Blah blah blah blah <replace id="some identifier" />
</section>
This approach would let you parse your text using any XML parser in Microsoft .NET and any other platform. And you'll save time because there's nothing to escape.

I'd recommend using \ to escape {} chars in the text and un-escaped {} to surround a subsection. This is how C# handles " chars in a string. Using double braces introduces ambiguities and makes correctly processing the text difficult, if not impossible. Your choice also depends on your target users. Developers are comfortable using escape chars but they can be confusing to non-dev users. You might want to use tags like <sub> and </sub> to indicate a subsection. Either way, you can use a regular expression to parse the user's text into a RegEx.Matches collection.

Related

How to escape variable name when using Roslyn C# Syntax Factory?

So I'm using Roslyn SyntaxFactory to generate C# code.
Is there a way for me to escape variable names when generating a variable name using IdentifierName(string)?
Requirements:
It would be nice if Unicode is supported but I suppose ASCII can suffice
It would be nice if it's reversible
Always same result for same input ("a" is always "a")
Unique result for each input ("a?"->"a_" cannot be same as "a!"->"a_")
Can convert from 1 special character to multiple single ones

The implication from the API docs seems to be that it expects a valid C# identifier here, so Roslyn's not going to provide an escaping mechanism for you. Therefore, it falls to you to define a string transformation such that it achieves what you want.
The way to do this would be to look at how other things already do it. Look at HTML entities, which are always introduced using &. They can always be distinguished easily, and there's a way to encode a literal & as well so that you don't restrict your renderable character set. Or consider how C# strings allow you to include string delimiters and other special characters in the string through the use of \.
You need to pick a character which is valid in C# identifiers to be your 'marker' for a sequence which represents one of the non-identifier characters you want to encode, and a way to allow that character to also be represented. Then make a mapping table for what comes after the marker for each of the encoded characters. If you want to do all of Unicode, the easiest way is probably to just use Unicode codepoint numbers. The resulting identifiers might not be very readable, but maybe that doesn't matter in your use case.
Once you have a suitable system worked out, it should be pretty straightforward to write a string transformation function which implements it.

Simple lexical parser

I want to write a lexical parser for regular text.
So i need to detect following tokens:
1) Word
2) Number
3) dot and other punctuation
4) "..." "!?" "!!!" and so on
I think that is not trivial to write "if else" condition for each item.
So is there any finite state machine generators for c#?
I know ANTLR and other but while i will try to learn how to work with these tools i can write my own "ifelse" FSM.
i hope to found something like:
FiniteStateMachine.AddTokenDefinition(":)","smile");
FiniteStateMachine.AddTokenDefinition(".","dot");
FiniteStateMachine.ParseText(text);

I suggest using Regular Expressions. Something like #"[a-zA-Z\-]+" will pick up words (a-z and dashes), while #"[0-9]*(\.[0-9]+)?" will pick up numbers (including decimal numbers). Dots and such are similar - #"[!\.\?]+" - and you can just add whatever punctuation you need inside the square brackets (escaping special Regex characters with a ).
Poor man's "lexer" for C# is very close to what you are looking for, in terms of being a lexer. I recommend googling regular expressions for words and numbers or whatever else you need to find out what expressions, exactly you need.
EDIT:
Or see Justin's answer for the particular regexes.

We need to know specifics on what you consider a word or a number. That being said, I'll assume "word" means "a C#-style identifier," and "number" means "a string of base-10 numerals, possibly including (but not starting or ending with) a decimal point."
Under those definitions, words would be anything matching the following regex:
#"\b(?!\d)\w+\b"
Note that this would also match unicode. Numbers would match the following:
#"\b\d+(?:\.\d+)?\b"
Note again that this doesn't cover hexadecimal, octal, or scientific notation, although you could add that in without too much difficulty. It also doesn't cover numeric literal suffixes.
After matching those, you could probably get away with this for punctuation:
#"[^\w\d\s]+"

C# How to remove text between BBCode

How to remove all text between BBCode Quotation (including BBCode itself):
[quote date=2011-07-02 14:43:53 user=test link=1]blabla[/quote]
I must add that between tags can be text with HTML tags for formating.
My current attempt looks like:
Regex regex = new Regex(#"[quote+].+?[/\+quote]");
Well it's almost working.

You may try the following regex:
#"\[quote.*\].*?\[/quote\]"
Note that you have to escape square brackets in a regex.

Since your BBCode blocks contains attributes, a simple + won't suffice to cover everything. + means to repeat the specified range of characters, in this case e.
On the top of my head, I'd try something like this:
\[quote([^\[]*)\](.*?)\[\/quote\]
Please bear in mind that I have not tested this for C#, where the syntax might be different depending on the interpreter. Also note that I've added selection groups so that you'd be able to examine the result of each expression. As #Howard answered, [ and ] are reserved symbols and consequently needs to be escaped.

Remove spam url in text

Input:
dsfdsf www. cnn .com dksfj kdsfjkdjfdf
www.google.com dkfjkdjfk w w w . ya
hoo .co mdfdd
Output:
dsfdsf dksfj kdsfjkdjfdf dkfjkdjfk mdfdd
How do I write a function that does this in C#?

Basically you would have to implement two steps:
Normalization
Filtering
Normalization means that you would remove all whitespace and other noise characters from your input, then you do a transcoding of all diacritics, special characters etc into the basic latin alphabet (this is to map identical- or similar-looking glyphs to one single char, e.g. omicron and o look identical). You would need to retain a one-to-one mapping from the normalized version of the input to the original input.
Then you would search the normalized input for blocked patterns, retrieve the same pattern in the original input and remove it.
Of course, this approach is not fail-safe, you might get false positives actually.
A good answer describing how the simple filtering is doomed can be found here:
How do you implement a good profanity filter?

Start with learning about the RegEx (Regular Expression) facilities in C#, then you'll need a good RegEx that matches a URL. You'd need to change this to manage URLs with spaces though.

Regular expression to replace square brackets with angle brackets

I have a string like:
[a b="c" d="e"]Some multi line text[/a]
Now the part d="e" is optional. I want to convert such type of string into:
<a b="c" d="e">Some multi line text</a>
The values of a b and d are constant, so I don't need to catch them. I just need the values of c, e and the text between the tags and create an equivalent xml based expression. So how to do that, because there is some optional part also.

For HTML tags, please use HTML parser.
For [a][/a], you can do like following
Match m=Regex.Match(#"[a b=""c"" d=""e""]Some multi line text[/a]",
#"\[a b=""([^""]+)"" d=""([^""]+)""\](.*?)\[/a\]",
RegexOptions.Multiline);
m.Groups[1].Value
"c"
m.Groups[2].Value
"e"
m.Groups[3].Value
"Some multi line text"
Here is Regex.Replace (I am not that prefer though)
string inputStr = #"[a b=""[[[[c]]]]"" d=""e[]""]Some multi line text[/a]";
string resultStr=Regex.Replace(inputStr,
#"\[a( b=""[^""]+"")( d=""[^""]+"")?\](.*?)\[/a\]",
#"<a$1$2>$3</a>",
RegexOptions.Multiline);

If you are actually thinking of processing (pseudo)-HTML using regexes,
don't
SO is filled with posts where regexes are proposed for HTML/XML and answers pointing out why this is a bad idea.
Suppose your multiline text ("which can be anything") contains
[a b="foo" [a b="bar"]]
a regex cannot detect this.
See the classic answer in:
RegEx match open tags except XHTML self-contained tags
which has:
I think it's time for me to quit the
post of Assistant Don't Parse HTML
With Regex Officer. No matter how many
times we say it, they won't stop
coming every day... every hour even.
It is a lost cause, which someone else
can fight for a bit. So go on, parse
HTML with regex, if you must. It's
only broken code, not life and death.
– bobince
Seriously. Find an XML or HTML DOM and populate it with your data. Then serialize it. That will take care of all the problems you don't even know you have got.

Would some multiline text include [ and ]? If not, you can just replace [ with < and ] with > using string.replace - no need of regex.
Update:
If it can be anything but [/a], you can replace
^\[a([^\]]+)](.*?)\[/a]$
with
<a$1>$2</a>
I haven't escaped ] and / in the regex - escape them if necessary to get
^\[a([^\]]+)\](.*?)\[\/a\]$

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Handling escape characters on custom language - c#

Why don't you use an existing markup convention? There are plenty of lightweight syntaxes to choose from; depending on your user population, some of them might already be familiar with MediaWiki markup and/or BBcode and/or reST and/or Markdown.

Why don't you use XML tags instead of special characters? <section> Blah blah blah blah <replace id="some identifier" /> </section> This approach would let you parse your text using any XML parser in Microsoft .NET and any other platform. And you'll save time because there's nothing to escape.

Related

How to escape variable name when using Roslyn C# Syntax Factory?

Simple lexical parser

C# How to remove text between BBCode

Remove spam url in text

Regular expression to replace square brackets with angle brackets

Categories

Resources