Issue creating regex patterns for batch file syntax highlighting

Issue creating regex patterns for batch file syntax highlighting - c#

I have the following code to create syntax highlighting for a text editor that I am working on. It uses the FastColoredTextBox component. I can't quite get the regex pattern for highlighting batch file variables correct.
private void batchSyntaxHighlight(FastColoredTextBox fctb)
{
fctb.LeftBracket = '(';
fctb.RightBracket = ')';
fctb.LeftBracket2 = '\x0';
fctb.RightBracket2 = '\x0';
Range e = fctb.Range;
e.ClearStyle(StyleIndex.All);
//clear style of changed range
e.ClearStyle(BlueStyle, BoldStyle, GrayStyle, MagentaStyle, GreenStyleItalic, BrownStyleItalic, YellowStyle);
//variable highlighting
e.SetStyle(YellowStyle, "(\".+?\"|\'.+?\')", RegexOptions.Singleline);
//comment highlighting
e.SetStyle(GreenStyleItalic, #"(REM.*)");
//attribute highlighting
e.SetStyle(GrayStyle, #"^\s*(?<range>\[.+?\])\s*$", RegexOptions.Multiline);
//class name highlighting
e.SetStyle(BoldStyle, #"(:.*)");
//symbol highlighting
e.SetStyle(MagentaStyle, #"(#|%)", RegexOptions.Singleline);
e.SetStyle(RedStyle, #"(\*)", RegexOptions.Singleline);
//keyword highlighting
e.SetStyle(BlueStyle, #"\b(set|SET|echo|Echo|ECHO|FOR|for|PUSHD|pushd|POPD|popd|pause|PAUSE|exit|Exit|EXIT|cd|CD|If|IF|if|ELSE|Else|else|GOTO|goto|DEL|del)");
//clear folding markers
e.ClearFoldingMarkers();
BATCH_HIGHLIGHTING = true;
}
Using this code I can't seem to highlight strings between two '%' symbols without highlighting almost the entire file because many lines will only contain one '%' symbol or two right next to each other.
I am also having trouble with '::' comments. In order to highlight the labels I have created the regex pattern to match any line that has a ':' in it followed by all characters that proceed it.
I want to get the highlighting correct so that labels will be highlighting BoldStyle and '::' comments will be highlighted GreenItalicStyle without any conflicts. I would also like to be able to highlight strings that lay between two '%' symbols without conflicts (such as a line that contains only one '%')
All this should only be highlighted if not in a comment.
EDIT: Currently the code only highlights '%' symbols by themselves as I was unable to get the code to work for highlighting between them without causing major syntax issues.

Big thanks to #DougF for helping me find this solution. The answer is:
#"^:[a-zA-Z]+"

Related

Regex hangs trying to find match

I am trying to match an assignment string in VB code (as in I'm passing in text that is VB code into my program that's written in C#). The assignment string that I'm trying to match is something for example like
CustomClassInitializer(someParameter, anotherParameter, someOtherClassAsParameterWithInitialization()).SomeProperty = 7
and I realize that's rather complex, but it actually isn't far off from some of the real text I'm trying to match.
In order to do so I wrote a Regex. This Regex:
#"[\w,.]+\(([\w,.]*\(*,* *\)*)+ = "
which correctly matches. The problem is it becomes VERY slow (with timeouts), which I've researched and found is probably because of "backtracking". One of the suggested solutions to help with backtracking in general was to add "?>" to the regex, which I think would go in this position:
[\w,.]+\(?>([\w,.]*\(*,* *\)*)+ =
but this no longer matches properly.
I'm fairly new to Regex, so I imagine that there is a much better pattern. What is it please? Or how can I improve my times in general?
Helpful notes:
I'm only interested in position 0 of the string I'm searching for a
match in. My code is "if (isMatch && match.index == 0) { ... }. Can
I tell it to only check position 0 and if it's not a match move on?
The reason I use all the 0 or more things is the match could be as simple as CustomClass() = new CustomClass(), and as complicated as the above or perhaps a bit worse. I'm trying to get as many cases as possible.
This Regex is interested in "[\w,.]+(" and then "whatever may be inside the parentheses" (I tried to think of what all could be inside them based on the fact that it's valid VB code) until you get to the close parenthesis and then " = ". Perhaps I can use a wildcard for literally anything until it get's to ") = " in the string? - Like I said, fairly new to Regex.
Thanks in advance!

This seems to do what you want. Normally, I like to be more specific than .*, but it is working correctly. Note that I am using the Multi-line option.
^.*=\s*.+$
Here is a working example in RegExStorm.net example

Syntax highlighting richtextbox in C# on a single line only

I'm working on my own syntax highlighter using a Richtextbox. It's already working, but I've noticed that the typing slows down a lot when there's to many lines of code. This is because my syntax highlight function is coloring all the words in the entire Richtextbox on every change made to it. Here's a minimal example of the function to see how it works:
private void colorCode()
{
// getting keywords/functions
string keywords = #"\b(class|function)\b";
MatchCollection keywordMatches = Regex.Matches(codeBox.Text, keywords);
// saving the original caret position + forecolor
int originalIndex = codeBox.SelectionStart;
int originalLength = codeBox.SelectionLength;
Color originalColor = Color.Black
// focuses a label before highlighting (avoids blinking)
titleLabel.Focus();;
// removes any previous highlighting (so modified words won't remain highlighted)
codeBox.SelectionStart = 0;
codeBox.SelectionLength = codeBox.Text.Length;
codeBox.SelectionColor = originalColor;
foreach (Match m in keywordMatches)
{
codeBox.SelectionStart = m.Index;
codeBox.SelectionLength = m.Length;
codeBox.SelectionColor = Color.Blue;
}
// restoring the original colors, for further writing
codeBox.SelectionStart = originalIndex;
codeBox.SelectionLength = originalLength;
codeBox.SelectionColor = originalColor;
// giving back the focus
codeBox.Focus();
}
To solve the problem, I want to write a function that doesn't change the entire Richtextbox, but just the line of the cursor position instead. I realise this will still cause the same issue on minified code, but that's not a problem for me. The problem is, I can't seem to get it working. This is what I've got so far:
void changeLine(RichTextBox RTB, int line, Color clr, int curPos){
string testWords = #"\b(test1|test2)\b";
MatchCollection testwordMatches = Regex.Matches(RTB.Lines[line], testWords);
foreach (Match m in testwordMatches)
{
//RTB.SelectionStart = m.Index;
//RTB.SelectionLength = m.Length;
RTB.SelectionColor = Color.Blue;
}
RTB.SelectionStart = curPos;
RTB.SelectionColor = Color.Black;
}
The problem is that it does the coloring when a word in testWords is found, but it colors the entire line instead of just the word. This is because I can't figure out a way of doing the selections right. So I'm hoping you guys can help me out with this.
Edit:
I'd like to add that I did thought about other solutions, like putting the lines in a List, or using a Stringbuilder. But those will turn the lines into strings and don't allow me to do color formatting like the Richtextbox does.

Well, you obviously need language lexer and parser. This task is not solvable by using Regex. It's just doesn't capable to accomplish this because of some fundamental grammar rules (or "power levels" of grammars) (read about Thomsky hierarchy of grammars).
What you need is to use some grammar toolkit. For example ANTLR4 provide grammar lexer/parser generator and set of already predefined grammars.
For example, you can find a lot of user-written grammars in here (including latest C# syntax): https://github.com/antlr/grammars-v4
Then just generate parser/lexer by it and feed it your string. It will output full hierarchy with indexes and lengths of each token, and you can colorize them without jumping across entire rich box.
Also, consider to use some timeout between user input, so you don't colorize your output every symbol (just save color from previous token, and use it for some time, until you recolorize output, then refresh). This way it will go as smoothly as it is in Visual Studio.

RegEx for a Glossary Function

I'm working on a web-based help system that will auto-insert links into the explanatory text, taking users to other topics in help. I have hundreds of terms that should be linked, i.e.
"Manuals and labels" (describes these concepts in general)
"Delete Manuals and Labels" (describes this specific action)
"Learn more about adding manuals and labels" (again, more specific action)
I have a RegEx to find / replace whole words (good ol' \b), which works great, except for linked terms found inside other linked terms. Instead of:
Learn more about manuals and labels
I end up with
Learn more about <a href="#">manuals and labels</a>
Which makes everyone cry a little. Changing the order in which the terms are replaced (going shortest to longest) means that I''d get:
Learn more about manuals and labels
Without the outer link I really need.
The further complication is that the capitalization of the search terms can vary, and I need to retain the original capitalization. If I could do something like this, I'd be all set:
Regex _regex = new Regex("\\b" + termToFind + "(|s)" + "\\b", RegexOptions.IgnoreCase);
string resultingText = _regex.Replace(textThatNeedsLinksInserted, "<a>" + "$&".Replace(" ", "_") + "</a>));
And then after all the terms are done, remove the "_", that would be perfect. "Learn_more_about_manuals_and_labels" wouldn't match "manuals and labels," and all is well.
It would be hard to have the help authors delimit the terms that need to be replaced when writing the text -- they're not used to coding. Also, this would limit the flexibility to add new terms later, since we'd have to go back and add delimiters to all the previously written text.
Is there a RegEx that would let me replace whitespace with "_" in the original match? Or is there a different solution that's eluding me?

From your examples with nested links it sounds like you're making individual passes over the terms and performing multiple Regex.Replace calls. Since you're using a regex you should let it do the heavy lifting and put a nice pattern together that makes use of alternation.
In other words, you likely want a pattern like this: \b(term1|term2|termN)\b
var input = "Having trouble with your manuals and labels? Learn more about adding manuals and labels. Need to get rid of them? Try to delete manuals and labels.";
var terms = new[]
{
"Learn more about adding manuals and labels",
"Delete Manuals and Labels",
"manuals and labels"
};
var pattern = #"\b(" + String.Join("|", terms) + #")\b";
var replacement = #"$1";
var result = Regex.Replace(input, pattern, replacement, RegexOptions.IgnoreCase);
Console.WriteLine(result);
Now, to address the issue of a corresponding href value for each term, you can use a dictionary and change the regex to use a MatchEvaluator that will return the custom format and look up the value from the dictionary. The dictionary also ignores case by passing in StringComparer.OrdinalIgnoreCase. I tweaked the pattern slightly by adding ?: at the start of the group to make it a non-capturing group since I am no longer referring to the captured item as I did in the first example.
var terms = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
{
{ "Learn more about adding manuals and labels", "2.html" },
{ "Delete Manuals and Labels", "3.html" },
{ "manuals and labels", "1.html" }
};
var pattern = #"\b(?:" + String.Join("|", terms.Select(t => t.Key)) + #")\b";
var result = Regex.Replace(input, pattern,
m => String.Format(#"{1}", terms[m.Value], m.Value),
RegexOptions.IgnoreCase);
Console.WriteLine(result);

I would use an ordered dictionary like this, making sure the smallest term is last:
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
public class Test
{
public static void Main()
{
OrderedDictionary Links = new OrderedDictionary();
Links.Add("Learn more about adding manuals and labels", "2");
Links.Add("Delete Manuals and Labels", "3");
Links.Add("manuals and labels", "1");
string text = "Having trouble with your manuals and labels? Learn more about adding manuals and labels. Need to get rid of them? Try to delete manuals and labels.";
foreach (string termToFind in Links.Keys)
{
Regex _regex = new Regex(#"\b" + termToFind + #"s?\b(?![^<>]*</)", RegexOptions.IgnoreCase);
text = _regex.Replace(text, #"$&");
}
Console.WriteLine(text);
}
}
ideone demo
The negative lookahead ((?![^<>]*</)) I added prevents the replace of a part you already replaced before which is between anchor tags.

First, you can prevent your Regex for manuals and labels from finding Learn more about manuals and labels by using a lookbehind. Modified your regex looks like this:
(?<!Learn more about )(manuals and labels)
But for your specific request i would suggest a different solution. You should define a rule or priority list for your regexs or both. A possible rule could be "always search for the regex first that matches the most characters". This however requires that your regexs are always fixed length. And it does not prevent one regex from consuming and replacing characters that would have been matched by a different regex (maybe even of the same size).
Of course you will need to add an additional lookbehind and lookahead to each of your regexs to prevent replacing strings that are inside of your replacing elements

removing #region

I had to take over a c# project. The guy who developed the software in the first place was deeply in love with #region because he wrapped everything with regions.
It makes me almost crazy and I was looking for a tool or addon to remove all #region from the project. Is there something around?

Just use Visual Studio's built-in "Find and Replace" (or "Replace in Files", which you can open by pressing Ctrl + Shift + H).
To remove #region, you'll need to enable Regular Expression matching; in the "Replace In Files" dialog, check "Use: Regular Expressions". Then, use the following pattern: "\#region .*\n", replacing matches with "" (the empty string).
To remove #endregion, do the same, but use "\#endregion .*\n" as your pattern. Regular Expressions might be overkill for #endregion, but it wouldn't hurt (in case the previous developer ever left comments on the same line as an #endregion or something).
Note: Others have posted patterns that should work for you as well, they're slightly different than mine but you get the general idea.

Use one regex ^[ \t]*\#[ \t]*(region|endregion).*\n to find both: region and endregion. After replacing by empty string, the whole line with leading spaces will be removed.
[ \t]* - finds leading spaces
\#[ \t]*(region|endregion) - finds #region or #endregion (and also very rare case with spaces after #)
.*\n - finds everything after #region or #endregion (but in the same line)
EDIT: Answer changed to be compatible with old Visual Studio regex syntax. Was: ^[ \t]*\#(end)?region.*\n (question marks do not work for old syntax)
EDIT 2: Added [ \t]* after # to handle very rare case found by #Volkirith

In Find and Replace use {[#]<region[^]*} for Find what: and replace it with empty string.
#EndRegion is simple enough to replace.

Should you have to cooperate with region lovers (and keep regions untouched ), then I would recommend "I hate #Regions" Visual Studio extension. It makes regions tolerable - all regions are expanded by default and #region directives are rendered with very small font.

For anyone using ReSharper it's just a simple Atr-Enter on the region line. You will then have the option to remove regions in file, in project, or in solution.
More info on JetBrains.

To remove #region with a newline after it, replace following with empty string:
^(?([^\r\n])\s)*\#region\ ([^\r\n])*\r?\n(?([^\r\n])\s)*\r?\n
To replace #endregion with a leading empty line, replace following with an empty string:
^(?([^\r\n])\s)*\r?\n(?([^\r\n])\s)*\#endregion([^\r\n])*\r?\n

How about writing your own program for it, to replace regions with nothing in all *.cs files in basePath recursively ?
(Hint: Careful with reading files as UTF8 if they aren't.)
public static void StripRegions(string fileName, System.Text.RegularExpressions.Regex re)
{
string input = System.IO.File.ReadAllText(fileName, System.Text.Encoding.UTF8);
string output = re.Replace(input, "");
System.IO.File.WriteAllText(fileName, output, System.Text.Encoding.UTF8);
}
public static void StripRegions(string basePath)
{
System.Text.RegularExpressions.Regex re = new System.Text.RegularExpressions.Regex(#"(^[ \t]*\#[ \t]*(region|endregion).*)(\r)?\n", System.Text.RegularExpressions.RegexOptions.Multiline);
foreach (string file in System.IO.Directory.GetFiles(basePath, "*.cs", System.IO.SearchOption.AllDirectories))
{
StripRegions(file, re);
}
}
Usage:
StripRegions(#"C:\sources\TestProject")

You can use the wildcard find/replace:
*\#region *
*\#endregion
And replace with no value. (Note the # needs to be escaped, as visual stuido uses it to match "any number")

How can I optimize this or is there a better way to do it?(HTML Syntax Highlighter)

I have made a HTML syntax highlighter in C# and it works great, but there's one problem. First off It runs pretty fast because it syntax highlights line by line, but when I paste more than one line of code or open a file I have to highlight the whole file which can take up to a minute for a file with only 150 lines of code. I tried just highlighting visible lines in the richtextbox but then when I try to scroll I can't it to highlight the new visible text. Here is my code:(note: I need to use regex so I can get the stuff in between < & > characters)
Highlight Whole File:
public void AllMarkup()
{
int selectionstart = richTextBox1.SelectionStart;
Regex rex = new Regex("<html>|</html>|<head.*?>|</head>|<body.*?>|</body>|<div.*?>|</div>|<span.*?>|</span>|<title.*?>|</title>|<style.*?>|</style>|<script.*?>|</script>|<link.*?/>|<meta.*?/>|<base.*?/>|<center.*?>|</center>|<a.*?>|</a>");
foreach (Match m in rex.Matches(richTextBox1.Text))
{
richTextBox1.Select(m.Index, m.Value.Length);
richTextBox1.SelectionColor = Color.Blue;
richTextBox1.Select(selectionstart, -1);
richTextBox1.SelectionColor = Color.Black;
}
richTextBox1.SelectionStart = selectionstart;
}
private void pasteToolStripMenuItem_Click(object sender, EventArgs e)
{
try
{
LockWindowUpdate(richTextBox1.Handle);//Stops text from flashing flashing
richTextBox1.Paste();
AllMarkup();
}finally { LockWindowUpdate(IntPtr.Zero); }
}
I want to know if there's a better way to highlight this and make it faster or if someone can help me make it highlight only the visible text.
Please help. :)
Thanks, Tanner.

I agree with RCIX - you'll have a hard time overall with combining Regex and HTML parsing :)
If you're going for a high-quality solution that always highlights syntax properly, you're going to need a full-blown parser. You can either use one that's already created, or you can create your own using a tool like ANTLR.
The creators of ANTLR have already created an HTML parser grammar. You can find it here.
If you're looking for a pre-built one, here's a few I've found:
HTML Agility Pack
Majestic 12 HTML Parser
SGML Reader
I'm sure there are others -- this is a pretty common requirement.
Long story short, if this is anything but a simple, disposable project, I'd get a full-blown parser. Otherwise, you can continue to try and hack it with Regex.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Issue creating regex patterns for batch file syntax highlighting - c#

Big thanks to #DougF for helping me find this solution. The answer is: #"^:[a-zA-Z]+"

Related

Regex hangs trying to find match

Syntax highlighting richtextbox in C# on a single line only

RegEx for a Glossary Function

removing #region

How can I optimize this or is there a better way to do it?(HTML Syntax Highlighter)

Categories

Resources