How to parse marked up text in C#

How to parse marked up text in C# - c#

I am trying to make a simple text formatter using MigraDoc for actually typesetting the text.
I'd like to specify formatting by marking up the text. For example, the input might look something like this:
"The \i{quick} brown fox jumps over the lazy dog^{note}"
which would denote "quick" being italicized and "note" being superscript.
To make the splits I have made a dictionary in my TextFormatter:
internal static TextFormatter()
{
FormatDictionary = new Dictionary<string, TextFormats>()
{
{#"^", TextFormats.supersript},
{#"_",TextFormats.subscript},
{#"\i", TextFormats.italic}
};
}
I'm then hoping to split using some regexes that looks for the modifier strings and matches what is enclosed in braces.
But as multiple formats can exist in a string, I need to also keep track of which regex was matched. E.g. getting a List<string, TextFormats>, (where string is the enclosed string, TextFormats is the TextFormats value corresponding to the appropriate special sequence and the items are sorted in order of appearance), which I could then iterate over applying formatting based on the TextFormats.
Thank you for any suggestions.

Consider the following Code...
string inputMessage = #"The \i{quick} brown fox jumps over the lazy dog^{note}";
MatchCollection matches = Regex.Matches(inputMessage, #"(?<=(\\i|_|\^)\{)\w*(?=\})");
foreach (Match match in matches)
{
string textformat = match.Groups[1].Value;
string enclosedstring = match.Value;
// Add to Dictionary<string, TextFormats>
}
Good Luck!

I'm not sure if callbacks are available in Dot-Net, but
If you have strings like "The \i{quick} brown fox jumps over the lazy dog^{note}" and
you want to just do the substitution as you find them.
Could use regex replace using a callback
# #"(\\i|_|\^){([^}]*)}"
( \\i | _ | \^ ) # (1)
{
( [^}]* ) # (2)
}
then in callback examine capture buffer 1 for format, replace with {fmtCodeStart}\2{fmtCodeEnd}
or you could use
# #"(?:(\\i)|(_)|(\^)){([^}]*)}"
(?:
( \\i ) # (1)
| ( _ ) # (2)
| ( \^ ) # (3)
)
{
( [^}]* ) # (4)
}
then in callback
if (match.Groups[1].sucess)
// return "{fmtCode1Start}\4{fmtCode1End}"
else if (match.Groups[2].sucess)
// return "{fmtCode2Start}\4{fmtCode2End}"
else if (match.Groups[3].sucess)
// return "{fmtCode3Start}\4{fmtCode3End}"

Related

How to Split OData multi-level expand query string in C#?

I have a URL: Expand=User($select=Firstname,Lastname),Organisation,Contract($Expand=MyOrganisation($select=Name,Status),Organisation),List
I need to split this string in the below format:
User($select=Firstname,Lastname)
Organisation
Contract($Expand=MyOrganisation($select=Name,Status),Organisation)
List
How to achieve this functionality in C#?

You not only need to split the string but also keep track of the parentheses while splitting. This is not possible with just plain old regex. See this post.
However, the splitting can be achieved with some advanced RegEx; .NET fortunately supports balancing groups using which you can keep track of the parentheses. This answer was quite helpful in coming up with a solution. For readability, I have split the regex into multiple lines and used RegexOptions.IgnorePatternWhitespace:
string url = "User($select=Firstname,Lastname),Organisation,Contract($Expand=MyOrganisation($select=Name,Status),Organisation),List";
Regex rgx = new Regex(
#"(.+?)
(
(
\(
(?:
[^()]
|
(?'open'\()
|
(?'close-open'\))
)+
(?(open)(?!))
\)
)
,
|
,
|
\b$
)",
RegexOptions.IgnorePatternWhitespace);
foreach(var match in rgx.Matches(url))
{
Console.WriteLine($"{match.Groups[1]} {match.Groups[3]}");
}
The field will be available as match.Groups[1] and the parameters, if any will be available as match.Groups[3](this will be an empty string if there are no parameters). You can access match.Groups[0] to get the entire group.
Regex Breakdown
Regex
Description
(.+?)
Non-greedily match one or more characters
\( and \)
Match an actual ( and )
[^()]
Match any character that is not a ( or )
(?'open'\()
Create a named group with the name "open" and match a ( character
(?'close-open\))
Create a group "close" and assign the interval between "open" and "close" to "close" and delete group "open"
(?(open)(?!))
Assert if the "open" group is not deleted
(?:[^()]|(?'open'\()|(?'close-open'\)))+
Create a non-capturing group and match one or more characters that match one of the expressions between |

More likely you have to use an ODataLib with in-built URI Parser
Uri requestUri = new Uri("Products?$select=ID&$expand=ProductDetail" +
"&$filter=Categories/any(d:d/ID%20gt%201)&$orderby=ID%20desc" +
"&$top=1&$count=true&$search=tom",
UriKind.Relative);
ODataUriParser parser = new ODataUriParser(model, serviceRoot, requestUri);
SelectExpandClause expand = parser.ParseSelectAndExpand(); // parse $select, $expand
FilterClause filter = parser.ParseFilter(); // parse $filter
OrderByClause orderby = parser.ParseOrderBy(); // parse $orderby
SearchClause search = parser.ParseSearch(); // parse $search
long? top = parser.ParseTop(); // parse $top
long? skip = parser.ParseSkip(); // parse $skip
bool? count = parser.ParseCount(); // parse $count
Adding the RegExp option (the fixed version of what Amal has provided below)
string url = "User($select=Firstname,Lastname),Organisation,Contract($Expand=MyOrganisation($select=Name,Status),Organisation),List";
Regex rgx = new Regex(#"(.+?)(?:(\(.*?\)),|,)");
foreach (var match in rgx.Matches($"{url},"))
{
Console.WriteLine(match.ToString()[..^1]);
}

Negative lookahead in Regex to exclude two words

I have the following regex:
(?!SELECT|FROM|WHERE|AND|OR|AS|[0-9])(?<= |^|\()([a-zA-Z0-9_]+)
that I'm matching against a string like this:
SELECT Static AS My_alias FROM Table WHERE Id = 400 AND Name = 'Something';
This already does 90% of what I want. What I also like to do is to exclude AS My_alias, where the alias can be any word.
I tried to add this to my regex, but this didn't work:
(?!SELECT|FROM|WHERE|AND|OR|AS [a-zA-Z0-9_]+|[0-9])(?<= |^|\()([a-zA-Z0-9_]+)
^^^^^^^^^^^^^^^^
this is the new part
How can I exclude this part of the string using my regex?
Demo of the regex can be found here

This excludes the AS and gets the tokens you seek. It also handles multiple select values, along zero to many Where clauses.
The thought is to use named explicit captures, and let the regex engine know to disregard any non-named capture groups. (A match but don't capture feature)
We will also put all the "tokens" wanted into one token captures (?<Token> ... ) for all of our token needs.
var data = "SELECT Static AS My_alias FROM Table WHERE Id = 400 AND Name = 'Something';";
var pattern = #"
^
SELECT\s+
(
(?<Token>[^\s]+)
(\sAS\s[^\s]+)?
[\s,]+
)+ # One to many statements
FROM\s+
(?<Token>[^\s]+) # Table name
(
\s+WHERE\s+
(
(?<Token>[^\s]+)
(.+?AND\s+)?
)+ # One to many conditions
)? # Optional Where
";
var tokens =
Regex.Matches(data, pattern,
RegexOptions.IgnorePatternWhitespace // Lets us space out/comment pattern
| RegexOptions.ExplicitCapture) // Only consume named groups.
.OfType<Match>()
.SelectMany(mt => mt.Groups["Token"].Captures // Get the captures inserted into `Token`
.OfType<Capture>()
.Select(cp => cp.Value.ToString()))
;
tokens is an array of these strings: { "Static", "Table", "Id", "Name" }
This should get you going on most of the cases of what will find. Use similar logic if you need to process selects with joins; regardless this is a good base to work from going forward.

C# regular expression for finding a certain pattern in a text

I'm trying to write a program that can replace bible verses within a document with any desired translation. This is useful for older books that contain a lot of KJV referenced verses. The most difficult part of the process is coming up with a way to extract the verses within a document.
I find that most books that place bible verses within the text use a structure like "N"(BookName chapter#:verse#s), where N is the verse text, the quotations are literal and the parens are also literal. I've been having problems coming up with a regular expression to match these in a text.
The latest regular expression I'm trying to use is this: \"(.+)\"\s*\(([\w. ]+[0-9\s]+[:][\s0-9\-]+.*)\). I'm having trouble where it won't find all the matches.
Here is the regex101 of it with a sample. https://regex101.com/r/eS5oT8/1
Is there anyway to solve this using a regular expression? Any help or suggestions would be greatly appreciated.

It's worth mentioning that the site you were using to test this relies on Javascript Regular Expressions, which require the g modifier to be explicitly defined, unlike C# (which is global by default).
You can adjust your expression slightly and ensure that you escape your double-quotes properly :
// Updated expression with escaped double-quotes and other minor changes
var regex = new Regex(#"\""([^""]+)\""\s*\(([\w. ]+[\d\s]+[:][\s\d\-]+[^)]*)\)");
And then use the Regex.Matches() method to find all of the matches in your string :
// Find each of the matches and output them
foreach(Match m in regex.Matches(input))
{
// Output each match here (using Console Example)
Console.WriteLine(m.Value);
}
You can see it in action in this working example with example output seen below :

Use the "g" modifier.
g modifier: global. All matches (don't return on first match)
See the Regex Demo

you can try with example given in MSDN here is the link
https://msdn.microsoft.com/en-us/library/0z2heewz(v=vs.110).aspx
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = "ablaze beagle choral dozen elementary fanatic " +
"glaze hunger inept jazz kitchen lemon minus " +
"night optical pizza quiz restoration stamina " +
"train unrest vertical whiz xray yellow zealous";
string pattern = #"\b\w*z+\w*\b";
Match m = Regex.Match(input, pattern);
while (m.Success) {
Console.WriteLine("'{0}' found at position {1}", m.Value, m.Index);
m = m.NextMatch();
}
}
}
// The example displays the following output:
// 'ablaze' found at position 0
// 'dozen' found at position 21
// 'glaze' found at position 46
// 'jazz' found at position 65
// 'pizza' found at position 104
// 'quiz' found at position 110
// 'whiz' found at position 157
// 'zealous' found at position 174

How about starting with this as a guide:
(?<quote>"".+"") # a series of any characters in quotes
\s + # followed by spaces
\( # followed by a parenthetical expression
(?<book>\d*[a-z.\s] *) # book name (a-z, . or space) optionally preceded by digits. e.g. '1 Cor.'
(?<chapter>\d+) # chapter e.g. the '1' in 1:2
: # semicolon
(?<verse>\d+) # verse e.g. the '2' in 1:2
\)
Using the options:
RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline | RegexOptions.IgnoreCase
The expression above will give you named captures of every element in the match for easy parsing (e.g., you'll be able to pick out quote, book, chapter and verse) by looking at, e.g., match.Groups["verse"].
Full code:
var input = #"Jesus said, ""'Love your neighbor as yourself.'
There is no commandment greater than these"" (Mark 12:31).";
var bibleQuotesRegex =
#"(?<quote>"".+"") # a series of any characters in quotes
\s + # followed by spaces
\( # followed by a parenthetical expression
(?<book>\d*[a-z.\s] *) # book name (a-z, . or space) optionally preceded by digits. e.g. '1 Cor.'
(?<chapter>\d+) # chapter e.g. the '1' in 1:2
: # semicolon
(?<verse>\d+) # verse e.g. the '2' in 1:2
\)";
foreach(Match match in Regex.Matches(input, bibleQuotesRegex, RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline | RegexOptions.IgnoreCase))
{
var bibleQuote = new
{
Quote = match.Groups["quote"].Value,
Book = match.Groups["book"].Value,
Chapter = int.Parse(match.Groups["chapter"].Value),
Verse = int.Parse(match.Groups["verse"].Value)
};
//do something with it.
}

After you've added "g", also be careful if there are multiple verses without any '\n' character in between, because "(.*)" will treat them as one long match instead of multiple verses. You will want something like "([^"]*)" to prevent that.

Regex removing empty spaces when using replace

My situation is not about removing empty spaces, but keeping them. I have this string >[database values] which I would like to find. I created this RegEx to find it then go in and remove the >, [, ]. The code below takes a string that is from a document. The first pattern looks for anything that is surrounded by >[some stuff] it then goes in and "removes" >, [, ]
string decoded = "document in string format";
string pattern = #">\[[A-z, /, \s]*\]";
string pattern2 = #"[>, \[, \]]";
Regex rgx = new Regex(pattern);
Regex rgx2 = new Regex(pattern2);
foreach (Match match in rgx.Matches(decoded))
{
string replacedValue= rgx2.Replace(match.Value, "");
Console.WriteLine(match.Value);
Console.WriteLine(replacedValue);
What I am getting in first my Console.WriteLine is correct. So I would be getting things like >[123 sesame St]. But my second output shows that my replace removes not just the characters but the spaces so I would get something like this 123sesameSt. I don't see any space being replaced in my Regex. Am I forgetting something, perhaps it is implicitly in a replace?

The [A-z, /, \s] and [>, \[, \]] in your patterns are also looking for commas and spaces. Just list the characters without delimiting them, like this: [A-Za-z/\s]
string pattern = #">\[[A-Za-z/\s]*\]";
string pattern2 = #"[>,\[\]]";
Edit to include Casimir's tip.

After rereading your question (if I understand well) I realize that your two steps approach is useless. You only need one replacement using a capture group:
string pattern = #">\[([^]]*)]";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(yourtext, "$1");
pattern details:
>\[ # literals: >[
( # open the capture group 1
[^]]* # all that is not a ]
) # close the capture group 1
] # literal ]
the replacement string refers to the capture group 1 with $1

By defining [>, \[, \]] in pattern2 you define a character group consisting of single characters like >, ,, , [ and every other character you listed in the square brackets. But I guess you don't want to match space and ,. So if you don't want to match them leave them out like
string pattern2 = #"[>\[\]]";
Alternatively, you could use
string pattern2 = #"(>\[|\])";
Thereby, you either match >[ or ] which better expresses your intention.

How write a regex with group matching?

Here is the data source, lines stored in a txt file:
servers[i]=["name1", type1, location3];
servers[i]=["name2", type2, location3];
servers[i]=["name3", type1, location7];
Here is my code:
string servers = File.ReadAllText("servers.txt");
string pattern = "^servers[i]=[\"(?<name>.*)\", (.*), (?<location>.*)];$";
Regex reg = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Multiline);
Match m;
for (m = reg.Match(servers); m.Success; m = m.NextMatch()) {
string name = m.Groups["name"].Value;
string location = m.Groups["location"].Value;
}
No lines are matching. What am I doing wrong?

If you don't care about anything except the servername and location, you don't need to specify the rest of the input in your regex. That lets you avoid having to escape the brackets, as Graeme correctly points out. Try something like:
string pattern = "\"(?<name>.+)\".+\s(?<location>[^ ]+)];$"
That's
\" = quote mark,
(?<name> = start capture group 'name',
.+ = match one or more chars (could use \w+ here for 1+ word chars)
) = end the capture group
\" = ending quote mark
.+\s = one or more chars, ending with a space
(?<location> = start capture group 'location',
[^ ]+ = one or more non-space chars
) = end the capture group
];$ = immediately followed by ]; and end of string
I tested this using your sample data in Rad Software's free Regex Designer, which uses the .NET regex engine.

I don't know if C# regex's are the same as perl, but if so, you probably want to escape the [ and ] characters. Also, there are extra characters in there. Try this:
string pattern = "^servers\[i\]=\[\"(?<name>.*)\", (.*), (?<location>.*)\];$";
Edited to add: After wondering why my answer was downvoted and then looking at Val's answer, I realized that the "extra characters" were there for a reason. They are what perl calls "named capture buffers", which I have never used but the original code fragment does. I have updated my answer to include them.

try this
string pattern = "servers[i]=[\"(?<name>.*)\", (.*), (?<location>.*)];$";

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to parse marked up text in C# - c#

Related

How to Split OData multi-level expand query string in C#?

Negative lookahead in Regex to exclude two words

C# regular expression for finding a certain pattern in a text

Regex removing empty spaces when using replace

How write a regex with group matching?

Categories

Resources