Superpower: match a string with parser only if it begins a line - c#

When parsing in superpower, how to match a string only if it is the first thing in a line?
For example, I need to match the A colon in "A: Hello Goodbye\n" but not in "Goodbye A: Hello\n"

Using your example here, I would change your ActorParser and NodeParser definitions to this:
public readonly static TokenListParser<Tokens, Node> ActorParser =
from name in NameParser
from colon in Token.EqualTo(Tokens.Colon)
from text in TextParser
select new Node {
Actor = name + colon.ToStringValue(),
Text = text
};
public readonly static TokenListParser<Tokens, Node> NodeParser =
from node in ActorParser.Try()
.Or(TextParser.Select(text => new Node { Text = text }))
select node;
I feel like there is a bug with Superpower, as I'm not sure why in the NodeParser I had to put a Try() on the first parser when chaining it with an Or(), but it would throw an error if I didn't add it.
Also, your validation when checking input[1] is incorrect (probably just a copy paste issue). It should be checking against "Goodbye A: Hello" and not "Hello A: Goodbye"

Unless RegexOptions.Multiline is set, ^ matches the beginning of a string regardless of whether it is at the beginning of a line.
You can probably use inline (?m) to turn on multiline:
static TextParser<Unit> Actor { get; } =
from start in Span.Regex(#"(?m)^[A-Za-z][A-Za-z0-9_]+:")
select Unit.Value;

I have actually done something similar, but I do not use a Tokenizer.
private static string _keyPlaceholder;
private static TextParser<MyClass> Actor { get; } =
Span.Regex("^[A-Za-z][A-Za-z0-9_]*:")
.Then(x =>
{
_keyPlaceholder = x.ToStringValue();
return Character.AnyChar.Many();
}
))
.Select(value => new MyClass { Key = _keyPlaceholder, Value = new string(value) });
I have not tested this, just wrote it out by memory. The above parser should have the following:
myClass.Key = "A:"
myClass.Value = " Hello Goodbye"

Related

C# Regex to replace specific hashtags with certain block of text

I am a new C# developer and I am struggling right now to write a method to replace a few specific hashtags in a sample of tweets with certain block of texts. For example if the tweet has a hashtag like #StPaulSchool, I want to replace this hashtag with this certain text "St. Paul School" without the '#' tag.
I have a very small list of the certain words which I need to replace. If there is no match, then I would like remove the hashtag (replace it with empty string)
I am using the following method to parse the tweet and convert it into a formatted tweet but I don't know how to enhance it in order to handle the specific hashtags. Could you please tell me how to do that?
Here's the code:
public string ParseTweet(string rawTweet)
{
Regex link = new Regex(#"http(s)?://([\w+?\.\w+])+([a-zA-Z0-9\~\!\#\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?");
Regex screenName = new Regex(#"#\w+");
Regex hashTag = new Regex(#"#\w+");
var words_to_replace = new string[] { "StPaulSchool", "AzharSchool", "WarwiSchool", "ManMet_School", "BrumSchool"};
var inputWords = new string[] { "St. Paul School", "Azhar School", "Warwick School", "Man Metapolian School", "Brummie School"};
string formattedTweet = link.Replace(rawTweet, delegate (Match m)
{
string val = m.Value;
//return string.Format("URL");
return string.Empty;
});
formattedTweet = screenName.Replace(formattedTweet, delegate (Match m)
{
string val = m.Value.Trim('#');
//return string.Format("USERNAME");
return string.Empty;
});
formattedTweet = hashTag.Replace(formattedTweet, delegate (Match m)
{
string val = m.Value;
//return string.Format("HASHTAG");
return string.Empty;
});
return formattedTweet;
}
The following code works for the hashtags:
static void Main(string[] args)
{
string longTweet = #"Long sentence #With #Some schools like #AzharSchool and spread out
over two #StPaulSchool lines ";
string result = Regex.Replace(longTweet, #"\#\w+", match => ReplaceHashTag(match.Value), RegexOptions.Multiline);
Console.WriteLine(result);
}
private static string ReplaceHashTag(string input)
{
switch (input)
{
case "#StPaulSchool": return "St. Paul School";
case "#AzharSchool": return "Azhar School";
default:
return input; // hashtag not recognized
}
}
If the list of hashtags to convert becomes very long it would be more succint to use a Dictionary, eg:
private static Dictionary<string, string> _hashtags
= new Dictionary<string, string>
{
{ "#StPaulSchool", "St. Paul School" },
{ "#AzharSchool", "Azhar School" },
};
and rewrite the body of the ReplaceHashTag method with this:
if (!_hashtags.ContainsKey(hashtag))
{
return hashtag;
}
return _hashtags[hashtag];
I believe that using regular expressions makes this code unreadable and difficult to maintain. Moreover, you are using regular expression to find a very simple pattern - to find strings that starts with the hashtag (#) character.
I suggest a different approach: Break the sentence into words, transform each word according to your business rules, then join the words back together. Although this sounds like a lot of work, and it may be the case in another language, the C# String class makes this quite easy to implement.
Here is a basic example of a console application that does the requested functionality, the business rules are hard-coded, but this should be enough so you could continue:
static void Main(string[] args)
{
string text = "Example #First #Second #NoMatch not a word ! \nSecond row #Second";
string[] wordsInText = text.Split(' ');
IEnumerable<string> transformedWords = wordsInText.Select(selector: word => ReplaceHashTag(word: word));
string transformedText = string.Join(separator: " ", values: transformedWords);
Console.WriteLine(value: transformedText);
}
private static string ReplaceHashTag(string word)
{
if (!word.StartsWith(value: "#"))
{
return word;
}
string wordWithoutHashTag = word.Substring(startIndex: 1);
if (wordWithoutHashTag == "First")
{
return "FirstTransformed";
}
if (wordWithoutHashTag == "Second")
{
return "SecondTransformed";
}
return string.Empty;
}
Note that this approach gives you much more flexibility chaining your logic, and by making small modifications you can make this code a lot more testable and incremental then the regular expression approach

Regex C# is it possible to use a variable in substitution?

I got bunch of strings in text, which looks like something like this:
h1. this is the Header
h3. this one the header too
h111. and this
And I got function, which suppose to process this text depends on what lets say iteration it been called
public void ProcessHeadersInText(string inputText, int atLevel = 1)
so the output should look like one below in case of been called
ProcessHeadersInText(inputText, 2)
Output should be:
<h3>this is the Header<h3>
<h5>this one the header too<h5>
<h9 and this <h9>
(last one looks like this because of if value after h letter is more than 9 it suppose to be 9 in the output)
So, I started to think about using regex.
Here's the example https://regex101.com/r/spb3Af/1/
(As you can see I came up with regex like this (^(h([\d]+)\.+?)(.+?)$) and tried to use substitution on it <h$3>$4</h$3>)
Its almost what I'm looking for but I need to add some logic into work with heading level.
Is it possible to add any work with variables in substitution?
Or I need to find other way? (extract all heading first, replace em considering function variables and value of the header, and only after use regex I wrote?)
The regex you may use is
^h(\d+)\.+\s*(.+)
If you need to make sure the match does not span across line, you may replace \s with [^\S\r\n]. See the regex demo.
When replacing inside C#, parse Group 1 value to int and increment the value inside a match evaluator inside Regex.Replace method.
Here is the example code that will help you:
using System;
using System.Linq;
using System.Text.RegularExpressions;
using System.IO;
public class Test
{
// Demo: https://regex101.com/r/M9iGUO/2
public static readonly Regex reg = new Regex(#"^h(\d+)\.+\s*(.+)", RegexOptions.Compiled | RegexOptions.Multiline);
public static void Main()
{
var inputText = "h1. Topic 1\r\nblah blah blah, because of bla bla bla\r\nh2. PartA\r\nblah blah blah\r\nh3. Part a\r\nblah blah blah\r\nh2. Part B\r\nblah blah blah\r\nh1. Topic 2\r\nand its cuz blah blah\r\nFIN";
var res = ProcessHeadersInText(inputText, 2);
Console.WriteLine(res);
}
public static string ProcessHeadersInText(string inputText, int atLevel = 1)
{
return reg.Replace(inputText, m =>
string.Format("<h{0}>{1}</h{0}>", (int.Parse(m.Groups[1].Value) > 9 ?
9 : int.Parse(m.Groups[1].Value) + atLevel), m.Groups[2].Value.Trim()));
}
}
See the C# online demo
Note I am using .Trim() on m.Groups[2].Value as . matches \r. You may use TrimEnd('\r') to get rid of this char.
You can use a Regex like the one used below to fix your issues.
Regex.Replace(s, #"^(h\d+)\.(.*)$", #"<$1>$2<$1>", RegexOptions.Multiline)
Let me explain you what I am doing
// This will capture the header number which is followed
// by a '.' but ignore the . in the capture
(h\d+)\.
// This will capture the remaining of the string till the end
// of the line (see the multi-line regex option being used)
(.*)$
The parenthesis will capture it into variables that can be used as "$1" for the first capture and "$2" for the second capture
Try this:
private static string ProcessHeadersInText(string inputText, int atLevel = 1)
{
// Group 1 = value after 'h'
// Group 2 = Content of header without leading whitespace
string pattern = #"^h(\d+)\.\s*(.*?)\r?$";
return Regex.Replace(inputText, pattern, match => EvaluateHeaderMatch(match, atLevel), RegexOptions.Multiline);
}
private static string EvaluateHeaderMatch(Match m, int atLevel)
{
int hVal = int.Parse(m.Groups[1].Value) + atLevel;
if (hVal > 9) { hVal = 9; }
return $"<h{hVal}>{m.Groups[2].Value}</h{hVal}>";
}
Then just call
ProcessHeadersInText(input, 2);
This uses the Regex.Replace(string, string, MatchEvaluator, RegexOptions) overload with a custom evaluator function.
You could of course streamline this solution into a single function with an inline lambda expression:
public static string ProcessHeadersInText(string inputText, int atLevel = 1)
{
string pattern = #"^h(\d+)\.\s*(.*?)\r?$";
return Regex.Replace(inputText, pattern,
match =>
{
int hVal = int.Parse(match.Groups[1].Value) + atLevel;
if (hVal > 9) { hVal = 9; }
return $"<h{hVal}>{match.Groups[2].Value}</h{hVal}>";
},
RegexOptions.Multiline);
}
A lot of good solution in this thread, but I don't think you really need a Regex solution for your problem. For fun and challenge, here a non regex solution:
Try it online!
using System;
using System.Linq;
public class Program
{
public static void Main()
{
string extractTitle(string x) => x.Substring(x.IndexOf(". ") + 2);
string extractNumber(string x) => x.Remove(x.IndexOf(". ")).Substring(1);
string build(string n, string t) => $"<h{n}>{t}</h{n}>";
var inputs = new [] {
"h1. this is the Header",
"h3. this one the header too",
"h111. and this" };
foreach (var line in inputs.Select(x => build(extractNumber(x), extractTitle(x))))
{
Console.WriteLine(line);
}
}
}
I use C#7 nested function and C#6 interpolated string. If you want, I can use more legacy C#. The code should be easy to read, I can add comments if needed.
C#5 version
using System;
using System.Linq;
public class Program
{
static string extractTitle(string x)
{
return x.Substring(x.IndexOf(". ") + 2);
}
static string extractNumber(string x)
{
return x.Remove(x.IndexOf(". ")).Substring(1);
}
static string build(string n, string t)
{
return string.Format("<h{0}>{1}</h{0}>", n, t);
}
public static void Main()
{
var inputs = new []{
"h1. this is the Header",
"h3. this one the header too",
"h111. and this"
};
foreach (var line in inputs.Select(x => build(extractNumber(x), extractTitle(x))))
{
Console.WriteLine(line);
}
}
}

How to avoid large switch statements and/or regular expressions when converting code from one language to another

I have to convert a few hundred test cases written in Java to code in C#. At the moment all I could think of is define a set of regular expressions, try to match it on a line and do an action based on which regex matched.
Any better ideas (this still stinks).
An example of from and to:
Java:
Request request = new Request(testRunner)
request.setUsername("userName")
request.setPassword("password")
log.info(request.getRequest())
C#
var request = new LoginRequest(LoginParams);
request.Username = "userName";
request.Password = "password";
var LoginResponse = Account.ExecuteCall(request, pathToApi);
The source I'm trying to convert is from SoapUI and the bits of script involved are within TestSteps of a humongous XML file. Also, most of them are simply forming some sort of request and checking for a specific response so there shouldn't be too many types to implement.
What I ended up doing was defined a base class (Map) that has a Pattern property, a Success indicator and the lines of Code that it results to after a successful match. In some cases a certain line can be simply replaced by another one but in other cases (setUserName) I need to extract content from the original script to put in the c# code. In other cases, a single line might be replaced with more than one. The transformation is all defined in the Match function.
public class SetUserName : Map
{
internal override string Pattern { get { return #"request.setUsername\(""(.*)""\)"; } }
public override void Match(string line)
{
Match match = Regex.Match(line, Pattern);
if (match.Success)
{
Success = true;
CodeLines = new Code<CodeLine>
{new CodeLine("request.Username = \"" + match.Groups[1].Value + "\"")};
}
}
}
Then I put the maps in a list ordered by occurrence and loop through each line of script:
foreach (string scriptLine in scriptLines)
{
string line = Strip(scriptLine);
if (string.IsNullOrEmpty(line) || Regex.Match(line, #"^\s+$").Success)
{
continue;
}
Map[] RegExes =
{
new Request(),
new SetUserName(),
new SetPassword(),
new RunRequest()
};
foreach (Map map in RegExes)
{
map.Match(line);
if (map.Success)
{
codeList.AddRange(map.CodeLines);
break;
}
}
}

How do I extract the *immediate* text from a C# System.Windows.Form.HtmlElement (i.e. NOT the text in children)

In C#, how do I get the text of an System.Windows.Form.HtmlElement not including the text from its children?
If I have
<div>aaa<div>bbb<div>ccc</div><div>ddd</div></div></div>
then the InnerText property of the whole thing is "aaabbbcccddd" and I just want "aaa".
I figure this should be trivial, but I haven't found anything to produce the "immediate" text of an HtmlElement in C#. More ludicrous ideas are "subtracting" the InnerText of the children from the parent, but that's an insane amount of work for something that I'm sure is trivial.
(All I want is access to the Text Node of the HtmlElement.)
I'd certain appreciate any help (or pointer) that anyone can supply.
Many thanks.
Examples:
<div>aaa<div>bbb<div>ccc</div><div>ddd</div></div></div> -> Produce "aaa"
<div><div>ccc</div><div>ddd</div></div> -> Produce ""
<div>ccc</div> -> Produce "ccc"
Edit
There are a number of ways to skin this particular cat, none of them elegant. However, given my constraints (not my HTML, quite possibly not valid), I think Aleksey Bykov's solution is closest to what I needed (and indeed, I did implement the same solution he suggested in the last comment.)
I've selected his solution and upvoted all the other ones that I think would work, but weren't optimal for me. I'll check back to upvote any other solutions that seem likely to work.
Many thanks.
Maybe it's simpler than that, if you're willing to use XmlDocument instead of HtmlDocument - you can just use the 'Value' property of the XmlElement.
This code gives the output you want for the 3 cases you mentioned:
class Program
{
private static string[] htmlTests = {#"<div>aaa<div>bbb<div>ccc</div><div>ddd</div></div></div>",
#"<div><div>ccc</div><div>ddd</div></div>",
#"<div>ccc</div>" };
static void Main(string[] args)
{
var page = new XmlDocument();
foreach (var test in htmlTests)
{
page.LoadXml(test);
Console.WriteLine(page.DocumentElement.FirstChild.Value);
}
}
}
Output:
aaa
ccc
I am not sure what you mean by HtmlElement, but with XmlElement you would do it like this:
using System;
using System.Xml;
using System.Linq;
using System.Collections.Generic;
using System.Text;
public static class XmlUtils {
public static IEnumerable<String> GetImmediateTextValues(XmlNode node) {
var values = node.ChildNodes.Cast<XmlNode>().Aggregate(
new List<String>(),
(xs, x) => { if (x.NodeType == XmlNodeType.Text) { xs.Add(x.Value); } return xs; }
);
return values;
}
public static String GetImmediateJoinedTextValues(XmlNode node, String delimiter) {
var values = GetImmediateTextValues(node);
var text = String.Join(delimiter, values.ToArray());
return text;
}
}
EDIT:
Well, if your HtmlElement comes from System.Windows.Forms, then what you need to do is to use its DomElement property trying to cast it to one of the COM interfaces defined in mshtml. So all you need to do is to be able to tell if the element you are looking at is a text node and get its value. First you gotta add a reference to the mshtml COM library. You can do something like this (I cannot verify this code immediately).
public Bool IsTextNode(HtmlElement element) {
var result = false;
var nativeNode = element.DomElement as mshtml.IHTMLDOMNode;
if (nativeNode != null) {
var nodeType = nativeNode.nodeType;
result = nodeType == 3; // -- TextNode: http://msdn.microsoft.com/en-us/library/aa704085(v=vs.85).aspx
}
return result
}
Well, you could do something like this (assuming your input is in a string called `input'):
string pattern = #">.*?<";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
var first_match = matches[0].ToString();
string result = first_match.Substring(1, first_match.Length - 2);
I probably wouldn't do it (or just relay on matching the string for the first <div> and </div>) ... here, for extra credit:
int start = pattern.IndexOf(">") + 1;
int end = pattern.IndexOf("<", start);
string result = input.Substring(start, end - start);

Remove BR tag from the beginning and end of a string

How can I use something like
return Regex.Replace("/(^)?(<br\s*\/?>\s*)+$/", "", source);
to replace this cases:
<br>thestringIwant => thestringIwant
<br><br>thestringIwant => thestringIwant
<br>thestringIwant<br> => thestringIwant
<br><br>thestringIwant<br><br> => thestringIwant
thestringIwant<br><br> => thestringIwant
It can have multiple br tags at begining or end, but i dont want to remove any br tag in the middle.
A couple of loops would solve the issue and be easier to read and understand (use a regex = tomorrow you look at your own code wondering what the heck is going on)
while(source.StartsWith("<br>"))
source = source.SubString(4);
while(source.EndsWith("<br>"))
source = source.SubString(0,source.Length - 4);
return source;
When I see your regular expression, it sounds like there could be spaces allowed with in br tag.
So you can try something like:
string s = Regex.Replace(input,#"\<\s*br\s*\/?\s*\>","");
There is no need to use regular expression for it
you can simply use
yourString.Replace("<br>", "");
This will remove all occurances of <br> from your string.
EDIT:
To keep the tag present in between the string, just use as follows-
var regex = new Regex(Regex.Escape("<br>"));
var newText = regex.Replace("<br>thestring<br>Iwant<br>", "<br>", 1);
newText = newText.Substring(0, newText.LastIndexOf("<br>"));
Response.Write(newText);
This will remove only 1st and last occurance of <br> from your string.
How about doing it in two goes so ...
result1 = Regex.Replace("/^(<br\s*\/?>\s*)+/", "", source);
then feed the result of that into
result2 = Regex.Replace("/(<br\s*\/?>\s*)+$/", "", result1);
It's a bit of added overhead I know but simplifies things enormously, and saves trying to counter match everything in the middle that isn't a BR.
Note the subtle difference between those two .. one matching them at start and one matching them at end. Doing it this way keeps the flexibility of keeping a regular expression that allows for the general formatting of BR tags rather than it being too strict.
if you also want it to work with
<br />
then you could use
return Regex.Replace("((:?<br\s*/?>)*<br\s*/?>$|^<br\s*/?>(:?<br\s*/?>)*)", "", source);
EDIT:
Now it should also take care of multiple
<br\s*/?>
in the start and end of the lines
You can write an extension method to this stuff
public static string TrimStart(this string value, string stringToTrim)
{
if (value.StartsWith(stringToTrim, StringComparison.CurrentCultureIgnoreCase))
{
return value.Substring(stringToTrim.Length);
}
return value;
}
public static string TrimEnd(this string value, string stringToTrim)
{
if (value.EndsWith(stringToTrim, StringComparison.CurrentCultureIgnoreCase))
{
return value.Substring(0, value.Length - stringToTrim.Length);
}
return value;
}
you can call it like
string example = "<br> some <br> test <br>";
example = example.TrimStart("<br>").TrimEnd("<br>"); //output some <br> test
I believe that one should not ignore the power of Regex. If you name the regular expression appropriately then it would not be difficult to maintain it in future.
I have written a sample program which does your task using Regex. It also ignores the character cases and white space at beginning and end. You can try other source string samples you have.
Most important, It would be faster.
using System;
using System.Text.RegularExpressions;
namespace ConsoleDemo
{
class Program
{
static void Main(string[] args)
{
string result;
var source = #"<br><br>thestringIwant<br><br> => thestringIwant<br/> same <br/> <br/> ";
result = RemoveStartEndBrTag(source);
Console.WriteLine(result);
Console.ReadKey();
}
private static string RemoveStartEndBrTag(string source)
{
const string replaceStartEndBrTag = #"(^(<br>[\s]*)+|([\s]*<br[\s]*/>)+[\s]*$)";
return Regex.Replace(source, replaceStartEndBrTag, "", RegexOptions.IgnoreCase);
}
}
}

Categories