Regular expression to replace square brackets with angle brackets - c#

I have a string like:
[a b="c" d="e"]Some multi line text[/a]
Now the part d="e" is optional. I want to convert such type of string into:
<a b="c" d="e">Some multi line text</a>
The values of a b and d are constant, so I don't need to catch them. I just need the values of c, e and the text between the tags and create an equivalent xml based expression. So how to do that, because there is some optional part also.

For HTML tags, please use HTML parser.
For [a][/a], you can do like following
Match m=Regex.Match(#"[a b=""c"" d=""e""]Some multi line text[/a]",
#"\[a b=""([^""]+)"" d=""([^""]+)""\](.*?)\[/a\]",
RegexOptions.Multiline);
m.Groups[1].Value
"c"
m.Groups[2].Value
"e"
m.Groups[3].Value
"Some multi line text"
Here is Regex.Replace (I am not that prefer though)
string inputStr = #"[a b=""[[[[c]]]]"" d=""e[]""]Some multi line text[/a]";
string resultStr=Regex.Replace(inputStr,
#"\[a( b=""[^""]+"")( d=""[^""]+"")?\](.*?)\[/a\]",
#"<a$1$2>$3</a>",
RegexOptions.Multiline);

If you are actually thinking of processing (pseudo)-HTML using regexes,
don't
SO is filled with posts where regexes are proposed for HTML/XML and answers pointing out why this is a bad idea.
Suppose your multiline text ("which can be anything") contains
[a b="foo" [a b="bar"]]
a regex cannot detect this.
See the classic answer in:
RegEx match open tags except XHTML self-contained tags
which has:
I think it's time for me to quit the
post of Assistant Don't Parse HTML
With Regex Officer. No matter how many
times we say it, they won't stop
coming every day... every hour even.
It is a lost cause, which someone else
can fight for a bit. So go on, parse
HTML with regex, if you must. It's
only broken code, not life and death.
– bobince
Seriously. Find an XML or HTML DOM and populate it with your data. Then serialize it. That will take care of all the problems you don't even know you have got.

Would some multiline text include [ and ]? If not, you can just replace [ with < and ] with > using string.replace - no need of regex.
Update:
If it can be anything but [/a], you can replace
^\[a([^\]]+)](.*?)\[/a]$
with
<a$1>$2</a>
I haven't escaped ] and / in the regex - escape them if necessary to get
^\[a([^\]]+)\](.*?)\[\/a\]$

Related

Regex that handles quoted strings and double quote for inches

I am writing a little search for a website's product catalog, and I am using regex to determine if there are any strings like "exact search phrase" included in the text from the search text box. The regex that I am currently using is:
List<string> searchTermList = searchTerm.Trim().ToLower().Split(new Char[] { ' ' }).ToList();
foreach (Match match in Regex.Matches(searchTerm, "\"([^\"]*)\""))
{
//irrelevant code
}
This code works great for me until I search for something like:
8" tortilla "stone ground"
The result I would like as a match would be
"stone ground"
but instead I am getting
" tortilla ".
The other posts I found for similar questions were escaping the double quote for inches, but I don't have any way to reliably escape quotes like those examples. The best option of the other articles I found was to escape it if it follows a number, but users could search for things like "burger 3-1" in quotes, which would be incorrect to escape the last quote in that case.
What I would like is some way to tell if the string inside a set of quotes is preceded by a space or an empty string (if the only search text is a phrase in quotes), but I am inexperienced and struggling with regex, and I feel like it is my best option for tackling something like this. Any help/pointers?
Try this: (updated)
First use this expression to find and replace (in javascript) all the strings that are of the pattern "9" "9.9" "9-9" to the pattern "9' "9.9' "9-9'
\"[0-9.-]*\"
Next replace all
([^a-z,0-9,',"])([\s]*)\"
with just a single ". This will remove all unwanted spaces.
Then take this new formatted string and apply
\"[^\s]([^\"]*)[^\s]\"
This takes care of all the scenarios. Just ensure that you take the original string into a new variable and play with else you will end up modifying the original value.
Here is the sample string I used to test the above expressions. I did not have the time to write the javascript function itself. Please post the function if you get it to work using the above expressions.
8" "bosch grinder" , bosch "8" grinder" , and "bosch grinder " 8" "99" "9.9" "9-7"
A website I use to test out my regular expressions is http://www.regexr.com/

How to get text "out of" by Regex

I have small problem. I'm trying to get text whitch is out of html elements.
Example input:
I want this text I want this text I want this text <I don't want this text/>
I want this text I wan this text <I don't>want this</text>
Does anybody know how is it possible by regex? I thought that I can make it by deleting element text. So, does anybody know another solution for this problem? Please help me.
Instead of regex, which is not suitable for parsing HTML in general (especially malformed HTML), use an HTML parser like the HTML Agility Pack.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
I agree that anything not trivial should be done with a HTML parser (Agility pack is excellent if you use .NET) but for small requirements as this its more than likely overkill.
Then again, A HTML parser knows more about the quirks and edge cases that HTML is full of. Be sure to test well before using a regex.
Here you go
<.*?>.*?<.*?>|<.*?/>
It also correctly ignores
<I don't>want this</text>
and not just the tags
In C# this becomes
string resultString = null;
resultString = Regex.Replace(subjectString, "<.*?>.*?<.*?>|<.*?/>", "");
Try this
(?<!<.*?)([^<>]+)
Explanation
#"
(?<! # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
< # Match the character “<” literally
. # Match any single character that is not a line break character
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
)
( # Match the regular expression below and capture its match into backreference number 1
[^<>] # Match a single character NOT present in the list “<>”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"

how to change xml tag format like <xml></xml> to <xml/>

I have a xml file. As per my requirement I need to update empty tag such as I need to change <xml></xml> to <xml/>. Is it possible to change the tags like that..
Thank you...
var xmlString="<xml></xml> <toto></toto>";
var properString=System.Text.RegularExpressions.Regex.Replace(xmlString, "<([^>]+)></[^>]+>", "<$1/>");
EDIT: explanation!
#Neil Knight has already provided, in a comment, a link to Wikipedia explaining the concept of regular expressions. The part specific to .NET is available here: .NET Framework Regular Expressions
A starting XML tag can be matched with the following regular expression: <[^>]+>. The [^>]+ part can be read as: all characters that are not ">", with at least one character (so <> is not matched but <a> is). An ending XML tag can be matched with the same kind of expression: </[^>]+> (note the slash after the first character). So the regular expression <[^>]+></[^>]+> matches empty tags such as <foo></foo> (but be careful, it also matches <foo></bar> which is not valid XML code).
What we need now is to isolate the characters between "<" and ">". For that, we use parenthesis: <([^>]+)>. This instructs the regular expression engine to capture the matched characters. Each group of parenthesis can be referred later in a replacement operation by the "$x" string (where "x" is a number: "$1" for the first matching parenthesis, "$2" for the second one, etc.).
So, with a call to Regex.Replace(xmlString, "<([^>]+)></[^>]+>", "<$1/>"), <foo></foo> will be replaced by <foo/> ("foo" characters are captured, and "$1" is replaced by them). <foo></bar> will also be replaced by <foo/>.
I hope that this explanation is enough for #Felix K. ;o)
(my English is not so good, that's why I did not provide many details)
if (someElement.innerText == string.Empty)
{
someElement.innerText = null;
}

C# How to remove text between BBCode

How to remove all text between BBCode Quotation (including BBCode itself):
[quote date=2011-07-02 14:43:53 user=test link=1]blabla[/quote]
I must add that between tags can be text with HTML tags for formating.
My current attempt looks like:
Regex regex = new Regex(#"[quote+].+?[/\+quote]");
Well it's almost working.
You may try the following regex:
#"\[quote.*\].*?\[/quote\]"
Note that you have to escape square brackets in a regex.
Since your BBCode blocks contains attributes, a simple + won't suffice to cover everything. + means to repeat the specified range of characters, in this case e.
On the top of my head, I'd try something like this:
\[quote([^\[]*)\](.*?)\[\/quote\]
Please bear in mind that I have not tested this for C#, where the syntax might be different depending on the interpreter. Also note that I've added selection groups so that you'd be able to examine the result of each expression. As #Howard answered, [ and ] are reserved symbols and consequently needs to be escaped.

regex to strip out [blah: ... ] tag from string

I am using this regex:
[Blah(?:\s*)\]
I want to strip out the tag that looks like:
[Blah:http:..anyting goes here so catch all types of characters ]
Any tips on what's wrong with my regex?
A regex of \[Blah[^\]]*\] is the usual way. It means:
literal string [Blah
zero or more:
characters that aren't ]
literal string ]
If you want to handle nesting (e.g. input of the form [a[b[c]]]), then you need something other than regex (this is one reason why trying to use regex to parse HTML doesn't work).
Your regex [Blah(?:\s*)\] starts with an unescaped '[' which is "seen" as the start of a character class. That's what's wrong with your regex (there are probably more errors, but that one is the main reason).
Try changing it to \[Blah[^\]]*\] or \[Blah.*?\]. They should give the same result, but there might be a difference in their performance.

Categories