Regex that handles quoted strings and double quote for inches - c#

I am writing a little search for a website's product catalog, and I am using regex to determine if there are any strings like "exact search phrase" included in the text from the search text box. The regex that I am currently using is:
List<string> searchTermList = searchTerm.Trim().ToLower().Split(new Char[] { ' ' }).ToList();
foreach (Match match in Regex.Matches(searchTerm, "\"([^\"]*)\""))
{
//irrelevant code
}
This code works great for me until I search for something like:
8" tortilla "stone ground"
The result I would like as a match would be
"stone ground"
but instead I am getting
" tortilla ".
The other posts I found for similar questions were escaping the double quote for inches, but I don't have any way to reliably escape quotes like those examples. The best option of the other articles I found was to escape it if it follows a number, but users could search for things like "burger 3-1" in quotes, which would be incorrect to escape the last quote in that case.
What I would like is some way to tell if the string inside a set of quotes is preceded by a space or an empty string (if the only search text is a phrase in quotes), but I am inexperienced and struggling with regex, and I feel like it is my best option for tackling something like this. Any help/pointers?

Try this: (updated)
First use this expression to find and replace (in javascript) all the strings that are of the pattern "9" "9.9" "9-9" to the pattern "9' "9.9' "9-9'
\"[0-9.-]*\"
Next replace all
([^a-z,0-9,',"])([\s]*)\"
with just a single ". This will remove all unwanted spaces.
Then take this new formatted string and apply
\"[^\s]([^\"]*)[^\s]\"
This takes care of all the scenarios. Just ensure that you take the original string into a new variable and play with else you will end up modifying the original value.
Here is the sample string I used to test the above expressions. I did not have the time to write the javascript function itself. Please post the function if you get it to work using the above expressions.
8" "bosch grinder" , bosch "8" grinder" , and "bosch grinder " 8" "99" "9.9" "9-7"
A website I use to test out my regular expressions is http://www.regexr.com/

Related

Match regex pattern in a line of text without targeting the text within quotations

Stackoverflow has been very generous with answers to my regex questions so far, but with this one I'm blanking on what to do and just can't seem to find it answered here.
So I'm parsing a string, let's say for example's sake, a line of VB-esque code like either of the following:
Call Function ( "Str ing 1 ", "String 2" , " String 3 ", 1000 ) As Integer
Dim x = "This string should not be affected "
I'm trying to parse the text in order to eliminate all leading spaces, trailing spaces, and extra internal spaces (when two "words/chunks" are separated with two or more space or when there is one or more spaces between a character and a parentheses) using regex in C#. The result after parsing the above should look like:
Call Function("Str ing 1 ", "String 2", " String 3 ", 1000) As Integer
Dim x = "This string should not be affected "
The issue I'm running into is that, I want to parse all of the line except any text contained within quotation marks (i.e. a string). Basically if there are extra spaces or whatever inside a string, I want to assume that it was intended and move on without changing the string at all, but if there are extra spaces in the line text outside of the quotation marks, I want to parse and adjust that accordingly.
So far I have the following regex which does all of the parsing I mentioned above, the only issue is it will affect the contents of strings just like any other part of the line:
var rx = new Regex(#"\A\s+|(?<=\s)\s+|(?<=.)\s+(?=\()|(?<=\()\s+(?=.)|(?<=.)\s+(?=\))|\s+\z")
.
.
.
lineOfText = rx.Replace(lineOfText, String.Empty);
Anyone have any idea how I can approach this, or know of a past question answering this that I couldn't find? Thank you!
Since you are reading the file line by line, you can use the following fix:
("[^"]*(?:""[^"]*)*")|^\s+|(?<=\s)\s+|(?<=\w)\s+(?=\()|(?<=\()\s+(?=\w)|(?<=\w)\s+(?=\))|\s+$
Replace the matched text with $1 to restore the captured string literals that were captured with ("[^"]*(?:""[^"]*)*").
See demo

Regex for ignoring consecutive quotation marks in string

I have built a parser in Sprache and C# for files using a format I don't control. Using it I can correctly convert:
a = "my string";
into
my string
The parser (for the quoted text only) currently looks like this:
public static readonly Parser<string> QuotedText =
from open in Parse.Char('"').Token()
from content in Parse.CharExcept('"').Many().Text().Token()
from close in Parse.Char('"').Token()
select content;
However the format I'm working with escapes quotation marks using "double doubles" quotes, e.g.:
a = "a ""string"".";
When attempting to parse this nothing is returned. It should return:
a ""string"".
Additionally
a = "";
should be parsed into a string.Empty or similar.
I've tried regexes unsuccessfully based on answers like this doing things like "(?:[^;])*", or:
public static readonly Parser<string> QuotedText =
from content in Parse.Regex("""(?:[^;])*""").Token()
This doesn't work (i.e. no matches are returned in the above cases). I think my beginners regex skills are getting in the way. Does anybody have any hints?
EDIT: I was testing it here - http://regex101.com/r/eJ9aH1
If I'm understanding you correctly, this is the kind of regex you're looking for:
"(?:""|[^"])*"
See the demo.
1. " matches an opening quote
2. (?:""|[^"])* matches two quotes or any chars that are not a quote (including newlines), repeating
3. " matches the closing quote.
But it's always going to boil down to whether your input is balanced. If not, you'll be getting false positives. And if you have a string such as "string"", which should be matched?"string"",""`, or nothing?... That's a tough decision, one that, fortunately, you don't have to make if you are sure of your input.
You can likely adapt your desired output from this pattern:
"(.+".+")"|(".+?")|("")
example:
http://regex101.com/r/lO1vZ4
If you only want to ignore consecutive double quotes, try this:
("{2,})
Live demo
This regex "("+) might help you to match extra unwanted double quotes.
here is the DEMO

C# .NET Regex remove all quotes of quotes excluding one instance in a sentance

I have description field which is:
16" Alloy Upgrade
In CSV format it appears like this:
"16"" Alloy Upgrade "
What would be the best use of regex to maintain the original format? As I'm learning I would appreciate it being broke down for my understanding.
I'm already using Regex to split some text separating 2 fields which are: code, description. I'm using this:
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))
My thoughts are to remove the quotes, then remove the delimiter excluding use in sentences.
Thanks in advance.
If you don't want to/can't use a standard CSV parser (which I'd recommend), you can strip all non-doubled quotes using a regex like this:
Regex.Replace(text, #"(?!="")""(?!"")",string.Empty)
That regex will match every " character not preceded or followed by another ".
I wouldn't use regex since they are usually confusing and totally unclear what they do (like the one in your question for example). Instead this method should do the trick:
public string CleanField(string input)
{
if (input.StartsWith("\"") && input.EndsWith("\""))
{
string output = input.Substring(1,input.Length-2);
output = output.Replace("\"\"","\"");
return output;
}
else
{
//If it doesn't start and end with quotes then it doesn't look like its been escaped so just hand it back
return input;
}
}
It may need tweaking but in essence it checks if the string starts and ends with a quote (which it should if it is an escaped field) and then if so takes the inside part (with the substring) and then replaces double quotes with single quotes. The code is a bit ugly due to all the escaping but there is no avoiding that.
The nice thing is this can be used easily with a bit of Linq to take an existing array and convert it.
processedFieldArray = inputfieldArray.Select(CleanField).ToArray();
I'm using arrays here purely because your linked page seems to use them where you are wanting this solution.

Regular expression with backreferences

I am attempting to write a regular expression in a C# application to find "{value}", along with a backreference to the text before it up to "[[", and another backreference to the text after it up to "]]". For example:
This is some text [[backreference one {value}
backreference two]]
Would match "[[backreference one ", "{value}", and "\r\nbackreference two]]".
I have tried modified versions of the following with no luck. I believe I am missing word boundaries, and may be having trouble because of "{" in the text I am trying to find.
\[\[(^[\{value\}]+)\{value\}(^\]\]+)\]\]
I'm not sure if it would be possible with regular expressions, but it would be ideal if it could find the matching closing bracket, for example the following would find "[[backreferenc[[e]] one ", "{value}", and "ba[[ckref[[e]]rence t]]wo]]":
This is some text [[backreferenc[[e]] one {value}
ba[[ckref[[e]]rence t]]wo]]
You need to use the MatchEvaluator on Regex replace. Also it would make your life easier by breaking up the matches into named capture groups to help with the match evaluator processing. Let me explain.
What the MatchEvaluator does, is it allows one to intercede in the match process with a C# delegate and return what should be replaced when a match happens by examining the actual match captured. That way you can do your text processing as needed.
Here is a basic example where it handles the sections in a basic way, but the structure is there to add your business logic:
string text = #"This is some text [[Name: {name}]] at [[Address: {address}]].";
Regex.Replace(text,
#"(?:\[\[)(?<Section>[^\:]+)(?:\:)(?<Data>[^\]]+)(?:\]\])",
new MatchEvaluator((mtch) =>
{
if (mtch.Groups["Section"].Value == "Name")
return "Jabberwocky";
return "120 Main";
}));
The result of Regex Replace is:
This is some text Jabberwocky at 120 Main.
To the first part of you question try this:
\[\[(.*)({value})(.*)\]\]

Regular expression to replace square brackets with angle brackets

I have a string like:
[a b="c" d="e"]Some multi line text[/a]
Now the part d="e" is optional. I want to convert such type of string into:
<a b="c" d="e">Some multi line text</a>
The values of a b and d are constant, so I don't need to catch them. I just need the values of c, e and the text between the tags and create an equivalent xml based expression. So how to do that, because there is some optional part also.
For HTML tags, please use HTML parser.
For [a][/a], you can do like following
Match m=Regex.Match(#"[a b=""c"" d=""e""]Some multi line text[/a]",
#"\[a b=""([^""]+)"" d=""([^""]+)""\](.*?)\[/a\]",
RegexOptions.Multiline);
m.Groups[1].Value
"c"
m.Groups[2].Value
"e"
m.Groups[3].Value
"Some multi line text"
Here is Regex.Replace (I am not that prefer though)
string inputStr = #"[a b=""[[[[c]]]]"" d=""e[]""]Some multi line text[/a]";
string resultStr=Regex.Replace(inputStr,
#"\[a( b=""[^""]+"")( d=""[^""]+"")?\](.*?)\[/a\]",
#"<a$1$2>$3</a>",
RegexOptions.Multiline);
If you are actually thinking of processing (pseudo)-HTML using regexes,
don't
SO is filled with posts where regexes are proposed for HTML/XML and answers pointing out why this is a bad idea.
Suppose your multiline text ("which can be anything") contains
[a b="foo" [a b="bar"]]
a regex cannot detect this.
See the classic answer in:
RegEx match open tags except XHTML self-contained tags
which has:
I think it's time for me to quit the
post of Assistant Don't Parse HTML
With Regex Officer. No matter how many
times we say it, they won't stop
coming every day... every hour even.
It is a lost cause, which someone else
can fight for a bit. So go on, parse
HTML with regex, if you must. It's
only broken code, not life and death.
– bobince
Seriously. Find an XML or HTML DOM and populate it with your data. Then serialize it. That will take care of all the problems you don't even know you have got.
Would some multiline text include [ and ]? If not, you can just replace [ with < and ] with > using string.replace - no need of regex.
Update:
If it can be anything but [/a], you can replace
^\[a([^\]]+)](.*?)\[/a]$
with
<a$1>$2</a>
I haven't escaped ] and / in the regex - escape them if necessary to get
^\[a([^\]]+)\](.*?)\[\/a\]$

Categories