The C# 6 preview for Visual Studio 2013 supported a primary constructors feature that the team has decided will not make it into the final release. Unfortunately, my team implemented over 200 classes using primary constructors.
We're now looking for the most straightforward path to migrate our source. Since this is a one time thing, a magical regex replacement string or hacky parser would work.
Before I spend a lot of time writing such a beast, is there anyone out there that's already done this or knows of a better way?
As I suggested in comments, you could use the version of Roslyn which does know about primary constructors to parse the code into a syntax tree, then modify that syntax tree to use a "normal" constructor instead. You'd need to put all the initializers that use primary constructor parameters into the new constructor too, mind you.
I suspect that writing that code would take me at least two or three hours, quite possibly more - whereas I could do the job manually for really quite a lot of classes in the same amount of time. Automation's great, but sometimes the quickest solution really is to do things by hand... even 200 classes may well be faster to do manually, and you could definitely parallelize the work across multiple people.
(\{\s*)(\w*\s*?=\s*?\w*\s*?;\s*?)*?(public\s*\w*\s*)(\w*)(\s*?{\s*?get;\s*?\})(\s*?=\s*?\w*;\s*)
\1\2\4\5
A few answers: the first with a simple Regex find and replace which you need to repeat a few times:
Regex: A few lines of explanation then the actual regex string and replacement string:
a. In regex, first you match the full string of what your looking for (in your case a primary constructor). Not hard to do: search for curly bracket, the word public, then two words and an equals sign etc. Each text found according to this is called a Match.
b. Sometimes there are possible repeated sequences in the text that you are looking for. (In your case: The parameters are defined in a line for each). For that, you simply mark the expected sequence as a Group by surrounding it with parenthesis.
c. You then want to mark different parts of what you found, so you can use them or replace them in your corrected text. These parts are also called "Groups" actually "Capture Groups". Again simply surround the parts with parenthesis.
In your case you'll be retaining the first captured group (the curly bracket) and the name of the property with its assignment to the parameter.
d. Here's the regex:
(\{\s*)(\w*\s*?=\s*?\w*\s*?;\s*?)*?(public\s*\w*\s*)(\w*)(\s*?{\s*?get;\s*?})(\s*?=\s*?\w*;\s*)
1. (
// ---- Capture1 -----
{
// code: \{\s*?
// explained: curley bracket followed by possible whitespace
)
2. ( - Capture2 - previously corrected text
// - possible multiple lines of 'corrected' non-primary-constructors
// created during the find-replace process previously,
Propname = paramname; // word, equals-sign, word, semicolon
// code: \w*\s*?=\s*?\w*\s*?;\s*?
// explained: \w - any alphanumeric, \s - any whitespace
// * - one or more times, *? - 0 or more times
)*?
// code: )*?
// explained: this group can be repeated zero or more times
// in other words it may not be found at all.
// These text lines are created during the recursive replacement process...
3. (
// ----Capture 3-----
// The first line of a primary constructor:
public type
// code: public\s*\w*\s*
// explained: the word 'public' and then another word (and [whitespace])
)
4. (
// ----- capture 4 -----
Propname
// code: \w#
// explained: any amount of alphanumeric letters
)
5. (
// ---- capture 5 ----
{ get; }
// code: \s*?{\s*?get;\s*?\}
)
6. (
// ---- capture 6 ----
= propname;
code: \s*?=\s*?\w*;\s*
explained: by now you should get it.
The replacement string is
\1\2\4\6
This leaves:
{
[old corrected code]
[new corrected line]
possible remaining lines to be corrected.
Notepad++ 10 minutes trial-and-error. I guarantee it won't take you more than that.
Visual Studio 2014 refactor. but
a. You have to install it on a separate VM or PC. MS warns you not to install it side by side with your existing code.
b. I'm not sure the refactor works the other way. [Here's an article about it][1]
Visual Studio macros. I know I know, they're long gone, but there are at least two plugins that replace them and perhaps more. I read about them on this SO (StackOverflow) discussion. (They give a few other options) Here:
Visual Commander - Free open source Visual Studio macro runner add-on
VSScript - A Visual Studio add-on: costs $50 !!
Try Automatic Regexp by example:You give it several examples of code in which you highlight what IS the expected result, and then the same (or other) code in which you highlight what IS NOT the expected result. You then wait for it to run through the examples and give you some regex code.
// for the following code (from http://odetocode.com/blogs/scott/archive/2014/08/14/c-6-0-features-part-ii-primary-constructors.aspx )
public struct Money(string currency, decimal amount)
{
public string Currency { get; } = currency;
public decimal Amount { get; } = amount;
}
// I get something like: { ++\w\w[^r-u][^_]++|[^{]++(?={ \w++ =)
Play with the regexp on this great site: https://www.regex101.com/
// I first tried: \{\s*((public\s*\w*\s*)\w*(\s*?{\s*?get;\s*?})\s*?=\s*?\w*;\s*)*\}
The repeated sequence of the primary-constructor lines (the "repeated capture group") only captures the last one.
Use c# code with regex.captures as explained here in another StackOverflow (see accepted answer)
Related
I have lots of code like below:
PlusEnvironment.EnumToBool(Row["block_friends"].ToString())
I need to convert them to something like this.
Row["block_friends"].ToString() == "1"
The value that gets passed to EnumToBool is always unique, meaning there is no guarantee that itll be passed by a row, it could be passed by a variable, or even a method that returns a string.
I've tried doing this with regex, but its sort of sketchy and doesn't work 100%.
PlusEnvironment\.EnumToBool\((.*)\)
I need to do this in Visual Studio's find and replace. I'm using VS 17.
If you had a few places where PlusEnvironment.EnumToBool() was called, I would have done the same thing that #IanMercer suggested: just replace PlusEnvironment.EnumToBool( with empty string and the fix all the syntax errors.
#IanMercer has also given you a link to super cool, advanced regex usage that will help you.
But if you are skeptical about using such a complex regex on hundreds of files, here is what I would have done:
Define my own PlusEnvironment class with EnumToBool functionality in my own namespace. And then just replace the using Plus; line with using <my own namespace>; in those hundreds of files. That way my changes will be limited to only the using... line, 1 line per file, and it will be simple find and replace, no regex needed.
(Note: I'm assuming that you don't want to use PlusEnvironment, or the complete library and hence you want to do this type of replacement.)
in Find and Replace Window:
Find:
PlusEnvironment\.EnumToBool\((.*))
Replace:
$1 == "1"
Make sure "Use Regular Expressions" is selected
This question already has answers here:
Match exact string
(3 answers)
Closed 3 years ago.
I'm parsing a text file line by line and for each line I have a special regex. However in one case a pattern is matching two lines. One that is a correct match and another line only partialy because a couple of values are optional.
Invalid match:
BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD
matches patial string (shouln't match this at all):
BNE1010/1000
Correct match (matches the entire string):
RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP
The regex for this is quite long and contains several optionl groups:
^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?
I think there is no need to study the entire regex becasue it's build dynamically from smaller patterns at runtime and all the parts work correctly. Also lots of combinations are tested with unit tests and they all work... as long as I try to parse ony the line that should be matched by the pattern.
Currently I'm checking if the entire string is matched by
match.Group[0].Value == line
but I find it's quite ugly. I know from JavaScript the regex engine provides an Index property where the regex engine stopped. So my idea was to compare the index with the length of the string. Unfortunatelly I wasn't able to find such a property in C#.
Another idea would be to modify the regex so that it matches only one line and no partial lines.
Example: https://regex101.com/r/dM5wU4/1
The example contains only two cases because there aren't actually any combinations that would change its behavior. I could remove some parameters but it wouldn't change anything.
EDIT:
I've edited my question. Sorry to every for not providing all the information at the first time. I won't ask any more questions when writing on the phone :) It wasn't a good idea. Hopefully it won't get closed now.
You asked whether I could simplify the regex. I would do it if I could and knew how. If it was easy I wouldn't have asked. The problem started as the regex ans string became bigger during development. Now they are at the production length and I can't actually make them shorter even for the sake of the quesion, sorry.
EDIT-2:
I found the reason why I couldn't find the inherited Index and Length properties of the Match class.
For some strange reason when selecting the Match class and pressing F1 Visual Studio opened the wrong help page (Match Properties) even though I'm not working with the Micro Framework. I didn't notice that but I was indeed wondering why there is very little information. Thx to #Jamiec for the correct link. I won't trust Visual Studio anymore when hitting F1.
Disclaimer: Im going to add this, but I doubt its the solution. If it's not this part will get deleted in short order
You can add a $ at the end of your regular expression. This stops your first example matching but continues to match the second example.
As you've not provided any more than 2 examples, its unclear if this actually solves all your cases or just that one specific false positive.
My question is whether it is possible to check if a regular expression matched the entire sting without checking the first group against the original line?
If you're not adverse to checking the entire match to the length of the string you can do that too:
var regex = new Regex(#"^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?");
var input1 = #"BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD";
var input2 = #"RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP";
var match1 = regex.Match(input1);
var match2 = regex.Match(input2);
Console.WriteLine(match1.Length == input1.Length); // False
Console.WriteLine(match2.Length == input2.Length); // True
Live example: http://rextester.com/NIBE6349
i have to create a function GetSourceCodeOfClass("ClassName",FilePath) this function will be used more than 10000 times to get Srouce code from c# Files, and from every source file i have to extract the source code of a complete class i.e
" Class someName { every thing in the body including sinature} "
Now this is simple, if a single file contains a single class but there will be many source files that will contain more than two classes in them , further more the bigger problem is there maybe nested classes inside a single class.
i want following thing :-
i want to extract the complete source of a given Class
if file contains more than two classes then i want to extract only the source code of specified class.
if file contains more than one class and my specified class have nested classes in it then i want to capture myClasses's source as well as all nested classes.
i have an algorithm in mid that is:
1-open file
2-match regex (C# classes signature ) - parameterized
#"(public|private|internal|protected|inline)?[\t ]*(static)?[\t
]class[\t ]" + sOurClassName + #"(([\t ][:][\t ]([a-zA-z]+(([
])[,]([ ])\w+))+))?\s[\n\r\t\s]?{"
3- If Regex is matched in the source file
4 Start copying at that point until the same regex is matched again but without parameters
regex is:
#" (public|private|internal|protected)?[\t ]*(static)?[\t ]class[\t
]\w+(([\t ][:][\t ]([a-zA-z]+(([ ])[,]([
])\w+))+))?\s[\n\r\t\s]?{"
(this is where i have no clue and i am stuck. I want to copy every thing after first matched to the second matched or after first match till the end )
copying nested classes is still an issue and i am still thinking about it if some one have an idea , can help me in this too.
Note- match.groups[0] or match.groups[1] this will only copy the signature but i want the complete source of the class thats why i am doing this way . ..
BTW i am using C#
I agree with Nathan's sentiment that you would be better using an existing C#-aware parser. Trying to write a regex for the task is a lot of work, and you are unlikely to get it right on the first try. It may work on your first example code, or even the first few, but eventually you'll find some code that's slightly different than what you expected and the regex will fail to catch something important.
That said, if you are comfortable with that limitation and risk, the general technique you are asking about (if I understand correctly…the question isn't entirely clear) is common enough, and worth understanding if you expect to use regex a lot. The key points to understand are that with a Match object, you can call the NextMatch() method to obtain the next match in the next, and that when calling the Regex.Match() method, you can pass the start and length of a substring you want to check, and it will limit its processing to that substring.
You can use the latter point to switch from one regex to another mid-parse.
In your scenario, I understand it to be that you want to run a regex containing the specific class name, to find that particular class in the file, and then to search the text after the initial match for any subsequent class in the file. If the second search finds something, you want to only return the text from the start of the first match to the start of the second match. If the second search finds nothing, you want to return the text from the start of the first match to the end of the whole file.
If that's correct, then something like this should work:
string ExtractClass(string fileContents, Regex classRegex, Regex nonClassRegex)
{
Match match1 = classRegex.Match(fileContents);
if (!match1.Success)
{
return null;
}
Match match2 = nonClassRegex.Match(fileContents, match1.Index + match1.Length);
if (!match2.Success)
{
return fileContents.Substring(match1.Index);
}
return fileContents.Substring(match1.Index, match2.Index - match1.Index);
}
I should note that between two class declarations, or between the end of a lone class declaration and the actual end of the file there can easily be other non-white-space text that isn't part of the class declaration. I assume you have a plan for dealing with that.
If the above doesn't address your need, you should examine your question closely, and edit it both for length and clarity.
I am attempting to parse with regex a series of lines of psudeo-assembly code that are the following formats:
optional_label required_instruction optional_parameter, optional_parameter
And actual example looks a bit more like:
PRINTLOOP MOV R6, R7
CMP R6, R9
TRP 1
BLK
Where MOV,CMP,BLK and BRZ are instructions.
Whitespace between tokens can be any number of spaces or tabs, labels must start at the beginning of a line while instructions can either start at the beginning or have any amount of leading spaces or tabs.
I need to get at each bit of it individually so it is important that the regex groups it properly. I am currently trying to use this pattern:
((?<label>[\w]*)[ |\t]+)?(?<operator>[\w]+)[ |\t]+(?<operand1>[\w]+)?(,[ |\t]*(?<openparen>\()?(?<operand2>[-]*[\w]+)(?<closeparen>\))?)?
This pattern has worked fine until now because there was always at least one parameter, but now I have zero parameter instructions which don't fit in nicely to this. I tried to tweak the pattern to be the following:
((?<label>[\w]*)[ |\t]+)?(?<operator>[\w]+)([ |\t]+(?<operand1>[\w]+))?(,[ |\t]*(?<openparen>\()?(?<operand2>[-]*[\w]+)(?<closeparen>\))?)?
So that the space after the instruction(operator) isn't mandatory but I found that this made things ambiguous enough that the instruction is perceived to be the label in many instructions. For example:
LDB R0, lM
Is understood as label: LDB, Instruction: R0 and neither operand is recognized.
Is their a way to either force the operator section to be checked first (so that that part of the string is prioritized), resources that will explain where I am going wrong in all this, or a regex pattern that will do what I am looking for?
Your problem cannot be solved even in theory, because your grammar is ambiguous: when you are looking at
INC R6
your grammar can parse it in the two ways below:
label=INC, Instruction=R6
or
Instruction=R6, Parameter1=R6
Assembly languages that I've worked with and/or implemented solve this problem by requiring a column after the optional label, like this:
[label:] instruction [parameter] [, optional_parameter]
This would give your regex an additional "anchor" (i.e. the colon :) by which to tell the label+instruction vs. instruction+parameter situation.
Another alternative is to introduce "keywords" for the instructions, and prohibiting the use of these keywords as labels. This would let you avoid introducing a colon, but would make a regex-based solution impractical.
I'm kind of new too C#, and regular expression for that matter, but I've searched a couple of hours to find a solution too this problem so, hopefully this is easy for you guys:)
My application uses a regex to match email addresses in a given string,
then loops throu the matches.:
String EmailPattern = "\\w+([-+.]\\w+)*#\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*";
MatchCollection mcemail = Regex.Matches(rawHTML, EmailPattern);
foreach (Match memail in mcemail)
Works fine, but, when I downloaded the string from a certain page, http://www.sp.se/sv/index/services/quality/sidor/default.aspx, the MatchCollection(mcemail) object "hangs" the loop. When using a break point and accessing the object, I get "Function evuluation timed out" on everything(.Count etc).
Update
I've tried my pattern and other email patterns on the same string, everyone(regex desingers, python based web pages etc.) fails/timesout when trying too match this particular string.
How can I detect that the matchcollection obj is not "ready" to use?
If you can post the email that's causing the problem (perhaps anonymized in some way), that will give us more information, but I'm thinking the problem is this little guy right here:
([-.]\\w+)*\\.\\w+([-.]\\w+)*
To understand the problem, let's break that into groups:
([-.]\\w+)*
\\.\\w+
([-.]\\w+)*
The strings that will match \\.\\w+ are a subset of those that will match [-.]\\w+. So if part of your input looks like foo.bar.baz.blah.yadda.com, your regex engine has no way of knowing which group is supposed to match it. Does that make sense? So the first ([-.]\\w+)* could match .bar.baz.blah, then the \\.\\w+ could match .yadda, then the last ([-.]\\w+)* could match .com...
...OR the first clause could match .bar.baz, the second could match .blah, and the last could match .yadda.com. Since it doesn't know which one is right, it will keep trying different combinations. It should stop eventually, but that could still take a long time. This is called "catastrophic backtracking".
This issue is compounded by the fact that you're using capturing groups rather than non-capturing groups; i.e. ([-+.]\\w+) instead of (?:[-+.]\\w+). That causes the engine to try and separate and save whatever matches inside the parentheses for later reference. But as I explained above, it's ambiguous which group each substring belongs in.
You might consider replacing everything after the # with something like this:
\\w[-\\w]*\\.[-.\\w]+
That could use some refinement to make it more specific, but you get the general idea. Hope I explained all this well enough; grouping and backreferences are kind of tough to describe.
EDIT:
Looking back at your pattern, there's a deeper issue here, still related to the backtracking/ambiguity problem I mentioned. The clause \\w+([-.]\\w+)* is ambiguous all by itself. Splitting it into parts, we have:
\\w+
([-.]\\w+)*
Suppose you have a string like foobar. Where does the \\w+ end and the ([-.]\\w+)* begin? How many repetitions of ([-.]\\w+) are there? Any of the following could work as matches:
f(oobar)
foo(bar)
f(o)(oba)(r)
f(o)(o)(b)(a)(r)
foobar
etc...
The regex engine doesn't know which is important, so it will try them all. This is the same problem I pointed out above, but it means you have it in multiple places in your pattern.
Even worse, ([-.]\\w+)* is also ambiguous, because of the + after the \\w. How many groups are there in blah? I count 16 possible combinations: (blah), (b)(lah), (bl)(ah)...
The amount of different possible combinations is going to be huge, even for a relatively small input, so your engine is going to be in overdrive. I would definitely simplify it if I were you.
I just did a local test and it appears either the sheer document size or something in the ViewState causes the Regex match evaluation to time out. (Edit: I'm pretty sure it's the size, actually. Removing the ViewState just reduces the size significantly.)
An admittedly crude way to solve this would be something like this:
string[] rawHtmlLines = File.ReadAllLines(#"C:\default.aspx");
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => !line.Contains("_VIEWSTATE")).ToArray());
string emailPattern = #"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
var emailMatches = Regex.Matches(filteredHtml, emailPattern);
foreach (Match match in emailMatches)
{
//...
}
Overall I suspect the email pattern is just not well optimised (or intended) to filter out emails in a large string but just used as validation for user input. Generally it might be a good idea to limit the string you search in to just the parts you are actually interested in and keep it as small as possible - for example by leaving out the ViewState which is guaranteed to not contain any readable email addresses.
If performance is important, it's probably also a better idea to create the filtered HTML using a StringBuilder and IndexOf (etc.) instead of splitting lines and LINQing up the result :)
Edit:
To further minimize the length of the string the Regex needs to check you could only include lines that contain the # character to begin with, like so:
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => line.IndexOf('#') >= 0 && !line.Contains("_VIEWSTATE")).ToArray());
From "Function evaluation timed out", I'm assuming you're doing this in the debugger. The debugger has some fairly quick timeouts with regard to how long a method takes. Not eveything happens quickly. I would suggest going the operation in code, storing the result, then viewing that result in the debugger (i.e. let the call to Matches run and put a breakpoint after it).
Now, with regard to detecting whether the string will make Matches take a long time; that's a bit of a black art. You basically have to perform some sort of input validation. Just because you got some value from the internet, doesn't mean that value will work well with Matches. The ultimate validation logic is up to you; but, starting with the length of rawHtmlLines might be useful. (i.e. if the lenght is 1000000 bytes, Matches might take a while) But, you have to decide what to do if the length is too long; e.g give an error to the user.