This question already has answers here:
Best way to replace tokens in a large text template
(10 answers)
Closed 6 years ago.
I am dilemma to decide which one to use, either to use Regex.Replace or to use Regex.Matches if you have to perform some logic on each matches to generate the replaced value.
Scenario: Reading a file (which can vary in the size) and then using the Regular expression to replace the matches. replaced value for each match is different and is generated by some logic.
Approach 1: Read the complete file, then find all the matches and then I do the foreach or for loop and replace them one by one.
Approach 2: Read the complete file, then uses the Regex.Replace method with the MatchEvaluator, where MatchEvaluator function performs the logic and returns the replaced value.
There is an article I would like to link here, which somehow gives me a feeling to not use, Regex.Replace. Link: https://blogs.msdn.microsoft.com/debuggingtoolbox/2008/04/02/comparing-regex-replace-string-replace-and-stringbuilder-replace-which-has-better-performance/
Approach 1:
This would read entire file. (Check out for memory consumption.)
foreach loop on large data, (more time consuming.)
Approach 2:
This also would read entire file.
MatchEvaluator(pretty sure takes more time)
Approach 3:
Read the file line by line. MDSN Link
Do string.replace() as checked by the link you provided.
Append each result to result file at the same time.
Related
At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.
So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:
.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*
Also filtering some CSS files:
.css
.css*
And some SQL Files:
.sql, .ddl, .dml
.sql*, .ddl*, .dml*
My previous question got an answer to filtering Python files:
.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$
But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.
For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).
Thank you.
For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:
In C# this would basically, with IgnoreCase, be
Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);
Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*
With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*
Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:
\.(html|htm|shtml|shtm|xhtml).*
Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):
\.(s|x)?(htm)l?.*
And, I always check if it's still working in regexstorm for .Net
That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.
I am trying to find a regular expression to parse two sections out of the file name for the .resx files in my project. There is one main file called "UiText.resx" and then many translation .resx files with convention "UiText.ja-JP.resx". I need both the "UiText" and the "ja-JP" out of the latter string, as we do have other resx files that don't have to be for UiText (e.g. I have some files named "ExceptionText.resx").
The pattern I'm using right now (which works, it just requires a little extra coding after) is "(?<=\.)((.*?)(?=\.resx))". For the example above, "UiText.ja-JP.resx" gets me a match set in C# of "UiText.", "ja-JP.", "ja-JP.", ".resx"
Of course I am able to just take the first occurrence of "ja-JP." and "UiText." from this set and massage it to what I want, but I'd rather just have a cleaner "UiText" "ja-JP" and be done with it.
I figure I'll probably have to have at least two different patterns for this, so that is OK. Thank you in advance!
Since UiText seems to be constant you can use this regex to extract just js-JP into $1:
^UiText\.(.+?)\.resx$
https://regex101.com/r/XKvwHA/1/
If I'm understanding your needs correctly, then the main reason you need "UiText" is not because you have any value for the term itself, but rather because you need to filter your files. The real term you need to play around with is "ja-JP", which changes for the files you need.
If I'm correct, try this regex:
(?<=UiText\.).+(?=\.resx)
Used in C# as follows:
var fileName = "UiText.ja-JP.resx";
var result = new Regex(#"(?<=^UiText\.).+(?=\.resx$)").Match(fileName).Value;
A little explanation:
(?<=^UiText\.) Start of string must begin exactly with "UiText."
.+ Any number of characters (but at least one)
(?=\.resx$) End of string must end with ".resx"
Any file that doesn't meet your criteria will return an empty string for 'result'.
This question already has answers here:
Match exact string
(3 answers)
Closed 3 years ago.
I'm parsing a text file line by line and for each line I have a special regex. However in one case a pattern is matching two lines. One that is a correct match and another line only partialy because a couple of values are optional.
Invalid match:
BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD
matches patial string (shouln't match this at all):
BNE1010/1000
Correct match (matches the entire string):
RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP
The regex for this is quite long and contains several optionl groups:
^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?
I think there is no need to study the entire regex becasue it's build dynamically from smaller patterns at runtime and all the parts work correctly. Also lots of combinations are tested with unit tests and they all work... as long as I try to parse ony the line that should be matched by the pattern.
Currently I'm checking if the entire string is matched by
match.Group[0].Value == line
but I find it's quite ugly. I know from JavaScript the regex engine provides an Index property where the regex engine stopped. So my idea was to compare the index with the length of the string. Unfortunatelly I wasn't able to find such a property in C#.
Another idea would be to modify the regex so that it matches only one line and no partial lines.
Example: https://regex101.com/r/dM5wU4/1
The example contains only two cases because there aren't actually any combinations that would change its behavior. I could remove some parameters but it wouldn't change anything.
EDIT:
I've edited my question. Sorry to every for not providing all the information at the first time. I won't ask any more questions when writing on the phone :) It wasn't a good idea. Hopefully it won't get closed now.
You asked whether I could simplify the regex. I would do it if I could and knew how. If it was easy I wouldn't have asked. The problem started as the regex ans string became bigger during development. Now they are at the production length and I can't actually make them shorter even for the sake of the quesion, sorry.
EDIT-2:
I found the reason why I couldn't find the inherited Index and Length properties of the Match class.
For some strange reason when selecting the Match class and pressing F1 Visual Studio opened the wrong help page (Match Properties) even though I'm not working with the Micro Framework. I didn't notice that but I was indeed wondering why there is very little information. Thx to #Jamiec for the correct link. I won't trust Visual Studio anymore when hitting F1.
Disclaimer: Im going to add this, but I doubt its the solution. If it's not this part will get deleted in short order
You can add a $ at the end of your regular expression. This stops your first example matching but continues to match the second example.
As you've not provided any more than 2 examples, its unclear if this actually solves all your cases or just that one specific false positive.
My question is whether it is possible to check if a regular expression matched the entire sting without checking the first group against the original line?
If you're not adverse to checking the entire match to the length of the string you can do that too:
var regex = new Regex(#"^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?");
var input1 = #"BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD";
var input2 = #"RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP";
var match1 = regex.Match(input1);
var match2 = regex.Match(input2);
Console.WriteLine(match1.Length == input1.Length); // False
Console.WriteLine(match2.Length == input2.Length); // True
Live example: http://rextester.com/NIBE6349
i have to create a function GetSourceCodeOfClass("ClassName",FilePath) this function will be used more than 10000 times to get Srouce code from c# Files, and from every source file i have to extract the source code of a complete class i.e
" Class someName { every thing in the body including sinature} "
Now this is simple, if a single file contains a single class but there will be many source files that will contain more than two classes in them , further more the bigger problem is there maybe nested classes inside a single class.
i want following thing :-
i want to extract the complete source of a given Class
if file contains more than two classes then i want to extract only the source code of specified class.
if file contains more than one class and my specified class have nested classes in it then i want to capture myClasses's source as well as all nested classes.
i have an algorithm in mid that is:
1-open file
2-match regex (C# classes signature ) - parameterized
#"(public|private|internal|protected|inline)?[\t ]*(static)?[\t
]class[\t ]" + sOurClassName + #"(([\t ][:][\t ]([a-zA-z]+(([
])[,]([ ])\w+))+))?\s[\n\r\t\s]?{"
3- If Regex is matched in the source file
4 Start copying at that point until the same regex is matched again but without parameters
regex is:
#" (public|private|internal|protected)?[\t ]*(static)?[\t ]class[\t
]\w+(([\t ][:][\t ]([a-zA-z]+(([ ])[,]([
])\w+))+))?\s[\n\r\t\s]?{"
(this is where i have no clue and i am stuck. I want to copy every thing after first matched to the second matched or after first match till the end )
copying nested classes is still an issue and i am still thinking about it if some one have an idea , can help me in this too.
Note- match.groups[0] or match.groups[1] this will only copy the signature but i want the complete source of the class thats why i am doing this way . ..
BTW i am using C#
I agree with Nathan's sentiment that you would be better using an existing C#-aware parser. Trying to write a regex for the task is a lot of work, and you are unlikely to get it right on the first try. It may work on your first example code, or even the first few, but eventually you'll find some code that's slightly different than what you expected and the regex will fail to catch something important.
That said, if you are comfortable with that limitation and risk, the general technique you are asking about (if I understand correctly…the question isn't entirely clear) is common enough, and worth understanding if you expect to use regex a lot. The key points to understand are that with a Match object, you can call the NextMatch() method to obtain the next match in the next, and that when calling the Regex.Match() method, you can pass the start and length of a substring you want to check, and it will limit its processing to that substring.
You can use the latter point to switch from one regex to another mid-parse.
In your scenario, I understand it to be that you want to run a regex containing the specific class name, to find that particular class in the file, and then to search the text after the initial match for any subsequent class in the file. If the second search finds something, you want to only return the text from the start of the first match to the start of the second match. If the second search finds nothing, you want to return the text from the start of the first match to the end of the whole file.
If that's correct, then something like this should work:
string ExtractClass(string fileContents, Regex classRegex, Regex nonClassRegex)
{
Match match1 = classRegex.Match(fileContents);
if (!match1.Success)
{
return null;
}
Match match2 = nonClassRegex.Match(fileContents, match1.Index + match1.Length);
if (!match2.Success)
{
return fileContents.Substring(match1.Index);
}
return fileContents.Substring(match1.Index, match2.Index - match1.Index);
}
I should note that between two class declarations, or between the end of a lone class declaration and the actual end of the file there can easily be other non-white-space text that isn't part of the class declaration. I assume you have a plan for dealing with that.
If the above doesn't address your need, you should examine your question closely, and edit it both for length and clarity.
I have simple ascii text file like this:
Madonna is a celebrity
No she's not she's a serious artist
Did you see her book or the movie Truth or Dare
Argument closed
I need a method to get the length of the longest line. In this example the answer would be 47.
I can use StreamReader and open the file and read each line but it seems that there should an easier way.
Is there a simple to way solve this problem?
You can do this nicely with File.ReadLines, which has the advantage that it does not read the entire file into memory. As it returns IEnumerable<string> you can use Linq on the return value, leading to this rather nice one liner.
File.ReadLines(fileName).Max(line => line.Length)