C# Regex: How to break up plain text string

C# Regex: How to break up plain text string - c#

I have a problem which I am wondering how to solve.
I have a String I read in from a pdf file that has a list of questions.
It's in the format of:
QUESTION NO: 1
xxxxxxx (question text)
A) xxxx (multiple choice) B) xxxx C) xxxx ...
Answer: xxxxx
QUESTION NO: 2
xxxxxxx (question text)
.... (etc)
There are about 200 questions in the list.
I am trying to use Regex to break up the text so each question can be in a separate string.
I've done this before with html and xml documents, but they were easy since there are a lot of identifying tags like double quotes, brackets, and parentheses.
But I am clueless as to how to do this with just text. I've tried a lot of combinations, but it just seems like I can't get the right format:
var questionPattern = #"QUESTION NO:(.*)QUESTION NO:";
var questionMatch = Regex.Matches(pdfText, questionPattern, RegexOptions.Singleline);
I was wondering, is there a way to do:
var questionPattern = #"(?<=QUESTION NO:)[^QUESTION NO:]*";
Where the [^QUESTION NO:]* reads everything after each Question header until it stops when it comes to the next Question header?
Obviously this is the wrong format, but I hope people will understand what I'm trying to get at.
Any help would be greatly appreciated.
Thanks!

This is probably the best you're going to get - dependent on Answer. Lookaheads would need to be conditional, and would break the entire expression.
(QUESTION NO: \d+[\S\s]*?Answer.*\n*)
Working example: http://regex101.com/r/nC6yA1

Related

C# Regex.Matches - Quotation Marks Problem

I am new to the community and first wanna introduce myself. My name is Ben and I am loving to code, but I began to code like 2 years ago and coded not really much since now (maybe 50 hours at all).
So my question is the following. I wanna scrape some data from a Website and it works almost perfect, but the problem is, that I have a string in the source code like that:
-> "key":"Name","role" and I want to grab only the Name without any quotation marks.
Now my code looks like the following:
MatchCollection AllChampionName = Regex.Matches(html, #"key\s*(.+?)\s*role", RegexOptions.Singleline);
But the result in my textbox is like the following text: ":"Name","
I know why, but I don't know how I can handle it, because I don't know how regex works.
Can someone pls tell me the right code, so that I only get Name without quotation marks and maybe a source, where I can read more about Regex and how it works with the commands, because I found no good source :(
Edit: I am programming in C#.
Thanks alot!

Regex hangs trying to find match

I am trying to match an assignment string in VB code (as in I'm passing in text that is VB code into my program that's written in C#). The assignment string that I'm trying to match is something for example like
CustomClassInitializer(someParameter, anotherParameter, someOtherClassAsParameterWithInitialization()).SomeProperty = 7
and I realize that's rather complex, but it actually isn't far off from some of the real text I'm trying to match.
In order to do so I wrote a Regex. This Regex:
#"[\w,.]+\(([\w,.]*\(*,* *\)*)+ = "
which correctly matches. The problem is it becomes VERY slow (with timeouts), which I've researched and found is probably because of "backtracking". One of the suggested solutions to help with backtracking in general was to add "?>" to the regex, which I think would go in this position:
[\w,.]+\(?>([\w,.]*\(*,* *\)*)+ =
but this no longer matches properly.
I'm fairly new to Regex, so I imagine that there is a much better pattern. What is it please? Or how can I improve my times in general?
Helpful notes:
I'm only interested in position 0 of the string I'm searching for a
match in. My code is "if (isMatch && match.index == 0) { ... }. Can
I tell it to only check position 0 and if it's not a match move on?
The reason I use all the 0 or more things is the match could be as simple as CustomClass() = new CustomClass(), and as complicated as the above or perhaps a bit worse. I'm trying to get as many cases as possible.
This Regex is interested in "[\w,.]+(" and then "whatever may be inside the parentheses" (I tried to think of what all could be inside them based on the fact that it's valid VB code) until you get to the close parenthesis and then " = ". Perhaps I can use a wildcard for literally anything until it get's to ") = " in the string? - Like I said, fairly new to Regex.
Thanks in advance!

This seems to do what you want. Normally, I like to be more specific than .*, but it is working correctly. Note that I am using the Multi-line option.
^.*=\s*.+$
Here is a working example in RegExStorm.net example

Is there a better way to check if an entire string was matched? [duplicate]

This question already has answers here:
Match exact string
(3 answers)
Closed 3 years ago.
I'm parsing a text file line by line and for each line I have a special regex. However in one case a pattern is matching two lines. One that is a correct match and another line only partialy because a couple of values are optional.
Invalid match:
BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD
matches patial string (shouln't match this at all):
BNE1010/1000
Correct match (matches the entire string):
RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP
The regex for this is quite long and contains several optionl groups:
^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?
I think there is no need to study the entire regex becasue it's build dynamically from smaller patterns at runtime and all the parts work correctly. Also lots of combinations are tested with unit tests and they all work... as long as I try to parse ony the line that should be matched by the pattern.
Currently I'm checking if the entire string is matched by
match.Group[0].Value == line
but I find it's quite ugly. I know from JavaScript the regex engine provides an Index property where the regex engine stopped. So my idea was to compare the index with the length of the string. Unfortunatelly I wasn't able to find such a property in C#.
Another idea would be to modify the regex so that it matches only one line and no partial lines.
Example: https://regex101.com/r/dM5wU4/1
The example contains only two cases because there aren't actually any combinations that would change its behavior. I could remove some parameters but it wouldn't change anything.
EDIT:
I've edited my question. Sorry to every for not providing all the information at the first time. I won't ask any more questions when writing on the phone :) It wasn't a good idea. Hopefully it won't get closed now.
You asked whether I could simplify the regex. I would do it if I could and knew how. If it was easy I wouldn't have asked. The problem started as the regex ans string became bigger during development. Now they are at the production length and I can't actually make them shorter even for the sake of the quesion, sorry.
EDIT-2:
I found the reason why I couldn't find the inherited Index and Length properties of the Match class.
For some strange reason when selecting the Match class and pressing F1 Visual Studio opened the wrong help page (Match Properties) even though I'm not working with the Micro Framework. I didn't notice that but I was indeed wondering why there is very little information. Thx to #Jamiec for the correct link. I won't trust Visual Studio anymore when hitting F1.

Disclaimer: Im going to add this, but I doubt its the solution. If it's not this part will get deleted in short order
You can add a $ at the end of your regular expression. This stops your first example matching but continues to match the second example.
As you've not provided any more than 2 examples, its unclear if this actually solves all your cases or just that one specific false positive.
My question is whether it is possible to check if a regular expression matched the entire sting without checking the first group against the original line?
If you're not adverse to checking the entire match to the length of the string you can do that too:
var regex = new Regex(#"^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?");
var input1 = #"BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD";
var input2 = #"RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP";
var match1 = regex.Match(input1);
var match2 = regex.Match(input2);
Console.WriteLine(match1.Length == input1.Length); // False
Console.WriteLine(match2.Length == input2.Length); // True
Live example: http://rextester.com/NIBE6349

How to ignore \t character inside double quotes using regex?

I am trying to parse a file using regex split, it works well with the '\t' character but some lines have the '\t' inside a field instead of acting as the delimiter.
Like :
G2226 TEST 1 C 29 Internal Head Office D Head Office ZZZ Unassigned 10910 10/10/2011 11/10/2011 10/10/2011 11/10/2011 "Test call Sort the customer out some data. See the customer again tomorrow to talk about Prod " Mr ABC Mr ABC Mr ABC Mr ABC Credit Requested BDM Call Internal Note 10
This part has 2 tabs I wish were ignored :
"Test call Sort the customer out some data. See the customer again tomorrow to talk about Prod\t\t"
The good thing is, they are included in double quotes, but I cannot work out how to ignore them, any ideas?
Edit:
My goal is to get 36 columns, some columns may come out more after a Regex.Split(lineString,'\t') using '\t' because they include '\t' characters inside some of the fields. I would like to ignore those ones. The one above comes out to 38 cols, which is rejected by my datatable as the header is only 36 cols, I would like to solve this problem.

If you have a simple CSV file, then regex split is usually the easiest way to process it.
However, if your CSV file contains more complex elements, such as quoted fields that contain separator characters or newlines, then this approach will no longer work. It is not a trivial matter to correctly parse these types of files, so you should use a library when possible.
The answers to this question give several options for C# libraries that can read a CSV file.

Regex is not the right tool for this.
You have basically a CSV format, it is "tab separated", not "comma separated", but it works exactly the same. So, find a CSV parser and use that - the separation character is usually configurable.

If you really need a regular expression, you can try something like this:
(?!\t")\t(?!\t")

Using RegEx to read through a CSV file

I have a CSV file, with the following type of data:
0,'VT,C',0,
0,'C,VT',0,
0,'VT,H',0,
and I desire the following output
0
VT,C
0
0
C,VT
0
0
VT,H
0
Therefore splitting the string on the comma however ignoring the comma within quote marks. At the moment I'm using the following RegEx:
("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)"
however this gives me the result of:
0
VT
C
0
0
C
VT
0
0
VT
H
0
This show the RegEx is not reading the quote mark properly. Can anyone suggest some alterations that might help?

Usually when it comes to CSV parsing, people use specific libraries well suited for the programming language they are using to code their application.
Anyway if you are going to use a regular expression to make a really loose(!) parsing you may try using something like this:
'(?<value>[^']*?)'
It will match anything in between single quotes, and assuming the csv file is well formed, it will not miss a field. Of course it doesn't accept embedded quotes but it easily gets the job done. That's what I use when I need to get the job done really quickly. Please don't consider it a complete solution to your problem...it just works in special conditions when the requirements are what you described and the input is well formed.
[EDIT]
I was checking again your question and noticed you want to include also non quoted fields...well ok in that case my expression will not work at all. Anyway listen...if you think hard about your problem, you'll find that's something quite difficult to solve without ambiguity. Because you need fixed rules and if you allow quoted and not quoted fields, the parser will have hard time figuring out legit commas as separator/quoted.
Another expression to model such a solution may be:
('[^']+'|[^,]+),?
It will match both quoted/notquoted fields...anyway I'm not sure if it needs to assume the csv HAS to adhere to strict conditions. That will work much safer then a split strategy as far as I can tell ... you just need to collect all matches and print the matched_value + \r\n on your target string.

This regex is based of the fact you have 1 digit before and after your 'value'
Regex.Replace(input, #"(?:(?<=\d),|,(?=\d))", "\n");
You can test it out on RegexStorm

foreach(var m in Regex.Matches(s,"(('.*?')|[0-9])"))

I have manages to get the following method to read the file as required:
public List<string> SplitCSV(string input, List<string> line)
{
Regex csvSplit = new Regex("(([^,^\'])*(\'.*\')*([^,^\'])*)(,|$)", RegexOptions.Compiled);
foreach (Match match in csvSplit.Matches(input))
{
line.Add(match.Value.TrimStart(','));
}
return line;
}
Thanks for everyone help though.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.