is there a conditional split in C#? [closed] - c#

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I came across a problem,
I want to split everything that comes afer ". "
For example, if I have the sentences :
"Danny went to school. it was wonderful. "
I want my output will be
Danny went to school.
it was wonderful.
which I can easily solve it by that :
string[] list = currentResult.Split(new string[] { ". " }, StringSplitOptions.None);
BUT!
what if I have for example :
Danny went to School. and : 2. James went to school as well.
my output will be :
1.
Danny went to School. and :
2.
James went to school as well
.
I dont want it to split it when there is a number before the dot, for example.
Can I solve it somehow ?
Thanks!

The problem here is how to deal with oddly formatted data, if you have control over your data you might consider using 1) and 2) instead of 1. and 2.; however if this is not the case then you might have to resort to regex to discern where a . is part of a line or the end of one as this functionality is past the capabilities of String.Split

You could always go character by character, and do something like:
NOTE: Untested, but looks right :)
List<string> strings = new List<string>();
int curStart = 0;
for(int index=0;index<str.Length;index++) {
if(index > 0) {
if(str[index] == '.') {
if(!char.IsNumeric(str[index-1])) {
strings.Add(str.SubString(curStart, index-curStart));
curStart = index + 1;
}
}
}
}

I thought I'd take a stab at producing an answer matching to what you ask, where as the comments make allot of sense in the larger scope of what you want.
Find out how to use regex with C# code from :http://www.dotnetperls.com/regex-matches
I used http://regexpal.com/ to confirm my regex. Play around with that or a similar page to get a handle on regex. It's worth knowing how to regex.
Look at http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet or someplace else for a list of the commands and definitions for regex.
the regex ".*?\D[.||:]\s" will turn the string:
1. Danny went to School. and : 2. James went to school as well. Danny went to school. it was wonderful.
into the following matches (separated here by new lines):
1. Danny went to School.
and :
2. James went to school as well.
Danny went to school.
it was wonderful.
Note that I took the liberty to separate matches based on ':' as well since your example does so.

Related

Regular expression for ensuring domain name is English characters only [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
What's a good, clear regex for matching a domain name that must consist of:
Only English alpha characters, plus numbers
Including spaces or other separator characters that are valid, and reliably handled within a domain name
To clarify, this is for the purposes of validating a domain name. Whilst there are moves in the internet community to support internationalisation of domain names, I've done a fair bit of research into this and to keep my explanation fairly simple, only domain names that include characters that are part of a modern UK English character set (including numbers) are reliably handled by the Domain Name System (DNS). I'm not indicating a desire to prohibit internationalisation - I've done a lot of work during my career doing the opposite!
To answer this, what I was looking for is something like this (tested and works). Sorry the original question wasn't explicit enough about what I was trying to do, however I've upvoted the suggestions that have helped me provide this answer to the commmunity:
^[\w- .]*$
'\w' = shorthand for [a-zA-Z0-9_]
'- .' = allow '-', ' ', '.'
asterisk = any of the previous characters zero or more times
You can use this one:
(?i)[a-z0-9\p{Z}]
where \p{Z} is "All separators" class and i ignore-case option.
You may use [a-zA-Z\d\s\p{P}]+ as the most simple solution. Or go with non-unicode solution >>
POSIX defines character classes [:...:] , but not every regex engine support them.
But alternative sets can be used then...
[:alnum:] [A-Za-z0-9] Alphanumeric characters
[:space:] [ \t\r\n\v\f] Whitespace characters
[:punct:] [\]\[!"#$%&'()*+,./:;<=>?#\^_`{|}~-] Punctuation characters
So putting them together you will get
^[A-Za-z0-9 \t\r\n\v\f\]\[!"#$%&'()*+,./:;<=>?#\^_`{|}~-]+$
This way you see what you going to match and what not. Please note that some characters are escaped by \ as without escaping they would have different meaning.

What algorithm can break text up into its component words? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I was pleasantly surprised to find how easy it is to use iTextSharp to extract the text from a pdf file. By following this article, I was able to get a pdf file converted to text with this simple code:
string pdfFilename = dlg.FileName;
// Show just the file name, without the path
string pdfFileNameOnly = System.IO.Path.GetFileName(pdfFilename);
lblFunnyMammalsFile.Content = pdfFileNameOnly;
string textFilename = String.Format(#"C:\Scrooge\McDuckbilledPlatypus\{0}.txt", pdfFileNameOnly);
PDFParser pdfParser = new PDFParser();
if (!pdfParser.ExtractText(pdfFilename, textFilename))
{
MessageBox.Show("there was a boo-boo");
}
The problem is that the text file generated contains text like this (i.e. it has no spaces):
IwaspleasantlysurprisedtofindhoweasyitistouseiTextSharptoextractthetextfromatextfile.
Is there an algorithm "out there" that will take text like that and make a best guess as to where the word breaks (AKA "spaces") should go?
Though I agree with Gavin that there's an easy way to solve this problem in this case but the problem itself is an interesting one.
This would require a heuristic algorithm to solve. I will just explain in a bit on why I think so. But first, I'll explain my algorithm.
Store all the dictionary words in a Trie. Now take a sentence, and look up in the trie to get to a word. The trie tracks the end of the word. Once you find a word, add a space to it in your sentence. This will work for your sentence. But consider these two examples:
He gave me this book
He told me a parable
For the first example, the above algorithm works fine but for the second example, the algorithm outputs:
He told me a par able
In order to avoid this, we will need to consider a longest match but if we do that then the output for the first example becomes:
He gave met his book.
So we are stuck and hence add heuristics to the algorithm that will be able to judge that grammatically He gave met his book doesn't make sense.

C# code or algorithm to quickly calculate distance between large strings? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
Hi and thanks for looking!
Background
I have an XML file that contains 1900 nodes which themselves contain a string of encoded data of about 3400 characters.
As part of a use case for an application I am developing, I need to be able to take a "benchmark" string at runtime, and find the closest match from the XML file.
Please note that XML is not germane to the app and that I may go with SQL moving forward, but for today, I just needed an easy place to store the data and prove the concept.
I am using .NET 4.0, C#, forms app, LINQ, etc.
Question
How do I find the closest match? Hamming? Levenshtein? There are plenty of code samples online, but most are geared towards small string comparisons ("ant" vs. "aunt") or exact match. I will rarely have exact matches; I just need closest match.
Thanks in advance!
Matt
You mentioned using Levenhstein's Edit Distance and that your strings were about 3400 characters long.
I did a quick try and using the dynamic programming version of Levenhstein's Edit Distance it seems to be quite fast and cause no issue.
I did this:
final StringBuilder sb1 = new StringBuilder();
final StringBuilder sb2 = new StringBuilder();
final Random r = new Random(42);
final int n = 3400;
for (int i = 0; i < n; i++) {
sb1.append( (char) ('a' + r.nextInt(26)) );
sb2.append( (char) ('a' + r.nextInt(26)) );
}
final long t0 = System.currentTimeMillis();
System.out.println("LED: " + getLevenshteinDistance(sb1.toString(), sb2.toString()) );
final long te = System.currentTimeMillis() - t0;
System.out.println("Took: " + te + " ms");
And it's finding the distance in 215 ms on a Core 2 Duo from 2006 or so.
Would that work for you?
(btw I'm not sure I can paste the code for the DP LED implementation I've got here so you probably should search the Internet for one Java implementation)

Phone Number Validation in Multiple Countries [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I am having trouble find regex validation patterns for phone numbers in different countries and have little time to try and write my own and was hoping a regex guru would be able to help.
I've checked the usual sources like regexlib already, so if anyone can help i'd be grateful with any of them
I need a separate phone number validation expression for each of the following:
Germany
US
Australia
New Zealand
Canada
Asia
France
The format is here.
Writing the regex is not trivial, but if you specify the rules, would not be difficult.
Instead of making an elaborate regular expression to match how you think your visitor will input their phone number, try a different approach. Take the phone number input and strip out the symbols. Then the regular expression can be simple and just check for 10 numeric digits (US number, for example). Then if you need to save the phone number in a consistent format, build that format off of the 10 digits.
This example validates U.S. phone numbers by looking for 10 numeric digits.
protected bool IsValidPhone(string strPhoneInput)
{
// Remove symbols (dash, space and parentheses, etc.)
string strPhone = Regex.Replace(strPhoneInput, #”[- ()\*\!]“, String.Empty);
// Check for exactly 10 numbers left over
Regex regTenDigits = new Regex(#”^\d{10}$”);
Match matTenDigits = regTenDigits.Match(strPhone);
return matTenDigits.Success;
}
Phone number is a number, what you want to validate there ?
Here you can see how different numbers look like.
And there is no such country like Asia, this is a mainland with several countries.
It's close to impossible to get a single regular expression that will cover all countries.
I'd go with [0-9+][0-9() ]* -- this simply allows any digit to start (or the "+" character), then any combination of digits or parentheses or spaces.
In general validation any further is not really going to be of much use. If the user of the page wants to be contacted by phone, they'll enter a valid phone number -- if not, then they won't.
A better way to enforce a correct phone number and eliminate most simple miskeying is to require the number to be entered twice -- then the user is likely to at least check it!

How do I capture named groups in C# .NET regex? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
I'm trying to use named groups to parse a string.
An example input is:
exitcode: 0; session id is RDP-Tcp#2
and my attempted regex is:
("(exitCode)+(\s)*:(\s)*(?<exitCode>[^;]+)(\s)*;(\s)*(session id is)(\s)*(?<sessionID>[^;]*)(\s)*");
Where is my syntax wrong?
Thanks
In your example:
exitcode: 0; session id is RDP-Tcp#2
It does not end with a semi-colon, but it seems your regular expression expects a semi-colon to mark the end of sessionID:
(?<sessionID>[^;]*)
I notice that immediately following both your named groups, you have optional whitespace matches -- perhaps it would help to add whitespace into the character classes, like this:
(?<exitCode>[^;\s]+)
(?<sessionID>[^;\s]*)
Even better, split the string on the semi-colon first, and then perhaps you don't even need a regular expression. You'd have these two substrings after you split on the semi-colon, and the exitcode and sessionID happen to be on the ends of the strings, making it easy to parse them any number of ways:
exitcode: 0
session id is RDP-Tcp#2
Richard's answer really covers it already - either remove or make optional the semicolon at the end and it should work, and definitely consider putting whitespace in the negated classes or just splitting on semi-colon, but a little extra food for thought. :)
Don't bother with \s where it's not necessary - looks like your output is some form of log or something, so it should be more predictable, and if so something simpler can do:
exitcode: (?<exitCode>\d+);\s+session id is\s+(?<sessionID>[^;\s]*);?
For the splitting on semi-colon, you'll get an array of two objects - here's some pseudo-code, assuming exitcode is numeric and sessionid doesn't have spaces in:
splitresult = input.split('\s*;\s*')
exitCode = splitresult[0].match('\d+')
sessionId = splitresult[1].match('\S*$')
Depending on who will be maintaining the code, this might be considered more readable than the above expression.

Categories