How to split at every second quotation mark - c#

I have a string that looks like this
2,"E2002084700801601390870F"
3,"E2002084700801601390870F"
1,"E2002084700801601390870F"
4,"E2002084700801601390870F"
3,"E2002084700801601390870F"
This is one whole string, you can imagine it being on one row.
And I want to split this in the way they stand right now like this
2,"E2002084700801601390870F"
I cannot change the way it is formatted. So my best bet is to split at every second quotation mark. But I haven't found any good ways to do this. I've tried this https://stackoverflow.com/a/17892392/2914876 But I only get an error about invalid arguements.
Another issue is that this project is running .NET 2.0 so most LINQ functions aren't available.
Thank you.

Try this
var regEx = new Regex(#"\d+\,"".*?""");
var lines = regex.Matches(txt).OfType<Match>().Select(m => m.Value).ToArray();
Use foreach instead of LINQ Select on .Net 2
Regex regEx = new Regex(#"\d+\,"".*?""");
foreach(Match m in regex.Matches(txt))
{
var curLine = m.Value;
}

I see three possibilities, none of them are particularly exciting.
As #dvnrrs suggests, if there's no comma where you have line-breaks, you should be in great shape. Replace ," with something novel. Replace the remaining "s with what you need. Replace the "something novel" with ," to restore them. This is probably the most solid--it solves the problem without much room for bugs.
Iterate through the string looking for the index of the next " from the previous index, and maintain a state machine to decide whether to manipulate it or not.
Split the string on "s and rejoin them in whatever way works the best for your application.

I realize regular expressions will handle this but here's a pure 2.0 way to handle as well. It's much more readable and maintainable in my humble opinion.
using System;
using System.Collections.Generic;
namespace ConsoleApplication1
{
internal class Program
{
private static void Main(string[] args)
{
const string data = #"2,""E2002084700801601390870F""3,""E2002084700801601390870F""1,""E2002084700801601390870F""4,""E2002084700801601390870F""3,""E2002084700801601390870F""";
var parsedData = ParseData(data);
foreach (var parsedDatum in parsedData)
{
Console.WriteLine(parsedDatum);
}
Console.ReadLine();
}
private static IEnumerable<string> ParseData(string data)
{
var results = new List<string>();
var split = data.Split(new [] {'"'}, StringSplitOptions.RemoveEmptyEntries);
if (split.Length % 2 != 0)
{
throw new Exception("Data Formatting Error");
}
for (var index = 0; index < split.Length / 2; index += 2)
{
results.Add(string.Format(#"""{0}""{1}""", split[index], split[index + 1]));
}
return results;
}
}
}

Related

how to find text in a string in c#

I am learning Dotnet c# on my own.
how to find whether a given text exists or not in a string and if exists, how to find count of times the word has got repeated in that string. even if the word is misspelled, how to find it and print that the word is misspelled?
we can do this with collections or linq in c# but here i used string class and used contains method but iam struck after that.
if we can do this with help of linq, how?
because linq works with collections, Right?
you need a list in order to play with linq.
but here we are playing with string(paragraph).
how linq can be used find a word in paragraph?
kindly help.
here is what i have tried so far.
string str = "Education is a ray of light in the darkness. It certainly is a hope for a good life. Eudcation is a basic right of every Human on this Planet. To deny this right is evil. Uneducated youth is the worst thing for Humanity. Above all, the governments of all countries must ensure to spread Education";
for(int i = 0; i < i++)
if (str.Contains("Education") == true)
{
Console.WriteLine("found");
}
else
{
Console.WriteLine("not found");
}
You can make a string a string[] by splitting it by a character/string. Then you can use LINQ:
if(str.Split().Contains("makes"))
{
// note that the default Split without arguments also includes tabs and new-lines
}
If you don't care whether it is a word or just a sub-string, you can use str.Contains("makes") directly.
If you want to compare in a case insensitive way, use the overload of Contains:
if(str.Split().Contains("makes", StringComparer.InvariantCultureIgnoreCase)){}
string str = "money makes many makes things";
var strArray = str.Split(" ");
var count = strArray.Count(x => x == "makes");
the simplest way is to use Split extension to split the string into an array of words.
here is an example :
var words = str.Split(' ');
if(words.Length > 0)
{
foreach(var word in words)
{
if(word.IndexOf("makes", StringComparison.InvariantCultureIgnoreCase) != -1)
{
Console.WriteLine("found");
}
else
{
Console.WriteLine("not found");
}
}
}
Now, since you just want the count of number word occurrences, you can use LINQ to do that in a single line like this :
var totalOccurrences = str.Split(' ').Count(x=> x.IndexOf("makes", StringComparison.InvariantCultureIgnoreCase) != -1);
Note that StringComparison.InvariantCultureIgnoreCase is required if you want a case-insensitive comparison.

count occurrences in a string similar to run length encoding c#

Say I have a string like
MyString1 = "ABABABABAB";
MyString2 = "ABCDABCDABCD";
MyString3 = "ABCAABCAABCAABCA";
MyString4 = "ABABACAC";
MyString5 = "AAAAABBBBB";
and I need to get the following output
Output1 = "5(AB)";
Output2 = "3(ABCD)";
Output3 = "4(ABCA)";
Output4 = "2(AB)2(AC)";
Output5 = "5(A)5(B)";
I have been looking at RLE but I can't figure out how to do the above.
The code I have been using is
public static string Encode(string input)
{
return Regex.Replace(input, #"(.)\1*", delegate(Match m)
{
return string.Concat(m.Value.Length, "(", m.Groups[1].Value, ")");
});
}
This works for Output5 but can I do the other Outputs with Regex or should I be using something like Linq?
The purpose of the code is to display MyString in a simple manner as I can get MyString being up to a 1000 characters generally with a pattern to it.
I am not too worried about speed.
Using RLE with single characters is easy, there never is an overlap between matches. If the number of characters to repeat is variable, you'd have a problem:
AAABAB
Could be:
3(A)BAB
Or
AA(2)AB
You'll have to define what rules you want to apply. Do you want the absolute best compression? Does speed matter?
I doubt Regex can look forward and select "the best" combination of matches - So to answer your question I would say "no".
RLE is of no help here - it's just an extremely simple compression where you repeat a single code-point a given number of times. This was quite useful for e.g. game graphics and transparent images ("next, there's 50 transparent pixels"), but is not going to help you with variable-length code-points.
Instead, have a look at Huffman encoding. Expanding it to work with variable-length codewords is not exactly cheap, but it's a start - and it saves a lot of space, if you can afford having the table there.
But the first thing you have to ask yourself is, what are you optimizing for? Are you trying to get the shortest possible string on output? Are you going for speed? Do you want as few code-words as possible, or do you need to balance the repetitions and code-word counts in some way? In other words, what are you actually trying to do? :))
To illustrate this on your "expected" return values, Output4 results in a longer string than MyString4. So it's not the shortest possible representation. You're not trying for the least amounts of code-words either, because then Output5 would be 1(AAAAABBBBB). Least amount of repetitions is of course silly (it would always be 1(...)). You're not optimizing for low overhead either, because that's again broken in Output4.
And whichever of those are you trying to do, I'm thinking it's not going to be possible with regular expressions - those only work for regular languages, and encoding like this doesn't seem all that regular to me. The decoding does, of course; but I'm not so sure about the encoding.
Here is a Non-Regex way given the data that you provided. I'm not sure of any edge cases, right now, that would trip this code up. If so, I'll update accordingly.
string myString1 = "ABABABABAB";
string myString2 = "ABCDABCDABCD";
string myString3 = "ABCAABCAABCAABCA";
string myString4 = "ABABACAC";
string myString5 = "AAAAABBBBB";
CountGroupOccurrences(myString1, "AB");
CountGroupOccurrences(myString2, "ABCD");
CountGroupOccurrences(myString3, "ABCA");
CountGroupOccurrences(myString4, "AB", "AC");
CountGroupOccurrences(myString5, "A", "B");
CountGroupOccurrences() looks like the following:
private static void CountGroupOccurrences(string str, params string[] patterns)
{
string result = string.Empty;
while (str.Length > 0)
{
foreach (string pattern in patterns)
{
int count = 0;
int index = str.IndexOf(pattern);
while (index > -1)
{
count++;
str = str.Remove(index, pattern.Length);
index = str.IndexOf(pattern);
}
result += string.Format("{0}({1})", count, pattern);
}
}
Console.WriteLine(result);
}
Results:
5(AB)
3(ABCD)
4(ABCA)
2(AB)2(AC)
5(A)5(B)
UPDATE
This worked with Regex
private static void CountGroupOccurrences(string str, params string[] patterns)
{
string result = string.Empty;
foreach (string pattern in patterns)
{
result += string.Format("{0}({1})", Regex.Matches(str, pattern).Count, pattern);
}
Console.WriteLine(result);
}

Getting substring between two separators in an arbitrary position

I have following string:
string source = "Test/Company/Business/Department/Logs.tvs/v1";
The / character is the separator between various elements in the string. I need to get the last two elements of the string. I have following code for this purpose. This works fine. Is there any faster/simpler code for this?
CODE
static void Main()
{
string component = String.Empty;
string version = String.Empty;
string source = "Test/Company/Business/Department/Logs.tvs/v1";
if (!String.IsNullOrEmpty(source))
{
String[] partsOfSource = source.Split('/');
if (partsOfSource != null)
{
if (partsOfSource.Length > 2)
{
component = partsOfSource[partsOfSource.Length - 2];
}
if (partsOfSource.Length > 1)
{
version = partsOfSource[partsOfSource.Length - 1];
}
}
}
Console.WriteLine(component);
Console.WriteLine(version);
Console.Read();
}
Why no regular expression? This one is fairly easy:
.*/(?<component>.*)/(?<version>.*)$
You can even label your groups so for your match all you need to do is:
component = myMatch.Groups["component"];
version = myMatch.Groups["version"];
The following should be faster, as it only scans as much of the string as it needs to to find two / and it doesn't bother splitting up the whole string:
string component = "";
string version = "";
string source = "Test/Company/Business/Department/Logs.tvs/v1";
int last = source.LastIndexOf('/');
if (last != -1)
{
int penultimate = source.LastIndexOf('/', last - 1);
version = source.Substring(last + 1);
component = source.Substring(penultimate + 1, last - penultimate - 1);
}
That said, as with all performance questions: profile! Try the two side-by-side with a big list of real-life inputs and see which is fastest.
(Also, this will leave empty strings rather than throw an exception if there is no slash in the input... but throw if source is null, lazy me.)
Your approach is the most suitable one given that your are looking for substrings at a particular index. A LINQ expression to do the same in this case will likely not improve the code or its readability.
For reference, there is some great information from Microsoft here on working with strings and LINQ. In particular see the article here which covers some examples with both LINQ and RegEx.
EDIT: +1 For Matt's named group within RegEx approach... that's the nicest solution I've seen.
Your code mostly looks fine. A couple of points to note:
String.Split() will never return null, so you don't need the null check on it.
If the source string has fewer than two / characters, how would you deal with that? (The Original Post was updated to address this)
Do you really want to just output empty strings if your source string is null or empty (or invalid)? If you have specific expectations about the nature of the input, you may want to consider failing fast when those expectations are not met.
You could try something like this but I doubt it would be much faster. You could do some meassurements with System.Diagnostics.StopWatch to see if you feel the need.
string source = "Test/Company/Business/Department/Logs.tvs/v1";
int index1 = source.LastIndexOf('/');
string last = source.Substring(index1 + 1);
string substring = source.Substring(0, index1);
int index2 = substring.LastIndexOf('/');
string secondLast = substring.Substring(index2 + 1);
I would try
string source = "Test/Company/Business/Department/Logs.tvs/v1";
var components = source.Split('/').Reverse().Take(2);
String last = string.Empty;
var enumerable = components as string[] ?? components.ToArray();
if (enumerable.Count() == 2)
last = enumerable.FirstOrDefault();
var secondLast = enumerable.LastOrDefault();
Hope this will help
you can retrieve the last two words using the process as below:
string source = "Test/Company/Business/Department/Logs.tvs/v1";
String[] partsOfSource = source.Split('/');
if(partsOfSourch.length>2)
for(int i=partsOfSourch.length-2;i<=partsOfSource.length-1;i++)
console.writeline(partsOfSource[i]);

Best way to search for many guids at once in a string?

I have before me the problem of searching for 90,000 GUIDs in a string. I need to get all instances of each GUID. Technically, I need to replace them too, but that's another story.
Currently, I'm searching for each one individually, using a regex. But it occurred to me that I could get better performance by searching for them all together. I've read about tries in the past and have never used one but it occurred to me that I could construct a trie of all 90,000 GUIDs and use that to search.
Alternatively, perhaps there is an existing library in .NET that can do this. It crossed my mind that there is no reason why I shouldn't be able to get good performance with just a giant regex, but this appears not to work.
Beyond that, perhaps there is some clever trick I could use relating to GUID structure to get better results.
This isn't really a crucial problem for me, but I thought I might be able to learn something.
Have a look at the Rabin-Karp string search algorithm. It's well suited for multi-pattern searching in a string:
http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_string_search_algorithm#Rabin.E2.80.93Karp_and_multiple_pattern_search
You will not get good performance with RegEx because it's performance is inherently poor. Additionally, if all the GUID's share the same format you should only need one RegEx. and regex.Replace(input, replacement); would do it.
If you have the list of guids in memory already the performance would be better by looping over that list and doing calling String.Replace like so
foreach(string guid in guids)
inputString.replace(guid, replacement);
I developed a method for replacing a large number of strings a while back, that may be useful:
A better way to replace many strings - obfuscation in C#
Another option would be to use a regular expression to find all GUIDs in the string, then loop through them and check if each is part of your set of GUIDs.
Basic example, using a Dictionary for fast lookup of the GUIDs:
Dictionary<string, string> guids = new Dictionary<string, string>();
guids.Add("3f74a071-54fc-10de-0476-a6b991f0be76", "(replacement)");
string text = "asdf 3f74a071-54fc-10de-0476-a6b991f0be76 lkaq2hlqwer";
text = Regex.Replace(text, #"[\da-f]{8}-[\da-f]{4}-[\da-f]{4}-[\da-f]{4}-[\da-f]{12}", m => {
string replacement;
if (guids.TryGetValue(m.Value, out replacement)) {
return replacement;
} else {
return m.Value;
}
});
Console.WriteLine(text);
Output:
asdf (replacement) lkaq2hlqwer
OK, this looks good. So to be clear here is the original code, which took 65s to run on the example string:
var unusedGuids = new HashSet<Guid>(oldToNewGuid.Keys);
foreach (var guid in oldToNewGuid) {
var regex = guid.Key.ToString();
if (!Regex.IsMatch(xml, regex))
unusedGuids.Add(guid.Key);
else
xml = Regex.Replace(xml, regex, guid.Value.ToString());
}
The new code is as follows and takes 6.7s:
var unusedGuids = new HashSet<Guid>(oldToNewGuid.Keys);
var guidHashes = new MultiValueDictionary<int, Guid>();
foreach (var guid in oldToNewGuid.Keys) {
guidHashes.Add(guid.ToString().GetHashCode(), guid);
}
var indices = new List<Tuple<int, Guid>>();
const int guidLength = 36;
for (int i = 0; i < xml.Length - guidLength; i++) {
var substring = xml.Substring(i, guidLength);
foreach (var value in guidHashes.GetValues(substring.GetHashCode())) {
if (value.ToString() == substring) {
unusedGuids.Remove(value);
indices.Add(new Tuple<int, Guid>(i, value));
break;
}
}
}
var builder = new StringBuilder();
int start = 0;
for (int i = 0; i < indices.Count; i++) {
var tuple = indices[i];
var substring = xml.Substring(start, tuple.Item1 - start);
builder.Append(substring);
builder.Append(oldToNewGuid[tuple.Item2].ToString());
start = tuple.Item1 + guidLength;
}
builder.Append(xml.Substring(start, xml.Length - start));
xml = builder.ToString();

How to split a string while preserving line endings?

I have a block of text and I want to get its lines without losing the \r and \n at the end. Right now, I have the following (suboptimal code):
string[] lines = tbIn.Text.Split('\n')
.Select(t => t.Replace("\r", "\r\n")).ToArray();
So I'm wondering - is there a better way to do it?
Accepted answer
string[] lines = Regex.Split(tbIn.Text, #"(?<=\r\n)(?!$)");
The following seems to do the job:
string[] lines = Regex.Split(tbIn.Text, #"(?<=\r\n)(?!$)");
(?<=\r\n) uses 'positive lookbehind' to match after \r\n without consuming it.
(?!$) uses negative lookahead to prevent matching at the end of the input and so avoids a final line that is just an empty string.
Something along the lines of using this regular expression:
[^\n\r]*\r\n
Then use Regex.Matches().
The problem is you need Group(1) out of each match and create your string list from that. In Python you'd just use the map() function. Not sure the best way to do it in .NET, you take it from there ;-)
Dmitri, your solution is actually pretty compact and straightforward. The only thing more efficient would be to keep the string-splitting characters in the generated array, but the APIs simply don't allow for that. As a result, every solution will require iterating over the array and performing some kind of modification (which in C# means allocating new strings every time). I think the best you can hope for is to not re-create the array:
string[] lines = tbIn.Text.Split('\n');
for (int i = 0; i < lines.Length; ++i)
{
lines[i] = lines[i].Replace("\r", "\r\n");
}
... but as you can see that looks a lot more cumbersome! If performance matters, this may be a bit better. If it really matters, you should consider manually parsing the string by using IndexOf() to find the '\r's one at a time, and then create the array yourself. This is significantly more code, though, and probably not necessary.
One of the side effects of both your solution and this one is that you won't get a terminating "\r\n" on the last line if there wasn't one already there in the TextBox. Is this what you expect? What about blank lines... do you expect them to show up in 'lines'?
If you are just going to replace the newline (\n) then do something like this:
string[] lines = tbIn.Text.Split('\n')
.Select(t => t + "\r\n").ToArray();
Edit: Regex.Replace allows you to split on a string.
string[] lines = Regex.Split(tbIn.Text, "\r\n")
.Select(t => t + "\r\n").ToArray();
As always, extension method goodies :)
public static class StringExtensions
{
public static IEnumerable<string> SplitAndKeep(this string s, string seperator)
{
string[] obj = s.Split(new string[] { seperator }, StringSplitOptions.None);
for (int i = 0; i < obj.Length; i++)
{
string result = i == obj.Length - 1 ? obj[i] : obj[i] + seperator;
yield return result;
}
}
}
usage:
string text = "One,Two,Three,Four";
foreach (var s in text.SplitAndKeep(","))
{
Console.WriteLine(s);
}
Output:
One,
Two,
Three,
Four
You can achieve this with a regular expression. Here's an extension method with it:
public static string[] SplitAndKeepDelimiter(this string input, string delimiter)
{
MatchCollection matches = Regex.Matches(input, #"[^" + delimiter + "]+(" + delimiter + "|$)", RegexOptions.Multiline);
string[] result = new string[matches.Count];
for (int i = 0; i < matches.Count ; i++)
{
result[i] = matches[i].Value;
}
return result;
}
I'm not sure if this is a better solution. Yours is very compact and simple.

Categories