This question already has answers here:
Finding longest word in string
(4 answers)
Closed 2 years ago.
I have .txt file.
I have to remove all the longest words for each line.
The main method where I'm looking for longest word is :
/// <summary>
/// Finds longest word in line
/// </summary>
/// <param name="eil">Line</param>
/// <param name="skyr">Punctuation</param>
/// <returns>Returns longest word for line</returns>
static string[] RastiIlgZodiEil(string eil, char[] skyr)
{
string[] zodIlg = new string[100];
for (int k = 0; k < 100; k++)
{
zodIlg[k] = " ";
}
int kiek = 0;
string[] parts = eil.Split(skyr,
StringSplitOptions.RemoveEmptyEntries);
int i = 0;
foreach (string zodis in parts)
{
if (zodis.Length > zodIlg[i].Length)
{
zodIlg[kiek] = zodis;
kiek++;
i++;
}
else
{
i++;
}
}
return zodIlg;
}
EDIT : Method that reads the .txt file and uses the previous method to replace line with a new line that is configured (by replacing the word that has to be deleted with an empty string).
/// <summary>
/// Finds longest words for each line and then replaces them with
/// emptry string
/// </summary>
/// <param name="fv">File name</param>
/// <param name="skyr">Punctuation</param>
static void RastiIlgZodiFaile(string fv, string fvr, char[] skyr)
{
using (var fr = new StreamWriter(fvr, true,
System.Text.Encoding.GetEncoding(1257)))
{
using (StreamReader reader = new StreamReader(fv,
Encoding.GetEncoding(1257)))
{
int n = 0;
string line;
while (((line = reader.ReadLine()) != null))
{
n++;
if (line.Length > 0)
{
string[] temp = RastiIlgZodiEil(line, skyr);
foreach (string t in temp)
{
line = line.Replace(t, "");
}
fr.WriteLine(line);
}
}
}
}
}
You could remove the longest word(s) from each line with:
static string RemoveLongestWord(string eil, char[] skyr)
{
string[] parts = eil.Split(skyr, StringSplitOptions.RemoveEmptyEntries);
int longestLength = parts.OrderByDescending(s => s.Length).First().Length;
var longestWords = parts.Where(s => s.Length == longestLength);
foreach(string word in longestWords)
{
eil = eil.Replace(word, "");
}
return eil;
}
Just pass each line to the function and you'll get that line back with the longest word removed.
Here's an approach that more closely resembles what you were doing before:
static string[] RastiIlgZodiEil(string eil, char[] skyr)
{
List<string> zodIlg = new List<string>();
string[] parts = eil.Split(skyr, StringSplitOptions.RemoveEmptyEntries);
int maxLength = -1;
foreach (string zodis in parts)
{
if (zodis.Length > maxLength)
{
maxLength = zodis.Length;
}
}
foreach (string zodis in parts)
{
if (zodis.Length == maxLength)
{
zodIlg.Add(zodis);
}
}
return zodIlg.Distinct().ToArray();
}
The first pass finds the longest length. The second pass adds all word that match that length to a List<string>. Finally, we call Distinct() to remove duplicates from the list and return an array version of it.
Here is another solution. Read all lines from a file and split each line using a loop. Aggregate function is used to get the highest length to filter the data further.
static void Main(string[] args)
{
var data = File.ReadAllLines(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "test.txt"));
for (int i = 0; i < data.Length; i++)
{
Console.WriteLine(data[i]);
var split = data[i].Split(' ');
int length = split.Aggregate((a, b) => a.Length >= b.Length ? a : b).Length;
data[i] = string.Join(' ', split.Where(w => w.Length < length));
Console.WriteLine(data[i]);
}
Console.Read();
}
Related
I'm working on reversing a sentence. I'm able to do it. But I'm not sure, how to reverse the word without changing the special characters positions. I'm using regex but as soon as it finds the special characters it's stopping the reversal of the word.
Following is the code:
Console.WriteLine("Enter:");
string w = Console.ReadLine();
string rw = String.Empty;
String[] arr = w.Split(' ');
var regexItem = new Regex("^[a-zA-Z0-9]*$");
StringBuilder appendString = new StringBuilder();
for (int i = 0; i < arr.Length; i++)
{
char[] chararray = arr[i].ToCharArray();
for (int j = chararray.Length - 1; j >= 0; j--)
{
if (regexItem.IsMatch(rw))
{
rw = appendString.Append(chararray[j]).ToString();
}
}
sb.Append(' ');
}
Console.WriteLine(rw);
Console.ReadLine();
Example : Input
Marshall! Hello.
Expected output
llahsram! olleh.
A basic solution with regex and LINQ. Try it online.
public static void Main()
{
Console.WriteLine("Marshall! Hello.");
Console.WriteLine(Reverse("Marshall! Hello."));
}
public static string Reverse(string source)
{
// we split by groups to keep delimiters
var parts = Regex.Split(source, #"([^a-zA-Z0-9])");
// if we got a group of valid characters
var results = parts.Select(x => x.All(char.IsLetterOrDigit)
// we reverse it
? new string(x.Reverse().ToArray())
// or we keep the delimiters as it
: x);
// then we concat all of them
return string.Concat(results);
}
The same solution without LINQ. Try it online.
public static void Main()
{
Console.WriteLine("Marshall! Hello.");
Console.WriteLine(Reverse("Marshall! Hello."));
}
public static bool IsLettersOrDigits(string s)
{
foreach (var c in s)
{
if (!char.IsLetterOrDigit(c))
{
return false;
}
}
return true;
}
public static string Reverse(char[] s)
{
Array.Reverse(s);
return new string(s);
}
public static string Reverse(string source)
{
var parts = Regex.Split(source, #"([^a-zA-Z0-9])");
var results = new List<string>();
foreach(var x in parts)
{
results.Add(IsLettersOrDigits(x)
? Reverse(x.ToCharArray())
: x);
}
return string.Concat(results);
}
This is a solution without LINQ. I wasn't sure about what are considered special characters.
string sentence = "Marshall! Hello.";
List<string> words = sentence.Split(' ').ToList();
List<string> reversedWords = new List<string>();
foreach (string word in words)
{
char[] arr = new char[word.Length];
for( int i=0; i<word.Length; i++)
{
if(!Char.IsLetterOrDigit((word[i])))
{
for ( int x=0; x< i; x++)
{
arr[x] = arr[x + 1];
}
arr[i] = word[i];
}
else
{
arr[word.Length - 1 - i] = word[i];
}
}
reversedWords.Add(new string(arr));
}
string reversedSentence = string.Join(" ", reversedWords);
Console.WriteLine(reversedSentence);
And this is the output:
Updated Output = llahsraM! olleH.
Here is a non-regex version that does what you want:
var sentence = "Hello, john!";
var parts = sentence.Split(' ');
var reversed = new StringBuilder();
var charPositions = sentence.Select((c, idx) => new { Char = c, Index = idx })
.Where(_ => !char.IsLetterOrDigit(_.Char));
for (int i = 0; i < parts.Length; i++)
{
var chars = parts[i].ToCharArray();
for (int j = chars.Length - 1; j >= 0; j--)
{
if (char.IsLetterOrDigit(chars[j]))
{
reversed.Append(chars[j]);
}
}
}
foreach (var ch in charPositions)
{
reversed.Insert(ch.Index, ch.Char);
}
// olleH, nhoj!
Console.WriteLine(reversed.ToString());
Basically the trick is to remember the position of special (i.e. non letter or digit) characters and insert them at the end to those positions.
This solution is without LINQ and Regex. It may not be an efficient answer but working properly for small string values.
// This will reverse the string and special characters will just stay there.
public string ReverseString(string rString)
{
StringBuilder ss = new StringBuilder(rString);
int y = 0;
// The idea is to swap values. Like swapping first value with last one. It will keep swapping unless it reaches at the middle of the string where no swapping will be needed.
// This first loop is to detect first values.
for(int i=rString.Length-1;i>=0;i--)
{
// This condition is to check if the values is String or not. If it is not string then it is considered as special character which will just stay there at same old position.
if(Char.IsLetter(Convert.ToChar(rString.Substring(i,1))))
{
// This is second loop which is starting from end to swap values from end with first.
for (int k = y; k < rString.Length; k++)
{
// Again checking last values if values are string or not.
if (Char.IsLetter(Convert.ToChar(rString.Substring(k, 1))))
{
// This is swapping. So st1 is First value in that string
// st2 is the last item in that string
char st1 = Convert.ToChar(rString.Substring(k, 1));
char st2 = Convert.ToChar(rString.Substring(i, 1));
//This is swapping. So last item will go to first position and first item will go to last position, To make sure string is reversed.
// Remember when the string value is Special Character, swapping will move forward without swapping.
ss[rString.IndexOf(rString.Substring(i, 1))] = st1;
ss[rString.IndexOf(rString.Substring(k, 1))] = st2;
y++;
// When the swapping is done for first 2 items. The loop will stop to change the values.
break;
}
else
{
// This is just increment if value was Special character.
y++;
}
}
}
}
return ss.ToString();
}
Thanks!
All string.Split methods seems to return an array of strings (string[]).
I'm wondering if there is a lazy variant that returns an IEnumerable<string> such that one for large strings (or an infinite length IEnumerable<char>), when one is only interested in a first subsequences, one saves computational effort as well as memory. It could also be useful if the string is constructed by a device/program (network, terminal, pipes) and the entire strings is thus not necessary immediately fully available. Such that one can already process the first occurences.
Is there such method in the .NET framework?
You could easily write one:
public static class StringExtensions
{
public static IEnumerable<string> Split(this string toSplit, params char[] splits)
{
if (string.IsNullOrEmpty(toSplit))
yield break;
StringBuilder sb = new StringBuilder();
foreach (var c in toSplit)
{
if (splits.Contains(c))
{
yield return sb.ToString();
sb.Clear();
}
else
{
sb.Append(c);
}
}
if (sb.Length > 0)
yield return sb.ToString();
}
}
Clearly, I haven't tested it for parity with string.split, but I believe it should work just about the same.
As Servy notes, this doesn't split on strings. That's not as simple, and not as efficient, but it's basically the same pattern.
public static IEnumerable<string> Split(this string toSplit, string[] separators)
{
if (string.IsNullOrEmpty(toSplit))
yield break;
StringBuilder sb = new StringBuilder();
foreach (var c in toSplit)
{
var s = sb.ToString();
var sep = separators.FirstOrDefault(i => s.Contains(i));
if (sep != null)
{
yield return s.Replace(sep, string.Empty);
sb.Clear();
}
else
{
sb.Append(c);
}
}
if (sb.Length > 0)
yield return sb.ToString();
}
There is no such thing built-in. Regex.Matches is lazy if I interpret the decompiled code correctly. Maybe you can make use of that.
Or, you simply write your own split function.
Actually, you could image most string functions generalized to arbitrary sequences. Often, even sequences of T, not just char. The BCL does not emphasize that at generalization all. There is no Enumerable.Subsequence for example.
Nothing built-in, but feel free to rip my Tokenize method:
/// <summary>
/// Splits a string into tokens.
/// </summary>
/// <param name="s">The string to split.</param>
/// <param name="isSeparator">
/// A function testing if a code point at a position
/// in the input string is a separator.
/// </param>
/// <returns>A sequence of tokens.</returns>
IEnumerable<string> Tokenize(string s, Func<string, int, bool> isSeparator = null)
{
if (isSeparator == null) isSeparator = (str, i) => !char.IsLetterOrDigit(str, i);
int startPos = -1;
for (int i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
{
if (!isSeparator(s, i))
{
if (startPos == -1) startPos = i;
}
else if (startPos != -1)
{
yield return s.Substring(startPos, i - startPos);
startPos = -1;
}
}
if (startPos != -1)
{
yield return s.Substring(startPos);
}
}
There is no built-in method to do this as far as I'm know. But it doesn't mean you can't write one. Here is a sample to give you an idea:
public static IEnumerable<string> SplitLazy(this string str, params char[] separators)
{
List<char> temp = new List<char>();
foreach (var c in str)
{
if (separators.Contains(c) && temp.Any())
{
yield return new string(temp.ToArray());
temp.Clear();
}
else
{
temp.Add(c);
}
}
if(temp.Any()) { yield return new string(temp.ToArray()); }
}
Ofcourse this doesn't handle all cases and can be improved.
I wrote this variant which supports also SplitOptions and count.
It behaves same like string.Split in all test cases I tried.
The nameof operator is C# 6 sepcific and can be replaced by "count".
public static class StringExtensions
{
/// <summary>
/// Splits a string into substrings that are based on the characters in an array.
/// </summary>
/// <param name="value">The string to split.</param>
/// <param name="options"><see cref="StringSplitOptions.RemoveEmptyEntries"/> to omit empty array elements from the array returned; or <see cref="StringSplitOptions.None"/> to include empty array elements in the array returned.</param>
/// <param name="count">The maximum number of substrings to return.</param>
/// <param name="separator">A character array that delimits the substrings in this string, an empty array that contains no delimiters, or null. </param>
/// <returns></returns>
/// <remarks>
/// Delimiter characters are not included in the elements of the returned array.
/// If this instance does not contain any of the characters in separator the returned sequence consists of a single element that contains this instance.
/// If the separator parameter is null or contains no characters, white-space characters are assumed to be the delimiters. White-space characters are defined by the Unicode standard and return true if they are passed to the <see cref="Char.IsWhiteSpace"/> method.
/// </remarks>
public static IEnumerable<string> SplitLazy(this string value, int count = int.MaxValue, StringSplitOptions options = StringSplitOptions.None, params char[] separator)
{
if (count <= 0)
{
if (count < 0) throw new ArgumentOutOfRangeException(nameof(count), "Count cannot be less than zero.");
yield break;
}
Func<char, bool> predicate = char.IsWhiteSpace;
if (separator != null && separator.Length != 0)
predicate = (c) => separator.Contains(c);
if (string.IsNullOrEmpty(value) || count == 1 || !value.Any(predicate))
{
yield return value;
yield break;
}
bool removeEmptyEntries = (options & StringSplitOptions.RemoveEmptyEntries) != 0;
int ct = 0;
var sb = new StringBuilder();
for (int i = 0; i < value.Length; ++i)
{
char c = value[i];
if (!predicate(c))
{
sb.Append(c);
}
else
{
if (sb.Length != 0)
{
yield return sb.ToString();
sb.Clear();
}
else
{
if (removeEmptyEntries)
continue;
yield return string.Empty;
}
if (++ct >= count - 1)
{
if (removeEmptyEntries)
while (++i < value.Length && predicate(value[i]));
else
++i;
if (i < value.Length - 1)
{
sb.Append(value, i, value.Length - i);
yield return sb.ToString();
}
yield break;
}
}
}
if (sb.Length > 0)
yield return sb.ToString();
else if (!removeEmptyEntries && predicate(value[value.Length - 1]))
yield return string.Empty;
}
public static IEnumerable<string> SplitLazy(this string value, params char[] separator)
{
return value.SplitLazy(int.MaxValue, StringSplitOptions.None, separator);
}
public static IEnumerable<string> SplitLazy(this string value, StringSplitOptions options, params char[] separator)
{
return value.SplitLazy(int.MaxValue, options, separator);
}
public static IEnumerable<string> SplitLazy(this string value, int count, params char[] separator)
{
return value.SplitLazy(count, StringSplitOptions.None, separator);
}
}
I wanted the functionality of Regex.Split, but in a lazily evaluated form. The code below just runs through all Matches in the input string, and produces the same results as Regex.Split:
public static IEnumerable<string> Split(string input, string pattern, RegexOptions options = RegexOptions.None)
{
// Always compile - we expect many executions
var regex = new Regex(pattern, options | RegexOptions.Compiled);
int currentSplitStart = 0;
var match = regex.Match(input);
while (match.Success)
{
yield return input.Substring(currentSplitStart, match.Index - currentSplitStart);
currentSplitStart = match.Index + match.Length;
match = match.NextMatch();
}
yield return input.Substring(currentSplitStart);
}
Note that using this with the pattern parameter #"\s" will give you the same results as string.Split().
Lazy split without create tempory string.
Chunk of string copied using system coll mscorlib String.SubString.
public static IEnumerable<string> LazySplit(this string source, StringSplitOptions stringSplitOptions, params string[] separators)
{
var sourceLen = source.Length;
bool IsSeparator(int index, string separator)
{
var separatorLen = separator.Length;
if (sourceLen < index + separatorLen)
{
return false;
}
for (var i = 0; i < separatorLen; i++)
{
if (source[index + i] != separator[i])
{
return false;
}
}
return true;
}
var indexOfStartChunk = 0;
for (var i = 0; i < source.Length; i++)
{
foreach (var separator in separators)
{
if (IsSeparator(i, separator))
{
if (indexOfStartChunk == i && stringSplitOptions != StringSplitOptions.RemoveEmptyEntries)
{
yield return string.Empty;
}
else
{
yield return source.Substring(indexOfStartChunk, i - indexOfStartChunk);
}
i += separator.Length;
indexOfStartChunk = i--;
break;
}
}
}
if (indexOfStartChunk != 0)
{
yield return source.Substring(indexOfStartChunk, sourceLen - indexOfStartChunk);
}
}
This question already has answers here:
Counting words in each sentence using C#
(6 answers)
Closed 8 years ago.
I was able to count the number of characters in the textbox. But I do not know how to count the number of words and print them to lable.
Assuming you are differentiating each word with a space, Try this:-
int count = YourtextBoxId.Text
.Split(new char[] {' '},StringSplitOptions.RemoveEmptyEntries)
.Count();
Okay as suggested by #Tim Schmelter, if you have different knids of separator apart from a blank space, you can extend this :-
int count = YourtextBoxId.Text
.Split(new char[] {' ','.',':'},StringSplitOptions.RemoveEmptyEntries)
.Count();
The first point is the definition of the "WORD" you should know it before implementing a code.
So if the WORD defined as a sequence of letters, you can use the following code to calculate words count:
public int WordsCount(string text)
{
if (string.IsNullOrEmpty(text))
{
return 0;
}
var count = 0;
var word = false;
foreach (char symbol in text)
{
if (!char.IsLetter(symbol))
{
word = false;
continue;
}
if (word)
{
continue;
}
count++;
word = true;
}
return count;
}
You can use Regex.Matches() just like the following example:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
const string t1 = "To be or not to be, that is the question.";
Console.WriteLine(WordCounting.CountWords1(t1));
Console.WriteLine(WordCounting.CountWords2(t1));
const string t2 = "Mary had a little lamb.";
Console.WriteLine(WordCounting.CountWords1(t2));
Console.WriteLine(WordCounting.CountWords2(t2));
}
}
/// <summary>
/// Contains methods for counting words.
/// </summary>
public static class WordCounting
{
/// <summary>
/// Count words with Regex.
/// </summary>
public static int CountWords1(string s)
{
MatchCollection collection = Regex.Matches(s, #"[\S]+");
return collection.Count;
}
/// <summary>
/// Count word with loop and character tests.
/// </summary>
public static int CountWords2(string s)
{
int c = 0;
for (int i = 1; i < s.Length; i++)
{
if (char.IsWhiteSpace(s[i - 1]) == true)
{
if (char.IsLetterOrDigit(s[i]) == true ||
char.IsPunctuation(s[i]))
{
c++;
}
}
}
if (s.Length > 2)
{
c++;
}
return c;
}
}
You could try to split the string in the textbox using the spaces.
string[] words = textbox.Text.Split(' '); // <-- Space character
int numberOfWords = words.Length;
label.Text = "Number of words are: " + numberOfWords;
MyTextbox.Text.Split(' ').Count()
This is working for me..
var noOfWords = Regex.Replace(textBox1.Text, "[^a-zA-Z0-9_]+", " ")
.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
.Length;
I have a large string separated by newline character. This string contains 100 lines. I want to split these line into small chunks say chunk of 20 also based on newline character.
Let's say the string variable is like this,
Line1 This is line2 Line3 is here I am Line4
Now I want to split this large string variable into small chunks of 2. The result should be 2 strings as,
Line1 This is line2
Line3 is here I am Line4
Using Split function, I am not getting the expected results. Please help me in achieving this.
Thanks in advance,
Vijay
The simple approach (Split on Environment.NewLine, then loop and append):
public static List<string> GetStringSegments(string originalString, int linesPerSegment)
{
List<string> segments = new List<string>();
string[] allLines = originalString.Split(new string[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
StringBuilder sb = new StringBuilder();
int linesProcessed = 0;
for (int i = 0; i < allLines.Length; i++)
{
sb.AppendLine(allLines[i]);
linesProcessed++;
if (linesProcessed == linesPerSegment
|| i == allLines.Length-1)
{
segments.Add(sb.ToString());
sb.Clear();
inesProcessed = 0;
}
}
return segments;
}
The above approach is slightly inefficient since it requires splitting the string first into individual lines, which creates unnecessary strings. A string of 1000 lines will create an array of 1000 strings. We can improved this if we just scan the string and search for \n:
public static List<string> GetStringSegments(string original, int linesPerSegment)
{
List<string> segments = new List<string>();
int startIndex = 0;
int newLinesEncountered = 0;
for (int i = 0; i < original.Length; i++)
{
if (original[i] == '\n')
{
newLinesEncountered++;
}
if (newLinesEncountered == linesPerSegment
|| i == original.Length - 1)
{
segments.Add(original.Substring(startIndex, (i - startIndex + 1)));
startIndex = i + 1;
newLinesEncountered = 0;
}
}
return segments;
}
You can use something like the batch operator from http://www.make-awesome.com/2010/08/batch-or-partition-a-collection-with-linq
string s = "[YOUR DATA]";
var lines = s.Split(new[]{Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
foreach(var batch in lines.Batch(20))
{
foreach(batchLine in batch)
{
Console.Writeline(batchLine);
}
}
static class LinqEx
{
// from http://www.make-awesome.com/2010/08/batch-or-partition-a-collection-with-linq
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection,
int batchSize)
{
List<T> nextbatch = new List<T>(batchSize);
foreach (T item in collection)
{
nextbatch.Add(item);
if (nextbatch.Count == batchSize)
{
yield return nextbatch;
nextbatch = new List<T>(batchSize);
}
}
if (nextbatch.Count > 0)
yield return nextbatch;
}
}
As several people mentioned, using string.Split will split the whole string into memory, which might be an allocation-heavy operation. This is why we have the TextReader class and its descendants, which should provide better memory performance, and might also be clearer, logically:
using (var reader = new StringReader(myString))
{
do
{
StringBuilder newString = null;
StringWriter newStringWriter = null;
if (lineCounter % 20 == 0)
{
newString = new StringBuilder();
newStringWriter = new StringWriter(newString);
newStringCollection.Add(newString);
}
string line = reader.ReadLine();
if (!string.isNullOrEmpty(line))
{
newStringWriter.WriteLine(line);
lineCounter++;
}
}
while (line != null)
}
We're using the StringReader to read our big string, one line at a time. And the corresponding StringWriter writes those lines to the new string, one line a time. After every 20 lines, we start a new StringBuilder (and the appropriate StringWriter wrapper).
split the strings by newline.
Then merge/fetch the number of strings together while using the strings.
string s = "Line1\nThis is line2 \nLine3 is here\nI am Line4";
string [] str = s.split('\n');
List<String> str1 = new List<String>();
for(int i=0; i<str.Length; i+=2)
{
string ss = str[i];
if(i+1 <str.Length)
ss += '\n' + str[i+1];
str1.Add(ss);
}
str = str1.ToArray();
If condition has been checked inside loop because may be the length of str is odd
var strAray = myLongString.Split('\n').ToList();
var skip=0;
var take=20;
var chunk = strAray.Skip(skip).Take(take).ToList();
While(chunk.Count >0)
{
foreach(var line in chunk)
{
// use line string
}
skip++;
chunk = strAray.Skip(skip).Take(take).ToList()
}
I have a field in my database that holds input from an html input. So I have in my db column data. What I need is to be able to extract this and display a short version of it as an intro. Maybe even the first paragraph if possible.
Something like this maybe?
public string Get(string text, int maxWordCount)
{
int wordCounter = 0;
int stringIndex = 0;
char[] delimiters = new[] { '\n', ' ', ',', '.' };
while (wordCounter < maxWordCount)
{
stringIndex = text.IndexOfAny(delimiters, stringIndex + 1);
if (stringIndex == -1)
return text;
++wordCounter;
}
return text.Substring(0, stringIndex);
}
It's quite simplified and doesnt handle if multiple delimiters comes after each other (for instance ", "). you might just want to use space as a delimiter.
If you want to get just the first paragraph, simply search after "\r\n\r\n" <-- two line breaks:
public string GetFirstParagraph(string text)
{
int pos = text.IndexOf("\r\n\r\n");
return pos == -1 ? text : text.Substring(0, pos);
}
Edit:
A very simplistic way to strip HTML:
return Regex.Replace(text, #”<(.|\n)*?>”, string.Empty);
The Html Agility Pack is usually the recommended way to strip out the HTML. After that it would just be a matter of doing a String.Substring to get the bit of it that you want.
If you need to get out the 2000 first words I suppose you could either use IndexOf to find a whitespace 2000 times and loop through it until then to get the index to use in the call to Substring.
Edit: Add sample method
public int GetIndex(string str, int numberWanted)
{
int count = 0;
int index = 1;
for (; index < str.Length; index++)
{
if (char.IsWhiteSpace(str[index - 1]) == true)
{
if (char.IsLetterOrDigit(str[index]) == true ||
char.IsPunctuation(str[index]))
{
count++;
if (count >= numberWanted)
break;
}
}
}
return index;
}
And call it like:
string wordList = "This is a list of a lot of words";
int i = GetIndex(wordList, 5);
string result = wordList.Substring(0, i);
Once you have your string you would have to count your words. I assume space is a delimiter for words, so the following code should find the first 2000 words in a string (or break out if there are fewer words).
string myString = "la la la";
int lastPosition = 0;
for (int i = 0; i < 2000; i++)
{
int position = myString.IndexOf(' ', lastPosition + 1);
if (position == -1) break;
lastPosition = position;
}
string firstThousandWords = myString.Substring(0, lastPosition);
You can change indexOf to indexOfAny to support more characters as delimiters.
I had the same problem and combined a few Stack Overflow answers into this class. It uses the HtmlAgilityPack which is a better tool for the job. Call:
Words(string html, int n)
To get n words
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace UmbracoUtilities
{
public class Text
{
/// <summary>
/// Return the first n words in the html
/// </summary>
/// <param name="html"></param>
/// <param name="n"></param>
/// <returns></returns>
public static string Words(string html, int n)
{
string words = html, n_words;
words = StripHtml(html);
n_words = GetNWords(words, n);
return n_words;
}
/// <summary>
/// Returns the first n words in text
/// Assumes text is not a html string
/// http://stackoverflow.com/questions/13368345/get-first-250-words-of-a-string
/// </summary>
/// <param name="text"></param>
/// <param name="n"></param>
/// <returns></returns>
public static string GetNWords(string text, int n)
{
StringBuilder builder = new StringBuilder();
//remove multiple spaces
//http://stackoverflow.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, #"\s+", " ");
IEnumerable<string> words = cleanedString.Split().Take(n + 1);
foreach (string word in words)
builder.Append(" " + word);
return builder.ToString();
}
/// <summary>
/// Returns a string of html with tags removed
/// </summary>
/// <param name="html"></param>
/// <returns></returns>
public static string StripHtml(string html)
{
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var root = document.DocumentNode;
var stringBuilder = new StringBuilder();
foreach (var node in root.DescendantsAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
stringBuilder.Append(" " + text.Trim());
}
}
return stringBuilder.ToString();
}
}
}
Merry Christmas!