Read text file word-by-word using LINQ - c#

I am learning LINQ, and I want to read a text file (let's say an e-book) word by word using LINQ.
This is wht I could come up with:
static void Main()
{
string[] content = File.ReadAllLines("text.txt");
var query = (from c in content
select content);
foreach (var line in content)
{
Console.Write(line+"\n");
}
}
This reads the file line by line. If i change ReadAllLines to ReadAllText, the file is read letter by letter.
Any ideas?

string[] content = File.ReadAllLines("text.txt");
var words=content.SelectMany(line=>line.Split(' ', StringSplitOptions.RemoveEmptyEntries));
foreach(string word in words)
{
}
You'll need to add whatever whitespace characters you need. Using StringSplitOptions to deal with consecutive whitespaces is cleaner than the Where clause I originally used.
In .net 4 you can use File.ReadLines for lazy evaluation and thus lower RAM usage when working on large files.

string str = File.ReadAllText();
char[] separators = { '\n', ',', '.', ' ', '"', ' ' }; // add your own
var words = str.Split(separators, StringSplitOptions.RemoveEmptyEntries);

string content = File.ReadAllText("Text.txt");
var words = from word in content.Split(WhiteSpace, StringSplitOptions.RemoveEmptyEntries)
select word;
You will need to define the array of whitespace chars with your own values like so:
List<char> WhiteSpace = { Environment.NewLine, ' ' , '\t'};
This code assumes that panctuation is a part of the word (like a comma).

It's probably better to read all the text using ReadAllText() then use regular expressions to get the words. Using the space character as a delimiter can cause some troubles as it will also retrieve punctuation (commas, dots .. etc). For example:
Regex re = new Regex("[a-zA-Z0-9_-]+", RegexOptions.Compiled); // You'll need to change the RE to fit your needs
Match m = re.Match(text);
while (m.Success)
{
string word = m.Groups[1].Value;
// do your processing here
m = m.NextMatch();
}

The following uses iterator blocks, and therefore uses deferred loading. Other solutions have you loading the entire file into memory before being able to iterate over the words.
static IEnumerable<string> GetWords(string path){
foreach (var line in File.ReadLines(path)){
foreach (var word in line.Split(null)){
yield return word;
}
}
}
(Split(null) automatically removes whitespace)
Use it like this:
foreach (var word in GetWords(#"text.txt")){
Console.WriteLine(word);
}
Works with standard Linq funness too:
GetWords(#"text.txt").Take(25);
GetWords(#"text.txt").Where(w => w.Length > 3)
Of course error handling etc. left out for sake of learning.

You could write content.ToList().ForEach(p => p.Split(' ').ToList().ForEach(Console.WriteLine)) but that's not a lot of linq.

Related

How to count paragraphs in a text?

I'm stuck.
I have song text stored in a string.
I need to count the song houses (houses separates by empty line. empty line is my delimiter).
In addition I need an access to each word, so i can associate the word to its house.
I really will appreciate yours help
This is my base code:
var paragraphMarker = Environment.NewLine;
var paragraphs = fileText.Split(new[] {paragraphMarker},
StringSplitOptions.RemoveEmptyEntries);
foreach (var paragraph in paragraphs)
{
var words = paragraph.Split(new[] {' '},
StringSplitOptions.RemoveEmptyEntries)
.Select(w => w.Trim());
//do something
}
You should be able to perform Regex.Split on \r\n\r\n which would be two carridge return line feeds (assuming that your empty line is actually empty) and then String.Split those by ' ' to get the individual words in each paragraph.
This will break it apart into two sections and then count the words in each. For simplicity I've only got one sentence in each bit.
var poem = "Roses are red, violets are blue\r\n\r\nSomething something darkside";
var verses = System.Text.RegularExpressions.Regex.Split(poem, "\r\n");
foreach (var verse in verses)
{
var words = verse.Split(' ');
Console.WriteLine(words.Count());
}
You'll need to tidy up more edge cases like punctuation etc, but this should give you a starting point.
String.Split will create an array using your delimiter as a token.
Array.Count will tell you how many elements are in an array.
For example, to find the count of words in this sentence:
var count = #"Hello! This is a naive example.".Split(' ').Count;

Fastest way to remove the leading special characters in string in c#

I am using c# and i have a string like
-Xyz
--Xyz
---Xyz
-Xyz-Abc
--Xyz-Abc
i simply want to remove any leading special character until alphabet comes , Note: Special characters in the middle of string will remain same . What is the fastest way to do this?
You could use string.TrimStart and pass in the characters you want to remove:
var result = yourString.TrimStart('-', '_');
However, this is only a good idea if the number of special characters you want to remove is well-known and small.
If that's not the case, you can use regular expressions:
var result = Regex.Replace(yourString, "^[^A-Za-z0-9]*", "");
I prefer this two methods:
List<string> strings = new List<string>()
{
"-Xyz",
"--Xyz",
"---Xyz",
"-Xyz-Abc",
"--Xyz-Abc"
};
foreach (var s in strings)
{
string temp;
// String.Trim Method
char[] charsToTrim = { '*', ' ', '\'', '-', '_' }; // Add more
temp = s.TrimStart(charsToTrim);
Console.WriteLine(temp);
// Enumerable.SkipWhile Method
// Char.IsPunctuation Method (se also Char.IsLetter, Char.IsLetterOrDigit, etc.)
temp = new String(s.SkipWhile(x => Char.IsPunctuation(x)).ToArray());
Console.WriteLine(temp);
}

String Regex Help C#

I'm trying to create a regex that reads a string, and if the last character is something like !"£$% etc, it ignores the last character, reads the string (to allow my code to look it up in a dictionary class) and then outputs the string, with the character on the end it ignored. Is this actually possible, or do I have to just remove the last character?
So far...
foreach(var line in yourReader)
{
var dict = new Dictionary<string,string>(); // your replacement dictionaries
foreach(var kvp in dict)
{
System.Text.RegularExpressions.Regex.Replace(line,"(\s|,|\.|:|\\t)" + kvp.Key + "(\s|,|\.|:|\\t)","\0" + kvp.Value + "\1");
}
}
I've also been told to try this
var trans = textbox1.Text;
foreach (var kvp in d) //d is my dictionary so use yours
{
trans = trans.Replace(kvp.Key, kvp.Value);
}
textbox2.Text = trans;
but have literally no idea what it does
I didn't find any point using Regex, so I hope this will help:
const int ARRAY_OFFSET = 1;
List<char> ForbiddenChars = new List<char>()
{
'!', '#', '#', '$', '%', '^', '&', '*', '£' //Add more if you'd like
};
string myString = "Hello World!&";
foreach (var forbiddenChar in ForbiddenChars)
{
if (myString[myString.Length - ARRAY_OFFSET] == forbiddenChar)
{
myString = myString.Remove(myString.Length - ARRAY_OFFSET);
break;
}
}
Edit:
I checked the old code, and it had a problem: when the string's last "forbidden" characters were in order of the ForbiddenChars array it deleted all of them. if your string was "Hello World&!" it would delete both the ! and &. so I set a break; and it won't be a problem anymore.
Take a look at Regex.Replace. A regular expression such as [!"£$%]$ should do what you need.
In your case I'd recommend using the regex expression for a range of characters to remove the !"£$% etc.
The way you'd want to use this in your case would be something like:
"<the bit you want to capture>(?:[!-%]\\r)"
The (?:[!-%]\\r) bit matches, but doesn't store, a single character in range !-% which comes right before a carriage return character.
I also recommend using this handy cheat sheet of reg ex expressions:
http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

How to indicate whitespaces while reading from a .txt file

I have a simple .txt file with X,Y-values in it. It is structured like this:
-25.7754 35.87
-22.1233 32.16
-20.361 30.75
etc.
I am able to read single lines or the whole text to the end, with objstream.ReadToEnd(); & objstream.ReadLine().
But here's my question how could I indicate when the String after the first value ends so I can save/parse it to float & proceed reading the value of the next string?
Here is the read functionality I have so far :)
StreamReader objStream = new StreamReader("C:blablabla\\Text.asc");
textBox1.Text = objStream.ReadLine();
Thanks in advance,
BC++
Use String.split()
As requested, an example :
string s = "there is a cat";
//
// Split string on spaces.
// ... This will separate all the words.
//
string[] words = s.Split(' ');
foreach (string word in words)
{
Console.WriteLine(word);
}
The output is :
there
is
a
cat
Look at the string.Split methods:
var line1 = objStream.ReadLine();
var lineParts = line1.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
textBox1.Text = lineParts[0];
textBox2.Text = lineParts[1];
Note the use of an overload that uses StringSplitOptions.RemoveEmptyEntries - the means that if you have multiple spaces in succession, the result will not contain empty entries.
If you really mean white-space and not space then you have to go this way:
string line = "-25.7754 35.87";
string[] values = line.Split(new char[] { }, StringSplitOptions.RemoveEmptyEntries);
The difference from the other answers in the splitting character. If this not defined then white-space characters are assumed to be the delimiters. In other words you will get the same result for
string line = "-25.7754\t35.87"; // tab instead of spaces.
You will have the flexibility to split correctly fixed length or tab delimited lines using the same code.

Perform Trim() while using Split()

today I was wondering if there is a better solution perform the following code sample.
string keyword = " abc, foo , bar";
string match = "foo";
string[] split= keyword.Split(new char[] { ',', ';' }, StringSplitOptions.RemoveEmptyEntries);
foreach(string s in split)
{
if(s.Trim() == match){// asjdklasd; break;}
}
Is there a way to perform trim() without manually iterating through each item? I'm looking for something like 'split by the following chars and automatically trim each result'.
Ah, immediatly before posting I found
List<string> parts = line.Split(';').Select(p => p.Trim()).ToList();
in How can I split and trim a string into parts all on one line?
Still I'm curious: Might there be a better solution to this? (Or would the compiler probably convert them to the same code output as the Linq-Operation?)
Another possible option (that avoids LINQ, for better or worse):
string line = " abc, foo , bar";
string[] parts= Array.ConvertAll(line.Split(','), p => p.Trim());
However, if you just need to know if it is there - perhaps short-circuit?
bool contains = line.Split(',').Any(p => p.Trim() == match);
var parts = line
.Split(';')
.Select(p => p.Trim())
.Where(p => !string.IsNullOrWhiteSpace(p))
.ToArray();
I know this is 10 years too late but you could have just split by ' ' as well:
string[] split= keyword.Split(new char[] { ',', ';', ' ' }, StringSplitOptions.RemoveEmptyEntries);
Because you're also splitting by the space char AND instructing the split to remove the empty entries, you'll have what you need.
If spaces just surrounds the words in the comma separated string this will work:
var keyword = " abc, foo , bar";
var array = keyword.Replace(" ", "").Split(',');
if (array.Contains("foo"))
{
Debug.Print("Match");
}
I would suggest using regular expressions on the original string, looking for the pattern "any number of spaces followed by one of your delimiters followed by one or more spaces" and remove those spaces. Then split.
Try this:
string keyword = " abc, foo , bar";
string match = "foo";
string[] split = Regex.Split(keyword.Trim(), #"\s*[,;]\s*");
if (split.Contains(match))
{
// do stuff
}
You're going to find a lot of different methods of doing this and the performance change and accuracy isn't going to be readily apparent. I'd recommend plugging them all into a testing suite like NUnit in order both to find which one comes out on top AND which ones are accurate.
Use small, medium, and large amounts of text in loops to examine the various situations.
Starting with .Net 5, there is an easier option:
string[] split= keyword.Split(new char[] { ',', ';' }, StringSplitOptions.TrimEntries);
You can combine it with the option to remove empty entries:
string[] split= keyword.Split(new char[] { ',', ';' }, StringSplitOptions.TrimEntries | StringSplitOptions.RemoveEmptyEntries);

Categories