easiest way to get each word of e-mail (text file) into an array C# - c#

I am trying to build a phishing scanner for a class project and I am stuck on trying to get an e-mail saved in a text file to properly copy into an array for later processing. What I want is for each word to be in it's own array index.
Here is my sample e-mail:
Subject: Insufficient Funds Notice
Date: September 25, 2013
Insufficient Funds Notice
Unfortunately, on 09/25/2013 your available balance in your Wells Fargo account XXXXXX4653 was insufficient to cover one or more of your checks, Debit Card purchases, or other transactions.
An important notice regarding one or more of your payments is now available in your Messages & Alerts inbox.
To read the message, click here, and first confirm your identity.
Please make deposits to cover your payments, fees, and any other withdrawals or transactions you have initiated. If you have already taken care of this, please disregard this notice.
We appreciate your business and thank you for your prompt attention to this matter.
If you have questions after reading the notice in your inbox, please refer to the contact information in the notice. Please do not reply to this automated email.
Sincerely,
Wells Fargo Online Customer Service
wellsfargo.com | Fraud Information Center
4f57e44c-5d00-4673-8eae-9123909604b6
I don't want any of the punctuation all I need is the words and numbers.
Here is the code I have written for it so far.
StreamReader sr1 = new StreamReader(lblDisplaySelectedFilePath.Text);
string line = sr1.ReadToEnd();
words = line.Split(' ');
int wordslowercount = 0;
foreach (string word in words)
{
words[wordslowercount] = word.ToLower();
wordslowercount = wordslowercount + 1;
}
The issue with the above code is that I keep getting words that are either strung together and/or have "\r" or "\n" on them in the array. Here is an example of what is in the array that I don't want.
"notice\r\ndate:" don't want the \r, \n, or the :. Also the two words should be in different indexes.

The regex \W will allow you to split your string and create a list of words. This uses word boundaries, so it will not include punctuation.
Regex.Split(inputString, "\\W").Where(x => !string.IsNullOrWhiteSpace(x));

using System;
using System.Text.RegularExpressions;
public class Example
{
static string CleanInput(string strIn)
{
// Replace invalid characters with empty strings.
try {
return Regex.Replace(strIn, #"[^\w\.#-]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException) {
return String.Empty;
}
}
}

Using line.Split(null) will split on white-space. From the C# String.Split method documentation:
If the separator parameter is null or contains no characters, white-space characters are assumed to be the delimiters. White-space characters are defined by the Unicode standard and return true if they are passed to the Char.IsWhiteSpace method.

Related

C# read from text file and store in variables

I have a text file that reads
1 "601 Cross Street College Station TX 71234"
2 "(another address)"
3 ...
.
.
I wanted to know how to parse this text file into an integer and a string using C#. The integer would hold the S.No and the string the address without the quotes.
I need to do this because later on I have a function that takes these two values from the text file as input and spits out some data. This function has to be executed on each entry in the text file.
If i is an integer and add is the string, the output should be
a=1; add=601 Cross Street College Station TX 71234 //for the first line and so on
As one can observe the address needs to be one string.
This is not a homework question. And what I have been able to accomplish so far is to read out all the lines using
string[] lines = System.IO.File.ReadAllLines(#"C:\Users\KS\Documents\input.txt");
Any help is appreciated.
I would need to see more of your input data to determine the most reliable method.
But one approach would be to split each address into words. You can then loop through the words and find each word that contains only digits. This will be your street number. You could look after the street number and look for S, So, or South but as your example illustrates, there might be no such indicator.
Also, you haven't provided what you want to happen if more than one number is found.
As far as removing the quotes, just remove the first and last characters. I'd recommend checking that they are in fact quotes before removing them.
From your description, every entry has this format:
[space][number][space][quote][address][quote]
Here is some quick and dirty code that will parse this format into an int/string tuple:
using namespace System;
using namespace System.Linq;
static Tuple<int, string> ParseLine(string line)
{
var tokens = line.Split(); // Split by spaces
var number = int.Parse(tokens[1]); // The number is the 2nd token
var address = string.Join(" ", tokens.Skip(2)); // The address is every subsequent token
address = address.Substring(1, address.Length - 2); // ... minus the first and last characters
return Tuple.Create(number, address);
}

Determine POS tagging in English based on database files

I'm a little bit confused how to determine part-of-speech tagging in English. In this case, I assume that one word in English has one type, for example word "book" is recognized as NOUN, not as VERB. I want to recognize English sentences based on tenses. For example, "I sent the book" is recognized as past tense.
Description:
I have a number of database (*.txt) files: NounList.txt, verbList.txt, adjectiveList.txt, adverbList.txt, conjunctionList.txt, prepositionList.txt, articleList.txt. And if input words are available in the database, I assume that type of those words can be concluded. But, how to begin lookup in the databases? For example, "I sent the book": how to begin a search in the databases for every word, "I" as Noun, "sent" as verb, "the" as article, "book" as noun? Any better approach than searching every word in every database? I doubt that every databases has unique element.
I enclose my perspective here.
private List<string> ParseInput(String allInput)
{
List<string> listSentence = new List<string>();
char[] delimiter = ".?!;".ToCharArray();
var sentences = allInput.Split(delimiter, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Trim());
foreach (var s in sentences)
listSentence.Add(s);
return listSentence;
}
private void tenseReviewMenu_Click(object sender, EventArgs e)
{
string allInput = rtbInput.Text;
List<string> listWord = new List<string>();
List<string> listSentence = new List<string>();
HashSet<string> nounList = new HashSet<string>(getDBList("nounList.txt"));
HashSet<string> verbList = new HashSet<string>(getDBList("verbList.txt"));
HashSet<string> adjectiveList = new HashSet<string>(getDBList("adjectiveList.txt"));
HashSet<string> adverbList = new HashSet<string>(getDBList("adverbList.txt"));
char[] separator = new char[] { ' ', '\t', '\n', ',' etc... };
listSentence = ParseInput(allInput);
foreach (string sentence in listSentence)
{
foreach (string word in sentence.Split(separator))
if (word.Trim() != "")
listWord.Add(word);
}
string testPOS = "";
foreach (string word in listWord)
{
if (nounList.Contains(word.ToLowerInvariant()))
testPOS += "noun ";
else if (verbList.Contains(word.ToLowerInvariant()))
testPOS += "verb ";
else if (adjectiveList.Contains(word.ToLowerInvariant()))
testPOS += "adj ";
else if (adverbList.Contains(word.ToLowerInvariant()))
testPOS += "adv ";
}
tbTest.Text = testPOS;
}
POS tagging is my secondary explanation in my assignment. So I use a simple approach to determine POS tagging that is based on database. But, if there's a simpler approach: easy to use, easy to understand, easy to get pseudocode, easy to design... to determine POS tagging, please let me know.
I hope the pseudocode I present below proves helpful to you. If I find time, I'd also write some code for you.
This problem can be tackled by following the steps below:
Create a dictionary of all the common sentence patterns in the English language. For example, Subject + Verb is an English pattern and all the sentences like I sleep, Dog barked and Ship will arrive match the S-V pattern. You can find a list of the most common english patterns here. Please note that for some time you may need to keep revising this dictionary to enhance the accuracy of your program.
Try to fit the input sentence in one of the patterns in the dictionary you created above, for example, if the input sentence is Snakes, unlike elephants, are venomous., then your code must be able to find a match with the pattern: Subject, unlike AnotherSubject, Verb Object or S-,unlike-S`-, -V-O. To successfully perform this step, you may need to write code that's good at spotting Structure Markers like the word unlike, in this example sentence.
When you have found a match for your input sentence in your pattern dictionary, you can easily assign a tag to each word in the sentence. For example, in our sentence, the word Snakes would be tagged as a subject, just like the word elephants, the word are would be tagged as a verb and finally the word venomous would be tagged as an object.
Once you have assigned a unique tag to each of the words in your sentence, you can go lookup the word in the appropriate text files that you already have and determine whether or not your sentence is valid.
If your sentence doesn't match any sentence pattern, then you have two options:
a) Add the pattern of this unrecognized sentence in your pattern dictionary if it is a valid English sentence.
b) Or, discard the input sentence as an invalid English sentence.
Things like what you're trying to achieve are best solved using machine learning techniques so that the system can learn any new patterns. So, you may want to include a trainer system that would add a new pattern to your pattern dictionary whenever it finds a valid English sentence not matching any of the existing patterns. I haven't thought much about how this can be done, but for now, you may manually revise your Sentence Pattern dictionary.
I'd be glad to hear your opinion about this pseudocode and would be available to brainstorm it further.

Dynamic Regex for number range using c#

I'm looking at UK postcodes and trying to work out how I can take data from a database (the first part of a UK postcode) and dynamically create a regexp for them using c#. For example:
AB44-56
I know what I want as an output:
AB([4][4-9]|[5][0-6])+
However, I can't work out how I might be able to do this with logic, perhaps I need to split the Letters from the numbers first, but i can't do that using split.
I have other combinations too - single range:
AB31 would be AB[3][1]+
Some with just letters:
BT would be BT+
Some with a single letter and 1 or two numbers:
G83 Would be G[8][3]
Any suggestions or guidance would be very much appriciated how this may be coded.
afrom wikipedia UK postal codes :
This can be generalised as: (one or two letters)(number between 0 and
99)(zero or one letter)(space)(single digit)(two letters)
so
^[A-Z,a-z]{0,2}\d+[A-Z,a-z]?\s\d[A-Z,a-z]{2}$
might work.
EDIT: Also if you are trying to restric the postal codes to say those with the same prefix as the ones in the database you could do this.
var source = "BTasdfweasdf"; //from the database
var input = "BT1A 1BB"; //from the somewhere else
var regex = Regex.Replace(source, #"(^[A-z,a-z]{0,2})(.*)", #"$1\d+[A-Z,a-z]?\s\d[A-Z,a-z]{2}$");
var match = Regex.Match(input,regex);

Extracting data from text using templates

I'm building a web service which receives emails from a number of CRM-systems. Emails typically contain a text status e.g. "Received" or "Completed" as well as a free text comment.
The formats of the incoming email are different, e.g. some systems call the status "Status: ZZZZZ" and some "Action: ZZZZZ". The free text sometimes appear before the status and somethings after. Status codes will be mapped to my systems interpretation and the comment is required too.
Moreover, I'd expect that the the formats change over time so a solution that is configurable, possibly by customers providing their own templates thru a web interface would be ideal.
The service is built using .NET C# MVC 3 but I'd be interested in general strategies as well as any specific libraries/tools/approaches.
I've never quite got my head around RegExp. I'll make a new effort in case it is indeed the way to go. :)
I would go with regex:
First example, if you had only Status: ZZZZZ- like messages:
String status = Regex.Match(#"(?<=Status: ).*");
// Explanation of "(?<=Status: ).*" :
// (?<= Start of the positive look-behind group: it means that the
// following text is required but won't appear in the returned string
// Status: The text defining the email string format
// ) End of the positive look-behind group
// .* Matches any character
Second example if you had only Status: ZZZZZ and Action: ZZZZZ - like messages:
String status = Regex.Match(#"(?<=(Status|Action): ).*");
// We added (Status|Action) that allows the positive look-behind text to be
// either 'Status: ', or 'Action: '
Now if you want to give the possibility to the user to provide its own format, you could come up with something like:
String userEntry = GetUserEntry(); // Get the text submitted by the user
String userFormatText = Regex.Escape(userEntry);
String status = Regex.Match(#"(?<=" + userFormatText + ").*");
That would allow the user to submit its format, like Status:, or Action:, or This is my friggin format, now please read the status -->...
The Regex.Escape(userEntry) part is important to ensure that the user doesn't break your regex by submitting special character like \, ?, *...
To know if the user submits the status value before or after the format text, you have several solutions:
You could ask the user where his status value is, and then build you regex accordingly:
if (statusValueIsAfter) {
// Example: "Status: Closed"
regexPattern = #"(?<=Status: ).*";
} else {
// Example: "Closed:Status"
regexPattern = #".*(?=:Status)"; // We use here a positive look-AHEAD
}
Or you could be smarter and introduce a system of tags for the user entry. For instance, the user submits Status: <value> or <value>=The status and you build the regex by replacing the tags string.

Trouble parsing NMEA data from Serial Port

I'm retrieving NMEA sentences from a serial GPS. Then string are coming across like I would expect. The problem is that when parsing a sentence like this:
$GPRMC,040302.663,A,3939.7,N,10506.6,W,0.27,358.86,200804,,*1A
I use a simple bit of code to make sure I have the right sentect:
string[] Words = sBuffer.Split(',');
foreach (string item in Words)
{
if (item == "$GPRMC")
{
return "Correct Sentence";
}
else
{
return "Incorrect Sentence
}
}
I added the return in that location for the example. I have printed the split results to a text box and have seen that $GPRMC is indeed coming across in the item variable at some point. If the string is coming across why won't the if statement catch? Is is the $? How can I trouble shoot this?
It has been a while since I read an NMEA GPS...
Don't you need to compare the substring corresponding to the NMEA data type rather than the entire NMEA buffer elements? The .Split method splits sBuffer on all the commas in the NMEA sentence so that you have each individual element. But then you are testing the substring against the first element in a loop that implies that you want to look at every element. Confusing...
So wouldn't your test seem better as:
string[] Words=sBuffer.Split(',');
if(String.Compare(Words[0],"$GPRMC")==0)
{
return "Correct Sentence";
}
else
{
return "Incorrect Sentence
}
Is there a possibility that the NMEA stream is outputting sentences other than the Min Data, GPRMC sentence and you need to reread until you have the correct sentence? Also, are you sure that your GPS has the datatype as $GPRMC rather than GPRMC? I do not think there is supposed to be a $ in the datatype.
ie, in pseudo:
do {
buffer=read_NMEA(); //making sure the entire sentence is read...
array=split(buffer,",");
data_type=buffer[0];
}
while(data_type!="GPRMC" || readcount++<=MAX_NMEA_READS)
To debug your loop, try a console write of the elements:
string[] Words = sBuffer.Split(',');
foreach (string item in Words)
{
Console.WriteLine(item);
}
Are you calculating the checksum, I don't see it.
NMEA Wiki
EDIT: My answer underneath is no improvement, as commentator mtrw stated, the == is overloaded by the string class. I was wrong.
To my mind your if-Statement is faulty. Using the == operator, you are checking if it is the same reference (which certainly will not be the case). To simply compare if the two strings contain the same value, use String.Equals().

Categories