Extracting data from text using templates

Extracting data from text using templates - c#

I'm building a web service which receives emails from a number of CRM-systems. Emails typically contain a text status e.g. "Received" or "Completed" as well as a free text comment.
The formats of the incoming email are different, e.g. some systems call the status "Status: ZZZZZ" and some "Action: ZZZZZ". The free text sometimes appear before the status and somethings after. Status codes will be mapped to my systems interpretation and the comment is required too.
Moreover, I'd expect that the the formats change over time so a solution that is configurable, possibly by customers providing their own templates thru a web interface would be ideal.
The service is built using .NET C# MVC 3 but I'd be interested in general strategies as well as any specific libraries/tools/approaches.
I've never quite got my head around RegExp. I'll make a new effort in case it is indeed the way to go. :)

I would go with regex:
First example, if you had only Status: ZZZZZ- like messages:
String status = Regex.Match(#"(?<=Status: ).*");
// Explanation of "(?<=Status: ).*" :
// (?<= Start of the positive look-behind group: it means that the
// following text is required but won't appear in the returned string
// Status: The text defining the email string format
// ) End of the positive look-behind group
// .* Matches any character
Second example if you had only Status: ZZZZZ and Action: ZZZZZ - like messages:
String status = Regex.Match(#"(?<=(Status|Action): ).*");
// We added (Status|Action) that allows the positive look-behind text to be
// either 'Status: ', or 'Action: '
Now if you want to give the possibility to the user to provide its own format, you could come up with something like:
String userEntry = GetUserEntry(); // Get the text submitted by the user
String userFormatText = Regex.Escape(userEntry);
String status = Regex.Match(#"(?<=" + userFormatText + ").*");
That would allow the user to submit its format, like Status:, or Action:, or This is my friggin format, now please read the status -->...
The Regex.Escape(userEntry) part is important to ensure that the user doesn't break your regex by submitting special character like \, ?, *...
To know if the user submits the status value before or after the format text, you have several solutions:
You could ask the user where his status value is, and then build you regex accordingly:
if (statusValueIsAfter) {
// Example: "Status: Closed"
regexPattern = #"(?<=Status: ).*";
} else {
// Example: "Closed:Status"
regexPattern = #".*(?=:Status)"; // We use here a positive look-AHEAD
}
Or you could be smarter and introduce a system of tags for the user entry. For instance, the user submits Status: <value> or <value>=The status and you build the regex by replacing the tags string.

Related

easiest way to get each word of e-mail (text file) into an array C#

I am trying to build a phishing scanner for a class project and I am stuck on trying to get an e-mail saved in a text file to properly copy into an array for later processing. What I want is for each word to be in it's own array index.
Here is my sample e-mail:
Subject: Insufficient Funds Notice
Date: September 25, 2013
Insufficient Funds Notice
Unfortunately, on 09/25/2013 your available balance in your Wells Fargo account XXXXXX4653 was insufficient to cover one or more of your checks, Debit Card purchases, or other transactions.
An important notice regarding one or more of your payments is now available in your Messages & Alerts inbox.
To read the message, click here, and first confirm your identity.
Please make deposits to cover your payments, fees, and any other withdrawals or transactions you have initiated. If you have already taken care of this, please disregard this notice.
We appreciate your business and thank you for your prompt attention to this matter.
If you have questions after reading the notice in your inbox, please refer to the contact information in the notice. Please do not reply to this automated email.
Sincerely,
Wells Fargo Online Customer Service
wellsfargo.com | Fraud Information Center
4f57e44c-5d00-4673-8eae-9123909604b6
I don't want any of the punctuation all I need is the words and numbers.
Here is the code I have written for it so far.
StreamReader sr1 = new StreamReader(lblDisplaySelectedFilePath.Text);
string line = sr1.ReadToEnd();
words = line.Split(' ');
int wordslowercount = 0;
foreach (string word in words)
{
words[wordslowercount] = word.ToLower();
wordslowercount = wordslowercount + 1;
}
The issue with the above code is that I keep getting words that are either strung together and/or have "\r" or "\n" on them in the array. Here is an example of what is in the array that I don't want.
"notice\r\ndate:" don't want the \r, \n, or the :. Also the two words should be in different indexes.

The regex \W will allow you to split your string and create a list of words. This uses word boundaries, so it will not include punctuation.
Regex.Split(inputString, "\\W").Where(x => !string.IsNullOrWhiteSpace(x));

using System;
using System.Text.RegularExpressions;
public class Example
{
static string CleanInput(string strIn)
{
// Replace invalid characters with empty strings.
try {
return Regex.Replace(strIn, #"[^\w\.#-]", "",
RegexOptions.None, TimeSpan.FromSeconds(1.5));
}
// If we timeout when replacing invalid characters,
// we should return Empty.
catch (RegexMatchTimeoutException) {
return String.Empty;
}
}
}

Using line.Split(null) will split on white-space. From the C# String.Split method documentation:
If the separator parameter is null or contains no characters, white-space characters are assumed to be the delimiters. White-space characters are defined by the Unicode standard and return true if they are passed to the Char.IsWhiteSpace method.

How to carriage return C# string without \r\n?

This is my problem.
A user can enter text into a text area in the browser. Which is then emailed out to users.
What I want to know is that how do I handle carriage return? If I enter \r\n for carriage return, the email (which is a plain text email) has actual \r\n in it.
In other words:
On the SQL server end
Case 1:
if I do this before the email gets sent
(notice the line break after line 1)
update emails
set
body='line 1
line 2'
where
id=100
the email goes out correctly
Case 2:
update emails
set
body='line 1'+char(13) + char(10) +'line 2'
where
id=100
This email also goes out correctly
Case 3:
However if I do this
update emails
set
body='line 1 \r\n line 2',
where
id=100
the email would have the actual text \r\n in it.
How do I simulate case 1/2 through c# ?

SQL literals (at least those in SQL Server) do not support such escape sequences (although you can just hit enter within the string literal so that it spans multiple lines). See this answer for some alternatives if writing it as an SQL string is a requirement.
If running the SQL programmatically from C#, use parameters which will handle this just fine:
sqlCommand.CommandText = "update emails set body=#body where id=#id"
sqlCommand.Parameters.AddWithValue("#body", "line 1 \r\n line2");
Note that the handling of the string literal (and conversion of the \r and \n character escape sequences) happens in C# and the value (with CR and LF characters) is passed to SQL.
If the above didn't address the problem, keep reading.
4.10.13 The textarea element:
For historical reasons, the element's value is normalised in three different ways for three different purposes. The raw value is the value as it was originally set. It is not normalized. The API value is the value used in the value IDL attribute. It is normalized so that line breaks use "LF" (U+000A) characters. Finally, there is the form submission value. [Upon form submission the textarea] is normalized so that line breaks use U+000D CARRIAGE RETURN U+000A LINE FEED (CRLF) character pairs, and in addition, if necessary given the element's wrap attribute, additional line breaks are inserted to wrap the text at the given width.
Note that CR and LF represent characters and not the two-character sequence of \ followed by either the r or n characters - this form is often found in string literals. If it appears as such then something is doing the incorrect conversion and putting (or leaving) the \ there. Or, perhaps there is some misguided "add slashes" hack somewhere?
As pointed out, while URL decode is likely wrong, it won't directly do this conversion. However, if the conversion happened previously before being "URL Encoded", then it will (correctly) decode to (incorrect) values.
In either case, it's a bug. So find out where the incorrect data conversion is introduced and fix it (attach a debugger and/or monitor the network traffic for clues) - the required information to isolate where is simply not present in the post.

Use whatever c#'s string replace method is to replace "\\r\\n" with "\r\n" and that should fix it.

parsing email text reply/forward

I am creating a web based email client using c# asp.net.
What is confusing is that various email clients seem to add the original text in alot of different ways when replying by email.
What I was wondering is that, if there is some sort of standardized way, to disambiguate this process?
Thank you
-Theo

I was thinking:
public String cleanMsgBody(String oBody, out Boolean isReply)
{
isReply = false;
Regex rx1 = new Regex("\n-----");
Regex rx2 = new Regex("\n([^\n]+):([ \t\r\n\v\f]+)>");
Regex rx3 = new Regex("([0-9]+)/([0-9]+)/([0-9]+)([^\n]+)<([^\n]+)>");
String txtBody = oBody;
while (txtBody.Contains("\n\n")) txtBody = txtBody.Replace("\n\n", "\n");
while (new Regex("\n ").IsMatch(txtBody)) txtBody = (new Regex("\n ")).Replace(txtBody, "\n");
while (txtBody.Contains(" ")) txtBody = txtBody.Replace(" ", " ");
if (isReply = (isReply || rx1.IsMatch(txtBody)))
txtBody = rx1.Split(txtBody)[0]; // Maybe a loop through would be better
if (isReply = (isReply || rx2.IsMatch(txtBody)))
txtBody = rx2.Split(txtBody)[0]; // Maybe a loop through would be better
if (isReply = (isReply || rx3.IsMatch(txtBody)))
txtBody = rx3.Split(txtBody)[0]; // Maybe a loop through would be better
return txtBody;
}

There isn't a standardized way, but a sensible heuristic will get you a good distance.
Some algorithms classify lines based on their initial character(s) and by comparing the text to a corpus of marked up text, yielding a statistical probability for each line that it is a) part of the same block as the next/previous one and b) quoted text, a signature, new text, etc.
It'd be worth trying out some of the most popular e-mail clients and creating and comparing some sample messages to see what the differences are. Usenet newsgroups may also help you build a reasonable corpus of messages to work from. HTML e-mail adds an extra level of complexity of course, tthough most compliant mail clients will included the corresponding plain text as well. Different languages also cause issues, as clients which can parse "Paul wrote:" may fall over at "Pablo ha scritto:".

Not really, no.
The original RFC for Internet Message talks about the in-reply-to header, but doesn't specify the format of the body.
As you've found, different clients add the original text in different ways, implying there's not a standard, coupled with the fact that users will do things differently as well:
Plain text, "rich text", HTML will all have a different way of separating the reply from the original
In Outlook I can choose from the following options when replying to a message:
Do not include
Attach original message
Include original message text
Include and indent original message text
Prefix each line of the original message
On top of that, I often send and receive replies that state "Responses in-line" where my comments are intermingled with the original message, so the original message no longer exists in its original form anyway.

Some heuristics you can try are
-Any number of > characters
-Looking for "wrote: " (be very careful with this one)
Also you can try relating the Message ID field with the In Reply To field
And finally, if you cannot find a good library to do this, it is time to start this project. No more parsing emails the Cthulhu way :)

How can I make this regex match correctly?

Given this regex:
^((https?|ftp):(\/{2}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)*?))(\.)([a-z]{2}
|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum){1})
Reformatted for readability:
#"^((https?|ftp):(\/{2}))?" + // http://, https://, ftp:// - Protocol Optional
#"(" + // Begin URL payload format section
#"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" + // IPv4 Address support
#")|("+ // Delimit supported payload types
#"((([a-zA-Z0-9]+)(\.)*?))(\.)([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum){1}" + // FQDNs
#")"; // End URL payload format section
How can I make it fail (i.e. not match) on this "fail" test case?
http://www.google
As I am specifying {1} on the TLD section, I would think it would fail without the extension. Am I wrong?
Edit: These are my PASS conditions:
"http://www.zi255.com?Req=Post&PID=4",
"http://www.zi255.com?Req=Post&ID=4",
"http://www.zi255.com/?Req=Post&PID=4",
"http://www.zi255.com?Req=Post&PostID=4",
"http://www.zi255.com/?Req=Post&ID=4"
"http://www.zi255.com?Req=Post&Post=4",
"http://www.zi255.com?Req=Post&Entry=4",
"http://www.zi255.com?PID=4"
"http://www.zi255.com/Post.aspx?Req=Post&ID=4",
"http://www.zi255.com/Post.aspx?Req=Post&PID=4",
"http://www.zi255.com/Post.aspx?Req=Post&Post=4",
"http://www.zi255.com/Post.aspx?Req=Post&Title=Random%20Post%20Name"
"http://www.zi255.com/?Req=Post&Title=Random%20Post%20Name",
"http://www.zi255.com?Req=Post&Title=Random%20Post%20Name",
"http://www.zi255.com?Req=Post&PostID=4",
"http://www.zi255.com?Req=Post&Post=4",
"http://www.zi255.com?Req=Post&Entry=4",
"http://www.zi255.com?PID=4"
"http://www.zi255.com",
"http://www.damnednice.com"
These are my FAIL conditions:
"http://.com",
"http://.com/",
"http:/www.google.com",
"http:/www.google.com/",
"http://www.google",
"http://www.googlecom",
"http://www.google.c",
".com",
"https://www..."

I'll throw out an alternative suggestion. You may want to use a combination of the parsing of the built-in System.Uri class and a couple targeted regexes (or simple string checks when appropriate).
Example:
string uriString = "...";
Uri uri;
if (!Uri.TryCreate(uriString, UriKind.Absolute, out uri))
{
// Uri is totally invalid!
}
else
{
// validate the scheme
if (!uri.Scheme.Equals("http", StringComparison.OrdinalIgnoreCase))
{
// not http!
}
// validate the authority ('www.blah.com:1234' portion)
if (uri.Authority // ...)
{
}
// ...
}

Sometimes, one catch-all reqex is not the best solution, however tempting. While debugging this regex is feasible (see Greg Hewgills answer), consider doing a couple of tests for different categories of problems, e.g. one test for numerical addresses and one test for named addresses.

You need to force your regex to match up until the end of the string. Add a $ at the very end of it. Otherwise, your regex is probably just matching http://, or something else shorter than your whole string.

The "validate a url" problem has been solved* numerous times. I suggest you use the System.Uri class, it validates more cases than you can shake a stick at.
The code Uri uri = new Uri("http://whatever"); throws a UriFormatException if it fails validation. That is probably what you'd want.
*) Or kind of solved. It's actually pretty tricky to define what is a valid url.

Its all about definitions, a "valid url" should provide you with a IP address when you do a DNS Lookup. The IP should be connected to and when a request is send out, you get a reply in the form of a HTML information that you can use.
So what we are looking for is a "valid URL Format" and that is where the system.uri comes in very handy. BUT, if the URL is hidden in a large piece of tekst, you would first like to find something that validates as a valid URL-Format.
The thing that distinquishes a URL from any given readable tekst is the dot not followed by whitespace. "123.com" could validate as a real URL.
Using the regex
[a-z_\.\-0-9]+\.[a-z]+[^ ]*
to find any possible valid url in a text and then do a system.uri check to see if its a valid URL format and then do a lookup. Only when the lookup gives you a result then you know the URL is valid.

Newlines escaped unexpectedly in C#/ASP.NET 1.1 code

Can someone explain to me why my code:
string messageBody = "abc\n" + stringFromDatabaseProcedure;
where valueFromDatabaseProcedure is not a value from the SQL database entered as
'line1\nline2'
results in the string:
"abc\nline1\\nline2"
This has resulted in me scratching my head somewhat.
I am using ASP.NET 1.1.
To clarify,
I am creating string that I need to go into the body of an email on form submit.
I mention ASP.NET 1.1 as I do not get the same result using .NET 2.0 in a console app.
All I am doing is adding the strings and when I view the messageBody string I can see that it has escaped the value string.
Update
What is not helping me at all is that Outlook is not showing the \n in a text email correctly (unless you reply of forward it).
An online mail viewer (even the Exchange webmail) shows \n as a new line as it should.

I just did a quick test on a test NorthwindDb and put in some junk data with a \n in middle. I then queried the data back using straight up ADO.NET and what do you know, it does in fact escape the backslash for you automatically. It has nothing to do with the n it just sees the backslash and escapes it for you. In fact, I also put this into the db: foo"bar and when it came back in C# it was foo\"bar, it escaped that for me as well. My point is, it's trying to preserve the data as is on the SQL side, so it's escaping what it thinks it needs to escape. I haven't found a setting yet to turn that off, but if I do I'll let you know...

ASP.NET would use <br /> to make linebreaks. \n would work with Console Applications or Windows Forms applications. Are you outputting it to a webpage?
Method #1
string value = "line1<br />line2";
string messageBody = "abc<br />" + value;
If that doesn't work, try:
string value = "line1<br>line2";
string messageBody = "abc<br>" + value;
Method #2
Use System.Environment.NewLine:
string value = "line1"+ System.Environment.NewLine + "line2";
string messageBody = "abc" System.Environment.NewLine + value;
One of these ways is guaranteed to work. If you're outputting a string to a Webpage (or an email, or a form submit), you'd have to use one of the ways I mentioned. The \n will never work there.

You need to set a watch and see where exactly your database result string gets double escaped.
Adding two strings together will never double escape strings, so its either happening before that, or after that.

When I get the string out of the database, .NET escapes it automagically. However, the little # symbol is appended to the string, which I did not notice.
So it appeared to be non-escaped to my "about to go on holiday" eye inside the ide.
Therefore when the non-escaped \n was added to the string (as the whole string is no longer escaped), it would remove the # and show the database portion of the string escaped.
Gah, it was all an illusion.
Perhaps that holiday is overdue.
Thanks for your input.

If the actual string stored in the database is (spaces added for emphasis): "l i n e 1 \ n l i n e 2", then whatever stored it there probably has a bug. But assuming that is the exact string there, then the "abc\nline1\nline2" string is what happens when you look at the string which would print as "abcline1\nline2" in a debugger which escapes it (this is a convenience, allowing you to copy-paste out of the debugger straight into code without errors).
Short answer: .NET is not escaping the string, your debugger is. The code which writes a literal "\n" into the database has a bug.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.