My previous question was closed, because it was mentioned that similar question was already asked, and answered. I was trying to find something like that, but I can't, so I will try to give more details.
I have a file which I am reading line by line.
In every line I am searching for specific phrase: "see {number}." or "See {number}."
For example:
see 112.
See 4.
etc.
The problem here is that I can't hardcore that specific phrase because I need to found different strings, which I just know starts with "see" followed by any "number", finished by dot.
At first I thought it would be best to use some kind of regular expression. But I am a little worry about performance.
Anyway after I will establish that I need to embed that in code.
Adding specific tag before and after string.
So for example when I found "see 112." i want to replace it with <link={found_number}>see 112.</link> for example: <link=112>see 112.</link>
I would be grateful for any suggestions, how to achieve that.
Related
This question already has answers here:
Extract Url using Regex
(2 answers)
Closed 8 years ago.
I'm attempting to detect all URLs listed in a free text block. I'm using the .nets Regex.Matches call.. with the following regex: (http|https)://[^\s "']{4,}
Now, I've put in the following text:
here is a link http://somelink.com
here is a link that I didn't space withhttp://nospacelink.com/something?something=&39358235
http://nospacelink.com/something?something=&12233454
here is a link I already handled.
Here is some secret t&cs you're not allowed to know https://somethingbad.com
Just to be a little annoying I've put in a new address thingy capture type of 'http://somethinginspeechmarks.com' and what are you going to do now?
here is a link http://postTextLink.com at then some post text
Here is a link with a full stop http://alinkwithafullstoplink.com. And then some more.
and I get the following output:
http://somelink.com
http://nospacelink.com?something=&39358235
http://nospacelink.com?something=&12233454
http://alreadyhandledlink.com
https://somethingbad.com
http://somethinginspeechmarks.com
http://postTextLink.com
http://alinkwithafullstoplink.com.
Please notice the full stop on the last entry. How can I update my regex to say "If there is a full stop at the end, please ignore it?"
Also, please note that "Getting parts of a URL (Regex)" has nothing to do with my question, as that question is about how to break down a particular URL. I want to extract multiple, complete urls. Please see my input and current outputs for clarification!
I have got a regex already that does most of what I want, but isn't quite right. Could you please explain where my approach might be improved?
I would add something like [^\.] to the pattern.
This pattern says that the last char can't be a full stop.
So for (http|https)://[^\s "']{4,}[^\.] it will try to match all adresses not ending with a full stop.
Edit:
This one should be better as said in comments: [^.\s"']
Updated:
Consider the following minor change to your pattern:
(http|https)://[^\s "']{4,}(?=\.)
i have text box for phone number .i need to validate it.my requiremants are
Take only numeric more than 10digits
Take symbols like (,),-,
can any one help for this.i tried
^[\d{10,14} +\s +\( +\)-]+$
but not working.
You may take a look at the following article which will help you build such expression.
You haven't said what is wrong with your regex (why it's not working as expected) but I'm guessing that the issue is it matches far more than it should. I.e it will match 1 or more of all the characters in your set (rather than just between 10 and 14).
I think you're mistake is that you have put way too much in your character set. You've got the + symbol in there 3 times and it looks like your trying to use quantifiers from within the set as well, which is not allowed. Character sets are the equivalent of single character alternations. So, [abc] is the equivalent of a|b|c.
I'm assuming that you want the input to be between 10 and 14 numbers while still allowing any number (zero or more) of the following characters:
+()-,
As some others have suggested, you could just put the chars you want in a set and then specify the quantifier after it like this: ^[0-9()-,+]{10,14}$. This will almost get you there. Only problem with it is that it will allow between 10 and 14 of any of these characters, so it would successfully match this:
,,,,,++()---
Which clearly you don't want (do you?)
So, in order to better solve this problem, you'll need to be more specific about what is allowed and where in the subject it is allowed. Because i don't know exactly what you want to match, i can't take you much further.
Hopefully the information I've provided here should be good enough to get you started, and if you have more questions... well that's what we're all here for right, so ask away.
To help you out with learning, below are a few resources you might find useful (this is a small subset of what's available, so do go ahead and search for yourself):
Testing tools
Rubular (ruby)
GSkinner Regex Testser
RegexHero (dotnet)
Helpful info
Regular-Expressions.Info
Codeproject 30 Minute Tutorial
I am wondering if anyone can help me out with parsing out data for key words.
say I am looking for this keyword: My Example Yo (this is one of many keywords)
I have data like this
MY EXAMPLE YO #108
my-example-yo #108
my-example #108
MY Example #108
This is just a few combinations. There could be words or number is front of these sentences, there could be in any case, maybe nothing comes after it maybe like the above example something comes after it.
A few ideas came to mind.
store all combinations that I can possible think of in my database then use contains
The downside with this is I going a huge database table with every combination of everything thing I need to find. I then will have to load the data into memory(through nhibernate and check every combination). I am trying to determine what category to use based on keyword and they can upload thousands of rows to check for.
Even if I load subsets and look through them I still picture this will be slow.
Remove all special characters and make single spaces and ignore case and try to use regex to see how much of the keyword matches up.
Not sure what to do if the keyword has special characters like dashes and such.
I know I will not get every combination out there but I want to try get as many as I can.
Have you considered Lucene.Net? I haven't used it myself, but I hear it's a great tool for full text searching. It might do well with keyword searching too. I believe that stackoverflow uses Lucene.
In Eclipse, editing Java code, if I type an open-paren, I get a pair of parens. If I then "type through" the second paren, it does not insert an additional paren. How do I get that in emacs?
The Eclipse editor is smart enough to know, when I type the close-paren, that I am just finishing what I started. The cursor advances past the close paren. If I then type a semicolon, same thing: it just overwrites past the semicolon, and I don't get two of them.
In emacs, in java-mode, or csharp-mode if I bind open-paren to skeleton-pair-insert-maybe, I get an open-close paren pair, which is good. But then if I "type through" the close paren, I get two close-parens.
Is there a way to teach emacs to not insert the close paren after an immediately preceding skeleton-pair-insert-maybe? And if that is possible, what about some similar intelligence to avoid doubling the semicolon?
I'm asking about parens, but the same applies to double-quotes, curly braces, square brackets, etc. Anything inserted with skeleton-pair-insert-maybe.
This post shows how to do what you want. As a bonus it also shows how to set it up so that if you immediately backspace after the opening char, it will also delete the closing char after the cursor.
Update:
Since I posted this answer, I've discovered Autopair which is a pretty much perfect system for this use case. I've been using it a lot and loving it.
To summarize what I did, I looked at this post, and took what I wanted out of it. What I ended up with was simpler, because I didn't have the additional requirements he had.
I used these two new definitions:
(defvar cheeso-skeleton-pair-alist
'((?\) . ?\()
(?\] . ?\[)
(?" . ?")))
(defun cheeso-skeleton-pair-end (arg)
"Skip the char if it is an ending, otherwise insert it."
(interactive "*p")
(let ((char last-command-char))
(if (and (assq char cheeso-skeleton-pair-alist)
(eq char (following-char)))
(forward-char)
(self-insert-command (prefix-numeric-value arg)))))
And then in my java-mode-hook, I bound the close-paren and close-bracket this way:
(local-set-key (kbd ")") 'cheeso-skeleton-pair-end)
(local-set-key (kbd "]") 'cheeso-skeleton-pair-end)
I use paredit-mode, which does this and a lot more.
ParEdit sounds like it would handle the parenthesis part of your need, with the caveat that it was designed for Common Lisp and Scheme. Steve Yegge mentions JDEE for emacs Java development, but I can't speak for that from personal experience, and I couldn't find any documentation on it talking about structured editing.
Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?
In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.
A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.
Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(
Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."
If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!
SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);