I need to extract the url inside the string [closed] - c#

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I need to extract the url inside the string.
In my case html text is in the db and when i get that text and need to find all url in the text and insert in to another table, can u give me a way to find the url's in SQL or C#.

This is reqular expression to find urls in text
Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
MatchCollection mactches = regx.Matches(txt);

One of the possible ways to do it is by using Regular expressions. First option is to extract HTML from the DB, then use Regular Expression to find the links directly. The second option is to locate link tags first, then extract url from them (again by using Regular expressions).
Here you can find information about how to use Regular Expressions in C#:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
On the other hand, writing the correct Regular Expression may not be so easy (it depends on how complex the URL is), but you should take a look at this question: regular expression for url
Also, here you can find a lot of information about regular expressions in general (keep in mind that there are some applications like RegexBuddy, that can help you a lot when it comes to testing your regular expressions): http://www.regular-expressions.info/

Related

How to find __tokens__ in a string without using RegEx in C# [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have the following:
The __animal__ jumped over the __object__
I'd like to write a concise method in C# that returns animal and object. How can this be done without RegEx? A dirty solution would be to iterate through the characters one by one until we find a __ and then build a string until we find the closing __ - but I'm looking for a more elegant approach.
You are going to have to iterate in some form - if you don't want to use regex.
string text = "The __animal__ jumped over the __object__";
List<string> words = text.Split(' ').ToList();
words = words.Where(x => x.StartsWith("__") && x.EndsWith("__")).ToList();
You can use extensions of IEnumerable.to search, like above. but the framework is technically iterating.

how to parse a search query string like SO [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I want to build a searching function with keywords format on Entity Framework.
void funcSearch(string keywork)
{
if (keywork == "[tag]")
{
//regex for is tag
//do search tag
}
if (keywork == "user:1234")
{
//regex for userid is 1234
//do search user with 1234
}
...
}
Can i use regex to parse a query string format like SO, or any method? a function to to be able to analyze all of the cases with corresponding keyword?
tags [tag]
exact "words here"
author user:1234
user:me (yours)
score score:3 (3+)
score:0 (none)
answers answers:3 (3+)
answers:0 (none)
isaccepted:yes
hasaccepted:no
inquestion:1234
views views:250
sections title:apples
body:"apples oranges"
url url:"*.example.com"
favorites infavorites:mine
infavorites:1234
status closed:yes
duplicate:no
migrated:no
wiki:no
types is:question
is:answer
thank you for advice.
Yes, you can. You'd have to create a list of regular expressions to check and loop through them until you find a match. (Make sure to prioritize them correctly.)
For example, to find out if a search query is querying tags, you can use the following regex:
string query = "[tag]";
bool isTag = Regex.IsMatch(query, #"^\[.+?\]$");
Here's another regex matching a user ID:
string query = "user:1234";
var match = Regex.Match(query, #"^user:(\d+)$", RegexOptions.IgnoreCase);
Note that you should trim your query first.

have to extract data from a word file [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions must demonstrate a minimal understanding of the problem being solved. Tell us what you've tried to do, why it didn't work, and how it should work. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have a peculiar problem in that I have to extract information from a word file. Say for example I have a resume and need to extract name, email address, phone no., address, university,Experience etc.
Every other person may be having their resume in a different format.So is there any way by which I can programmatically extract the information I need?
I need this information to fill-up a form for registration.
Even if at first you might be attracted by the idea of using Com Interop and Asp.net, don't do it.
http://support.microsoft.com/kb/257757
That said, it's important to know which version of word are we talking about. Newer formats allow treat them as a zip containing xml files and there are good&free libraries.
http://docx.codeplex.com/
Convert the word document to html, with aspose .net.
Then you can use regular expressions to search the word and/or pdf documents.
Or you can use HTMLAgilityPack to parse the created HTML documents, and search for specific sections/paths.
PS:
If you have a regex for email that's shorter than one page, then the regex is incorrect.
Phone should be manageable, as long as you have to support only one country.
As for name and address, good luck with that.
Edit:
Like this
VB.NET:
Dim doc As New Aspose.Words.Document("filename.docORdocx")
doc.Save("filename.html", Aspose.Words.SaveFormat.Html)
C#:
Aspose.Words.Document doc = new Aspose.Words.Document("filename.docORdocx");
doc.Save("filename.html", Aspose.Words.SaveFormat.Html);
The component is here:
http://www.aspose.com/.net/word-component.aspx
To find out what a valid email address is, read RFC 822:
http://www.faqs.org/rfcs/rfc822.html

Differences between C# and JavaScript Regular Expressions? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Are C# and JavaScript Regular Expressions different?
Is there a list of these differences?
Here's a difference we bumped into that I haven't seen documented anywhere, so I'll publish it, and the solution, here in the hope that it will help someone.
We were testing "some but not all" character classes, such as "A to Z but not Q, V or X", using "[A-Z-[QVX]]" syntax. Don't know where we found it, don't know if it's documented, but it works in .Net.
For example, in Powershell, using the .Net regex class,
[regex]::ismatch("K", "^[A-Z-[QVX]]$")
returns true. Test the same input and pattern in JavaScript and it returns false, but test "K" against "^[A-Z]$" in JavaScript and it returns true.
You can use the more orthodox approach of negative lookahead to express "A to Z but not Q, V or X", eg "^(?![QVX])[A-Z]$", which will work in both Powershell and (modern) JavaScript.
Given Ben Atkin's point above about IE6 and IE7 not supporting lookahead, it may be that the only way to do this in a fool-proof (or IE7-proof) way is to expand the expression out, eg "[A-Z-[QVX]" -> "ABCDEFGHIJKLMNOPRSTUWYZ". Ouch.
First, some resources:
Mozilla Development Center JavaScript Guide: Regular Expressions
.NET Framework Regular Expressions - see the links at the bottom of the page
Here are a few differences:
Lookahead is not supported in IE6 and IE7. (Search for x(?=y) in the MDC guide for for examples.)
JavaScript doesn't support named capture groups. Example: (?<foo>)
The list of metacharacters supported by JavaScript is much shorter.

What is a good regular expression for catching typos in an email address? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
This post was edited and submitted for review last year and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
When users create an account on my site I want to make server validation for emails to not accept every input.
I will send a confirmation, in a way to do a handshake validation.
I am looking for something simple, not the best, but not too simple that doesn't validate anything. I don't know where limitation must be, since any regular expression will not do the correct validation because is not possible to do it with regular expressions.
I'm trying to limit the sintax and visual complexity inherent to regular expressions, because in this case any will be correct.
What regexp can I use to do that?
It's possible to write a regular expression that only accept email addresses that follow the standards. However, there are some email addresses out there that doesn't strictly follow the standards, but still work.
Here are some simple regular expressions for basic validation:
Contains a # character:
#
Contains # and a period somewhere after it:
#.*?\.
Has at least one character before the #, before the period and after it:
.+#.+\..+
Has only one #, at least one character before the #, before the period and after it:
^[^#]+#[^#]+\.[^#]+$
User AmoebaMan17 suggests this modification to eliminate whitespace:
^[^#\s]+#[^#\s]+\.[^#\s]+$
And for accepting only one period [external edit: not recommended, does not match valid email adresses]:
^[^#\s]+#[^#\s\.]+\.[^#\.\s]+$
^\S+#\S+$
^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$
Only 1 #
Several domains and subdomains
I think this little tweak to the expression by AmoebaMan17 should stop the address from starting/ending with a dot and also stop multiple dots next to each other. Trying not to make it complex again whilst eliminating a common issue.
(?!.*\.\.)(^[^\.][^#\s]+#[^#\s]+\.[^#\s\.]+$)
It appears to be working (but I am no RegEx-pert). Fixes my issue with users copy&pasting email addresses from the end of sentences that terminate with a period.
i.e: Here's my new email address tabby#coolforcats.com.
Take your pick.
Here's the one that complies with RFC 2822 Section 3.4.1 ...
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Just in case you are curious. :)

Categories