how to extract date from a string using regex - c#

i m looking for regex which can extract the date from the following html
<p>British Medical Journal, 29.9.12, pp.37-41.</p>
and convert it in the format 29/09/12

Match this pattern: -
(\d+)[.](\d+)[.](\d+)
and replace with: -
$1/$2/$3
\d is used to match digits. Using it with quantifier (+), you would match one or more digits.
Now, in regex, a dot(.) is a metacharacter, that matches any character. To match a period literally, you would need to either escape it, or use it inside a character class.
To convert to a specific Date Format, e.g.: - convert 9 -> 09, you can make use of a MatchEvaluator: -
string input = "British Medical Journal, 29.9.12, pp.37-41.";
Regex reg = new Regex(#"(\d+)[.](\d+)[.](\d+)");
string result = reg.Replace(input, delegate(Match m) {
return m => DateTime.Now.ToString("dd/MM/yy")
});
You can check whether it works or not.

Here is the regex pattern: \d{1,2}\.\d{1,2}\.\d{1,2}.
And here is the example of how to parse this string to DateTime:
DateTime.ParseExact("29.9.12", "d.M.yy", CultureInfo.InvariantCulture);

(\d{4})[-](\d{2})[-](\d{2}) use this regex to pick 2017-01-23 format date

Related

Extract value from a string in C# from a specific position

I have bunch of files in a folder and I am looping through them.
How do I extract the value from the below example? I need the value 0519 only.
DOC 75-20-0519-1.PDF
The below code gives the complete part include -1.
Convert.ToInt32(Path.GetFileNameWithoutExtension(objFile).Split('-')[2]);
Appreciate any help.
You can try regular expressions in order to match the value.
pattern:
[0-9]+ - one ore more digits
(?=[^0-9][0-9]+$) - followed by not a digit and one or more digits and end of string
code:
using System.Text.RegularExpressions;
...
string file = "DOC 75-20-0519-1.PDF";
// "0519"
string result = Regex
.Match(Path.GetFileNameWithoutExtension(file), #"[0-9]+(?=[^0-9][0-9]+$)")
.Value;
If Split('-') fails, and you have an entire string as a result, it seems that you have a wrong delimiter. It can be, say, one of the dashes:
"DOC 75–20–0519–1.PDF"; // n-dash
"DOC 75—20—0519—1.PDF"; // m-dash
You can use REGEX for this
Match match = Regex.Match("DOC 75-20-0519-1.PDF", #"DOC\s+\d+\-\d+\-(\d+)\-\d+", RegexOptions.IgnoreCase);
string data = match.Groups[1].Value;

test a specific string with regex

i'm going to test a string of the form dd / mm / yyyy xx-xxxx-x xxx-xxx with a string array, i use to define the regex form but i think the format is not correctly declare
Regex rgx1 = new Regex(#"^d{2}\/\d{2}\/\d{4}\t[A-Z]\d{2}\-\d{4}\-\[A-Z0-9]\d{1}\t[A-Z]\d{3}\-\[A-Z]\d{3}$");
Match FormatS = rgx1.Match(tab[i]);
if ( FormatS.Success)
{
Console.WriteLine(tab[i]);
Console.ReadLine();
}
Based on your comment with sample input, this works:
Regex rgx1 = new Regex(#"^\d{2}/\d{2}/\d{4}\s[A-Z]{2}-\d{4}-[A-Z0-9]{1}\s[A-Z]{3}-[A-Z]{3}$");
Problems I found:
\[ instead of [ in two places
d instead of \d at the start
\t instead of \s (or just a space would probably be fine, too)
a few unnecessary \d
I also removed a few unnecessarily-escaped tokens, but... those don't matter as much.

String.Format - query metadata from the format string

Is there any function to query the expected inputs and formats from a format string - i.e. one intended as the first argument to the String.Format function?
e.g. given:
"On {0:yyyyy-MM-dd} do {1} and earn {2:C2}"
I'd like to get back something like:
{"yyyyy-MM-dd", null, "C2"}
I guess a regex is one possibility but is there anything precanned that hooks into the same logic as String.Format?
String.Format itself doesn't parse the format string. It ends up calling the internal StringBuilder.AppendFormatHelper method which treats the format strings only as delimited strings. It doesn't try to parse them. The format is passed directly to each argument type's formatter method. String formatting performance is critical, both for the runtime and applications.
You can use a regular expression to parse the format string. You'd need to take care of escaped braces ({{, {}) and alignment strings.
The regex {(?<index>\d+)(,(?<algn>-?\d+?))?(:(?<fmt>.*?))?} takes extracts the index, alignment and format segments as named groups. It doesn't take care of escaped braces *explicitly. It will avoid {{ {} but not {{2,20:N{}:
var regex=new System.Text.RegularExpressions.Regex(#"{(?<index>\d+)(,(?<algn>-?\d+?))?(:(?<fmt>.*?))?}");
var matches=regex.Matches("asdf{0:d2} {1:yyyy-MM-dd} {2,-20:N2}");
foreach(Match match in matches)
{
Console.WriteLine("{0,-5} {1,-15} {2,-15}",
match.Groups["index"].Value,
match.Groups["algn"].Value,
match.Groups["fmt"].Value);
}
This will return :
0 d2
1 yyyy-MM-dd
2 -20 N2
The (?<name>...) syntax captures a pattern and exposes it as a named group. (?<index>\d+) captures a sequence of digits and exposes it as the group index.
The ? in .*? specifies a non-greedy match. Typically a regex is greedy - it will capture as many characters match a pattern as possible. By using .*? the regex will capture as few characters as possible before the next pattern starts. That's why the optional algn group stops at :.
Chances are no standard means for that. Use Regex, it's easy:
var args = new List<string>();
var str = "On {0:yyyyy-MM-dd} do {1} and earn {2:C2}";
MatchCollection matches = Regex.Matches(str, #"\{\d+[^\{\}]*\}");
foreach (Match match in matches)
{
string obj = null;
var split = match.ToString().Split(':');
if (split.Length == 2) obj = split.Last().Trim(' ', '}', '{');
args.Add(obj);
}
// Result: args = {"yyyyy-MM-dd", null, "C2"}

Formatting long datetime string to remove T character

I have a number of XML nodes which output a datetime object as string.
The problem is that when outputting both the time stamp and the date they are bonded together with a T Character.
Here is an example
2016-01-13T23:59:59
Of course all of the nodes in the XML are of a different type so grouping by name or type is out of the question. Im thinking my only option is to match a pattern with regex and resolve the problem that way.
Below is an example of how the XML would work, you can see that each element is named as something different but they all follow a similar pattern, where the T between the date and the time must be removed and a space replaced instead.
<dates>
<1stDate> 2016-01-13T23:59:59 </1stdate>
<2ndDate> 2017-01-13T23:55:57 </2ndDate>
<3rdDate> 2018-01-13T23:22:19 </3rdDate>
</dates>
Ideal solution to output like this
2016-01-13 23:59:59
2017-01-13 23:55:57
2018-01-13 23:22:19
I havent had to use Regex before but i know what it is. I have been trying to decode what this cheat sheet means http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1 but to no avail.
UPDATE
//How each node is output
foreach (XText node in nodes)
{
node.Value = node.Value.Replace("T"," "); // Where a date occurs, replace T with space.
}
The <date> elements provided in the example may contain dates in my XML but may not include the word date as a name.
e.g.
<Start> 2017-01-13T23:55:57 </start>
<End> 2018-01-13T23:22:19 </End>
<FirstDate> 2018-01-13T23:22:19 </FirstDate>
The main reason I would have liked a regex solution was because I need to match the date string with a pattern that can determine if its a date or not, then i can apply formatting.
Why not parse that (perfectly valid ISO-8601) date time into a DateTime, and then use the built in string formatting to produce a presentable human readable date time?
if (!string.IsNullOrWhiteSpace(node.Value))
{
DateTime date;
if (DateTime.TryParseExact(node.Value.Trim(),
#"yyyy-MM-dd\THH:mm:ss",
CultureInfo.InvariantCulture,
DateTimeStyles.AssumeUniversal,
out date)
{
node.Value = date.ToString("yyyy-MM-dd HH:mm:ss");
}
}
I would use:
if (DateTime.TryParse(yourString))
{
yourString.Replace("T", " ");
}
EDIT
If you would only like to replace the first instance of the letter "T" like I think you are suggesting in your UPDATE. You could use this extension method:
public static string ReplaceFirst(this string text, string search, string replace)
{
int pos = text.IndexOf(search);
if (pos < 0)
{
return text;
}
return text.Substring(0, pos) + replace + text.Substring(pos + search.Length);
}
and you would use it like:
yourString.ReplaceFirst("T", " ");
If you still want to do this with regex, the following expression should do the trick:
# Positive lookbehind for date part which consists of numbers and dashes
(?<=[0-9-]+)
# Match the T in between
T
# Positive lookahead for time part which consists of numbers and colons
(?=[0-9:]+)
EDIT
The regex above will NOT check if the string is in date/time format. It is a generic pattern. To impose the format for your strings use this pattern:
# Positive lookbehind for date part
(?<=\d{4}(-\d{2}){2})
# Match the T
T
# Positive lookahead for time part
(?=\d{2}(:\d{2}){2})
Again, this will match the exactly the strings you have but it you should not use it to validate date/time values because it will match invalid dates like 2015-15-10T24:12:10; to validate date/time values use DateTime.Parse() or DateTime.TryParse() methods.

C# Regex for retrieving capital string in quotation mark

Given a string, I want to retrieve a string that is in between the quotation marks, and that is fully capitalized.
For example, if a string of
oqr"awr"q q"ASRQ" asd "qIKQWIR"
has been entered, the regex would only evaluate "ASRQ" as matching string.
What is the best way to approach this?
Edit: Forgot to mention the string takes a numeric input as well I.E: "IO8917AS" is a valid input
EDIT: If you actually want "one or more characters, and none of the characters is a lower-case letter" then you probably want:
Regex regex = new Regex("\"\\P{Ll}+\"");
That will then allow digits as well... and punctuation. If you want to allow digits and upper case letters but nothing else, you can use:
Regex regex = new Regex("\"[\\p{Lu}\\d]+\"");
Or in verbatim string literal form (makes the quotes more confusing, but the backslashes less so):
Regex regex = new Regex(#"""[\p{Lu}\d]+""");
Original answer (before digits were required)
Sounds like you just want (within the pattern)
"[A-Z]*"
So something like:
Regex regex = new Regex("\"[A-Z]*\"");
Or for full Unicode support, use the Lu Unicode character category:
Regex regex = new Regex("\"\\p{Lu}*\"");
EDIT: As noted, if you don't want to match an empty string in quotes (which is still "a string where everything is upper case") then use + instead of *, e.g.
Regex regex = new Regex("\"\\p{Lu}+\");
Short but complete example of finding and displaying the first match:
using System;
using System.Text.RegularExpressions;
class Program
{
public static void Main()
{
Regex regex = new Regex("\"\\p{Lu}+\"");
string text = "oqr\"awr\"q q\"ASRQ\" asd \"qIKQWIR\"";
Match match = regex.Match(text);
Console.WriteLine(match.Success); // True
Console.WriteLine(match.Value); // "ASRQ"
}
}
Like this:
"\"[A-Z]+\""
The outermost quotes are not part of the regex, they delimit a C# string.
This requires at least one uppercase character between quotes and works for the English language.
Please try the following:
[\w]*"([A-Z0-9]+)"

Categories