I have hit a snag in some data parsing because the title line of what I am parsing is somewhat complex. It has the year, the title, and the edition, but not always in that order. the year and the edition can be converted to ints but the rest cannot be. how could I split the year from the rest of the line to be parsed to an int if I don't know where it would be in the line every time?
example data sets:
2016 Super special regular season, 01 fifteenth tossup
Math problems galore 2013 Round 02 directed problems
FooBar the amazing game part 1 0f 2 round 03 problems 2015
I know that I can't just test the whole line to see if a character is a number, because there are multiple numbers. Nor can I do something like IndexOf because of not knowing the dates ahead of time.
To get all numbers from string use the regex.Matches() method to get
all matches from the regex
/* \d+ Is used to find integers */
Regex regex = new Regex(#"\d+");
// Loop thrue all matches
foreach (Match match in regex.Matches("2016 Super special regular season, 01 fifteenth tossup"))
{
Console.WriteLine(match.Value); /* Test output */
int i = Convert.ToInt32(match.Value); /* Convert To Int and do something with it */
}
============ output ===========
2016
01
/* Use this \d{4} to return the 4 character from current match from \d*/
/* (Example) => 12564568 => (output) : 1256 and 4568 */
/* (Notice!!) If you use \d{4} and there are only 2 numbers found by \d
It has no result. */
Or in one line to get the result value from the first occurring number:
string resultString = Regex.Match(subjectString /*string to test */, #"\d+").Value;
Use Regex :
string pattern_Year = #"\(\d{4}\)";
string pattern_Edition = #"\(\d{2}\)";
string search = "2016 Super special regular season, 01 fifteenth tossup";
var year = Regex.Matches(search, pattern_Year );
var edition = Regex.Matches(search, pattern_Edition );
if(year.Count > 0)
Console.WriteLine(year[0].Value);
if(edition.Count > 0)
Console.WriteLine(edition [0].Value);
var line = "FooBar the amazing game part 1 0f 2 round 03 problems 2015";
var numbers = line.Split(' ').Where(word => word.All(char.IsDigit)).Select(int.Parse).ToList();
Now you have the ints 1, 2, 3, 2015.
How you find out what the year is is up to you. Maybe check which is between 1900 and 2017?
Something like this:
static int GetYearFromTextLine(string s)
{
string [] words = s.Split(' ');
foreach (string w in words)
{
int number = 0;
if (int.TryParse(w, out number))
{
// assume the first number found over "1900" must be a year
// you can modify this test yourself
if (number >= 1900)
{
return number;
}
}
}
return 0;
}
static void Main(string[] args)
{
Console.WriteLine(GetYearFromTextLine("Math problems galore 2013 Round 02 directed problems"));
}
Try this, should work
string strValue = "abc123def456";
char[] charArr = strValue.ToCharrArray();
List<int> intList = new List<int>();
for(int i =0; i < charArr.Length; i++)
{
string tmpInt ="";
if(char.IsDigit(charArr[i]))
{
tmpInt += charArr[i];
while((i < charArr.Lenght -1 ) && char.IsDigit([i + 1)
{
tmpInt += charArr[i+1];
i++;
}
}
if(tmpInt != "")
intList.Add(int.Parse(tmpInt));
}
Advantage of this script is, does not matter where digits located in the string and not depended on split or any pattern.
Related
Working on program for class call pig Latin. It works for what I need for class. It ask just to type in a phase to convert. But I notice if I type a sentence with punctuation at the end it will mess up the last word translation. Trying to figure out the best way to fix this. New at programming but I would need away for it to check last character in word to check for punctuations. Remove it before translation and then add it back. Not sure how to do that. Been reading about char.IsPunctuation. Plus not sure what part of my code I would had for that check.
public static string MakePigLatin(string str)
{
string[] words = str.Split(' ');
str = String.Empty;
for (int i = 0; i < words.Length; i++)
{
if (words[i].Length <= 1) continue;
string pigTrans = new String(words[i].ToCharArray());
pigTrans = pigTrans.Substring(1, pigTrans.Length - 1) + pigTrans.Substring(0, 1) + "ay ";
str += pigTrans;
}
return str.Trim();
}
The following should get you strings of letters for converting while passing through any non-letter characters that follow them.
Splitter based on Splitting a string in C#
public static string MakePigLatin(string str) {
MatchCollection matches = Regex.Matches(str, #"([a-zA-Z]*)([^a-zA-Z]*)");
StringBuilder result = new StringBuilder(str.Length * 2);
for (int i = 0; i < matches.Count; ++i) {
string pigTrans = matches[i].Groups[1].Captures[0].Value ?? string.Empty;
if (pigTrans.Length > 1) {
pigTrans = pigTrans.Substring(1) + pigTrans.Substring(0, 1) + "ay";
}
result.Append(pigTrans).Append(matches[i].Groups[2].Captures[0].Value);
}
return result.ToString();
}
The matches variable should contain all the match collections of 2 groups. The first group will be 0 or more letters to translate followed by a second group of 0 or more non-letters to pass through. The StringBuilder should be more memory efficient than concatenating System.String values. I gave it a starting allocation of double the initial string size just to avoid having to double the allocated space. If memory is tight, maybe 1.25 or 1.5 instead of 2 would be better, but you'd probably have to convert it back to int after. I took the length calculation off your Substring call because leaving it out grabs everything to the end of the string already.
I'm reading a list by line and using regex in c# to capture the fields:
fed line 1: Type: eBook Year: 1990 Title: This is ebook 1 ISBN:15465452 Pages: 100 Authors: Cendric, Paul
fed line 2: Type: Movie Year: 2016 Title: This is movie 1 Authors: Pepe Giron ; Yamasaki Suzuki Length: 4500 Media Type: DVD
string pattern = #"(?:(Type: )(?<type>\w+)) *(?:(Year: )(?<year>\d{4})) *(?:(Title: )(?<title>[^ISBN]*))(?:(ISBN:) *(?<ISBN>\d*))* *(?:(Pages: )(?<pages>\d*))* *(?:(Authors: )(?<author1>[\w ,]*)) *;* *(?<author2>[\w ,]*) *(?:(Length: )(?<length>\d*))* *(?:Media Type: )*(?<discType>[\w ,]*)";
MatchCollection matches = Regex.Matches(line, pattern);
If the line fed has "Length: " I want to stop capturing the surname of the Author excluding the word Length.
If I use (?:(Length: )(?<length>\d*))* Length is added to the surname of the second author for match.Groups["author2"].Value. If I use (?:(Length: )(?<length>\d*))+ I get no matches for the first line.
Can you please give me guidance.
Thank you, Sergio
Using full regexes for something as fuzzy as the format you have is always a way for hurting themselves. As written by #Kevin, you should look for the keys and extract the values.
My proposal is looking for those keys and splitting the string before and after them. There is a nifty, randomly working (they even changed its working between .NET 1.1 and .NET 2.0), nearly unknown feature of Regex that is called Regex.Split(). We could try to use it :-)
string pattern = #"(?<=^| )(Type: |Year: |Title: |ISBN:|Pages: |Authors: |Length: |Media Type: )";
var rx = new Regex(pattern);
string[] parts = rx.Split(line);
Now parts is an array where if in an element there is a key, in the next element there is the value... The Regex.Split can add an empty element at the beginning of the array.
string type = null, title = null, mediaType = null;
int? year, length;
string[] authors = new string[0];
// The parts[0] == string.Empty ? 1 : 0 is caused by the "strangeness" of Regex.Split
// that can add an empty element at the beginning of the string
for (int i = parts[0] == string.Empty ? 1 : 0; i < parts.Length; i += 2)
{
string key = parts[i].TrimEnd();
string value = parts[i + 1].Trim();
Console.WriteLine("[{0}|{1}]", key, value);
switch (key)
{
case "Type:":
type = value;
break;
case "Year:":
{
int temp;
if (int.TryParse(value, out temp))
{
year = temp;
}
}
break;
case "Title:":
title = value;
break;
case "Authors:":
{
authors = value.Split(" ; ");
}
break;
case "Length:":
{
int temp;
if (int.TryParse(value, out temp))
{
length = temp;
}
}
break;
case "Media Type:":
mediaType = value;
break;
}
}
After all, #xanathos is right. An overcomplicated regex that is hard to maintain and error prone may not serve you well in the long run.
But to answer your question, your regex can be fixed with a tempered greedy token*, e.g. do not allow Length: in the author's pattern:
(?:(?:(?!Length: )[\w ,])*)
* The linked description uses a . in the greedy token but it's useful to limit the range of allowed characters more here.
Arguably, this should be added to the author1 and author2 part.
The final pattern then looks like this:
(?:(Type: )(?<type>\w+)) *(?:(Year: )(?<year>\d{4})) *(?:(Title: )(?<title>[^ISBN]*))(?:(ISBN:) *(?<ISBN>\d*))* *(?:(Pages: )(?<pages>\d*))* *(?:(Authors: )(?<author1>(?:(?:(?!Length: )[\w ,])*) *)) *;* *(?<author2>(?:(?:(?!Length: )[\w ,])*) *)(?:(Length: )(?<length>\d*))* *(?:Media Type: )*(?<discType>[\w ,]*)
Demo
Update: July 26, 2017
I have a string inside which the values are comma separated. However for some cases it is coming double comma ,, at in a consecutive way. But when I am using using string.split(',') it's returning me a array which doesn't have a value on that index. For example
string str = "12,3,5,,6,54,127,8,,0,98,"
It's breaking down the the array this way
str2[0] = 12
str2[1] = 3
str2[2] = 5
str2[3] = ""
str2[4] = 6
str2[5] = 54
str2[6] = 127
str2[7] = 8
str2[8] = ""
str2[9] = 0
str2[10] = 98
str2[11] = ""
Look here I am getting the array with one or more empty value. So I want to put a 0 in each empty position when I am splitting the string. Here I have found something to skip the empty values
str .Split(',', StringSplitOptions.RemoveEmptyEntries)
However I did not found such a solution put a default value at empty index. I have gone through these previous Questions Q1, Q2, But these are not effective for mine. I am using C# for web application in .Net framework
Try the below code:
You can able to use IEnumerable extension method (Select) of String object.
string str = "12,3,5,,6,54,127,8,,0,98";
var strVal = str.Split(',').Select(s => string.IsNullOrWhiteSpace(s) ? "0" : s);
Use the following code to replace empty string to zero
string str = "12,3,5,,6,54,127,8,,0,98";
var a= str.Split(',').Select(x=>string.IsNullOrEmpty(x)?"0":x);
While all the suggested solutions work perfectly, they are all iterating twice your input (once for the string split, once for the string replace or regex, once for the array replace).
Here is a solution iterating only once the input:
var input = "12,3,5,,6,54,127,8,,0,98";
var result = new List<int>();
var currentNumeric = string.Empty;
foreach(char c in input)
{
if(c == ',' && String.IsNullOrWhiteSpace(currentNumeric))
{
result.Add(0);
}
else if(c == ',')
{
result.Add(int.Parse(currentNumeric));
currentNumeric = string.Empty;
}
else
{
currentNumeric += c;
}
}
if(!String.IsNullOrWhiteSpace(currentNumeric))
{
result.Add(int.Parse(currentNumeric));
}
else if(input.EndsWith(","))
{
result.Add(0);
}
You can run your string through regex to put zeros into it before going into Split:
Regex.Replace(str, "(?<=(^|,))(?=(,|$))", "0").Split(',')
The regex will insert zeros into the original string in spots when two commas are next to each other, or when a comma is detected at the beginning or at the end of the string (demo).
I have these data files comming in on a server that i need to split into [date time] and [value]. Most of them are delimited a single time between time and value and between date and time is a space. I already have a program processing the data with a simple split(char[]) but now found data where the delimiter is a space and i am wondering how to tackle this best.
So most files i encountered look like this:
18-06-2014 12:00:00|220.6
The delimiters vary, but i tackled that with a char[]. But today i ran into a problem on this format:
18-06-2014 12:00:00 220.6
This complicates things a little. The easy solution would be to just add a space to my split characters and when i find 3 splits combine the first two before processing?
I'm looking for a 2nd opining on this matter. Also the time format can change to something like d/m/yy and the amount of lines can run into the millions so i would like to keep it as efficient as possible.
Yes I believe the most efficient solution is to add space as a delimiter and then just combine the first two if you get three. That is going to be be more efficient than regex.
You've got a string 18-06-2014 12:00:00 220.6 where first 19 characters is a date, one character is a separation symbol and other characters is a value. So:
var test = "18-06-2014 12:00:00|220.6";
var dateString = test.Remove(19);
var val = test.Substring(20);
Added normalization:
static void Main(string[] args) {
var test = "18-06-2014 12:00:00|220.6";
var test2 = "18-6-14 12:00:00|220.6";
var test3 = "8-06-14 12:00:00|220.6";
Console.WriteLine(test);
Console.WriteLine(TryNormalizeImportValue(test));
Console.WriteLine(test2);
Console.WriteLine(TryNormalizeImportValue(test2));
Console.WriteLine(test3);
Console.WriteLine(TryNormalizeImportValue(test3));
}
private static string TryNormalizeImportValue(string value) {
var valueSplittedByDateSeparator = value.Split('-');
if (valueSplittedByDateSeparator.Length < 3) throw new InvalidDataException();
var normalizedDay = NormalizeImportDayValue(valueSplittedByDateSeparator[0]);
var normalizedMonth = NormalizeImportMonthValue(valueSplittedByDateSeparator[1]);
var valueYearPartSplittedByDateTimeSeparator = valueSplittedByDateSeparator[2].Split(' ');
if (valueYearPartSplittedByDateTimeSeparator.Length < 2) throw new InvalidDataException();
var normalizedYear = NormalizeImportYearValue(valueYearPartSplittedByDateTimeSeparator[0]);
var valueTimeAndValuePart = valueYearPartSplittedByDateTimeSeparator[1];
return string.Concat(normalizedDay, '-', normalizedMonth, '-', normalizedYear, ' ', valueTimeAndValuePart);
}
private static string NormalizeImportDayValue(string value) {
return value.Length == 2 ? value : "0" + value;
}
private static string NormalizeImportMonthValue(string value) {
return value.Length == 2 ? value : "0" + value;
}
private static string NormalizeImportYearValue(string value) {
return value.Length == 4 ? value : DateTime.Now.Year.ToString(CultureInfo.InvariantCulture).Remove(2) + value;
}
Well you can use this one to get the date and the value.
(((0[1-9]|[12][0-9]|3[01])-(0[1-9]|1[012])-(19|20)\d\d)\s((\d{2}:?){3})|(\d+\.?\d+))
This will give you 2 matches
1º 18-06-2014 12:00:00
2º 220.6
Example:
http://regexr.com/391d3
This regex matches both kinds of strings, capturing the two tokens to Groups 1 and 2.
Note that we are not using \d because in .NET it can match any Unicode digits such as Thai...
The key is in the [ |] character class, which specifies your two allowable delimiters
Here is the regex:
^([0-9]{2}-[0-9]{2}-[0-9]{4} (?:[0-9]{2}:){2}[0-9]{2})[ |]([0-9]{3}\.[0-9])$
In the demo, please pay attention to the capture Groups in the right pane.
Here is how to retrieve the values:
var myRegex = new Regex(#"^([0-9]{2}-[0-9]{2}-[0-9]{4} (?:[0-9]{2}:){2}[0-9]{2})[ |]([0-9]{3}\.[0-9])$", RegexOptions.IgnoreCase);
string mydate = myRegex.Match(s1).Groups[1].Value;
Console.WriteLine(mydate);
string myvalue = myRegex.Match(s1).Groups[1].Value;
Console.WriteLine(myvalue);
Please let me know if you have questions
Given the provided format I'd use something like
char delimiter = ' '; //or whatever the delimiter for the specific file is, this can be set in a previous step
int index = line.LastIndexOf(delimiter);
var date = line.Remove(index);
var value = line.Substring(++index);
If there are that many lines and efficiency matters, you could obtain the delimiter once on the first line, by looping back from the end and find the first index that is not a digit or dot (or comma if the value can contain those) to determine the delimiter, and then use something such as the above.
If each line can contain a different delimiter, you could always track back to the first not value char as described above and still maintain adequate performance.
Edit: for completeness sake, to find the delimiter, you could perform the following once per file (provided that the delimiter stays consistent within the file)
char delimiter = '\0';
for (int i = line.Length - 1; i >= 0; i--)
{
var c= line[i];
if (!char.IsDigit(c) && c != '.')
{
delimiter = c;
break;
}
}
I have a very simple question, and I shouldn't be hung up on this, but I am. Haha!
I have a string that I receive in the following format(s):
123
123456-D53
123455-4D
234234-4
123415
The desired output, post formatting, is:
123-455-444
123-455-55
123-455-5
or
123-455
The format is ultimately dependent upon the total number of characters in the original string..
I have several ideas of how to do this, but I keep thing there's a better way than string.Replace and concatenate...
Thanks for the suggestions..
Ian
Tanascius is right but I cant comment or upvote due to my lack of rep but if you want additional info on the string.format Ive found this helpful.
http://blog.stevex.net/string-formatting-in-csharp/
I assume this does not merely rely upon the inputs always being numeric? If so, I'm thinking of something like this
private string ApplyCustomFormat(string input)
{
StringBuilder builder = new StringBuilder(input.Replace("-", ""));
int index = 3;
while (index < builder.Length)
{
builder.Insert(index, "-");
index += 4;
}
return builder.ToString();
}
Here's a method that uses a combination of regular expressions and LINQ to extract groups of three letters at a time and then joins them together again. Note: it assumes that the input has already been validated. The validation can also be done with a regular expression.
string s = "123456-D53";
string[] groups = Regex.Matches(s, #"\w{1,3}")
.Cast<Match>()
.Select(match => match.Value)
.ToArray();
string result = string.Join("-", groups);
Result:
123-456-D53
EDIT: See history for old versions.
You could use char.IsDigit() for finding digits, only.
var output = new StringBuilder();
var digitCount = 0;
foreach( var c in input )
{
if( char.IsDigit( c ) )
{
output.Append( c );
digitCount++;
if( digitCount % 3 == 0 )
{
output.Append( "-" );
}
}
}
// Remove possible last -
return output.ToString().TrimEnd('-');
This code should fill from left to right (now I got it, first read, then code) ...
Sorry, I still can't test this right now.
Not the fastest, but easy on the eyes (ed: to read):
string Normalize(string value)
{
if (String.IsNullOrEmpty(value)) return value;
int appended = 0;
var builder = new StringBuilder(value.Length + value.Length/3);
for (int ii = 0; ii < value.Length; ++ii)
{
if (Char.IsLetterOrDigit(value[ii]))
{
builder.Append(value[ii]);
if ((++appended % 3) == 0) builder.Append('-');
}
}
return builder.ToString().TrimEnd('-');
}
Uses a guess to pre-allocate the StringBuilder's length. This will accept any Alphanumeric input with any amount of junk being added by the user, including excess whitespace.