Need to split a string into substrings but can't use split - c#

I have a string that looks like this:
123.45.67.890-1292 connected to EDS via 10.98.765.432-4300.
I need to split it like so:
"123.45.67.890-1292 connected to EDS via 10.98.765.432-4300."
-----+------- --+- -+- -----+------- --+-
| | | | |
ClientIP | ServiceName HostIP |
| |
ClientSession HostSession
I'm converting the code from vbscript that has a lot of complex InStr methods. Was wondering if there was a way to do this using a regEx.

(\d{,3}\.\d{,3}\.\d{,3}\.\d{,3})-(\d+) connected to ([A-Z]+) via (\d{,3}\.\d{,3}\.\d{,3}\.\d{,3})-(\d+)\.

Why can't you use split? Using regular expression for single task is inappropriate:
([^\-]+)\-(\S+)\s+connected\s+to\s+(\S+)\s+via\s+([^\-]+)\-(\S+)\.
C# code implementation (regular expression):
static void Main(string[] args)
{
String input = "123.45.67.890-1292 connected to EDS via 10.98.765.432-4300.";
String pattern = #"([^\-]+)\-(\S+)\s+connected\s+to\s+(\S+)\s+via\s+([^\-]+)\-(\S+)\.";
Match match = Regex.Match(input, pattern);
if (match.Success)
{
foreach (var group in match.Groups)
{
Console.WriteLine(group);
}
}
Console.ReadKey();
}
C# code implementation (splitting):
public class DTO
{
public string ClientIP { get; set; }
public string ClientSession { get; set; }
public string ServiceName { get; set; }
public string HostIP { get; set; }
public string HostSession { get; set; }
}
static void Main(string[] args)
{
String input = "123.45.67.890-1292 connected to EDS via 10.98.765.432-4300.";
String[] splits = input.Split(new char[] { ' ' });
DTO obj = new DTO();
for (int i = 0; i < splits.Length; ++i)
{
switch (i)
{
// connected
case 1:
// to
case 2:
// via
case 4:
{
break;
}
// 123.45.67.890-1292
case 0:
{
obj.ClientIP = splits[i].Split(new char[] { '-' })[0];
obj.ClientSession = splits[i].Split(new char[] { '-' })[1];
break;
}
// EDS
case 3:
{
obj.ServiceName = splits[i];
break;
}
// 10.98.765.432-4300.
case 5:
{
obj.HostIP = splits[i].Split(new char[] { '-' })[0];
obj.HostSession = splits[i].Split(new char[] { '-' })[1];
break;
}
}
}
Console.ReadKey();
}

(?<ClientIP>\d+\.\d+\.\d+\.\d+)-(?<ClientSession>\d+) connected to (?<ServiceName>.*?) via (?<HostIP>\d+\.\d+\.\d+\.\d+)-(?<HostSession>\d+)\.

Here's a RegExp to match/capture that:
([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)-([0-9]+) connected to ([a-zA-Z]+) via ([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)-([0-9]+)
implementation:
string pat = #"([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)-([0-9]+) connected to ([a-zA-Z]+) via ([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)-([0-9]+)";
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
Match match = r.Match("123.45.67.890-1292 connected to EDS via 10.98.765.432-4300.");
foreach (var str in match.Groups)
Console.WriteLine(str);
Console.ReadKey();

Since I don't see why you rule out String.Split() :
var parts = test.Split(new string[] {" connected to ", " via "},
StringSplitOptions.None);
gives you
123.45.67.890-1292
EDS
10.98.765.432-4300
breaking of the -#### session parts would take 1 extra step, also possible with Split().
Or maybe easier:
var parts = test.Split(' ', '-');
and use parts 0, 1, 4, 6, 7

Related

Combine multiple lines into 1 string with stream reader

I have a decently sized file (95K lines) that i need to parse through. For the following sample data...
<FIPS>10440<STATE>AL<WFO>BMX
8 32.319 32.316 -86.484 -86.487 32.316 -86.484
32.316 -86.484
102 32.501 31.965 -85.919 -86.497 32.496 -86.248
32.448 -86.181 32.432 -86.189 32.433 -86.125 32.417 -86.116
32.406 -86.049 32.419 -86.023 32.337 -85.991 32.333 -85.969
32.276 -85.919 32.271 -85.986 32.250 -85.999 31.968 -85.995
31.965 -86.302 32.052 -86.307 32.051 -86.406 32.245 -86.410
32.276 -86.484 32.302 -86.491 32.332 -86.475 32.344 -86.497
32.364 -86.492 32.378 -86.463 32.405 -86.460 32.414 -86.396
32.427 -86.398 32.433 -86.350 32.412 -86.310 32.441 -86.325
32.487 -86.314 32.473 -86.288 32.488 -86.260 32.501 -86.263
32.496 -86.248
What I need to do is read from one FIPS to the next FIPS and combine the lines within each group into one giant line like the following...
<FIPS>10440<STATE>AL<WFO>BMX 8 32.319 32.316 -86.484 -86.487 32.316 -86.484 32.316 -86.484...
<FIPS>10440<STATE>AL<WFO>BMX 102 32.501 31.965 -85.919 -86.497 32.496 -86.248 32.448 -86.181...
I currently have the following code (about my 6th variation for the day). What am I missing?
using (var reader = new StreamReader(winterBoundsPath))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine().Trim();
if (!Char.IsLetter(line[0]))
{
if (line.Contains("<FIPS>"))
{
var lineReplace = line.Replace('<', ' ').Replace('>', ' ');
string[] rawData = lineReplace.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
temp = new WinterJsonModel
{
FIPS = rawData[1],
State = rawData[3],
Center = rawData[5],
polyCoords = new List<polyCoordsJsonData>()
};
}
else
{
string[] rawData2 = line.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
if (rawData2.Count() > 1)
{
allValues.Add(listPointValue);
listPointValue = new List<string>();
}
// Add values to line
foreach (string value in rawData2)
{
listPointValue.Add(value);
}
}
}
}
reader.Close();
}
Judging from the sample you've given, the line breaks are CRLF characters. This means you really only need to know two things.
1. If the line contains "FIPS" as a string literal enclosed as a tag
2. if you've reached the end of a line that has a carriage return.
I'm going to ignore the JSON bit for now, because it's not part of your question. I'm assuming this means you have the JSON well-handled and if we get these strings how you want them, you've got it from there.
var x = new List<string>();
while (!reader.EndOfStream)
{
var line = reader.ReadLine().Trim();
if (line.Contains("<FIPS>"))
{
x.Add(line.Replace(Environment.NewLine, " "));
}
else
{
var s = String.Concat(x.Last(), line.Replace(Environment.NewLine, string.Empty), " ");
x[x.Count - 1] = s;
}
}
Much of the point here is to separate the organization of the data away from actually putting it into your object. From here, you can iterate through the list in a foreach, creating new objects based on the results of string.Split() on each string in your List<string>.
I've been parsing text files for over 40 years. Code below is sample of what I've done
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace Oppgave3Lesson1
{
class Program
{
const string FILENAME = #"c:\temp\test.txt";
static void Main(string[] args)
{
WinterJsonModel data = new WinterJsonModel();
data.ParseFile(FILENAME);
}
}
public class WinterJsonModel
{
public static List<WinterJsonModel> samplData = new List<WinterJsonModel>();
public string fips { get; set; }
public string state { get; set; }
public string wfo { get; set; }
public List<Group> groups = new List<Group>();
public void ParseFile(string winterBoundsPath)
{
WinterJsonModel winterJsonModel = null;
Group group = null;
List<KeyValuePair<decimal, decimal>> values = null;
using (var reader = new StreamReader(winterBoundsPath))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine().Trim();
if (line.Length > 0)
{
if (line.StartsWith("<FIPS>"))
{
winterJsonModel = new WinterJsonModel();
WinterJsonModel.samplData.Add(winterJsonModel);
string[] rawData = line.Split(new char[] { '<', '>' }, StringSplitOptions.RemoveEmptyEntries);
winterJsonModel.fips = rawData[1];
winterJsonModel.state = rawData[3];
winterJsonModel.wfo = rawData[5];
group = null; // very inportant line
}
else
{
decimal[] rawData = line.Split(new char[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries).Select(x => decimal.Parse(x)).ToArray();
//if odd number of numbers in a line
if (rawData.Count() % 2 == 1)
{
group = new Group();
winterJsonModel.groups.Add(group);
group.id = (int)rawData[0];
//remove group number from raw data
rawData = rawData.Skip(1).ToArray();
}
for (int i = 0; i < rawData.Count(); i += 2)
{
group.values.Add(new KeyValuePair<decimal, decimal>(rawData[i], rawData[i + 1]));
}
}
}
}
}
}
}
public class Group
{
public int id { get; set; }
public List<KeyValuePair<decimal, decimal>> values = new List<KeyValuePair<decimal, decimal>>();
}
}

Regex - Capture every line based on condition

To revisit a solution I had here over a year ago:
/* ----------------- jobnameA ----------------- */
insert_job: jobnameA job_type: CMD
date_conditions: 0
alarm_if_fail: 1
/* ----------------- jobnameB ----------------- */
insert_job: jobnameB job_type: CMD
date_conditions: 1
days_of_week: tu,we,th,fr,sa
condition: s(job1) & s(job2) & (v(variable1) = "Y" | s(job1)) & (v(variable2) = "Y"
alarm_if_fail: 1
job_load: 1
priority: 10
/* ----------------- jobnameC ----------------- */
...
I use the following regex to capture each job that has uses a variable v(x) in its condition parameter (only jobnameB here matches):
(?ms)(^[ \t]*/\*[\s-]*([\w-]*)[\s-]*\*/)((?:(?:(?!^[ \t]*/\*[\s-]*[\w-]*[\s-]*\*/).)*?condition\: ([^\n\r]*v\([^\n\r]*)[ \t]*\))+(?:(?!^[ \t]*/\*[\s-]*[\w-]*[\s-]*\*/).)*)
I now need each line caught as parameter and value groups while satisfying the same conditions.
This regex will get each line with parameter and value as separate capture groups, but this wont take into account the presence of variables v(x), so it grabs all jobs:
(?:^([\w_]*\:) ([^\n]+))
And, the following expression will get me as far as the first line (insert_job) of the satisfying jobs, but it ends there instead of grabbing all parameters.
(?:^[ \t]*/\*[\s-]*[\w-]*[\s-]*\*/)(?:(?!^[ \t]*/\*[\s-]*[\w-]*[\s-]*\*/).)*?(?:^([\w_]*\:) ([^\n]+))
Any further help will be appreciated.
I think this would be much easier if you broke it up into steps. I am using LINQ for this:
var jobsWithVx = Regex.Matches(src, #"(?ms)(^[ \t]*/\*[\s-]*([\w-]*)[\s-]*\*/)((?:(?:(?!^[ \t]*/\*[\s-]*[\w-]*[\s-]*\*/).)*?condition\: ([^\n\r]*v\([^\n\r]*)[ \t]*\))+(?:(?!^[ \t]*/\*[\s-]*[\w-]*[\s-]*\*/).)*)").Cast<Match>().Select(m => m.Value);
var jobParameters = jobsWithVx.Select(j => Regex.Matches(j, #"(?ms)^([\w_]+\:) (.+?)$")).Select(m => m.Cast<Match>().Select(am => am.Groups));
Then you can work with the job parameters:
foreach (var aJobsParms in jobParameters) {
foreach (var jobParm in aJobsParms) {
// work with job and parm
}
// alternatively, convert to a Dictionary
var jobDict = aJobsParms.ToDictionary(jpgc => jpgc[1].Value, jpgc => jpgc[2].Value));
// then work with the dictionary
}
Sample that runs in LINQPad:
var src = #"/* ----------------- jobnameA ----------------- */
insert_job: jobnameA job_type: CMD
date_conditions: 0
alarm_if_fail: 1
/* ----------------- jobnameB ----------------- */
insert_job: jobnameB job_type: CMD
date_conditions: 1
days_of_week: tu,we,th,fr,sa
condition: s(job1) & s(job2) & (v(variable1) = ""Y"" | s(job1)) & (v(variable2) = ""Y""
alarm_if_fail: 1
job_load: 1
priority: 10
/* ----------------- jobnameC ----------------- */
";
var jobsWithVx = Regex.Matches(src, #"(?ms)(^[ \t]*/\*[\s-]*([\w-]*)[\s-]*\*/)((?:(?:(?!^[ \t]*/\*[\s-]*[\w-]*[\s-]*\*/).)*?condition\: ([^\n\r]*v\([^\n\r]*)[ \t]*\))+(?:(?!^[ \t]*/\*[\s-]*[\w-]*[\s-]*\*/).)*)").Cast<Match>().Select(m => m.Value);
var jobParameters = jobsWithVx.Select(j => Regex.Matches(j, #"(?ms)^([\w_]+\:) (.+?)$")).Select(m => m.Cast<Match>().Select(am => am.Groups));
jobParameters.Dump();
I've been parsing text files for over 40 years. If I can't do it nobody can. I tried for awhile to use Regex to split your 'name: value' inputs but was unsuccessful. So I finally wrote my own method. Take a look what I did with the days of the week
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace ConsoleApplication1
{
class Program
{
const string FILENAME = #"c:\temp\test.txt";
static void Main(string[] args)
{
Job.Load(FILENAME);
}
}
public class Job
{
public static List<Job> jobs = new List<Job>();
public string name { get;set;}
public string job_type { get;set;}
public int date_conditions { get; set;}
public DayOfWeek[] days_of_week { get; set; }
public string condition { get; set; }
public int alarm_if_fail { get; set; }
public int job_load { get; set; }
public int priority { get; set;}
public static void Load(string filename)
{
Job newJob = null;
StreamReader reader = new StreamReader(filename);
string inputLine = "";
while ((inputLine = reader.ReadLine()) != null)
{
inputLine = inputLine.Trim();
if ((inputLine.Length > 0) && (!inputLine.StartsWith("/*")))
{
List<KeyValuePair<string, string>> groups = GetGroups(inputLine);
foreach (KeyValuePair<string, string> group in groups)
{
switch (group.Key)
{
case "insert_job" :
newJob = new Job();
Job.jobs.Add(newJob);
newJob.name = group.Value;
break;
case "job_type":
newJob.job_type = group.Value;
break;
case "date_conditions":
newJob.date_conditions = int.Parse(group.Value);
break;
case "days_of_week":
List<string> d_of_w = new List<string>() { "su", "mo", "tu", "we", "th", "fr", "sa" };
newJob.days_of_week = group.Value.Split(new char[] { ',' }, StringSplitOptions.RemoveEmptyEntries).Select(x => (DayOfWeek)d_of_w.IndexOf(x)).ToArray();
break;
case "condition":
newJob.condition = group.Value;
break;
case "alarm_if_fail":
newJob.alarm_if_fail = int.Parse(group.Value);
break;
case "job_load":
newJob.job_load = int.Parse(group.Value);
break;
case "priority":
newJob.priority = int.Parse(group.Value);
break;
}
}
}
}
reader.Close();
}
public static List<KeyValuePair<string, string>> GetGroups(string input)
{
List<KeyValuePair<string, string>> groups = new List<KeyValuePair<string, string>>();
string inputLine = input;
while(inputLine.Length > 0)
{
int lastColon = inputLine.LastIndexOf(":");
string value = inputLine.Substring(lastColon + 1).Trim();
int lastWordStart = inputLine.Substring(0, lastColon - 1).LastIndexOf(" ") + 1;
string name = inputLine.Substring(lastWordStart, lastColon - lastWordStart);
groups.Insert(0, new KeyValuePair<string,string>(name,value));
inputLine = inputLine.Substring(0, lastWordStart).Trim();
}
return groups;
}
}
}

c# string text, enum type

I am reading txt file, and I would like to separate it into some parts. This example of my TXT file:
"Part error(1) devic[3].data_type(2)"
"Escape error(3) device[10].data_type(12)"
I want to achieve such a situation that, when I have first word "Part" I would like to have enum type for it, and in switch I would like to call some function that will work with whole line, and on the other hand, when I will have first word "Escape", there will another case in switch that will call other functions. How can I do it? This is my code so far:
class Row
{
public enum Category { Part, Escape }
public string Error{ get; set; }
public string Data_Type { get; set; }
public string Device{ get; set; }
}
public object HandleRegex(string items)
{
Row sk = new Row();
Regex r = new Regex(#"[.]");
var newStr = r.Replace(items, #" ");
switch(this.category)
{
case Category.Part:
//I want to call here function HandlePart with my line as a parameter
HandlePart(newStr);
break;
case Category.Escape:
//Here I want to call Function HandleEscape for line with "Escape" word
HandleEscape(newStr);
break;
}
}
public object HandleRegex(string items)
{
Regex r = new Regex(#"[.]");
var newStr = r.Replace(items, #" ");
try {
category = (Category) new EnumConverter(typeof(Category)).ConvertFromString(items.Split(new string[]{" "},StringSplitOptions.RemoveEmptyEntries)[0]);
}
catch {
throw new ArgumentException("items doesn't contain valid prefix");
}
switch(category)
{
case Category.Part:
HandlePart(newStr);
break;
case Category.Escape:
HandleEscape(newStr);
break;
}
}
You could use TryParse :
Category outCategory;
Enum.TryParse(this.category, out outCategory)
switch(outCategory)
{
case Category.Part:
//I want to call here function HandlePart with my line as a parameter
HandlePart(newStr);
break;
case Category.Escape:
//Here I want to call Function HandleEscape for line with "Escape" word
HandleEscape(newStr);
break;
default:
// Needs to be handled
}
You can create Dictionary<Category, Action<string>> and then use it to call code according to category:
static void Main(string[] args)
{
var input = #"Part error(1) devic[3].data_type(2)
Escape error(3) device[10].data_type(12)";
var functions = new Dictionary<Category, Action<string>>()
{
{ Category.Part, HandlePart},
{ Category.Escape, HandleEscape }
};
foreach (var line in input.Split(new [] {Environment.NewLine }, StringSplitOptions.None))
{
Category category;
if(Enum.TryParse<Category>(line.Substring(0, line.IndexOf(' ')), out category) && functions.ContainsKey(category))
functions[category](line);
}
}
static void HandlePart(string line)
{
Console.WriteLine("Part handler call");
}
static void HandleEscape(string line)
{
Console.WriteLine("Escape handler call");
}
Output of program above:
Part handler call
Escape handler call
if you read the file line by line then you can do
string str = file.ReadLine();
string firstWord = str.substring(0, str.IndexOf(' ')).Trim().ToLower();
now you have your first word you can do
switch(firstWord){
case "escape":// your code
case "Part":// your code
}

How to split string?

In the following example,
/*----------------------// kvkbl jk//bv klb /* /*gkljbgflkjbncviogf*/
how do I get the strings between /* and */?
Take a look at this tutorial
using System;
class Program
{
static void Main()
{
string s = "/*there*/ is a cat";
string s = "User name (sales)";
int start = s.IndexOf("/*");
int end = s.IndexOf(")*/")
string result = s.substring(start, end - start -1)
//result contains "there"
}
}
string s = "there is a cat";
//
// Split string on spaces.
// ... This will separate all the words.
//
string[] words = s.Split(' ');
foreach (string word in words)
{
Console.WriteLine(word);
}
//Output
there
is
a
cat

C# Regex for Movie Filename

I have been trying to use a C# Regex unsuccessfully to remove certain strings from a movie name.
Examples of the file names I'm working with are:
EuroTrip (2004) [SD]
Event Horizon (1997) [720]
Fast & Furious (2009) [1080p]
Star Trek (2009) [Unknown]
I'd like to remove anything in square brackets or parenthesis (including the brackets themselves)
So far I'm using:
movieTitleToFetch = Regex.Replace(movieTitleToFetch, "([*\\(\\d{4}\\)])", "");
Which seems to remove the Year and Parenthesis ok, but I just can't figure out how to remove the Square Brackets and content without affecting other parts... I've had miscellaneous results but the closest one has been:
movieTitleToFetch = Regex.Replace(movieTitleToFetch, "([?\\[+A-Z+\\]])", "");
Which left me with:
urorip (2004)
Instead of:
EuroTrip (2004) [SD]
Any whitespace that is left at the ends are ok as I will just perform
movieTitleToFetch = movieTitleToFetch.Trim();
at the end.
Thanks in advance,
Alex
This regex pattern should work ok... maybe needs a bit of tweaking
"[\[\(].+?[\]\)]"
Regex.Replace(movieTitleToFetch, #"[\[\(].+?[\]\)]", "");
This should match anything from either "[" or "(" until the next occurance of "]" or ")"
If that does not work try removing the escape character for the parentheses, like so...
Regex.Replace(movieTitleToFetch, #"[\[(].+?[\])]", "");
#Craigt is pretty much spot on but it's possibly cleaner to ensure that the brackets are matched.
([\[].*?[\]]|[\(].*?[\)])
I'know i'm late on this thread but i wrote a simple algorythm to sanitize the downloaded movies filenames.
This runs these steps:
Removes everything in brackets (if find a year it tries to keep the info)
Removes a list of common used words (720p, bdrip, h264 and so on...)
Assumes that can be languages info in the title and removes them when at the end of remaining string (before special words)
if a year was not found into parenthesis looks at the end of remaining string (as for languages)
Doing this replaces dots and spaces so the title is ready, as example, to be a query for a search api.
Here's the test in XUnit (i used most of italian titles to test it)
using Grappachu.Movideo.Core.Helpers.TitleCleaner;
using SharpTestsEx;
using Xunit;
namespace Grappachu.MoVideo.Test
{
public class TitleCleanerTest
{
[Theory]
[InlineData("Avengers.Confidential.La.Vedova.Nera.E.Punisher.2014.iTALiAN.Bluray.720p.x264 - BG.mkv",
"Avengers Confidential La Vedova Nera E Punisher", 2014)]
[InlineData("Fuck You, Prof! (2013) BDRip 720p HEVC ITA GER AC3 Multi Sub PirateMKV.mkv",
"Fuck You, Prof!", 2013)]
[InlineData("Il Libro della Giungla(2016)(BDrip1080p_H264_AC3 5.1 Ita Eng_Sub Ita Eng)by siste82.avi",
"Il Libro della Giungla", 2016)]
[InlineData("Il primo dei bugiardi (2009) [Mux by Little-Boy]", "Il primo dei bugiardi", 2009)]
[InlineData("Il.Viaggio.Di.Arlo-The.Good.Dinosaur.2015.DTS.ITA.ENG.1080p.BluRay.x264-BLUWORLD",
"il viaggio di arlo", 2015)]
[InlineData("La Mafia Uccide Solo D'estate 2013 .avi",
"La Mafia Uccide Solo D'estate", 2013)]
[InlineData("Ip.Man.3.2015.iTA.AC3.5.1.448.Chi.Aac.BluRay.m1080p.x264.Sub.[scambiofile.info].mkv",
"Ip Man 3", 2015)]
[InlineData("Inferno.2016.BluRay.1080p.AC3.ITA.AC3.ENG.Subs.x264-WGZ.mkv",
"Inferno", 2016)]
[InlineData("Ghostbusters.2016.iTALiAN.BDRiP.EXTENDED.XviD-HDi.mp4",
"Ghostbusters", 2016)]
[InlineData("Transcendence.mkv", "Transcendence", null)]
[InlineData("Being Human (Forsyth, 1994).mkv", "Being Human", 1994)]
public void Clean_should_return_title_and_year_when_possible(string filename, string title, int? year)
{
var res = MovieTitleCleaner.Clean(filename);
res.Title.ToLowerInvariant().Should().Be.EqualTo(title.ToLowerInvariant());
res.Year.Should().Be.EqualTo(year);
}
}
}
and fisrt version of the code
using System;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
namespace Grappachu.Movideo.Core.Helpers.TitleCleaner
{
public class MovieTitleCleanerResult
{
public string Title { get; set; }
public int? Year { get; set; }
public string SubTitle { get; set; }
}
public class MovieTitleCleaner
{
private const string SpecialMarker = "§=§";
private static readonly string[] ReservedWords;
private static readonly string[] SpaceChars;
private static readonly string[] Languages;
static MovieTitleCleaner()
{
ReservedWords = new[]
{
SpecialMarker, "hevc", "bdrip", "Bluray", "x264", "h264", "AC3", "DTS", "480p", "720p", "1080p"
};
var cultures = CultureInfo.GetCultures(CultureTypes.AllCultures);
var l = cultures.Select(x => x.EnglishName).ToList();
l.AddRange(cultures.Select(x => x.ThreeLetterISOLanguageName));
Languages = l.Distinct().ToArray();
SpaceChars = new[] {".", "_", " "};
}
public static MovieTitleCleanerResult Clean(string filename)
{
var temp = Path.GetFileNameWithoutExtension(filename);
int? maybeYear = null;
// Remove what's inside brackets trying to keep year info.
temp = RemoveBrackets(temp, '{', '}', ref maybeYear);
temp = RemoveBrackets(temp, '[', ']', ref maybeYear);
temp = RemoveBrackets(temp, '(', ')', ref maybeYear);
// Removes special markers (codec, formats, ecc...)
var tokens = temp.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
var title = string.Empty;
for (var i = 0; i < tokens.Length; i++)
{
var tok = tokens[i];
if (ReservedWords.Any(x => string.Equals(x, tok, StringComparison.OrdinalIgnoreCase)))
{
if (title.Length > 0)
break;
}
else
{
title = string.Join(" ", title, tok).Trim();
}
}
temp = title;
// Remove languages infos when are found before special markers (should not remove "English" if it's inside the title)
tokens = temp.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
for (var i = tokens.Length - 1; i >= 0; i--)
{
var tok = tokens[i];
if (Languages.Any(x => string.Equals(x, tok, StringComparison.OrdinalIgnoreCase)))
tokens[i] = string.Empty;
else
break;
}
title = string.Join(" ", tokens).Trim();
// If year is not found inside parenthesis try to catch at the end, just after the title
if (!maybeYear.HasValue)
{
var resplit = title.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
var last = resplit.Last();
if (LooksLikeYear(last))
{
maybeYear = int.Parse(last);
title = title.Replace(last, string.Empty).Trim();
}
}
// TODO: review this. when there's one dash separates main title from subtitle
var res = new MovieTitleCleanerResult();
res.Year = maybeYear;
if (title.Count(x => x == '-') == 1)
{
var sp = title.Split('-');
res.Title = sp[0];
res.SubTitle = sp[1];
}
else
{
res.Title = title;
}
return res;
}
private static string RemoveBrackets(string inputString, char openChar, char closeChar, ref int? maybeYear)
{
var str = inputString;
while (str.IndexOf(openChar) > 0 && str.IndexOf(closeChar) > 0)
{
var dataGraph = str.GetBetween(openChar.ToString(), closeChar.ToString());
if (LooksLikeYear(dataGraph))
{
maybeYear = int.Parse(dataGraph);
}
else
{
var parts = dataGraph.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
foreach (var part in parts)
if (LooksLikeYear(part))
{
maybeYear = int.Parse(part);
break;
}
}
str = str.ReplaceBetween(openChar, closeChar, string.Format(" {0} ", SpecialMarker));
}
return str;
}
private static bool LooksLikeYear(string dataRound)
{
return Regex.IsMatch(dataRound, "^(19|20)[0-9][0-9]");
}
}
public static class StringUtils
{
public static string GetBetween(this string src, string a, string b,
StringComparison comparison = StringComparison.Ordinal)
{
var idxStr = src.IndexOf(a, comparison);
var idxEnd = src.IndexOf(b, comparison);
if (idxStr >= 0 && idxEnd > 0)
{
if (idxStr > idxEnd)
Swap(ref idxStr, ref idxEnd);
return src.Substring(idxStr + a.Length, idxEnd - idxStr - a.Length);
}
return src;
}
private static void Swap<T>(ref T idxStr, ref T idxEnd)
{
var temp = idxEnd;
idxEnd = idxStr;
idxStr = temp;
}
public static string ReplaceBetween(this string s, char begin, char end, string replacement = null)
{
var regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
return regex.Replace(s, replacement ?? string.Empty);
}
}
}
This does the trick:
#"(\[[^\]]*\])|(\([^\)]*\))"
It removes anything from "[" to the next "]" and anything from "(" to the next ")".
Can you just use:
string MovieTitle="Star Trek (2009) [Unknown]";
movieTitleToFetch= MovieTitle.IndexOf('(')>MovieTitle.IndexOf('[')?
MovieTitle.Substring(0,MovieTitle.IndexOf('[')):
MovieTitle.Substring(0,MovieTitle.IndexOf('('));
Cant we use this instead:-
if(movieTitleToFetch.Contains("("))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("("));
Above code will surely return you the perfect movie titles for these strings:-
EuroTrip (2004) [SD]
Event Horizon (1997) [720]
Fast & Furious (2009) [1080p]
Star Trek (2009) [Unknown]
if there occurs a case where you will not have year but only type i.e :-
EuroTrip [SD]
Event Horizon [720]
Fast & Furious [1080p]
Star Trek [Unknown]
then use this
if(movieTitleToFetch.Contains("("))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("("));
else if(movieTitleToFetch.Contains("["))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("["));
I came up with .+\s(?<year>\(\d{4}\))\s(?<format>\[\w+\]) which matches any of your examples, and contains the year and format as named capture groups to help you replace them.
This pattern translates as:
Any character, one or more repitions
Whitespace
Literal '(' followed by 4 digits followed by literal ')' (year)
Whitespace
Literal '[' followed by alphanumeric, one or more repitions, followed by literal ']' (format)

Categories