C# - Regex subtitle file (.srt) to get text content? - c#

I have a srt file
1
00:00:07,000 --> 00:00:09,000
Time to amaze the world..
create by Hazy
2
00:00:11,000 --> 00:00:12,200
show them
3
00:00:15,000 --> 00:00:16,500
an impossible feat
i want to get text content
Time to amaze the world..
create by Hazy,
show them,
an impossible feat
My regex:
string[] souceSrt = Regex.Split(inputText.Text, #"\n*\d+\n\d\d:\d\d:\d\d,\d\d\d --> \d\d:\d\d:\d\d,\d\d\d\n");
but it's not working. What should i do??

Your approach wasn't bad, I think your pattern doesn't work because of newlines (that are probably CRLF):
(?:\r?\n)*\d+\r?\n\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}\r?\n
Note that your first approach is safer than searching all lines that contains letters (imagine a character that says "how old are you?")

using RegexHero
string strRegex = #"^.*([a-zA-Z]).*$";
Regex myRegex = new Regex(strRegex, RegexOptions.Multiline);
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
if (myMatch.Success)
{
//grab line
}
}
unless there's something I've missed, the lines you don't want will never have an alphabetic character in them.

A solution to parse an SRT without RegEx
Create a class to Deserialize the SRT
public class SrtContent
{
public string Text { get; set; }
public string StartTime { get; set; }
public string EndTime { get; set; }
public string Segment { get; set; }
}
Now here is the method that will parse the SRT
private static void ParseSRT(string srtFilePath)
{
var fileContent = File.ReadAllLines(srtFilePath);
if (fileContent.Length <= 0)
return;
var content = new List<SrtContent>();
var segment = 1;
for (int item = 0; item < fileContent.Length; item++)
{
if (segment.ToString() == fileContent[item])
{
content.Add(new SrtContent
{
Segment = segment.ToString(),
StartTime = fileContent[item + 1].Substring(0, fileContent[item + 1].LastIndexOf("-->")).Trim(),
EndTime = fileContent[item + 1].Substring(fileContent[item + 1].LastIndexOf("-->") + 3).Trim(),
Text = fileContent[item + 2]
});
// The block numbers of SRT like 1, 2, 3, ... and so on
segment++;
// Iterate one block at a time
item += 3;
}
}
}

Related

C# Search Textfile after multiple Datas and fill them into a Datagrid View

I get the datas from an textfile. The File itself is already inserted by ReadAllLines and converted into a string - this works fine for me and I checked the content with a MessageBox.
The Textfile looks like this (This is just 1 line from about thousand):
3016XY1234567891111111ABCDEFGHIJKabcdef+0000001029916XY1111111123456789ABCDEFGHIJKabcdef+00000003801
Now these are 2 records and I need 2 datas from every record.
The "XY Number" - these are the first 16 digits AFTER "16XY" (16XY is always the same value)
Value from the example: XY1234567891111111
The "Price" - that is the 11 digits value after the plus. The last 2 digits specify the amount of Cent.
Value from the example: 102,99$
I Need both of this datas to be in the same row in my Datagrid View and also for all other Datas in this textfile.
All I can imagine is to write a code, which searchs the string after "16XY" and counts the next 16 digits - the same with the Price which searchs for a "plus" and counts the next 11 digits. Just in this case I would need to ignore the first line of the file because there are about 10x"+".
I tried several possibilities to search and count for that values but without any success right now. Im also not sure how to get the datas into the specific Datagrid View.
This is all I have to show at the moment:
List<List<string>> groups = new List<List<string>>();
List<string> current = null;
foreach (var line in File.ReadAllLines(path))
{
if (line.Contains("") && current == null)
current = new List<string>();
else if (line.Contains("") && current != null)
{
groups.Add(current);
current = null;
}
if (current != null)
current.Add(line);
}
//array
string output = string.Join(Environment.NewLine, current.ToArray());
//string
string final = string.Join("", output.ToCharArray());
MessageBox.Show(output);
Thanks in advance!
Create a class or struct to hold data
public class Data
{
String XYValue { set; get; }
Decimal Price { set; get; }
}
Then the reading logic (You might need to add some more checks):
string decimalSeperator = CultureInfo.CurrentCulture.NumberFormat.NumberDecimalSeparator;
List<Data> results = new List<Data>();
foreach(string line in File.ReadAllLines(path).Skip(1))
{
if (line == null)
continue;
int indexOfNextXY = 0;
while (true)
{
int indexOfXY = line.IndexOf("16XY", indexOfNextXY) + "16XY".Length;
int indexOfPlus = line.IndexOf("+", indexOfXY + 16) + "+".Length;
indexOfNextXY = line.IndexOf("16XY", indexOfPlus);
string xyValue = line.Substring(indexOfXY - 2, 18); // -2 to get the XY part
string price = indexOfNextXY < 0 ? line.Substring(indexOfPlus) : line.Substring(indexOfPlus, indexOfNextXY - indexOfPlus);
string intPart = price.Substring(0, price.Length - 2);
string decimalPart = price.Substring(price.Length - 2);
price = intPart + decimalSeperator + decimalPart;
results.Add(new Data (){ XYValue = xyValue, Price = Convert.ToDecimal(price) });
if (indexOfNextXY < 0)
break;
}
}
var regex = new Regex(#"\+(\d+)(\d{2})16(XY\d{16})");
var q =
from e in File.ReadLines("123.txt")
let find = regex.Match(e)
where find.Success
select new
{
price = double.Parse(find.Groups[1].Value) + (double.Parse(find.Groups[2].Value) / 100),
value = find.Groups[3]
};
dataGridView1.DataSource = q.ToList();
If you need the whole text file as string, you can manipulate it with .Split method.
The action will look something like this:
var values = final.Split(new string[] { "16XY" }, StringSplitOptions.RemoveEmptyEntries).ToList();
List <YourModel> models = new List<YourModel>();
foreach (var item in values)
{
if (item.IndexOf('+') > 0)
{
var itemSplit = item.Split('+');
if (itemSplit[0].Length > 15 &&
itemSplit[1].Length > 10)
{
models.Add(new YourModel(itemSplit[0].Substring(0, 16), itemSplit[1].Substring(0, 11)));
}
}
}
And you will need some model
public class YourModel
{
public YourModel(string xy, string price)
{
float forTest = 0;
XYNUMBER = xy;
string addForParse = string.Format("{0}.{1}", price.Substring(0, price.Length - 2), price.Substring(price.Length - 2, 2));
if (float.TryParse(addForParse, out forTest))
{
Price = forTest;
}
}
public string XYNUMBER { get; set; }
public float Price { get; set; }
}
After that you can bind it to your gridview.
Given that the "data pairs" are variable each line (and can get truncated to the next line), it is best to use File.ReadAllText() instead. This will give you a single string to work on, eliminating the truncation issue.
var data = File.ReadAllText(path);
Define a model to contain your data:
public class Item {
public string XYNumber { get; set; }
public double Price { get; set; }
}
You can then use regular expressions to find matches and store them in a list:
var list = List<Item>();
var regex = new Regex(#"(XY\d{16})\w+\+(\d{11})");
var match = regex.Match(data);
while (match.Success) {
var ps = match.Group[1].Captures[0].Value.Insert(9, ".");
list.Add(new Item {
XYNumber = match.Group[0].Captures[0].Value,
Price = Convert.ToDouble(ps)
});
match = match.NextMatch();
}
The list can also be used as a data source to a grid view:
gridView.DataSource = list;
Consider employing the Split method. From the example data, I notice there is "16XY" between each value. So something like this:
var data = "3016XY1234567891111111ABCDEFGHIJKabcdef+0000001029916XY1111111123456789ABCDEFGHIJKabcdef+00000003801";
var records = data.Split(new string[] { "16XY" }, StringSplitOptions.RemoveEmptyEntries);
Given the example data this will return the following array:
[0]: "30"
[1]: "1234567891111111ABCDEFGHIJKabcdef+00000010299"
[2]: "1111111123456789ABCDEFGHIJKabcdef+00000003801"
Now it will be easier to count characters in each string and give them meaning in your code.
So we know valuable data is separated by +. Lets split it further and fill a Dictionary<string, double>.
var parsed = new Dictionary<string, double>(records.Length - 1);
foreach (var pairX in records.Skip(1))
{
var fields = pairX.Split('+');
var cents = double.Parse(fields[1]);
parsed.Add(fields[0], cents / 100);
}
// Now you bind to the GridView
gv.DataSource = parsed;
And your 'GridView` declaration should look like this:
<asp:GridView ID="gv" runat="server" AutoGenerateColumns="false">
<Columns>
<asp:BoundField DataField="Key" HeaderText="ID" />
<asp:BoundField DataField="Value" HeaderText="Price" />
</Columns>
</asp:GridView>

How to read file that contains one row with multiple records- C# [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have this text file that only has one row. Each file contains one customer name but multiple items and descriptions.
Record starting with 00 (Company Name) has a char length of 10
01 (Item#) - char length of 10
02 (Description) - char length of 50
I know how to read a file, but I don't have any idea of how to loop through only one line, find records 00, 01, 02 and grab the text based on the length, finally start at the position of the last records and start the loop again. Can someone please give me an idea of how to read files like this?
output:
companyName 16622 Description
companyName 15522 Description
input text file example
00Init 0115522 02Description 0116622 02Description
This solution assumes that the data is fixed width, and that item number will preceed description (01 before 02). This solution will emit a record every time a description record is encountered, and deals with multiple products for the same company.
First, define a class to hold your data:
public class Record
{
public string CompanyName { get; set; }
public string ItemNumber { get; set; }
public string Description { get; set; }
}
Then, iterate through your string, returning a record when you've got a description:
public static IEnumerable<Record> ReadFile(string input)
{
// Alter these as appropriate
const int RECORDTYPELENGTH = 2;
const int COMPANYNAMELENGTH = 41;
const int ITEMNUMBERLENGTH = 8;
const int DESCRIPTIONLENGTH = 48;
int index = 0;
string companyName = null;
string itemNumber = null;
while (index < input.Length)
{
string recordType = input.Substring(index, RECORDTYPELENGTH);
index += RECORDTYPELENGTH;
if (recordType == "00")
{
companyName = input.Substring(index, COMPANYNAMELENGTH).Trim();
index += COMPANYNAMELENGTH;
}
else if (recordType == "01")
{
itemNumber = input.Substring(index, ITEMNUMBERLENGTH).Trim();
index += ITEMNUMBERLENGTH;
}
else if (recordType == "02")
{
string description = input.Substring(index, DESCRIPTIONLENGTH).Trim();
index += DESCRIPTIONLENGTH;
yield return new Record
{
CompanyName = companyName,
ItemNumber = itemNumber,
Description = description
};
}
else
{
throw new FormatException("Unexpected record type " + recordType);
}
}
}
Note that your field lengths in the question don't match the sample data, so I adjusted them so that the solution worked with the data you provided. You can adjust the field lengths by adjusting the constants.
Use this like the following:
string input = "00CompanyName 0115522 02Description 0116622 02Description ";
foreach (var record in ReadFile(input))
{
Console.WriteLine("{0}\t{1}\t{2}", record.CompanyName, record.ItemNumber, record.Description);
}
If you read the whole file into a string, you have a couple options.
One, it might be useful to use string.split.
Another option would be to use string.indexof. Once you have the index, you could use string.substring
Assuming fixed-width as specified, lets create two simple classes to hold a client and its related data as a list:
// can hold as many items (data) as there are in the line
public class Client
{
public string name;
public List<ClientData> data;
};
// one single item in the client data
public class ClientData
{
public string code;
public string description;
};
To parse a single line (which is assumed to have a single client and a successive list of item/description), we can do this (note: for simplification I'm just creating a static class with a static method in it):
// this parser will read as many itens as there are in the line
// and return a Client instance with those inside.
public static class Parser
{
public static Client ParseData(string line)
{
Client client = new Client ();
client.data = new List<ClientData> ();
client.name = line.Substring (2, 10);
// remove the client name
line = line.Substring (12);
while (line.Length > 0)
{
// create new item
ClientData data = new ClientData ();
data.code = line.Substring (2, 10);
data.description = line.Substring (14, 50);
client.data.Add (data);
// next item
line = line.Substring (64);
}
return client;
}
}
So, in your main loop, just after reading a new line from the file, you can call the above method to receive a new client. Something like this:
// should be from a file but this is just an example
string[] lines = {
"00XXXXXXXXXX01YYYYYYYYYY02XXXXXXXXX.XXXXXXXXX.XXXXXXXXX.XXXXXXXXX.XXXXXXXXXX",
"00XXXXXXXXXX01YYYYYYYYYY02XXXXXXXXX.XXXXXXXXX.XXXXXXXXX.XXXXXXXXX.XXXXXXXXXX01YYYYYYYYYY02XXXXXXXXX.XXXXXXXXX.XXXXXXXXX.XXXXXXXXX.XXXXXXXXXX",
"00XXXXXXXXXX01YYYYYYYYYY02XXXXXXXXX.XXXXXXXXX.XXXXXXXXX.XXXXXXXXX.XXXXXXXXXX",
"00XXXXXXXXXX01YYYYYYYYYY02XXXXXXXXX.XXXXXXXXX.XXXXXXXXX.XXXXXXXXX.XXXXXXXXXX",
"00XXXXXXXXXX01YYYYYYYYYY02XXXXXXXXX.XXXXXXXXX.XXXXXXXXX.XXXXXXXXX.XXXXXXXXXX",
};
// loop through each line
// (lines can have multiple items)
foreach (string line in lines)
{
Client client = Parser.ParseData (line);
Console.WriteLine ("Read: " + client.name);
}
Contents of Sample.txt:
00Company1 0115522 02This is a description for company 1. 00Company2 0115523 02This is a description for company 2. 00Company3 0115524 02This is a description for company 3
Note that in the code below, the fields are 2 characters longer than those specified in the original question. This is because I am including the headings in the length of each field, thus a field of a length of 10is effectively 12 by including the 00 from the heading. If this is undesirable, tweak the offsets of the entries in the fieldLengths array.
String directory = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
String file = "Sample.txt";
String path = Path.Combine(directory, file);
Int32[] fieldLengths = new Int32[] { 12, 12, 52 };
List<RowData> rows = new List<RowData>();
Byte[] buffer = new Byte[fieldLengths.Sum()];
using (var stream = File.OpenRead(path))
{
while (stream.Read(buffer, 0, buffer.Length) > 0)
{
List<String> fieldValues = new List<String>();
Int32 offset = 0;
for (int i = 0; i < fieldLengths.Length; i++)
{
var value = Encoding.UTF8.GetString(buffer, offset, fieldLengths[i]);
fieldValues.Add(value);
offset += fieldLengths[i];
}
String companyName = fieldValues[0];
String itemNumber = fieldValues[1];
String description = fieldValues[2];
var row = new RowData(companyName, itemNumber, description);
rows.Add(row);
}
}
Class definition for RowData:
public class RowData
{
public String Company { get; set; }
public String Number { get; set; }
public String Description { get; set; }
public RowData(String company, String number, String description)
{
Company = company;
Number = number;
Description = description;
}
}
The results will be in the rows variable.
You would have to split rows based on a delimiter. It would seem that in your case you are using whitespace as a delimiter.
The method you are looking for is String.Split(), it should cover your needs :) Documentation is located at https://msdn.microsoft.com/en-us/library/system.string.split(v=vs.110).aspx - It also includes examples.
I'd do something like this:
string myLineOfText = "MyCompany 12345 The description of my company";
string[] partsOfMyLine = myLineOfText.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries);
Best of luck! :)

Split string into class

I have an array which contains following values:
str[0]= "MeterNr 29202"
str[1]="- 20111101: position 61699 (Previous calculation) "
str[2]="- 20111201: position 68590 (Calculation) consumption 6891 kWh"
str[3]="- 20111101: position 75019 (Previous calculation) "
str[4]="MeterNr 50273"
str[5]="- 20111101: position 18103 (Previous reading) "
str[6]="- 20111201: position 19072 (Calculation) consumption 969 kWh "
I want to split the rows in logical order so that I can store them in following Reading class. I have problems with spliting the values. Everything in brackets () is ItemDescription.
I will be thankful for the quick answer.
public class Reading
{
public string MeterNr { get; set; }
public string ItemDescription { get; set; }
public string Date { get; set; }
public string Position { get; set; }
public string Consumption { get; set; }
}
You should parse the values one by one.
If you have a string, which starts with "MeterNr", you should save it as currentMeterNumber and parse the values further.
Otherwise, you can parse the values with Regex:
var dateRegex = new Regex(#"(?<=-\s)(?<year>\d{4})(?<month>\d{2})(?<day>\d{2})");
var positionRegex = new Regex(#"(?<=position\s+)(\d+)");
var descriptionRegex = new Regex(#"(?<=\()(?<description>[^)]+)(?=\))");
var consuptionRegex = new Regex(#"(?<=consumption\s+)(?<consumption>(?<consumtionValue>\d+)\s(?<consumptionUom>\w+))");
I hope, you would be able to create the final algorithm, as well as understand how each of those expressions works. A final point could be to combine them all into single Regex. You should do it yourself to enhance your skills.
P.S.: There are a lot of tutorials in Internet.
I would just use a for loop and string indexes etc, but then I am a bit simple like that! Not sure of your data (i.e. if things might be missing) but this would work on the data you have posted...
var readings = new List<Reading>();
int meterNrLength = "MeterNr".Length;
int positionLength = "position".Length;
int consumptionLength = "consumption".Length;
string meterNr = null;
foreach(var s in str)
{
int meterNrIndex = s.IndexOf("MeterNr",
StringComparison.OrdinalIgnoreCase);
if (meterNrIndex != -1)
{
meterNr = s.Substring(meterNrIndex + meterNrLength).Trim();
continue;
}
var reading = new Reading {MeterNr = meterNr};
string rest = s.Substring(0, s.IndexOf(':'));
reading.Date = rest.Substring(1).Trim();
rest = s.Substring(s.IndexOf("position") + positionLength);
int bracketIndex = rest.IndexOf('(');
reading.Position = rest.Substring(0, bracketIndex).Trim();
rest = rest.Substring(bracketIndex + 1);
reading.ItemDescription = rest.Substring(0, rest.IndexOf(")"));
int consumptionIndex = rest.IndexOf("consumption",
StringComparison.OrdinalIgnoreCase);
if (consumptionIndex != -1)
{
reading.Consumption = rest.Substring(consumptionIndex + consumptionLength).Trim();
}
readings.Add(reading);
}
public static List<Reading> Parser(this string[] str)
{
List<Reading> result = new List<Reading>();
string meterNr = "";
Reading reading;
foreach (string s in str)
{
MatchCollection mc = Regex.Matches(s, "\\d+|\\((.*?)\\)");
if (mc.Count == 1)
{
meterNr = mc[0].Value;
continue;
}
reading = new Reading()
{
MeterNr = meterNr,
Date = mc[0].Value,
Position = mc[1].Value,
ItemDescription = mc[2].Value.TrimStart('(').TrimEnd(')')
};
if (mc.Count == 4)
reading.Consumption = mc[3].Value;
result.Add(reading);
}
return result;
}

String Replacements for Word Merge

using asp.net 4
we do a lot of Word merges at work. rather than using the complicated conditional statements of Word i want to embed my own syntax. something like:
Dear Mr. { select lastname from users where userid = 7 },
Your invoice for this quarter is: ${ select amount from invoices where userid = 7 }.
......
ideally, i'd like this to get turned into:
string.Format("Dear Mr. {0}, Your invoice for this quarter is: ${1}", sqlEval[0], sqlEval[1]);
any ideas?
Well, I don't really recommend rolling your own solution for this, however I will answer the question as asked.
First, you need to process the text and extract the SQL statements. For that you'll need a simple parser:
/// <summary>Parses the input string and extracts a unique list of all placeholders.</summary>
/// <remarks>
/// This method does not handle escaping of delimiters
/// </remarks>
public static IList<string> Parse(string input)
{
const char placeholderDelimStart = '{';
const char placeholderDelimEnd = '}';
var characters = input.ToCharArray();
var placeHolders = new List<string>();
string currentPlaceHolder = string.Empty;
bool inPlaceHolder = false;
for (int i = 0; i < characters.Length; i++)
{
var currentChar = characters[i];
// Start of a placeholder
if (!inPlaceHolder && currentChar == placeholderDelimStart)
{
currentPlaceHolder = string.Empty;
inPlaceHolder = true;
continue;
}
// Start of a placeholder when we already have one
if (inPlaceHolder && currentChar == placeholderDelimStart)
throw new InvalidOperationException("Unexpected character detected at position " + i);
// We found the end marker while in a placeholder - we're done with this placeholder
if (inPlaceHolder && currentChar == placeholderDelimEnd)
{
if (!placeHolders.Contains(currentPlaceHolder))
placeHolders.Add(currentPlaceHolder);
inPlaceHolder = false;
continue;
}
// End of a placeholder with no matching start
if (!inPlaceHolder && currentChar == placeholderDelimEnd)
throw new InvalidOperationException("Unexpected character detected at position " + i);
if (inPlaceHolder)
currentPlaceHolder += currentChar;
}
return placeHolders;
}
Okay, so that will get you a list of SQL statements extracted from the input text. You'll probably want to tweak it to use properly typed parser exceptions and some input guards (which I elided for clarity).
Now you just need to replace those placeholders with the results of the evaluated SQL:
// Sample input
var input = "Hello Mr. {select firstname from users where userid=7}";
string output = input;
var extractedStatements = Parse(input);
foreach (var statement in extractedStatements)
{
// Execute the SQL statement
var result = Evaluate(statement);
// Update the output with the result of the SQL statement
output = output.Replace("{" + statement + "}", result);
}
This is obviously not the most efficient way to do this, but I think it sufficiently demonstrates the concept without muddying the waters.
You'll need to define the Evaluate(string) method. This will handle executing the SQL.
I just finished building a proprietary solution like this for a law firm here.
I evaluated a product called Windward reports. It's a tad pricy, esp if you need a lot of copies, but for one user it's not bad.
it can pull from XML or SQL data sources (or more if I remember).
Might be worth a look (and no I don't work for 'em, just evaluated their stuff)
You might want to check out the razor engine project on codeplex
http://razorengine.codeplex.com/
Using SQL etc within your template looks like a bad idea. I'd suggest you make a ViewModel for each template.
The Razor thing is really easy to use. Just add a reference, import the namespace, and call the Parse method like so:
(VB guy so excuse syntax!)
MyViewModel myModel = new MyViewModel("Bob",150.00); //set properties
string myTemplate = "Dear Mr. #Model.FirstName, Your invoice for this quarter is: #Model.InvoiceAmount";
string myOutput = Razor.Parse(myTemplate, myModel);
Your string can come from anywhere - I use this with my templates stored in a database, you could equally load it from files or whatever. It's very powerful as a view engine, you can do conditional stuff, loops, etc etc.
i ended up rolling my own solution but thanks. i really dislike if statements. i'll need to refactor them out. here it is:
var mailingMergeString = new MailingMergeString(input);
var output = mailingMergeString.ParseMailingMergeString();
public class MailingMergeString
{
private string _input;
public MailingMergeString(string input)
{
_input = input;
}
public string ParseMailingMergeString()
{
IList<SqlReplaceCommand> sqlCommands = new List<SqlReplaceCommand>();
var i = 0;
const string openBrace = "{";
const string closeBrace = "}";
while (string.IsNullOrWhiteSpace(_input) == false)
{
var sqlReplaceCommand = new SqlReplaceCommand();
var open = _input.IndexOf(openBrace) + 1;
var close = _input.IndexOf(closeBrace);
var length = close != -1 ? close - open : _input.Length;
var newInput = _input.Substring(close + 1);
var nextClose = newInput.Contains(openBrace) ? newInput.IndexOf(openBrace) : newInput.Length;
if (i == 0 && open > 0)
{
sqlReplaceCommand.Text = _input.Substring(0, open - 1);
_input = _input.Substring(open - 1);
}
else
{
sqlReplaceCommand.Command = _input.Substring(open, length);
sqlReplaceCommand.PlaceHolder = openBrace + i + closeBrace;
sqlReplaceCommand.Text = _input.Substring(close + 1, nextClose);
sqlReplaceCommand.NewInput = _input.Substring(close + 1);
_input = newInput.Contains(openBrace) ? sqlReplaceCommand.NewInput : string.Empty;
}
sqlCommands.Add(sqlReplaceCommand);
i++;
}
return sqlCommands.GetParsedString();
}
internal class SqlReplaceCommand
{
public string Command { get; set; }
public string SqlResult { get; set; }
public string PlaceHolder { get; set; }
public string Text { get; set; }
protected internal string NewInput { get; set; }
}
}
internal static class SqlReplaceExtensions
{
public static string GetParsedString(this IEnumerable<MailingMergeString.SqlReplaceCommand> sqlCommands)
{
return sqlCommands.Aggregate("", (current, replaceCommand) => current + (replaceCommand.PlaceHolder + replaceCommand.Text));
}
}

C# Regex for Movie Filename

I have been trying to use a C# Regex unsuccessfully to remove certain strings from a movie name.
Examples of the file names I'm working with are:
EuroTrip (2004) [SD]
Event Horizon (1997) [720]
Fast & Furious (2009) [1080p]
Star Trek (2009) [Unknown]
I'd like to remove anything in square brackets or parenthesis (including the brackets themselves)
So far I'm using:
movieTitleToFetch = Regex.Replace(movieTitleToFetch, "([*\\(\\d{4}\\)])", "");
Which seems to remove the Year and Parenthesis ok, but I just can't figure out how to remove the Square Brackets and content without affecting other parts... I've had miscellaneous results but the closest one has been:
movieTitleToFetch = Regex.Replace(movieTitleToFetch, "([?\\[+A-Z+\\]])", "");
Which left me with:
urorip (2004)
Instead of:
EuroTrip (2004) [SD]
Any whitespace that is left at the ends are ok as I will just perform
movieTitleToFetch = movieTitleToFetch.Trim();
at the end.
Thanks in advance,
Alex
This regex pattern should work ok... maybe needs a bit of tweaking
"[\[\(].+?[\]\)]"
Regex.Replace(movieTitleToFetch, #"[\[\(].+?[\]\)]", "");
This should match anything from either "[" or "(" until the next occurance of "]" or ")"
If that does not work try removing the escape character for the parentheses, like so...
Regex.Replace(movieTitleToFetch, #"[\[(].+?[\])]", "");
#Craigt is pretty much spot on but it's possibly cleaner to ensure that the brackets are matched.
([\[].*?[\]]|[\(].*?[\)])
I'know i'm late on this thread but i wrote a simple algorythm to sanitize the downloaded movies filenames.
This runs these steps:
Removes everything in brackets (if find a year it tries to keep the info)
Removes a list of common used words (720p, bdrip, h264 and so on...)
Assumes that can be languages info in the title and removes them when at the end of remaining string (before special words)
if a year was not found into parenthesis looks at the end of remaining string (as for languages)
Doing this replaces dots and spaces so the title is ready, as example, to be a query for a search api.
Here's the test in XUnit (i used most of italian titles to test it)
using Grappachu.Movideo.Core.Helpers.TitleCleaner;
using SharpTestsEx;
using Xunit;
namespace Grappachu.MoVideo.Test
{
public class TitleCleanerTest
{
[Theory]
[InlineData("Avengers.Confidential.La.Vedova.Nera.E.Punisher.2014.iTALiAN.Bluray.720p.x264 - BG.mkv",
"Avengers Confidential La Vedova Nera E Punisher", 2014)]
[InlineData("Fuck You, Prof! (2013) BDRip 720p HEVC ITA GER AC3 Multi Sub PirateMKV.mkv",
"Fuck You, Prof!", 2013)]
[InlineData("Il Libro della Giungla(2016)(BDrip1080p_H264_AC3 5.1 Ita Eng_Sub Ita Eng)by siste82.avi",
"Il Libro della Giungla", 2016)]
[InlineData("Il primo dei bugiardi (2009) [Mux by Little-Boy]", "Il primo dei bugiardi", 2009)]
[InlineData("Il.Viaggio.Di.Arlo-The.Good.Dinosaur.2015.DTS.ITA.ENG.1080p.BluRay.x264-BLUWORLD",
"il viaggio di arlo", 2015)]
[InlineData("La Mafia Uccide Solo D'estate 2013 .avi",
"La Mafia Uccide Solo D'estate", 2013)]
[InlineData("Ip.Man.3.2015.iTA.AC3.5.1.448.Chi.Aac.BluRay.m1080p.x264.Sub.[scambiofile.info].mkv",
"Ip Man 3", 2015)]
[InlineData("Inferno.2016.BluRay.1080p.AC3.ITA.AC3.ENG.Subs.x264-WGZ.mkv",
"Inferno", 2016)]
[InlineData("Ghostbusters.2016.iTALiAN.BDRiP.EXTENDED.XviD-HDi.mp4",
"Ghostbusters", 2016)]
[InlineData("Transcendence.mkv", "Transcendence", null)]
[InlineData("Being Human (Forsyth, 1994).mkv", "Being Human", 1994)]
public void Clean_should_return_title_and_year_when_possible(string filename, string title, int? year)
{
var res = MovieTitleCleaner.Clean(filename);
res.Title.ToLowerInvariant().Should().Be.EqualTo(title.ToLowerInvariant());
res.Year.Should().Be.EqualTo(year);
}
}
}
and fisrt version of the code
using System;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
namespace Grappachu.Movideo.Core.Helpers.TitleCleaner
{
public class MovieTitleCleanerResult
{
public string Title { get; set; }
public int? Year { get; set; }
public string SubTitle { get; set; }
}
public class MovieTitleCleaner
{
private const string SpecialMarker = "§=§";
private static readonly string[] ReservedWords;
private static readonly string[] SpaceChars;
private static readonly string[] Languages;
static MovieTitleCleaner()
{
ReservedWords = new[]
{
SpecialMarker, "hevc", "bdrip", "Bluray", "x264", "h264", "AC3", "DTS", "480p", "720p", "1080p"
};
var cultures = CultureInfo.GetCultures(CultureTypes.AllCultures);
var l = cultures.Select(x => x.EnglishName).ToList();
l.AddRange(cultures.Select(x => x.ThreeLetterISOLanguageName));
Languages = l.Distinct().ToArray();
SpaceChars = new[] {".", "_", " "};
}
public static MovieTitleCleanerResult Clean(string filename)
{
var temp = Path.GetFileNameWithoutExtension(filename);
int? maybeYear = null;
// Remove what's inside brackets trying to keep year info.
temp = RemoveBrackets(temp, '{', '}', ref maybeYear);
temp = RemoveBrackets(temp, '[', ']', ref maybeYear);
temp = RemoveBrackets(temp, '(', ')', ref maybeYear);
// Removes special markers (codec, formats, ecc...)
var tokens = temp.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
var title = string.Empty;
for (var i = 0; i < tokens.Length; i++)
{
var tok = tokens[i];
if (ReservedWords.Any(x => string.Equals(x, tok, StringComparison.OrdinalIgnoreCase)))
{
if (title.Length > 0)
break;
}
else
{
title = string.Join(" ", title, tok).Trim();
}
}
temp = title;
// Remove languages infos when are found before special markers (should not remove "English" if it's inside the title)
tokens = temp.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
for (var i = tokens.Length - 1; i >= 0; i--)
{
var tok = tokens[i];
if (Languages.Any(x => string.Equals(x, tok, StringComparison.OrdinalIgnoreCase)))
tokens[i] = string.Empty;
else
break;
}
title = string.Join(" ", tokens).Trim();
// If year is not found inside parenthesis try to catch at the end, just after the title
if (!maybeYear.HasValue)
{
var resplit = title.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
var last = resplit.Last();
if (LooksLikeYear(last))
{
maybeYear = int.Parse(last);
title = title.Replace(last, string.Empty).Trim();
}
}
// TODO: review this. when there's one dash separates main title from subtitle
var res = new MovieTitleCleanerResult();
res.Year = maybeYear;
if (title.Count(x => x == '-') == 1)
{
var sp = title.Split('-');
res.Title = sp[0];
res.SubTitle = sp[1];
}
else
{
res.Title = title;
}
return res;
}
private static string RemoveBrackets(string inputString, char openChar, char closeChar, ref int? maybeYear)
{
var str = inputString;
while (str.IndexOf(openChar) > 0 && str.IndexOf(closeChar) > 0)
{
var dataGraph = str.GetBetween(openChar.ToString(), closeChar.ToString());
if (LooksLikeYear(dataGraph))
{
maybeYear = int.Parse(dataGraph);
}
else
{
var parts = dataGraph.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
foreach (var part in parts)
if (LooksLikeYear(part))
{
maybeYear = int.Parse(part);
break;
}
}
str = str.ReplaceBetween(openChar, closeChar, string.Format(" {0} ", SpecialMarker));
}
return str;
}
private static bool LooksLikeYear(string dataRound)
{
return Regex.IsMatch(dataRound, "^(19|20)[0-9][0-9]");
}
}
public static class StringUtils
{
public static string GetBetween(this string src, string a, string b,
StringComparison comparison = StringComparison.Ordinal)
{
var idxStr = src.IndexOf(a, comparison);
var idxEnd = src.IndexOf(b, comparison);
if (idxStr >= 0 && idxEnd > 0)
{
if (idxStr > idxEnd)
Swap(ref idxStr, ref idxEnd);
return src.Substring(idxStr + a.Length, idxEnd - idxStr - a.Length);
}
return src;
}
private static void Swap<T>(ref T idxStr, ref T idxEnd)
{
var temp = idxEnd;
idxEnd = idxStr;
idxStr = temp;
}
public static string ReplaceBetween(this string s, char begin, char end, string replacement = null)
{
var regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
return regex.Replace(s, replacement ?? string.Empty);
}
}
}
This does the trick:
#"(\[[^\]]*\])|(\([^\)]*\))"
It removes anything from "[" to the next "]" and anything from "(" to the next ")".
Can you just use:
string MovieTitle="Star Trek (2009) [Unknown]";
movieTitleToFetch= MovieTitle.IndexOf('(')>MovieTitle.IndexOf('[')?
MovieTitle.Substring(0,MovieTitle.IndexOf('[')):
MovieTitle.Substring(0,MovieTitle.IndexOf('('));
Cant we use this instead:-
if(movieTitleToFetch.Contains("("))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("("));
Above code will surely return you the perfect movie titles for these strings:-
EuroTrip (2004) [SD]
Event Horizon (1997) [720]
Fast & Furious (2009) [1080p]
Star Trek (2009) [Unknown]
if there occurs a case where you will not have year but only type i.e :-
EuroTrip [SD]
Event Horizon [720]
Fast & Furious [1080p]
Star Trek [Unknown]
then use this
if(movieTitleToFetch.Contains("("))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("("));
else if(movieTitleToFetch.Contains("["))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("["));
I came up with .+\s(?<year>\(\d{4}\))\s(?<format>\[\w+\]) which matches any of your examples, and contains the year and format as named capture groups to help you replace them.
This pattern translates as:
Any character, one or more repitions
Whitespace
Literal '(' followed by 4 digits followed by literal ')' (year)
Whitespace
Literal '[' followed by alphanumeric, one or more repitions, followed by literal ']' (format)

Categories