Regular expression can't handle rogue square brackets - c#

Thanks to the smarties on here in the past I have this amazing recursive regular expression that helps me to transform custom BBCode-style tags in a block of text.
/// <summary>
/// Static class containing common regular expression strings.
/// </summary>
public static class RegularExpressions
{
/// <summary>
/// Expression to find all root-level BBCode tags. Use this expression recursively to obtain nested tags.
/// </summary>
public static string BBCodeTags
{
get
{
return #"
(?>
\[ (?<tag>[^][/=\s]+) \s*
(?: = \s* (?<val>[^][]*) \s*)?
]
)
(?<content>
(?>
\[(?<innertag>[^][/=\s]+)[^][]*]
|
\[/(?<-innertag>\k<innertag>)]
|
[^][]+
)*
(?(innertag)(?!))
)
\[/\k<tag>]
";
}
}
}
This regex works beautifully, recursively matching on all tags. Like this:
[code]
some code
[b]some text [url=http://www.google.com]some link[/url][/b]
[/code]
The regex does exactly what I want and matches the [code] tag. It breaks it up into three groups: tag, optional value, and content. Tag being the tag name ("code" in this case). Optional value being a value after the equals(=) sign if there is one. And content being everything between the opening and closing tag.
The regex can be used recursively to match nested tags. So after matching on [code] I would run it again against the content group and it would match the [b] tag. If I ran it again on the next content group it would then match the [url] tag.
All of that is wonderful and delicious but it hiccups on one issue. It can't handle rogue square brackets.
[code]This is a successful match.[/code]
[code]This is an [ unsuccessful match.[/code]
[code]This is also an [unsuccessful] match.[/code]
I really suck at regular expressions but if anyone knows how I might tweak this regex to correctly ignore rogue brackets (brackets that do not make up an opening tag and/or do not have a matching closing tag) so that it still matches the surrounding tags, I would be very appreciative :D
Thanks in advance!
Edit
If you are interested in seeing the method where I use this expression you are welcome to.

I did a program that can parse your strings in a debugable, developer-friendly way. It is not a small code like those regexes, but it has a positive side: you can debug it, and fine tune it as you need.
The implementation is a descent recursive parser, but if you need some kind of contextual data, you can place it all inside the ParseContext class.
It is quite long, but I consider it as being better than a a regex based solution.
To test it, create a console application, and replace all the code inside Program.cs with the following code:
using System.Collections.Generic;
namespace q7922337
{
static class Program
{
static void Main(string[] args)
{
var result1 = Match.ParseList<TagsGroup>("[code]This is a successful match.[/code]");
var result2 = Match.ParseList<TagsGroup>("[code]This is an [ unsuccessful match.[/code]");
var result3 = Match.ParseList<TagsGroup>("[code]This is also an [unsuccessful] match.[/code]");
var result4 = Match.ParseList<TagsGroup>(#"
[code]
some code
[b]some text [url=http://www.google.com]some link[/url][/b]
[/code]");
}
class ParseContext
{
public string Source { get; set; }
public int Position { get; set; }
}
abstract class Match
{
public override string ToString()
{
return this.Text;
}
public string Source { get; set; }
public int Start { get; set; }
public int Length { get; set; }
public string Text { get { return this.Source.Substring(this.Start, this.Length); } }
protected abstract bool ParseInternal(ParseContext context);
public bool Parse(ParseContext context)
{
var result = this.ParseInternal(context);
this.Length = context.Position - this.Start;
return result;
}
public bool MarkBeginAndParse(ParseContext context)
{
this.Start = context.Position;
var result = this.ParseInternal(context);
this.Length = context.Position - this.Start;
return result;
}
public static List<T> ParseList<T>(string source)
where T : Match, new()
{
var context = new ParseContext
{
Position = 0,
Source = source
};
var result = new List<T>();
while (true)
{
var item = new T { Source = source, Start = context.Position };
if (!item.Parse(context))
break;
result.Add(item);
}
return result;
}
public static T ParseSingle<T>(string source)
where T : Match, new()
{
var context = new ParseContext
{
Position = 0,
Source = source
};
var result = new T { Source = source, Start = context.Position };
if (result.Parse(context))
return result;
return null;
}
protected List<T> ReadList<T>(ParseContext context)
where T : Match, new()
{
var result = new List<T>();
while (true)
{
var item = new T { Source = this.Source, Start = context.Position };
if (!item.Parse(context))
break;
result.Add(item);
}
return result;
}
protected T ReadSingle<T>(ParseContext context)
where T : Match, new()
{
var result = new T { Source = this.Source, Start = context.Position };
if (result.Parse(context))
return result;
return null;
}
protected int ReadSpaces(ParseContext context)
{
int startPos = context.Position;
int cnt = 0;
while (true)
{
if (startPos + cnt >= context.Source.Length)
break;
if (!char.IsWhiteSpace(context.Source[context.Position + cnt]))
break;
cnt++;
}
context.Position = startPos + cnt;
return cnt;
}
protected bool ReadChar(ParseContext context, char p)
{
int startPos = context.Position;
if (startPos >= context.Source.Length)
return false;
if (context.Source[startPos] == p)
{
context.Position = startPos + 1;
return true;
}
return false;
}
}
class Tag : Match
{
protected override bool ParseInternal(ParseContext context)
{
int startPos = context.Position;
if (!this.ReadChar(context, '['))
return false;
this.ReadSpaces(context);
if (this.ReadChar(context, '/'))
this.IsEndTag = true;
this.ReadSpaces(context);
var validName = this.ReadValidName(context);
if (validName != null)
this.Name = validName;
this.ReadSpaces(context);
if (this.ReadChar(context, ']'))
return true;
context.Position = startPos;
return false;
}
protected string ReadValidName(ParseContext context)
{
int startPos = context.Position;
int endPos = startPos;
while (char.IsLetter(context.Source[endPos]))
endPos++;
if (endPos == startPos) return null;
context.Position = endPos;
return context.Source.Substring(startPos, endPos - startPos);
}
public bool IsEndTag { get; set; }
public string Name { get; set; }
}
class TagsGroup : Match
{
public TagsGroup()
{
}
protected TagsGroup(Tag openTag)
{
this.Start = openTag.Start;
this.Source = openTag.Source;
this.OpenTag = openTag;
}
protected override bool ParseInternal(ParseContext context)
{
var startPos = context.Position;
if (this.OpenTag == null)
{
this.ReadSpaces(context);
this.OpenTag = this.ReadSingle<Tag>(context);
}
if (this.OpenTag != null)
{
int textStart = context.Position;
int textLength = 0;
while (true)
{
Tag tag = new Tag { Source = this.Source, Start = context.Position };
while (!tag.MarkBeginAndParse(context))
{
if (context.Position >= context.Source.Length)
{
context.Position = startPos;
return false;
}
context.Position++;
textLength++;
}
if (!tag.IsEndTag)
{
var tagGrpStart = context.Position;
var tagGrup = new TagsGroup(tag);
if (tagGrup.Parse(context))
{
if (textLength > 0)
{
if (this.Contents == null) this.Contents = new List<Match>();
this.Contents.Add(new Text { Source = this.Source, Start = textStart, Length = textLength });
textStart = context.Position;
textLength = 0;
}
this.Contents.Add(tagGrup);
}
else
{
textLength += tag.Length;
}
}
else
{
if (tag.Name == this.OpenTag.Name)
{
if (textLength > 0)
{
if (this.Contents == null) this.Contents = new List<Match>();
this.Contents.Add(new Text { Source = this.Source, Start = textStart, Length = textLength });
textStart = context.Position;
textLength = 0;
}
this.CloseTag = tag;
return true;
}
else
{
textLength += tag.Length;
}
}
}
}
context.Position = startPos;
return false;
}
public Tag OpenTag { get; set; }
public Tag CloseTag { get; set; }
public List<Match> Contents { get; set; }
}
class Text : Match
{
protected override bool ParseInternal(ParseContext context)
{
return true;
}
}
}
}
If you use this code, and someday find that you need optimizations because the parser has become ambiguous, then try using a dictionary in the ParseContext, take a look here for more info: http://en.wikipedia.org/wiki/Top-down_parsing in the topic Time and space complexity of top-down parsing. I find it very interesting.

The first change is pretty simple - you can get it by changing [^][]+, which is responsible for matching the free text, to .. This seems a little crazy, perhaps, but it's actually safe, because you are using a possessive group (?> ), so all the valid tags will be matched by the first alternation - \[(?<innertag>[^][/=\s]+)[^][]*] - and cannot backtrack and break the tags.
(Remember to enable the Singleline flag, so . matches newlines)
The second requirement, [unsuccessful], seems to go against your goal it. The whole idea from the very start is not to match these unclosed tags. If you allow unclosed tags, all matches of the form \[(.*?)\].*?[/\1] become valid. Not good. At best, you can try to whitelist a few tags which are not allowed to be matched.
An example of both changes:
(?>
\[ (?<tag>[^][/=\s]+) \s*
(?: = \s* (?<val>[^][]*) \s*)?
\]
)
(?<content>
(?>
\[(?:unsuccessful)\] # self closing
|
\[(?<innertag>[^][/=\s]+)[^][]*]
|
\[/(?<-innertag>\k<innertag>)]
|
.
)*
(?(innertag)(?!))
)
\[/\k<tag>\]
Working example on Regex Hero

Ok. Here's another attempt. This one is a little more complicated.
The idea is to match the whole text from start to ext, and parse it to a single Match. While rarely used as such, .Net Balancing Groups allow you to fine tune your captures, remembering all positions and captures exactly the way you want them.
The pattern I came up with is:
\A
(?<StartContentPosition>)
(?:
# Open tag
(?<Content-StartContentPosition>) # capture the content between tags
(?<StartTagPosition>) # Keep the starting postion of the tag
(?>\[(?<TagName>[^][/=\s]+)[^\]\[]*\]) # opening tag
(?<StartContentPosition>) # start another content capture
|
# Close tag
(?<Content-StartContentPosition>) # capture the content in the tag
\[/\k<TagName>\](?<Tag-StartTagPosition>) # closing tag, keep the content in the <tag> group
(?<-TagName>)
(?<StartContentPosition>) # start another content capture
|
. # just match anything. The tags are first, so it should match
# a few if it can. (?(TagName)(?!)) keeps this in line, so
# unmatched tags will not mess with the resul
)*
(?<Content-StartContentPosition>) # capture the content after the last tag
\Z
(?(TagName)(?!))
Remember - the balancing group (?<A-B>) captures into A all text since B was last captured (and pops that position from B's stack).
Now you can parse the string using:
Match match = Regex.Match(sample, pattern, RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);
Your interesting data will be on match.Groups["Tag"].Captures, which contains all tags (some of them are contained in others), and match.Groups["Content"].Captures, which contains tag's contents, and contents between tags. For example, without all blanks, it contains:
some code
some text
This is also an successful match.
This is also an [ unsuccessful match.
This is also an [unsuccessful] match.
This is pretty close to a full parsed document, but you'll still have to play with indices and length to figure out the exact order and structure of the document (though it isn't more complex than sorting all captures)
At this point I'll state what others have said - it may be a good time to write a parser, this pattern isn't pretty...

Related

C# Create Acronym from Word

Given any string, I'd like to create an intelligent acronym that represents the string. If any of you have used JIRA, they accomplish this pretty well.
For example, given the word: Phoenix it would generate PHX or given the word Privacy Event Management it would create PEM.
I've got some code that will accomplish the latter:
string.Join(string.Empty, model.Name
.Where(char.IsLetter)
.Where(char.IsUpper))
This case doesn't handle if there is only one word and its lower case either.
but it doesn't account for the first case. Any ideas? I'm using C# 4.5
For the Phoenix => PHX, I think you'll need to check the strings against a dictionary of known abbreviations. As for the multiple word/camel-case support, regex is your friend!
var text = "A Big copy DayEnergyFree good"; // abbreviation should be "ABCDEFG"
var pattern = #"((?<=^|\s)(\w{1})|([A-Z]))";
string.Join(string.Empty, Regex.Matches(text, pattern).OfType<Match>().Select(x => x.Value.ToUpper()))
Let me explain what's happening here, starting with the regex pattern, which covers a few cases for matching substrings.
// must be directly after the beginning of the string or line "^" or a whitespace character "\s"
(?<=^|\s)
// match just one letter that is part of a word
(\w{1})
// if the previous requirements are not met
|
// match any upper-case letter
([A-Z])
The Regex.Matches method returns a MatchCollection, which is basically an ICollection so to use LINQ expressions, we call OfType() to convert the MatchCollection into an IEnumerable.
Regex.Matches(text, pattern).OfType<Match>()
Then we select only the value of the match (we don't need the other regex matching meta-data) and convert it to upper-case.
Select(x => x.Value.ToUpper())
I was able to extract out the JIRA key generator and posted it here. pretty interesting, and even though its JavaScript it could easily be converted to c#.
Here is a simple function that generates an acronym. Basically it puts letters or numbers into the acronym when there is a space before of this character. If there are no spaces in the string the the string is returned back. It does not capitalize letters in the acronym, but it is easy to amend.
You can just copy it in your code and start using it.
Results are the following. Just an example:
Deloitte Private Pty Ltd - DPPL
Clearwater Investment Co Pty Ltd (AC & CC Family Trust) - CICPLACFT
ASIC - ASIC
private string Acronym(string value)
{
if (string.IsNullOrWhiteSpace(value))
{
return value;
} else
{
var builder = new StringBuilder();
foreach(char c in value)
{
if (char.IsWhiteSpace(c) || char.IsLetterOrDigit(c))
{
builder.Append(c);
}
}
string trimmedValue = builder.ToString().Trim();
builder.Clear();
if (trimmedValue.Contains(' '))
{
for(int charIndex = 0; charIndex < trimmedValue.Length; charIndex++)
{
if (charIndex == 0)
{
builder.Append(trimmedValue[0]);
} else
{
char currentChar = trimmedValue[charIndex];
char previousChar = trimmedValue[charIndex - 1];
if (char.IsLetterOrDigit(currentChar) && char.IsWhiteSpace(previousChar))
{
builder.Append(trimmedValue[charIndex]);
}
}
}
return builder.ToString();
} else
{
return trimmedValue;
}
}
}
I need a not repeating code,So I create the follow method.
If you use like this,you will get
HashSet<string> idHashSet = new HashSet<string>();
for (int i = 0; i < 100; i++)
{
var eName = "China National Petroleum";
Console.WriteLine($"count:{i+1},short name:{GetIdentifierCode(eName,ref idHashSet)}");
}
the method is this.
/// <summary>
/// 根据英文名取其简写Code,优先取首字母,然后在每个单词中依次取字母作为Code,最后若还有重复则使用默认填充符(A)填充
/// todo 当名称为中文时,使用拼音作为取Code的源
/// </summary>
/// <param name="name"></param>
/// <param name="idHashSet"></param>
/// <returns></returns>
public static string GetIdentifierCode(string name, ref HashSet<string> idHashSet)
{
var words = name;
var fillChar = 'A';
if (string.IsNullOrEmpty(words))
{
do
{
words += fillChar.ToString();
} while (idHashSet.Contains(words));
}
//if (IsChinese)
//{
// words = GetPinYin(words);
//}
//中国石油天然气集团公司(China National Petroleum)
var sourceWord = new List<string>(words.Split(' '));
var returnWord = sourceWord.Select(c => new List<char>()).ToList();
int index = 0;
do
{
var listAddWord = sourceWord[index];
var addWord = returnWord[index];
//最后若还有重复则使用默认填充符(A)填充
if (sourceWord.All(c => string.IsNullOrEmpty(c)))
{
returnWord.Last().Add(fillChar);
continue;
}
//字符取完后跳过
else if (string.IsNullOrEmpty(listAddWord))
{
if (index == sourceWord.Count - 1)
index = 0;
else
{
index++;
}
continue;
}
if (addWord == null)
addWord = new List<char>();
string addString = string.Empty;
//字符全为大写时,不拆分
if (listAddWord.All(a => char.IsUpper(a)))
{
addWord = listAddWord.ToCharArray().ToList();
returnWord[index] = addWord;
addString = listAddWord;
}
else
{
addString = listAddWord.First().ToString();
addWord.Add(listAddWord.First());
}
listAddWord = listAddWord.Replace(addString, "");
sourceWord[index] = listAddWord;
if (index == sourceWord.Count - 1)
index = 0;
else
{
index++;
}
} while (idHashSet.Contains(string.Concat(returnWord.SelectMany(c => c))));
words = string.Concat(returnWord.SelectMany(c => c));
idHashSet.Add(words);
return words;

NEsper issue with regexp

I have been stuck here for a good while and seem to nail the problem to incorrect NEsper behaviour with regex. I wrote a simple project to reproduce the issue and it is available from github.
In a nutshell, NEsper allows me to pump messages (events) through a set of rules (SQL-like). If an event matches a rule, NEsper fires an alert. In my application I need to use a regular expression and this doesn't seem to work.
Problem
I tried both approaches of creating statements createPattern and createEPL and they are not firing a match event, however a regular expression and an input are matching by the .NET Regex class. If instead of regex ("\b\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\b") I pass a matching value ("127.0.0.5") to the statement, the event successfully fires.
INPUT
127.0.0.5
==RULE FAIL==
every (Id123=TestDummy(Value regexp '\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'))
// and I want this to pass
==RULE PASS==
every (Id123=TestDummy(Value regexp '127.0.0.5'))
Question
Could anyone help me out with a sample of NEsper regular expression matching? Or perhaps point to my dumb mistake in the code.
Code
This is my NEsper demo wrapper class
public class NesperAdapter
{
public MatchEventSubscrtiber Subscriber { get; set; }
internal EPServiceProvider Engine { get; private set; }
public NesperAdapter()
{
//This call internally depend on log4net,
//will throw an error if log4net cannot be loaded
EPServiceProviderManager.PurgeDefaultProvider();
//config
var configuration = new Configuration();
configuration.AddEventType("TestDummy", typeof(TestDummy).FullName);
configuration.EngineDefaults.Threading.IsInternalTimerEnabled = false;
configuration.EngineDefaults.Logging.IsEnableExecutionDebug = false;
configuration.EngineDefaults.Logging.IsEnableTimerDebug = false;
//engine
Engine = EPServiceProviderManager.GetDefaultProvider(configuration);
Engine.EPRuntime.SendEvent(new TimerControlEvent(TimerControlEvent.ClockTypeEnum.CLOCK_EXTERNAL));
Engine.Initialize();
Engine.EPRuntime.UnmatchedEvent += OnUnmatchedEvent;
}
public void AddStatementFromRegExp(string regExp)
{
const string pattern = "any (Id123=TestDummy(Value regexp '{0}'))";
string formattedPattern = String.Format(pattern, regExp);
EPStatement statement = Engine.EPAdministrator.CreatePattern(formattedPattern);
//this is subscription
Subscriber = new MatchEventSubscrtiber();
statement.Subscriber = Subscriber;
}
internal void OnUnmatchedEvent(object sender, UnmatchedEventArgs e)
{
Console.WriteLine(#"Unmatched event");
Console.WriteLine(e.Event);
}
public void SendEvent(object someEvent)
{
Engine.EPRuntime.SendEvent(someEvent);
}
}
Then subscriber and a DummyType
public class MatchEventSubscrtiber
{
public bool HasEventFired { get; set; }
public MatchEventSubscrtiber()
{
HasEventFired = false;
}
public void Update(IDictionary<string, object> rows)
{
Console.WriteLine("Match event fired");
Console.WriteLine(rows);
HasEventFired = true;
}
}
public class TestDummy
{
public string Value { get; set; }
}
And NUnit test. If one comments nesper.AddStatementFromRegExp(regexp); line and uncomments //nesper.AddStatementFromRegExp(input); line then test pass. However I need a regular expression there.
//Match any IP address
[TestFixture(#"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b", "127.0.0.5")]
public class WhenValidRegexpPassedAndRuleCreatedAndPropagated
{
private NesperAdapter nesper;
//Setup
public WhenValidRegexpPassedAndRuleCreatedAndPropagated(string regexp, string input)
{
//check it is valid regexp in .NET
var r = new Regex(regexp);
var match = r.Match(input);
Assert.IsTrue(match.Success, "Regexp validation failed in .NET");
//create and start engine
nesper = new NesperAdapter();
//Add a rule, this fails with a correct regexp and a matching input
//PROBLEM IS HERE
nesper.AddStatementFromRegExp(regexp);
//PROBLEM IS HERE
//This works, but it is just input self-matching
//nesper.AddStatementFromRegExp(input);
var oneEvent = new TestDummy
{
Value = input
};
nesper.SendEvent(oneEvent);
}
[Test]
public void ThenNesperFiresMatchEvent()
{
//wait till nesper process the event
Thread.Sleep(100);
//Check if subscriber has received the event
Assert.IsTrue(nesper.Subscriber.HasEventFired,
"Event didn't fire");
}
}
I was debugging this issue for some time now and found that NEsper incorrectly handles
WHERE regexp 'foobar' statement
So if I have
SELECT * FROM MyType WHERE PropertyA regexp 'some valid regexp'
NEsper performs string formatting and validation with 'some valid regexp' and removes important (and valid) symbols from regexp. This is how I fixed it for myself. Not sure if it is a recommended approach.
File: com.espertech.esper.epl.expression.ExprRegexpNode
Reason: I think it is up to the user how regexp is constructed, this shall not be part of a framework.
// Inside this method
public object Evaluate(EventBean[] eventsPerStream, bool isNewData, ExprEvaluatorContext exprEvaluatorContext){...}
// Find two occurrences of
_pattern = new Regex(String.Format("^{0}$", patternText));
// And change to
_pattern = new Regex(patternText);
File: com.espertech.esper.epl.parse.ASTConstantHelper
Reason: requireUnescape for all strings, but skip regexp as this brakes valid regexp and removes some valid symbols from it.
// Inside this method
public static Object Parse(ITree node){...}
// Find one occurrence of
case EsperEPL2GrammarParser.STRING_TYPE:
{
return StringValue.ParseString(node.Text, requireUnescape);
}
// And change to
case EsperEPL2GrammarParser.STRING_TYPE:
{
bool requireUnescape = true;
if (node.Parent != null)
{
if (!String.IsNullOrEmpty(node.Parent.Text))
{
if (node.Parent.Text == "regexp")
{
requireUnescape = false;
}
}
}
return StringValue.ParseString(node.Text, requireUnescape);
}
File: com.espertech.esper.type.StringValue
Reason: unescape all strings, but the regexp value.
// Inside this method
public static String ParseString(String value){...}
// Change from
public static String ParseString(String value)
{
if ((value.StartsWith("\"")) & (value.EndsWith("\"")) || (value.StartsWith("'")) & (value.EndsWith("'")))
{
if (value.Length > 1)
{
if (value.IndexOf('\\') != -1)
{
return Unescape(value.Substring(1, value.Length - 2));
}
return value.Substring(1, value.Length - 2);
}
}
throw new ArgumentException("String value of '" + value + "' cannot be parsed");
}
// Change to
public static String ParseString(String value, bool requireUnescape = true)
{
if ((value.StartsWith("\"")) & (value.EndsWith("\"")) || (value.StartsWith("'")) & (value.EndsWith("'")))
{
if (value.Length > 1)
{
if (requireUnescape)
{
if (value.IndexOf('\\') != -1)
{
return Unescape(value.Substring(1, value.Length - 2));
}
}
return value.Substring(1, value.Length - 2);
}
}
throw new ArgumentException("String value of '" + value + "' cannot be parsed");
}

Parsing tree in C#

I have a [textual] tree like this:
+---step-1
| +---step_2
| | +---step3
| | \---step4
| +---step_2.1
| \---step_2.2
+---step1.2
Tree2
+---step-1
| \---step_2
| | +---step3
| | \---step4
+---step1.2
This is just a small example, tree can be deeper and with more children and etc..
Right now I'm doing this:
for (int i = 0; i < cmdOutList.Count; i++)
{
string s = cmdOutList[i];
String value = Regex.Match(s, #"(?<=\---).*").Value;
value = value.Replace("\r", "");
if (s[1].ToString() == "-")
{
DirectoryNode p = new DirectoryNode { Name = value };
//p.AddChild(f);
directoryList.Add(p);
}
else
{
DirectoryNode f = new DirectoryNode { Name = value };
directoryList[i - 1].AddChild(f);
directoryList.Add(f);
}
}
But this doesn't handle the "step_2.1" and "step_2.2"
I think I'm doing this totally wrong, maybe someone can help me out with this.
EDIT:
Here is the DirectoryNode class to make that a bit more clear..
public class DirectoryNode
{
public DirectoryNode()
{
this.Children = new List<DirectoryNode>();
}
public DirectoryNode ParentObject { get; set; }
public string Name;
public List<DirectoryNode> Children { get; set; }
public void AddChild(DirectoryNode child)
{
child.ParentObject = this;
this.Children.Add(child);
}
}
If your text is that simple (just either +--- or \--- preceded by a series of |), then a regex might be more than you need (and what's tripping you up).
DirectoryNode currentParent = null;
DirectoryNode current = null;
int lastStartIndex = 0;
foreach(string temp in cmdOutList)
{
string line = temp;
int startIndex = Math.Max(line.IndexOf("+"), line.IndexOf(#"\");
line = line.Substring(startIndex);
if(startIndex > lastStartIndex)
{
currentParent = current;
}
else if(startIndex < lastStartIndex)
{
for(int i = 0; i < (lastStartIndex - startIndex) / 4; i++)
{
if(currentParent == null) break;
currentParent = currentParent.ParentObject;
}
}
lastStartIndex = startIndex;
current = new DirectoryNode() { Name = line.Substring(4) };
if(currentParent != null)
{
currentParent.AddChild(current);
}
else
{
directoryList.Add(current);
}
}
Regex definitely looks unnecessary here, since the symbols in your markup language (that's what it is, after all) are both static and few. That is: Although the label names may vary, the tokens you need to look for when trying to parse them into relevant pieces will never be anything other than +---, \---, and ..
From a question I answered yesterday: "Regexes are extremely useful for describing a whole class of needles in a largely unknown haystack, but they're not the right tool for input that's in a very static format."
String manipulation is what you want for parsing this, especially since you're dealing with a recursive markup language, which can't be fully understood by regex anyway. I'd also suggest creating a tree-type data structure to store the data (which, surprisingly, doesn't seem to be included in the framework unless they added it after 2.0).
As an aside, your regex above seems to have an unnecessary \ in it, but that doesn't matter in most regex flavors.

C# Regex for Movie Filename

I have been trying to use a C# Regex unsuccessfully to remove certain strings from a movie name.
Examples of the file names I'm working with are:
EuroTrip (2004) [SD]
Event Horizon (1997) [720]
Fast & Furious (2009) [1080p]
Star Trek (2009) [Unknown]
I'd like to remove anything in square brackets or parenthesis (including the brackets themselves)
So far I'm using:
movieTitleToFetch = Regex.Replace(movieTitleToFetch, "([*\\(\\d{4}\\)])", "");
Which seems to remove the Year and Parenthesis ok, but I just can't figure out how to remove the Square Brackets and content without affecting other parts... I've had miscellaneous results but the closest one has been:
movieTitleToFetch = Regex.Replace(movieTitleToFetch, "([?\\[+A-Z+\\]])", "");
Which left me with:
urorip (2004)
Instead of:
EuroTrip (2004) [SD]
Any whitespace that is left at the ends are ok as I will just perform
movieTitleToFetch = movieTitleToFetch.Trim();
at the end.
Thanks in advance,
Alex
This regex pattern should work ok... maybe needs a bit of tweaking
"[\[\(].+?[\]\)]"
Regex.Replace(movieTitleToFetch, #"[\[\(].+?[\]\)]", "");
This should match anything from either "[" or "(" until the next occurance of "]" or ")"
If that does not work try removing the escape character for the parentheses, like so...
Regex.Replace(movieTitleToFetch, #"[\[(].+?[\])]", "");
#Craigt is pretty much spot on but it's possibly cleaner to ensure that the brackets are matched.
([\[].*?[\]]|[\(].*?[\)])
I'know i'm late on this thread but i wrote a simple algorythm to sanitize the downloaded movies filenames.
This runs these steps:
Removes everything in brackets (if find a year it tries to keep the info)
Removes a list of common used words (720p, bdrip, h264 and so on...)
Assumes that can be languages info in the title and removes them when at the end of remaining string (before special words)
if a year was not found into parenthesis looks at the end of remaining string (as for languages)
Doing this replaces dots and spaces so the title is ready, as example, to be a query for a search api.
Here's the test in XUnit (i used most of italian titles to test it)
using Grappachu.Movideo.Core.Helpers.TitleCleaner;
using SharpTestsEx;
using Xunit;
namespace Grappachu.MoVideo.Test
{
public class TitleCleanerTest
{
[Theory]
[InlineData("Avengers.Confidential.La.Vedova.Nera.E.Punisher.2014.iTALiAN.Bluray.720p.x264 - BG.mkv",
"Avengers Confidential La Vedova Nera E Punisher", 2014)]
[InlineData("Fuck You, Prof! (2013) BDRip 720p HEVC ITA GER AC3 Multi Sub PirateMKV.mkv",
"Fuck You, Prof!", 2013)]
[InlineData("Il Libro della Giungla(2016)(BDrip1080p_H264_AC3 5.1 Ita Eng_Sub Ita Eng)by siste82.avi",
"Il Libro della Giungla", 2016)]
[InlineData("Il primo dei bugiardi (2009) [Mux by Little-Boy]", "Il primo dei bugiardi", 2009)]
[InlineData("Il.Viaggio.Di.Arlo-The.Good.Dinosaur.2015.DTS.ITA.ENG.1080p.BluRay.x264-BLUWORLD",
"il viaggio di arlo", 2015)]
[InlineData("La Mafia Uccide Solo D'estate 2013 .avi",
"La Mafia Uccide Solo D'estate", 2013)]
[InlineData("Ip.Man.3.2015.iTA.AC3.5.1.448.Chi.Aac.BluRay.m1080p.x264.Sub.[scambiofile.info].mkv",
"Ip Man 3", 2015)]
[InlineData("Inferno.2016.BluRay.1080p.AC3.ITA.AC3.ENG.Subs.x264-WGZ.mkv",
"Inferno", 2016)]
[InlineData("Ghostbusters.2016.iTALiAN.BDRiP.EXTENDED.XviD-HDi.mp4",
"Ghostbusters", 2016)]
[InlineData("Transcendence.mkv", "Transcendence", null)]
[InlineData("Being Human (Forsyth, 1994).mkv", "Being Human", 1994)]
public void Clean_should_return_title_and_year_when_possible(string filename, string title, int? year)
{
var res = MovieTitleCleaner.Clean(filename);
res.Title.ToLowerInvariant().Should().Be.EqualTo(title.ToLowerInvariant());
res.Year.Should().Be.EqualTo(year);
}
}
}
and fisrt version of the code
using System;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
namespace Grappachu.Movideo.Core.Helpers.TitleCleaner
{
public class MovieTitleCleanerResult
{
public string Title { get; set; }
public int? Year { get; set; }
public string SubTitle { get; set; }
}
public class MovieTitleCleaner
{
private const string SpecialMarker = "§=§";
private static readonly string[] ReservedWords;
private static readonly string[] SpaceChars;
private static readonly string[] Languages;
static MovieTitleCleaner()
{
ReservedWords = new[]
{
SpecialMarker, "hevc", "bdrip", "Bluray", "x264", "h264", "AC3", "DTS", "480p", "720p", "1080p"
};
var cultures = CultureInfo.GetCultures(CultureTypes.AllCultures);
var l = cultures.Select(x => x.EnglishName).ToList();
l.AddRange(cultures.Select(x => x.ThreeLetterISOLanguageName));
Languages = l.Distinct().ToArray();
SpaceChars = new[] {".", "_", " "};
}
public static MovieTitleCleanerResult Clean(string filename)
{
var temp = Path.GetFileNameWithoutExtension(filename);
int? maybeYear = null;
// Remove what's inside brackets trying to keep year info.
temp = RemoveBrackets(temp, '{', '}', ref maybeYear);
temp = RemoveBrackets(temp, '[', ']', ref maybeYear);
temp = RemoveBrackets(temp, '(', ')', ref maybeYear);
// Removes special markers (codec, formats, ecc...)
var tokens = temp.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
var title = string.Empty;
for (var i = 0; i < tokens.Length; i++)
{
var tok = tokens[i];
if (ReservedWords.Any(x => string.Equals(x, tok, StringComparison.OrdinalIgnoreCase)))
{
if (title.Length > 0)
break;
}
else
{
title = string.Join(" ", title, tok).Trim();
}
}
temp = title;
// Remove languages infos when are found before special markers (should not remove "English" if it's inside the title)
tokens = temp.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
for (var i = tokens.Length - 1; i >= 0; i--)
{
var tok = tokens[i];
if (Languages.Any(x => string.Equals(x, tok, StringComparison.OrdinalIgnoreCase)))
tokens[i] = string.Empty;
else
break;
}
title = string.Join(" ", tokens).Trim();
// If year is not found inside parenthesis try to catch at the end, just after the title
if (!maybeYear.HasValue)
{
var resplit = title.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
var last = resplit.Last();
if (LooksLikeYear(last))
{
maybeYear = int.Parse(last);
title = title.Replace(last, string.Empty).Trim();
}
}
// TODO: review this. when there's one dash separates main title from subtitle
var res = new MovieTitleCleanerResult();
res.Year = maybeYear;
if (title.Count(x => x == '-') == 1)
{
var sp = title.Split('-');
res.Title = sp[0];
res.SubTitle = sp[1];
}
else
{
res.Title = title;
}
return res;
}
private static string RemoveBrackets(string inputString, char openChar, char closeChar, ref int? maybeYear)
{
var str = inputString;
while (str.IndexOf(openChar) > 0 && str.IndexOf(closeChar) > 0)
{
var dataGraph = str.GetBetween(openChar.ToString(), closeChar.ToString());
if (LooksLikeYear(dataGraph))
{
maybeYear = int.Parse(dataGraph);
}
else
{
var parts = dataGraph.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
foreach (var part in parts)
if (LooksLikeYear(part))
{
maybeYear = int.Parse(part);
break;
}
}
str = str.ReplaceBetween(openChar, closeChar, string.Format(" {0} ", SpecialMarker));
}
return str;
}
private static bool LooksLikeYear(string dataRound)
{
return Regex.IsMatch(dataRound, "^(19|20)[0-9][0-9]");
}
}
public static class StringUtils
{
public static string GetBetween(this string src, string a, string b,
StringComparison comparison = StringComparison.Ordinal)
{
var idxStr = src.IndexOf(a, comparison);
var idxEnd = src.IndexOf(b, comparison);
if (idxStr >= 0 && idxEnd > 0)
{
if (idxStr > idxEnd)
Swap(ref idxStr, ref idxEnd);
return src.Substring(idxStr + a.Length, idxEnd - idxStr - a.Length);
}
return src;
}
private static void Swap<T>(ref T idxStr, ref T idxEnd)
{
var temp = idxEnd;
idxEnd = idxStr;
idxStr = temp;
}
public static string ReplaceBetween(this string s, char begin, char end, string replacement = null)
{
var regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
return regex.Replace(s, replacement ?? string.Empty);
}
}
}
This does the trick:
#"(\[[^\]]*\])|(\([^\)]*\))"
It removes anything from "[" to the next "]" and anything from "(" to the next ")".
Can you just use:
string MovieTitle="Star Trek (2009) [Unknown]";
movieTitleToFetch= MovieTitle.IndexOf('(')>MovieTitle.IndexOf('[')?
MovieTitle.Substring(0,MovieTitle.IndexOf('[')):
MovieTitle.Substring(0,MovieTitle.IndexOf('('));
Cant we use this instead:-
if(movieTitleToFetch.Contains("("))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("("));
Above code will surely return you the perfect movie titles for these strings:-
EuroTrip (2004) [SD]
Event Horizon (1997) [720]
Fast & Furious (2009) [1080p]
Star Trek (2009) [Unknown]
if there occurs a case where you will not have year but only type i.e :-
EuroTrip [SD]
Event Horizon [720]
Fast & Furious [1080p]
Star Trek [Unknown]
then use this
if(movieTitleToFetch.Contains("("))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("("));
else if(movieTitleToFetch.Contains("["))
movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("["));
I came up with .+\s(?<year>\(\d{4}\))\s(?<format>\[\w+\]) which matches any of your examples, and contains the year and format as named capture groups to help you replace them.
This pattern translates as:
Any character, one or more repitions
Whitespace
Literal '(' followed by 4 digits followed by literal ')' (year)
Whitespace
Literal '[' followed by alphanumeric, one or more repitions, followed by literal ']' (format)

get div element contents in C#

I have a moderately well-formatted HTML document. It is not XHTML so it's not valid XML. Given a offset of the opening tag I need to obtain contents of this tag, considering that it can have multiple nested tags inside of it.
What is the easiest way to solve this problem with a minimum amount of C# code that doesn't involve using non-standard libraries?
You can strip your html content using following function
public static string StripHTMLTag(string strHTML)
{
return Regex.Replace(strHTML, "<(.|\n)*?>", "");
}
pass your content of outer tag, this will strip all html tags and provide you only content.
Hope this helps
Imran
I ended up writing the following function. It seems to get the job done for my purposes.
I know that it's kind of dirty, but so is the HTML code of most web-pages.
If anyone can point out principal flaws, please do so:
private static readonly Regex rxDivTag = new Regex(
#"<(?<close>/)?div(\s[^>]*?)?(?<selfClose>/)?>",
RegexOptions.Compiled | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase | RegexOptions.Singleline);
private const string RXCAP_DIVTAG_CLOSE = "close";
private const string RXCAP_DIVTAG_SELFCLOSE = "selfClose";
private static List<string> GetProductDivs(string pageText, int start)
{
bool success = true;
int curr = start + 1;
for (Match matchNextTag = rxDivTag.Match(pageText, curr) ; depth > 0 ; matchNextTag = rxDivTag.Match(pageText, curr))
{
if (matchNextTag == Match.Empty)
{
success = false;
break;
}
if (matchNextTag.Groups[RXCAP_DIVTAG_CLOSE].Success)
{
if (matchNextTag.Groups[RXCAP_DIVTAG_SELFCLOSE].Success)
{
success = false;
break;
}
--depth;
}
else if (!matchNextTag.Groups[RXCAP_DIVTAG_SELFCLOSE].Success)
{
++depth;
}
curr = matchNextTag.Index + matchNextTag.Length;
}
if (success)
{
return pageText.Substring(start, curr - start);
}
else
{
return null;
}
}

Categories