Extracting a string starting with x and ending with y - c#

First of all, I did a search on this and was able to find how to use something like String.Split() to extract the string based on a condition. I wasn't able to find however, how to extract it based on an ending condition as well. For example, I have a file with links to images: http://i594.photobucket.com/albums/tt27/34/444.jpghttp://i594.photobucket.com/albums/as/asfd/ghjk6.jpg
You will notice that all the images start with http:// and end with .jpg. However, .jpg is succeeded by http:// without a space, making this a little more difficult.
So basically I'm trying to find a way (Regex?) to extract a string from a string that starts with http:// and ends with .jpg

Regex is the easiest way to do this. If you're not familiar with regular expressions, you might check out Regex Buddy. It's a relatively cheap little tool that I found extremely useful when I was learning. For your particular case, a possible expression is:
(http://.+?\.jpg)
It probably requires some more refinement, as there are boundary cases that could trip this up, but it would work if the file is a simple list.
You can also do free quick testing of expressions here.
Per your latest comment, if you have links to other non-images as well, then you need to make sure it doesn't start at the http:// for one link and read all the way to the .jpg for the next image. Since URLs are not allowed to have whitespace, you can do it like this:
(http://[^\s]+\.jpg)
This basically says, "match a string starting with http:// and ending with .jpg where there is at least one character between the two and none of those characters are whitespace".

Regex RegexObj = new Regex("http://.+?\\.jpg");
Match MatchResults = RegexObj.Match(subject);
while (MatchResults.Success) {
//Do something with it
MatchResults = MatchResults.NextMatch();
}

In your specific case, you could always split if by ".jpg". You will probably end up with one empty element at the end of the array, and have to append the .jpg at the end of each file if you need that. Apart from that I think it would work.
Tested the following code and it worked fine:
public void SplitTest()
{
string test = "http://i594.photobucket.com/albums/tt27/34/444.jpghttp://i594.photobucket.com/albums/as/asfd/ghjk6.jpg";
string[] items = test.Split(new string[] { ".jpg" }, StringSplitOptions.RemoveEmptyEntries);
}
It even get rid of the empty entry...

The following LINQ will separate by http: and make sure to only get values that end with jpg.
var images = from i in imageList.Split(new[] {"http:"},
StringSplitOptions.RemoveEmptyEntries)
where i.EndsWith(".jpg")
select "http:" + i;

Related

Regex: Find pagenumber from partial matching urls

As we all know, Regex patterns will make your stomache turn the first time you see them (or 10th time since you never went head first and truly learned it. Quilty.). I'm currently reading upon it, but since I'm on a tight deadline I'll check here if I can get a quicker and better answer/explaination meanwhile.
I have some url to a forum thread, and I want to scan through the html and find the last page for the thread.
So say I have one of the following urls identifying the thread in question:
https://www.somesite.com/forum/thread-93912* (absolute url to the
thread)
/forum/thread-93912 (relative url to the thread)
and I want to get all values (integers) that appear directly (next path) after any of the above "partial" match in the html-document.
So from any of the following hrefs located anywhere in the html-document (the doc is represented as a single string):
https://www.somesite.com/forum/thread-93912/34
https://www.somesite.com/forum/thread-93912/34/morestuffhere/whatevs
/forum/thread-93912/34
/forum/thread-93912/34/somethingheretoo
I want to extract the number 34 (only 34), so I can parse it to int.
EDIT
Okay, to make it simpler:
Say I have all the html in htmlString, and in this string I want to find all numbers x that appear after my inputString /forum/thread-93912.
These all appear in the htmlString, and I want to extract the numbers:
thread-93912/34
thread-93912/14
thread-93912/84
thread-93912/64
thread-93912/4
You don't need regex. Just use System.Uri.Segments
Uri url = new Uri("your url here");
Console.WriteLine(url.Segments[4]);
\b(\d+)\b(?=[^\d]*$)
Try this.See demo.grab the capture.
http://regex101.com/r/sU3fA2/55
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
Regex regex = new Regex(#"\b\d+\b(?=[^\d]*$)");
Match match = regex.Match("/forum/thread-93912/34");
if (match.Success)
{
Console.WriteLine(match.Value);
}
}
}
Since my question was a little hard to explain thuroughly (and since I "changed" my problem a little), I thought I'd add my own answer to get the exact code I went with (which I came up with thanks to the other answers here, so I'll give you all an upvote!).
I'm sure this can be made prettier and more compact, but I went for clearity since I'm new to regex!
First, get all strings matching the url + some number (separated with a slash "/"), then extract that number to a group called "page".
Regex regex = new Regex(urlToThread + #"/(?<page>\d+)");
MatchCollection matches = regex.Matches(htmlString);
Then iterate all matches and extract the "page"-value (garanteed to be an integer), and parse it to an integer. Add all parsed integers to a list and sort when done. The last one will be the greatest (last page).
List<int> pages = new List<int>();
foreach(Match match in matches)
pages.Add(int.Parse(match.Groups["page"].Value));
pages.Sort();
// And here we get the last page
int nrOfPages = pages[pages.Count-1];

Matching a term that contains nested HTML

I have been having trouble finding a solution to this problem.
I am parsing the content of a number of ebooks, finding specific terms and characters, marking the locations and lengths of each term.
A normal case would be something like this (excerpts from A Game of Thrones):
"When he paused to look down, his head swam dizzily and he felt his fingers slipping. Bran cried out and clung for dear life."
If we are searching for the character "Bran", its location is 85 and length is 4. Easy enough.
My issue arises when there is a paragraph like this:
<span height="-0em"><font size="7">D</font></span>aenerys Targaryen wed Khal Drogo
We need to match "Daenerys Targaryn". It is easy enough to strip the HTML and match the string, but in this example the result needs to include the HTML. Thus the expected result would here be would be location = 0, length = 67.
Another situation, caused by random anchor tags scattered throughout:
Did anyone outside the Vale even suspect where Catelyn <a></a>Stark had taken him?
Again, searching for "Catelyn Stark" needs to include the HTML, so location = 47, length = 20.
I have been able to get around it temporarily by adding those specific cases (searching for "Catelyn <a></a>Stark specifically), but clearly I should have a more robust solution, which I cannot seem to get my head around. My attempts have been using RegEx but with limited success.
I have found various questions regarding HTML matching/stripping (and whether or not to use RegEx =)), but this case seems to be somewhat unique.
Stripping the tags isn't an option as the content must be preserved.
This is within a stand-alone C# application.
Any ideas, steps in the right direction, or similar examples should your search go better than mine would be greatly appreciated!
One possible approach would be to insert the following between each letter in your search string:
(?:<[^>]*>)*
So when searching for the character "Bran" your regex would become the following:
(?:<[^>]*>)*B(?:<[^>]*>)*r(?:<[^>]*>)*a(?:<[^>]*>)*n
This will allow your regex to match any number of HTML tags anywhere within the search string. Note that this will only work if your search strings are always something simple like a character's name, and not regular expressions (this method will fail if there is repetition like a* in your search string).
I would create a function that would take "Daenerys Targaryn" as a parameter and then strip the first letter. Then, it would only search for "aenerys Targaryn," and if found, it would search for ">D<" or the first variable letter. Does than make sense?
Example:
public static string searchFor(string str)
{
// strip first letter of search string (in this case "D")
// search for the rest of the string ("aenerys Targaryn")
// if found, search for ">D<"
// if found, search for HTML tags with "D" inside (using regex)
// if found, search for HTML tags with the previous HTML tag in them (using regex)
return result;
}
Well using Javascript or Php you can get the text of elements and the text of documents and search there and then do a regex to return the closest match (containing the html):
Another option:
would be to index the books first using something like Lucene Search Engine (which happens to let you index in different formats (html format being one of them).
You can then use the Lucene api to search your documents a little easier.
In php we have Zend_Search_Lucene which works perfectly for this kind of thing.
Lucene Search can be found at:
http://lucene.apache.org/core/
Have fun!

Returning the regular expression match as part of a split (or equivalent functionality)

I am trying to parse through some log files and put them into a database for analysis. A single line looks something like this:
2012-09-30 17:16:27,213 [39] (boxes) ERROR Assembly.Places [(null)] - Error while displaying a thing
I have made a regular expression that works well for pulling out the date in front and breaking up the lines that way, but I lose the date itself. This is a pretty important bit of data, and I don't want to lose it!
I cannot just do this by \r\n, because some logs are fatal errors that include stack traces for the developers. Those, obviously, use \r\n to make them readable.
My current code looks like this for reference:
var logpath = Directory.GetFiles(#"C:\a\directory", "*.log");
foreach (var log in logpath)
{
var fileStream = new StreamReader(log);
var fileString = fileStream.ReadToEnd();
var records = Regex.Split(fileString, "[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}");
...
}
Split() will always remove the matched delimiter. The trick is not to match any actual text, but rather a position in the string.
This is done through zero-width look-ahead:
var datePattern = "^(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})";
var datePositions = new Regex(datePattern, RegexOptions.Multiline);
// ...
Regex.Split(fileString, datePositions);
You should match instead of splitting
This is the regex.Use singleLine Mode
([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})(.*?)((?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}|$))
Group 1 contains date
Group 2 contains the required date
NOTE
The regex is conceptually like this.
(yourDate)(.*?yourdata)(?=till the other date|$)
Dont forget to use singlelineMode
Well, I'm not an expert on the subject but I did found this: Regex.Match.
From what I see you can receive the first match of the date format with a Match object
which has all kind of nice properties that put together you can probably cut the parts you want.
p.s. also exists a Regex.Matches which will return all matches in the file, might be easier for use.
Sorry I don't have time for to find a complete code example.
good day

Conditional Regex Replace in C# without MatchEvaluator

So, Im trying to make a program to rename some files. For the most part, I want them to look like this,
[Testing]StupidName - 2[720p].mkv
But, I would like to be able to change the format, if so desired. If I use MatchEvaluators, you would have to recompile every time. Thats why I don't want to use the MatchEvaluator.
The problem I have is that I don't know how, or if its possible, to tell Replace that if a group was found, include this string. The only syntax for this I have ever seen was something like (?<group>:data), but I can't get this to work. Well if anyone has an idea, im all for it.
EDIT:
Current Capture Regexes =
^(\[(?<FanSub>[^\]\)\}]+)\])?[. _]*(?<SeriesTitle>[\w. ]*?)[. _]*\-[. _]*(?<EpisodeNumber>\d+)[. _]*(\-[. _]*(?<EpisodeName>[\w. ]*?)[. _]*)?([\[\(\{](?<MiscInfo>[^\]\)\}]*)[\]\)\}][. _]*)*[\w. ]*(?<Extension>\.[a-zA-Z]+)$
^(?<SeriesTitle>[\w. ]*?)[. _]*[Ss](?<SeasonNumber>\d+)[Ee](?<EpisodeNumber>\d+).*?(?<Extension>\.[a-zA-Z]+)$
^(?<SeriesTitle>[\w. ]*?)[. _]*(?<SeasonNumber>\d)(?<EpisodeNumber>\d{2}).*?(?<Extension>\.[a-zA-Z]+)$
Current Replace Regex = [${FanSub}]${SeriesTitle} - ${EpisodeNumber} [${MiscInfo}]${Extension}
Using Regex.Replace, the file TestFile 101.mkv, I get []TestFile - 1[].mkv. What I want to do is make it so that [] is only included if the group FanSub or MiscInfo was found.
I can solve this with a MatchEvaluator because I actually get to compile a function. But this would not be a easy solution for users of the program. The only other idea I have to solve this is to actually make my own Regex.Replace function that accepts special syntax.
It sounds like you want to be able to specify an arbitrary format dynamically rather than hard-code it into your code.
Perhaps one solution is to break your filename parts into specific groups then pass in a replacement pattern that takes advantage of those group names. This would give you the ability to pass in different replacement patterns which return the desired filename structure using the Regex.Replace method.
Since you didn't explain the categories of your filename I came up with some random groups to demonstrate. Here's a quick example:
string input = "Testing StupidName Number2 720p.mkv";
string pattern = #"^(?<Category>\w+)\s+(?<Name>.+?)\s+Number(?<Number>\d+)\s+(?<Resolution>\d+p)(?<Extension>\.mkv)$";
string[] replacePatterns =
{
"[${Category}]${Name} - ${Number}[${Resolution}]${Extension}",
"${Category} - ${Name} - ${Number} - ${Resolution}${Extension}",
"(${Number}) - [${Resolution}] ${Name} [${Category}]${Extension}"
};
foreach (string replacePattern in replacePatterns)
{
Console.WriteLine(Regex.Replace(input, pattern, replacePattern));
}
As shown in the sample, named groups in the pattern, specified as (?<Name>pattern), are referred to in the replacement pattern by ${Name}.
With this approach you would need to know the group names beforehand and pass these in to rearrange the pattern as needed.

Extract substring from string with Regex

Imagine that users are inserting strings in several computers.
On one computer, the pattern in the configuration will extract some characters of that string, lets say position 4 to 5.
On another computer, the extract pattern will return other characters, for instance, last 3 positions of the string.
These configurations (the Regex patterns) are different for each computer, and should be available for change by the administrator, without having to change the source code.
Some examples:
Original_String Return_Value
User1 - abcd78defg123 78
User2 - abcd78defg123 78g1
User3 - mm127788abcd 12
User4 - 123456pp12asd ppsd
Can it be done with Regex?
Thanks.
Why do you want to use regex for this? What is wrong with:
string foo = s.Substring(4,2);
string bar = s.Substring(s.Length-3,3);
(you can wrap those up to do a bit of bounds-checking on the length easily enough)
If you really want, you could wrap it up in a Func<string,string> to put somewhere - not sure I'd bother, though:
Func<string, string> get4and5 = s => s.Substring(4, 2);
Func<string,string> getLast3 = s => s.Substring(s.Length - 3, 3);
string value = "abcd78defg123";
string foo = getLast3(value);
string bar = get4and5(value);
If you really want to use regex:
^...(..)
And:
.*(...)$
To have a regex capture values for further use you typically use (), depending on the regex compiler it might be () or for microsoft MSVC I think it's []
Example
User4 - 123456pp12asd ppsd
is most interesting in that you have here 2 seperate capture areas. Is there some default rule on how to join them together, or would you then want to be able to specify how to make the result?
Perhaps something like
r/......(..)...(..)/\1\2/ for ppsd
r/......(..)...(..)/\2-\1/ for sd-pp
do you want to run a regex to get the captures and handle them yourself, or do you want to run more advanced manipulation commands?
I'm not sure what you are hoping to get by using RegEx. RegEx is used for pattern matching. If you want to extract based on position, just use substring.
It seems to me that Regex really isn't the solution here. To return a section of a string beginning at position pos (starting at 0) and of length length, you simply call the Substring function as such:
string section = str.Substring(pos, length)
Grouping. You could match on /^.{3}(.{2})/ and then look at group $1 for example.
The question is why? Normal string handling i.e. actual substring methods are going to be faster and clearer in intent.

Categories