Regex: Find pagenumber from partial matching urls - c#

As we all know, Regex patterns will make your stomache turn the first time you see them (or 10th time since you never went head first and truly learned it. Quilty.). I'm currently reading upon it, but since I'm on a tight deadline I'll check here if I can get a quicker and better answer/explaination meanwhile.
I have some url to a forum thread, and I want to scan through the html and find the last page for the thread.
So say I have one of the following urls identifying the thread in question:
https://www.somesite.com/forum/thread-93912* (absolute url to the
thread)
/forum/thread-93912 (relative url to the thread)
and I want to get all values (integers) that appear directly (next path) after any of the above "partial" match in the html-document.
So from any of the following hrefs located anywhere in the html-document (the doc is represented as a single string):
https://www.somesite.com/forum/thread-93912/34
https://www.somesite.com/forum/thread-93912/34/morestuffhere/whatevs
/forum/thread-93912/34
/forum/thread-93912/34/somethingheretoo
I want to extract the number 34 (only 34), so I can parse it to int.
EDIT
Okay, to make it simpler:
Say I have all the html in htmlString, and in this string I want to find all numbers x that appear after my inputString /forum/thread-93912.
These all appear in the htmlString, and I want to extract the numbers:
thread-93912/34
thread-93912/14
thread-93912/84
thread-93912/64
thread-93912/4

You don't need regex. Just use System.Uri.Segments
Uri url = new Uri("your url here");
Console.WriteLine(url.Segments[4]);

\b(\d+)\b(?=[^\d]*$)
Try this.See demo.grab the capture.
http://regex101.com/r/sU3fA2/55
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
Regex regex = new Regex(#"\b\d+\b(?=[^\d]*$)");
Match match = regex.Match("/forum/thread-93912/34");
if (match.Success)
{
Console.WriteLine(match.Value);
}
}
}

Since my question was a little hard to explain thuroughly (and since I "changed" my problem a little), I thought I'd add my own answer to get the exact code I went with (which I came up with thanks to the other answers here, so I'll give you all an upvote!).
I'm sure this can be made prettier and more compact, but I went for clearity since I'm new to regex!
First, get all strings matching the url + some number (separated with a slash "/"), then extract that number to a group called "page".
Regex regex = new Regex(urlToThread + #"/(?<page>\d+)");
MatchCollection matches = regex.Matches(htmlString);
Then iterate all matches and extract the "page"-value (garanteed to be an integer), and parse it to an integer. Add all parsed integers to a list and sort when done. The last one will be the greatest (last page).
List<int> pages = new List<int>();
foreach(Match match in matches)
pages.Add(int.Parse(match.Groups["page"].Value));
pages.Sort();
// And here we get the last page
int nrOfPages = pages[pages.Count-1];

Related

regex gets stuck with this call

I'm working on a movie scraper / auto-downloader that iterates over my current movie collection, finds new recommendations, and downloads the new goods.
There is a part where I scrape IMDb for metadata and it seems to get stuck in this one spot and I can't seem to figure out why.... it has run this same code with different imdb pages just fine (this is the 29th iteration of a new page)
I am using c#!
The code:
private string Match(string regex, string html, int i = 1)
{
return new Regex(regex, RegexOptions.Multiline).Match(html).Groups[i].Value.Trim();
}
regex parameter string contents:
<title>.*?\\(.*?(\\d{4}).*?\\).*?</title>
html parameter string contents: too big to paste here, but literally the html string representation of http://www.imdb.com/title/tt4422748/combined
if in chrome, you can view easily with:
view-source:http://www.imdb.com/title/tt4422748/combined
I have paused execution in visual studio and stepped forward, it continues to run but just hangs (it doesn't let me step, it just runs). If i hit pause again it will return to the same spot with the same parameter values (and no I am not calling it in an infinite loop. I'm pretty new to Regex so any help would be appreciated!
Use of .* is like saying I want to match everything, yet nothing. Each use of it causes the parser to backtrack on so many different possibilities it becomes unresponsive and appears to lock up.
Does the person designing the pattern really not know if there is going to be text there or not for title? I bet 99% of the time the title has text..so why is .* even used, how about .+ at least?
If you want text between the delimiters, use this
title\>(?<Title>[^<]+)\</title
Then extract the matched text through the named group "Title" instead of group[0]. Group[1] will have the actual match text as well if one loathes named match captures.
Answer for Regex Haters
Use the HTML agility pack.

Matching a term that contains nested HTML

I have been having trouble finding a solution to this problem.
I am parsing the content of a number of ebooks, finding specific terms and characters, marking the locations and lengths of each term.
A normal case would be something like this (excerpts from A Game of Thrones):
"When he paused to look down, his head swam dizzily and he felt his fingers slipping. Bran cried out and clung for dear life."
If we are searching for the character "Bran", its location is 85 and length is 4. Easy enough.
My issue arises when there is a paragraph like this:
<span height="-0em"><font size="7">D</font></span>aenerys Targaryen wed Khal Drogo
We need to match "Daenerys Targaryn". It is easy enough to strip the HTML and match the string, but in this example the result needs to include the HTML. Thus the expected result would here be would be location = 0, length = 67.
Another situation, caused by random anchor tags scattered throughout:
Did anyone outside the Vale even suspect where Catelyn <a></a>Stark had taken him?
Again, searching for "Catelyn Stark" needs to include the HTML, so location = 47, length = 20.
I have been able to get around it temporarily by adding those specific cases (searching for "Catelyn <a></a>Stark specifically), but clearly I should have a more robust solution, which I cannot seem to get my head around. My attempts have been using RegEx but with limited success.
I have found various questions regarding HTML matching/stripping (and whether or not to use RegEx =)), but this case seems to be somewhat unique.
Stripping the tags isn't an option as the content must be preserved.
This is within a stand-alone C# application.
Any ideas, steps in the right direction, or similar examples should your search go better than mine would be greatly appreciated!
One possible approach would be to insert the following between each letter in your search string:
(?:<[^>]*>)*
So when searching for the character "Bran" your regex would become the following:
(?:<[^>]*>)*B(?:<[^>]*>)*r(?:<[^>]*>)*a(?:<[^>]*>)*n
This will allow your regex to match any number of HTML tags anywhere within the search string. Note that this will only work if your search strings are always something simple like a character's name, and not regular expressions (this method will fail if there is repetition like a* in your search string).
I would create a function that would take "Daenerys Targaryn" as a parameter and then strip the first letter. Then, it would only search for "aenerys Targaryn," and if found, it would search for ">D<" or the first variable letter. Does than make sense?
Example:
public static string searchFor(string str)
{
// strip first letter of search string (in this case "D")
// search for the rest of the string ("aenerys Targaryn")
// if found, search for ">D<"
// if found, search for HTML tags with "D" inside (using regex)
// if found, search for HTML tags with the previous HTML tag in them (using regex)
return result;
}
Well using Javascript or Php you can get the text of elements and the text of documents and search there and then do a regex to return the closest match (containing the html):
Another option:
would be to index the books first using something like Lucene Search Engine (which happens to let you index in different formats (html format being one of them).
You can then use the Lucene api to search your documents a little easier.
In php we have Zend_Search_Lucene which works perfectly for this kind of thing.
Lucene Search can be found at:
http://lucene.apache.org/core/
Have fun!

Returning the regular expression match as part of a split (or equivalent functionality)

I am trying to parse through some log files and put them into a database for analysis. A single line looks something like this:
2012-09-30 17:16:27,213 [39] (boxes) ERROR Assembly.Places [(null)] - Error while displaying a thing
I have made a regular expression that works well for pulling out the date in front and breaking up the lines that way, but I lose the date itself. This is a pretty important bit of data, and I don't want to lose it!
I cannot just do this by \r\n, because some logs are fatal errors that include stack traces for the developers. Those, obviously, use \r\n to make them readable.
My current code looks like this for reference:
var logpath = Directory.GetFiles(#"C:\a\directory", "*.log");
foreach (var log in logpath)
{
var fileStream = new StreamReader(log);
var fileString = fileStream.ReadToEnd();
var records = Regex.Split(fileString, "[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}");
...
}
Split() will always remove the matched delimiter. The trick is not to match any actual text, but rather a position in the string.
This is done through zero-width look-ahead:
var datePattern = "^(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})";
var datePositions = new Regex(datePattern, RegexOptions.Multiline);
// ...
Regex.Split(fileString, datePositions);
You should match instead of splitting
This is the regex.Use singleLine Mode
([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})(.*?)((?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}|$))
Group 1 contains date
Group 2 contains the required date
NOTE
The regex is conceptually like this.
(yourDate)(.*?yourdata)(?=till the other date|$)
Dont forget to use singlelineMode
Well, I'm not an expert on the subject but I did found this: Regex.Match.
From what I see you can receive the first match of the date format with a Match object
which has all kind of nice properties that put together you can probably cut the parts you want.
p.s. also exists a Regex.Matches which will return all matches in the file, might be easier for use.
Sorry I don't have time for to find a complete code example.
good day

Using .NET RegEx to retrieve part of a string after the second '-'

This is my first stack message. Hope you can help.
I have several strings i need to break up for use later. Here are a couple of examples of what i mean....
fred-064528-NEEDED
frederic-84728957-NEEDED
sam-028-NEEDED
As you can see above the string lengths vary greatly so regex i believe is the only way to achieve what i want. what i need is the rest of the string after the second hyphen ('-').
i am very weak at regex so any help would be great.
Thanks in advance.
Just to offer an alternative without using regex:
foreach(string s in list)
{
int x = s.LastIndexOf('-')
string sub = s.SubString(x + 1)
}
Add validation to taste.
Something like this. It will take anything (except line breaks) after the second '-' including the '-' sign.
var exp = #"^\w*-\w*-(.*)$";
var match = Regex.Match("frederic-84728957-NEE-DED", exp);
if (match.Success)
{
var result = match.Groups[1]; //Result is NEE-DED
Console.WriteLine(result);
}
EDIT: I answered another question which relates to this. Except, it asked for a LINQ solution and my answer was the following which I find pretty clear.
Pimp my LINQ: a learning exercise based upon another post
var result = String.Join("-", inputData.Split('-').Skip(2));
or
var result = inputData.Split('-').Skip(2).FirstOrDefault(); //If the last part is NEE-DED then only NEE is returned.
As mentioned in the other SO thread it is not the fastest way of doing this.
If they are part of larger text:
(\w+-){2}(\w+)
If there are presented as whole lines, and you know you don't have other hyphens, you may also use:
[^-]*$
Another option, if you have each line as a string, is to use split (again, depending on whether or not you're expecting extra hyphens, you may omit the count parameter, or use LastIndexOf):
string[] tokens = line.Split("-".ToCharArray(), 3);
string s = tokens.Last();
This should work:
.*?-.*?-(.*)
This should do the trick:
([^\-]+)\-([^\-]+)\-(.*?)$
the regex pattern will be
(?<first>.*)?-(?<second>.*)?-(?<third>.*)?(\s|$)
then you can get the named group "second" to get the test after 2nd hyphen
alternatively
you can do a string.split('-') and get the 2 item from the array

Extracting a string starting with x and ending with y

First of all, I did a search on this and was able to find how to use something like String.Split() to extract the string based on a condition. I wasn't able to find however, how to extract it based on an ending condition as well. For example, I have a file with links to images: http://i594.photobucket.com/albums/tt27/34/444.jpghttp://i594.photobucket.com/albums/as/asfd/ghjk6.jpg
You will notice that all the images start with http:// and end with .jpg. However, .jpg is succeeded by http:// without a space, making this a little more difficult.
So basically I'm trying to find a way (Regex?) to extract a string from a string that starts with http:// and ends with .jpg
Regex is the easiest way to do this. If you're not familiar with regular expressions, you might check out Regex Buddy. It's a relatively cheap little tool that I found extremely useful when I was learning. For your particular case, a possible expression is:
(http://.+?\.jpg)
It probably requires some more refinement, as there are boundary cases that could trip this up, but it would work if the file is a simple list.
You can also do free quick testing of expressions here.
Per your latest comment, if you have links to other non-images as well, then you need to make sure it doesn't start at the http:// for one link and read all the way to the .jpg for the next image. Since URLs are not allowed to have whitespace, you can do it like this:
(http://[^\s]+\.jpg)
This basically says, "match a string starting with http:// and ending with .jpg where there is at least one character between the two and none of those characters are whitespace".
Regex RegexObj = new Regex("http://.+?\\.jpg");
Match MatchResults = RegexObj.Match(subject);
while (MatchResults.Success) {
//Do something with it
MatchResults = MatchResults.NextMatch();
}
In your specific case, you could always split if by ".jpg". You will probably end up with one empty element at the end of the array, and have to append the .jpg at the end of each file if you need that. Apart from that I think it would work.
Tested the following code and it worked fine:
public void SplitTest()
{
string test = "http://i594.photobucket.com/albums/tt27/34/444.jpghttp://i594.photobucket.com/albums/as/asfd/ghjk6.jpg";
string[] items = test.Split(new string[] { ".jpg" }, StringSplitOptions.RemoveEmptyEntries);
}
It even get rid of the empty entry...
The following LINQ will separate by http: and make sure to only get values that end with jpg.
var images = from i in imageList.Split(new[] {"http:"},
StringSplitOptions.RemoveEmptyEntries)
where i.EndsWith(".jpg")
select "http:" + i;

Categories