Regex: absolute url to relative url (C#) - c#

I need a regex to run against strings like the one below that will convert absolute paths to relative paths under certain conditions.
<p>This website is <strong>really great</strong> and people love it <img alt="" src="http://localhost:1379/Content/js/fckeditor/editor/images/smiley/msn/teeth_smile.gif" /></p>
Rules:
If the url contains "/Content/" I
would like to get the relative path
If the url does not contain
"/Content/", it is an external file,
and the absolute path should remain
Regex unfortunatley is not my forte, and this is too advanced for me at this point. If anyone can offer some tips I'd appreciate it.
Thanks in advance.
UPDATE:
To answer questions in the comments:
At the time the Regex is applied, All urls will begin with "http://"
This should be applied to the src attribute of both img and a tags, not to text outside of tags.

You should consider using the Uri.MakeRelativeUri method - your current algorithm depends on external files never containing "/Content/" in their path, which seems risky to me. MakeRelativeUri will determine whether a relative path can be made from the current Uri to the src or href regardless of changes you or the external file store make down the road.

Unless I'm missing the point here, if you replace
^(.*)([C|c]ontent.*)
With
/$2
You will end up with
/Content/js/fckeditor/editor/images/smiley/msn/teeth_smile.gif
This will only happen id "content" is found, so in cae you have a URL such as:
http://localhost:1379/js/fckeditor/editor/images/smiley/msn/teeth_smile.gif
Nothing will be replaced
Hope it helps, and that i didn't miss anything.
UPDATE
Obviously considering you are using an HTML parser to find the URL inside the a href (which you should in case you're not :-))
Cheers

That is for perl, I do not know c#:
s#(<(img|a)\s[^>]*?\s(src|href)=)(["'])http://[^'"]*?(/Content/[^'"]*?)\4#$1$4$5#g
If c# has perl-like regex it will be easy to port.

This function can convert all the hyperlinks and image sources inside your HTML to absolute URLs and for sure you can modify it also for CSS files and Javascript files easily:
Private Function ConvertALLrelativeLinksToAbsoluteUri(ByVal html As String, ByVal PageURL As String)
Dim result As String = Nothing
' Getting all Href
Dim opt As New RegexOptions
Dim XpHref As New Regex("(href="".*?"")", RegexOptions.IgnoreCase)
Dim i As Integer
Dim NewSTR As String = html
For i = 0 To XpHref.Matches(html).Count - 1
Application.DoEvents()
Dim Oldurl As String = Nothing
Dim OldHREF As String = Nothing
Dim MainURL As New Uri(PageURL)
OldHREF = XpHref.Matches(html).Item(i).Value
Oldurl = OldHREF.Replace("href=", "").Replace("HREF=", "").Replace("""", "")
Dim NEWURL As New Uri(MainURL, Oldurl)
Dim NewHREF As String = "href=""" & NEWURL.AbsoluteUri & """"
NewSTR = NewSTR.Replace(OldHREF, NewHREF)
Next
html = NewSTR
Dim XpSRC As New Regex("(src="".*?"")", RegexOptions.IgnoreCase)
For i = 0 To XpSRC.Matches(html).Count - 1
Application.DoEvents()
Dim Oldurl As String = Nothing
Dim OldHREF As String = Nothing
Dim MainURL As New Uri(PageURL)
OldHREF = XpSRC.Matches(html).Item(i).Value
Oldurl = OldHREF.Replace("src=", "").Replace("src=", "").Replace("""", "")
Dim NEWURL As New Uri(MainURL, Oldurl)
Dim NewHREF As String = "src=""" & NEWURL.AbsoluteUri & """"
NewSTR = NewSTR.Replace(OldHREF, NewHREF)
Next
Return NewSTR
End Function

Related

Display string after certain words are found

Basically what I'm trying to do is find the first string that starts with "/Game/Mods" but the problem is how do i tell the program where to end the string? here's an example what a string can look like: string example
As you can see the string starts with "/Game/Mods", i want it to end after the word "TamingSedative", the problem is that the ending word (TamingSedative)is different for every file it has to check, for example: example 2
There you can see that the ending word is now "WeapObsidianSword" (instead of TamingSedative) so basically the string has to end when it comes across the "NUL" but how do i specify that in c# code?
This a simple example using Regex.
Dim yourString As String = "/Game/Mods/TamingSedative/PrimalItemConsumable_TamingSedative"
Dim M As System.Text.RegularExpressions.Match = System.Text.RegularExpressions.Regex.Match(yourString, "/Game/Mods/(.+?)/")
MessageBox.Show(M.Groups(0).Value) 'This should show /Game/Mods/TamingSedative/
MessageBox.Show(M.Groups(1).Value) 'This should show TamingSedative
Since you need only the first occurance, this is the simplest solution I could think of:
(In case you cannot see the image, click on it to open in new tab)
EDIT:
In case the existence of a path like this is not guaranteed in the string, you can do an additional check before proceeding to use Substring, like this:
int exists = fullString.IndexOf("/Game/Mods");
if (exists == -1) return null;
Note: I have included "ENDED" in order to see in case any NULL chars have been included (white spaces)
From your comments: "the string just has to start at /Game/Mods and end when it reaches the whitespace".
In that case, you can easily get the matches using Linq, like this (assuming filePath is a string that has the path to your file):
var text = File.ReadAllText(filePath);
var matches = text.Split(null).Where(s => s.StartsWith("/Game/Mods"));
And, if you only need the first occurrence, it would be:
var firstMatch = matches.Any() ? matches.First() : null;
Check this post.

Remove inline styles from innerHtml using HtmlAgilityPack

I am parsing a web page to return all the unique sentences on the page, each with a minimum of two words. It almost works. The following appears as one sentence in the page however my code is dropping the text in the <b></b> tags. How do I remove the inline style/tags to return the sentence as it appears in the browser with the text in the bold tags or any other inline style like strong tags?
Currently it returns NHL Playoffs as one line of text and then Takeaways: Sharks beat Penguins for first Stanley Cup Final win as the second sentence when it is really just one sentence.
<span class="titletext"><b>NHL Playoffs</b> Takeaways: Sharks beat Penguins for first Stanley Cup Final win</span>
Here is my asp.net vb.net code (c# solution is fine).
Public Shared Function validateIsMoreThanOneWord(input As String, numberWords As Integer) As Boolean
If String.IsNullOrEmpty(input) Then
Return False
End If
Return (input.Split(New Char() {" "c}, StringSplitOptions.RemoveEmptyEntries).Length >= numberWords)
End Function
Private Sub form1_Load(sender As Object, e As EventArgs) Handles form1.Load
Try
Dim html = New HtmlDocument()
html.LoadHtml(New WebClient().DownloadString("http://news.google.ca/nwshp?hl=en&ei=4H1UV7-NNOfCjwTAl4bABw&ved=0EKkuCAkoBw"))
Dim root = html.DocumentNode
Dim myList As New List(Of String)()
For Each node As HtmlNode In root.Descendants().Where(Function(n) n.NodeType = HtmlNodeType.Text AndAlso n.ParentNode.Name <> "script" AndAlso n.ParentNode.Name <> "style" AndAlso n.ParentNode.Name <> "css")
If Not node.HasChildNodes Then
Dim text As String = HttpUtility.HtmlDecode(node.InnerText)
If Not String.IsNullOrEmpty(text) And Not String.IsNullOrWhiteSpace(text) Then
If validateIsMoreThanOneWord(text.Trim(), 2) Then
myList.Add(text.Trim())
End If
End If
End If
Next
'remove dups from array and other stuff
Dim q As String() = myList.Distinct().ToArray()
For i As Integer = 0 To UBound(q)
Response.Write(q(i).Trim() & "<br/>")
Next
Response.Write(q.Count)
Catch ex As Exception
Response.Write(ex.Message)
End Try
End Sub
Hope you can shed some light on a solution. Thanks!
Since you are looping over all root descendant nodes which parent is not <script>, nor <style> nor css, you will indeed treat every child node from .titleText as a different piece of text.
What you want is to retrieve the InnerText of each .titletext entry.
The following is what I would do in C#, you can get the idea of what you need to do.
HtmlWeb w = new HtmlWeb();
var htmlDoc = w.Load("http://news.google.ca/nwshp?hl=en&ei=4H1UV7-NNOfCjwTAl4bABw&ved=0EKkuCAkoBw");
var textTitles = htmlDoc.DocumentNode.SelectNodes("//span[#class='titletext']");
//for testing purposes
foreach (var textTitle in textTitles)
Console.WriteLine(textTitle.InnerText);

vb.net dataset.load xml file containing string "&"

I'm using a Dataset.Load statement to load a XMl File and on the file I have some tags with the "&" character and this is causing a exception. Are there any way to Load the XML to the dataset or replacing the & for another string.
I tried to do a Replace but when I use StringVar.Replace("&","e") for example when I have "ç" or "ã" strings on the file this chars are replaced for an wrong sequence of chars.
I was trying this
My.Computer.FileSystem.WriteAllText(MyFilePath, My.Computer.FileSystem.ReadAllText(MyFilePath, System.Text.Encoding.UTF8).Replace(" & ", "&"), False, System.Text.Encoding.UTF8)
but it happens that some files has "A&B" or any other combination of letters before and after the "&"
I'll be glad if anyone can help-me.
Thanks
`Hello Guys, I solved my problem. The problem was really #Blorgbeard sayd the Xml File was coming not valid.
Public Shared Function Decompress(text As String) As String
Dim bytes As Byte() = Convert.FromBase64String(text)
Using msi = New MemoryStream(bytes)
Using mso = New MemoryStream()
Using gs = New System.IO.Compression.GZipStream(msi, System.IO.Compression.CompressionMode.Decompress)
Dim bytesAux As Byte() = New Byte(4095) {}
Dim cnt As Integer
While (InlineAssignHelper(cnt, gs.Read(bytesAux, 0, bytesAux.Length))) <> 0
mso.Write(bytesAux, 0, cnt)
End While
End Using
Dim streamReader As StreamReader = New StreamReader(mso, System.Text.Encoding.UTF8, True)
Dim XmlDoc As String
mso.Seek(0, SeekOrigin.Begin)
XmlDoc = streamReader.ReadToEnd
Return XmlDoc
End Using
End Using
End Function`
this is what I did to get and return the string containing the correct XML data to be write to file.

C# Skipwhile to VB skipWhile

I have code in C#
string fileNameOnly = Path.GetFileNameWithoutExtension(sKey);
string token = fileNameOnly.Remove(fileNameOnly.LastIndexOf('_'));
string number = new string(token.SkipWhile(Char.IsLetter).ToArray());
And i want it in VB
Dim fileNameOnly As String = Path.GetFileNameWithoutExtension(sKey)
Dim token As String = fileNameOnly.Remove(fileNameOnly.LastIndexOf("_"c))
Dim number As New String(token.SkipWhile([Char].IsLetter).ToArray())
I have tried that but did not work! Is there something similar to use.
What it does is look at a file name and only use the number part of it and skip all letters and all after _.
You have to use AddressOf in VB.NET:
Dim number As New String(token.SkipWhile(AddressOf Char.IsLetter).ToArray())
You could also use Function:
Dim number As New String(token.SkipWhile(Function(c)Char.IsLetter(c)).ToArray())
In VB.NET i often use multiple lines and combine query+method syntax to avoid the ugly Function/AddressOf keywords.
Dim numberChars = From c In token
Skip While Char.IsLetter(c)
Dim numbers = New String(numberChars.ToArray())

how to extract string from certain position

I am struggling to find a solution in string manipulation - I am trying to extract a certain part of the string element after the '=' character - say for ex.
dim s as string = "/mysite/secondary.aspx?id=1005"
I am trying to get the string after the "=" and just to grab the 1005. I tried indexof and split, but i am not sure where i am going wrong. Any help, please?
Here is what i did:
Dim lnk As String = "/mysite/secondary.aspx?id=1005"
Dim id As Long = lnk.IndexOf("=")
Dim part As String = lnk.Substring(id + 1, 4)
Thanks
Try the following
Dim index = s.IndexOf("="C)
Dim value = s.Substring(index + 1)
This will put "1005" into value
Dim tUriPath As String = "/mysite/secondary.aspx?id=1005"
Dim tURI As Uri = New Uri("dummy://example.com" & tUriPath)
Dim tIdValue As String = System.Web.HttpUtility.ParseQueryString(tUri.Query)("id")
Here's a very simple example. Obviously it relies on very specific conditions:
Dim afterEquals As String = s.Split("="c)(1)
You would probably want something slightly more robust (checking to make sure more than one string was returned from Split, etc.).
If you try string.Split with '=' you'll get 1005 on the first element of the array and the /mysite/secondary.aspx?id= on the 0th position.
But if this is just a regular URL coming from an http request.
You could possibly do Request.QueryString("id") and it will return 1005;
Borrowing code from Boo...
Dim tUriPath As String = "/mysite/secondary.aspx?id=1005"
Dim tURI As Uri = New Uri("dummy://example.com" & tUriPath)
Dim tIdValue As String = System.Web.HttpUtility.ParseQueryString(tUri.Query)
Dim theIntYouWant as String= System.Web.HttpUtility.ParseQueryString(tUri.Query)("id")

Categories