Cleaning/formatting URLs

Cleaning/formatting URLs - c#

If I have the following URL:
/sites/testsite/subsite/shared%20documents1/projects/project%20-%20csf%20healthcare%20patient%20dining%20development
/sites/testsite/subsite2/healthcare/sd/Documents/Cleaning%20Services
I need to be able to clean the URLs so I do this with the following:
string webUrl = sd.Key.Substring(0, sd.Key.ToLower().IndexOf("documents") - 1);
This works great for the 2 second link and it gives me the following cleaned up URL:
/sites/testsite/subsite2/healthcare/sd
This however is not universal and it does not work for the first Url, and what I get is the following:
/sites/testsite/subsite/Shared%2
Ideally what I would want to get here is
/sites/testsite/subsite
Is there a better way (universal) to ensure that this works for both URLs?

These are escaped strings, use javascript function unescape() to unescape them.
e.g.
unescape('/sites/testsite/subsite/shared%20documents1/projects/project%20-%20csf%20healthcare%20patient%20dining%20development')
//sites/testsite/subsite/shared documents1/projects/project - csf healthcare patient dining development
And use HttpUtility.HtmlDecode in C#
var result = HttpUtility.HtmlDecode("/sites/testsite/subsite/shared%20documents1/projects/project%20-%20csf%20healthcare%20patient%20dining%20development");

The best way to do this is by using Uri.UnescapeDataString
Uri.UnescapeDataString(#"/sites/testsite/subsite/shared%20documents1/projects/project%20-%20csf%20healthcare%20patient%20dining%20development")
this will give you
/sites/testsite/subsite/shared documents1/projects/project - csf healthcare patient dining development
then you can remove spaces if you want to. for more information use this link

If you are trying to retrieve just "/sites/sitename/subsite" you can do this
var match = Regex.Match("/sites/compass/community/Shared%20documents1/projects/project%20-%20csf%20healthcare%20patient%20dining%20development", "^/sites/.*?/.*?/", RegexOptions.IgnoreCase);
if (match.Success)
Console.WriteLine(match.Value) // "/sites/compass/community"
Or in order to retrieve everything left from /*documents
var source = new [] {"/sites/testsite/subsite/shared%20documents1/projects/project%20-%20csf%20healthcare%20patient%20dining%20development",
"/sites/testsite/subsite2/healthcare/sd/Documents/Cleaning%20Services"};
foreach (var item in source)
{
var match = Regex.Match(item, "^(.*)(?:/.*?Documents)", RegexOptions.IgnoreCase);
if (match.Success)
Console.WriteLine(match.Groups[1].Value);
}
Output:
/sites/testsite/subsite
/sites/testsite/subsite2/healthcare/sd

Related

Replacing html content in a string

I have a string that has html contents such as:
string myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
What I need in the end is:
string myMessage = "Please the website for more information http://www.africau.edu/images/default/sample.pdf easy details given";
I can do this replacing each string as myMessage = myMessage.Replace("string to replace", ""); but then I have to take in each string and replace it will empty. Could there be a better solution?

If I understand you correctly you have a larger text with multiple occurrences of "<a ....>" and actually you want to replace that entire thing by simply only the URL given in the href.
Not sure if this makes it so much easier for you but you could use Regex.Matches something like e.g.
var myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
var matches = Regex.Matches(myMessage, "(.+?)<a.+?href=\"(.+?)\".+?<\\/a>(.+?)");
var strBuilder = new StringBuilder();
foreach (Match match in matches)
{
var groups = match.Groups;
strBuilder.Append(groups[1]) // Please the website for more information (
.Append(groups[2]) // http://www.africau.edu/images/default/sample.pdf
.Append(groups[3]); // )
}
Debug.Log(strBuilder.ToString());
So what does this do?
(.+?) will create a group for everything before the first encounter of the following <a => groups[1]
<a.+?href=" matches everything starting with <a and ending with href=" => ignored
(.+?) will create a group for everything between href=" and the next " (so the URL) => groups[2]
".+?<\/a> matches everything from the " until the next </a> => ignored
(.+?) will create a group for everything after the </a> => groups[3]
and groups[0] is the entire match.
so finally we just want to combine
groups[1] + groups[2] + groups[3]
but in a loop so we find possibly multiple matches within the same string and it is simply more efficient to use a StringBuilder for that.
Result
Please the website for more information (http://www.africau.edu/images/default/sample.pdf)
you can simply adjust this to e.g. also remove the ( ) or include the text between the tags but I figured actually this makes the most sense for now.

I personally don't like to rely on the string format always being what I expect as this can lead to errors down the road.
Instead, I offer two ways I can think of doing this:
Use regular expressions:
string myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
var capturePattern = #"(.+)\(<a .*href.*?=""(.*?)"".*>(.*)</a>\)";
var regex = new Regex(capturePattern);
var captures = regex.Match(myMessage);
var newString = $"{captures.Groups[1]}{captures.Groups[2]}{captures.Groups[3]}";
Console.WriteLine(myMessage);
Console.WriteLine(newString);
Output:
Please the website for more information (<a class="link" href="http://www.africau.edu/images/default/sample.pdf" target=_blank" id="urlLink"> easy details given)
Please the website for more information http://www.africau.edu/images/default/sample.pdf easy details given
Of course, regular expressions are only as good as the cases you can think of/test. I wrote this up quickly just to illustrate so make sure to verify for other string variations.
The other way is using HTMLAgilityPack:
string myMessage = "Please the website for more information (<a class=\"link\" href=\"http://www.africau.edu/images/default/sample.pdf\" target=_blank\" id=\"urlLink\"> easy details given</a>)";
var doc = new HtmlDocument();
doc.LoadHtml(myMessage);
var prefix = doc.DocumentNode.ChildNodes[0].InnerText;
var url = doc.DocumentNode.SelectNodes("//a[#href]").First().GetAttributeValue("href", string.Empty);
var suffix= doc.DocumentNode.ChildNodes[1].InnerText + doc.DocumentNode.ChildNodes[2].InnerText;
var newString = $"{prefix}{url}{suffix}";
Console.WriteLine(myMessage);
Console.WriteLine(newString);
Output:
Please the website for more information (<a class="link" href="http://www.africau.edu/images/default/sample.pdf" target=_blank" id="urlLink"> easy details given)
Please the website for more information (http://www.africau.edu/images/default/sample.pdf easy details given)
Notice this method preserves the parenthesis around the link. This is because from the agility pack's perspective, the first parenthesis is part of the text of the node. You can always remove them with a quick replace.
This method adds a dependency but this library is very mature and has been around for a long time.
it goes without saying that for both methods, you should make sure to add [error handling] checks for unexpected conditions.

Using RegEx to split strings after specific character

I've been working on trying to get this string split in a couple different places which I managed to get to work, except if the name had a forward-slash in it, it would throw all of the groups off completely.
The string:
123.45.678.90:00000/98765432109876541/[CLAN]PlayerName joined [windows/12345678901234567]
I essentially need the following:
IP group: 123.45.678.90:00000 (without the following /)
id group: 98765432109876541
name group: [CLAN]PlayerName
id1 group: 12345678901234567
The text "joined" also has to be there. However windows does not.
Here is what I have so far:
(?<ip>.*)\/(?<id>.*)\/(.*\/)?(?<name1>.*)( joined.*)\[(.*\/)?(?<id1>.*)\]
This works like a charm unless the player name contains a "/". How would I go about escaping that?
Any help with this would be much appreciated!

Since you tag your question with C# and Regex and not only Regex, I will propose an alternative. I am not sure if it will more efficient or not. I find it easiest to read and to debug if you simply use String.Split():
Demo
public void Main()
{
string input = "123.45.678.90:00000/98765432109876541/[CLAN]Player/Na/me joined [windows/12345678901234567]";
// we want "123.45.678.90:00000/98765432109876541/[CLAN]Player/Na/me joined" and "12345678901234567]"
// Also, you can remove " joined" by adding it before " [windows/"
var content = input.Split(new string[]{" [windows/"}, StringSplitOptions.None);
// we want ip + groupId + everything else
var tab = content[0].Split('/');
var ip = tab[0];
var groupId = tab[1];
var groupName = String.Join("/", tab.Skip(2)); // merge everything else. We use Linq to skip ip and groupId
var groupId1 = RemoveLast(content[1]); // cut the trailing ']'
Console.WriteLine(groupName);
}
private static string RemoveLast(string s)
{
return s.Remove(s.Length - 1);
}
Output:
[CLAN]Player/Na/me joined
If you are using a class for ip, groupId, etc. and I guess you do, just put everything in it with a constructor which accept a string as parameter.

You shouldn't be using greedy quanitifiers (*) with an open character such as .. It won't work as intended and will result in a lot of backtracking.
This is slightly more efficient, but not overly strict:
^(?<ip>[^\/\n]+)\/(?<id>[^\/]+)\/(?<name1>\S+)\D+(?<id1>\d+)]$
Regex demo

You basically needs to use non greedy selectors (*?). Try this:
(?<ip>.*?)\/(?<id>.*?)\/(?<name1>.*?)( joined )\[(.*?\/)?(?<id1>.*?)\]

Match Multiline & IgnoreSome

I'm trying to extract some information from a JCL source using regex in C#
Basically, this is a string I can have:
//JOBNAME0 JOB (BLABLABLA),'SOME TEXT',MSGCLASS=YES,ILIKE=POTATOES, GRMBL
// IALSOLIKE=TOMATOES, ANOTHER GARBAGE
// FINALLY=BYE
//OTHER STUFF
So I need to extract the jobname JOBNAME0, the info (BLABLABLA), the description 'SOME TEXT' and the other parms MSGCLASS=YES ILIKE=POTATOES IALSOLIKE=TOMATOES FINALLY=BYE.
I must ignore everything that is after the space ... like GRMBL or ANOTHER GARBAGE
I must continue to next line if my last valid char was a , and stop if it there were none.
So far, I have successfully managed to get the jobname, the info and the description, pretty easy. For the other parms, i'm able to get all the parms and to split them, but i don't know how to get rid of the garbage.
Here is my code:
var regex = "//([^\\s]*) JOB (\\([^)]*\\))?,?(\\'[^']*\\')?,?([^,]*[,|\\s|$])*";
Match match2 = Regex.Match(test5, regex,RegexOptions.Singleline);
string CarteJob2 = match2.Groups[0].Value;
string JobName2 = match2.Groups[1].Value;
string JobInfo2 = match2.Groups[2].Value;
string JobDesc2 = match2.Groups[3].Value;
IEnumerable<string> parms = match2.Groups[4].Captures.OfType<Capture>().Select(x => x.Value);
string JobParms2 = String.Join("|", parms);
Console.WriteLine(CarteJob2 + "|");
Console.WriteLine(JobName2 + "|");
Console.WriteLine(JobInfo2 + "|");
Console.WriteLine(JobDesc2 + "|");
Console.WriteLine(JobParms2 + "|");
The output I get is this one:
//JOBNAME0 JOB (BLABLABLA),'SOME TEXT',MSGCLASS=YES,ILIKE=POTATOES, GRMBL
// IALSOLIKE=TOMATOES, ANOTHER GARBAGE
// FINALLY=BYE
//OTHER |
JOBNAME0|
(BLABLABLA)|
'SOME TEXT'|
MSGCLASS=YES,|ILIKE=POTATOES,| GRMBL
// IALSOLIKE=TOMATOES,| ANOTHER GARBAGE
// FINALLY=BYE
//OTHER |
The output I would like to see is:
//JOBNAME0 JOB (BLABLABLA),'SOME TEXT',MSGCLASS=YES,ILIKE=POTATOES, GRMBL
// IALSOLIKE=TOMATOES, ANOTHER GARBAGE
// FINALLY=BYE|
JOBNAME0|
(BLABLABLA)|
'SOME TEXT'|
MSGCLASS=YES|ILIKE=POTATOES|IALSOLIKE=TOMATOES|FINALLY=BYE|
Is there a way to get what I want ?

I think I'd try and do this with two Regex expressions.
The first one to get all the starting information from the beginning of the string - job name, info, description.
The second one to get all the parameters, which all seem to have a simple pattern of <param name>=<param value>.
The first Regex might look like this:
^//(?<job>[\d\w]+)[ ]+JOB[ ]+\((?<info>[\d\w]+)\),'(?<description>[\d\w ]+)'
I don't know if rules permit whitespaces to appear in the job name, info or description - adjust as needed. Also, I'm assuming this is the start of the file using the ^ char. Finally, this Regex has groups already defined, so getting values should be easier in C#.
The second Regex might be something like this:
(?<param>[\w\d]+)=(?<value>[\w\d]+)
Again, grouping is added to help get the parameter names and values.
Hope this helps.
EDIT:
A small tip - you can use the # sign before a string in C# to make it easier to write such Regex patterns. For example:
Regex reg = new Regex(#"(?<param>[\w\d]+)=(?<value>[\w\d]+)");

Regex required for renaming file in C#

I need a regex for renaming file in c#. My file name is 22px-Flag_Of_Sweden.svg.png. I want it to rename as sweden.png.
So for that I need regex. Please help me.
I have various files more than 300+ like below:
22px-Flag_Of_Sweden.svg.png - should become sweden.png
13px-Flag_Of_UnitedStates.svg.png - unitedstates.png
17px-Flag_Of_India.svg.png - india.png
22px-Flag_Of_Ghana.svg.png - ghana.png
These are actually flags of country. I want to extract Countryname.Fileextension. Thats all.

var fileNames = new [] {
"22px-Flag_Of_Sweden.svg.png"
,"13px-Flag_Of_UnitedStates.svg.png"
,"17px-Flag_Of_India.svg.png"
,"22px-Flag_Of_Ghana.svg.png"
,"asd.png"
};
var regEx = new Regex(#"^.+Flag_Of_(?<country>.+)\.svg\.png$");
foreach ( var fileName in fileNames )
{
if ( regEx.IsMatch(fileName))
{
var newFileName = regEx.Replace(fileName,"${country}.png").ToLower();
//File.Save(Path.Combine(root, newFileName));
}
}

I am not exactly sure how this would look in c# (although the regex is important and not the language), but in Java this would look like this:
String input = "22px-Flag_Of_Sweden.svg.png";
Pattern p = Pattern.compile(".+_(.+?)\\..+?(\\..+?)$");
Matcher m = p.matcher(input);
System.out.println(m.matches());
System.out.println(m.group(1).toLowerCase() + m.group(2));
Where the relevant for you is this part :
".+_(.+?)\\..+?(\\..+?)$"
Just concat the two groups.
I wish I knew a bit of C# right now :)
Cheers Eugene.

This will return country in the first capture group: ([a-zA-Z]+)\.svg\.png$

I don't know c# but the regex could be:
^.+_(\pL+)\.svg\.png
and the replace part is : $1.png

Find/parse server-side <?abc?>-like tags in html document

I guess I need some regex help. I want to find all tags like <?abc?> so that I can replace it with whatever the results are for the code ran inside. I just need help regexing the tag/code string, not parsing the code inside :p.
<b><?abc print 'test' ?></b> would result in <b>test</b>
Edit: Not specifically but in general, matching (<?[chars] (code group) ?>)

This will build up a new copy of the string source, replacing <?abc code?> with the result of process(code)
Regex abcTagRegex = new Regex(#"\<\?abc(?<code>.*?)\?>");
StringBuilder newSource = new StringBuilder();
int curPos = 0;
foreach (Match abcTagMatch in abcTagRegex.Matches(source)) {
string code = abcTagMatch.Groups["code"].Value;
string result = process(code);
newSource.Append(source.Substring(curPos, abcTagMatch.Index));
newSource.Append(result);
curPos = abcTagMatch.Index + abcTagMatch.Length;
}
newSource.Append(source.Substring(curPos));
source = newSource.ToString();
N.B. I've not been able to test this code, so some of the functions may be slightly the wrong name, or there may be some off-by-one errors.

var new Regex(#"<\?(\w+) (\w+) (.+?)\?>")
This will take this source
<b><?abc print 'test' ?></b>
and break it up like this:
Value: <?abc print 'test' ?>
SubMatch: abc
SubMatch: print
SubMatch: 'test'
These can then be sent to a method that can handle it differently depending on what the parts are.
If you need more advanced syntax handling you need to go beyond regex I believe.
I designed a template engine using Antlr but thats way more complex ;)

exp = new Regex(#"<\?abc print'(.+)' \?>");
str = exp.Replace(str, "$1")
Something like this should do the trick. Change the regexes how you see fit

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Cleaning/formatting URLs - c#

Related

Replacing html content in a string

Using RegEx to split strings after specific character

Match Multiline & IgnoreSome

Regex required for renaming file in C#

Find/parse server-side <?abc?>-like tags in html document

Categories

Resources