Parsing Links and Receiving Extra Blanks - c#

I am parsing a webpage for http links by first parsing out all the anchored tags, then parsing out the href tags, then running a regex to remove all tags that aren't independent links (like href="/img/link.php"). The following code works correctly, but also appends lots of blank lines in between the parsed links.
while (parse.ParseNext("a", out tag))
{
string value;
//A REGEX value, this one finds proper http address'
Regex regexObj = new Regex(#"^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$");
if (tag.Attributes.TryGetValue("href", out value))
{
string value2;
//Start finding matches...
Match matchResults = regexObj.Match(value);
value2 = matchResults.Value;
lstPages.AppendText(value2 + "\r\n");
}
}
To fix this, I added the following code and it works to clean up the list:
if (value2 != "")
{
lstPages.AppendText(value2 + "\r\n");
}
However, I
Don't believe this is the most efficient way to go about this and
Still don't understand where the != "" lines come from.
My actual question is on both of these but more for issue #2, as I would like to learn why I receive these results, but also if there is a more efficient method for this.

The reason you are getting an empty string in value2 is that matchResults.Value == "" if the regular expression fails to match. Instead of checking if value2 != "", you could directly check matchResults.Success to see if the regular expression matched. You're basically doing that, anyway, since your particular regular expression would never match an empty string, but checking matchResults.Success would be more straightforward.
Another thing to consider is that it's not necessary to create the Regex object every iteration of your loop. Here are the modifications I suggest:
//A REGEX value, this one finds proper http address'
Regex regexObj = new Regex(#"^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$");
while (parse.ParseNext("a", out tag))
{
string value;
if (tag.Attributes.TryGetValue("href", out value))
{
string value2;
//Start finding matches...
Match matchResult = regexObj.Match(value);
if (matchResult.Success)
{
value2 = matchResult.Value;
lstPages.AppendText(value2 + "\r\n");
}
}
}

Use Html Agility Pack instead
static void Main(string[] args)
{
var html = new HtmlDocument();
var request = WebRequest.Create("http://stackoverflow.com/questions/6256982/parsing-links-and-recieving-extra-blanks/6257328#6257328") as HttpWebRequest;
using (var response = request.GetResponse())
using (var responseStream = response.GetResponseStream())
{
html.Load(responseStream);
}
foreach (var absoluteHref in html.DocumentNode.SelectNodes("//a[starts-with(#href, 'http')]"))
{
Console.WriteLine(absoluteHref.Attributes["href"].Value);
}
}

TryGetValue is a generic Method (of Type T). If it doesnt have any value to return, it returnd the default value of the type, which is String.Empty or "" for String

Related

Implement generic Regex.Matches with string tags

I have a function that gets the content inside 2 tags of a string:
string content = string.Empty;
foreach (Match match in Regex.Matches(stringSource, "<tag1>(.*?)</tag1>"))
{
content = match.Groups[1].Value;
}
I need to do this operation many times with different tags. I want to update method so I can pass in the opening closing tags, but I can't concatenate the parameters of my tags with the regular expression. When I pass these values to the new function, the expression does not work:
public string GetContent(string stringSource, string openTag, string closeTag)
{
string content = string.Empty;
foreach (Match match in Regex.Matches(stringSource, $"{openTag}(.*?){closeTag}"))
{
content = match.Groups[1].Value;
}
return content;
}
I want to use the function like this:
string content = GetContent(sourceString, "<tag1>", "</tag1>");
How can I make this work?
Try this:
public IEnumerable<string> GetContent(string stringSource, string tag)
{
foreach (Match match in Regex.Matches(stringSource, $"<{tag}>(.*?)</{tag}>"))
{
yield return match.Groups[1].Value;
}
}
// ...
var content = GetContent(sourceString, "tag1");
Note I also changed the return type. What you had before was the equivalent of calling this function like this: string content = GetContent(sourceString, "tag").LastOrDefault();
Also, Regex is generally a poor choice for handling HTML and XML. There are all kind of edge cases around this, such that RegEx really doesn't work that well.
You can make it seem to work if you can constrain your input to a subset of the language to limit edge cases, and that might get you by for a while, but usually someone will eventually want to use more of the features of the markup language and you'll start getting weird bugs and errors. You'll really do much better with a dedicated, purpose-built parser!

Extract ID and replace everything in `Example HTML`

New to Regular Expressions, I want to have the following text in my HTML and would like to replace with something else
Example HTML:
{{Object id='foo'}}
Extract the id into a variable like this:
string strId = "foo";
So far I have the following Regular Expression code that will capture the Example HTML:
string strStart = "Object";
string strFind = "{{(" + strStart + ".*?)}}";
Regex regExp = new Regex(strFind, RegexOptions.IgnoreCase);
Match matchRegExp = regExp.Match(html);
while (matchRegExp.Success)
{
//At this point, I have this variable:
//{{Object id='foo'}}
//I can find the id='foo' (see below)
//but not sure how to extract 'foo' and use it
string strFindInner = "id='(.*?)'"; //"{{Slider";
Regex regExpInner = new Regex(strFindInner, RegexOptions.IgnoreCase);
Match matchRegExpInner = regExpInner.Match(matchRegExp.Value.ToString());
//Do something with 'foo'
matchRegExp = matchRegExp.NextMatch();
}
I understand this might be a simple solution, I am hoping to gain more knowledge about Regular Expressions but more importantly, I am hoping to receive a suggestion on how to approach this cleaner and more efficiently.
Thank you
Edit:
Is this an example that I could potentially use: c# regex replace
While I am not solving my initial question with Regular Expressions, I did move into a simpler solution using SubString, IndexOf and string.Split for the time being, I understand that my code needs to be cleaned up but thought I would post the answer that I have thus far.
string html = "<p>Start of Example</p>{{Object id='foo'}}<p>End of example</p>"
string strObject = "Slider"; //Example
//When found, this will contain "{{Object id='foo'}}"
string strCode = "";
//ie: "id='foo'"
string strCodeInner = "";
//Tags will be a list, but in this example, only "id='foo'"
string[] tags = { };
//Looking for the following "{{Object "
string strFindStart = "{{" + strObject + " ";
int intFindStart = html.IndexOf(strFindStart);
//Then ending in the following
string strFindEnd = "}}";
int intFindEnd = html.IndexOf(strFindEnd) + strFindEnd.Length;
//Must find both Start and End conditions
if (intFindStart != -1 && intFindEnd != -1)
{
strCode = html.Substring(intFindStart, intFindEnd - intFindStart);
//Remove Start and End
strCodeInner = strCode.Replace(strFindStart, "").Replace(strFindEnd, "");
//Split by spaces, this needs to be improved if more than IDs are to be used
//but for proof of concept this is perfect
tags = strCodeInner.Split(new char[] { ' ' });
}
Dictionary<string, string> dictTags = new Dictionary<string, string>();
foreach (string tag in tags)
{
string[] tagSplit = tag.Split(new char[] { '=' });
dictTags.Add(tagSplit[0], tagSplit[1].Replace("'", "").Replace("\"", ""));
}
//At this point, I can replace "{{Object id='foo'}}" with anything I'd like
//What I don't show is that I go into the website's database,
//get the object (ie: Slider) and return the html for slider with the ID of foo
html = html.Replace(strCode, strView);
/*
"html" variable may contain:
<p>Start of Example</p>
<p id="foo">This is the replacement text</p>
<p>End of example</p>
*/

Replacing XElement content with XElement

Is there a way to selectively replace XElement content with other XElements?
I have this XML:
<prompt>
There is something I want to tell you.[pause=3]
You are my favorite caller today.[pause=1]
Have a great day!
</prompt>
And I want to render it as this:
<prompt>
There is something I want to tell you.<break time="3s"/>
You are my favorite caller today.<break time="1s"/>
Have a great day!
</prompt>
I need to replace the placeholders with actual XElements, but when I try to alter the content of an XElement, .NET of course escapes all of the angle brackets. I understand why the content would normally need to be correctly escaped, but I need to bypass that behavior and inject XML directly into content.
Here's my code that would otherwise work.
MatchCollection matches = Regex.Matches(content, #"\[(\w+)=(\d+)]");
foreach (XElement element in voiceXmlDocument.Descendants("prompt"))
{
if (matches[0] == null)
continue;
element.Value = element.Value.Replace(matches[0].Value, #"<break time=""5s""/>");
}
This is a work in progress, so don't worry so much about the validity of the RegEx pattern, as I will work that out later to match several conditions. This is proof of concept code and the focus is on replacing the placeholders as described. I only included the iteration and RegEx code here to illustrate that I need to be able to do this to a whole document that is already populated with content.
You can use XElement.Parse() method:
First, get the outer xml of your XElement, for example,
string outerXml = element.ToString();
The you have exactly this string to work with:
<prompt>
There is something I want to tell you.[pause=3]
You are my favorite caller today.[pause=1]
Have a great day!
</prompt>
Then you can do your replacement
outerXml = outerXml.Replace(matches[0].Value, #"<break time=""5s""/>");
Then you can parse it back:
XElement repElement = XElement.Parse(outerXml);
And, finally, replace original XElement:
element.ReplaceWith(repElement);
The key to all of this is the XText, which allows you to work with text as an element.
This is the loop:
foreach (XElement prompt in voiceXmlDocument.Descendants("prompt"))
{
string text = prompt.Value;
prompt.RemoveAll();
foreach (string phrase in text.Split('['))
{
string[] parts = phrase.Split(']');
if (parts.Length > 1)
{
string[] pause = parts[0].Split('=');
prompt.Add(new XElement("break", new XAttribute("time", pause[1])));
// add a + "s" if you REALLY want it, but then you have to get rid
// of it later in some other code.
}
prompt.Add(new XText(parts[parts.Length - 1]));
}
}
This is the end result
<prompt>
There is something I want to tell you.<break time="3" />
You are my favorite caller today.<break time="1" />
Have a great day!
</prompt>
class Program
{
static void Main(string[] args)
{
var xml =
#"<prompt>There is something I want to tell you.[pause=3] You are my favorite caller today.[pause=1] Have a great day!</prompt>";
var voiceXmlDocument = XElement.Parse(xml);
var pattern = new Regex(#"\[(\w+)=(\d+)]");
foreach (var element in voiceXmlDocument.DescendantsAndSelf("prompt"))
{
var matches = pattern.Matches(element.Value);
foreach (var match in matches)
{
var matchValue = match.ToString();
var number = Regex.Match(matchValue, #"\d+").Value;
var newValue = string.Format(#"<break time=""{0}""/>", number);
element.Value = element.Value.Replace(matchValue, newValue);
}
}
Console.WriteLine(voiceXmlDocument.ToString());
}
}
Oh, my goodness, you guys were quicker than I expected! So, thanks for that, however in the meantime, I solved it a slightly different way. The code here looks expanded from before because once I got it working, I added some specifics into this particular condition:
foreach (XElement element in voiceXmlDocument.Descendants("prompt").ToArray())
{
// convert the element to a string and see to see if there are any instances
// of pause placeholders in it
string elementAsString = element.ToString();
MatchCollection matches = Regex.Matches(elementAsString, #"\[pause=(\d+)]");
if (matches == null || matches.Count == 0)
continue;
// if there were no matches or an empty set, move on to the next one
// iterate through the indexed matches
for (int i = 0; i < matches.Count; i++)
{
int pauseValue = 0; // capture the original pause value specified by the user
int pauseMilliSeconds = 1000; // if things go wrong, use a 1 second default
if (matches[i].Groups.Count == 2) // the value is expected to be in the second group
{
// if the value could be parsed to an integer, convert it from 1/8 seconds to milliseconds
if (int.TryParse(matches[i].Groups[1].Value, out pauseValue))
pauseMilliSeconds = pauseValue * 125;
}
// replace the specific match with the new <break> tag content
elementAsString = elementAsString.Replace(matches[i].Value, string.Format(#"<break time=""{0}ms""/>", pauseMilliSeconds));
}
// finally replace the element by parsing
element.ReplaceWith(XElement.Parse(elementAsString));
}
Oh, my goodness, you guys were quicker than I expected!
Doh! Might as well post my solution anyway!
foreach (var element in xml.Descendants("prompt"))
{
Queue<string> pauses = new Queue<string>(Regex.Matches(element.Value, #"\[pause *= *\d+\]")
.Cast<Match>()
.Select(m => m.Value));
Queue<string> text = new Queue<string>(element.Value.Split(pauses.ToArray(), StringSplitOptions.None));
element.RemoveAll();
while (text.Any())
{
element.Add(new XText(text.Dequeue()));
if (pauses.Any())
element.Add(new XElement("break", new XAttribute("time", Regex.Match(pauses.Dequeue(), #"\d+"))));
}
}
For every prompt element, Regex match all your pauses and put them in a queue.
Then use these prompts to delimit the inner text and grab the 'other' text and put that in a queue.
Clear all data from the element using RemoveAll and then iterate over your delimited data and re-add it as the appropriate data type. When you are adding in the new attributes you can use Regex to get the number value out of the original match.

How to avoid large switch statements and/or regular expressions when converting code from one language to another

I have to convert a few hundred test cases written in Java to code in C#. At the moment all I could think of is define a set of regular expressions, try to match it on a line and do an action based on which regex matched.
Any better ideas (this still stinks).
An example of from and to:
Java:
Request request = new Request(testRunner)
request.setUsername("userName")
request.setPassword("password")
log.info(request.getRequest())
C#
var request = new LoginRequest(LoginParams);
request.Username = "userName";
request.Password = "password";
var LoginResponse = Account.ExecuteCall(request, pathToApi);
The source I'm trying to convert is from SoapUI and the bits of script involved are within TestSteps of a humongous XML file. Also, most of them are simply forming some sort of request and checking for a specific response so there shouldn't be too many types to implement.
What I ended up doing was defined a base class (Map) that has a Pattern property, a Success indicator and the lines of Code that it results to after a successful match. In some cases a certain line can be simply replaced by another one but in other cases (setUserName) I need to extract content from the original script to put in the c# code. In other cases, a single line might be replaced with more than one. The transformation is all defined in the Match function.
public class SetUserName : Map
{
internal override string Pattern { get { return #"request.setUsername\(""(.*)""\)"; } }
public override void Match(string line)
{
Match match = Regex.Match(line, Pattern);
if (match.Success)
{
Success = true;
CodeLines = new Code<CodeLine>
{new CodeLine("request.Username = \"" + match.Groups[1].Value + "\"")};
}
}
}
Then I put the maps in a list ordered by occurrence and loop through each line of script:
foreach (string scriptLine in scriptLines)
{
string line = Strip(scriptLine);
if (string.IsNullOrEmpty(line) || Regex.Match(line, #"^\s+$").Success)
{
continue;
}
Map[] RegExes =
{
new Request(),
new SetUserName(),
new SetPassword(),
new RunRequest()
};
foreach (Map map in RegExes)
{
map.Match(line);
if (map.Success)
{
codeList.AddRange(map.CodeLines);
break;
}
}
}

Remove BR tag from the beginning and end of a string

How can I use something like
return Regex.Replace("/(^)?(<br\s*\/?>\s*)+$/", "", source);
to replace this cases:
<br>thestringIwant => thestringIwant
<br><br>thestringIwant => thestringIwant
<br>thestringIwant<br> => thestringIwant
<br><br>thestringIwant<br><br> => thestringIwant
thestringIwant<br><br> => thestringIwant
It can have multiple br tags at begining or end, but i dont want to remove any br tag in the middle.
A couple of loops would solve the issue and be easier to read and understand (use a regex = tomorrow you look at your own code wondering what the heck is going on)
while(source.StartsWith("<br>"))
source = source.SubString(4);
while(source.EndsWith("<br>"))
source = source.SubString(0,source.Length - 4);
return source;
When I see your regular expression, it sounds like there could be spaces allowed with in br tag.
So you can try something like:
string s = Regex.Replace(input,#"\<\s*br\s*\/?\s*\>","");
There is no need to use regular expression for it
you can simply use
yourString.Replace("<br>", "");
This will remove all occurances of <br> from your string.
EDIT:
To keep the tag present in between the string, just use as follows-
var regex = new Regex(Regex.Escape("<br>"));
var newText = regex.Replace("<br>thestring<br>Iwant<br>", "<br>", 1);
newText = newText.Substring(0, newText.LastIndexOf("<br>"));
Response.Write(newText);
This will remove only 1st and last occurance of <br> from your string.
How about doing it in two goes so ...
result1 = Regex.Replace("/^(<br\s*\/?>\s*)+/", "", source);
then feed the result of that into
result2 = Regex.Replace("/(<br\s*\/?>\s*)+$/", "", result1);
It's a bit of added overhead I know but simplifies things enormously, and saves trying to counter match everything in the middle that isn't a BR.
Note the subtle difference between those two .. one matching them at start and one matching them at end. Doing it this way keeps the flexibility of keeping a regular expression that allows for the general formatting of BR tags rather than it being too strict.
if you also want it to work with
<br />
then you could use
return Regex.Replace("((:?<br\s*/?>)*<br\s*/?>$|^<br\s*/?>(:?<br\s*/?>)*)", "", source);
EDIT:
Now it should also take care of multiple
<br\s*/?>
in the start and end of the lines
You can write an extension method to this stuff
public static string TrimStart(this string value, string stringToTrim)
{
if (value.StartsWith(stringToTrim, StringComparison.CurrentCultureIgnoreCase))
{
return value.Substring(stringToTrim.Length);
}
return value;
}
public static string TrimEnd(this string value, string stringToTrim)
{
if (value.EndsWith(stringToTrim, StringComparison.CurrentCultureIgnoreCase))
{
return value.Substring(0, value.Length - stringToTrim.Length);
}
return value;
}
you can call it like
string example = "<br> some <br> test <br>";
example = example.TrimStart("<br>").TrimEnd("<br>"); //output some <br> test
I believe that one should not ignore the power of Regex. If you name the regular expression appropriately then it would not be difficult to maintain it in future.
I have written a sample program which does your task using Regex. It also ignores the character cases and white space at beginning and end. You can try other source string samples you have.
Most important, It would be faster.
using System;
using System.Text.RegularExpressions;
namespace ConsoleDemo
{
class Program
{
static void Main(string[] args)
{
string result;
var source = #"<br><br>thestringIwant<br><br> => thestringIwant<br/> same <br/> <br/> ";
result = RemoveStartEndBrTag(source);
Console.WriteLine(result);
Console.ReadKey();
}
private static string RemoveStartEndBrTag(string source)
{
const string replaceStartEndBrTag = #"(^(<br>[\s]*)+|([\s]*<br[\s]*/>)+[\s]*$)";
return Regex.Replace(source, replaceStartEndBrTag, "", RegexOptions.IgnoreCase);
}
}
}

Categories