c# Regex on XML string handler - c#

Trying to fiddle around with regex here, my first attempt.
Im trying to extract some figures out of content from an XML tag. The content looks like this:
www.blahblah.se/maps.aspx?isAlert=true&lat=51.958855252721&lon=-0.517657021473527
I need to extract the lat and long numerical vales out of each link. They will always be the same amount of characters, and the lon may or may not have a "-" sign.
I thought about doing it like this below (its obviously not right though): (The string in question is in the "link" tag):
var document = XDocument.Load(e.Result);
if (document.Root == null)
return;
var events = from ev in document.Descendants("item1")
select new
{
Title = (ev.Element("title").Value),
Latitude = Regex.xxxxxxx(ev.Element("link").Value, #"lat=(?<Lat>[+-]?\d*\.\d*)", String.Empty),
Longitude = Convert.ToDouble(ev.Element("link").Value),
};
foreach (var ev in events)
{
do stuff
}
Many thanks!

Try this:
Regex.Match(ev.Element("link").Value, #"lat=(?<Lat>[+-]?\d*\.\d*)").Groups[1].Value
Example:
string ev = "www.blahblah.se/maps.aspx?isAlert=true&lat=51.958855252721&lon=-0.517657021473527";
string s = Regex.Match(ev, #"lat=(?<Lat>[+-]?\d*\.\d*)").Groups[1].Value;
Result:
"51.958855252721"

If you have a URL with GET parameters, I would be tempted to find a library that can pull that apart for you, rather than using regexps. There may be edge cases that your regexp doesn't handle, but a library built for the specific purpose would handle (it depends on the set of data that you have to parse and how well-bounded it is, of course).

Related

How to fetch particular text from a string

I have a string "(zoneId==176)&&((startTime==100)&&(endTime==1200))" from which i want to fetch the value of startTime and endTime in C#. How to do this i am new to c# programming that why i need some clue
That doesn't look like a String but a block of code. Assuming that is the value entered into your code, you could do the following:
var input = "(zoneId==176)&&((startTime==100)&&(endTime==1200))";
var time = input.Split(')');
var start = time.FirstOrDefault(s => s.Contains("startTime")).Split('=')[2];
var end = time.FirstOrDefault(e => e.Contains("endTime")).Split('=')[2];
Your output would be as follows: 100 and 1200
The above implementation works, but shouldn't be used for production purposes for an assortment of reasons. You'll want to focus on:
Substring
Split
Remove
Regular Expressions
These are all essential to learning how to parse data and or any other form of string manipulation. Hopefully this points you in the proper direction.
Another approach would be:
var input = "(zoneId==176)&&((startTime==100)&&(endTime==1200))";
var section = input.Split('=');
foreach(var region in section)
{
var zone = region.Substring(0, region.Length);
var number = zone.Where(d => char.IsDigit(d)).ToArray();
}

Scrape HTML for label then value in separate DIV tags

I am scraping a database of products and I am able to get all the HTML and retrieve most values as they have some unique items. However I am stuck on some areas that have common tags.
Example:
<div class="label">Name:</div><div class="value">John</div>
<div class="label">Age:</div><div class="value">24</div>
Any ideas on how I could get those labels and associated values?
I am using HTMLAgilityPack for the rest if there is something in there that may help.
Please use the xpath to get div's with class as label and class as value
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourHtml);
Dictionary<string, string> dict = new Dictionary<string, string>();
//This will get all div's with class as label & class value in dictionary
int cnt = 1;
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='label']"))
{
var val = doc.DocumentNode.SelectSingleNode("//div[#class='value'][" + cnt + "]").InnerText;
if(!dict.ContainsKey(node.InnerText))//dictionary takes unique keys only
{
dict.Add(node.InnerText, val);
cnt++;
}
}
You could try this:
Int32 endingIndex;
var Name1 = GetTextBetween(yourHtml, "<div class=\"label\">", "</div><div class=\"value\">", out endingIndex);
var Value1 = GetTextBetween(yourHtml.SubString(endingIndex), "<div class=\"value\">", "</div>", out endingIndex);
var Name2 = GetTextBetween(yourHtml.SubString(endingIndex), "<div class=\"label\">", "</div><div class=\"value\">", out endingIndex);
var Value2 = GetTextBetween(yourHtml.SubString(endingIndex), "<div class=\"value\">", "</div>", out endingIndex);
public static String GetTextBetween(String allDataToParse, String startText, String endText, out Int32 indexOfEndText)
{
var indexOfStartText = allDataToParse.IndexOf(startText);
indexOfEndText = allDataToParse.IndexOf(endText);
return allDataToParse.Substring(indexOfStartText, indexOfEndText - indexOfStartText).Replace(startText, String.Empty) ;
}
Although XPath always sounds like a great idea, when you're scraping data you can't rely on the HTML to be well formed. Many webpages break their HTML regularly to make scraping harder. Even though Mark's code looks awkward, it's actually more robust in some cases.
As sad as it sounds, you can only rely on consistency in the target document when the provider has proven reliable over a long length of time. Ideally, I'd use a regular expression to search for the tags I want specifically. Here's a good starting point:
Regular expression for extracting tag attributes
Unfortunately, only you know the exact quirks of the document you're working on. A simple solution, like the one Mark proposes, will likely work if the page you're viewing is reliable. And frankly, it's less likely to be fragile and crash unexpectedly.
If you use the HTML document parsing code that HatSoft suggests, your program may work great on most documents, but in my experience websites will throw errors randomly, change their layout unexpectedly, or sometimes your network code will only receive a partial string. Perhaps this is okay, but I'd suggest you try both approaches and see what is more reliable for you.

find string using c#?

I am trying find a string in below string.
http://example.com/TIGS/SIM/Lists/Team Discussion/DispForm.aspx?ID=1779
by using http://example.com/TIGS/SIM/Lists string. How can I get Team Discussion word from it?
Some times strings will be
http://example.com/TIGS/SIM/Lists/Team Discussion/DispForm.aspx?ID=1779
I need `Team Discussion`
http://example.com/TIGS/ALIF/Lists/Artifical Lift Discussion Forum 2/DispForm.aspx?ID=8
I need `Artifical Lift Discussion Forum 2`
If you're always following that pattern, I recommend #Justin's answer. However, if you want a more robust method, you can always couple the System.Uri and Path.GetDirectoryName methods, then perform a String.Split. Like this example:
String url = #"http://example.com/TIGS/SIM/Lists/Team Discussion/DispForm.aspx?ID=1779";
System.Uri uri = new System.Uri(url);
String dir = Path.GetDirectoryName(uri.AbsolutePath);
String[] parts = dir.Split(new[]{ Path.DirectorySeparatorChar });
Console.WriteLine(parts[parts.Length - 1]);
The only major problem, however, is you're going to wind up with a path that's been "encoded" (i.e. your space is now going to be represented by a %20)
This solution will get you the last directory of your URL regardless of how many directories are in your URL.
string[] arr = s.Split('/');
string lastPart = arr[arr.Length - 2];
You could combine this solution into one line, however it would require splitting the string twice, once for the values, the second for the length.
If you wanted to see a regular expression example:
string input = "http://example.com/TIGS/SIM/Lists/Team Discussion/DispForm.aspx?ID=1779";
string given = "http://example.com/TIGS/SIM/Lists";
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(given + #"\/(.+)\/");
System.Text.RegularExpressions.Match match = regex.Match(input);
Console.WriteLine(match.Groups[1]); // Team Discussion
Here's a simple approach, assuming that your URL always has the same number of slashes before the are you want:
var value = url.Split(new[]{'/'}, StringSplitOptions.RemoveEmptyEntries)[5];
Here is another solution that provides the following advantages:
Does not require the use of regular expressions.
Does not require a certain 'count' of slashes be present (indexing based of a specific number). I consider this a key benefit because it makes the code less likely to fail if some part of the URL changes. Ultimately it is best to base your parsing logic off which part of the text's structure you consider least likely to change.
This method, however, DOES rely on the following assumptions, which I consider to be the least likely to change:
URL must have "/Lists/" right before target text.
URL must have "/" right after target text.
Basically, I just split the string twice, using text that I expect to be surrounding the area I am interested in.
String urlToSearch = "http://example.com/TIGS/SIM/Lists/Team Discussion/DispForm.aspx";
String result = "";
// First, get everthing after "/Lists/"
string[] temp1 = urlToSearch.Split(new String[] { "/Lists/" }, StringSplitOptions.RemoveEmptyEntries);
if (temp1.Length > 1)
{
// Next, get everything before the first "/"
string[] temp2 = temp1[1].Split(new String[] { "/" }, StringSplitOptions.RemoveEmptyEntries);
result = temp2[0];
}
Your answer will then be stored in the 'result' variable.

extract query string from a URL string

I am reading from history, and I want that when i come across a google query, I can extract the query string. I am not using request or httputility since i am simply parsing a string. however, when i come across URLs like this, my program fails to parse it properly:
http://www.google.com.mt/search?client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&channel=s&hl=mt&source=hp&biw=986&bih=663&q=hotmail&meta=&btnG=Fittex+bil-Google
what i was trying to do is get the index of q= and the index of & and take the words in between but in this case the index of & will be smaller than q= and it will give me errors.
any suggestions?
thanks for your answers, all seem good :) p.s. i couldn't use httputility, not I don't want to. when i add a reference to system.web, httputility isn't included! it's only included in an asp.net application. Thanks again
It's not clear why you don't want to use HttpUtility. You could always add a reference to System.Web and use it:
var parsedQuery = HttpUtility.ParseQueryString(input);
Console.WriteLine(parsedQuery["q"]);
If that's not an option then perhaps this approach will help:
var query = input.Split('&')
.Single(s => s.StartsWith("q="))
.Substring(2);
Console.WriteLine(query);
It splits on & and looks for the single split result that begins with "q=" and takes the substring at position 2 to return everything after the = sign. The assumption is that there will be a single match, which seems reasonable for this case, otherwise an exception will be thrown. If that's not the case then replace Single with Where, loop over the results and perform the same substring operation in the loop.
EDIT: to cover the scenario mentioned in the comments this updated version can be used:
int index = input.IndexOf('?');
var query = input.Substring(index + 1)
.Split('&')
.SingleOrDefault(s => s.StartsWith("q="));
if (query != null)
Console.WriteLine(query.Substring(2));
If you don't want to use System.Web.HttpUtility (thus be able to use the client profile), you can still use Mono HttpUtility.cs which is only an independent .cs file that you can embed in your application. Then you can simply use the ParseQueryString method inside the class to parse the query string properly.
here is the solution -
string GetQueryString(string url, string key)
{
string query_string = string.Empty;
var uri = new Uri(url);
var newQueryString = HttpUtility.ParseQueryString(uri.Query);
query_string = newQueryString[key].ToString();
return query_string;
}
Why don't you create a code which returns the string from the q= onwards till the next &?
For example:
string s = historyString.Substring(url.IndexOf("q="));
int newIndex = s.IndexOf("&");
string newString = s.Substring(0, newIndex);
Cheers
Use the tools available:
String UrlStr = "http://www.google.com.mt/search?client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&channel=s&hl=mt&source=hp&biw=986&bih=663&q=hotmail&meta=&btnG=Fittex+bil-Google";
NameValueCollection Items = HttpUtility.ParseQueryString(UrlStr);
String QValue = Items["q"];
If you really need to do the parsing yourself, and are only interested in the value for 'q' then the following would work:
string url = #"http://www.google.com.mt/search?" +
"client=firefoxa&rls=org.mozilla%3Aen-" +
"US%3Aofficial&channel=s&hl=mt&source=hp&" +
"biw=986&bih=663&q=hotmail&meta=&btnG=Fittex+bil-Google";
int question = url.IndexOf("?");
if(question>-1)
{
int qindex = url.IndexOf("q=", question);
if (qindex > -1)
{
int ampersand = url.IndexOf('&', qindex);
string token = null;
if (ampersand > -1)
token = url.Substring(qindex+2, ampersand - qindex - 2);
else
token = url.Substring(qindex+2);
Console.WriteLine(token);
}
}
But do try to look at using a proper URL parser, it will save you a lot of hassle in the future.
(amended this question to include a check for the '?' token, and support 'q' values at the end of the query string (without the '&' at the end) )
And that's why you should use Uri and HttpUtility.ParseQueryString.
HttpUtility is fine for the .Net Framework. However that class is not available for WinRT apps. If you want to get the parameters from a url in a Windows Store App you need to use WwwFromUrlDecoder. You create an object from this class with the query string you want to get the parameters from, the object has an enumerator and supports also lambda expressions.
Here's an example
var stringUrl = "http://localhost/?name=Jonathan&lastName=Morales";
var decoder = new WwwFormUrlDecoder(stringUrl);
//Using GetFirstByName method
string nameValue = decoder.GetFirstByName("name");
//nameValue has "Jonathan"
//Using Lambda Expressions
var parameter = decoder.FirstOrDefault(p => p.Name.Contains("last")); //IWwwFormUrlDecoderEntry variable type
string parameterName = parameter.Name; //lastName
string parameterValue = parameter.Value; //Morales
You can also see http://www.dzhang.com/blog/2012/08/21/parsing-uri-query-strings-in-windows-8-metro-style-apps

Find/parse server-side <?abc?>-like tags in html document

I guess I need some regex help. I want to find all tags like <?abc?> so that I can replace it with whatever the results are for the code ran inside. I just need help regexing the tag/code string, not parsing the code inside :p.
<b><?abc print 'test' ?></b> would result in <b>test</b>
Edit: Not specifically but in general, matching (<?[chars] (code group) ?>)
This will build up a new copy of the string source, replacing <?abc code?> with the result of process(code)
Regex abcTagRegex = new Regex(#"\<\?abc(?<code>.*?)\?>");
StringBuilder newSource = new StringBuilder();
int curPos = 0;
foreach (Match abcTagMatch in abcTagRegex.Matches(source)) {
string code = abcTagMatch.Groups["code"].Value;
string result = process(code);
newSource.Append(source.Substring(curPos, abcTagMatch.Index));
newSource.Append(result);
curPos = abcTagMatch.Index + abcTagMatch.Length;
}
newSource.Append(source.Substring(curPos));
source = newSource.ToString();
N.B. I've not been able to test this code, so some of the functions may be slightly the wrong name, or there may be some off-by-one errors.
var new Regex(#"<\?(\w+) (\w+) (.+?)\?>")
This will take this source
<b><?abc print 'test' ?></b>
and break it up like this:
Value: <?abc print 'test' ?>
SubMatch: abc
SubMatch: print
SubMatch: 'test'
These can then be sent to a method that can handle it differently depending on what the parts are.
If you need more advanced syntax handling you need to go beyond regex I believe.
I designed a template engine using Antlr but thats way more complex ;)
exp = new Regex(#"<\?abc print'(.+)' \?>");
str = exp.Replace(str, "$1")
Something like this should do the trick. Change the regexes how you see fit

Categories