Regex to extract info out of large html source?

Regex to extract info out of large html source? - c#

in among lots of html source i have some elements like this
<option value=15>Bahrain - Manama</option>
<option value=73>Bangladesh - Dhaka</option>
<option value=46>Barbados - Bridgetown</option>
<option value=285>Belarus - Minsk</option>
<option value=48>Belgium - Brussels</option>
<option value=36>Belize - Belmopan</option>
Also I have a dictionary declared like Dictionary<string, int> Places = new Dictionary<string, int>();
What I want to do it extract the City name out of the html and put it into of Places, and extract the number code out and put it into the int. For the first one I would add Placed.Add("Manama", 15); The country name can get ignored. The idea though is to scan the html source and add the Cities automatically.
this is what I have so far
string[] temp = htmlContent.Split('\n');
List<string> temp2 = new List<string>();
foreach (string s in temp)
{
if (s.Contains("<option value="))
{
string t = s.Replace("option value=", "");
temp2.Add(t);
}
}
This cuts out some of the text but then I more or less get stuck wondering how to extract the relevant parts from the text. It's really bad I know but I'm learning :(

Don't use a regular expression - use HtmlAgilityPack - now you can use Linq to retrieve your option elements and build up your dictionary in a one-liner:
HtmlDocument doc = new HtmlDocument();
//remove "option" special handling otherwise inner text won't be parsed correctly
HtmlNode.ElementsFlags.Remove("option");
doc.Load("test.html");
var Places = doc.DocumentNode
.Descendants("option")
.ToDictionary(x => x.InnerText.Split('-')[1].Trim(),
x => x.Attributes["value"].Value);
For extracting the city name from the option value the above uses string.Split(), splitting on the separating -, taking the second (city) string and trimming any leading or trailing whitespace.

If the only relevant data you are looking for is within
string[] options = Regex.Split(theSource, "<option value="); // Splits up the source which is downloaded from the url
that way you are instantly faced with an array of strings with the first few chars being your int. if the ints are always over 10, i.e 2 characters long, you can use:
int y = 2; // pointer
string theString = options[x].substring(0,2); // if the numbers are always > 10 its quicker than a loop otherwise leave this bit out and loop the is below
if(options[x].substring(y,1)!=">") // check to see if the number has finished
{
theString += options[x].substring(y,1);
y++;
}
int theInt = int.Parse(theString);
to get the number you can loop the if statement with a pointer if you need to get longer numbers. If the numbers are not always over 10, just loop the if statement with a pointer and ignore the first lines.
Then I would re-use the string theString:
string[] place = Regex.Split(options[x], " - "); // split it immediately after the name
theString = place[0].substring(y, place[0].length - y);
And then add them with
Places.Add(theString, theInt);
Shoud work, if the code doesnt work straigth away, the algorithms will, just make sure the spelling is right and that the variables are doing what they should

Related

C# creating an object using an array as the basis?

This is more a question of whether this is possible.
I have an input box, 6 items go into the input box, this is an example string that forms the array:
Monday, Tuesday, April, February, Tomorrow, 42
These words can change, but their order is important. They are separated by a tab.
I want the 1st, 3rd, and the 6th word from this array. I would like to place them into an object - and if at all possible, but other items from other sources into that object in a particular order - so that I can then refer back to that object so that I do not have to write out long sections of code each time I need to output these 3 items.
My current code is unwieldy and I am looking for a better solution.
For reference my current code:
string phrase = value.Text;
string[] words = phrase.Split('\t');
string Word1 = words[1];
string Word2 = words[3];
string Word3 = words[6];
this.Output.Text = Word1 + '\t';
this.Output.Text += TextBox1.Text + '\t';
this.Output.Text += Word2 + '\t';
this.Output.Text += TextBox2.Text + '\t';
this.Output.Text += Word3;
This code works, but I am starting to work with larger arrays, requiring larger outputs and I am finding that I need to refer back to the same output multiple times.
Imagine the above code running to Word12, from an array of 30 adding the information from 6 text boxes, and having to have that output created 15 times in different places in the program. Also, you can see that the length of the code stops making sense.
If I could create an object containing all of that information, I could create it once, and then refer back to it as often as I needed.
Any insight or direction on how to proceed gratefully received.

Based on my understanding you are looking for below solution. If I missed something then please let me know.
Firstly you can store value.Text into a list of string by splitting by '\t'.
Create an array to store indexes for which you want to pick words.
Based on stored indexes you can pick words and store in a final wordslist.
Create an array to store dynamic textboxes text.
Loop on these stored textboxes text array and insert at alternate position in final wordlist.
At last join wordlist separated by '\t' and show as output.
Below is the code:
string finalOutput = string.Empty;
List<string> wordsList = new List<string>();
string phrase = value.Text;// "Monday\tTuesday\tApril\tFebruary\tTomorrow\t42";
string[] words = phrase.Split('\t');
List<int> wordsIndexes = new List<int> { 1, 3, 6 }; //this is based on ur requirment
List<string> textBoxesText = new List<string>() { TextBox1.Text, TextBox2.Text };
wordsIndexes.ForEach(i => wordsList.Add(words[i-1]));
int insertAtIndex = 1;
for (int i = 0; i < textBoxesText.Count; i++)
{
if (wordsList.Count >= insertAtIndex)
{
wordsList.Insert(insertAtIndex, textBoxesText[i]);
insertAtIndex += 2;
}
}
finalOutput = string.Join('\t', wordsList);

Not sure if I understand correctly, but I think that you could use a list and add the words there, using a list of indexes like so:
string phrase = value.Text;
string[] words = phrase.Split('\t');
List<int> indexes = new List<int> { 1, 3, 6 }; //you can add more indexes here...
List<string> wordsList = new List<string>();
indexes.Foreach(i => wordsList.add(words[i]));
With this implementation, you have all the words you need in the list and you can easily add more just adding or removing any index you want. And you can refer the list whenever you need to.

auto detect tag within a text

Does there is any library or algorithm that can do auto detection of tags in a text (ignoring the usual words of the chosen language)?
Something like this:
string[] keywords = GetKeyword("Your order is num #0123456789")
and keywords[] would contain "order" and "#0123456789" ...?
Does it exist? Or the user will select by himself all the tags of every document all the time? :?

foreach(string keyword in keywords) { // where keywords is a List<string>
if ("Your order is num #0123456789".Contains(keyword)) {
keywordsPresent.Add(keyword); // where keywordsPresent is a List<string>
}
}
return keywordsPresent;
What the above does is not cater for your #0123456789, for that add some more logic to find the index of the # or something...

Sorry, I misunderstood the question. If you want to look for specific words, the algorithm will depend on you strings. For example, you can use string.Split() to generate an array of words from one string, and then work with that, like this:
string[] words = string.Split("Your order is num #0123456789");
string orderNumber = "";
if(words.Contains("order") && w.StartsWith("#").Count > 0)
{
orderNumber = words.Where(w=>w.StartsWith("#").FirstOrDefault();
}
This will first generate an array of words from "Your order is num #0123456789" , then if it contains the word "order" it will wind a word that starts with "#" and select that;

I think that a lot of different algorithms can be used. Some of them are simple another are super complex. I can suggest you the next basic way:
Split all text into array of words.
Remove stop words from the array. (Goole "stop words list" to get full list of stop words.)
Walk through the array and calculate count of each word.
Sort words in accordance with their 'weight' in the array.
Choose necessary amount of tags.

Extracting values from a string in C#

I have the following string which i would like to retrieve some values from:
============================
Control 127232:
map #;-
============================
Control 127235:
map $;NULL
============================
Control 127236:
I want to take only the Control . Hence is there a way to retrieve from that string above into an array containing like [127232, 127235, 127236]?

One way of achieving this is with regular expressions, which does introduce some complexity but will give the answer you want with a little LINQ for good measure.
Start with a regular expression to capture, within a group, the data you want:
var regex = new Regex(#"Control\s+(\d+):");
This will look for the literal string "Control" followed by one or more whitespace characters, followed by one or more numbers (within a capture group) followed by a literal string ":".
Then capture matches from your input using the regular expression defined above:
var matches = regex.Matches(inputString);
Then, using a bit of LINQ you can turn this to an array
var arr = matches.OfType<Match>()
.Select(m => long.Parse(m.Groups[1].Value))
.ToArray();
now arr is an array of long's containing just the numbers.
Live example here: http://rextester.com/rundotnet?code=ZCMH97137

try this (assuming your string is named s and each line is made with \n):
List<string> ret = new List<string>();
foreach (string t in s.Split('\n').Where(p => p.StartsWith("Control")))
ret.Add(t.Replace("Control ", "").Replace(":", ""));
ret.Add(...) part is not elegant, but works...
EDITED:
If you want an array use string[] arr = ret.ToArray();
SYNOPSYS:
I see you're really a newbie, so I try to explain:
s.Split('\n') creates a string[] (every line in your string)
.Where(...) part extracts from the array only strings starting with Control
foreach part navigates through returned array taking one string at a time
t.Replace(..) cuts unwanted string out
ret.Add(...) finally adds searched items into returning list

Off the top of my head try this (it's quick and dirty), assuming the text you want to search is in the variable 'text':
List<string> numbers = System.Text.RegularExpressions.Regex.Split(text, "[^\\d+]").ToList();
numbers.RemoveAll(item => item == "");
The first line splits out all the numbers into separate items in a list, it also splits out lots of empty strings, the second line removes the empty strings leaving you with a list of the three numbers. if you want to convert that back to an array just add the following line to the end:
var numberArray = numbers.ToArray();

Yes, the way exists. I can't recall a simple way for It, but string is to be parsed for extracting this values. Algorithm of it is next:
Find a word "Control" in string and its end
Find a group of digits after the word
Extract number by int.parse or TryParse
If not the end of the string - goto to step one
realizing of this algorithm is almost primitive..)
This is simplest implementation (your string is str):
int i, number, index = 0;
while ((index = str.IndexOf(':', index)) != -1)
{
i = index - 1;
while (i >= 0 && char.IsDigit(str[i])) i--;
if (++i < index)
{
number = int.Parse(str.Substring(i, index - i));
Console.WriteLine("Number: " + number);
}
index ++;
}
Using LINQ for such a little operation is doubtful.

C# implementation of Dictionary to count occurrences of words returns duplicate words in output

I recently made a little application to read in a text file of lyrics, then use a Dictionary to calculate how many times each word occurs. However, for some reason I'm finding instances in the output where the same word occurs multiple times with a tally of 1, instead of being added onto the original tally of the word. The code I'm using is as follows:
StreamReader input = new StreamReader(path);
String[] contents = input.ReadToEnd()
.ToLower()
.Replace(",","")
.Replace("(","")
.Replace(")", "")
.Replace(".","")
.Split(' ');
input.Close();
var dict = new Dictionary<string, int>();
foreach (String word in contents)
{
if (dict.ContainsKey(word))
{
dict[word]++;
}else{
dict[word] = 1;
}
}
var ordered = from k in dict.Keys
orderby dict[k] descending
select k;
using (StreamWriter output = new StreamWriter("output.txt"))
{
foreach (String k in ordered)
{
output.WriteLine(String.Format("{0}: {1}", k, dict[k]));
}
output.Close();
timer.Stop();
}
The text file I'm inputting is here: http://pastebin.com/xZBHkjGt (it's the lyrics of the top 15 rap songs, if you're curious)
The output can be found here: http://pastebin.com/DftANNkE
A quick ctrl-F shows that "girl" occurs at least 13 different times in the output. As far as I can tell, it is the exact same word, unless there's some sort of difference in ASCII values. Yes, there are some instances on there with odd characters in place of a apostrophe, but I'll worry about those later. My priority is figuring out why the exact same word is being counted 13 different times as different words. Why is this happening, and how do I fix it? Any help is much appreciated!

Another way is to split on non words.
var lyrics = "I fly with the stars in the skies I am no longer tryin' to survive I believe that life is a prize But to live doesn't mean your alive Don't worry bout me and who I fire I get what I desire, It's my empire And yes I call the shots".ToLower();
var contents = Regex.Split(lyrics, #"[^\w'+]");
Also here's an alternative (and probably more obscure) loop
int value;
foreach (var word in contents)
{
dict[word] = dict.TryGetValue(word, out value) ? ++value : 1;
}
dict.Remove("");

If you notice, the repeat occurrences appear on a line following a word which apparently doesn't have a count.
You're not stripping out newlines, so em\r\ngirl is being treated as a different word.

String[] contents = input.ReadToEnd()
.ToLower()
.Replace(",", "")
.Replace("(", "")
.Replace(")", "")
.Replace(".", "")
.Split("\r\n ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
Works better.

Add Trim to each word:
foreach (String word in contents.Select(w => w.Trim()))

How to check if a position in a string is empty in c#

I have strings with space seperated values and I would like to pick up from a certain index to another and save it in a variable. The strings are as follows:
John Doe Villa Grazia 323334I
I managed to store the id card (3rd column) by using:
if (line.length > 39)
{
idCard = line.Substring(39, 46);
}
However, if I store the name and address (1st and 2nd columns) with Substring there will be empty spaces since they are not of the same length (unlike the id cards). How can I store these 2 values and removing the unneccasry spaces BUT allowing the spaces between name and surname?

Try this:
string line = " John Doe Villa Grazia 323334I";
string name = line.Substring(02, 16).Trim();
string address = line.Substring(18, 23).Trim();
string id = line.Substring(41, 07).Trim();

var values = line.Split(' ');
string name = values[0] + " " + values[1];
string idCard = values[4];
It will be impossible to do without database lookups on names if there aren't spaces for sure in the previous columns.

Are these actually space separated or are they really fix width columns?
By that I mean do the "columns" start at the same index into the string in each case - from the way you're describing the data is sounds like the later i.e. the ID column is always column 39 for 7 characters.
In which case you need to a) pull the columns using the appropriate substring calls as you're already doing and then, use "string ".Trim() to cut off the spaces.
If the rows, are, as it seems fixed with then you don't want to use Split at all.

How can you even get the ID like that, when everything in front of it is of variable length? If that was used for my name, "David Hedlund 323334I", the ID would start at pos 14, not 39.
Try this more dynamic approach:
var name = str.Substring(0, str.LastIndexOf(" "));
var id = str.Substring(str.LastIndexOf(" ")+1);

Looks like your parsing strategy will cause you a lot of trouble. You shouldn't count on the string's size in order to parse it.
Why not save the data in CSV format (John Doe, Villa Grazia, 323334I)?
that way, you can assume that each "column" will be separated by a comma which will make your parsing efforts easier.

Possible "DOH!" question but are you sure they are spaces and not Tabs? Looks like it "could" be a tab seperated file?
Also for browie points you should use String.Empty instead of ' ' for comparisons, its more localisation and memory friendly apparently.

The first approach would be - as already mentioned - a CSV-like structure with a defined token as the field separator.
The second one would be fixed field lengths so you know the first column goes from char 1 to char 20, the second column from char 21 to char 30, and so on.
There is nothing bad about this concept besides that the human readability may be poor if the columns are filled up to their maximum so no spaces remain between them.
You could write a helper function or class which knows about the field lengths and provides an index-based, fault-tolerant access to the particular column. This function would extract the particular string parts and remove the leading and trailing spaces but leave the spaces in between as they are.

If your values have fixed width, best not split it but use the right indexes in your array.
const string input = "John Doe Villa Grazia 323334I";
var name = input.Substring(0, 15).TrimEnd();
var place = input.Substring(16, 38).TrimEnd();
var cardId = input.Substring(39).TrimEnd();
Assuming your values cannot contain two sequential spaces in them we can maybe use " " (double space" as a separator?
The following code will split your string based on the double space
const string input = "John Doe Villa Grazia 323334I";
var entries = input.Split(new[]{" "}, StringSplitOptions.RemoveEmptyEntries)
.Select(s=>s.Trim()).ToArray();
string name = entries[0];
string place = entries[1];
string idCard = entries[2];

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex to extract info out of large html source? - c#

Related

C# creating an object using an array as the basis?

auto detect tag within a text

Extracting values from a string in C#

C# implementation of Dictionary to count occurrences of words returns duplicate words in output

How to check if a position in a string is empty in c#

Categories

Resources