I made a program that modifies a local html page and replaces a certain value in it as you can see here:
string j = Microsoft.VisualBasic.Interaction.InputBox("Replace " + listView2.SelectedItems[0].Text + "/" + intlstv2.ToString() + " With?");
var links = webBrowser1.Document.GetElementsByTagName("td");
foreach(HtmlElement lnk in links)
{
if (lnk.GetAttribute("className") == "point" && lnk.InnerText == listView2.SelectedItems[0].Text || lnk.GetAttribute("className") == "point tekort" && lnk.InnerText == listView2.SelectedItems[0].Text)
{
MessageBox.Show(lnk.InnerText);
MessageBox.Show("Replacing with: " + j.ToString());
System.IO.File.WriteAllText("Fin.html", this.webBrowser1.DocumentText);
System.IO.File.WriteAllText("Fin.html", System.IO.File.ReadAllText("Fin.html").Replace(lnk.InnerText, j));
}
}
And in the html file:
<td class="point">14,5</td> <---- Value that I want replaced
<td class="average" title="Med">14,5</td> <---- Value that I want to keep
The value selected in listview 2 = 14,5 but the problem I'm having is that in the html, 14,5 exists twice (once for the class name point and the second for the class name med) I would only like to replace the innertext of classname point without changing the med's innertext.
How would I do this?
You could find better success by traversing the HTML document as an object tree rather than a "blob" of string.
That being said, using HtmlDocument to do this will be painful as it doesn't offer a way to introspect easily by class name, attribute values, etc. That being said, you could call GetElementByTagName and fetch all the td elements, and filter these by the class attribute. A bit of complexity, but I guess manageable.
I usually use the HtmlAgilityPack library, which provides many, many more methods and objects which will allow you to find your html elements with greater ease. Strongly recommend you use it!
Related
I have a plain text file as below,
<body labelR={Right} LabelL={Left}> </body/> Video provides a powerful way to help you prove your point. When you click Online Video, you can paste in the embed code for the video you want to add. You can also type a keyword to search online for the video that best fits your document. <body TestR={TestRight} TestL={TestLeft}> </body/>
It is read into the file system as,
var plainText = File.ReadAllText(#"D:\TestTxt.txt");
I'm trying to figure out a way if there is a way to filer out and get a list of a particular set of elements which are in XML syntax. Desired outcome is as below,
A list of 2 items in this case, with,
<body labelR={Right} LabelL={Left}>
</body/>
<body TestR={TestRight} TestL={TestLeft}>
</body/>
Basically the XML elements with <body> </body>
I cannot use LINQ to XML here since this plain text content is not valid XML syntax, I have read that RegEx might be possible but I'm not sure the proper way to use it here.
Any advise is greatly appreciated here
I think the best way to implement this situation is to change your txt file to an XML file by adding a little piece of code, Then you can easily read it
this and this will help you to do that.
using (XmlReader reader = XmlReader.Create(#"YOUFILEPATH.xml"))
{
while (reader.Read())
{
if (reader.IsStartElement())
{
//return only when you have START tag
switch (reader.Name.ToString())
{
case "Key":
Console.WriteLine("Element tag name is: " + reader.ReadString());
break;
case "Element value is: "
Console.WriteLine("Your Location is : " + reader.ReadString());
break;
}
}
Console.WriteLine("");
}
A plain string-based solution could be:
var s = "<body labelR={Right} LabelL={Left}> </body/> Video provides ... your document. <body TestR={TestRight} TestL={TestLeft}> </body/>";
int start = 0;
while ((start = s.IndexOf("<body", start )) >= 0)
{
var end = s.IndexOf("</body/>", start + "<body".Length) + "</body/>".Length;
Console.WriteLine(s[start..end]);
start = end;
}
This finds the next <body starting from the previous "node" (if any). Then it finds the (end of the) next </body/>.
Finally it prints the substring.
Repeat until no start marker was found, so it prints:
<body labelR={Right} LabelL={Left}> </body/>
<body TestR={TestRight} TestL={TestLeft}> </body/>
You may want to add some checks - what if the end marker is missing?
I need to surround every word in loaded html text with a span which will uniquely identify every word. The problem is that some content is not being handled by my regex pattern. My current problems include...
1) Special html characters like ” “ are treated as words.
2) Currency values. e.g. $2,500 end up as "2" "500" (I need "$2,500")
3) Double hyphened words. e.g. one-legged-man. end up "one-legged" "man"
I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions.
What I have so far is:
string pattern = #"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
wordCnt++;
return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
});
How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together?
A fundamental problem that you're up against here is that html is not a "regular language". This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. It isn't a matter of writing a better regular expression; this is a problem that regex can't solve.
What you need is a dedicated html parser. You could try this nuget package. There are many others, but HtmlAgilityPack is quite popular.
Edit: Below is an example program using HtmlAgilityPack. When an HTML document is parsed, the result is a tree (aka the DOM). In the DOM, text is stored inside text nodes. So something like <p>Hello World<\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans.
You can search for all the text nodes using an xpath query. The xpath I have below is /html/body//*[not(self::script)]/text(), which avoids the html head and any script tags in the body.
class Program
{
static void Main(string[] args)
{
var doc = new HtmlDocument();
doc.Load(args[0]);
var wordCount = 0;
var nodes = doc.DocumentNode
.SelectNodes("/html/body//*[not(self::script)]/text()");
foreach (var node in nodes)
{
var words = node.InnerHtml.Split(' ');
var surroundedWords = words.Select(word =>
{
if (String.IsNullOrWhiteSpace(word))
{
return word;
}
else
{
return $"<span data-wordno={wordCount++}>{word}</span>";
}
});
var newInnerHtml = String.Join("", surroundedWords);
node.InnerHtml = newInnerHtml;
}
WriteLine(doc.DocumentNode.InnerHtml);
}
}
Fix 1) by adding "negative look-behind assertions" (?<!\&). I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above.
Fix 2) by adding a new alternative |(\$?(\d+[,.])+\d+)' at the end of pattern. This also handles non-dollar and decimal-pointed numbers at the same time.
Fix 3) by enhancing the (\w+['-]\w+) alternative to read instead ((\w+['-])+\w+).
I have a method that copies selected image from an OpenFileDialog to a defined location, and I want to check if an image with the same name exists, and if so to change the name on the fly.
Here is my method:
public void SaveImage(IList<AppConfig> AppConfigs, string ImageNameFilter)
{
string imgPath = AppConfigs[0].ConfigValue.ToString();
Int32 i = 0;
StringBuilder sb = new StringBuilder(selectedFileName);
while (File.Exists(imgPath + "\\" + ImageNameFilter + selectedFileName))
{
sb.Insert(i, 0);
i++;
//ImageNameFilter += (i++).ToString();
}
File.Copy(selectedFile, imgPath + "\\" + ImageNameFilter + selectedFileName);
}
ImageNameFilter is a custom filter that is added at the beginning of each image and the users need this prefix to be able to recognize what the image is used for, only by seeing the prefix. selectedFileName is the name of the image taken with SafeFileName, which means it looks like this - imageName.jpeg.
There are several problems that I have with this code. Firstly, I wanted to change the name like this - imageName1.jpeg, imageName2.jpeg, imageName3.jpeg...imageName14.jpeg.., but if I'm using selectedFileName with the += everything is added, even after the .jpeg, which is not what I want. The only solution that I can think of is to use regex, but I really want to find another way.
Also, incrementing i++ and adding it with += leads to unwanted result which is :
imageName1.jpeg, imageName12.jpeg, imageName123.jpeg...imageName1234567.jpeg.
So, how can I get the result I want and the compromise I see here is to add underscore _ right after the ImageNameFilter and then add i at the beginning of selectedFileName rather in the end as it is by default. But adding something to the beginning of the string is also something I don't know how to do. As you may see I tried StringBuiledr + Insert, but I don't get the expected result.
Basically you need to separate the base file name from the extension (use the helpful methods on Path to do this) before starting the loop, and then keep producing filenames. Each candidate will not be produced based on the last one (it's just based on fixed information and the current iteration count), so you don't need to involve a StringBuilder at all.
Here's one neat way to do it in two steps. First, set up the bookkeeping:
var canonicalFileName = ImageNameFilter + selectedFileName;
var baseFileName = Path.GetFileNameWithoutExtension(canonicalFileName);
var extension = Path.GetExtension(canonicalFileName);
Then do the loop -- here I 'm using LINQ instead of a loop statement because I can, but there's no essential difference from a stock while loop:
var targetFileName = Enumerable.Range(1, int.MaxValue - 1)
.Select(i => Path.Combine(imgPath, baseFileName + i + extension))
.First(file => !File.Exists(file));
File.Copy(selectedFile, targetFileName);
Use Path.GetFileNameWithoutExtension or String.TrimEnd('.') to get the FileName without extension. You can use FileInfo to get the extension as well.
I am scraping a database of products and I am able to get all the HTML and retrieve most values as they have some unique items. However I am stuck on some areas that have common tags.
Example:
<div class="label">Name:</div><div class="value">John</div>
<div class="label">Age:</div><div class="value">24</div>
Any ideas on how I could get those labels and associated values?
I am using HTMLAgilityPack for the rest if there is something in there that may help.
Please use the xpath to get div's with class as label and class as value
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourHtml);
Dictionary<string, string> dict = new Dictionary<string, string>();
//This will get all div's with class as label & class value in dictionary
int cnt = 1;
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='label']"))
{
var val = doc.DocumentNode.SelectSingleNode("//div[#class='value'][" + cnt + "]").InnerText;
if(!dict.ContainsKey(node.InnerText))//dictionary takes unique keys only
{
dict.Add(node.InnerText, val);
cnt++;
}
}
You could try this:
Int32 endingIndex;
var Name1 = GetTextBetween(yourHtml, "<div class=\"label\">", "</div><div class=\"value\">", out endingIndex);
var Value1 = GetTextBetween(yourHtml.SubString(endingIndex), "<div class=\"value\">", "</div>", out endingIndex);
var Name2 = GetTextBetween(yourHtml.SubString(endingIndex), "<div class=\"label\">", "</div><div class=\"value\">", out endingIndex);
var Value2 = GetTextBetween(yourHtml.SubString(endingIndex), "<div class=\"value\">", "</div>", out endingIndex);
public static String GetTextBetween(String allDataToParse, String startText, String endText, out Int32 indexOfEndText)
{
var indexOfStartText = allDataToParse.IndexOf(startText);
indexOfEndText = allDataToParse.IndexOf(endText);
return allDataToParse.Substring(indexOfStartText, indexOfEndText - indexOfStartText).Replace(startText, String.Empty) ;
}
Although XPath always sounds like a great idea, when you're scraping data you can't rely on the HTML to be well formed. Many webpages break their HTML regularly to make scraping harder. Even though Mark's code looks awkward, it's actually more robust in some cases.
As sad as it sounds, you can only rely on consistency in the target document when the provider has proven reliable over a long length of time. Ideally, I'd use a regular expression to search for the tags I want specifically. Here's a good starting point:
Regular expression for extracting tag attributes
Unfortunately, only you know the exact quirks of the document you're working on. A simple solution, like the one Mark proposes, will likely work if the page you're viewing is reliable. And frankly, it's less likely to be fragile and crash unexpectedly.
If you use the HTML document parsing code that HatSoft suggests, your program may work great on most documents, but in my experience websites will throw errors randomly, change their layout unexpectedly, or sometimes your network code will only receive a partial string. Perhaps this is okay, but I'd suggest you try both approaches and see what is more reliable for you.
I want to parse the following XML
XmlElement costCenterElement2 = doc.CreateElement("CostCenter");
costCenterElement2.InnerXml =
"<CostCenterNumber>2</CostCenterNumber> <CostCenter>" +
"G&A: Fin & Acctng" +
"</CostCenter>";
but I found XML Exception
An error occurred while parsing EntityName.
Yeah - a & is not valid in XML and needs to be escaped to &.
The other characters invalid characters and their escapes:
< - <
> - >
" - "e;
' - '
The following should work:
XmlElement costCenterElement2 = doc.CreateElement("CostCenter");
costCenterElement2.InnerXml =
"<CostCenterNumber>2</CostCenterNumber> <CostCenter>" +
"G&A: Fin & Acctng" +
"</CostCenter>";
However, you really should be creating the CostCenterNumber and CostCenter as elements and not as InnerXml.
private string SanitizeXml(string source)
{
if (string.IsNullOrEmpty(source))
{
return source;
}
if (source.IndexOf('&') < 0)
{
return source;
}
StringBuilder result = new StringBuilder(source);
result = result.Replace("<", "<>lt;")
.Replace(">", "<>gt;")
.Replace("&", "<>amp;")
.Replace("'", "<>apos;")
.Replace(""", "<>quot;");
result = result.Replace("&", "&");
result = result.Replace("<>lt;", "<")
.Replace("<>gt;", ">")
.Replace("<>amp;", "&")
.Replace("<>apos;", "'")
.Replace("<>quot;", """);
return result.ToString();
}
Updated:
#thabet, if the string "<CostCenterNumber>...G&A: Fin & Acctng</CostCenter>" is coming in as a parameter, and it's supposed to represent XML to be parsed, then it has to be well-formed XML to start with. In the example you gave, it isn't. & signals the start of an entity reference, is followed by an entity name, and is terminated by ;, which never appears in the string above.
If you are given that whole string as a parameter, some of which is markup that must be parsed (i.e. the start/end tags), and some of which may contain markup that should not be parsed (i.e. the &), there is no clean and reliable way to "escape" the latter and not escape the former. You could replace all & characters with &, but in doing so you might accidentally turn into   and your resulting content would be wrong. If this is your situation, that you are receiving input "XML" where markup is mixed with unparseable text, the best recourse is to tell the person from whom you are getting the XML that it's not well-formed and they need to fix their output. There are ways for them to do that that are not difficult with standard XML tools.
If on the other hand you have
<CostCenterNumber>2</CostCenterNumber>
<CostCenter>...</CostCenter>
separately from the passed string, and you need to plug in the passed string as the text content of the child <CostCenter>, and you know it is not to be parsed (does not contain elements), then you can do this:
create <CostCenterNumber> and <CostCenter> as elements
make them children of the parent <CostCenter>
set CostCenterNumber's text content using InnerXML assuming there is no risk of markup in there: eltCCN.InnerXml = "2";
create for the child CostCenter element a Text node child whose value is the passed string: textCC = doc.CreateText(argStr);
assign that text node as a child of the child CostCenter element: eltCC.AppendChild(textCC);