String not getting decoded - c#

I have a DecXpress report and the datasource shows a filed where the data is comming something like
PRODUCT - APPLE<BR/>ITEM NUMBER - 23454</BR>LOT NUMBER 3343 <BR/>
Now that is how it is showing in a cell, so i decided to decoded, but nothing is working, i tried HttpUtility.HtmlDecode and here i am trying WebUtility.HtmlDecode.
private void xrTableCell9_BeforePrint(object sender, System.Drawing.Printing.PrintEventArgs e)
{
XRTableCell cell = sender as XRTableCell;
string _description = WebUtility.HtmlDecode(Convert.ToString(GetCurrentColumnValue("Description")));
cell.Text = _description;
}
How can I decode the value of this column in the datasource?.
Thank you

If you need to show the description with the < /> also, you need to use HtmlEncode.
If you need to extract the text from that html
public static string ExtractTextFromHtml(this string text)
{
if (String.IsNullOrEmpty(text))
return text;
var sb = new StringBuilder();
var doc = new HtmlDocument();
doc.LoadHtml(text);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
if (!String.IsNullOrWhiteSpace(node.InnerText))
sb.Append(HtmlEntity.DeEntitize(node.InnerText.Trim()) + " ");
}
return sb.ToString();
}
And you need HtmlAgilityPack
To remove the br tags:
var str = Convert.ToString(GetCurrentColumnValue("Description"));
Regex.Replace(str, #"</?\s?br\s?/?>", System.Environment.NewLine, RegexOptions.IgnoreCase);

Related

Remove accents from a text file

I have issues with removing accents from a text file program replaces characters with diacritics to ? Here is my code:
private void button3_Click(object sender, EventArgs e)
{
if (radioButton3.Checked)
{
byte[] tmp;
tmp = System.Text.Encoding.GetEncoding("ISO-8859-1").GetBytes(richTextBox1.Text);
richTextBox2.Text = System.Text.Encoding.UTF8.GetString(tmp);
}
}
Taken from here: https://stackoverflow.com/a/249126/3047078
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
usage:
string result = RemoveDiacritics("včľťšľžšžščýščýťčáčáčťáčáťýčťž");
results in vcltslzszscyscytcacactacatyctz
richTextBox1.Text = "včľťšľžšžščýščýťčáčáčťáčáťýčťž";
string text1 = richTextBox1.Text.Normalize(NormalizationForm.FormD);
string pattern = #"\p{M}";
string text2 = Regex.Replace(text1, pattern, "�");
richTextBox2.Text = text2;
First normalize the string.
Then with a regular expression replace all diacritics. Pattern \p{M} is Unicode Category - All diacritic marks.

Reading txt file line by line than than write all lines to the RichTextBox

when i click to button it'll take these links from c:\text.txt file and it will write into my richtextbox
in my text.txt:
en.wikipedia.org/wiki/Extreme_programming
en.wikipedia.org/wiki/Boolean_algebra
en.wikipedia.org/wiki/Microsoft_Visual_Studio
en.wikipedia.org/wiki/Web_crawler
(there is no empty rows between links)
after that i want to call that links line by line into my other button to parse its html codes and write to other richtextbox
here is my parse button code:
private void button2_Click(object sender, EventArgs e)
{
string s = KaynakKodunuCek("http://tr.wikipedia.org/wiki/Lale");
// <p ... > </p> tagları arasını alıyor.(taglar dahil)
Regex regex = new Regex("<p[^>]*>.*?</p>");
string gelen = s;
string inside = null;
Match match = regex.Match(gelen);
if (match.Success)
{
inside = match.Value;
richTextBox3.Text = inside;
}
string outputStr = "";
foreach (Match ItemMatch in regex.Matches(gelen))
{
Console.WriteLine(ItemMatch);
inside = ItemMatch.Value;
//boşluk bırakıp alt satıra yazıyor
outputStr += inside + "\r\n";
}
richTextBox3.Text = outputStr;
}
here i want to call links string s = KaynakKodunuCek("here");
Or should i use listbox instead of richtextbox
Yes, listBox would be better.But if you want to do it with richTexBox you can use this:
string[] links = richTextBox2.Text.Split(new [] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
for(int i=0;i<links.Length;i++)
{
string s = KaynakKodunuCek(links[i]);
...
}

Web scraper using HtmlAgilityPack

I am new to C# so this might be very obvious how to get this to work or way too complex for me but I am trying to setup and scrape a web page using the HtmlAgilityPack. Currently my code compiles but when I write the string I only get 1 result and it happens to be the last result from the li in the ul. The reason for the string split is so I can eventually output the title and description strings into a .csv for further use. I am just unsure what to do next thus, why I am asking for any help/understanding/ideas/thoughts/suggestions that can be offered. Thank you!
private void button1_Click(object sender, EventArgs e)
{
List<string> cities = new List<string>();
//var xpath = "//h2[span/#id='Cities']";
var xpath = "//h2[span/#id='Cities']" + "/following-sibling::ul[1]" + "/li";
WebClient web = new WebClient();
String html = web.DownloadString("http://wikitravel.org/en/Vietnam");
hap.HtmlDocument doc = new hap.HtmlDocument();
doc.LoadHtml(html);
foreach (hap.HtmlNode node in doc.DocumentNode.SelectNodes(xpath))
{
string all = node.InnerText;
//splits text between '—', '-' or ' ' into 2 parts
string[] split = all.Split(new char[] { '—', ' ', '-' }, StringSplitOptions.None);
string title;
string description;
int nodeCount;
nodeCount = node.ChildNodes.Count;
if (nodeCount == 2)
{
title = node.ChildNodes[0].InnerText;
description = node.ChildNodes[1].InnerText;
}
else if (nodeCount == 4)
{
title = node.ChildNodes[0].InnerText;
description = node.ChildNodes[1].InnerText + node.ChildNodes[2].InnerText;
}
else
{
title = "Error";
description = "The node cound was not 2 or 3. Check the div section.";
}
System.IO.StreamWriter write = new System.IO.StreamWriter(#"C:\Users\cbrannin\Desktop\textTest\testText.txt");
write.WriteLine(all);
write.Close();
}
}
}
One problem is that you're overwriting the output file each time through the loop. You probably want to do this:
using (StreamWriter write = new StreamWriter(#"filename"))
{
foreach (hap.HtmlNode node in doc.DocumentNode.SelectNodes(xpath))
{
// do your thing
write.WriteLine(all);
}
}
Also, have you single-stepped this to see if you're getting more than one HtmlNode from your SelectNode call?
Finally, I don't see where you're doing anything with the title or description. Were you planning to use those for something else?

Remove the Text in C#.net

I have a string like,
string str;
str = "This is my new string. "<script>" Hi this is XYZ "</script>"";
Now I want to remove the text, from "<script>" to "</script>" including the tags by using C#.net code.
Thanks,
You should check Regex.
With that you can locate it and delete then.
This should get everything between script tags "<script>[^<]+</script>"
this is what i use to remove html tags in a string
public static string ClearHtmlTags(string html)
{
if (string.IsNullOrWhiteSpace(html))
return html;
html = html.Trim();
string[] hs = html.Split("<>".ToArray());
bool skip = false;
StringBuilder sb = new StringBuilder();
foreach (string s in hs)
{
if (!skip)
sb.Append(s);
skip = !skip;
}
return sb.ToString();
}
and with a simple modify you will get your method
public static string ClearHtmlTags(string html)
{
if (string.IsNullOrWhiteSpace(html))
return html;
html = html.Trim();
string[] hs = html.Split("<>".ToArray());
bool skip = false;
bool skipTag = false;
StringBuilder sb = new StringBuilder();
foreach (string s in hs)
{
if (!skip)
{
if (!skipTag)
sb.Append(s);
}
else
{
skipTag = s == "script";
}
skip = !skip;
}
return sb.ToString();
}
You can use something like:
Regex.Replace(inputString, "<script>([a-z]|[A-Z])*</script>", "");
now this would only allow alphanumeric text within the script tags
If you want to remove text from specific length e.g
"my name is testing"
here you want to remove is
just use indexof function later on use substring method for replace string with null or some thing else.
In c# you can filter your string like this or user regex before enter the data

Remove words from string c#

I am working on a ASP.NET 4.0 web application, the main goal for it to do is go to the URL in the MyURL variable then read it from top to bottom, search for all lines that start with "description" and only keep those while removing all HTML tags. What I want to do next is remove the "description" text from the results afterwords so I have just my device names left. How would I do this?
protected void parseButton_Click(object sender, EventArgs e)
{
MyURL = deviceCombo.Text;
WebRequest objRequest = HttpWebRequest.Create(MyURL);
objRequest.Credentials = CredentialCache.DefaultCredentials;
using (StreamReader objReader = new StreamReader(objRequest.GetResponse().GetResponseStream()))
{
originalText.Text = objReader.ReadToEnd();
}
//Read all lines of file
String[] crString = { "<BR> " };
String[] aLines = originalText.Text.Split(crString, StringSplitOptions.RemoveEmptyEntries);
String noHtml = String.Empty;
for (int x = 0; x < aLines.Length; x++)
{
if (aLines[x].Contains(filterCombo.SelectedValue))
{
noHtml += (RemoveHTML(aLines[x]) + "\r\n");
}
}
//Print results to textbox
resultsBox.Text = String.Join(Environment.NewLine, noHtml);
}
public static string RemoveHTML(string text)
{
text = text.Replace(" ", " ").Replace("<br>", "\n");
var oRegEx = new System.Text.RegularExpressions.Regex("<[^>]+>");
return oRegEx.Replace(text, string.Empty);
}
Ok so I figured out how to remove the words through one of my existing functions:
public static string RemoveHTML(string text)
{
text = text.Replace(" ", " ").Replace("<br>", "\n").Replace("description", "").Replace("INFRA:CORE:", "")
.Replace("RESERVED", "")
.Replace(":", "")
.Replace(";", "")
.Replace("-0/3/0", "");
var oRegEx = new System.Text.RegularExpressions.Regex("<[^>]+>");
return oRegEx.Replace(text, string.Empty);
}
public static void Main(String[] args)
{
string str = "He is driving a red car.";
Console.WriteLine(str.Replace("red", "").Replace(" ", " "));
}
Output:
He is driving a car.
Note: In the second Replace its a double space.
Link : https://i.stack.imgur.com/rbluf.png
Try this.It will remove all occurrence of the word which you want to remove.
Try something like this, using LINQ:
List<string> lines = new List<string>{
"Hello world",
"Description: foo",
"Garbage:baz",
"description purple"};
//now add all your lines from your html doc.
if (aLines[x].Contains(filterCombo.SelectedValue))
{
lines.Add(RemoveHTML(aLines[x]) + "\r\n");
}
var myDescriptions = lines.Where(x=>x.ToLower().BeginsWith("description"))
.Select(x=> x.ToLower().Replace("description",string.Empty)
.Trim());
// you now have "foo" and "purple", and anything else.
You may have to adjust for colons, etc.
void Main()
{
string test = "<html>wowzers description: none <div>description:a1fj391</div></html>";
IEnumerable<string> results = getDescriptions(test);
foreach (string result in results)
{
Console.WriteLine(result);
}
//result: none
// a1fj391
}
static Regex MyRegex = new Regex(
"description:\\s*(?<value>[\\d\\w]+)",
RegexOptions.Compiled);
IEnumerable<string> getDescriptions(string html)
{
foreach(Match match in MyRegex.Matches(html))
{
yield return match.Groups["value"].Value;
}
}
Adapted From Code Project
string value = "ABC - UPDATED";
int index = value.IndexOf(" - UPDATED");
if (index != -1)
{
value = value.Remove(index);
}
It will print ABC without - UPDATED

Categories