Regex to remove and replace characters - c#

I have the following
<option value="Abercrombie">Abercrombie</option>
My file has about 2000 rows in it each row has a different location, I'm trying to understand regex but unfortunately nothing I learn will go in and I'm unsure if this is possible.
What I want to do is run a regex which will strip the above HTML which will leave the following
Abercrombie
I then want to prefix a particular number to the front so the result would be for example
2,Abercrombie
Is this possible?

Don't use a regular expression since HTML is not a regular language. You can use Linq's XML parser. If you want to process the entire file, you can replace the elements inline:
int myNumber = 2;
var html = #"<html><body><option value=""Abercrombie"">Abercrombie</option><div><option value=""Forever21"">Forever21</option></div></body></html>";
var doc = XDocument.Load(new StringReader(html));
var options = doc.Descendants().Where(o => o.Name == "option").ToList();
foreach (var element in options)
{
element.ReplaceWith(string.Format("{0},{1}", myNumber, element.Value));
}
var result = doc.ToString();
This gives:
<html>
<body>2,Abercrombie<div>2,Forever21</div></body>
</html>
If you just want to grab the text for a specific tag, you can use the following:
int myNumber = 2;
var html = #"<option value=""Abercrombie"">Abercrombie</option>";
var doc = XDocument.Load(new StringReader(html));
var element = doc.Descendants().FirstOrDefault(o => o.Name == "option");
var attribute = element.Attribute("value").Value;
var result = string.Format("{0},{1}", myNumber, attribute);
//result == "2,Abercrombie"

Related

How can I change a tag name in AngleSharp?

Is it possible to change the name of a tag from code? Something like this:
var tag = doc.QuerySelector("i");
tag.TagName = "em";
This won't work, because TagName is read-only.
But, what are my options for getting to the same end? Would I have to construct an entirely new tag and set the InnerHtml to the contents of the old tag, then delete and swap? Is this even possible?
If you mean to replace the elements in html string then it can be done this way:
private static string RefineImageElement(string htmlContent)
{
var parser = new HtmlParser();
var document = parser.ParseDocument(htmlContent);
foreach (var element in document.All)
{
if (element.LocalName == "img")
{
var newElement = document.CreateElement("v-img");
newElement.SetAttribute("src", element.Attributes["src"] == null ? "" :
element.Attributes["src"].Value);
newElement.SetAttribute("alt", "Article Image");
element.Insert(AdjacentPosition.BeforeBegin, newElement.OuterHtml);
element.Remove();
}
}
return document.FirstElementChild.OuterHtml;
}
To change element name, you can replace OuterHtml of initial element with combination of :
new element opening tag
initial element's InnerHtml
new element closing tag
Here is an example :
var raw = #"<div>
<i>foo</i>
</div>";
var parser = new AngleSharp.Parser.Html.HtmlParser();
var doc = parser.Parse(raw);
var tag = doc.QuerySelector("i");
tag.OuterHtml = $"<em>{tag.InnerHtml}</em>";
Console.WriteLine(doc.DocumentElement.OuterHtml);
Output :
<html><head></head><body><div>
<em>foo</em>
</div></body></html>
I was having the same question the other day. Later the only solution came to me is simply create another element and copy every attributes from the original element. Then remove the original element.

strip all tags from string except anchor have class videoLink c#

i am trying to strip all tags from string paragraph except anchor tag which have class Videolink with regex.replace function can anybody help me out...!! thanks in advance... text is in urdu
before i am using this function but it is deleting all tags
public string ScrubHtml(string value)
{
var step1 = System.Text.RegularExpressions.Regex.Replace(value, #"<[^>]+>| ", "").Trim();
var Message_ = System.Text.RegularExpressions.Regex.Replace(step1, #"\s{2,}", " ");
return Message_;
}
Use a real html parser like HtmlAgilityPack, instead of Regex
Here is an example to get all links from a site
HttpClient client = new HttpClient();
var html = await client.GetStringAsync("http://google.com");
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var links = doc.DocumentNode.Descendants()
.Where(x => x.Name == "a")
.Select(x=>x.Attributes["href"].Value)
.ToList();

Insert XElement into value of another XElement

I have an XDocument object which contains XHTML and I am looking to add ABBR elements into the string. I have a List that I am looping through to look for values which need to be wrapped in ABBR elements.
Lets say I have an XElement which contains XHTML like so:
<p>Some text will go here</p>
I need to adjust the value of the XElement to look like this:
<p>Some text <abbr title="Will Description">will</abbr> go here</p>
How do I do this?
UPDATE:
I am wrapping the value "will" with the HTML element ABBR.
This is what I have so far:
// Loop through them
foreach (XElement xhtmlElement in allElements)
{
// Don't process this element if it has child elements as they
// will also be processed through here.
if (!xhtmlElement.Elements().Any())
{
string innerText = GetInnerText(xhtmlElement);
foreach (var abbrItem in AbbreviationItems)
{
if (innerText.ToLower().Contains(abbrItem.Description.ToLower()))
{
var abbrElement = new XElement("abbr",
new XAttribute("title", abbrItem.Abbreviation),
abbrItem.Description);
innerText = Regex.Replace(innerText, abbrItem.Description, abbrElement.ToString(),
RegexOptions.IgnoreCase);
xhtmlElement.Value = innerText;
}
}
}
}
The problem with this approach is that when I set the XElement Value property, it is encoding the XML tags (correctly treating it as a string rather than XML).
If innerText contains the right XML you can try the following:
xhtmlElement.Value = XElement.Parse(innerText);
instead of
xhtmlElement.Value = innerText;
you can :
change the element value first to string,
edit and replace the previous element with the xmltag,
and then replace old value with the new value.
this might what you're looking for:
var element = new XElement("div");
var xml = "<p>Some text will go here</p>";
element.Add(XElement.Parse(xml));
//Element to replace/rewrite
XElement p = element.Element("p");
var value = p.ToString();
var newValue = value.Replace("will", "<abbr title='Will Description'>will</abbr>");
p.ReplaceWith(XElement.Parse(newValue));

HtmlAgilityPack scraping "href"

I wrote this code.:
Warning, the link point to adult site!!!
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load("http://xhamster.com/movies/2808613/jewel_is_a_sexy_cougar_who_loves_to_fuck_lucky_younger_guys.html");
var aTags = document.DocumentNode.SelectNodes("//div[contains(#class,'noFlash')]");
if (aTags != null)
foreach (var aTag in aTags)
{
var href = aTag.Attributes["href"].Value;
textBox2.Text = href;
}
I got an error when i try run this programm.
If i put other things in "var href" for example.:
var href = aTag.InnerHtml
I got the inner text, and i can see there the "href=" link, and some other datas.
But i need only the link after the href!
You are selecting div elements. A div element can't have href attribute.If you want to get href's of anchor tags you can use:
var hrefs = aTags.Descendants("a")
.Select(node => node.GetAttributeValue("href",""))
.ToList();

How to replace all "values" in an XML document with "0.0" using C# (preferably LINQ)?

This is not a homework; I need this for my unit tests.
Sample input: <rows><row><a>1234</a><b>Hello</b>...</row><row>...</rows>.
Sample output: <rows><row><a>0.0</a><b>0.0</b>...</row><row>...</rows>.
You may assume that the document starts with <rows> and that parent node has children named <row>. You do not know the name of nodes a, b, etc.
For extra credit: how to make this work with an arbitrary well-formed, "free-form" XML?
I have tried this with a regex :) without luck. I could make it "non-greedy on the right", but not on the left. Thanks for your help.
EDIT: Here is what I tried:
private static string ReplaceValuesWithZeroes(string gridXml)
{
Assert.IsTrue(gridXml.StartsWith("<row>"), "Xml representation must start with '<row>'.");
Assert.IsTrue(gridXml.EndsWith("</row>"), "Xml representation must end with '<row>'.");
gridXml = "<deleteme>" + gridXml.Trim() + "</deleteme>"; // Fake parent.
var xmlDoc = XDocument.Parse(gridXml);
var descendants = xmlDoc.Root.Descendants("row");
int rowCount = descendants.Count();
for (int rowNumber = 0; rowNumber < rowCount; rowNumber++)
{
var row = descendants.ElementAt(0);
Assert.AreEqual<string>(row.Value /* Does not work */, String.Empty, "There should be nothing between <row> and </row>!");
Assert.AreEqual<string>(row.Name.ToString(), "row");
var rowChildren = row.Descendants();
foreach (var child in rowChildren)
{
child.Value = "0.0"; // Does not work.
}
}
// Not the most efficient but still fast enough.
return xmlDoc.ToString().Replace("<deleteme>", String.Empty).Replace("</deleteme>", String.Empty);
}
XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
foreach (XmlElement el in doc.SelectNodes("//*[not(*)]"))
el.InnerText = "0.0";
xml = doc.OuterXml;
or to be more selective about non-empty text nodes:
foreach (XmlText el in doc.SelectNodes("//text()[.!='']"))
el.InnerText = "0.0";
XDocument xml = XDocument.Load(myXmlFile);
foreach (var element in xml.Descendants("row").SelectMany(r => r.Elements()))
{
element.Value = "0.0";
}
Note that this general search for "Desscendants('row')" is not very efficient--but it satisfies the 'arbitrary format' requirement.
You should take look at HTML Agility Pack. It allows you to treat html documents as well-formed xml's, therefore you can parse it and change values.
I think you can use Regex.Replace method in C#. I used the below regex to replace all the XML elements values:
[>]+[a-zA-Z0-9]+[<]+
This will basically match text starting with a '>'{some text alphabets or number}'<'.
I was able to use this successfully in Notepad++. You can write a small program as well using this.

Categories