how to strip html code from a string in c# - c#

Im coding an app for windows phone in c#.
the program creates a html file, in the course of the programs running i add a lot of html tags.
now i need to strip those from a string when needed.
now all my searches show me i can take a string turn it into an array then put it back together minus any words i dont want, now this is handy but wont work for my needs. i have no idea where to start or even if it is possible
here is an example of the strings i need to remove
testString = "AnotherTest<br>";
so this is a string of the parts i need to remove
List<string> partsToRemove ={"</a>","\">","<br>","<a","href=\"#"};
so how do i take "AnotherTest<br>" and remove all the parts included in partsToRemove?
To clarify:
i will only be removing html from small strings as needed not from a whole html file
to give a working concept:
my program is creating a back ground for a roleplay character, part of that process uses a "gang" generator, the gang generator provides the strings with html tags ready for placement (adding them on the fly is not possible with out radical alteration to my whole program) this is fine for the end result BUT i give users access to the generator itself so if they just want a gang they can use what i have created, this is then diplayed in a textbox (i could easierly change that to another web box) and if enabled the phone reads it out, so here i would take the string created for the gang and feed it through a method that strips the html code and returns a "clean" string
before posting i searched for a solution but all i came across was how to remove words, whole words.

You can try to use regex to do this:
Remove all html tags:
String result = Regex.Replace(htmlDocument, #"<[^>]*>", String.Empty);

for the case that you've shown, you can use this : /(<a|href=\\"#|">|</a>|<br>|\\)/gm regex
But since you might have many different types, the best is to keep a list of patterns, or try to figure out a pattern that matches all the different combinations that you have. It might be more suitable to split the document, and execute a regex multiple times, to keep the regex as simple as possible.
Hope I've answered you're question.

Related

regex gets stuck with this call

I'm working on a movie scraper / auto-downloader that iterates over my current movie collection, finds new recommendations, and downloads the new goods.
There is a part where I scrape IMDb for metadata and it seems to get stuck in this one spot and I can't seem to figure out why.... it has run this same code with different imdb pages just fine (this is the 29th iteration of a new page)
I am using c#!
The code:
private string Match(string regex, string html, int i = 1)
{
return new Regex(regex, RegexOptions.Multiline).Match(html).Groups[i].Value.Trim();
}
regex parameter string contents:
<title>.*?\\(.*?(\\d{4}).*?\\).*?</title>
html parameter string contents: too big to paste here, but literally the html string representation of http://www.imdb.com/title/tt4422748/combined
if in chrome, you can view easily with:
view-source:http://www.imdb.com/title/tt4422748/combined
I have paused execution in visual studio and stepped forward, it continues to run but just hangs (it doesn't let me step, it just runs). If i hit pause again it will return to the same spot with the same parameter values (and no I am not calling it in an infinite loop. I'm pretty new to Regex so any help would be appreciated!
Use of .* is like saying I want to match everything, yet nothing. Each use of it causes the parser to backtrack on so many different possibilities it becomes unresponsive and appears to lock up.
Does the person designing the pattern really not know if there is going to be text there or not for title? I bet 99% of the time the title has text..so why is .* even used, how about .+ at least?
If you want text between the delimiters, use this
title\>(?<Title>[^<]+)\</title
Then extract the matched text through the named group "Title" instead of group[0]. Group[1] will have the actual match text as well if one loathes named match captures.
Answer for Regex Haters
Use the HTML agility pack.

Remove everything expect src in Image Tag using Regex

I want to remove everything expect src in Image tag using regex.I am using C# but I don't want to use HTMLAgilityPack I want it using regex only.
How to get this ?
If String is <img id="image" class="header" src="test.png"> then it returns as <img src="test.png">
Image tag may contain many other extra properties.
To clarify my comments: Normally I wouldn't recommend parsing HTML Using Regex. however, this is one of the few times when it's possible without ending up with a disastrously complicated regex string, because here you have a single node, with 1 pair of matching angle brackets. In addition, the OP only needs a single tag from this string. If he needed to do anything more complicated, I'd agree that he should use HTMLAgilityPack, but this is perfectly doable.
What you do is you extract the tag from the string using this regex: (src=['\"].+?['\"]). Then you take what you extracted from the string and paste it into a new string:
String newImgTag = String.Format("<img {0}>", srcMatch);
Again, if this were any more complicated (or if I had to do other HTML manipulation), I would just skip the regex and go for the established solutions like the aforementioned HTMLAgilityPack, because it offers far more support for HTML manipulation.
However, I don't view this as HTML manipulation, because you got a single tag without even a matching closing tag. This is more like basic string manipulation. It's similar to calculating a number to the second power: I doubt anyone would import the entire math library just for that, they'd just do N * N.
I fully expect and accept that people will downvote me for even considering to use Regex for this. Before you do so, however, read the post and think about it. This is one of those borderline cases where HTMLAgilityPack would make the project far more complicated without actually adding anything except that you're not using Regex. Regex has its uses, it's only when you abuse it that it becomes a monster to work with.

Matching a term that contains nested HTML

I have been having trouble finding a solution to this problem.
I am parsing the content of a number of ebooks, finding specific terms and characters, marking the locations and lengths of each term.
A normal case would be something like this (excerpts from A Game of Thrones):
"When he paused to look down, his head swam dizzily and he felt his fingers slipping. Bran cried out and clung for dear life."
If we are searching for the character "Bran", its location is 85 and length is 4. Easy enough.
My issue arises when there is a paragraph like this:
<span height="-0em"><font size="7">D</font></span>aenerys Targaryen wed Khal Drogo
We need to match "Daenerys Targaryn". It is easy enough to strip the HTML and match the string, but in this example the result needs to include the HTML. Thus the expected result would here be would be location = 0, length = 67.
Another situation, caused by random anchor tags scattered throughout:
Did anyone outside the Vale even suspect where Catelyn <a></a>Stark had taken him?
Again, searching for "Catelyn Stark" needs to include the HTML, so location = 47, length = 20.
I have been able to get around it temporarily by adding those specific cases (searching for "Catelyn <a></a>Stark specifically), but clearly I should have a more robust solution, which I cannot seem to get my head around. My attempts have been using RegEx but with limited success.
I have found various questions regarding HTML matching/stripping (and whether or not to use RegEx =)), but this case seems to be somewhat unique.
Stripping the tags isn't an option as the content must be preserved.
This is within a stand-alone C# application.
Any ideas, steps in the right direction, or similar examples should your search go better than mine would be greatly appreciated!
One possible approach would be to insert the following between each letter in your search string:
(?:<[^>]*>)*
So when searching for the character "Bran" your regex would become the following:
(?:<[^>]*>)*B(?:<[^>]*>)*r(?:<[^>]*>)*a(?:<[^>]*>)*n
This will allow your regex to match any number of HTML tags anywhere within the search string. Note that this will only work if your search strings are always something simple like a character's name, and not regular expressions (this method will fail if there is repetition like a* in your search string).
I would create a function that would take "Daenerys Targaryn" as a parameter and then strip the first letter. Then, it would only search for "aenerys Targaryn," and if found, it would search for ">D<" or the first variable letter. Does than make sense?
Example:
public static string searchFor(string str)
{
// strip first letter of search string (in this case "D")
// search for the rest of the string ("aenerys Targaryn")
// if found, search for ">D<"
// if found, search for HTML tags with "D" inside (using regex)
// if found, search for HTML tags with the previous HTML tag in them (using regex)
return result;
}
Well using Javascript or Php you can get the text of elements and the text of documents and search there and then do a regex to return the closest match (containing the html):
Another option:
would be to index the books first using something like Lucene Search Engine (which happens to let you index in different formats (html format being one of them).
You can then use the Lucene api to search your documents a little easier.
In php we have Zend_Search_Lucene which works perfectly for this kind of thing.
Lucene Search can be found at:
http://lucene.apache.org/core/
Have fun!

Using C# regex to select text based on custom tags

I have a string in c# containing some data i need to extract based on certain conditions.
The string contains many tenders in the following form :
<TENDER> some words, don't know how many, may contain numbers and things like slashes (/) or whatever <DESCRIPTION> some more words and possibly other things like numbers or whatever describing the tender here </DESCRIPTION> some more words and possibly numbers and weird things </TENDER>
This string doesn't contain any nested <TENDER> tags, its flat. The <DESCRIPTION> tags occur only once within the <TENDER> tags.
I'm using : <TENDER>(.+?)</TENDER> as the regex to split up the tenders and it works fine. If this is wrong or stupid and you know a better way to write this please let me know as I have discovered I suck at regex.
My problem that I now need to only select a tender if its description contains any word in a list of keywords (lets say for now i want to select a tender only if it contains either "concrete" or"brick" in the description).
So far the regex I have come up with looks like this, but I don't know what to put in the middle. Also I have a vague suspicion that this might return me some false positives.
<TENDER>(.+?)<DESCRIPTION>have no idea what to do here</DESCRIPTION>(.+?)</TENDER>
If any of you regex guru's could point me in the right direction I would be most appreciative.
Use
<TENDER>([^<>]+?)<DESCRIPTION>[^<>]*?(brick|concrete)[^<>]*?</DESCRIPTION>([^<>]+?)</TENDER>
I am using [^<>] instead of . to avoid leaving the tags.
Use IgnorePatternWhiteSpace because I have commented the pattern. It does not affect the data processing...it allows one to break out patterns and comment.
string pattern = #"
(?<=<TENDER>) # Look Behind for TENDER
(?<TenderBefore>.*?) # Put the data into the TenderBefore Named Match Capture Group
(?:<DESCRIPTION>)
(?=.*brick|concrete) # Look ahead for the keywords
(?<Description>.*?) # Put the data into the Description NMCG
(?:</DESCRIPTION>)
(?<TenderAfter>.*?) # Put text into NMCG TenderAfter
(?=<\/TENDER>) # Tender Look ahead.";
After processing the matches, extract the data out of each match such as
string Tender = string.Format("{0}<DESCRIPTION>{1}</DESCRIPTION>{2}",
myMatch.Groups["TenderBefore"].Value,
myMatch.Groups["Description"].Value,
myMatch.Groups["TenderAfter"].Value);
HTH
Instead of regex, try using a proper DOM parsing library, such as the Html Agility Pack. It should work with any tags, even custom ones.

Complex String Processing - well complex to me

I am calling a web service and all I get back is a giant blob of text. I am left to process it myself. Problem is not all lines are necessarily the same. They each have 2 or 3 sections to them and they are similar. Here are the most common examples
text1 [text2] /text3/
text1/test3
text1[text2]/text3
text1 [text2] /text /3 here/
I am not exactly sure how to approach this problem. I am not too good at doing anything advanced as far as manipulating strings.
I was thinking using a regular expression might work, but not too sure on that either. If I can get each of these 3 sections broken up it is easier from there to do the rest. its just there doesn't seem to be any uniformity to the main 3 sections that I know how to work with.
EDIT: Thanks for mentioning i didn't actually say what I wanted to do.
Basically, I want to split these 3 sections of text into their own strings seperate stings so basically take it from one single string to an array of 3 strings.
string[0] = text1
string[1] = text2
string[2] = text3
Here is some of the text I get back from a call as an example
スルホ基 [スルホき] /(n) sulfo group/
鋭いナイフ [するどいナイフ] /(n) sharp knife/
鋭い批判 [するどいひはん] /(n) sharp criticism/
スルナーイ /(n) (See ズルナ) (obsc) surnay (Anatolian woodwind instrument) (per:)/zurna/
スルピリン /(n) sulpyrine/
スルファミン /(n) sulfamine/
剃る [そる(P);する] /(v5r,vt) to shave/(P)/
As the first line for an example I want to pull it out into an array
string[0] = スルホ基
string[0] = [スルホき]
string[0] = /(n) sulfo group/
Those example seem a bit random, there has to be some kind of order, isn't there a spec for the service? If not i suggest more example so that we can understand the rules.
Read up on some of the info here on finite state machines, and see if you can use some of the concepts on your input parsing problem.
If there is some order to the groups on each line, then maybe you can use a regex to separate the groups out.
Edit: after seeing your samples, you may get by with a regex, breaking on some of those specific delimiters. It will take maybe half an hour to test theory: pick yourself up a free regex tester, make yourself a regex that will isolate out just one of those groups, and pump a few sample lines through. If it performs reliably on the real data that you have, then expand it and see if you can also isolate out the other groups.
I should mention though that your regexes will break or just become a nightmare if there is any sort of vagaries in your data (and frequently there is). So test long and hard before settling on them. If you find you start to have exceptions in your data, then you will need to choose some sort of parsing algorithm (the FSM i mentioned above is a pattern you can follow if you implement a parsing mechanism).
The most stupid answer is "Use regex". But more information needed for better one.

Categories