I have a string like,
string str;
str = "This is my new string. "<script>" Hi this is XYZ "</script>"";
Now I want to remove the text, from "<script>" to "</script>" including the tags by using C#.net code.
Thanks,
You should check Regex.
With that you can locate it and delete then.
This should get everything between script tags "<script>[^<]+</script>"
this is what i use to remove html tags in a string
public static string ClearHtmlTags(string html)
{
if (string.IsNullOrWhiteSpace(html))
return html;
html = html.Trim();
string[] hs = html.Split("<>".ToArray());
bool skip = false;
StringBuilder sb = new StringBuilder();
foreach (string s in hs)
{
if (!skip)
sb.Append(s);
skip = !skip;
}
return sb.ToString();
}
and with a simple modify you will get your method
public static string ClearHtmlTags(string html)
{
if (string.IsNullOrWhiteSpace(html))
return html;
html = html.Trim();
string[] hs = html.Split("<>".ToArray());
bool skip = false;
bool skipTag = false;
StringBuilder sb = new StringBuilder();
foreach (string s in hs)
{
if (!skip)
{
if (!skipTag)
sb.Append(s);
}
else
{
skipTag = s == "script";
}
skip = !skip;
}
return sb.ToString();
}
You can use something like:
Regex.Replace(inputString, "<script>([a-z]|[A-Z])*</script>", "");
now this would only allow alphanumeric text within the script tags
If you want to remove text from specific length e.g
"my name is testing"
here you want to remove is
just use indexof function later on use substring method for replace string with null or some thing else.
In c# you can filter your string like this or user regex before enter the data
Related
Document example (Opens correctly in MS Office)
I have a Word document 1 where I need to replace all tags <> in the text with my values. Using the interop, I got access to the main text and the text of the Headers for seaching matches by Regex class,
static string GetContent(Word.Document document)
{
return document.Content.Text;
}
static string GetHeaderFooterText(Word.Document document)
{
StringBuilder sb = new StringBuilder();
foreach (Word.Section section in document.Sections)
{
foreach (Word.HeaderFooter hf in section.Headers)
{
if (!hf.Exists)
continue;
Word.Range range = hf.Range;
sb.AppendLine(range.Text);
foreach (Word.Shape shape in hf.Shapes)
{
sb.AppendLine(shape.TextFrame.TextRange.Text);
}
}
}
return sb.ToString();
}
public string[] GetMatches(string pattern, string text)
{
WordReader reader = new WordReader(Word);
Regex regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
HashSet<string> matches = new HashSet<string>();
foreach(Match match in regex.Matches(text))
matches.Add(match.Value);
return matches.ToArray();
}
but I can’t get the text from the tables located in the Headers. The value of the number of tables property of the HeaderFooter class is 0.
Replacing text with the Find class also does not replace tags in these tables.
private void ReplaceWords(Word.Application app, Dictionary<string, string> keyValuePairs)
{
foreach (var pair in keyValuePairs)
{
Word.Find find = app.Selection.Find;
find.ClearFormatting();
find.Replacement.ClearFormatting();
find.Text = pair.Key;
find.Replacement.Text = pair.Value;
find.MatchAllWordForms = false;
find.Forward = true;
find.Wrap = Word.WdFindWrap.wdFindContinue;
find.Forward = false;
find.MatchCase = false;
find.MatchWholeWord = false;
find.MatchWildcards = false;
find.MatchSoundsLike = false;
find.Execute(Replace: Word.WdReplace.wdReplaceAll);
}
}
Is there a way to access these tables and how to replace text in the document globally at all levels?
I found a working way to access all document content:
foreach (Word.Range range in wordApp.ActiveDocument.StoryRanges)
{
//Get string range.Text or use range.Find to find and replace text in document
}
I have a text file containing the following lines:
<TestInfo."Content">
{
<Label> "Content"
<Visible> "true"
"This is the text I want to get"
}
<TestInfo."Content2">
{
<Label> "Content2"
<Visible> "true"
"I don't want e.g. this"
}
I want to extract This is the text I want to get.
I tried e.g. the following:
string tmp = File.ReadAllText(textfile);
string result = Regex.Match(tmp, #"<Label> ""Content"" \n\s+ <Visible> ""true"" \n\s+ ""(.+?)""", RegexOptions.Singleline).Groups[1].Value;
However, in this case I get only the first word.
So, my output is: This
And I have no idea why...
I would appreciate any help. Thanks!
If you want the entire line after the line that starts with <Visible>, you'd better read the file line by line instead of using File.ReadAllText and a regular expression:
string result;
using (StreamReader sr = new StreamReader(textfile))
{
while (sr.Peek() >= 0)
{
string line = sr.ReadLine();
if (line.StartsWith("<Visible>"))
{
result = sr.ReadLine();
break;
}
}
}
Try this:
var tmp = File.ReadAllText("TextFile1.txt");
var result = Regex.Match(tmp, "This is the text I want to get", RegexOptions.Multiline);
if (result.Groups.Count> 0)
for (int i = 0; i < result.Groups.Count; i++)
Console.WriteLine(result.Groups[i].Value);
else
Console.WriteLine("string not found.");
Regards,
//jafc
You could change your regex this way:
var result = Regex.Match(tmp, #"<Visible> ""true""\s*""([\S ]+)""", RegexOptions.Singleline).Groups[1].Value;
If you want to get all the matches, not only the first one, you could use Regex.Matches
Thanks a lot for your input! This helped me to find a final solution:
First, I extracted only a small part containing the string I want to extract to avoid ambiguities:
string[] tmp = File.ReadAllLines(textfile);
List<string> Content = new List<string>();
bool dumpA = false;
Regex regBEGIN = new Regex(#"<TestInfo\.""Content"">");
Regex regEND = new Regex(#"<TestInfo\.""Content2"">");
foreach (string line in tmp)
{
if (dumpA)
Content.Add(line.Trim());
if (regBEGIN.IsMatch(line))
dumpA = true;
if (regEND.IsMatch(line)) break;
}
Then I can extract the (now only once existing) line starting with '"':
string result = "";
foreach (string line in Content)
{
if (line.StartsWith("\""))
{
result = line;
result = result.Replace("\"", "");
result = result.Trim();
}
}
Please help me to replace all the additional Facebook information from here using C# .net Regex Replace method.
Example
http://on.fb.me/OE6gnBsomehtml
Output
somehtml on.fb.me/OE6gnB somehtml
I tried following regex but they didn't work for me
searchPattern = "<a([.]*)?/l.php([.]*)?(\">)?([.]*)?(</a>)?";
replacePattern = "$3";
Thanks
I manage to do this using regex with following code
searchPattern = "<a(.*?)href=\"/l.php...(.*?)&?(.*?)>(.*?)</a>";
string html1 = Regex.Replace(html, searchPattern, delegate(Match oMatch)
{
return string.Format("{1}", HttpUtility.UrlDecode(oMatch.Groups[2].Value), oMatch.Groups[4].Value);
});
You can try this (System.Web has to be added to use System.Web.HttpUtility):
string input = #"http://on.fb.me/OE6gnBsomehtml";
string rootedInput = String.Format("<root>{0}</root>", input);
XDocument doc = XDocument.Parse(rootedInput, LoadOptions.PreserveWhitespace);
string href;
var anchors = doc.Descendants("a").ToArray();
for (int i = anchors.Count() - 1; i >= 0; i--)
{
href = HttpUtility.ParseQueryString(anchors[i].Attribute("href").Value)[0];
XElement newAnchor = new XElement("a");
newAnchor.SetAttributeValue("href", href);
newAnchor.SetValue(href.Replace(#"http://", String.Empty));
anchors[i].ReplaceWith(newAnchor);
}
string output = doc.Root.ToString(SaveOptions.DisableFormatting)
.Replace("<root>", String.Empty)
.Replace("</root>", String.Empty);
So here is my problem, I'm trying to get the content of a text file as a string, then parse it. What I want is a tab containing each word and only words (no blank, no backspace, no \n ...) What I'm doing is using a function LireFichier that send me back the string containing the text from the file (works fine because it's displayed correctly) but when I try to parse it fails and start doing random concatenation on my string and I don't get why.
Here is the content of the text file I'm using :
truc,
ohoh,
toto, tata, titi, tutu,
tete,
and here's my final string :
;tete;;titi;;tata;;titi;;tutu;
which should be:
truc;ohoh;toto;tata;titi;tutu;tete;
Here is the code I wrote (all using are ok):
namespace ConsoleApplication1{
class Program
{
static void Main(string[] args)
{
string chemin = "MYPATH";
string res = LireFichier(chemin);
Console.WriteLine("End of reading...");
Console.WriteLine("{0}",res);// The result at this point is good
Console.WriteLine("...starting parsing");
res = parseString(res);
Console.WriteLine("Chaine finale : {0}", res);//The result here is awfull
Console.ReadLine();//pause
}
public static string LireFichier(string FilePath) //Read the file, send back a string with the text
{
StreamReader streamReader = new StreamReader(FilePath);
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
public static string parseString(string phrase)//is suppsoed to parse the string
{
string fin="\n";
char[] delimiterChars = { ' ','\n',',','\0'};
string[] words = phrase.Split(delimiterChars);
TabToString(words);//I check the content of my tab
for(int i=0;i<words.Length;i++)
{
if (words[i] != null)
{
fin += words[i] +";";
Console.WriteLine(fin);//help for debug
}
}
return fin;
}
public static void TabToString(string[] montab)//display the content of my tab
{
foreach(string s in montab)
{
Console.WriteLine(s);
}
}
}//Fin de la class Program
}
I think your main issue is
string[] words = phrase.Split(delimiterChars, StringSplitOptions.RemoveEmptyEntries);
You could try using the string splitting option to remove empty entries for you:
string[] words = phrase.Split(delimiterChars, StringSplitOptions.RemoveEmptyEntries);
See the documentation here.
Try this:
class Program
{
static void Main(string[] args)
{
var inString = LireFichier(#"C:\temp\file.txt");
Console.WriteLine(ParseString(inString));
Console.ReadKey();
}
public static string LireFichier(string FilePath) //Read the file, send back a string with the text
{
using (StreamReader streamReader = new StreamReader(FilePath))
{
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
}
public static string ParseString(string input)
{
input = input.Replace(Environment.NewLine,string.Empty);
input = input.Replace(" ", string.Empty);
string[] chunks = input.Split(',');
StringBuilder sb = new StringBuilder();
foreach (string s in chunks)
{
sb.Append(s);
sb.Append(";");
}
return sb.ToString(0, sb.ToString().Length - 1);
}
}
Or this:
public static string ParseFile(string FilePath)
{
using (var streamReader = new StreamReader(FilePath))
{
return streamReader.ReadToEnd().Replace(Environment.NewLine, string.Empty).Replace(" ", string.Empty).Replace(',', ';');
}
}
Your main problem is that you are splitting on \n, but the linebreaks read from your file are \r\n.
You output string does contain all of your items, but the \r characters left in it cause later "lines" to overwrite earlier "lines" on the console.
(\r is a "return to start of line" instruction; without the \n "move to the next line" instruction your words from line 1 are being overwritten by those in line 2, then line 3 and line 4.)
As well as splitting on \r as well as \n, you need to check a string is not null or empty before adding it to your output (or, preferably, use StringSplitOptions.RemoveEmptyEntries as others have mentioned).
string ParseString(string filename) {
return string.Join(";", System.IO.File.ReadAllLines(filename).Where(x => x.Length > 0).Select(x => string.Join(";", x.Split(",".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Select(y => y.Trim()))).Select(z => z.Trim())) + ";";
}
I have a DecXpress report and the datasource shows a filed where the data is comming something like
PRODUCT - APPLE<BR/>ITEM NUMBER - 23454</BR>LOT NUMBER 3343 <BR/>
Now that is how it is showing in a cell, so i decided to decoded, but nothing is working, i tried HttpUtility.HtmlDecode and here i am trying WebUtility.HtmlDecode.
private void xrTableCell9_BeforePrint(object sender, System.Drawing.Printing.PrintEventArgs e)
{
XRTableCell cell = sender as XRTableCell;
string _description = WebUtility.HtmlDecode(Convert.ToString(GetCurrentColumnValue("Description")));
cell.Text = _description;
}
How can I decode the value of this column in the datasource?.
Thank you
If you need to show the description with the < /> also, you need to use HtmlEncode.
If you need to extract the text from that html
public static string ExtractTextFromHtml(this string text)
{
if (String.IsNullOrEmpty(text))
return text;
var sb = new StringBuilder();
var doc = new HtmlDocument();
doc.LoadHtml(text);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
if (!String.IsNullOrWhiteSpace(node.InnerText))
sb.Append(HtmlEntity.DeEntitize(node.InnerText.Trim()) + " ");
}
return sb.ToString();
}
And you need HtmlAgilityPack
To remove the br tags:
var str = Convert.ToString(GetCurrentColumnValue("Description"));
Regex.Replace(str, #"</?\s?br\s?/?>", System.Environment.NewLine, RegexOptions.IgnoreCase);