unable to find a substring in html after decode/normalize

unable to find a substring in html after decode/normalize - c#

I have a snippet of html held as a string "s", it's user generated and may come from multiple sources, so I can't control the encoding of characters etc.
I have a simple string "comparison", and I need to check if comparison exists as a substring of "s". "comparison" does not have any html tags or encoding.
I am decoding, normalizing, and using a regex to strip out html tags, but am still unable to find the substring even when I know it is there...
string s = "<p>this is my string.</p><p>my string is html with tags and <a href="someurl">links</a> and encoding.</p><p>i want to find a substring but my comparison might not have tags & encoding.";
string comparison = "i want to find a substring";
string decode = HttpUtility.HtmlDecode(s);
string tagsreplaced = Regex.Replace(decode, "<.*?>", " ");
string normalized = tagsreplaced.Normalize();
Literal1.Text = normalized;
if (normalized.IndexOf(comparison) != -1)
{
Label1.Text = "substring found";
}
else
{
Label1.Text = "substring not found";
}
This is returning "substring not found". I can see by clicking view source that the string sent to the Literal absolutely includes the comparison string exactly as provided, so why isn't in being found?
Is there another way to achieve this?

The answer is that the HTML entity decoding still decodes your to the character 0xc2 0xa0 which is not a normal space character ' ' (which is 0x20). Verfy this with the following program:
using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Web;
namespace TestStuff
{
class Program
{
static void Main(string[] args)
{
string s = "<p>this is my string.</p><p>my string is html with tags and <a href="someurl">links</a> and encoding.</p><p>i want to find a substring but my comparison might not have tags & encoding.";
s = "i want to find a substring";
string comparison = "i want to find a substring";
string decode = HttpUtility.HtmlDecode(s);
string tagsreplaced = Regex.Replace(decode, "<.*?>", " ");
string normalized = tagsreplaced.Normalize();
Console.WriteLine("Dumping first string");
Console.WriteLine(normalized);
Console.WriteLine(BitConverter.ToString(Encoding.UTF8.GetBytes(normalized)));
Console.WriteLine("Dumping second string");
Console.WriteLine(comparison);
Console.WriteLine(BitConverter.ToString(Encoding.UTF8.GetBytes(comparison)));
if (normalized.IndexOf(comparison) != -1)
Console.WriteLine("substring found");
else
Console.WriteLine("substring not found");
Console.ReadLine();
return;
}
}
}
It dumps the UTF8 encodings of the two strings for you. You'll see as output:
Dumping first string
i want to find a substring
69-20-77-61-6E-74-20-74-6F-C2-A0-66-69-6E-64-C2-A0-61-C2-A0-73-75-62-73-74-72-69-6E-67
Dumping second string
i want to find a substring
69-20-77-61-6E-74-20-74-6F-20-66-69-6E-64-20-61-20-73-75-62-73-74-72-69-6E-67
substring not found
You see that the bytearrays do not match, therefore they aren't equal, therefore .IndexOf() is right to tell you that nothing was found.
So, the problem lies within the HTML itself since there is a non-breaking space character which you don't decode to a normal space. You can hack around it by substituting a " " for a " " in the string using String.Replace().

Related

How can I extract a dynamic length string from multiline string?

I am using "nslookup" to get machine name from IP.
nslookup 1.2.3.4
Output is multiline and machine name's length dynamic chars. How can I extract "DynamicLengthString" from all output. All suggestions IndexOf and Split, but when I try to do like that, I was not a good solution for me. Any advice ?
Server: volvo.toyota.opel.tata
Address: 5.6.7.8
Name: DynamicLengthString.toyota.opel.tata
Address: 1.2.3.4

I made it the goold old c# way without regex.
string input = #"Server: volvo.toyota.opel.tata
Address: 5.6.7.8
Name: DynamicLengtdfdfhString.toyota.opel.tata
Address: 1.2.3.4";
string targetLineStart = "Name:";
string[] allLines = input.Split(new char[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
string targetLine = String.Empty;
foreach (string line in allLines)
if (line.StartsWith(targetLineStart))
{
targetLine = line;
}
System.Console.WriteLine(targetLine);
string dynamicLengthString = targetLine.Remove(0, targetLineStart.Length).Split('.')[0].Trim();
System.Console.WriteLine("<<" + dynamicLengthString + ">>");
System.Console.ReadKey();
This extracts "DynamicLengtdfdfhString" from the given input, no matter where the Name-Line is and no matter what comes afterwards.
This is the console version to test & verify it.

You can use Regex
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string Content = "Server: volvo.toyota.opel.tata \rAddress: 5.6.7.8 \rName: DynamicLengthString.toyota.opel.tata \rAddress: 1.2.3.4";
string Pattern = "(?<=DynamicLengthString)(?s)(.*$)";
//string Pattern = #"/^Dy*$/";
MatchCollection matchList = Regex.Matches(Content, Pattern);
Console.WriteLine("Running");
foreach(Match match in matchList)
{
Console.WriteLine(match.Value);
}
}
}

I'm going to assume your output is exactly like you put it.
string output = ExactlyAsInTheQuestion();
var fourthLine = output.Split(Environment.NewLine)[3];
var nameValue = fourthLine.Substring(9); //skips over "Name: "
var firstPartBeforePeriod = nameValue.Split('.')[0];
//firstPartBeforePeriod should equal "DynamicLengthString"
Note that this is a barebones example:
Either check all array indexes before you access them, or be prepared to catch IndexOutOfRangeExceptions.
I've assumed that the four spaces between "Name:" and "DynamicLengthString" are four spaces. If they are a tab character, you'll need to adjust the Substring(9) method to Substring(6).
If "DynamicLengthString" is supposed to also have periods in its value, then my answer does not apply. You'll need to use a regex in that case.
Note: I'm aware that you dismissed Split:
All suggestions IndexOf and Split, but when I try to do like that, I was not a good solution for me.
But based on only this description, it's impossible to know if the issue was in getting Split to work, or it actually being unusable for your situation.

why regex split add to pattern \r\n

I want to split the body of article by html div tag so I have a pattern to search div.
the problem is that the pattern also split \r\n
[enter image description here][1]
string pattern = #"<div[^<>]*>(.*?)</div>";
string[] bodyParagraphsnew = Regex.Split(body, pattern,RegexOptions.None);
Response.Write("num of paragraph =" + bodyParagraphsnew.Length);
for (int i = 0; i < bodyParagraphsnew.Length; i++)
{
Response.Write("bodyParagraphs" + i + "= " + bodyParagraphsnew[i]+ Environment.NewLine);
}
When I debug this code I see a lot of "\r\n" in the array bodyParagraphsnew.
Its seen that the pattern include split by the string "\r\n"
I try to replace \r\n to string empty and i hoped that bodyParagraphsnew length will change.but not.I got instead of item(in array) that contain \r\n it contain ""
WHY?
here is link to image http://i.stack.imgur.com/Hxqki.gif that explain the problem

What you are seeing is the text that is between the end of the first </div> tag and the start of the next <div> tag. This is what Split does, it finds the text between the Regular Expression matches.
What is curious here though is that you are also going to get the text between the open and close tags because you put brackets in your string forming a capturing group. Consider the following program:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string body = "<div>some text</div>\r\n<div>some more text</div>";
string pattern = #"<div[^>]*?>(.*?)</div>";
string[] bodyParagraphsnew = Regex.Split(body, pattern, RegexOptions.None);
Console.WriteLine("num of paragraph =" + bodyParagraphsnew.Length);
for (int i = 0; i < bodyParagraphsnew.Length; i++)
{
Console.WriteLine("bodyParagraphs {0}: '{1}'", i, bodyParagraphsnew[i]);
}
}
}
What you will get from this is:
"" - An empty string taken from before the first <div>.
"some text" - The contents of the first <div>, because of the capturing group.
"\r\n" - The text between the end of the first </div> and the start of the last <div>.
"some more text" - The contents of the second div, again because of the capturing group.
"" - An empty string taken from after the last </div>.
What you are probably after is the contents of the div tags. This can kind of be achieved using this code:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string body = "<div>some text</div>\r\n<div>some more text</div>";
string pattern = #"<div[^>]*?>(.*?)</div>";
MatchCollection bodyParagraphsnew = Regex.Matches(body, pattern, RegexOptions.None);
Console.WriteLine("num of paragraph =" + bodyParagraphsnew.Count);
for (int i = 0; i < bodyParagraphsnew.Count; i++)
{
Console.WriteLine("bodyParagraphs {0}: '{1}'", i, bodyParagraphsnew[i].Groups[1].Value);
}
}
}
Note however that in HTML, div tags can be nested within each other. For example, the following is a valid HTML string:
string test = "<div>Outer div<div>inner div</div>outer div again</div>";
With this kind of situation Regular expressions are not going to work! This is largely due to HTML not being a Regular Language. To deal with this situation you are going to need to write a Parser (of which regular expressions are only a small part). However personally I wouldn't bother as there are plenty of open source HTML parsers already available HTML Agility Pack for example.

Two possibilies
you use llist instead of array and list.remove
you go through your array search for \r\n and remove it by index
if(bodyParagraphsnew[i] == "\r\n")
{
bodyParagraphsnew = bodyParagraphsnew.Where(w => w != bodyParagraphsnew[i]).ToArray();
}
Not very nice but maybe it is what you were looking for

Extract sub-string between two certain words right to left side

Example String
This is an important example about regex for my work.
I can extract important example about regex with this (?<=an).*?(?=for) snippet. Reference
But i would like to extract to string right to left side. According to this question's example; first position must be (for) second position must be (an).
I mean extracting process works back ways.
I tried what i want do as below codes in else İf case, but it doesn't work.
public string FnExtractString(string _QsString, string _QsStart, string _QsEnd, string _QsWay = "LR")
{
if (_QsWay == "LR")
return Regex.Match(_QsString, #"(?<=" + _QsStart + ").*?(?=" + _QsEnd + ")").Value;
else if (_QsWay == "RL")
return Regex.Match(_QsString, #"(?=" + _QsStart + ").*?(<=" + _QsEnd + ")").Value;
else
return _QsString;
}
Thanks in advance.
EDIT
My real example as below
#Var|First String|ID_303#Var|Second String|ID_304#Var|Third String|DI_t55
When i pass two string to my method (for example "|ID_304" and "#Var|") I would like to extract "Second String" but this example is little peace of my real string and my string is changeable.

No need for forward or backward lookahead! You could just:
(.*)\san\s.*\sfor\s
The \s demands whitespace, so you don't match an import*an*t.

One potential problem in your current solution is that the string passed in contains special characters, which needs to be escaped with Regex.Escape before concatenation:
return Regex.Match(_QsString, #"(?<=" + Regex.Escape(_QsStart) + ").*?(?=" + Regex.Escape(_QsEnd) + ")").Value;
For your other requirement of matching RL, I don't understand your requirement.

Need help inserting commas after each character in specific part of string

In the program I'm working on, I need to strip the tags around certain parts of a string, and then insert a comma after each character WITHIN the tag (not not after any other characters in the string). In case this doesn't make sense, here's an example of what needs to happen -
This is a string with a < a > tag < /a > (please ignore the spaces within the tag)
(needs to become)
This is a string with a t,a,g,.
Can anyone help me with this? I've managed to strip the tags using RegEx, but I can't figure out how to insert the commas only after the characters contained within the tag. If someone could help that would be great.
#Dour High Arch I'll elaborate a little bit. The code is for a text-to-speech app that won't recognize SSML tags. When the user enters a message for the text to speech app, they have the option of enclosing a word in a < a > tag to make the speaker say the world as an acronym. Because the acronym SSML tag won't work, I want to remove the < a > tag whenever present, and place commas after each character contained in the tag to fake it out (ex: < a > test< /a > becomes t,e,s,t,). All the non-tagged words in the string do not need commas after them, just those enclosed in tags (see my first example if need be).

If you have figured out the regex, I would imagine it would be simple to capture the inner text of the tag. Then it's a really simple operation to insert the commas:
var commaString = string.Join(",", capturedString.ToList());

Assuming you have your target string already parsed via your RegEx i.e. no tags around it...
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication32
{
class Program
{
static void Main(string[] args)
{
// setup a test string
string stringToProcess = "Test";
// actual solution here
string result = String.Concat(stringToProcess.Select(c => c + ","));
// results: T,e,s,t,
Console.WriteLine(result);
}
}
}

Parsing XML is very problematic because you may have to deal with things like CDATA sections, nested elements, entities, surrogate characters, and on and on. I would use a state-based parser like ANTLR.
However, if you are just starting out with C# it is instructive to solve this using the built-in .Net string and array classes. No ANTLR, LINQ, or regular expressions needed:
using System;
class ReplaceAContentsWithCommaSeparatedChars
{
static readonly string acroStartTag = "<a>";
static readonly string acroEndTag = "</a>";
static void Main(string[] args)
{
string s = "Alpha <a>Beta</a> Gamma <a>Delta</a>";
while (true)
{
int start = s.IndexOf(acroStartTag);
if (start < 0)
break;
int end = s.IndexOf(acroEndTag, start + acroStartTag.Length);
if (end < 0)
end = s.Length;
string contents = s.Substring(start + acroStartTag.Length, end - start - acroStartTag.Length);
string[] chars = Array.ConvertAll<char, string>(contents.ToCharArray(), c => c.ToString());
s = s.Substring(0, start)
+ string.Join(",", chars)
+ s.Substring(end + acroEndTag.Length);
}
Console.WriteLine(s);
}
}
Please be aware this does not deal with any of the issues I mentioned. But then, none of the other suggestions do either.

How do I replace all the spaces with %20 in C#?

I want to make a string into a URL using C#. There must be something in the .NET framework that should help, right?

Another way of doing this is using Uri.EscapeUriString(stringToEscape).

I believe you're looking for HttpServerUtility.UrlEncode.
System.Web.HttpUtility.UrlEncode(string url)

I found useful System.Web.HttpUtility.UrlPathEncode(string str);
It replaces spaces with %20 and not with +.

To properly escape spaces as well as the rest of the special characters, use System.Uri.EscapeDataString(string stringToEscape).

As commented on the approved story, the HttpServerUtility.UrlEncode method replaces spaces with + instead of %20.
Use one of these two methods instead: Uri.EscapeUriString() or Uri.EscapeDataString()
Sample code:
HttpUtility.UrlEncode("https://mywebsite.com/api/get me this file.jpg")
//Output: "https%3a%2f%2fmywebsite.com%2fapi%2fget+me+this+file.jpg"
Uri.EscapeUriString("https://mywebsite.com/api/get me this file.jpg");
//Output: "https://mywebsite.com/api/get%20me%20this%20file.jpg"
Uri.EscapeDataString("https://mywebsite.com/api/get me this file.jpg");
//Output: "https%3A%2F%2Fmywebsite.com%2Fapi%2Fget%20me%20this%20file.jpg"
//When your url has a query string:
Uri.EscapeUriString("https://mywebsite.com/api/get?id=123&name=get me this file.jpg");
//Output: "https://mywebsite.com/api/get?id=123&name=get%20me%20this%20file.jpg"
Uri.EscapeDataString("https://mywebsite.com/api/get?id=123&name=get me this file.jpg");
//Output: "https%3A%2F%2Fmywebsite.com%2Fapi%2Fget%3Fid%3D123%26name%3Dget%20me%20this%20file.jpg"

I needed to do this too, found this question from years ago but question title and text don't quite match up, and using Uri.EscapeDataString or UrlEncode (don't use that one please!) doesn't usually make sense unless we are talking about passing URLs as parameters to other URLs.
(For example, passing a callback URL when doing open ID authentication, Azure AD, etc.)
Hoping this is more pragmatic answer to the question: I want to make a string into a URL using C#, there must be something in the .NET framework that should help, right?
Yes - two functions are helpful for making URL strings in C#
String.Format for formatting the URL
Uri.EscapeDataString for escaping any parameters in the URL
This code
String.Format("https://site/app/?q={0}&redirectUrl={1}",
Uri.EscapeDataString("search for cats"),
Uri.EscapeDataString("https://mysite/myapp/?state=from idp"))
produces this result
https://site/app/?q=search%20for%20cats&redirectUrl=https%3A%2F%2Fmysite%2Fmyapp
Which can be safely copied and pasted into a browser's address bar, or the src attribute of a HTML A tag, or used with curl, or encoded into a QR code, etc.

Use HttpServerUtility.UrlEncode

HttpUtility.UrlDecode works for me:
var str = "name=John%20Doe";
var str2 = HttpUtility.UrlDecode(str);
str2 = "name=John Doe"

HttpUtility.UrlEncode Method (String)

The below code will replace repeating space with a single %20 character.
Example:
Input is:
Code by Hitesh Jain
Output:
Code%20by%20Hitesh%20Jain
Code
static void Main(string[] args)
{
Console.WriteLine("Enter a string");
string str = Console.ReadLine();
string replacedStr = null;
// This loop will repalce all repeat black space in single space
for (int i = 0; i < str.Length - 1; i++)
{
if (!(Convert.ToString(str[i]) == " " &&
Convert.ToString(str[i + 1]) == " "))
{
replacedStr = replacedStr + str[i];
}
}
replacedStr = replacedStr + str[str.Length-1]; // Append last character
replacedStr = replacedStr.Replace(" ", "%20");
Console.WriteLine(replacedStr);
Console.ReadLine();
}

HttpServerUtility.HtmlEncode
From the documentation:
String TestString = "This is a <Test String>.";
String EncodedString = Server.HtmlEncode(TestString);
But this actually encodes HTML, not URLs. Instead use UrlEncode(TestString).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

unable to find a substring in html after decode/normalize - c#

Related

How can I extract a dynamic length string from multiline string?

why regex split add to pattern \r\n

Extract sub-string between two certain words right to left side

Need help inserting commas after each character in specific part of string

How do I replace all the spaces with %20 in C#?

Categories

Resources