Can't get Regex group value - c#

The string I'm trying to parse from looks something like this:
{"F4q6i9xe":{"Hhgi79M1":"cTZ3W2JG","j0Uszek2":"0"},"a3vSYuq2":{"Kn51uR4Y":"c6QUklQzv+kRLnmvS0zsheqDsRCXbWFVpYhq2rsElKyFQnla01E6P3qQ26b+xWSrscFJCi2qsh6WjJKSW5FN9EwxRAWc3rfyToaEhRngI2WPu3W/b1/hkkS2tEEk7LEpS2ItxLYKjEQGneO5E9rcGzbtfSOikdIjhxpD5m9HsKayo2ZSc2EYd/cN9yfYrNfLOCx+xeGqGcPmmImAzOZM0Q1IXIDQZ5r70vUS1aUOMOFnN1fVxmQ8ISofEKNEpHFfwWW9QUl9eVDgpQ2HES13s20kvVH2FOlE5uahIJBnyTLWYzViAWYyX13VK2PgrQcZ0oLuV84bSbjHGZf+JAjvImuyYhkKDhtTWAGkQ8LZ6r07duo71Gbz3glOUZZyz+FFiH5KrwafBpzGhRlI8El6lLZASN9Z5iZNTKs+sOti1fzsuOBzUXEjFURJGa93GMqPC+OyTw6YxKvgapz7go1XL7EAf50UXpMHd9xAJzsuIb6/lB/6v+XjD70wc74D2OosxtR6DIbQj3gbaBUBABQsJHQZLNnHWqikh+HLCASVbkqw6YPm5NecGaOQxonDkh3ZVQSF15WCEsgNdoWG94OuLF8DSw1t1Kt8sMFkvBJgB7LJS3pAw/Tcg/rvR5qwZ/n31rtOzaJgCdVc6XeIK6Rttj4KvNtgtTwkJIjb97FgIS0CXrR0UCp3BsuLwQg3CRTnjQqsjGgJeinLXX2lq6vNOkxRzgXlWnap96kheYj7sIE/IxbJJWEIInMbH9HU/w0bS0MtIgofpIQAZ/m6XKhu9LpPZBgsBf9KX5chUPeuRfs3MfHFzPYPlCCd2dkE8BHKOTmfS3okqhkzIZvFN3Ni+7enktFdgfpQ0toQbjhpn/TszP34pBaIy77me+GvePNO5bAFECLWSpGvsmW16rLCP0H+xIS/lNf9hK98jQqGU7eqpQdai7WFie2yN75Up7MrgBMp5w9Z5C7qpG2iYGiqynpqTCEnfQ3IZumbL+YvGTiuI2c320MGjKzOdO45MIU5fxUNZQSIfCSyIq5G/XGIeXCG3KETyKDZygXUTgEWbJWNTADE0AhXvP7HMtsuvstyuHvlTZGcfKS4oLnDPFiV1ndIV7+W74Ytv9bAdDIVl36xTzA2PV6waqXBfSPUCTx3jVrAeXcHGFjZtxbk3pFmuqPqgVcxeX/aDbK0NHkR6phQcFEREBjfdCOLAAbCkWiRF0JAA=="}}
And this is my regular expression:
Hhgi79M1":"(?<encodeKeyID>.*?)",.*Kn51uR4Y":"(?<encodedBody>.*?)"}
And this is the code I'm using in my C# application:
string responsePattern = "Hhgi79M1\":\"(?<encodeKeyID>.*?)\",.*Kn51uR4Y\":\"(?<encodedBody>.*?)\"}";
if (Regex.IsMatch(body, responsePattern))
{
var match = Regex.Match(body, responsePattern);
string encodeKeyID = match.Groups["encodeKeyID"].Value;
string encodedBody = match.Groups["encodedBody"].Value;
Now it works, but it doesn't get the value of "encodedBody". I tested my expression with the data on https://regex101.com/ and it seems to work fine on there. However, when getting the value in my program it's just an empty string.

I sense the issue is that your body string is not escaped properly, as your pattern works fine in the following code:
string body = "{\"F4q6i9xe\":{\"Hhgi79M1\":\"cTZ3W2JG\",\"j0Uszek2\":\"0\"},\"a3vSYuq2\":{\"Kn51uR4Y\":\"c6QUklQzv+kRLnmvS0zsheqDsRCXbWFVpYhq2rsElKyFQnla01E6P3qQ26b+xWSrscFJCi2qsh6WjJKSW5FN9EwxRAWc3rfyToaEhRngI2WPu3W/b1/hkkS2tEEk7LEpS2ItxLYKjEQGneO5E9rcGzbtfSOikdIjhxpD5m9HsKayo2ZSc2EYd/cN9yfYrNfLOCx+xeGqGcPmmImAzOZM0Q1IXIDQZ5r70vUS1aUOMOFnN1fVxmQ8ISofEKNEpHFfwWW9QUl9eVDgpQ2HES13s20kvVH2FOlE5uahIJBnyTLWYzViAWYyX13VK2PgrQcZ0oLuV84bSbjHGZf+JAjvImuyYhkKDhtTWAGkQ8LZ6r07duo71Gbz3glOUZZyz+FFiH5KrwafBpzGhRlI8El6lLZASN9Z5iZNTKs+sOti1fzsuOBzUXEjFURJGa93GMqPC+OyTw6YxKvgapz7go1XL7EAf50UXpMHd9xAJzsuIb6/lB/6v+XjD70wc74D2OosxtR6DIbQj3gbaBUBABQsJHQZLNnHWqikh+HLCASVbkqw6YPm5NecGaOQxonDkh3ZVQSF15WCEsgNdoWG94OuLF8DSw1t1Kt8sMFkvBJgB7LJS3pAw/Tcg/rvR5qwZ/n31rtOzaJgCdVc6XeIK6Rttj4KvNtgtTwkJIjb97FgIS0CXrR0UCp3BsuLwQg3CRTnjQqsjGgJeinLXX2lq6vNOkxRzgXlWnap96kheYj7sIE/IxbJJWEIInMbH9HU/w0bS0MtIgofpIQAZ/m6XKhu9LpPZBgsBf9KX5chUPeuRfs3MfHFzPYPlCCd2dkE8BHKOTmfS3okqhkzIZvFN3Ni+7enktFdgfpQ0toQbjhpn/TszP34pBaIy77me+GvePNO5bAFECLWSpGvsmW16rLCP0H+xIS/lNf9hK98jQqGU7eqpQdai7WFie2yN75Up7MrgBMp5w9Z5C7qpG2iYGiqynpqTCEnfQ3IZumbL+YvGTiuI2c320MGjKzOdO45MIU5fxUNZQSIfCSyIq5G/XGIeXCG3KETyKDZygXUTgEWbJWNTADE0AhXvP7HMtsuvstyuHvlTZGcfKS4oLnDPFiV1ndIV7+W74Ytv9bAdDIVl36xTzA2PV6waqXBfSPUCTx3jVrAeXcHGFjZtxbk3pFmuqPqgVcxeX/aDbK0NHkR6phQcFEREBjfdCOLAAbCkWiRF0JAA==\"}}\n" +
"";
string responsePattern = "Hhgi79M1\":\"(?<encodeKeyID>.*?)\",.*Kn51uR4Y\":\"(?<encodedBody>.*?)\"}";
if (Regex.IsMatch(body, responsePattern))
{
var match = Regex.Match(body, responsePattern);
string encodeKeyID = match.Groups["encodeKeyID"].Value;
string encodedBody = match.Groups["encodedBody"].Value;
string msg = String.Format("encodeKeyID: {0}\nencodedBody: {1}", encodeKeyID, encodedBody);
//show in message box
MessageBox.Show(msg, "Pattern Match Result");
}
Output:

Try it this way.
string sTarget = #"
{""F4q6i9xe"":{""Hhgi79M1"":""cTZ3W2JG"",""j0Uszek2"":""0""},""a3vSYuq2"":{""Kn51uR4Y"":""c6QUklQzv+kRLnmvS0zsheqDsRCXbWFVpYhq2rsElKyFQnla01E6P3qQ26b+xWSrscFJCi2qsh6WjJKSW5FN9EwxRAWc3rfyToaEhRngI2WPu3W\/b1\/hkkS2tEEk7LEpS2ItxLYKjEQGneO5E9rcGzbtfSOikdIjhxpD5m9HsKayo2ZSc2EYd\/cN9yfYrNfLOCx+xeGqGcPmmImAzOZM0Q1IXIDQZ5r70vUS1aUOMOFnN1fVxmQ8ISofEKNEpHFfwWW9QUl9eVDgpQ2HES13s20kvVH2FOlE5uahIJBnyTLWYzViAWYyX13VK2PgrQcZ0oLuV84bSbjHGZf+JAjvImuyYhkKDhtTWAGkQ8LZ6r07duo71Gbz3glOUZZyz+FFiH5KrwafBpzGhRlI8El6lLZASN9Z5iZNTKs+sOti1fzsuOBzUXEjFURJGa93GMqPC+OyTw6YxKvgapz7go1XL7EAf50UXpMHd9xAJzsuIb6\/lB\/6v+XjD70wc74D2OosxtR6DIbQj3gbaBUBABQsJHQZLNnHWqikh+HLCASVbkqw6YPm5NecGaOQxonDkh3ZVQSF15WCEsgNdoWG94OuLF8DSw1t1Kt8sMFkvBJgB7LJS3pAw\/Tcg\/rvR5qwZ\/n31rtOzaJgCdVc6XeIK6Rttj4KvNtgtTwkJIjb97FgIS0CXrR0UCp3BsuLwQg3CRTnjQqsjGgJeinLXX2lq6vNOkxRzgXlWnap96kheYj7sIE\/IxbJJWEIInMbH9HU\/w0bS0MtIgofpIQAZ\/m6XKhu9LpPZBgsBf9KX5chUPeuRfs3MfHFzPYPlCCd2dkE8BHKOTmfS3okqhkzIZvFN3Ni+7enktFdgfpQ0toQbjhpn\/TszP34pBaIy77me+GvePNO5bAFECLWSpGvsmW16rLCP0H+xIS\/lNf9hK98jQqGU7eqpQdai7WFie2yN75Up7MrgBMp5w9Z5C7qpG2iYGiqynpqTCEnfQ3IZumbL+YvGTiuI2c320MGjKzOdO45MIU5fxUNZQSIfCSyIq5G\/XGIeXCG3KETyKDZygXUTgEWbJWNTADE0AhXvP7HMtsuvstyuHvlTZGcfKS4oLnDPFiV1ndIV7+W74Ytv9bAdDIVl36xTzA2PV6waqXBfSPUCTx3jVrAeXcHGFjZtxbk3pFmuqPqgVcxeX\/aDbK0NHkR6phQcFEREBjfdCOLAAbCkWiRF0JAA==""}}
";
Regex responseRx = new Regex(#"Hhgi79M1"":""(?<encodeKeyID>.*?)"",.*Kn51uR4Y"":""(?<encodedBody>.*?)""}");
Match responseMatch = responseRx.Match(sTarget);
if (responseMatch.Success)
{
Console.WriteLine("ID = {0}", responseMatch.Groups["encodeKeyID"].Value);
Console.WriteLine("Body = {0}", responseMatch.Groups["encodedBody"].Value);
}
Output
ID = cTZ3W2JG
Body = c6QUklQzv+kRLnmvS0zsheqDsRCXbWFVpYhq2rsElKyFQnla01E6P3qQ26b+xWSrscFJCi2qs
h6WjJKSW5FN9EwxRAWc3rfyToaEhRngI2WPu3W\/b1\/hkkS2tEEk7LEpS2ItxLYKjEQGneO5E9rcGzb
tfSOikdIjhxpD5m9HsKayo2ZSc2EYd\/cN9yfYrNfLOCx+xeGqGcPmmImAzOZM0Q1IXIDQZ5r70vUS1a
UOMOFnN1fVxmQ8ISofEKNEpHFfwWW9QUl9eVDgpQ2HES13s20kvVH2FOlE5uahIJBnyTLWYzViAWYyX1
3VK2PgrQcZ0oLuV84bSbjHGZf+JAjvImuyYhkKDhtTWAGkQ8LZ6r07duo71Gbz3glOUZZyz+FFiH5Krw
afBpzGhRlI8El6lLZASN9Z5iZNTKs+sOti1fzsuOBzUXEjFURJGa93GMqPC+OyTw6YxKvgapz7go1XL7
EAf50UXpMHd9xAJzsuIb6\/lB\/6v+XjD70wc74D2OosxtR6DIbQj3gbaBUBABQsJHQZLNnHWqikh+HL
CASVbkqw6YPm5NecGaOQxonDkh3ZVQSF15WCEsgNdoWG94OuLF8DSw1t1Kt8sMFkvBJgB7LJS3pAw\/T
cg\/rvR5qwZ\/n31rtOzaJgCdVc6XeIK6Rttj4KvNtgtTwkJIjb97FgIS0CXrR0UCp3BsuLwQg3CRTnj
QqsjGgJeinLXX2lq6vNOkxRzgXlWnap96kheYj7sIE\/IxbJJWEIInMbH9HU\/w0bS0MtIgofpIQAZ\/
m6XKhu9LpPZBgsBf9KX5chUPeuRfs3MfHFzPYPlCCd2dkE8BHKOTmfS3okqhkzIZvFN3Ni+7enktFdgf
pQ0toQbjhpn\/TszP34pBaIy77me+GvePNO5bAFECLWSpGvsmW16rLCP0H+xIS\/lNf9hK98jQqGU7eq
pQdai7WFie2yN75Up7MrgBMp5w9Z5C7qpG2iYGiqynpqTCEnfQ3IZumbL+YvGTiuI2c320MGjKzOdO45
MIU5fxUNZQSIfCSyIq5G\/XGIeXCG3KETyKDZygXUTgEWbJWNTADE0AhXvP7HMtsuvstyuHvlTZGcfKS
4oLnDPFiV1ndIV7+W74Ytv9bAdDIVl36xTzA2PV6waqXBfSPUCTx3jVrAeXcHGFjZtxbk3pFmuqPqgVc
xeX\/aDbK0NHkR6phQcFEREBjfdCOLAAbCkWiRF0JAA==

Related

How can I pick out a number from a string in C#

I have this string:
http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&sid=&q=1007040&a=2
How can I pick out the number between "q=" and "&amp" as an integer?
So in this case I want to get the number: 1007040
What you're actually doing is parsing a URI - so you can use the .Net library to do this properly as follows:
var str = "http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&sid=&q=1007040&a=2";
var uri = new Uri(str);
var query = uri.Query;
var dict = System.Web.HttpUtility.ParseQueryString(query);
Console.WriteLine(dict["amp;q"]); // Outputs 1007040
If you want the numeric string as an integer then you'd need to parse it:
int number = int.Parse(dict["amp;q"]);
Consider using regular expressions
String str = "http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&sid=&q=1007040&a=2";
Match match = Regex.Match(str, #"q=\d+&amp");
if (match.Success)
{
string resultStr = match.Value.Replace("q=", String.Empty).Replace("&amp", String.Empty);
int.TryParse(resultStr, out int result); // result = 1007040
}
Seems like you want a query parameter for a uri that's html encoded. You could do:
Uri uri = new Uri(HttpUtility.HtmlDecode("http://www.edrdg.org/jmdictdb/cgi-bin/edform.py?svc=jmdict&sid=&q=1007040&a=2"));
string q = HttpUtility.ParseQueryString(uri.Query).Get("q");
int qint = int.Parse(q);
A regex approach using groups:
public int GetInt(string str)
{
var match = Regex.Match(str,#"q=(\d*)&amp");
return int.Parse(match.Groups[1].Value);
}
Absolutely no error checking in that!

Get the remaining text after subtring

how can I get the remaining text/values of ImageString and return it in a variable? In my current code below i got an error.
Note: ImageString = data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEBLAEsAAD/2wBDAAUDBAQEAwUEBAQFBQUGBwwIBwcHBw8LCwkMEQ8SEhEPERETFhwXExQaFRERGCEYGh0dHx8fExciJCIeJBweHx7/2wBDAQUFBQcGBw4ICA4eFBEUHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh7/wAARCAOEAlgDASIAAhEBAxEB/8QAHQABAAICAwEBAAAAAAAAAAAAAAUGAwQBAgcICf/EAEkQAQABAwMCAgcCDAMFBgcAAAABAgMEBREhMTIGEgciQVFhcYETkQgUIzNCUnKCobHB0RVikkNjc7LwFhckNKL
var _strings= { response = "True", name = "John", ImageString = "(random text here (1-1000 of characters maybe?))"};
var _split = _strings.Substring(_strings.IndexOf("ImageString"), _strings.Length);
"Index and length must refer to a location within the string.
Expected Out : I want to get all the text from the "ImageString"
In your particular case you can use JSON serializer:
public class Data
{
public string ImageString { get; set; }
}
var data = Newtonsoft.Json.JsonConvert.Deserialize<Data>(_strings);
var myImageString = data.ImageString;
var _strings= { response = "True", name = "John", ImageString = "(random text here (1-1000 of characters maybe?))"};
The above looks like an anonymous object but i am assumming you mean the structure of your text by it.
var imageStringIndex= _strings.IndexOf("ImageString");
var _split = _strings.Substring(imageStringIndex<0?0:imageStringIndex, _strings.Length - 1);
or
var _split = _strings.Split("ImageString".ToCharArray()).Last();
both you work for your case
If I've understood you right, and you insist on string manipulations (not Json parsing)
string _strings =
"{ response = \"True\", name = \"John\", ImageString = \"bla-bla-bla\" };";
string toFind = "ImageString";
int p = _strings.IndexOf(toFind);
// bla-bla-bla
string remain = p < 0
? "" //TODO: Not found at all (shall we return an empty or entire string?)
: _strings
.Substring(p + toFind.Length)
.TrimStart(' ', '=') // to be on the safe side: what if data starts from =
.TrimStart('"', ' ')
.TrimEnd('}', ';', ' ')
.TrimEnd('"', ' ');
First problem I'm seeing is that
var _strings= { response = "True", name = "John", ImageString = "(random text here (1-1000 of characters maybe?))"};
Is not a string, rather an attempt at an anonymous object?
If it is, you can declare the object using:
var _strings = new { response = "True", name = "John", ImageString = "(random text here (1-1000 of characters maybe?))" };
Then access any property in the object using standard c# syntax:
var imageString = _strings.ImageString;
The answer to your question depends very much on where the information is coming from as you can declare it and manipulate it in a thousand different ways if you have control over it.

Android indexoutofboundsexception

helo everyone i working Regex. i have indexoutofboundsexception
this is a my source
public String ExtractYoutubeURLFromHTML(String html) throws UnsupportedEncodingException
{
Pattern pattern = Pattern.compile(this.extractionExpression);
Matcher matcher = pattern.matcher(html);
List<String> matches = new ArrayList<String>();
while(matcher.find()){
matches.add(matcher.group());
}
String vid0 = matches.get(0).toString();
vid0 = vid0.replace ("\\\\u0026", "&");
vid0 = vid0.replace ("\\\\\\", "");
vid0=URLDecoder.decode(vid0,"UTF-8");
return vid0;
}
i debuged and i have indexoutofboundsexception invalid index exception
error is in this part of code
List<String> matches = new ArrayList<String>();
while(matcher.find()){
matches.add(matcher.group());
}
String vid0 = matches.get(0).toString();
also i have woring code in C# and i want to rewrite C# code in java Code
this is a working code in C#
public string ExtractYoutubeURLFromHTML(string html)
{
Regex rx = new Regex (this.extractionExpression);
var video = rx.Matches (html);
var vid0 = video [0].ToString ();
vid0 = vid0.Replace ("\\\\u0026", "&");
vid0 = vid0.Replace ("\\\\\\", "");
vid0 = System.Net.WebUtility.UrlDecode (vid0);
return vid0;
}
If matches is empty then matches.get(0).toString(); will give an ArrayIndexOutOfBoundsException.
Better check like
if(matches.size()>0)
String vid0 = matches.get(0).toString();
Replace this one :
String vid0 = matches.get(0).toString();
by
if(matches.size() > 0){
String vid0 = matches.get(0).toString();
} else {
return "" // or you can also return null;
}

Working with Regex to get 2 strings out of a source code [duplicate]

I am using webrequest to download a source from a page and then I need to use Regex to grab the string and store it in a string:
U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..
also need:
bpvsid=nvnN2JFJqJc.&dcz=1
Both out of:
<td style="cursor:pointer;" class="" onclick="NewWindow('U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..', 'bpvsid=nvnN2JFJqJc.&dcz=1', 'bpvstage_edit', '1200', '800')" onmouseout="HideHover();"><img src="gfx/info.gif" alt="" tipwidth="450" ajaxtip="openajax.php?target=modules/bpv/bpvstage_hover_info.php&rid=&oid=&bpvsid=&bpvname=" /></td>
It keep giving me errors like not enough )'s?
Thanks in advance.
Current code, probably wrong in every way. Really new to this:
Regex rx = new Regex("(?<=class=\"\" onclick=\"NewWindow(').*(?=')");
longId = (rx.Match(textBox2.Text).Value);
textBox1.Text = longId;
var match = Regex.Match(s, #"onclick=""NewWindow\('([^']*)',\s*'([^']*)',.*");
if (match.Success)
{
string longId = match.Groups[1].Value;
string other = match.Groups[2].Value;
}
That will give you two groups with values:
U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..
bpvsid=nvnN2JFJqJc.&dcz=1
The regex NewWindow\('([^']*)', '([^']*) will match what you require. The two strings required will be in Groups[1] and Groups[2].
var match = Regex.Match(textBox2.Text, "NewWindow\('([^']*)', '([^']*)");
var id1 = match.Groups[1].Value;
var id2 = match.Groups[2].Value;
Note that you could also use simply string functions instead of a regex:
var s = "<td style=\"cursor:pointer;\" class=\"\" onclick=\"NewWindow('U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..', 'bpvsid=nvnN2JFJqJc.&dcz=1', 'bpvstage_edit', '1200', '800')\" onmouseout=\"HideHover();\"><img src=\"gfx/info.gif\" alt=\"\" tipwidth=\"450\" ajaxtip=\"openajax.php?target=modules/bpv/bpvstage_hover_info.php&rid=&oid=&bpvsid=&bpvname=\" /></td>";
var tmp = s.Substring(s.IndexOf("NewWindow('")).Split('\'');
var value1 = tmp[1]; // U_nQgAjU_tdUnfcA7lT5opoTLyLdslWDTpiNzcdkLoHlobS_HbujMw..
var value2 = tmp[3]; // bpvsid=nvnN2JFJqJc.&dcz=1
I would use HtmlAgilityPack to parse HTML, then this non-regex approach works:
string html = // get your html ...
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // doc.Load can also consume a response-stream directly
var result = Enumerable.Empty<string>();
var firstTD = doc.DocumentNode.SelectNodes("//td").FirstOrDefault();
if (firstTD != null)
{
if (firstTD.Attributes.Contains("onclick"))
{
string onclick = firstTD.Attributes["onclick"].Value;
int newWindowIndex = onclick.IndexOf("newWindow(", StringComparison.OrdinalIgnoreCase);
if (newWindowIndex >= 0)
{
string functionBody = onclick.Substring(newWindowIndex + "newWindow(".Length);
string[] tokens = functionBody.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries);
result = tokens.Take(2).Select(s => s.Trim(' ', '\''));
}
}
}

Formatting Twitter text (TweetText) with C#

Is there a better way to format text from Twitter to link the hyperlinks, username and hashtags? What I have is working but I know this could be done better. I am interested in alternative techniques. I am setting this up as a HTML Helper for ASP.NET MVC.
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Web;
using System.Web.Mvc;
namespace Acme.Mvc.Extensions
{
public static class MvcExtensions
{
const string ScreenNamePattern = #"#([A-Za-z0-9\-_&;]+)";
const string HashTagPattern = #"#([A-Za-z0-9\-_&;]+)";
const string HyperLinkPattern = #"(http://\S+)\s?";
public static string TweetText(this HtmlHelper helper, string text)
{
return FormatTweetText(text);
}
public static string FormatTweetText(string text)
{
string result = text;
if (result.Contains("http://"))
{
var links = new List<string>();
foreach (Match match in Regex.Matches(result, HyperLinkPattern))
{
var url = match.Groups[1].Value;
if (!links.Contains(url))
{
links.Add(url);
result = result.Replace(url, String.Format("{0}", url));
}
}
}
if (result.Contains("#"))
{
var names = new List<string>();
foreach (Match match in Regex.Matches(result, ScreenNamePattern))
{
var screenName = match.Groups[1].Value;
if (!names.Contains(screenName))
{
names.Add(screenName);
result = result.Replace("#" + screenName,
String.Format("#{0}", screenName));
}
}
}
if (result.Contains("#"))
{
var names = new List<string>();
foreach (Match match in Regex.Matches(result, HashTagPattern))
{
var hashTag = match.Groups[1].Value;
if (!names.Contains(hashTag))
{
names.Add(hashTag);
result = result.Replace("#" + hashTag,
String.Format("#{1}",
HttpUtility.UrlEncode("#" + hashTag), hashTag));
}
}
}
return result;
}
}
}
That is remarkably similar to the code I wrote that displays my Twitter status on my blog. The only further things I do that I do are
1) looking up #name and replacing it with Real Name;
2) multiple #name's in a row get commas, if they don't have them;
3) Tweets that start with #name(s) are formatted "To #name:".
I don't see any reason this can't be an effective way to parse a tweet - they are a very consistent format (good for regex) and in most situations the speed (milliseconds) is more than acceptable.
Edit:
Here is the code for my Tweet parser. It's a bit too long to put in a Stack Overflow answer. It takes a tweet like:
#user1 #user2 check out this cool link I got from #user3: http://url.com/page.htm#anchor #coollinks
And turns it into:
<span class="salutation">
To Real Name,
Real Name:
</span> check out this cool link I got from
<span class="salutation">
Real Name
</span>:
http://site.com/...
#coollinks
It also wraps all that markup in a little JavaScript:
document.getElementById('twitter').innerHTML = '{markup}';
This is so the tweet fetcher can run asynchronously as a JS and if Twitter is down or slow it won't affect my site's page load time.
I created helper method to shorten text to 140 chars with url included. You can set share length to 0 to exclude url from tweet.
public static string FormatTwitterText(this string text, string shareurl)
{
if (string.IsNullOrEmpty(text))
return string.Empty;
string finaltext = string.Empty;
string sharepath = string.Format("http://url.com/{0}", shareurl);
//list of all words, trimmed and new space removed
List<string> textlist = text.Split(' ').Select(txt => Regex.Replace(txt, #"\n", "").Trim())
.Where(formatedtxt => !string.IsNullOrEmpty(formatedtxt))
.ToList();
int extraChars = 3; //to account for the two dots ".."
int finalLength = 140 - sharepath.Length - extraChars;
int runningLengthCount = 0;
int collectionCount = textlist.Count;
int count = 0;
foreach (string eachwordformated in textlist
.Select(eachword => string.Format("{0} ", eachword)))
{
count++;
int textlength = eachwordformated.Length;
runningLengthCount += textlength;
int nextcount = count + 1;
var nextTextlength = nextcount < collectionCount ?
textlist[nextcount].Length :
0;
if (runningLengthCount + nextTextlength < finalLength)
finaltext += eachwordformated;
}
return runningLengthCount > finalLength ? finaltext.Trim() + ".." : finaltext.Trim();
}
There is a good resource for parsing Twitter messages this link, worked for me:
How to Parse Twitter Usernames, Hashtags and URLs in C# 3.0
http://jes.al/2009/05/how-to-parse-twitter-usernames-hashtags-and-urls-in-c-30/
It contains support for:
Urls
#hashtags
#usernames
BTW: Regex in the ParseURL() method needs reviewing, it parses stock symbols (BARC.L) into links.

Categories