So, I'm fairly new to coding, but I've never had a problem with IndexOf until now. I'm trying to search through an html string which looks like:
" data-pid=\"6598160343\">\n\n https://minneapolis.craigslist.org/dak/fuo/d/executive-desk-3-piece-set/6598160343.html\"
class=\"result-image gallery\"
data-ids=\"1:00B0B_hkRi5TEyM9Q,1:00z0z_jTtBxHxlxAZ,1:00p0p_2GU15WOHDEB,1:00909_eKQVd7O1pfE\">\n
$1500\n \n\n \n \n favorite this post\n
\n\n Jun
4\n\n\n https://minneapolis.craigslist.org/dak/fuo/d/executive-desk-3-piece-set/6598160343.html\"
data-id=\"6598160343\" class=\"result-title hdrlnk\">Executive Desk (3
piece set)\n\n\n \n
$1500\n\n\n\n \n pic\n
map\n
\n\n \n hide this posting\n
\n\n \n \n restore\n restore this posting\n
\n\n \n \n\n " string
I'm trying to find the index of specific elements so that I can grab the data later, here's what I have to find the indexes of the positions on either side of the data I want:
DataBookends bkEnds = new DataBookends
{
PIDFrom = (post.IndexOf(#"pid=\""")) + (#"pid=\""".Length),
URLFrom = (post.IndexOf(#"<a href=\")) + (#"<a href=\".Length),
PriceFrom = (post.IndexOf(#"result-price\"">$")) + (#"result-price\"">$".Length),
DateFrom = (post.IndexOf(#"datetime=\""")) + (#"datetime=\""".Length),
TitleFrom = (post.IndexOf(#"result-title hdrlnk\"">")) + (#"result-title hdrlnk\"">".Length),
LocationFrom = (post.IndexOf(#"result-hood\""> (")) + (#"result-hood\""> (".Length)
};
bkEnds.PIDTo = post.IndexOf(#"\""", bkEnds.PIDFrom);
bkEnds.URLTo = post.IndexOf(#"\", bkEnds.URLFrom);
bkEnds.PriceTo = post.IndexOf(#"</span>", bkEnds.PriceFrom);
bkEnds.DateTo = post.IndexOf(#"\", bkEnds.DateFrom);
bkEnds.TitleTo = post.IndexOf(#"</a>", bkEnds.TitleTo);
bkEnds.LocationTo = post.IndexOf(#"\", bkEnds.LocationFrom);
return bkEnds;
However, whenever I try to run it, it either doesn't find anything, or the index values are incorrect. I know I'm missing something simple but I can't figure it out and I feel like a moron. Is it something to do with escape characters I'm not seeing or something with how my string is formatted?
Help please?
EDIT:
I initially tried using the HTML Agility Pack, but I was having trouble understanding how to extract the data I needed so I thought using string.substring() would've been more straightforward.
The index values I'm getting are entirely wrong, even before I tried adding the forward-slashes. I'll be getting rid of those.
I'll write this answer but really it was CraigW in the comments who spotted your error. I think it could still use some explaining as you missed it. Also, the other comments are right that a parser might be the way to go. I still think you should understand the mistake you made as it's generally useful.
You said the variable has this string
" data-pid=\"6598160343\">\n\n https://minneapolis.craigslist.org/dak/fuo/d/executive-desk-3-piece-set/6598160343.html\" class=\"result-image gallery\" data-ids=\"1:00B0B_hkRi5TEyM9Q,1:00z0z_jTtBxHxlxAZ,1:00p0p_2GU15WOHDEB,1:00909_eKQVd7O1pfE\">\n $1500\n \n\n \n \n favorite this post\n
\n\n Jun 4\n\n\n https://minneapolis.craigslist.org/dak/fuo/d/executive-desk-3-piece-set/6598160343.html\" data-id=\"6598160343\" class=\"result-title hdrlnk\">Executive Desk (3 piece set)\n\n\n \n
$1500\n\n\n\n \n pic\n
map\n
\n\n \n hide this posting\n
\n\n \n \n restore\n restore this posting\n
\n\n \n
\n\n " string
which seems to have come from the debugger. You're searching with
post.IndexOf(#"pid=\""")
this won't find a hit, because it is literally looking for pid=\" which is not in your variable. Your variable actually contains
data-pid="6598160343">
https://minneap....
The debugger showed it as
data-pid=\"6598160343\">\n\n https://minneap
because it always 'escapes' quotes (ie a " in the variable shows in the watch window as \") and similarly newlines appear as \n. If you click the magnifying glass icon you will see the string as it really is, without the escapes.
Hope that clears your confusion, if it does you will now realise that this code would work
post.IndexOf(#"pid=""")
Also, for your interest note that if you don't use # before a string then you escape the ", eg.
post.IndexOf("pid=\"")
I think you can change your code a little bit because it's really hard to debug. See my code below and get your idea. You can copy and paste the method ExtractData (and the class as well) to your code, but you need to add some code to verify the patterStart, patterEnd can be found from the content
using System;
public static class StringFinder
{
public static string ExtractData(this string content, string patterStart, string patternEnd)
{
var indexStart = content.IndexOf(patterStart) + patterStart.Length;
var indexEnd = content.IndexOf(patternEnd, indexStart);
return content.Substring(indexStart,indexEnd - indexStart);
}
}
public class Program
{
public static void Main()
{
var data = #" data-pid=\""6598160343\"">\n\n https://minneapolis.craigslist.org/dak/fuo/d/executive-desk-3";
Console.WriteLine(data.ExtractData(#"data-pid=\""", #"\"">"));
}
}
Result 6598160343
So I figured it out, I ended up going with HTML Agility Pack as was suggested by Jeremy. I wasn't able to figure out what exactly was wrong with how I was searching through it with IndexOf and Substring (for example: it would skip "" and continue on until a point that didn't contain any of those characters), but I'm not going to try web-scraping that way again.
For the future, HTML Agility Pack is the way to go!
Related
Hi so i'm not exactly sure if the title justifies this question I'm not too good at phrasing sorry.
But what i'm trying to do is um like:
String joggingResults = ",Distance: 2.4km, Duration: 14minutes,";
And ideally, I would like to search joggingResults for " , " and output the words beside it.. and stops when it finds another " , " ... Does this make any sense? haha
My expected result would be something like this but each line is on a new string:
Distance: 2.4km
Duration: 14minutes
I hope someone helps me out tysm
You can split using ',' and then loop through the array and display the results.
var results = joggingResults.Split(',');
foreach(var item in results)
{
Console.WriteLine(item);
}
Note:- Assuming it is a console application. You can display it as per your type of application.
joggingResults.Split(',')
Will give you a collection of strings split where the commas are.
Just starting to muddle my way through C# and I have a question which maybe really simple (Once somebody explains it to me).
I have a text box asking for the users National Insurance Number (This is program doesn't do anything it's just me trying to figure out the formatting sequences) - But I'm pulling my hair out trying to work out how to display this back to the label.
at the moment I have the following
string result = String.Format("Thank you, {0}"+
" for your business. You NI # is {1:???}",
nameTextBox.Text,
socialTextBox.Text);
resultLabel.Text = result;
I don't know what to replace the ? with.. Any help would be really appreciated.
Many Thanks
I was looking for something like BN-201285-T
You could make your own function that formats the string to the desired format :
private string CustomFormat(string input) {
return string.Format("BN-{0}-T", input);
}
Then pass the formated string to the string.Format call :
string result = String.Format("Thank you, {0}" +
" for your business. You NI # is {1}",
nameTextBox.Text,
CustomFormat(socialTextBox.Text));
resultLabel.Text = result;
I'm trying to remove new lines from a text file. Opening the text file in notepad doesn't reveal the line breaks I'm trying to remove (it looks like one big wall of text), however when I open the file in sublime, I can see them.
In sublime, I can remove the pattern '\n\n' and then the pattern '\n(?!AAD)' no problem. However, when I run the following code, the resulting text file is unchanged:
public void Format(string fileloc)
{
string str = File.ReadAllText(fileloc);
File.WriteAllText(fileloc + "formatted", Regex.Replace(Regex.Replace(str, "\n\n", ""), "\n(?!AAD)", ""));
}
What am I doing wrong?
If you do not want to spend hours trying to re-adjust the code for various types of linebreaks, here is a generic solution:
string str = File.ReadAllText(fileloc);
File.WriteAllText(fileloc + "formatted",
Regex.Replace(Regex.Replace(str, "(?:\r?\n|\r){2}", ""), "(?:\r?\n|\r)(?!AAD)", "")
);
Details:
A linebreak can be matched with (?:\r?\n|\r): an optional CR followed with a single obligatory LF. To match 2 consecutive linebreaks, a limiting quantifier can be appended - (?:\r?\n|\r){2}.
An empirical solution. Opening your sample file in binary mode revealed that it contains 0x0D characters, which are carriage returns \r. So I came up with this (multiple lines for easier debugging):
public void Format(string fileloc)
{
var str = File.ReadAllText(fileloc);
var firstround = Regex.Replace(str, #"\r\r", "");
var secondround = Regex.Replace(firstround, #"\r(?!AAD)", "");
File.WriteAllText(fileloc + "formatted", secondround);
}
Is this possibly a windows/linux mismatch? Try replacing '\r\n' instead.
I have a big file that has a bunch of data in it, but essentially what I would like to do is to grab only parts of it, let me explain what parts I'm interested in:
(imagine "x" as an IP Address)
(imagine "?" as any alphanumerical character /w any length)
(imagine "MD5" as an MD5 hash)
(Actual -not literally though- text file below)
'xxx.xxx.xxx.xxx'
xxxxxxxxxx
'?'
'?'
'MD5'
Now my inquiry is the following one, How could I identify the line
'xxx.xxx.xxx.xxx'
anywhere at the beginning inside a file and then automatically write to another file both of the '?' entries and the 'MD5' entry for each IP Address instances found.
So in a nutshell, the program should start at the beginning of the file, read the contents, if it hits an IP Address (Regex: '\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b' works fine for me), skip one line below, then start copying the other data to another file until it hits the MD5 entry (Regex: '[a-f0-9]{32}' works fine for me), then iterate again from that point and so on looking for another instance of an IP Address etc, etc. It should keep doing that until it reaches the end of the file.
I'm trying to do this myself but I don't even know where to start, or methods of doing it at all.
You can use the following to match the content that you are looking for.. and copy it to the desired location/ file:
('\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')(\s*.+\s*)([\s\S]*?)('\b[a-f0-9]{32}\b')
And extract $1$3$4
See DEMO
Code:
String regex = "('\\b\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\b')(\\s*.+\\s*)([\\s\\S]*?)('\\b[a-f0-9]{32}\\b')";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(myString);
while (m.find()) {
System.out.println("end(): " + m.group(1));
//System.out.println("end(): " + m.group(2));
System.out.println("end(): " + m.group(3));
System.out.println("end(): " + m.group(4));
}
Given the fact that your file is machine generated and that the overall pattern is pretty specific, I don't think it's necessary to be overly specific with the IP address.
Matching it as "a bunch of digits and dots in single quotes" is probably enough, in the context of the rest of the pattern (*).
Here is an expression that matches your entire requirement into named groups:
^'(?<IP>[\d.]+)'\s+
^(?<ID>\w*)\s+
^'(?<line1>\w*)'\s+
^'(?<line2>\w*)'\s+
^'(?<MD5>[A-Fa-f0-9]{32})'
Use it with the Multiline and IgnorePatternWhitespace regex options (the latter means you can keep the regex layout for better readability).
(*) Besides, regex patterns for IP addresses are literally all over the place, in countless examples. Of course you can use something more sophisticated than '[\d.]+' if you think it's necessary.
I have tried out this in Java as below.
public class TestRegex
{
/**
* #param args
*/
public static void main(String[] args)
{
String input = "assasasa 123.234.223.223 333 aad sddsf 343sdd sds23343 ssdfs33344 MD5=aas jjsjjdjd 143.234.223.223 333 aad sddsf 343sdd sds23343 ssdfs33344 MD5=asas";
String regexPattern = "(\\b[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\b).*?([A-Z a-z]+[0-9]+=.*?\\s)";
Matcher m = null;
Pattern pattern = Pattern.compile(regexPattern);
m = pattern.matcher(input);
// System.out.println(matcher.toString());
while (m.find()) {
System.out.println("start(): " + m.start());
System.out.println("end(): " + m.end());
System.out.println("end(): " + m.toString());
System.out.println("end(): " + m.group(1));
System.out.println("end(): " + m.group(2));
}
}
}
Example String
This is an important example about regex for my work.
I can extract important example about regex with this (?<=an).*?(?=for) snippet. Reference
But i would like to extract to string right to left side. According to this question's example; first position must be (for) second position must be (an).
I mean extracting process works back ways.
I tried what i want do as below codes in else İf case, but it doesn't work.
public string FnExtractString(string _QsString, string _QsStart, string _QsEnd, string _QsWay = "LR")
{
if (_QsWay == "LR")
return Regex.Match(_QsString, #"(?<=" + _QsStart + ").*?(?=" + _QsEnd + ")").Value;
else if (_QsWay == "RL")
return Regex.Match(_QsString, #"(?=" + _QsStart + ").*?(<=" + _QsEnd + ")").Value;
else
return _QsString;
}
Thanks in advance.
EDIT
My real example as below
#Var|First String|ID_303#Var|Second String|ID_304#Var|Third String|DI_t55
When i pass two string to my method (for example "|ID_304" and "#Var|") I would like to extract "Second String" but this example is little peace of my real string and my string is changeable.
No need for forward or backward lookahead! You could just:
(.*)\san\s.*\sfor\s
The \s demands whitespace, so you don't match an import*an*t.
One potential problem in your current solution is that the string passed in contains special characters, which needs to be escaped with Regex.Escape before concatenation:
return Regex.Match(_QsString, #"(?<=" + Regex.Escape(_QsStart) + ").*?(?=" + Regex.Escape(_QsEnd) + ")").Value;
For your other requirement of matching RL, I don't understand your requirement.