Regex pattern for file extraction via url?

Regex pattern for file extraction via url? - c#

So, the html data I'm looking at is:
Action.log<br> 6/8/2015 3:45 PM
From this I need to extract either instances of Action.log,
My problem is I've been over a ton of regex tutorials and I still can't seem to brain up a pattern to extract it. I guess I'm lacking some fundamental understanding of regex, but any help would be appreciated.
Edit:
internal string[] ParseFolderIndex_Alpha(string url, WebDirectory directory)
{
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Timeout = 3 * 60 * 1000;
request.KeepAlive = true;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
List<string> fileLocations = new List<string>(); string line;
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
while ((line = reader.ReadLine()) != null)
{
int index = line.IndexOf("<a href=");
if (index >= 0)
{
string[] segments = line.Substring(index).Split('\"');
///Can Parse File Size Here: Add todo
if (!segments[1].Contains("/"))
{
fileLocations.Add(segments[1]);
UI.UpdatePatchNotes("Web File Found: " + segments[1]);
UI.UpdateProgressBar();
}
else
{
if (segments[1] != #"../")
{
directory.SubDirectories.Add(new WebDirectory(url + segments[1], this));
UI.UpdatePatchNotes("Web Directory Found: " + segments[1].Replace("/", string.Empty));
}
}
}
else if (line.Contains("</pre")) break;
}
}
response.Dispose(); /// After ((line = reader.ReadLine()) != null)
return fileLocations.ToArray<string>();
}
else return new string[0]; /// !(HttpStatusCode.OK)
}
catch (Exception e)
{
LogHandler.LogErrors(e.ToString(), this);
LogHandler.LogErrors(url, this);
return null;
}
}
That's what I was doing, the problem is I changed servers and the html IIS is displaying is different so I have to make new logic.
Edit / Conclusion:
First of all, I'm sorry I even mentions regex :P Secondly each platform will have to be handled individually depending on environment.
This is how I'm currently gathering the file names.
internal string[] ParseFolderIndex(string url, WebDirectory directory)
{
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Timeout = 3 * 60 * 1000;
request.KeepAlive = true;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
bool endMet = false;
if (response.StatusCode == HttpStatusCode.OK)
{
List<string> fileLocations = new List<string>(); string line;
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
while (!endMet)
{
line = reader.ReadLine();
if (line != null && line != "" && line.IndexOf("</A>") >= 0)
{
if (line.Contains("</html>")) endMet = true;
string[] segments = line.Replace("\\", "").Split('\"');
List<string> paths = new List<string>();
List<string> files = new List<string>();
for (int i = 0; i < segments.Length; i++)
{
if (!segments[i].Contains('<'))
paths.Add(segments[i]);
}
paths.RemoveAt(0);
foreach (String s in paths)
{
string[] secondarySegments = s.Split('/');
if (s.Contains(".") || s.Contains("Verinfo"))
files.Add(secondarySegments[secondarySegments.Length - 1]);
else
{
directory.SubDirectories.Add(new WebDirectory
(url + "/" + secondarySegments[secondarySegments.Length - 2], this));
UI.UpdatePatchNotes("Web Directory Found: " + secondarySegments[secondarySegments.Length - 2]);
}
}
foreach (String s in files)
{
if (!String.IsNullOrEmpty(s) && !s.Contains('%'))
{
fileLocations.Add(s);
UI.UpdatePatchNotes("Web File Found: " + s);
UI.UpdateProgressBar();
}
}
if (line.Contains("</pre")) break;
}
}
}
response.Dispose(); /// After ((line = reader.ReadLine()) != null)
return fileLocations.ToArray<string>();
}
else return new string[0]; /// !(HttpStatusCode.OK)
}
catch (Exception e)
{
LogHandler.LogErrors(e.ToString(), this);
LogHandler.LogErrors(url, this);
return null;
}
}

Regex for this is overkill.
It's too heavy, and considering the format of the string will always be the same, you're going to find it easier to debug and maintain using splitting and substrings.
class Program {
static void Main(string[] args) {
String s = "Action.log<br> 6/8/2015 3:45 PM ";
String[] t = s.Split('"');
String fileName = String.Empty;
//To get the entire file name and path....
fileName = t[1].Substring(0, (t[1].Length));
//To get just the file name (Action.log in this case)....
fileName = t[1].Substring(0, (t[1].Length)).Split('/').Last();
}
}

Try matching the following pattern:
<A HREF="(?<url>.*)">
Then get the group called url from the match results.
Working example: https://regex101.com/r/hW8iH6/1

string text = #"Action.log<br> 6/8/2015 3:45 PM";
var match = Regex.Match(text, #"^(.*).*$");
var result = match.Groups[1].Value;
Try http://regexr.com/ or Regexbuddy!

Related

Search raw text from url, using keyword from textbox

Me and my buddy Xylophone have been at this for hours and cant figure this out, any help would be appreciated. I'm basically trying to read all the text from that URL and search for a keyword.
if (comboBoxEdit1.Text == "Hello")
{
label2.Text = "Current Status: Searching...";
this.dataGridView3.ScrollBars = ScrollBars.None;
this.dataGridView3.MouseWheel += new MouseEventHandler(mousewheel);
dataGridView3.Rows.Clear();
string line;
int row = 0;
List<String> LinesFound = new List<string>();
StreamReader file = new StreamReader("https://pastebin.com/raw/fWxKdRjN");
while ((line = file.ReadLine()) != null)
{
if (line.Contains(textEdit1.Text))
{
string[] Columns = line.Split(':');
dataGridView3.Rows.Add(line);
for (int i = 0; i < Columns.Length; i++)
{
dataGridView3[i, row].Value = Columns[i];
}
row++;
label2.Text = "Current Status: " + dataGridView3.Rows.Count + " Matche(s) Found";
}
else if (dataGridView3.RowCount == 0)
{
label2.Text = "Current Status: No Matche(s) Found";
}
}
}

You are doing it all wrong, if you want to read and pars the html content of the web page, you need to fetch the page using httpClient, or better take look at this library https://html-agility-pack.net/

We can use regular expression to check if 'raw' exist in the URL.
Regex.Matches() function will return an array with all occurrence of the match.
We can then use count property to find the no of occurence.
Regular expression to match raw in a url:(raw)
Below is the working code snippet:
public static void Main()
{
string pattern = #"(raw)";
Regex rgx = new Regex(pattern);
string url = "https://pastebin.com/raw/fWxKdRjN";
if (rgx.Matches(url).Count>0){
Console.WriteLine(Current Status: " + rgx.Matches(url).Count + " Matche(s) Found");
}
else {
Console.WriteLine("Current Status: No Matche(s) Found");
}
}

Read the flat file,group and write to file(Add special Characters as '*' in Empty Space)

E2739158012008-10-01O9918107NPF7547379999010012008-10-0100125000000
E2739158PU0000-00-00 010012008-10-0100081625219
E3180826011985-01-14L9918007NPM4927359999010011985-01-1400005620000
E3180826PU0000-00-00 020011985-01-14000110443500021997-01-1400000518799
E3292015011985-01-16L9918007NPM4927349999010011985-01-1600003623300
I have this flat file and I need to group this based on the 2nd position to 8th position
example(2739158/3180826/3292015) and write to another flat file.
So the data Starting with 'E' should Repeat in the single line along with that group field(2nd to 8th Position in the start) and I should take the 9th Position after 'E'
Also I need to replace Empty space with ('*' star)
For example
1st Line
2739158**E**012008-10-01O9918107NPF7547379999010012008-10-0100125000000*****E**012008-10-01O9918107NPF7547379999010012008-10-0100125000000
2nd Line
3180826**E**011985-01-14L9918007NPM4927359999010011985-01-1400005620000**E**011985-01-14L9918007NPM4927359999010011985-01-140000562000**E**011985-01-14L9918007NPM4927359999010011985-01-140000562000***
3rd Line
3292015**E**011985-01-16L9918007NPM4927349999010011985-01-1600003623300****
Can we do this in Stream reader c#, please?
Any help would be highly appreciated.The file size is more than 285 MB so it it good to read through Stream Reader?
Thanks

#jdweng: thanks very much for your input. i tried somehow without grouping and it works as expected.Thanks everyone who tried to solve the issue.
string sTest= string.Empty; List<SortLines> lines = new List<SortLines>();
List<String> FinalLines = new List<String>();
using (StreamReader sr = new StreamReader(#"C:\data\Input1.txt))
{
sr.ReadLine();
string line = "";
while (!sr.EndOfStream)
{
line = sr.ReadLine();
//line = line.Trim();
if (line.Length > 0)
{
line = line.Replace(" ", "*");
SortLines newLine = new SortLines()
{
key = line.Substring(1, 7),
line = line
};
if (sTest != newLine.key)
{
//Add the Line Items to String List
sOuterLine = sTest + sOneLine;
FinalLines.Add(sOuterLine);
string sFinalLine = newLine.line.Remove(1, 7);
string snewLine = newLine.key + sFinalLine;
sTest = snewLine.Substring(0, 7);
//To hold the data for the 1st occurence
sOtherLine = snewLine.Remove(0, 7);
bOtherLine = true;
string sKey = newLine.key;
lines.Add(newLine);
}
else if (sTest == newLine.key)
{
string sConcatLine = String.Empty;
string sFinalLine = newLine.line.Remove(1, 7);
//Check if 1st Set
if (bOtherLine == true)
{
sOneLine = sOtherLine + sFinalLine;
bOtherLine = false;
}
//If not add subsequent data
else
{
sOneLine = sOneLine + sFinalLine;
}
//Check for the last line in the flat file
if (sr.Peek() == -1)
{
sOuterLine = sTest + sOneLine;
FinalLines.Add(sOuterLine);
}
}
}
}
}
//Remove the Empty List
FinalLines.RemoveAll(x => x == "");
StreamWriter srWriter = new StreamWriter(#"C:\data\test.txt);
foreach (var group in FinalLines)
{
srWriter.WriteLine(group);
}
srWriter.Flush();
srWriter.Close();

Try code below :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace ConsoleApplication1
{
class Program
{
const string INPUT_FILENAME = #"c:\temp\test.txt";
const string OUTPUT_FILENAME = #"c:\temp\test1.txt";
static void Main(string[] args)
{
List<SortLines> lines = new List<SortLines>();
StreamReader reader = new StreamReader(INPUT_FILENAME);
string line = "";
while ((line = reader.ReadLine()) != null)
{
line = line.Trim();
if (line.Length > 0)
{
line = line.Replace(" ", "*");
SortLines newLine = new SortLines() { key = line.Substring(2, 7), line = line };
lines.Add(newLine);
}
}
reader.Close();
var groups = lines.GroupBy(x => x.key);
StreamWriter writer = new StreamWriter(OUTPUT_FILENAME);
foreach (var group in groups)
{
foreach (SortLines sortLine in group)
{
writer.WriteLine(sortLine.line);
}
}
writer.Flush();
writer.Close();
}
}
public class SortLines : IComparable<SortLines>
{
public string line { get; set; }
public string key { get; set; }
public int CompareTo(SortLines other)
{
return key.CompareTo(other);
}
}
}

Create multi-file zip from stream and download on the fly

I'm using DotNetZip library to make multi-file zip archive and download it on the fly (no need to wait for download to start). However I can't make it to download instantly. From browser dev console, in network tab I noticed that first zip file is "transferred" and after being fully transferred it starts downloading.
Here is code fragment:
using (var vZipArchive = new ZipFile())
{
if (vFilesTable.Rows.Count > 0)
{
string vPathFormat = null;
string vKeyName = null;
string vPrimKey = null;
string vFileName = null;
vZipArchive.CompressionLevel = CompressionLevel.BestSpeed;
vZipArchive.CompressionMethod = CompressionMethod.Deflate;
vZipArchive.Comment = "Document Archive";
foreach (DataRow vFile in vFilesTable.Rows)
{
if (vKeyFieldName != null && !string.IsNullOrEmpty(vKeyFieldName))
{
vPathFormat = "{0}/";
}
else
{
vPathFormat = "/";
}
if (vFile.Table.Columns.Contains("FileName") && vFile.Table.Columns.Contains("PrimKey"))
{
vPrimKey = vFile["PrimKey"].ToString();
vFileName = vFile["FileName"].ToString();
if (vKeyFieldName != null)
{
vKeyName = vFile[vKeyFieldName].ToString();
}
string vPath = null;
if (vKeyFieldName != null)
{
vPath = string.Format(vKeyFieldName + " " + vPathFormat + "{1}", vKeyName, vFileName);
}
else
{
vPath = string.Format(vPathFormat + vFileName);
}
using (var vFileStream = vUserContext.GetFileStream(vRecordSource.ViewName, new Guid(vPrimKey)))
{
vFileStream.Position = 0;
vZipArchive.AddEntry(vPath, vFileStream);
}
}
else
{
throw new Exception("Select 'FileName' and 'PrimKey' fields in underlying DataSource");
}
} //end loop
pContext.Response.Clear();
pContext.Response.BufferOutput = false;
pContext.Response.ContentType = "application/zip";
pContext.Response.AddHeader("Content-Disposition", string.Format("attachment; filename=\"{0}.zip\"", vZipName));
vZipArchive.Save(pContext.Response.OutputStream);
}
else
{
return;
}
}
I also tried using ZipOutputStream and SharpCompress.dll library.
So, what I am missing to make it work? Or its impossible?

Retrieve Specific Files from Directory Using FTP

I'm creating a C# application that needs to FTP to a directory to retrieve a file list. The following code works just fine. However, the folder that I'm FTPing to contains around 92,000 files. This code will not work in the way that I want it to for a file list of that size.
I'm looking only for files that begin with the string "c-". After doing some research, I'm not even sure how to begin trying to solve this issue. Is there any way I can modify this existing code for it to retrieve only those files?
public string[] getFileList() {
string[] downloadFiles;
StringBuilder result = new StringBuilder();
FtpWebRequest reqFTP;
try {
reqFTP = (FtpWebRequest)FtpWebRequest.Create(new Uri(ftpHost));
reqFTP.UseBinary = true;
reqFTP.Credentials = new NetworkCredential(ftpUser, ftpPass);
reqFTP.Method = WebRequestMethods.Ftp.ListDirectory;
WebResponse response = reqFTP.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
string line = reader.ReadLine();
while (line != null) {
result.Append(line);
result.Append("\n");
line = reader.ReadLine();
}
// to remove the trailing '\n'
result.Remove(result.ToString().LastIndexOf('\n'), 1);
reader.Close();
response.Close();
return result.ToString().Split('\n');
}
catch (Exception ex) {
System.Windows.Forms.MessageBox.Show(ex.Message);
downloadFiles = null;
return downloadFiles;
}
}

I think the LIST doesn't support wildcard search and in fact it might be vary from different FTP platform and depend the COMMANDS support
you will need to download all the files name in the FTP directory using LIST , probably in the async way.

Here is an alternative implementation along the similar lines. I've tested this with as many as 1000 ftp files, it might work for you. Complete source code can be found here.
public List<ftpinfo> browse(string path) //eg: "ftp.xyz.org", "ftp.xyz.org/ftproot/etc"
{
FtpWebRequest request=(FtpWebRequest)FtpWebRequest.Create(path);
request.Method=WebRequestMethods.Ftp.ListDirectoryDetails;
List<ftpinfo> files=new List<ftpinfo>();
//request.Proxy = System.Net.WebProxy.GetDefaultProxy();
//request.Proxy.Credentials = CredentialCache.DefaultNetworkCredentials;
request.Credentials = new NetworkCredential(_username, _password);
Stream rs=(Stream)request.GetResponse().GetResponseStream();
OnStatusChange("CONNECTED: " + path, 0, 0);
StreamReader sr = new StreamReader(rs);
string strList = sr.ReadToEnd();
string[] lines=null;
if (strList.Contains("\r\n"))
{
lines=strList.Split(new string[] {"\r\n"},StringSplitOptions.None);
}
else if (strList.Contains("\n"))
{
lines=strList.Split(new string[] {"\n"},StringSplitOptions.None);
}
//now decode this string array
if (lines==null || lines.Length == 0)
return null;
foreach(string line in lines)
{
if (line.Length==0)
continue;
//parse line
Match m= GetMatchingRegex(line);
if (m==null) {
//failed
throw new ApplicationException("Unable to parse line: " + line);
}
ftpinfo item=new ftpinfo();
item.filename = m.Groups["name"].Value.Trim('\r');
item.path = path;
item.size = Convert.ToInt64(m.Groups["size"].Value);
item.permission = m.Groups["permission"].Value;
string _dir = m.Groups["dir"].Value;
if(_dir.Length>0 && _dir != "-")
{
item.fileType = directoryEntryTypes.directory;
}
else
{
item.fileType = directoryEntryTypes.file;
}
try
{
item.fileDateTime = DateTime.Parse(m.Groups["timestamp"].Value);
}
catch
{
item.fileDateTime = DateTime.MinValue; //null;
}
files.Add(item);
}
return files;
}

c# Remove rows from csv

I have two csv files. In the first file i have a list of users, and in the second file i have a list of duplicate users. Im trying to remove the rows in the first file that are equal to the second file.
Heres the code i have so far:
StreamWriter sw = new StreamWriter(path3);
StreamReader sr = new StreamReader(path2);
string[] lines = File.ReadAllLines(path);
foreach (string line in lines)
{
string user = sr.ReadLine();
if (line != user)
{
sw.WriteLine(line);
}
File 1 example:
Modify,ABAMA3C,Allpay - Free State - HO,09072701
Modify,ABCG327,Processing Centre,09085980
File 2 Example:
Modify,ABAA323,Group HR Credit Risk & Finance
Modify,ABAB959,Channel Sales & Service,09071036
Any suggestions?
Thanks.

All you'd have to do is change the following file paths in the code below and you will get a file back (file one) without the duplicate users from file 2. This code was written with the idea in mind that you want something that is easy to understand. Sure there are other more elegant solutions, but I wanted to make it as basic as possible for you:
(Paste this in the main method of your program)
string line;
StreamReader sr = new StreamReader(#"C:\Users\J\Desktop\texts\First.txt");
StreamReader sr2 = new StreamReader(#"C:\Users\J\Desktop\texts\Second.txt");
List<String> fileOne = new List<string>();
List<String> fileTwo = new List<string>();
while (sr.Peek() >= 0)
{
line = sr.ReadLine();
if(line != "")
{
fileOne.Add(line);
}
}
sr.Close();
while (sr2.Peek() >= 0)
{
line = sr2.ReadLine();
if (line != "")
{
fileTwo.Add(line);
}
}
sr2.Close();
var t = fileOne.Except(fileTwo);
StreamWriter sw = new StreamWriter(#"C:\Users\justin\Desktop\texts\First.txt");
foreach(var z in t)
{
sw.WriteLine(z);
}
sw.Flush();

If this is not homework, but a production thing, and you can install assemblies, you'll save 3 hours of your life if you swallow your pride and use a piece of the VB library:
There are many exceptions (CR/LF between commas=legal in quotes; different types of quotes; etc.) This will handle anything excel will export/import.
Sample code to load a 'Person' class pulled from a program I used it in:
Using Reader As New Microsoft.VisualBasic.FileIO.TextFieldParser(CSVPath)
Reader.TextFieldType = Microsoft.VisualBasic.FileIO.FieldType.Delimited
Reader.Delimiters = New String() {","}
Reader.TrimWhiteSpace = True
Reader.HasFieldsEnclosedInQuotes = True
While Not Reader.EndOfData
Try
Dim st2 As New List(Of String)
st2.addrange(Reader.ReadFields())
If iCount > 0 Then ' ignore first row = field names
Dim p As New Person
p.CSVLine = st2
p.FirstName = st2(1).Trim
If st2.Count > 2 Then
p.MiddleName = st2(2).Trim
Else
p.MiddleName = ""
End If
p.LastNameSuffix = st2(0).Trim
If st2.Count >= 5 Then
p.TestCase = st2(5).Trim
End If
If st2(3) > "" Then
p.AccountNumbersFromCase.Add(st2(3))
End If
While p.CSVLine.Count < 15
p.CSVLine.Add("")
End While
cases.Add(p)
End If
Catch ex As Microsoft.VisualBasic.FileIO.MalformedLineException
MsgBox("Line " & ex.Message & " is not valid and will be skipped.")
End Try
iCount += 1
End While
End Using

this to close the streams properly:
using(var sw = new StreamWriter(path3))
using(var sr = new StreamReader(path2))
{
string[] lines = File.ReadAllLines(path);
foreach (string line in lines)
{
string user = sr.ReadLine();
if (line != user)
{
sw.WriteLine(line);
}
}
}
for help on the real logic of removal or compare, answer the comment of El Ronnoco above...

You need to close the streams or utilize using clause
sw.Close();
using(StreamWriter sw = new StreamWriter(#"c:\test3.txt"))

You can use LINQ...
class Program
{
static void Main(string[] args)
{
var fullList = "TextFile1.txt".ReadAsLines();
var removeThese = "TextFile2.txt".ReadAsLines();
//Change this line if you need to change the filter results.
//Note: this assume you are wanting to remove results from the first
// list when the entire record matches. If you want to match on
// only part of the list you will need to split/parse the records
// and then filter your results.
var cleanedList = fullList.Except(removeThese);
cleanedList.WriteAsLinesTo("result.txt");
}
}
public static class Tools
{
public static IEnumerable<string> ReadAsLines(this string filename)
{
using (var reader = new StreamReader(filename))
while (!reader.EndOfStream)
yield return reader.ReadLine();
}
public static void WriteAsLinesTo(this IEnumerable<string> lines, string filename)
{
using (var writer = new StreamWriter(filename) { AutoFlush = true, })
foreach (var line in lines)
writer.WriteLine(line);
}
}

using(var sw = new StreamWriter(path3))
using(var sr = new StreamReader(path))
{
string []arrRemove = File.ReadAllLines(path2);
HashSet<string> listRemove = new HashSet<string>(arrRemove.Count);
foreach(string s in arrRemove)
{
string []sa = s.Split(',');
if( sa.Count < 2 ) continue;
listRemove.Add(sa[1].toUpperCase());
}
string line = sr.ReadLine();
while( line != null )
{
string []sa = line.Split(',');
if( sa.Count < 2 )
sw.WriteLine(line);
else if( !listRemove.contains(sa[1].toUpperCase()) )
sw.WriteLine(line);
line = sr.ReadLine();
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex pattern for file extraction via url? - c#

Try matching the following pattern: <A HREF="(?<url>.*)"> Then get the group called url from the match results. Working example: https://regex101.com/r/hW8iH6/1

string text = #"Action.log<br> 6/8/2015 3:45 PM"; var match = Regex.Match(text, #"^(.).$"); var result = match.Groups[1].Value; Try http://regexr.com/ or Regexbuddy!

Related

Search raw text from url, using keyword from textbox

Read the flat file,group and write to file(Add special Characters as '*' in Empty Space)

Create multi-file zip from stream and download on the fly

Retrieve Specific Files from Directory Using FTP

c# Remove rows from csv

Categories

Resources

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex pattern for file extraction via url? - c#

Try matching the following pattern: <A HREF="(?<url>.*)"> Then get the group called url from the match results. Working example: https://regex101.com/r/hW8iH6/1

string text = #"Action.log<br> 6/8/2015 3:45 PM"; var match = Regex.Match(text, #"^(.*).*$"); var result = match.Groups[1].Value; Try http://regexr.com/ or Regexbuddy!

Related

Search raw text from url, using keyword from textbox

Read the flat file,group and write to file(Add special Characters as '*' in Empty Space)

Create multi-file zip from stream and download on the fly

Retrieve Specific Files from Directory Using FTP

c# Remove rows from csv

Categories

Resources

string text = #"Action.log<br> 6/8/2015 3:45 PM"; var match = Regex.Match(text, #"^(.).$"); var result = match.Groups[1].Value; Try http://regexr.com/ or Regexbuddy!