C# Read csv from url and save to database

C# Read csv from url and save to database - c#

I'm trying to get data from a csv-file from a Webservice.
If i paste the url in my browser, the csv will be downloaded and look like the following example:
"ID","ProductName","Company"
"1","Apples","Alfreds futterkiste"
"2","Oranges","Alfreds futterkiste"
"3","Bananas","Alfreds futterkiste"
"4","Salad","Alfreds futterkiste"
...next 96 rows
However I don't want to download the csv-file first and then extract data from it afterwards.
The webservice uses pagination and returns 100 rows (determined by the &num-parameter with a max of 100). After the first request i can use the &next-parameter to fetch the next 100 rows based on ID. For instance the url
http://testWebservice123.com/Example.csv?auth=abc&number=100&next=100
will get me rows from ID 101 to 200. So if there are a lot of rows i would end up downloading a lot of csv-files and saving them to the harddrive. So instead of downloading the csv-files first and saving them hdd to I want to get data directly from the webservice to be able to write directly to a database without saving the csv-files.
After a bit of search I came up with the following solution
static void Main(string[] args)
{
string startUrl = "http://testWebservice123.com/Example.csv?auth=abc&number=100";
string url = "";
string deltaRequestParameter = "";
string lastLine;
int numberOfLines = 0;
do
{
url = startUrl + deltaRequestParameter;
WebClient myWebClient = new WebClient();
using (Stream myStream = myWebClient.OpenRead(url))
{
using (StreamReader sr = new StreamReader(myStream))
{
numberOfLines = 0;
while (!sr.EndOfStream)
{
var row = sr.ReadLine();
var values = row.Split(',');
//do whatever with the rows by now - i.e. write to console
Console.WriteLine(values[0] + " " + values[1]);
lastLine = values[0].Replace("\"", ""); //last line in the loop - get the last ID.
numberOfLines++;
deltaRequestParameter = "&next=" + lastLine;
}
}
}
} while (numberOfLines == 101); //since the header is returned each time the number of rows will be 101 until we get to the last request
}
but im not sure if this is an "up to date" way of doing this, or if there is a better way (easier/simpler)? In other words i'm insecure about whether using WebClient and StreamReader is the right way to go?
In this thread: how to read a csv file from a url?
WebClient.DownloadString is mentioned as well as WebRequest. But if I want to write to a database without saving csv to hdd which is the best option?
Furhtermore - will the approach I have taken save data to a temporary disk storage behind the scenes or will all data be read into memmory and then disposed when the loop completes?
I have read the following documentation but can't seem to find out what it does behind the scenes:
StreamReader: https://learn.microsoft.com/en-us/dotnet/api/system.io.streamreader?view=netframework-4.7.2
Stream: https://learn.microsoft.com/en-us/dotnet/api/system.io.stream?view=netframework-4.7.2
Edit:
I guess I could also be using the following "TextFieldParser"...but my questions is really still the same:
(using the Assembly Microsoft.VisualBasic)
using (Stream myStream = myWebClient.OpenRead(url))
{
using (TextFieldParser parser = new TextFieldParser(myStream))
{
numberOfLines = 0;
parser.TrimWhiteSpace = true; // if you want
parser.Delimiters = new[] { "," };
parser.HasFieldsEnclosedInQuotes = true;
while (!parser.EndOfData)
{
string[] line = parser.ReadFields();
Console.WriteLine(line[0].ToString() + " " + line[1].ToString());
numberOfLines++;
deltaRequestParameter = "&next=" + line[0].ToString();
}
}
}

The HttpClient class on System.Web.Http is available as of .Net 4.5. You have to work with async code, but it's not a bad idea to get into it if you're dealing with the web.
As sample data, I'll use jsonplaceholder's "todo" list. It provides json data, not csv data, but it gives a simple enough structure that can serve our purpose in the example below.
This is the core function, which fetches from jsonplaceholder in a similar way to your "testWebService123" site, although I'm just getting the first 3 todo's, as opposed to testing for when I've hit the last page (you would probably keep your do-while) logic on that one.
async void DownloadPagesAsync() {
for (var i = 1; i < 3; i++) {
var pageToGet = $"https://jsonplaceholder.typicode.com/todos/{i}";
using (var client = new HttpClient())
using (HttpResponseMessage response = await client.GetAsync(pageToGet))
using (HttpContent content = response.Content)
using (var stream = (MemoryStream) await content.ReadAsStreamAsync())
using (var sr = new StreamReader(stream))
while (!sr.EndOfStream) {
var row =
sr.ReadLine()
.Replace(#"""", "")
.Replace(",", "");
if (row.IndexOf(":") == -1)
continue;
var values = row.Split(':');
Console.WriteLine($"{values[0]}, {values[1]}");
}
}
}
This is how you would call the function, such as you would in a Main() method:
Task t = new Task(DownloadPagesAsync);
t.Start();
The new task, here is taking in an "action", or or in other words a function that returns void, as a parameter. Then you start the task. Be careful, it is asynchronous, so any code you have after t.Start() may very well run before your task completes.
As to your question as to whether the stream reads "in memory" or not, running GetType() on "stream" in the code resulted in a "MemoryStream" type, though it seems to only be recognized as a "Stream" object at compile time. A MemoryStream is definately in-memory. I'm not really sure if any of the other kinds of stream objects save temporary files behind the scenes, but I'm leaning towards not.
But looking into the inner workings of a class, though commendable, is not usually required for your anxiety about disposing. For any class, just see if it implements IDisposable. If it does, then put in in a "using" statement, as you have done in your code. When the program terminates, as expected or via error, the program will implement the proper disposures after control has passed out of the "using" block.
HttpClient is in fact the newer approach. From what I understand, it does not replace all of the functionality for WebClient, but is stronger in many respects. See this SO site for more details comparing the two classes.
Also, something to know about WebClient is that it can be simple, but limiting. If you run into issues, you will need to look into the HttpWebRequest class, which is a "lower level" class that gives you greater access to the nuts and bolts of things (such as working with cookies).

Related

c# - splitting a large list into smaller sublists

Fairly new to C# - Sitting here practicing. I have a file with 10 million passwords listed in a single file that I downloaded to practice with.
I want to break the file down to lists of 99. Stop at 99 then do something. Then start where it left off and repeat the do something with the next 99 until it reaches the last item in the file.
I can do the count part well, it is the stop at 99 and continue where I left off is where I am having trouble. Anything I find online is not close to what I am trying to do and anything I add to this code on my own does not work.
I am more than happy to share more information if I am not clear. Just ask and will respond however, I might not be able to respond until tomorrow depending on what time it is.
Here is the code I have started:
using System;
using System.IO;
namespace lists01
{
class Program
{
static void Main(string[] args)
{
int count = 0;
var f1 = #"c:\tmp\10-million-password-list-top-1000000.txt";
{
var content = File.ReadAllLines(f1);
foreach (var v2 in content)
{
count++;
Console.WriteLine(v2 + "\t" + count);
}
}
}
}
}
My end goal is to do this with any list of items from files I have. I am only using this password list because it was sizable and thought it would be good for this exercise.
Thank you
Keith

Here is a couple of different ways to approach this. Normally, I would suggest the ReadAllLines function that you have in your code. The trade off is that you are loading the entire file into memory at once, then you operate on it.
Using read all lines in concert with Linq's Skip() and Take() methods, you can chop the lines up into groups like this:
var lines = File.ReadAllLines(fileName);
int linesAtATime = 99;
for (int i = 0; i < lines.Length; i = i + linesAtATime)
{
List<string> currentLinesGroup = lines.Skip(i).Take(linesAtATime).ToList();
DoSomethingWithLines(currentLinesGroup);
}
But, if you are working with a really large file, it might not be practical to load the entire file into memory. Plus, you might not want to leave the file open while you are working on the lines. This option gives you more control over how you move through the file. It just loads the part it needs into memory, and closes the file while you are working on the current set of lines.
List<string> lines = new List<string>();
int maxLines = 99;
long seekPosition = 0;
bool fileLoaded = false;
string line;
while (!fileLoaded)
{
using (Stream stream = File.Open(fileName, FileMode.Open))
{
//Jump back to the previous position
stream.Seek(seekPosition, SeekOrigin.Begin);
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream && lines.Count < maxLines)
{
line = reader.ReadLine();
seekPosition += (line.Length + 2); //Tracks how much data has been read.
lines.Add(line);
}
fileLoaded = reader.EndOfStream;
}
}
DoSomethingWithLines(lines);
lines.Clear();
}
In this case, I used Stream because it has the ability to seek to a specific position in the file. But then I used StreaReader because it has the ReadLine() methods.

Why is my filestream not writing correctly

I am trying to modify a file-stream inline as the file has the potential to be very large and I don't want to load it into memory. The piece of information I'm editing will always be the same length so in theory I can just swap the content out using a stream reader but it doesn't seem to be writing to the correct place
I have created a section of code that using a stream reader will read line by line until it finds a regex match and will then attempt to swap the bytes out with the edited line. The code is as follows:
private void UpdateFile(string newValue, string path, string pattern)
{
var regex = new Regex(pattern, RegexOptions.IgnoreCase);
int index = 0;
string line = "";
using (var fileStream = File.OpenRead(path))
using (var streamReader = new StreamReader(fileStream, Encoding.Default, true, 128))
{
while ((line = streamReader.ReadLine()) != null)
{
if (regex.Match(line).Success)
{
break;
}
index += Encoding.Default.GetBytes(line).Length;
}
}
if (line != null)
{
using (Stream stream = File.Open(path, FileMode.Open))
{
stream.Position = index + 1;
var newLine = regex.Replace(line, newValue);
var oldBytes = Encoding.Default.GetBytes(line);
var newBytes = Encoding.Default.GetBytes("\n" + newLine);
stream.Write(newBytes, 0, newBytes.Length);
}
}
}
The code almost works as expected, it inserts the updated line but it always does it a little early, just how early varies slightly based on the file I'm editing. I expect it is something to do with the way I am managing the stream position but I don't know the correct way to approach this.
Unfortunately the exact files I'm working on are under NDA.
The structure is as follows though:
A file will have an unkown amount of data followed by a line of a known format, for example:
Description: ABCDEF
I know the portion that follows "Description: " will always be 6 characters, so I do a replace on the line to replace with, for example, UVWXYZ.
The problem is that for example if a file read as
'...
UNIMPORTANT UNKNOWN DATA
DESCRIPTION: ABCDEF
MORE DATA
...'
it will come out as something like
'...
UNIMPORTANT UNKNOWN DDESCRIPTION: UVWXYZDEF
MORE DATA
...'

I think the problem here is that you are not considering the line feed ("\n") for each line you are getting and therefore your index is incorrectly setting the position of your stream. Try the following code:
private void UpdateFile(string newValue, string path, string pattern)
{
var regex = new Regex(pattern, RegexOptions.IgnoreCase);
int index = 0;
string line = "";
using (var fileStream = File.OpenRead(path))
using (var streamReader = new StreamReader(fileStream, Encoding.Default, true, 128))
{
while ((line = streamReader.ReadLine()) != null)
{
if (regex.Match(line).Success)
{
break;
}
index += Encoding.ASCII.GetBytes(line + "\n").Length;
}
}
if (line != null)
{
using (Stream stream = File.Open(path, FileMode.Open))
{
stream.Position = index;
var newBytes = Encoding.Default.GetBytes(regex.Replace(line + "\n", newValue));
stream.Write(newBytes, 0, newBytes.Length);
}
}
}

In your example, you are "off" by 4 Characters. Not quite the common "off by one error", but close. But maybe a different pattern would help the most?
Programms nowadays rarely work "on the file" like that. There is just too much to go wrong, all the way to a power loss mid-process. Instead they:
create a empty new file at the same location. Often temporary named and hidden.
write the output to the new file
Once you are done and eveyrthing is good - all the caches are flushed and everything is on the disk (done by Stream.Close() or Dispose()) - just replace the old file with the new file using the OS move operation.
The advantage is that it is impossible to have data-loss. Even if the computer looses power mid-operation, at tops the temporary file is messed up. You still got the orignal file and yoou can just delte the temporary file and restart the work from scratch if you need too. Indeed recovery only makes sense in rare cases (Word Processors)
The replacement of old file by new file is done with a move order. If they are on the same partition, that is literally just a rename operation in the Filesytem. And as modern FS are basically designed like a topline, robust relational Databases there is no danger in this.
You can find that pattern in everything from your Word Porcessor of choice, to backup programms, the download manager of Firefox (as you might be overriding a file that was there befroe) and even zipping programms. Everytime you got a long writing phase and want to minimize the danger, it is to go to pattern.
And as you can work entirely in memory without having to deal with moving around the read/write head, it will get around your issue too.
Edit: I made some source code for it from memory/documentation. Might contain syntax errors
string sourcepath; //containts the source file path, set by other code
string temppath; //containts teh path of the tempfile. Should be in the same folder, and thus same partiion
//Open both Streams, can use a single using for this
//The supression of any Buffering on the output should be optional and will be detrimental to performance
using(var sourceStream = File.OpenRead(sourcepath),
outStream = File.Create(temppath, 0, FileOptions.WriteThrough )){
string line = "";
//itterte over the input
while((line = streamReader.ReadLine()) != null){
//do processing on line here
outStream.Write(line);
}
}
//replace the files. Pretty sure it will just overwrite without asking
File.Move(temppath, sourcepath);

C# WebClient multiple execution

I have a problem with WebClient.
Simple I check file missed in one folder. If I don't have this file, with WebClient I navigate to web page and send a value to execute a query and store the value in the database.
My problem:
I have a List of 1500 Elements for example.
But after first element the for loop is stopped (maybe) or doesn't navigate again.
My code:
List<string> fileneed = new List<string>();
In the Thread
//Distinct
fileneed = fileneed.Distinct().ToList<string>();
for (int i = 0; i < fileneed.Count; i++)
{
if (fileneed[i].Contains("."))
{
w = new WebClient();
w.OpenRead("http://mywebsite.org/collab/files.php?act=need&user=" + Properties.Settings.Default.user + "&file=" + fileneed[i]);
fileneed.RemoveAt(i);
}
}
After execution of the thread, I go to my PhpMyAdmin and I see only one file.
Other files in the list don't show or are present or with a strange problem, my code execute one time the loop.

There are a few things wrong with the example code:
1st: Because it is removing items from the fileneed list at the same point it is reading from the list it is going to skip files in the list. This is because when you remove an item, the index of all the following items is made one smaller. We can get around this by iterating over the list from the end to the start.
2nd: Though the code is reading a file from the server, it is not doing anything with the file to write it out to disk. As such the file will simply be lost. This can be fixed by opening a file stream and copying to it.
3rd: WebClient and the Stream returned from OpenRead need to be Disposed. Otherwise the resources they use will not be cleaned up and your program will become a memory/connection hog. This is fixed by using the using statement.
With these three fixes the resulting code looks like this:
fileneed = fileneed.Distinct().ToList<string>();
for (int i = fileneed.Count - 1; i >= 0; i--)
{
if (fileneed[i].Contains("."))
{
using (var w = new WebClient())
using (var webFile = w.OpenRead("http://mywebsite.org/collab/files.php?act=need&user=" + Properties.Settings.Default.user + "&file=" + fileneed[i]))
using (var diskFile = File.OpenWrite(fileneed[i]))
{
webFile.CopyTo(diskFile);
}
fileneed.RemoveAt(i);
}
}

You are opening a 'connection' to that file, but you aren't reading it or storing it anyway. You need to create a new file, and read from the remote stream and write to the local file stream:
using(var myFile = File.OpenWrite(fileneed[i]))
{
w.CopyTo(myFile);
}
See this page for details

http://mywebsite.org/collab/files.php
I don't know this page what exactly do but you should remove this line;
fileneed.RemoveAt(i);
Every iterate, you are removing the element and Count changes. If you want to remove processed items, you could store in another list and except from original string list.

Reversing a tsv/csv file or reading only last line using asp.net

I am pretty much stuck on a problem from last few days. I have a file while is located on a remote server can be access by using userId and password. Well no problem in accessing.
Problem is I have around 150 of them. and each of them is of variable size minimum is 2 MB and max is 3 MB.
I have to read them one by one and read last row/line data from them. I am doing it in my current code.
The main problem is it is taking too much time since it is reading files from top to bottom.
public bool TEst(string ControlId, string FileName, long offset)
{
// The serverUri parameter should use the ftp:// scheme.
// It identifies the server file that is to be downloaded
// Example: ftp://contoso.com/someFile.txt.
// The fileName parameter identifies the local file.
//The serverUri parameter identifies the remote file.
// The offset parameter specifies where in the server file to start reading data.
Uri serverUri;
String ftpserver = "ftp://xxx.xxx.xx.xxx/"+FileName;
serverUri = new Uri(ftpserver);
if (serverUri.Scheme != Uri.UriSchemeFtp)
{
return false;
}
// Get the object used to communicate with the server.
FtpWebRequest request = (FtpWebRequest)WebRequest.Create(serverUri);
request.Credentials = new NetworkCredential("test", "test");
request.Method = WebRequestMethods.Ftp.DownloadFile;
//request.Method = WebRequestMethods.Ftp.DownloadFile;
request.ContentOffset = offset;
FtpWebResponse response = null;
try
{
response = (FtpWebResponse)request.GetResponse();
// long Size = response.ContentLength;
}
catch (WebException e)
{
Console.WriteLine(e.Status);
Console.WriteLine(e.Message);
return false;
}
// Get the data stream from the response.
Stream newFile = response.GetResponseStream();
// Use a StreamReader to simplify reading the response data.
StreamReader reader = new StreamReader(newFile);
string newFileData = reader.ReadToEnd();
// Append the response data to the local file
// using a StreamWriter.
string[] parser = newFileData.Split('\t');
string strID = parser[parser.Length - 5];
string strName = parser[parser.Length - 3];
string strStatus = parser[parser.Length-1];
if (strStatus.Trim().ToLower() != "suspect")
{
HtmlTableCell control = (HtmlTableCell)this.FindControl(ControlId);
control.InnerHtml = strName.Split('.')[0];
}
else
{
HtmlTableCell control = (HtmlTableCell)this.FindControl(ControlId);
control.InnerHtml = "S";
}
// Display the status description.
// Cleanup.
reader.Close();
response.Close();
//Console.WriteLine("Download restart - status: {0}", response.StatusDescription);
return true;
}
Threading:
protected void Page_Load(object sender, EventArgs e)
{
new Task(()=>this.TEst("controlid1", "file1.tsv", 261454)).Start();
new Task(()=>this.TEst1("controlid2", "file2.tsv", 261454)).Start();
}

FTP is not capable of seeking a file to read only the last few lines. Reference: FTP Commands You'll have to coordinate with the developers and owners of the remote ftp server and ask them make an additional file containing the data you need.
Example Ask owners of remote ftp server to create for each of the files a [filename]_lastrow file that contains the last row of the files. Your program would then operate on the [filename]_lastrow files. You'll probably be pleasantly surprised with an accommodating answer of "Ok we can do that for you"
If the ftp server can't be changed ask for a database connection.

You can also download all your files in parallel and start popping them into a queue for parsing when they are done rather than doing this process synchronously. If the ftp server can handle more connections, use as many as would be reasonable for the scenario. Parsing can be done in parallel too.
More reading: System.Threading.Tasks
It's kinda buried, but I placed a comment in your original answer. This SO question leads to this blog post which has some awesome code you can draw from.

Rather than your while loop you can skip directly to the end of the Stream by using Seek. You then want to work your way backwards though the stream until you find the first new line variable. This post should give you everything your need to know.
Get last 10 lines of very large text file > 10GB

FtpWebRequest includes the ContentOffset property. Find/choose a way to keep the offset of the last line (locally or remotely - ie by uploading a 4 byte file to ftp). This is the fastest way to do it and the most optimal for network traffic.
More information about FtpWebRequest can be found at MSDN

How do I locate a particular word in a text file using .NET

I am sending mails (in asp.net ,c#), having a template in text file (.txt) like below
User Name :<User Name>
Address : <Address>.
I used to replace the words within the angle brackets in the text file using the below code
StreamReader sr;
sr = File.OpenText(HttpContext.Current.Server.MapPath(txt));
copy = sr.ReadToEnd();
sr.Close(); //close the reader
copy = copy.Replace(word.ToUpper(),"#" + word.ToUpper()); //remove the word specified UC
//save new copy into existing text file
FileInfo newText = new FileInfo(HttpContext.Current.Server.MapPath(txt));
StreamWriter newCopy = newText.CreateText();
newCopy.WriteLine(copy);
newCopy.Write(newCopy.NewLine);
newCopy.Close();
Now I have a new problem,
the user will be adding new words within an angle, say for eg, they will be adding <Salary>.
In that case i have to read out and find the word <Salary>.
In other words, I have to find all the words, that are located with the angle brackets (<>).
How do I do that?

Having a stream for your file, you can build something similar to a typical tokenizer.
In general terms, this works as a finite state machine: you need an enumeration for the states (in this case could be simplified down to a boolean, but I'll give you the general approach so you can reuse it on similar tasks); and a function implementing the logic. C#'s iterators are quite a fit for this problem, so I'll be using them on the snippet below. Your function will take the stream as an argument, will use an enumerated value and a char buffer internally, and will yield the strings one by one. You'll need this near the start of your code file:
using System.Collections.Generic;
using System.IO;
using System.Text;
And then, inside your class, something like this:
enum States {
OUT,
IN,
}
IEnumerable<string> GetStrings(TextReader reader) {
States state=States.OUT;
StringBuilder buffer;
int ch;
while((ch=reader.Read())>=0) {
switch(state) {
case States.OUT:
if(ch=='<') {
state=States.IN;
buffer=new StringBuilder();
}
break;
case States.IN:
if(ch=='>') {
state=States.OUT;
yield return buffer.ToString();
} else {
buffer.Append(Char.ConvertFromUtf32(ch));
}
break;
}
}
}
The finite-state machine model always has the same layout: while(READ_INPUT) { switch(STATE) {...}}: inside each case of the switch, you may be producing output and/or altering the state. Beyond that, the algorithm is defined in terms of states and state changes: for any given state and input combination, there is an exact new state and output combination (the output can be "nothing" on those states that trigger no output; and the state may be the same old state if no state change is triggered).
Hope this helps.
EDIT: forgot to mention a couple of things:
1) You get a TextReader to pass to the function by creating a StreamReader for a file, or a StringReader if you already have the file on a string.
2) The memory and time costs of this approach are O(n), with n being the length of the file. They seem quite reasonable for this kind of task.

Using regex.
var matches = Regex.Matches(text, "<(.*?)>");
List<string> words = new List<string>();
for (int i = 0; i < matches.Count; i++)
{
words.Add(matches[i].Groups[1].Value);
}
Of course, this assumes you already have the file's text in a variable. Since you have to read the entire file to achieve that, you could look for the words as you are reading the stream, but I don't know what the performance trade off would be.

This is not an answer, but comments can't do this:
You should place some of your objects into using blocks. Something like this:
using(StreamReader sr = File.OpenText(HttpContext.Current.Server.MapPath(txt)))
{
copy = sr.ReadToEnd();
} // reader is closed by the end of the using block
//remove the word specified UC
copy = copy.Replace(word.ToUpper(), "#" + word.ToUpper());
//save new copy into existing text file
FileInfo newText = new FileInfo(HttpContext.Current.Server.MapPath(txt));
using(var newCopy = newText.CreateText())
{
newCopy.WriteLine(copy);
newCopy.Write(newCopy.NewLine);
}
The using block ensures that resources are cleaned up even if an exception is thrown.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Read csv from url and save to database - c#

Related

c# - splitting a large list into smaller sublists

Why is my filestream not writing correctly

C# WebClient multiple execution

Reversing a tsv/csv file or reading only last line using asp.net

How do I locate a particular word in a text file using .NET

Categories

Resources