I have a problem with WebClient.
Simple I check file missed in one folder. If I don't have this file, with WebClient I navigate to web page and send a value to execute a query and store the value in the database.
My problem:
I have a List of 1500 Elements for example.
But after first element the for loop is stopped (maybe) or doesn't navigate again.
My code:
List<string> fileneed = new List<string>();
In the Thread
//Distinct
fileneed = fileneed.Distinct().ToList<string>();
for (int i = 0; i < fileneed.Count; i++)
{
if (fileneed[i].Contains("."))
{
w = new WebClient();
w.OpenRead("http://mywebsite.org/collab/files.php?act=need&user=" + Properties.Settings.Default.user + "&file=" + fileneed[i]);
fileneed.RemoveAt(i);
}
}
After execution of the thread, I go to my PhpMyAdmin and I see only one file.
Other files in the list don't show or are present or with a strange problem, my code execute one time the loop.
There are a few things wrong with the example code:
1st: Because it is removing items from the fileneed list at the same point it is reading from the list it is going to skip files in the list. This is because when you remove an item, the index of all the following items is made one smaller. We can get around this by iterating over the list from the end to the start.
2nd: Though the code is reading a file from the server, it is not doing anything with the file to write it out to disk. As such the file will simply be lost. This can be fixed by opening a file stream and copying to it.
3rd: WebClient and the Stream returned from OpenRead need to be Disposed. Otherwise the resources they use will not be cleaned up and your program will become a memory/connection hog. This is fixed by using the using statement.
With these three fixes the resulting code looks like this:
fileneed = fileneed.Distinct().ToList<string>();
for (int i = fileneed.Count - 1; i >= 0; i--)
{
if (fileneed[i].Contains("."))
{
using (var w = new WebClient())
using (var webFile = w.OpenRead("http://mywebsite.org/collab/files.php?act=need&user=" + Properties.Settings.Default.user + "&file=" + fileneed[i]))
using (var diskFile = File.OpenWrite(fileneed[i]))
{
webFile.CopyTo(diskFile);
}
fileneed.RemoveAt(i);
}
}
You are opening a 'connection' to that file, but you aren't reading it or storing it anyway. You need to create a new file, and read from the remote stream and write to the local file stream:
using(var myFile = File.OpenWrite(fileneed[i]))
{
w.CopyTo(myFile);
}
See this page for details
http://mywebsite.org/collab/files.php
I don't know this page what exactly do but you should remove this line;
fileneed.RemoveAt(i);
Every iterate, you are removing the element and Count changes. If you want to remove processed items, you could store in another list and except from original string list.
Related
I have 369 files that need to be formatted and consolidated into 5-8 files before being submitted to the server. I can't submit the 369 files because that would overwhelm the metadata tables in our database (they can handle it, but it'd be 369 rows for what was essentially one file, which would make querying and utilizing those tables a nightmare) and I can't handle it as one file because the total of 3.6 GB is too much for SSIS to handle on our servers.
I wrote the following script to fix the issue:
static void PrepPAIDCLAIMSFiles()
{
const string HEADER = "some long header text, trimmed for SO question";
const string FOOTER = "some long footer text, trimmed for SO question";
//path is defined as a static member of the containing class
string[] files = Directory.GetFiles(path + #"split\");
int splitFileCount = 0, finalFileCount = 0;
List<string> newFileContents = new List<string>();
foreach(string file in files)
{
try
{
var contents = File.ReadAllLines(file).ToList();
var fs = File.OpenRead(file);
if (splitFileCount == 0)
{
//Grab everything except the header
contents = contents.GetRange(1, contents.Count - 1);
}
else if (splitFileCount == files.Length - 1)
{
//Grab everything except the footer
contents = contents.GetRange(0, contents.Count - 1);
}
if (!Directory.Exists(path + #"split\formatted"))
{
Directory.CreateDirectory(path + #"split\formatted");
}
newFileContents.AddRange(contents);
if (splitFileCount % 50 == 0 || splitFileCount >= files.Length)
{
Console.WriteLine($"{splitFileCount} {finalFileCount}");
var sb = new StringBuilder(HEADER);
foreach (var row in newFileContents)
{
sb.Append(row);
}
sb.Append(FOOTER);
newFileContents = new List<string>();
GC.Collect();
string fileName = file.Split('\\').Last();
string baseFileName = fileName.Split('.')[0];
DateTime currentTime = DateTime.Now;
baseFileName += "." + COMPANY_NAME_SetHHMMSS(currentTime, finalFileCount) + ".TXT";
File.WriteAllText(path + #"split\formatted\" + baseFileName, sb.ToString());
finalFileCount += 1;
}
splitFileCount += 1;
}
catch(OutOfMemoryException OOM)
{
Console.WriteLine(file);
Console.WriteLine(OOM.Message);
break;
}
}
}
The way this works is it reads the split file, puts its rows into a string builder, every time it gets to a multiple of 50 files, it writes the string builder to a new file and starts over. The COMPANY_NAME_SetHHMMSS() method ensures the file has a unique name, so it's not writing to the same file over and over (and I can verify this by seeing the output, it writes two files before exploding.)
It breaks when it gets to the 81st file. System.OutOfMemoryException on var contents = File.ReadAllLines(file).ToList();. There's nothing special about the 81st file, it's the same exact size as all the others (~10MB.) The files this function delivers are about ~500MB. It also has no trouble reading and processing all the files upto and not including the 81st, so I don't think that it's running out of memory reading the file, but running out of memory doing something else and it's at the 81st where memory runs out.
The newFileContents() list should be getting emptied by overwriting it with a new list, right? That shouldn't be growing with every iteration in this function. GC.Collect() was sort of a last ditch effort.
The original file that the 369 splits come from has been a headache for a few days now, causing UltraEdit to crash, SSIS to crash, C# to crash, etc. Splitting it via 7zip seemed to be the only option that worked, and splitting it to 369 files seemed to be the only option 7zip had that didn't also reformat or somehow compress the file in an undesirable way.
Is there something that I'm missing? Something in my code that keeps growing in memory? I know File.ReadAllLines() opens and closes the file, so it should be disposed after called, right? newFileContents() gets overwritten every 50th file, as does the string builder. What else could I be doing?
One thing that jumps out at me is that you are opening a FileStream, never using it, and never disposing of it. With 300+ file streams this may be contributing to your issue.
var fs = File.OpenRead(file);
Another thing that perked my ear is that you said 3.6GB. Make sure you are compiling for 64 bit architecture.
Finally, stuffing gigabytes into a string builder may cause you grief. Maybe create a staging file - which every time you open a new input file, you write that to the staging file, close the input, and not depend on stuffing everything into memory.
You should just be looping over the rows in your source files and appending them to a new file. You're holding the contents of up to 50 10MB files in memory at once, plus anything else you're doing. This may be because you're compiling for x86 instead of x64, but there isn't any reason this should use anywhere near that memory. Something like the following:
var files = Directory.Getfiles(System.IO.Path.Combing(path, "split")).ToList();
//since you were skipping the first and last file
files.Remove(files.FirstOrDefault());
files.Remove(files.LastOrDefault());
string combined_file_path = "<whatever you want to call this>";
System.IO.StreamWriter combined_file_writer = null;
try
{
foreach(var file in files)
{
//if multiple of 50, write footer, dispose of stream, and make a new stream
if((files.IndexOf(file)) % 50 == 0)
{
combined_file_writer?.WriteLine(FOOTER);
combined_file_writer?.Dispose();
combined_file_writer = new System.IO.StreamWriter(combined_file_path + "_1"); //increment the name somewhow
combined_file_writer.WriteLine(Header);
}
using(var file_reader = new System.IO.StreamReader(file))
{
while(!file_reader.EOF)
{
combined_file_writer.WriteLine(file_reader.ReadLine());
}
}
}
//finish out the last file
combined_file_writer?.WriteLine(FOOTER);
}
finally
{
//dispose of last file
combined_file_writer?.Dispose();
}
Fairly new to C# - Sitting here practicing. I have a file with 10 million passwords listed in a single file that I downloaded to practice with.
I want to break the file down to lists of 99. Stop at 99 then do something. Then start where it left off and repeat the do something with the next 99 until it reaches the last item in the file.
I can do the count part well, it is the stop at 99 and continue where I left off is where I am having trouble. Anything I find online is not close to what I am trying to do and anything I add to this code on my own does not work.
I am more than happy to share more information if I am not clear. Just ask and will respond however, I might not be able to respond until tomorrow depending on what time it is.
Here is the code I have started:
using System;
using System.IO;
namespace lists01
{
class Program
{
static void Main(string[] args)
{
int count = 0;
var f1 = #"c:\tmp\10-million-password-list-top-1000000.txt";
{
var content = File.ReadAllLines(f1);
foreach (var v2 in content)
{
count++;
Console.WriteLine(v2 + "\t" + count);
}
}
}
}
}
My end goal is to do this with any list of items from files I have. I am only using this password list because it was sizable and thought it would be good for this exercise.
Thank you
Keith
Here is a couple of different ways to approach this. Normally, I would suggest the ReadAllLines function that you have in your code. The trade off is that you are loading the entire file into memory at once, then you operate on it.
Using read all lines in concert with Linq's Skip() and Take() methods, you can chop the lines up into groups like this:
var lines = File.ReadAllLines(fileName);
int linesAtATime = 99;
for (int i = 0; i < lines.Length; i = i + linesAtATime)
{
List<string> currentLinesGroup = lines.Skip(i).Take(linesAtATime).ToList();
DoSomethingWithLines(currentLinesGroup);
}
But, if you are working with a really large file, it might not be practical to load the entire file into memory. Plus, you might not want to leave the file open while you are working on the lines. This option gives you more control over how you move through the file. It just loads the part it needs into memory, and closes the file while you are working on the current set of lines.
List<string> lines = new List<string>();
int maxLines = 99;
long seekPosition = 0;
bool fileLoaded = false;
string line;
while (!fileLoaded)
{
using (Stream stream = File.Open(fileName, FileMode.Open))
{
//Jump back to the previous position
stream.Seek(seekPosition, SeekOrigin.Begin);
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream && lines.Count < maxLines)
{
line = reader.ReadLine();
seekPosition += (line.Length + 2); //Tracks how much data has been read.
lines.Add(line);
}
fileLoaded = reader.EndOfStream;
}
}
DoSomethingWithLines(lines);
lines.Clear();
}
In this case, I used Stream because it has the ability to seek to a specific position in the file. But then I used StreaReader because it has the ReadLine() methods.
I'm trying to get data from a csv-file from a Webservice.
If i paste the url in my browser, the csv will be downloaded and look like the following example:
"ID","ProductName","Company"
"1","Apples","Alfreds futterkiste"
"2","Oranges","Alfreds futterkiste"
"3","Bananas","Alfreds futterkiste"
"4","Salad","Alfreds futterkiste"
...next 96 rows
However I don't want to download the csv-file first and then extract data from it afterwards.
The webservice uses pagination and returns 100 rows (determined by the &num-parameter with a max of 100). After the first request i can use the &next-parameter to fetch the next 100 rows based on ID. For instance the url
http://testWebservice123.com/Example.csv?auth=abc&number=100&next=100
will get me rows from ID 101 to 200. So if there are a lot of rows i would end up downloading a lot of csv-files and saving them to the harddrive. So instead of downloading the csv-files first and saving them hdd to I want to get data directly from the webservice to be able to write directly to a database without saving the csv-files.
After a bit of search I came up with the following solution
static void Main(string[] args)
{
string startUrl = "http://testWebservice123.com/Example.csv?auth=abc&number=100";
string url = "";
string deltaRequestParameter = "";
string lastLine;
int numberOfLines = 0;
do
{
url = startUrl + deltaRequestParameter;
WebClient myWebClient = new WebClient();
using (Stream myStream = myWebClient.OpenRead(url))
{
using (StreamReader sr = new StreamReader(myStream))
{
numberOfLines = 0;
while (!sr.EndOfStream)
{
var row = sr.ReadLine();
var values = row.Split(',');
//do whatever with the rows by now - i.e. write to console
Console.WriteLine(values[0] + " " + values[1]);
lastLine = values[0].Replace("\"", ""); //last line in the loop - get the last ID.
numberOfLines++;
deltaRequestParameter = "&next=" + lastLine;
}
}
}
} while (numberOfLines == 101); //since the header is returned each time the number of rows will be 101 until we get to the last request
}
but im not sure if this is an "up to date" way of doing this, or if there is a better way (easier/simpler)? In other words i'm insecure about whether using WebClient and StreamReader is the right way to go?
In this thread: how to read a csv file from a url?
WebClient.DownloadString is mentioned as well as WebRequest. But if I want to write to a database without saving csv to hdd which is the best option?
Furhtermore - will the approach I have taken save data to a temporary disk storage behind the scenes or will all data be read into memmory and then disposed when the loop completes?
I have read the following documentation but can't seem to find out what it does behind the scenes:
StreamReader: https://learn.microsoft.com/en-us/dotnet/api/system.io.streamreader?view=netframework-4.7.2
Stream: https://learn.microsoft.com/en-us/dotnet/api/system.io.stream?view=netframework-4.7.2
Edit:
I guess I could also be using the following "TextFieldParser"...but my questions is really still the same:
(using the Assembly Microsoft.VisualBasic)
using (Stream myStream = myWebClient.OpenRead(url))
{
using (TextFieldParser parser = new TextFieldParser(myStream))
{
numberOfLines = 0;
parser.TrimWhiteSpace = true; // if you want
parser.Delimiters = new[] { "," };
parser.HasFieldsEnclosedInQuotes = true;
while (!parser.EndOfData)
{
string[] line = parser.ReadFields();
Console.WriteLine(line[0].ToString() + " " + line[1].ToString());
numberOfLines++;
deltaRequestParameter = "&next=" + line[0].ToString();
}
}
}
The HttpClient class on System.Web.Http is available as of .Net 4.5. You have to work with async code, but it's not a bad idea to get into it if you're dealing with the web.
As sample data, I'll use jsonplaceholder's "todo" list. It provides json data, not csv data, but it gives a simple enough structure that can serve our purpose in the example below.
This is the core function, which fetches from jsonplaceholder in a similar way to your "testWebService123" site, although I'm just getting the first 3 todo's, as opposed to testing for when I've hit the last page (you would probably keep your do-while) logic on that one.
async void DownloadPagesAsync() {
for (var i = 1; i < 3; i++) {
var pageToGet = $"https://jsonplaceholder.typicode.com/todos/{i}";
using (var client = new HttpClient())
using (HttpResponseMessage response = await client.GetAsync(pageToGet))
using (HttpContent content = response.Content)
using (var stream = (MemoryStream) await content.ReadAsStreamAsync())
using (var sr = new StreamReader(stream))
while (!sr.EndOfStream) {
var row =
sr.ReadLine()
.Replace(#"""", "")
.Replace(",", "");
if (row.IndexOf(":") == -1)
continue;
var values = row.Split(':');
Console.WriteLine($"{values[0]}, {values[1]}");
}
}
}
This is how you would call the function, such as you would in a Main() method:
Task t = new Task(DownloadPagesAsync);
t.Start();
The new task, here is taking in an "action", or or in other words a function that returns void, as a parameter. Then you start the task. Be careful, it is asynchronous, so any code you have after t.Start() may very well run before your task completes.
As to your question as to whether the stream reads "in memory" or not, running GetType() on "stream" in the code resulted in a "MemoryStream" type, though it seems to only be recognized as a "Stream" object at compile time. A MemoryStream is definately in-memory. I'm not really sure if any of the other kinds of stream objects save temporary files behind the scenes, but I'm leaning towards not.
But looking into the inner workings of a class, though commendable, is not usually required for your anxiety about disposing. For any class, just see if it implements IDisposable. If it does, then put in in a "using" statement, as you have done in your code. When the program terminates, as expected or via error, the program will implement the proper disposures after control has passed out of the "using" block.
HttpClient is in fact the newer approach. From what I understand, it does not replace all of the functionality for WebClient, but is stronger in many respects. See this SO site for more details comparing the two classes.
Also, something to know about WebClient is that it can be simple, but limiting. If you run into issues, you will need to look into the HttpWebRequest class, which is a "lower level" class that gives you greater access to the nuts and bolts of things (such as working with cookies).
I need help figuring out the fastest way to read through about 80 files with over 500,000 lines in each file, and write to one master file with each input file's line as a column in the master. The master file must be written to a text editor like notepad and not a Microsoft product because they can't handle the number of lines.
For example, the master file should look something like this:
File1_Row1,File2_Row1,File3_Row1,...
File1_Row2,File2_Row2,File3_Row2,...
File1_Row3,File2_Row3,File3_Row3,...
etc.
I've tried 2 solutions so far:
Create a jagged array to hold each files' contents into an array and then once reading all lines in all files, write the master file. The issue with this solution is that Windows OS memory throws an error that too much virtual memory is being used.
Dynamically create a reader thread for each of the 80 files that reads a specific line number, and once all threads finish reading a line, combine those values and write to file, and repeat for each line in all files. The issue with this solution is that it is very very slow.
Does anybody have a better solution for reading so many large files in a fast way?
The best way is going to be to open the input files with a StreamReader for each one and a StreamWriter for the output file. Then you loop through each reader and read a single line and write it to the master file. This way you are only loading one line at a time so there should be minimal memory pressure. I was able to copy 80 ~500,000 line files in 37 seconds. An example:
using System;
using System.Collections.Generic;
using System.IO;
using System.Diagnostics;
class MainClass
{
static string[] fileNames = Enumerable.Range(1, 80).Select(i => string.Format("file{0}.txt", i)).ToArray();
public static void Main(string[] args)
{
var stopwatch = Stopwatch.StartNew();
List<StreamReader> readers = fileNames.Select(f => new StreamReader(f)).ToList();
try
{
using (StreamWriter writer = new StreamWriter("master.txt"))
{
string line = null;
do
{
for(int i = 0; i < readers.Count; i++)
{
if ((line = readers[i].ReadLine()) != null)
{
writer.Write(line);
}
if (i < readers.Count - 1)
writer.Write(",");
}
writer.WriteLine();
} while (line != null);
}
}
finally
{
foreach(var reader in readers)
{
reader.Close();
}
}
Console.WriteLine("Elapsed {0} ms", stopwatch.ElapsedMilliseconds);
}
}
I've assume that all the input files have the same number of lines, but you should be add the logic to keep reading when at least one file has given you data.
Use Memory Mapped files seems what is suitable to you. Something that does not execute pressure on memory of your app contemporary maintaining good performance in IO operations.
Here complete documentation: Memory-Mapped Files
If you have enough memory on the computer, I would use the Parallel.Invoke construct and read each file into a pre-allocated array such as:
string[] file1lines = new string[some value];
string[] file2lines = new string[some value];
string[] file3lines = new string[some value];
Parallel.Invoke(
() =>
{
ReadMyFile(file1,file1lines);
},
() =>
{
ReadMyFile(file2,file2lines);
},
() =>
{
ReadMyFile(file3,file3lines);
}
);
Each ReadMyFile method should just use the following sample code which, according to these benchmarks, is the fastest way to read a text file:
int x = 0;
using (StreamReader sr = File.OpenText(fileName))
{
while ((file1lines[x] = sr.ReadLine()) != null)
{
x += 1;
}
}
If you need to manipulate the data from each file before writing your final output, read this article on the fastest way to do that.
Then you just need one method to write the contents to each string[] to the output as you desire.
Have an array of open file handles. Loop through this array and read a line from each file into a string array. Then combine this array into the master file, append a newline at the end.
This differs from your second approach that it is single threaded and doesn't read a specific line but always the next one.
Of course you need to be error proof if there are files with less lines than others.
I am sending mails (in asp.net ,c#), having a template in text file (.txt) like below
User Name :<User Name>
Address : <Address>.
I used to replace the words within the angle brackets in the text file using the below code
StreamReader sr;
sr = File.OpenText(HttpContext.Current.Server.MapPath(txt));
copy = sr.ReadToEnd();
sr.Close(); //close the reader
copy = copy.Replace(word.ToUpper(),"#" + word.ToUpper()); //remove the word specified UC
//save new copy into existing text file
FileInfo newText = new FileInfo(HttpContext.Current.Server.MapPath(txt));
StreamWriter newCopy = newText.CreateText();
newCopy.WriteLine(copy);
newCopy.Write(newCopy.NewLine);
newCopy.Close();
Now I have a new problem,
the user will be adding new words within an angle, say for eg, they will be adding <Salary>.
In that case i have to read out and find the word <Salary>.
In other words, I have to find all the words, that are located with the angle brackets (<>).
How do I do that?
Having a stream for your file, you can build something similar to a typical tokenizer.
In general terms, this works as a finite state machine: you need an enumeration for the states (in this case could be simplified down to a boolean, but I'll give you the general approach so you can reuse it on similar tasks); and a function implementing the logic. C#'s iterators are quite a fit for this problem, so I'll be using them on the snippet below. Your function will take the stream as an argument, will use an enumerated value and a char buffer internally, and will yield the strings one by one. You'll need this near the start of your code file:
using System.Collections.Generic;
using System.IO;
using System.Text;
And then, inside your class, something like this:
enum States {
OUT,
IN,
}
IEnumerable<string> GetStrings(TextReader reader) {
States state=States.OUT;
StringBuilder buffer;
int ch;
while((ch=reader.Read())>=0) {
switch(state) {
case States.OUT:
if(ch=='<') {
state=States.IN;
buffer=new StringBuilder();
}
break;
case States.IN:
if(ch=='>') {
state=States.OUT;
yield return buffer.ToString();
} else {
buffer.Append(Char.ConvertFromUtf32(ch));
}
break;
}
}
}
The finite-state machine model always has the same layout: while(READ_INPUT) { switch(STATE) {...}}: inside each case of the switch, you may be producing output and/or altering the state. Beyond that, the algorithm is defined in terms of states and state changes: for any given state and input combination, there is an exact new state and output combination (the output can be "nothing" on those states that trigger no output; and the state may be the same old state if no state change is triggered).
Hope this helps.
EDIT: forgot to mention a couple of things:
1) You get a TextReader to pass to the function by creating a StreamReader for a file, or a StringReader if you already have the file on a string.
2) The memory and time costs of this approach are O(n), with n being the length of the file. They seem quite reasonable for this kind of task.
Using regex.
var matches = Regex.Matches(text, "<(.*?)>");
List<string> words = new List<string>();
for (int i = 0; i < matches.Count; i++)
{
words.Add(matches[i].Groups[1].Value);
}
Of course, this assumes you already have the file's text in a variable. Since you have to read the entire file to achieve that, you could look for the words as you are reading the stream, but I don't know what the performance trade off would be.
This is not an answer, but comments can't do this:
You should place some of your objects into using blocks. Something like this:
using(StreamReader sr = File.OpenText(HttpContext.Current.Server.MapPath(txt)))
{
copy = sr.ReadToEnd();
} // reader is closed by the end of the using block
//remove the word specified UC
copy = copy.Replace(word.ToUpper(), "#" + word.ToUpper());
//save new copy into existing text file
FileInfo newText = new FileInfo(HttpContext.Current.Server.MapPath(txt));
using(var newCopy = newText.CreateText())
{
newCopy.WriteLine(copy);
newCopy.Write(newCopy.NewLine);
}
The using block ensures that resources are cleaned up even if an exception is thrown.