I have an application that needs to read very big .CSV files on application start and convert each row to an object. these are the methods that read the files:
public List<Aobject> GetAobject()
{
List<Aobject> Aobjects = new List<Aobject>();
using (StreamReader sr = new StreamReader(pathA, Encoding.GetEncoding("Windows-1255")))
{
string line;
while ((line = sr.ReadLine()) != null)
{
string[] spl = line.Split(',');
Aobject p = new Aobject { Aprop = spl[0].Trim(), Bprop = spl[1].Trim(), Cprop = spl[2].Trim() };
Aobjects.Add(p);
}
}
return Aobjects;
}
public List<Bobject> GetBobject()
{
List<Bobject> Bobjects = new List<Bobject>();
using (StreamReader sr =
new StreamReader(pathB, Encoding.GetEncoding("Windows-1255")))
{
//parts.Clear();
string line;
while ((line = sr.ReadLine()) != null)
{
string[] spl = line.Split(',');
Bobject p = new Bobject();
p.Cat = spl[0];
p.Name = spl[1];
p.Serial1 = spl[3].ToUpper().Contains("1");
if (spl[4].StartsWith("1"))
p.Technical = 1;
else if (spl[4].StartsWith("2"))
p.Technical = 2;
else
p.Technical = 0;
Bobjects.Add(p);
}
}
return Bobjects;
}
this was blocking my UI for a few seconds so I tried to make it multi-Threaded. however all my tests show that the un-threaded scenario is faster. this is how I tested it:
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 1000; i++)
{
Dal dal = new Dal();
Thread a = new Thread(() => { ThreadedAobjects = dal.GetAobject(); });
Thread b = new Thread(() => { ThreadedBobjects = dal.GetBobject(); });
a.Start();
b.Start();
b.Join();
a.Join();
}
sw.Stop();
txtThreaded.Text = sw.Elapsed.ToString();
Stopwatch sw2 = new Stopwatch();
sw2.Start();
for (int i = 0; i < 1000; i++)
{
Dal dal2 = new Dal();
NonThreadedAobjects = dal2.GetAobject();
NonThreadedBobjects = dal2.GetBobject();
}
sw2.Stop();
txtUnThreaded.Text = sw2.Elapsed.ToString();
The results:
Threaded run: 00:01:55.1378686
UnTreaded run: 00:01:37.1197840
Compiled for .Net4.0 but should also work under .Net3.5, in release mode.
Could some please explain why does it happen and how can I improve this?
You are ignoring the cost associated with creating and starting up a thread. Instead of creating new threads try using the thread pool:
ThreadPool.QueueUserWorkItem(() => { ThreadedAobjects = dal.GetAobject(); });
You'll also need to keep a count of how many operations you have completed in order to properly calculate your total time. Have a look at this link: http://msdn.microsoft.com/en-us/library/3dasc8as.aspx
I would suggest a single thread that calls GetAobject and then calls GetBobject. Your task is almost certainly I/O bound, and if those two files are very large and on the same drive, then trying to access them concurrently will cause a lot of unnecessary disk seeks. So your code becomes:
ThreadPool.QueueUserWorkItem(() =>
{
AObjects = GetAObject();
BObjects = GetBObject();
});
That also simplifies your code because you only have to synchronize on one ManualResetEvent.
If you will run this test you will get slightly diffrent result every time. the time that takes to things to happend infuanced by many things that happens on the computer while running - in example: other processes, GC, etc.
but your resoults are reasonable because having another thread means that the proccessor need more context-switching and every context swich takes time...
you can read more on context-switch on:
http://en.wikipedia.org/wiki/Context_switch
Adding to Slugart's correct answer: your parallelisation is ineffective in a number of ways, because you wait for first thread to complete, while second one may be completed quicker and doing nothing for some time (Look into Task Parallel Library and PLINQ).
Also, your operations are IO bound, which means parallelism depends on IO device (some devices better perform in sequential manner, and trying to do multiple reads will slow down overall operation).
Related
I have 3 files, each 1 million rows long and I'm reading them line by line. No processing, just reading as I'm just trialling things out.
If I do this synchronously it takes 1 second. If I switch to using Threads, one for each file, it is slightly quicker (code not below, but I simply created a new Thread and started it for each file).
When I change to async, it is taking 40 times as long at 40 seconds. If I add in any work to do actual processing, I cannot see how I'd ever use async over synchronous or if I wanted a responsive application using Threads.
Or am I doing something fundamentally wrong with this code and not as async was intended?
Thanks.
class AsyncTestIOBound
{
Stopwatch sw = new Stopwatch();
internal void Tests()
{
DoSynchronous();
DoASynchronous();
}
#region sync
private void DoSynchronous()
{
sw.Restart();
var start = sw.ElapsedMilliseconds;
Console.WriteLine($"Starting Sync Test");
DoSync("Addresses", "SampleLargeFile1.txt");
DoSync("routes ", "SampleLargeFile2.txt");
DoSync("Equipment", "SampleLargeFile3.txt");
sw.Stop();
Console.WriteLine($"Ended Sync Test. Took {(sw.ElapsedMilliseconds - start)} mseconds");
Console.ReadKey();
}
private long DoSync(string v, string filename)
{
string line;
long counter = 0;
using (StreamReader file = new StreamReader(filename))
{
while ((line = file.ReadLine()) != null)
{
counter++;
}
}
Console.WriteLine($"{v}: T{Thread.CurrentThread.ManagedThreadId}: Lines: {counter}");
return counter;
}
#endregion
#region async
private void DoASynchronous()
{
sw.Restart();
var start = sw.ElapsedMilliseconds;
Console.WriteLine($"Starting Sync Test");
Task a=DoASync("Addresses", "SampleLargeFile1.txt");
Task b=DoASync("routes ", "SampleLargeFile2.txt");
Task c=DoASync("Equipment", "SampleLargeFile3.txt");
Task.WaitAll(a, b, c);
sw.Stop();
Console.WriteLine($"Ended Sync Test. Took {(sw.ElapsedMilliseconds - start)} mseconds");
Console.ReadKey();
}
private async Task<long> DoASync(string v, string filename)
{
string line;
long counter = 0;
using (StreamReader file = new StreamReader(filename))
{
while ((line = await file.ReadLineAsync()) != null)
{
counter++;
}
}
Console.WriteLine($"{v}: T{Thread.CurrentThread.ManagedThreadId}: Lines: {counter}");
return counter;
}
#endregion
}
Since you are using await several times in a giant loop (in your case, looping through each line of a "SampleLargeFile"), you are doing a lot of context switching, and the overhead can be really bad.
For each line, your code maybe is switching between each file. If your computer uses a hard drive, this can get even worse. Imagine the head of your HD getting crazy.
When you use normal threads, you are not switching the context for each line.
To solve this, just read the file on a single run. You can still use async/await (ReadToEndAsync()) and get a good performance.
EDIT
So, you are trying to count lines on the text file using async, right?
Try this (no need to load the entire file in memory):
private async Task<int> CountLines(string path)
{
int count = 0;
await Task.Run(() =>
{
using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
while (sr.ReadLine() != null)
{
count++;
}
}
});
return count;
}
a few things. First I would read all lines at once in the async method so that you are only awaiting once (instead of per line).
private async Task<long> DoASync(string v, string filename)
{
string lines;
long counter = 0;
using (StreamReader file = new StreamReader(filename))
{
lines = await reader.ReadToEndAsync();
}
Console.WriteLine($"{v}: T{Thread.CurrentThread.ManagedThreadId}: Lines: {lines.Split('\n').Length}");
return counter;
}
next, you can also wait for each Task individually. This will cause your CPU to only focus on one at a time, instead of possibly switching between the 3, which will cause more overhead.
private async void DoASynchronous()
{
sw.Restart();
var start = sw.ElapsedMilliseconds;
Console.WriteLine($"Starting Sync Test");
await DoASync("Addresses", "SampleLargeFile1.txt");
await DoASync("routes ", "SampleLargeFile2.txt");
await DoASync("Equipment", "SampleLargeFile3.txt");
sw.Stop();
Console.WriteLine($"Ended Sync Test. Took {(sw.ElapsedMilliseconds - start)} mseconds");
Console.ReadKey();
}
The reason why you are seeing slower performance is due to how await works with the CPU load. For each new line, this will cause an increase of CPU usage. Async machinery adds processing, allocations and synchronization. Also, we need to transition to kernel mode two times instead of once (first to initiate the IO, then to dequeue the IO completion notification).
More info, see: Does async await increases Context switching
I need to process a very large text file (6-8 GB). I wrote the code attached below. Unfortunately, every time output file reaches (being created next to source file) reaches ~2GB, I observe sudden jump in memory consumption (~100MB to few GBs) and in result - out of memory exception.
Debugger indicates that OOM occurs at while ((tempLine = streamReader.ReadLine()) != null)
I am targeting .NET 4.7 and x64 architecture only.
Single line is at most 50 character long.
I can workaround this and split original file to smaller parts not to face the problem while processing and merge resuls back to one file at the end, but would like not to do it.
Code:
public async Task PerformDecodeAsync(string sourcePath, string targetPath)
{
var allLines = CountLines(sourcePath);
long processedlines = default;
using (File.Create(targetPath));
var streamWriter = File.AppendText(targetPath);
var decoderBlockingCollection = new BlockingCollection<string>(1000);
var writerBlockingCollection = new BlockingCollection<string>(1000);
var producer = Task.Factory.StartNew(() =>
{
using (var streamReader = new StreamReader(File.OpenRead(sourcePath), Encoding.Default, true))
{
string tempLine;
while ((tempLine = streamReader.ReadLine()) != null)
{
decoderBlockingCollection.Add(tempLine);
}
decoderBlockingCollection.CompleteAdding();
}
});
var consumer1 = Task.Factory.StartNew(() =>
{
foreach (var line in decoderBlockingCollection.GetConsumingEnumerable())
{
short decodeCounter = 0;
StringBuilder builder = new StringBuilder();
foreach (var singleChar in line)
{
var positionInDecodeKey = decodingKeysList[decodeCounter].IndexOf(singleChar);
if (positionInDecodeKey > 0)
builder.Append(model.Substring(positionInDecodeKey, 1));
else
builder.Append(singleChar);
if (decodeCounter > 18)
decodeCounter = 0;
else ++decodeCounter;
}
writerBlockingCollection.TryAdd(builder.ToString());
Interlocked.Increment(ref processedlines);
if (processedlines == (long)allLines)
writerBlockingCollection.CompleteAdding();
}
});
var writer = Task.Factory.StartNew(() =>
{
foreach (var line in writerBlockingCollection.GetConsumingEnumerable())
{
streamWriter.WriteLine(line);
}
});
Task.WaitAll(producer, consumer1, writer);
}
Solutions, as well as advices how to optimize it a little more is greatly appreciated.
Like I said, I'd probably go for something simpler first, unless or until it's demonstrated that it's not performing well. As Adi said in their answer, this work appears to be I/O bound - so there seems little benefit in creating multiple tasks for it.
publiv void PerformDecode(string sourcePath, string targetPath)
{
File.WriteAllLines(targetPath,File.ReadLines(sourcePath).Select(line=>{
short decodeCounter = 0;
StringBuilder builder = new StringBuilder();
foreach (var singleChar in line)
{
var positionInDecodeKey = decodingKeysList[decodeCounter].IndexOf(singleChar);
if (positionInDecodeKey > 0)
builder.Append(model.Substring(positionInDecodeKey, 1));
else
builder.Append(singleChar);
if (decodeCounter > 18)
decodeCounter = 0;
else ++decodeCounter;
}
return builder.ToString();
}));
}
Now, of course, this code actually blocks until it's done, which is why I've not marked it async. But then, so did yours, and it should have been warning about that already.
(You could try using PLINQ instead of LINQ for the Select portion but honestly, the amount of processing we're doing here looks trivial; Profile first before applying any such change)
As the work you are doing is mostly IO bound, you aren't really gaining anything from parallelization. It also looks to me like (correct me if I'm wrong) that your transformation algorithm doesn't depend on you reading the file line-by-line, so I would recommend instead doing something like this:
void Main()
{
//Setup streams for testing
using(var inputStream = new MemoryStream())
using(var outputStream = new MemoryStream())
using (var inputWriter = new StreamWriter(inputStream))
using (var outputReader = new StreamReader(outputStream))
{
//Write test string and rewind stream
inputWriter.Write("abcdefghijklmnop");
inputWriter.Flush();
inputStream.Seek(0, SeekOrigin.Begin);
var inputBuffer = new byte[5];
var outputBuffer = new byte[5];
int inputLength;
while ((inputLength = inputStream.Read(inputBuffer, 0, inputBuffer.Length)) > 0)
{
for (var i = 0; i < inputLength; i++)
{
//transform each character
outputBuffer[i] = ++inputBuffer[i];
}
//Write to output
outputStream.Write(outputBuffer, 0, inputLength);
}
//Read for testing
outputStream.Seek(0, SeekOrigin.Begin);
var output = outputReader.ReadToEnd();
Console.WriteLine(output);
//Outputs: "bcdefghijklmnopq"
}
}
Obviously, you would be using FileStreams instead of MemoryStreams, and you can increase the buffer length to something much larger (as this was just a demonstrative example). Also as your original method is Async, you use the async variants of Stream.Write and Stream.Read
I'm having a really difficult time wrapping my head around asynchronous. I've tried reading numerous questions/answers, forums etc.
I have to make about 50 calls to the database(I can't make a change to the database to return all in one call). So the answer is to use asynchronous so that I only have to wait for essentially the longest call. I am trying to do a simple logging of the time for 10 round trips.
My theory is that because I'm using two asynchronous tasks one in one method, and one in the calling method. The tasks look like they are completing but have not completed yet. here is my code.
I'm just logging the time it takes to complete the round trips for now, but the idea is to return the List<ResultList> in the future.
protected async void Page_Load(object sender, EventArgs e)
{
List<Buildings> bldgList = new List<Buildings>();
//lots of buildings here now.
Stopwatch GetByOrg = new Stopwatch();
lblorg.Text = await RunOrg(GetByOrg, bldgList);
}
async Task<string> RunOrg(Stopwatch getByOrg, List<Buildings> retVal)
{
getByOrg.Start();
List<Task> tasks = new List<Task>();
for (int i = 0; i < 10; i++)
{
foreach (Buildings b in bldgList)
{
tasks.Add(Task.Run(() => ExecuteOrg(b)));
}
// var tasky = new Task()
}
Task.WaitAll(tasks.ToArray());
getByOrg.Stop();
return String.Format("{0:00}:{1:00}:{2:00}.{3:00}", getByOrg.Elapsed.Hours, getByOrg.Elapsed.Minutes, getByOrg.Elapsed.Seconds, getByOrg.Elapsed.Milliseconds / 10);
}
async Task<List<ResultSet>> ExecuteOrg(Buildings b)
{
List<ResultSet> resulty = new List<ResultSet>();
var asyncconn = new SqlConnectionStringBuilder("That Server Connection") { AsynchronousProcessing = true }.ToString();
using (SqlConnection conn = new SqlConnection(asyncconn))
{
using (SqlCommand cmd = new SqlCommand("rptTotalConcentrators", conn))
{
cmd.CommandType = System.Data.CommandType.StoredProcedure;
cmd.Parameters.Add("#school_yr", System.Data.SqlDbType.SmallInt).Value = 2016;
conn.Open();
using (var reader = await cmd.ExecuteReaderAsync())
{
while (await reader.ReadAsync())
{
ResultSet testy = new ResultSet();
testy.bldg_no = reader["bldg_no"].ToString();
testy.bldg_name = reader["bldg_name"].ToString();
testy.school_year = reader["school_year"].ToString();
testy.concentrator = reader["conc"].ToString();
resulty.Add(testy);
}
}
}
}
return resulty;
}
the problem is that lblorg.Text is showing about .06 for completed milliseconds. Either .WhenAll doesn't work how i think, or something else is wrong. because that should take a lot longer.
I wrote a method to download data from the internet and save it to my database. I wrote this using PLINQ to take advantage of my multi-core processor and because it is downloading thousands of different files in a very short period of time. I have added comments below in my code to show where it stops but the program just sits there and after awhile, I get an out of memory exception. This being my first time using TPL and PLINQ, I'm extremely confused so I could really use some advice on what to do to fix this.
UPDATE: I found out that I was getting a webexception constantly because the webclient was timing out. I fixed this by increasing the max amount of connections according to this answer here. I was then getting exceptions for the connection not opening and I fixed it by using this answer here. I'm now getting connection timeout errors for the database even though it is a local sql server. I still haven't been able to get any of my code to run so I could totally use some advice
static void Main(string[] args)
{
try
{
while (true)
{
// start the download process for market info
startDownload();
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Console.WriteLine(ex.StackTrace);
}
}
public static void startDownload()
{
DateTime currentDay = DateTime.Now;
List<Task> taskList = new List<Task>();
if (Helper.holidays.Contains(currentDay) == false)
{
List<string> markets = new List<string>() { "amex", "nasdaq", "nyse", "global" };
Parallel.ForEach(markets, market =>
{
Downloads.startInitialMarketSymbolsDownload(market);
}
);
Console.WriteLine("All downloads finished!");
}
// wait 24 hours before you do this again
Task.Delay(TimeSpan.FromHours(24)).Wait();
}
public static void startInitialMarketSymbolsDownload(string market)
{
try
{
List<string> symbolList = new List<string>();
symbolList = Helper.getStockSymbols(market);
var historicalGroups = symbolList.AsParallel().Select((x, i) => new { x, i })
.GroupBy(x => x.i / 100)
.Select(g => g.Select(x => x.x).ToArray());
historicalGroups.AsParallel().ForAll(g => getHistoricalStockData(g, market));
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Console.WriteLine(ex.StackTrace);
}
}
public static void getHistoricalStockData(string[] symbols, string market)
{
// download data for list of symbols and then upload to db tables
Uri uri;
string url, line;
decimal open = 0, high = 0, low = 0, close = 0, adjClose = 0;
DateTime date;
Int64 volume = 0;
string[] lineArray;
List<string> symbolError = new List<string>();
Dictionary<string, string> badNameError = new Dictionary<string, string>();
Parallel.ForEach(symbols, symbol =>
{
url = "http://ichart.finance.yahoo.com/table.csv?s=" + symbol + "&a=00&b=1&c=1900&d=" + (DateTime.Now.Month - 1) + "&e=" + DateTime.Now.Day + "&f=" + DateTime.Now.Year + "&g=d&ignore=.csv";
uri = new Uri(url);
using (dbEntities entity = new dbEntities())
using (WebClient client = new WebClient())
using (Stream stream = client.OpenRead(uri))
using (StreamReader reader = new StreamReader(stream))
{
while (reader.EndOfStream == false)
{
line = reader.ReadLine();
lineArray = line.Split(',');
// if it isn't the very first line
if (lineArray[0] != "Date")
{
// set the data for each array here
date = Helper.parseDateTime(lineArray[0]);
open = Helper.parseDecimal(lineArray[1]);
high = Helper.parseDecimal(lineArray[2]);
low = Helper.parseDecimal(lineArray[3]);
close = Helper.parseDecimal(lineArray[4]);
volume = Helper.parseInt(lineArray[5]);
adjClose = Helper.parseDecimal(lineArray[6]);
switch (market)
{
case "nasdaq":
DailyNasdaqData nasdaqData = new DailyNasdaqData();
var nasdaqQuery = from r in entity.DailyNasdaqDatas.AsParallel().AsEnumerable()
where r.Date == date
select new StockData { Close = r.AdjustedClose };
List<StockData> nasdaqResult = nasdaqQuery.AsParallel().ToList(); // hits this line
break;
default:
break;
}
}
}
// now save everything
entity.SaveChanges();
}
}
);
}
Async lambdas work like async methods in one regard: They do not complete synchronously but they return a Task. In your parallel loop you are simply generating tasks as fast as you can. Those tasks hold onto memory and other resources such as DB connections.
The simplest fix is probably to just use synchronous database commits. This will not result in a loss of throughput because the database cannot deal with high amounts of concurrent DML anyway.
I made a WPF application that opens the CSV file and does some operation that includes webscraping and gets some values that has type long.(0-10000000)
Now the issue is that when large list of about 2000 is opened then memory usage for software raises above 700MB in some cases 1G.
I am shocked to see this.
some things I think is that
If each entry of csv file has long value associated with it it will take much memory.and single entry has approx 10-12 column each is long in type.now when there are huge row count then memory shoots
There are certain places in code that has a loop (on all csv rows) that creates a instance of custom class.i thought of having destructor then came to know that dot net manages memory automatically.
here goes code for loading CSV
try
{
StreamReader sr = new StreamReader(path,Encoding.Default);
labelRankCheckStatus.Dispatcher.Invoke(DispatcherPriority.Normal, new Action(delegate()
{
labelRankCheckStatus.Content = "Loading Data";
}));
string strline = "";
string[] _values = null;
int x = 0;
while (!sr.EndOfStream)
{
x++;
strline = sr.ReadLine();
_values = strline.Split(',');
if (x == 1)
{
textBoxKw1.Text = _values[12];
textBoxKw2.Text = _values[14];
textBoxKw3.Text = _values[16];
textBoxKw4.Text = _values[18];
}
else if (x != 1)
{
if (_values[0] != "")
{
Url info = new Url();
srNo++;
info.URL = idn.GetAscii(_values[0].ToString().Trim());
info.IsChecked = true;
info.TestResults = int.Parse(_values[1].Replace("%","").TrimEnd().TrimStart());
info.PageRank= int.Parse(_values[2]);
info.RelPageRank = int.Parse(_values[3].Replace("%","").TrimEnd().TrimStart());
info.Alexa= long.Parse(_values[4]);
info.RelAlexa = long.Parse(_values[5].Replace("%","").TrimEnd().TrimStart());
info.Links= long.Parse(_values[6]);
info.RelLinks = long.Parse(_values[7].Replace("%","").TrimEnd().TrimStart());
info.GIW= long.Parse(_values[8]);
info.RelGIW = long.Parse(_values[9].Replace("%","").TrimEnd().TrimStart());
info.GIN= long.Parse(_values[10]);
info.RelGIN = long.Parse(_values[11].Replace("%","").TrimEnd().TrimStart());
info.Kw1Indexed= long.Parse(_values[12]);
info.RelKw1Indexed = long.Parse(_values[13].Replace("%","").TrimEnd().TrimStart());
info.Kw2Indexed= long.Parse(_values[14]);
info.RelKw2Indexed = long.Parse(_values[15].Replace("%","").TrimEnd().TrimStart());
info.Kw3Indexed= long.Parse(_values[16]);
info.RelKw3Indexed = long.Parse(_values[17].Replace("%","").TrimEnd().TrimStart());
info.Kw4Indexed= long.Parse(_values[18]);
info.RelKw4Indexed = long.Parse(_values[19].Replace("%","").TrimEnd().TrimStart());
info.DKwIndexed= long.Parse(_values[20]);
info.RelDKwIndexed = long.Parse(_values[21].Replace("%","").TrimEnd().TrimStart());
info.Info= _values[22];
info.srNo = srNo;
url.Add(info);
}
}
dataGrid1.Dispatcher.Invoke(DispatcherPriority.Normal, new Action(delegate()
{
dataGrid1.Columns[2].Header = "URL ( " + url.Count + " )";
try
{
if (dataGrid1.ItemsSource == null)
dataGrid1.ItemsSource = url;
else
dataGrid1.Items.Refresh();
}
catch (Exception)
{
}
labelRankCheckStatus.Dispatcher.Invoke(DispatcherPriority.Normal, new Action(delegate()
{
labelRankCheckStatus.Content = "Done";
}));
}));
}
sr.Close();
labelRankCheckStatus.Dispatcher.Invoke(DispatcherPriority.Normal, new Action(delegate()
{
labelRankCheckStatus.Content = "Complete ";
}));
}
catch (Exception c)
{
MessageBox.Show(c.Message);
}`
Instead of building in-memory copies of your large objects, consider a more functional approach where you stream data in, process it and output it to your database of choice. If you need to do operations on the old data, you can use an SQL database like Sqlite.
Creating managed objects for every single entity in your system is beyond wasteful, you won't need most of them.
Of course, if you have a lot of RAM, it might simply be that the GC isn't yet bothering to collect all your garbage because the memory isn't actively needed by anything. It's more likely that you're holding references to it though.