I made a WPF application that opens the CSV file and does some operation that includes webscraping and gets some values that has type long.(0-10000000)
Now the issue is that when large list of about 2000 is opened then memory usage for software raises above 700MB in some cases 1G.
I am shocked to see this.
some things I think is that
If each entry of csv file has long value associated with it it will take much memory.and single entry has approx 10-12 column each is long in type.now when there are huge row count then memory shoots
There are certain places in code that has a loop (on all csv rows) that creates a instance of custom class.i thought of having destructor then came to know that dot net manages memory automatically.
here goes code for loading CSV
try
{
StreamReader sr = new StreamReader(path,Encoding.Default);
labelRankCheckStatus.Dispatcher.Invoke(DispatcherPriority.Normal, new Action(delegate()
{
labelRankCheckStatus.Content = "Loading Data";
}));
string strline = "";
string[] _values = null;
int x = 0;
while (!sr.EndOfStream)
{
x++;
strline = sr.ReadLine();
_values = strline.Split(',');
if (x == 1)
{
textBoxKw1.Text = _values[12];
textBoxKw2.Text = _values[14];
textBoxKw3.Text = _values[16];
textBoxKw4.Text = _values[18];
}
else if (x != 1)
{
if (_values[0] != "")
{
Url info = new Url();
srNo++;
info.URL = idn.GetAscii(_values[0].ToString().Trim());
info.IsChecked = true;
info.TestResults = int.Parse(_values[1].Replace("%","").TrimEnd().TrimStart());
info.PageRank= int.Parse(_values[2]);
info.RelPageRank = int.Parse(_values[3].Replace("%","").TrimEnd().TrimStart());
info.Alexa= long.Parse(_values[4]);
info.RelAlexa = long.Parse(_values[5].Replace("%","").TrimEnd().TrimStart());
info.Links= long.Parse(_values[6]);
info.RelLinks = long.Parse(_values[7].Replace("%","").TrimEnd().TrimStart());
info.GIW= long.Parse(_values[8]);
info.RelGIW = long.Parse(_values[9].Replace("%","").TrimEnd().TrimStart());
info.GIN= long.Parse(_values[10]);
info.RelGIN = long.Parse(_values[11].Replace("%","").TrimEnd().TrimStart());
info.Kw1Indexed= long.Parse(_values[12]);
info.RelKw1Indexed = long.Parse(_values[13].Replace("%","").TrimEnd().TrimStart());
info.Kw2Indexed= long.Parse(_values[14]);
info.RelKw2Indexed = long.Parse(_values[15].Replace("%","").TrimEnd().TrimStart());
info.Kw3Indexed= long.Parse(_values[16]);
info.RelKw3Indexed = long.Parse(_values[17].Replace("%","").TrimEnd().TrimStart());
info.Kw4Indexed= long.Parse(_values[18]);
info.RelKw4Indexed = long.Parse(_values[19].Replace("%","").TrimEnd().TrimStart());
info.DKwIndexed= long.Parse(_values[20]);
info.RelDKwIndexed = long.Parse(_values[21].Replace("%","").TrimEnd().TrimStart());
info.Info= _values[22];
info.srNo = srNo;
url.Add(info);
}
}
dataGrid1.Dispatcher.Invoke(DispatcherPriority.Normal, new Action(delegate()
{
dataGrid1.Columns[2].Header = "URL ( " + url.Count + " )";
try
{
if (dataGrid1.ItemsSource == null)
dataGrid1.ItemsSource = url;
else
dataGrid1.Items.Refresh();
}
catch (Exception)
{
}
labelRankCheckStatus.Dispatcher.Invoke(DispatcherPriority.Normal, new Action(delegate()
{
labelRankCheckStatus.Content = "Done";
}));
}));
}
sr.Close();
labelRankCheckStatus.Dispatcher.Invoke(DispatcherPriority.Normal, new Action(delegate()
{
labelRankCheckStatus.Content = "Complete ";
}));
}
catch (Exception c)
{
MessageBox.Show(c.Message);
}`
Instead of building in-memory copies of your large objects, consider a more functional approach where you stream data in, process it and output it to your database of choice. If you need to do operations on the old data, you can use an SQL database like Sqlite.
Creating managed objects for every single entity in your system is beyond wasteful, you won't need most of them.
Of course, if you have a lot of RAM, it might simply be that the GC isn't yet bothering to collect all your garbage because the memory isn't actively needed by anything. It's more likely that you're holding references to it though.
Related
I am struggling a few days already with a problem of growing memory consumption by console application in .Net Core 2.2, and just now I ran out of ideas what else I could improve.
Im my application I have a method that triggers StartUpdatingAsync method:
public MenuViewModel()
{
if (File.Exists(_logFile))
File.Delete(_logFile);
try
{
StartUpdatingAsync("basic").GetAwaiter().GetResult();
}
catch (ArgumentException aex)
{
Console.WriteLine($"Caught ArgumentException: {aex.Message}");
}
Console.ReadKey();
}
StartUpdatingAsync creates 'repo' and instance is getting from DB a list of objects to be updated (around 200k):
private async Task StartUpdatingAsync(string dataType)
{
_repo = new DataRepository();
List<SomeModel> some_list = new List<SomeModel>();
some_list = _repo.GetAllToBeUpdated();
await IterateStepsAsync(some_list, _step, dataType);
}
And now, within IterateStepsAsync we are getting updates, parsing them with existing data and updating DB. Inside of each while I was creating new instances of all new classes and lists, to be sure that old ones are releasing memory, but it didnt help. Also I was GC.Collect() at the end of the method, what also is not helping. Please note, that method below triggers lots of parralel Tasks, but they supposed to be disposed within it, am I right?:
private async Task IterateStepsAsync(List<SomeModel> some_list, int step, string dataType)
{
List<Area> areas = _repo.GetAreas();
int counter = 0;
while (counter < some_list.Count)
{
_repo = new DataRepository();
_updates = new HttpUpdates();
List<Task> tasks = new List<Task>();
List<VesselModel> vessels = new List<VesselModel>();
SemaphoreSlim throttler = new SemaphoreSlim(_degreeOfParallelism);
for (int i = counter; i < step; i++)
{
int iteration = i;
bool skip = false;
if (dataType == "basic" && (some_list[iteration].Mmsi == 0 || !some_list[iteration].Speed.HasValue)) //if could not be parsed with "full"
skip = true;
tasks.Add(Task.Run(async () =>
{
string updated= "";
await throttler.WaitAsync();
try
{
if (!skip)
{
Model model= await _updates.ScrapeSingleModelAsync(some_list[iteration].Mmsi);
while (Updating)
{
await Task.Delay(1000);
}
if (model != null)
{
lock (((ICollection)vessels).SyncRoot)
{
vessels.Add(model);
scraped = BuildData(model);
}
}
}
else
{
//do nothing
}
}
catch (Exception ex)
{
Log("Scrape error: " + ex.Message);
}
finally
{
while (Updating)
{
await Task.Delay(1000);
}
Console.WriteLine("Updates for " + counter++ + " of " + some_list.Count + scraped);
throttler.Release();
}
}));
}
try
{
await Task.WhenAll(tasks);
}
catch (Exception ex)
{
Log("Critical error: " + ex.Message);
}
finally
{
_repo.UpdateModels(vessels, dataType, counter, some_list.Count, _step);
step = step + _step;
GC.Collect();
}
}
}
Inside of the method above, we are calling _repo.UpdateModels, where is updated DB. I tryed two approaches, with using EC Core and SqlConnection. Both with similar results. Below you can find both of them.
EF Core
internal List<VesselModel> UpdateModels(List<Model> vessels, string dataType, int counter, int total, int _step)
{
for (int i = 0; i < vessels.Count; i++)
{
Console.WriteLine("Parsing " + i + " of " + vessels.Count);
Model existing = _context.Vessels.Where(v => v.id == vessels[i].Id).FirstOrDefault();
if (vessels[i].LatestActivity.HasValue)
{
existing.LatestActivity = vessels[i].LatestActivity;
}
//and similar parsing several times, as above
}
Console.WriteLine("Saving ...");
_context.SaveChanges();
return new List<Model>(_step);
}
SqlConnection
internal List<VesselModel> UpdateModels(List<Model> vessels, string dataType, int counter, int total, int _step)
{
if (vessels.Count > 0)
{
using (SqlConnection connection = GetConnection(_connectionString))
using (SqlCommand command = connection.CreateCommand())
{
connection.Open();
StringBuilder querySb = new StringBuilder();
for (int i = 0; i < vessels.Count; i++)
{
Console.WriteLine("Updating " + i + " of " + vessels.Count);
//PARSE
VesselAisUpdateModel existing = new VesselAisUpdateModel();
if (vessels[i].Id > 0)
{
//find existing
}
if (existing != null)
{
//update for basic data
querySb.Append("UPDATE dbo." + _vesselsTableName + " SET Id = '" + vessels[i].Id+ "'");
if (existing.Mmsi == 0)
{
if (vessels[i].MMSI.HasValue)
{
querySb.Append(" , MMSI = '" + vessels[i].MMSI + "'");
}
}
//and similar parsing several times, as above
querySb.Append(" WHERE Id= " + existing.Id+ "; ");
querySb.AppendLine();
}
}
try
{
Console.WriteLine("Sending SQL query to " + counter);
command.CommandTimeout = 3000;
command.CommandType = CommandType.Text;
command.CommandText = querySb.ToString();
command.ExecuteNonQuery();
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
finally
{
connection.Close();
}
}
}
return new List<Model>(_step);
}
Main problem is, that after tenths/hundreds of thousands of updated models my console application memory consumption increases continuously. And I have no idea why.
SOLUTION my problem was inside of ScrapeSingleModelAsync method, where I was using incorrectly HtmlAgilityPack, what I could debug thanks to cassandrad.
Your code is messy, with huge amount of different objects with unknown lifetime. It's hardly possible to figure out the problem just looking at it.
Consider using profiling tools, for example Visual Studio's Diagnostic Tools, they will help you to find what objects are living too long in the heap. Here is overview of its functions related to memory profiling. Highly recomended to be read.
In short, you need to take two snapshots and look at what objects are taking the most memory. Let's look at simple example.
int[] first = new int[10000];
Console.WriteLine(first.Length);
int[] secod = new int[9999];
Console.WriteLine(secod.Length);
Console.ReadKey();
Take the first snapshot when your function works at least once. In my case, I took snapshot when the first huge space has been alocated.
After that, let your app be working some time so the difference in memory usage become noticeable, take the second memory snapshot.
You'll notice that another snapshot is added with info about how much is the difference. To get more specific info, click on one or another blue label of the latest snapshot to open snapshots comparison.
Following my example, we can see that there is change in count of int arrays. By default int[] wasn't visible in the table, so I had to uncheck Just My Code in filtration options.
So, this is what needs to be done. After you figure out what objects increase in count or size over time, you can locate where these objects are create and optimize this operation.
I don't know if it is a problem but...
This is the main call inside a Windows service.
The lower level logs into a WebService, retrieves information and passes it up.
A timer is hitting the main call 4 times a minute.
Gen 2 heap size gradually increases . GC does its best so you end up with a sort of Saw Tooth memory consumption, but the gradient of the Saw Tooth is slowly upwards. Working Set and Private Bytes also increase.
LOH size however is consistent.
The culprit appears to be System. Byte[].
ANTS Memory Profiler shows no linkage shown to any program objects. i.e. The retention Graphs only mention System objects.
The Profiler shows 60 % being held by System.Collections.Concurrent.ConcurrentStack,T> + Node The other 40% is divided amongst other system classes.
I have refactored until I am blue in the face. Behavior seems consistent, making me think that perhaps it is not a problem but expected behavior.
If I return before the marked assignment the memory consumption is 'correct' like wise if I return immediately after the assignment the consumption is 'faulty'
So...
It is a real problem and if so any suggestions on what to do would be welcome.
public void GetItsValues()
{
using (Dt dt = new Dt())
{
List<ItsSite> siteList = dt.GetCurrentItsSites().ToList();
IEnumerator<ItsSite> enSite = siteList.GetEnumerator();
while (enSite.MoveNext())
{
using (ClsItsParking its = new ClsItsParking(enSite.Current))
{
List<ClsPbpRegistration> resultList;
//ClsPbpRegistration[] reglist = its.PbpRegistrationsBySite(enSite.Current).ToArray();
resultList = its.PbpRegistrationsBySite(enSite.Current);//***THIS Assignment causes the problem
if (resultList.Any())
{
using (Dt dtSite = new Dt(enSite.Current.ConnectionString))
{
// var result = dtSite.WriteItsPurchases(reglist);
var result = dtSite.WriteItsPurchases(resultList);
if (!result)
ClsLogger.WriteErrorLog("dt.WritePurchases returned false for site " + enSite.Current.CompanyName);
}
}
}
}
}
}
Sorry not much good at posting code.
dt.GetITSPurchases, GetCurrentItsSites signatures and PbpRegistrationBySite body
public IEnumerable<ItsSite> GetCurrentItsSites()
public ClsPbpRegistration[] GetItsPurchases(DateTime purchaseDate)
public List<ClsPbpRegistration> PbpRegistrationsBySite(ItsSite site)
{
string regUrl = null;
try
{
if (!Logon(site))
ClsLogger.WriteErrorLog("ClsItsParking. Logon returned false");
/**** Working Code*****/
TimeSpan tWindow = new TimeSpan(0, TimeWindowMinutes, 0);
DateTime startDate = DateTime.Now - tWindow;
string startHour = startDate.Hour.ToString("00");
string startMinute = startDate.Minute.ToString("00");
DateTime endDate = DateTime.Now;
string endHour = endDate.Hour.ToString("00");
string endMinute = endDate.Minute.ToString("00");
string toDaysPurchases = "?searchFromDate=" + startDate.ToString(DateFormat) +
"&searchFromTimeHour=" + startHour + "&searchFromTimeMinute=" + startMinute + "&searchToDate=" + endDate.ToString(DateFormat)
+ "&searchToTimeHour=" + endHour + "&searchToTimeMinute=" + endMinute + "&searchBy=modified";
regUrl = site.UrlBase + site.RegistrationListExtension + toDaysPurchases;
string registrationResponseString;
registrationResponseString = HttpGetRequest(regUrl, Cookies, ItsUserAgent);
{
Logoff();
if (!string.IsNullOrEmpty(registrationResponseString))
{
XmlDocument motherShipRegistrations = new XmlDocument();
motherShipRegistrations.LoadXml(registrationResponseString);
//listRegistrations = GetNodeList(motherShipRegistrations);
//ClsPbpRegistration[] listRegistrations = GetRegistrationList(motherShipRegistrations);
//return null;
//return listRegistrations;
return GetRegistrationList(motherShipRegistrations);
}
return null;
}
}
catch (Exception ex)
{
ClsLogger.WriteErrorLog("PbpRegistrationsBySite " + ex.Message + " Sent url was " + regUrl);
return null;
}
}
I wrote a method to download data from the internet and save it to my database. I wrote this using PLINQ to take advantage of my multi-core processor and because it is downloading thousands of different files in a very short period of time. I have added comments below in my code to show where it stops but the program just sits there and after awhile, I get an out of memory exception. This being my first time using TPL and PLINQ, I'm extremely confused so I could really use some advice on what to do to fix this.
UPDATE: I found out that I was getting a webexception constantly because the webclient was timing out. I fixed this by increasing the max amount of connections according to this answer here. I was then getting exceptions for the connection not opening and I fixed it by using this answer here. I'm now getting connection timeout errors for the database even though it is a local sql server. I still haven't been able to get any of my code to run so I could totally use some advice
static void Main(string[] args)
{
try
{
while (true)
{
// start the download process for market info
startDownload();
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Console.WriteLine(ex.StackTrace);
}
}
public static void startDownload()
{
DateTime currentDay = DateTime.Now;
List<Task> taskList = new List<Task>();
if (Helper.holidays.Contains(currentDay) == false)
{
List<string> markets = new List<string>() { "amex", "nasdaq", "nyse", "global" };
Parallel.ForEach(markets, market =>
{
Downloads.startInitialMarketSymbolsDownload(market);
}
);
Console.WriteLine("All downloads finished!");
}
// wait 24 hours before you do this again
Task.Delay(TimeSpan.FromHours(24)).Wait();
}
public static void startInitialMarketSymbolsDownload(string market)
{
try
{
List<string> symbolList = new List<string>();
symbolList = Helper.getStockSymbols(market);
var historicalGroups = symbolList.AsParallel().Select((x, i) => new { x, i })
.GroupBy(x => x.i / 100)
.Select(g => g.Select(x => x.x).ToArray());
historicalGroups.AsParallel().ForAll(g => getHistoricalStockData(g, market));
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Console.WriteLine(ex.StackTrace);
}
}
public static void getHistoricalStockData(string[] symbols, string market)
{
// download data for list of symbols and then upload to db tables
Uri uri;
string url, line;
decimal open = 0, high = 0, low = 0, close = 0, adjClose = 0;
DateTime date;
Int64 volume = 0;
string[] lineArray;
List<string> symbolError = new List<string>();
Dictionary<string, string> badNameError = new Dictionary<string, string>();
Parallel.ForEach(symbols, symbol =>
{
url = "http://ichart.finance.yahoo.com/table.csv?s=" + symbol + "&a=00&b=1&c=1900&d=" + (DateTime.Now.Month - 1) + "&e=" + DateTime.Now.Day + "&f=" + DateTime.Now.Year + "&g=d&ignore=.csv";
uri = new Uri(url);
using (dbEntities entity = new dbEntities())
using (WebClient client = new WebClient())
using (Stream stream = client.OpenRead(uri))
using (StreamReader reader = new StreamReader(stream))
{
while (reader.EndOfStream == false)
{
line = reader.ReadLine();
lineArray = line.Split(',');
// if it isn't the very first line
if (lineArray[0] != "Date")
{
// set the data for each array here
date = Helper.parseDateTime(lineArray[0]);
open = Helper.parseDecimal(lineArray[1]);
high = Helper.parseDecimal(lineArray[2]);
low = Helper.parseDecimal(lineArray[3]);
close = Helper.parseDecimal(lineArray[4]);
volume = Helper.parseInt(lineArray[5]);
adjClose = Helper.parseDecimal(lineArray[6]);
switch (market)
{
case "nasdaq":
DailyNasdaqData nasdaqData = new DailyNasdaqData();
var nasdaqQuery = from r in entity.DailyNasdaqDatas.AsParallel().AsEnumerable()
where r.Date == date
select new StockData { Close = r.AdjustedClose };
List<StockData> nasdaqResult = nasdaqQuery.AsParallel().ToList(); // hits this line
break;
default:
break;
}
}
}
// now save everything
entity.SaveChanges();
}
}
);
}
Async lambdas work like async methods in one regard: They do not complete synchronously but they return a Task. In your parallel loop you are simply generating tasks as fast as you can. Those tasks hold onto memory and other resources such as DB connections.
The simplest fix is probably to just use synchronous database commits. This will not result in a loss of throughput because the database cannot deal with high amounts of concurrent DML anyway.
I have the following code:
var lstMusicInfo = new List<MediaFile>();
var LocalMusic = Directory.EnumerateFiles(AppSettings.Default.ComputerMusicFolder, "*.*", SearchOption.AllDirectories).AsParallel().ToList<string>();
LocalMusic = (from a in LocalMusic.AsParallel()
where a.EndsWith(".mp3") || a.EndsWith(".wma")
select a).ToList<string>();
var DeviceMusic = adb.SyncMedia(this.dev, AppSettings.Default.ComputerMusicFolder, 1);
Parallel.ForEach(LocalMusic, new Action<string>(item =>
{
try
{
UltraID3 mFile = new UltraID3();
FileInfo fInfo;
mFile.Read(item);
fInfo = new FileInfo(item);
bool onDevice = true;
if (DeviceMusic.Contains(item))
{
onDevice = false;
}
// My Problem starts here
lstMusicInfo.Add(new MediaFile()
{
Title = mFile.Title,
Album = mFile.Album,
Year = mFile.Year.ToString(),
ComDirectory = fInfo.Directory.FullName,
FileFullName = fInfo.FullName,
Artist = mFile.Artist,
OnDevice = onDevice,
PTDevice = false
});
//Ends here.
}
catch (Exception) { }
}));
this.Dispatcher.BeginInvoke(new Action(() =>
{
lstViewMusicFiles.ItemsSource = lstMusicInfo;
blkMusicStatus.Text = "";
doneLoading = true;
}));
#endregion
}));
The first part of the code gives me almost instant result containing:
Address on computer of 5780 files.
Get list of all music files on an android device compare it with those 5780 files and return a list of files found on computer but not on device (in my case it returns a list with 5118 items).
The block of code below is my problem, I am filling data into a class, then adding that class into a List<T>, doing it for 5780 times takes 60 seconds, how can I improve it?
// My Problem starts here
lstMusicInfo.Add(new MediaFile
{
Title = mFile.Title,
Album = mFile.Album,
Year = mFile.Year.ToString(),
ComDirectory = fInfo.Directory.FullName,
FileFullName = fInfo.FullName,
Artist = mFile.Artist,
OnDevice = onDevice,
PTDevice = false
});
//Ends here.
Update:
Here is the profiling result and I see it's obvious why it's slowing down >_>
I suppose I should look for a different library that reads music file information.
One way to avoid loading everything once, up front, would be to lazy load the ID3 information as necessary.
You'd construct your MediaFile instances thus...
new MediaFile(filePath)
...and MediaFile would look something like the following.
internal sealed class MediaFile
{
private readonly Lazy<UltraID3> _lazyFile;
public MediaFile(string filePath)
{
_lazyFile = new Lazy<UltraID3>(() =>
{
var file = new UltraID3();
file.Read(filePath);
return file;
});
}
public string Title
{
get { return _lazyFile.Value.Title; }
}
// ...
}
This is possibly less ideal than loading them as fast as you can in the background, if you do something like MediaFiles.OrderBy(x => x.Title).ToList() and nothing has been lazy loaded then you'll have to wait for every file to load.
Loading them in the background would make them available for use immediately after the background loading has finished. But you might have to concern yourself with not accessing some items until the background loading has finished.
You biggest bottleneck is new FileInfo(item), but you don't need FileInfo just to get the Directory and File names. You can use Path.GetDirectoryName and Path.GetFileName, which are must faster since no I/O is involved.
UltraID3 mFile = new UltraID3();
//FileInfo fInfo;
mFile.Read(item);
//fInfo = new FileInfo(item);
bool onDevice = true;
if (DeviceMusic.Contains(item))
{
onDevice = false;
}
// My Problem starts here
lstMusicInfo.Add(new MediaFile()
{
Title = mFile.Title,
Album = mFile.Album,
Year = mFile.Year.ToString(),
ComDirectory = Path.GetDirectoryName(item), // fInfo.Directory.FullName,
FileFullName = Path.GetFileName(item), //fInfo.FullName,
Artist = mFile.Artist,
OnDevice = onDevice,
PTDevice = false
});
//Ends here.
I wrote a little C# application that indexes a book and executes a boolean textretrieval algorithm on the index. The class at the end of the post showes the implementation of both, building the index and executing the algorithm on it.
The code is called via a GUI-Button in the following way:
private void Execute_Click(object sender, EventArgs e)
{
Stopwatch s;
String output = "-----------------------\r\n";
String sr = algoChoice.SelectedItem != null ? algoChoice.SelectedItem.ToString() : "";
switch(sr){
case "Naive search":
output += "Naive search\r\n";
algo = NaiveSearch.GetInstance();
break;
case "Boolean retrieval":
output += "boolean retrieval\r\n";
algo = BooleanRetrieval.GetInstance();
break;
default:
outputTextbox.Text = outputTextbox.Text + "Choose retrieval-algorithm!\r\n";
return;
}
output += algo.BuildIndex("../../DocumentCollection/PilzFuehrer.txt") + "\r\n";
postIndexMemory = GC.GetTotalMemory(true);
s = Stopwatch.StartNew();
output += algo.Start("../../DocumentCollection/PilzFuehrer.txt", new String[] { "Pilz", "blau", "giftig", "Pilze" });
s.Stop();
postQueryMemory = GC.GetTotalMemory(true);
output += "\r\nTime elapsed:" + s.ElapsedTicks/(double)Stopwatch.Frequency + "\r\n";
outputTextbox.Text = output + outputTextbox.Text;
}
The first execution of Start(...) runs about 700µs, every rerun only takes <10µs.
The application is compiled with Visual Studio 2010 and the default 'Debug' buildconfiguration.
I experimentad a lot to find the reason for that including profiling and different implementations but the effect always stays the same.
I'd be hyppy if anyone could give me some new ideas what I shall try or even an explanation.
class BooleanRetrieval:RetrievalAlgorithm
{
protected static RetrievalAlgorithm theInstance;
List<String> documentCollection;
Dictionary<String, BitArray> index;
private BooleanRetrieval()
: base("BooleanRetrieval")
{
}
public override String BuildIndex(string filepath)
{
documentCollection = new List<string>();
index = new Dictionary<string, BitArray>();
documentCollection.Add(filepath);
for(int i=0; i<documentCollection.Count; ++i)
{
StreamReader input = new StreamReader(documentCollection[i]);
var text = Regex.Split(input.ReadToEnd(), #"\W+").Distinct().ToArray();
foreach (String wordToIndex in text)
{
if (!index.ContainsKey(wordToIndex))
{
index.Add(wordToIndex, new BitArray(documentCollection.Count, false));
}
index[wordToIndex][i] = true;
}
}
return "Index " + index.Keys.Count + "words.";
}
public override String Start(String filepath, String[] search)
{
BitArray tempDecision = new BitArray(documentCollection.Count, true);
List<String> res = new List<string>();
foreach(String searchWord in search)
{
if (!index.ContainsKey(searchWord))
return "No documents found!";
tempDecision.And(index[searchWord]);
}
for (int i = 0; i < tempDecision.Count; ++i )
{
if (tempDecision[i] == true)
{
res.Add(documentCollection[i]);
}
}
return res.Count>0 ? res[0]: "Empty!";
}
public static RetrievalAlgorithm GetInstance()
{
Contract.Ensures(Contract.Result<RetrievalAlgorithm>() != null, "result is null.");
if (theInstance == null)
theInstance = new BooleanRetrieval();
theInstance.Executions++;
return theInstance;
}
}
Cold/warm start of .Net application is usually impacted by JIT time and disk access time to load assemblies.
For application that does a lot of disk IO very first access to data on disk will be much slower than on re-run for the same data due to caching content (also applies to assembly loading) if data is small enough to fit in memory cache for the disk.
First run of the task will be impacted by disk IO for assemblies and data, plus JIT time.
Second run of the same task without restart of application - just reading data from OS memory cache.
Second run of application - reading assemblies from OS memory cache and JIT again.