I am currently writing an winforms c# application that will allow users to cleanse text / log files. At present the app is working, but if the file is massibe in size, i.e. 10MB it is taking an age!
The first cleanse it does is for users Windows Auth, i.e. who was logged in at the time. I have a textfile of all users in our organisation, roughly 10,000.
I load this into a
List<string> loggedUsers = new List<string>();
string[] userList = System.IO.File.ReadAllLines(#"C:\temp\logcleaner\users.txt");
foreach (string line in userList)
{
loggedUsers .Add(line.ToString());
}
Next i take a textfile and show it in a RichTextBox (rtbOrgFile), allowing the user to see see what information is currently there. The user then clicks a button which does the following:
foreach (var item in loggedUsers)
{
if (rtbOrgFile.Text.Contains(item.ToString()))
{
if(foundUsers.Items.Contains(item.ToString()))
{
// already in list
}
else
{
foundUsers.Items.Add(item.ToString());
}
}
}
My question is, is this the most efficient way? Or is there are far batter way to go about this. The code is working fine, but the as you start to get into big files it is incredibly slow.
First, I would advise the following for loading your List:
List<string> loggeedUsers = System.IO.File.ReadAllLines("[...]users.txt").ToList();
You didn't specify how large the text file that you load into the RichTextBox is, but I assume it is quite large, since it takes so long.
Found in this answer, it suggests the Lucene.NET search engine, but it also provides a simple way to multi-thread the search without that engine, making it faster.
I would translate the example to:
var foundUsers = loggeedUsers.AsParallel().Where(user => rtbOrgFile.Contains(user)).ToList();
This way, it checks for multiple logged users at once.
You need at least .NET 4.0 for Parallel LINQ (which this example uses), as far as I know. If you don't have access to .NET 4.0, you could try to manually create one or two Threads and let each one handle an equal part of loggedUsers to check. They would each make a separate foundUsers list and then report it back to you, where you would merge them to a single list using List<T>.AddRange(anotherList).
Related
I have a situation wherein a List object is built off of values pulled from a MSSQL database. However, this particular table is mysteriously getting an errant record or two tossed in. Removing the records cause trouble even though they have no referential links to any other tables, and will still get recreated without any known user actions taken. This causes some trouble as it puts unwanted values on display that add a little bit of confusion. The specific issue is that this is a platform that allows users to run a search for quotes, and the filtering allows for sales rep selection. The select/dropdown field is showing these errant values, and they need to be removed.
Given that deleting the offending table rows does not provide a desirable result, I was thinking that maybe the best course of action was to modify the code where the List object is created and either filter the values out or remove them after the object is populated. I'd like to do this in a clean, scalible fashion by providing some kind of appendable data object where I could just add in a new string value if something else cropped up as opposed to doing something clunky that adds new code to find the value and remove it each time.
My thought was to create a string array, and somehow loop through that to remove bad List values, but I wasn't entirely certain that was the best way to approach this, and I could not for the life of me think of a clean approach for this. I would think that the best way would be to add a filter within the Find arguments, but I don't know how to add in an array or list that way. Otherwise I figured to loop through the values either before or after the sorting of the List and remove any matches that way, but I wasn't sure that was the best choice of actions.
I have attached the current code, and would appreciate any suggestions.
int licenseeID = Helper.GetLicenseeIdByLicenseeShortName(Membership.ApplicationName);
List<User> listUsers;
if (Roles.IsUserInRole("Admin"))
{
//get all users
listUsers = User.Find(x => x.LicenseeID == licenseeID).ToList();
}
else
{
//get only the current user
listUsers = User.Find(x => (x.LicenseeID == licenseeID && x.EmailAddress == Membership.GetUser().Email)).ToList();
}
listUsers.Sort((x, y) => string.Compare(x.FirstName, y.FirstName));
-- EDIT --
I neglected to mention that I did not develop this, I merely inherited its maintenance after the original developer(s) disappeared, and my coworker who was assigned to it left the company. I'm not really really skilled at handling ASP.NET sites. Many object sources are hidden and unavailable for edit, I assume due to them being defined in a DLL somewhere. So, for any of these objects that are sourced from database tables, altering the tables will not help, since I would not be able to get the new data anyway.
However, I did try to do the following to filter out the undersirable data:
List<String> exclude = new List<String>(new String[] { "value1" , "value2" });
listUsers = User.Find(x => x.LicenseeID == licenseeID && !exclude.Contains(x.FirstName)).ToList();
Unfortunately it only resulted in an error being displayed to the page.
-- EDIT #2 --
I got the server setup to accept a new event viewer source so I could write info to the Application log to see what was happening. Looks like this installation of ASP.NET does not accept "Contains" as an action on a List object. An error gets kicked out stating that the method is not available.
I will probably add a bit to the table and flag Errant rows and then skip them when I query the table, something like
&& !ErrantData
Other way, that requires a bit more upkeep but doesn't require db change, would be to keep a text file that gets periodically updated and you read it and remove users from list based on it.
The bigger issue is unknown rows creeping in your database. Changing user credentials and adding creation timestamps may help you narrow down the search scope.
Essentially, my program needs to consume about 100 (and this number will expand) WebServices, pull a piece of data from each, store it, parse it, and then display it. I've written the code for storing, parsing, and displaying.
My problem is this: I can't find any tutorial online about how to loop through a list of WebReferences and query each one. Am I doomed to writing 100 WebReferences and manually writing queries for each one, or is it possible to store a List or Array of the URLs (or something) and loop through it? Or is there another, better way of doing this?
I've specifically done research on this and I haven't found anything, I've done my due diligence. I'm not asking about how to consume a WebService, there's plenty of information on that and it's not that hard.
Current foreach loop (not sufficent as I need to pass login credentials and get a response):
//Retrieve the XMLString from the server
//The ServerURLList is just a giant list of URLS, I didn't include it
var client = new WebClient { Credentials = new NetworkCredential("LoginCredentials", "LoginCredentialsPass") };
var XMLStringFromServer = client.DownloadString((String)(dr[0]));
//Notice it takes the string URL from the DataTable provided, so that it can do all 100 customers while parsing the response
I'm working on a c# application that extracts metadata from all the documents in a Lotus Notes database (.nsf file, no Domino server) thusly:
NotesDocumentCollection documents = _notesDatabase.AllDocuments;
if (documents.Count>0)
{
NotesDocument document = documents.GetFirstDocument();
while (document!=null)
{
//processing
}
}
This works well, except we also need to record all the views that a document appears in. We iterate over all the views like this:
foreach (var viewName in _notesDatabase.Views)
{
NotesView view = _notesDatabase.GetView(viewName);
if (view != null)
{
if (view.AllEntries.Count > 0)
{
folderCount = view.AllEntries.Count;
NotesDocument document = view.GetFirstDocument();
while (document!=null)
{
//record the document/view cross reference
document = view.GetNextDocument(document);
}
}
Marshal.ReleaseComObject(view);
view = null;
}
}
Here are my problems and questions:
We fairly regularly encounter documents in a view that were not found in NotesDatabase.AllDocuments collection. How is that possible? Is there a better way to get all the documents in a notes database?
Is there a way to find out all the views a document is in without looping through all the views and documents? This part of the process can be very slow, especially on large nsf files (35 GB!). I'd love to find a way to get just a list of view name and Document.UniversalID.
If there is not a more efficient way to find all the document + view information, is it possible to do this in parallel, with a separate thread/worker/whatever processing each view?
Thanks!
Answering questions in the same order:
I'm not sure how this is possible either unless perhaps there's a special type of document that doesn't get returned by that AllDocuments property. Maybe replication conflicts are excluded?
Unfortunately there's no better way. Views are really just a saved query into the database that return a list of matching documents. There's no list of views directly associated with a document.
You may be able to do this in parallel by processing each view on its own thread, but the bottleneck may be the Domino server that needs to refresh the views and thus it might not gain much.
One other note, the "AllEntries" in a view is different than all the documents in the view. Entries can include things like the category row, which is just a grouping and isn't backed by an actual document. In other words, the count of AllEntries might be more than the count of all documents.
Well, first of all, it's possible that documents are being created while your job runs. It takes time to cycle through AllDocuments, and then it takes time to cycle through all the views. Unless you are working on a copy or replica of the database that is is isolated from all other possible users, then you can easily run into a case where a document was created after you loaded AllDocuments but before you accessed one of the views.
Also, is it may be possible that some of the objects returned by the view.getXXXDocument() methods are deleted documents. You should probably be checking document.isValid() to avoid trying to process them.
I'm going to suggest using the NotesNoteCollection as a check on AllDocuments. If AllDocuments were returning the full set of documents, or if NotesNoteCollection does (after selecting documents and building the collection), then there is a way to do this that is going to be faster than iterating each view.
(1) Read all the selection formulas from the views, removing the word 'SELECT' and saving them in a list of pairs of {view name, formula}.
(2) Iterate through the documents (from the NotesNoteCollection or AllDocuments) and for each doc you can use foreach to iterate through the list of view/formula pairs. Use the NotesSession.Evaluate method on each formula, passing the current document in for the context. A return of True from any evaluated formula tells you the document is in the view corresponding to the formula.
It's still brute force, but it's got to be faster than iterating all views.
I'm currently writing a simple text analysis program in C#. Currently it takes simple statistics from the text and prints them out. However, I need to get it to the point where in input mode you input sample text, specifying an author, and it writes the statistics to a database entry of that specific author. Then in a different mode the program will take text, and see if it can accurately identify the author by pulling averages from the DB files and comparing the text's statistics to sample statistics. What I need help with is figuring out the best way to make a database out of text statistics. Is there some library I could use for this? Or should I simply do simple reading and writing from text files that I'll store the information in? Any and all ideas are welcome, as I'm struggling to come up with a solution to this problem.
Thanks,
PardonMyRhetoric
You can use and XmlSerializer to persist your data to file really easily. There are numerous tutorials you can find on google that will teach you how in just a few minutes. However, most of them want to show you how to add attributes to your properties to customize the way it serializes, so I'll just point out that those aren't really necessary. So long as you have the [Serializeable] tag over your class all you need is something that looks like this to save:
void Save()
{
using (var sw = new StreamWriter("somefile.xml"))
(new XmlSerializer(typeof(MyClass))).Serialize(sw, this);
}
and something like this in a function to read it:
MyClass Load()
{
XmlSerializer xSer = new XmlSerializer(typeof(MyClass));
using (var sr = new StreamReader("somefile.xml"))
return (MyClass)xSer.Deserialize(sr);
}
I don't think in this stage you'll need a database. Try to select appropriate data structures from the .NET framework itself. Try to use dictionary or lists, don't use arrays for this, and the methods you write will become simpler. Try to learn LINQ - it's like queries to database, but to regular data structures. When you'll get this and the project will grow, try to add a database.
I've simple ListView with 3-4 columns which display list of clients. Below it there's a TextBox that is used to search in Sql Server and display related results (basically it executes sql query every single letter when typed. It was working fine with not much clients but with over 1000 typing one letter makes it hold display for about few seconds, display a lot of records, then another letter make it a bit quicker, and then another...
So I thought about couple of possible fixes to this:
Start searching after typing 3 letters (the name has at least 3 chars), do nothing for 1st/2nd letter and display everything for 0 (still there's a delay when getting back from search)
Load the list once to List<string> or create some kind of object to cover this, but I would need to keep it synced with any changes that is done by other users (adding new clients, updating names etc) from other work places and always update the list with proper information. Keeping it database related seems like an easier idea.
Other ideas? Maybe combination of both?
Here's code sample:
private void klienciSearchBoxTextChanged(object sender, EventArgs e) {
string varSzukaj = klienciSearchBox.Text.Trim();
if (varSzukaj.Length >= 3) {
pobierzDaneSqlKlientaOgolne(listViewKlienci, lvwColumnSorterKlienci, varSzukaj, radioButtonWyszukajPoPortfelu.Checked ? 1 : 0);
} else if (varSzukaj.Length > 0 && varSzukaj.Length < 3) {
// do nothing
} else {
pobierzDaneSqlKlientaOgolne(listViewKlienci, lvwColumnSorterKlienci, varSzukaj, radioButtonWyszukajPoPortfelu.Checked ? 1 : 0);
}
Is 1st or 2nd idea good enough or someone can propose other implementation?
In a similar situation we handled this by limiting the number of results the user could get back. We used a limit of 500 to keep things snappy and ran a count query prior to running the select query to check if it's about to pause the software.
It also depends if the problem is an unresponsive GUI or a wait for the user. Since an unresponsive GUI can be fixed by running the query on a separate thread, you could then check if the query is running and cancel it when the user types the next letter. Another option to prevent user waiting would be to display partial results.
The type of interface design that you currently have for searching is better suited for data that updates infrequently. For example, say you have a list of 10000 products that updates once a week, in this case, cache the data locally and then pull the data from the cache instead of the database for every letter typed. That way it is one query to the DB and then many queries to the local cache.
In your case, data updates are more common, so I would change the interface to allow users to type in some letters, and then press a search button when they are ready to retrieve results. As JamesB has noted, limiting the results back would also help but you are still hitting the database with a lot of queries. If the users can live with some data latency, caching can be an option. There is a lot of needless searching going to the database for "M" then "Ma" then "Mad" then "Madb" and so on...