Lucene.NET is not deleting docs?

Lucene.NET is not deleting docs? - c#

I've probably gone through numerous S.O. posts on this issue, but I'm at a loss and can't figure out what the problem is.
I can add and update docs to the index, but I cannot seem to successfully delete them.
I'm using Lucene.NET v3.0.3
I read one suggestion was to do a query using the same conditions and ensure I'm getting a result back. Well, I did so:
First, I have a method that returns items in my database that have been marked as deleted
var deletedItems = VehicleController.GetDeleted(DateTime lastcheck);
Right now during testing, this includes a single item. I then iterate:
// This method returns my writer
var indexWriter = LuceneController.GetWriter();
// And my searcher
var searcher = new IndexSearcher(indexWriter.GetReader());
// And iterate over my items (just one for testing)
foreach(var c in deletedItems) {
// Here I'm testing by doing a query
var query = new BooleanQuery();
query.Add(new TermQuery(new Term("key", c.Guid.ToString())), Occur.MUST);
// Let's see if it can find the record based on this
var docs = searcher.Search(query, 1);
var foundDoc = docs.FirstOrDefault();
// Yep, we have one... let's get the full doc to be sure
var actualDoc = searcher.Doc(foundDoc.Doc);
// If I inspect actualDoc, it's the right one... I want to delete it.
indexWriter.DeleteDocuments(query);
indexWriter.Commit();
}
I've tried to smash all the logic above so it's easier to read, but I've tried all kinds of methods...
indexWriter.Optimize();
indexWriter.Flush(true, true, true);
If I watch the actual folder where everything is being stored, I can see filenames like 0_1.del and stuff like that popup, which seems promising.
I then read somewhere about a merge policy, but isn't that what Flush is supposed to do?
Then read to try setting the optimize method to 1 max, and that still didn't work (i.e. indexWriter.Optimize(1)).
So using the same query to fetch works, but deleting does not. Why? What else can I check? Does delete actually remove the item permanently or does it live on in some other manner until I completely delete the directory that's being used? Not understanding.

Index segment files in Lucene are immutable they never change once written. So when a deletion is recorded, the deleted record is not actually removed from the index files immediately, the record is simply marked as deleted. The record will eventually be removed from the index once that index segment is merged to produce a new segment. i.e. the deleted record won't be in the new segment that is the result of the merge.
Theoritically, once commit is called the deletion should be removed from the reader's view since you are getting the reader from the writer (i.e. it's a real time reader) This is documented here:
Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either commit() or close() is called.
source: https://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/index/IndexWriter.html
But you might want to try closing the reader after the deletion takes place and then getting a new reader from the writer to see if that new reader now has the record removed from visibility.

Related

Azure Search Service Deleting an Index

I'm very new to the Azure Search Service and I'm following along with the guides that I found on the web. In one of these guides they have a method such as this:
static void Main(string[] args)
{
IConfigurationBuilder builder = new ConfigurationBuilder().AddJsonFile("appsettings.json");
IConfigurationRoot configuration = builder.Build();
SearchServiceClient serviceClient = CreateSearchServiceClient(configuration);
var test = serviceClient.Indexes.GetClient("testindex");
Console.WriteLine("{0}", "Deleting index...\n");
DeleteHotelsIndexIfExists(serviceClient);
Console.WriteLine("{0}", "Creating index...\n");
CreateHotelsIndex(serviceClient);
ISearchIndexClient indexClient = serviceClient.Indexes.GetClient("hotels");
Console.WriteLine("{0}", "Uploading documents...\n");
UploadDocuments(indexClient);
ISearchIndexClient indexClientForQueries = CreateSearchIndexClient(configuration);
RunQueries(indexClientForQueries);
Console.WriteLine("{0}", "Complete. Press any key to end application...\n");
Console.ReadKey();
}
So the idea above is to delete an index if it exists, then create a new index. Then generate and upload documents and then finally run a search. Now this all makes sense to me how it works, but one thing that troubles me is the deleting the index part. Essentially what would appear to me is that they are suggesting to always delete the index and then recreate it. Which makes me concerned.
If this is a live index, then by me deleting it even for a split of a second, wouldn't that mean that services calling search on this index will fail? The other thing that I'm concerned about is that in most cases my data will be 90% the same, I'll have some updates and newer records, maybe few deleted. But if my database has a million records, it just seems foolish to delete it all and then create it all over again.
Is there a better approach. Is there a way for me to just update the documents in the index as opposed to deleting it?
So I guess this is a 2-part question. 1. If I delete the index, will it stop the searching capability while the new one is being built? 2. Is there a better approach than just deleting the index. If there is a way to update, how do you handle scenarios where document structure has changed, for example a new field is added.

That sample is meant to demonstrate how to call Azure Search via the .NET SDK. It should not be taken as an example of best practices.
To answer your specific questions:
Deleting the index will immediately cause all requests targeting that index to fail. You should not delete a live index in production. Also, there is not yet a backup/restore capability for indexes, so even if an index is not live, you shouldn't delete it unless you can restore the data from elsewhere.
If you need to update documents without updating the index schema, you can achieve this with the "merge" action of the Index API. There is a tutorial that covers this here. If you need to add a new field, you can do this without deleting and re-creating the index -- just use one of the Indexes.CreateOrUpdate methods of the .NET SDK. By default this operation does not cause any index downtime. However, if you need to delete, rename, or change the type of an existing field, or enable new behaviors on a field (e.g. -- make it facetable, sortable, or part of a suggester), you'll need to create a new index. For this reason, it is recommended that your application implement an extra layer of indirection over index names so that you can swap and re-build indexes when needed.

Commands.Stage() doesn't increment Staged.Count when staging untracked files with LibGit2Sharp

Following block works when I stage at least one tracked file. But when I stage only untracked files, repo.RetrieveStatus().Staged.Count equals to zero (I expect it to be incremented by the number of files staged), thus doesn't satisfy if condition and don't commit.
using (var repo = new LibGit2Sharp.Repository(RepoPath))
{
Commands.Stage("*");
Signature author = new Signature(username, email, DateTime.Now);
Signature committer = author;
if (repo.RetrieveStatus().Staged.Any())
{
Commit commit = repo.Commit(CommitMessage, author, committer);
Console.WriteLine(commit.ToString());
}
}
Is this a bug, or I am misusing it?

The definition of the Staged collection is:
List of files added to the index, which are already in the current commit with different content
ie, these correspond to the "modified" in the "changes to be committed" area of git-status, or the M status to git-status --short.
These do not, by definition, include newly added files that are staged. For that, you want to also examine the Added collection, which is:
List of files added to the index, which are not in the current commit
ie, these correspond to the "new files" in the "changes to be committed" area of git-status, or the A status to git status --short.
However, you probably want to also consider all staged changes, which I think is what you're trying to do. You would want to look at these collections on the status:
Added: List of files added to the index, which are not in the current commit
Staged: List of files added to the index, which are already in the current commit with different content
Removed: List of files removed from the index but are existent in the current commit
RenamedInIndex: List of files that were renamed and staged (if you requested renames).
At the moment there is no function that will return a list of all staged changes, which seems like an oversight (and one that would be easy to correct, if you wanted to submit a pull request to LibGit2Sharp!)

Mongo Update Unintentionally Inserting Document

I have a .NET application written in C#, and use Mongo for my database backend. One of my collections, UserSearchTerms, repeatedly (and unintentionally) has duplicate documents created.
I've teased out the problem to an update function that gets called asynchronously, and can be called multiple times simultaneously. In order to avoid problems with concurrent runs, I've implemented this code using an update which I trigger on any documents that match a specific query (unique on user and program), upserting if no documents are found.
Initially, I can guarantee that no duplicates exist and so expect that only the following two cases can occur:
No matching documents exist, triggering an upsert to add a new document
One matching document exists, and so an update is triggered only on that one document
Given these two cases, I expect that there would be no way for duplicate documents to be inserted through this function - the only time a new document should be inserted is if there are none to begin with. Yet over an hour or so, I've found that even though documents for a particular user/program pair exist, new documents for them are created.
Am I implementing this update correctly to guarantee that duplicate documents will not be created? If not, what is the proper way to implement an update in order to assure this?
This is the function in question:
public int UpdateSearchTerm(UserSearchTerm item)
{
_userSearches = _uow.Db.GetCollection<UserSearchTerm>("UserSearchTerms");
var query = Query.And(Query<UserSearchTerm>.EQ(ust => ust.UserId, item.UserId), Query<UserSearchTerm>.EQ(ust => ust.ProgramId, item.ProgramId));
_userSearches.Update(query, Update<UserSearchTerm>.Replace(item), new MongoUpdateOptions { Flags = UpdateFlags.Upsert });
return (int)_userSearches.Count(query);
}
Additional Information:
I'm using mongod version 2.6.5
The mongocsharpdriver version I'm using is 1.9.2
I'm running .NET 4.5
UserSearchTerms is the collection I store these documents in.
The query is intended to match users on both userId AND programId - my definition of a 'unique' document.
I return a count after the fact for debugging purposes.

You could add a unique index on userId and programId to ensure that no duplicate will be inserted
Doc : https://docs.mongodb.org/v2.4/tutorial/create-a-unique-index/

Mongodb C# driver - Update document then select collection with updated document

I'm updating a document in Mongodb with the C# driver. I've verified the update is successfully completing, but if I select the collection that contains the updated document immediately after the update I'm not seeing the new values immediately. If I put a breakpoint in my code after the update but before the select, I will see the new values in the results from the Select. If I let the code run straight through, I get the old values in my names collection. I tried changing the write concern, but I don't think that's it. Is there some way to chain these two operations together so the select won't happen until the update has completed?
var qry = Query.EQ("_id", new ObjectId(id));
var upd = Update.Set("age", BsonValue.Create(newAge));
db.GetCollection<MongoTest>("mongotest").Update(qry,upd);
... would like to pause here until update is complete ...
var names = db.GetCollection<MongoTest>("mongotest")
.FindAll().SetSortOrder(SortBy.Ascending("name"))
.ToList<MongoTest>();
if (names.Count() > 0)
{
return View(names);
}

One clarification. The official MongoDB .Net driver defaults to Acknowledged write (write concern 1) when you start with using MongoClient. If you start with the old style of using MongoServer.Create (now obsoleted), the default is Unacknowledged.
Also ensure that you are not using a read preference that could route your reads to a secondary.

Checking a File for matches

I am saving each line in a list to the end of a file...
But would I would like to do is check if that file already contains that line so it does not save the same line twice.
So before using StreamWriter to write the file I want to check each item in the list to see if it exists in the file. If it does, I want to remove it from the list before using StreamWriter.
..... Unless of course there is a better way to go about doing this?

Assuming your files are small and you are limited to flat files plus a database table is not an option, etc., then you could just read existing items into a list then make the write operation condition based on examining the list... Again, I would try for another method if at all possible (db table, etc) but just the most direct answer your question...
string line = "your line to append";
// Read existing lines into list
List<string> existItems = new List<string>();
using (StreamReader sr = new StreamReader(path))
while (!sr.EndOfStream)
existItems.Add(sr.ReadLine());
// Conditional write new line to file
if (existItems.Contains(line))
using (StreamWriter sw = new StreamWriter(path))
sw.WriteLine(line);

I guess what you could do is initialize the list from the file, adding each line as a new entry to the list.
Then, as you add to the list, check to see if it contains the line already.
List<string> l = new List<string>{"A", "B", "C"}; //This would be initialized from the file.
string s;
if(!l.Contains(s))
l.Add(s);
When you are ready to save the file, just write out what is in the list.

This will be slow, especially if you have a lot of data going into the table.
If possible, can you store all the lines in a database table with a primary key on the text column? Then add if the column value does not exist, and when you're done, dump the table to a text file? I think that's what I'd do.
I'd like to point out I don't think this is ideal, but it should be fairly performant (Using mssql syntax):
create table foo (
rowdata varchar(1000) primary key
);
-- for insertion (where #rowdata is new text line):
insert into foo (rowdata)
select #rowdata
where not exists(select 1 from foo where rowdata = #rowdata)
-- for output
select rowdata from foo;

If you can sort the file every time you save it would be much faster to determine if a particular entry exists.
Also a database table would be a good idea as mentioned earlier as you can search for the entry to be added in the table and if it does not exist add it.
It depends on if you are after speed (db), fast implementation (file access) or don't care (use in memory lists until the file gets too big and burns and crashes.)
A similar case can be found here

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Lucene.NET is not deleting docs? - c#

Related

Azure Search Service Deleting an Index

Commands.Stage() doesn't increment Staged.Count when staging untracked files with LibGit2Sharp

Mongo Update Unintentionally Inserting Document

Mongodb C# driver - Update document then select collection with updated document

Checking a File for matches

Categories

Resources