Checking a File for matches - c#

I am saving each line in a list to the end of a file...
But would I would like to do is check if that file already contains that line so it does not save the same line twice.
So before using StreamWriter to write the file I want to check each item in the list to see if it exists in the file. If it does, I want to remove it from the list before using StreamWriter.
..... Unless of course there is a better way to go about doing this?

Assuming your files are small and you are limited to flat files plus a database table is not an option, etc., then you could just read existing items into a list then make the write operation condition based on examining the list... Again, I would try for another method if at all possible (db table, etc) but just the most direct answer your question...
string line = "your line to append";
// Read existing lines into list
List<string> existItems = new List<string>();
using (StreamReader sr = new StreamReader(path))
while (!sr.EndOfStream)
existItems.Add(sr.ReadLine());
// Conditional write new line to file
if (existItems.Contains(line))
using (StreamWriter sw = new StreamWriter(path))
sw.WriteLine(line);

I guess what you could do is initialize the list from the file, adding each line as a new entry to the list.
Then, as you add to the list, check to see if it contains the line already.
List<string> l = new List<string>{"A", "B", "C"}; //This would be initialized from the file.
string s;
if(!l.Contains(s))
l.Add(s);
When you are ready to save the file, just write out what is in the list.

This will be slow, especially if you have a lot of data going into the table.
If possible, can you store all the lines in a database table with a primary key on the text column? Then add if the column value does not exist, and when you're done, dump the table to a text file? I think that's what I'd do.
I'd like to point out I don't think this is ideal, but it should be fairly performant (Using mssql syntax):
create table foo (
rowdata varchar(1000) primary key
);
-- for insertion (where #rowdata is new text line):
insert into foo (rowdata)
select #rowdata
where not exists(select 1 from foo where rowdata = #rowdata)
-- for output
select rowdata from foo;

If you can sort the file every time you save it would be much faster to determine if a particular entry exists.
Also a database table would be a good idea as mentioned earlier as you can search for the entry to be added in the table and if it does not exist add it.
It depends on if you are after speed (db), fast implementation (file access) or don't care (use in memory lists until the file gets too big and burns and crashes.)
A similar case can be found here

Related

How do I read only part of a column from a Parquet file using Parquet.net?

I am using Parquet.Net to read parquet files, but the only option to read from the parquet file is.
//get the first group
Parquet.ParquetRowGroupReader rowGroup = myParquet.OpenRowGroupReader(0);
//gets the first column
Parquet.Data.DataColumn col1 = rowGroup.ReadColumn(myParquet.Schema.GetDataFields()[0]);
This allows me to get the the first column from the first rowGroup, but the problem is, the first rowGroup can be something like 4million rows and readColumn will read all 4million values.
How do I tell readColumn that I only want it to read, say the first 100 rows. Reading all 4million rows wastes memory and file read time.
I actually got a memory error, until I changed my code to resize that 4million value array down to my 100. After calling each column.
I don't necessarily need row based access, I can work with columns, I just don't need a whole rowGroup worth of values in each column. Is this possible? If row based access is better, how does one use it? The Parquet.Net project site doesn't give any examples, and just talks about tables.
According to the source code this capability exists in DataColumnReader but this is an internal class and thus not directly usable.
ParquetRowGroupReader uses it inside its ReadColumn method, but exposes no such options.
What can be done in practice is copying the whole DataColumnReader class and using it directly, but this could breed future compatibility issues.
If the problem can wait for some time, I'd recommend copying the class and then opening an issue + pull request to the library with the enhanced class, so the copied class can eventually be removed.
If you look at the parquet-dotnet documentation they do not recommend writing more than 5000 records into one row group for performance reasons, though at the bottom of the page they say they are designed to hold 50000 rows on average:
It's not recommended to have more than 5'000 rows in a single row
group for performance reasons
We are working with 100000 in a row group with my team, overall it may depend on what you are storing but 4000000 records in one row group inside a column does sounds like too much.
So to answer your question, to read only part of the column make your row groups inside the column smaller and then read only as many row groups as you wish. If you want to only read 100 records, read in the first row group and take first 100 from it, reasonably sized row groups are very fast to read.
ParquetSharp should be able to do that. It’s a wrapper around the Apache Arrow C++ library but it supports Windows, Linux and macOS.
using System;
using ParquetSharp;
using (var reader = new ParquetFileReader(path))
using (var group = reader.RowGroup(0))
{
// You can use the logical reader for automatic conversion to a fitting CLR type
// here `double` as an example
// (unfortunately this does not work well with complex schemas IME)
const int batchSize = 4000;
Span<double> buffer = new double[batchSize];
var howManyRead = group.Column(0).LogicalReader<double>().ReadBatch(buffer);
// or if you want raw Parquet (with Dremel data and physical type)
var resultObject = group.Column(0).Apply(new Visitor());
}
class Visitor : IColumnReaderVisitor<object>
{
public object OnColumnReader<TValue>(ColumnReader<TValue> columnReader)
where TValue : unmanaged
{
// TValue will be the physical Parquet type
const int batchSize = 200000;
var buffer = new TValue[batchSize];
var definitionLevels = new short[batchSize];
var repetitionLevels = new short[batchSize];
long valuesRead;
var levelsRead = columnReader.ReadBatch(batchSize,
definitionLevels, repetitionLevels,
buffer, out valuesRead);
// Return stuff you are interested in here, will be `resultObject` above
return new object();
}
}

Lucene.NET is not deleting docs?

I've probably gone through numerous S.O. posts on this issue, but I'm at a loss and can't figure out what the problem is.
I can add and update docs to the index, but I cannot seem to successfully delete them.
I'm using Lucene.NET v3.0.3
I read one suggestion was to do a query using the same conditions and ensure I'm getting a result back. Well, I did so:
First, I have a method that returns items in my database that have been marked as deleted
var deletedItems = VehicleController.GetDeleted(DateTime lastcheck);
Right now during testing, this includes a single item. I then iterate:
// This method returns my writer
var indexWriter = LuceneController.GetWriter();
// And my searcher
var searcher = new IndexSearcher(indexWriter.GetReader());
// And iterate over my items (just one for testing)
foreach(var c in deletedItems) {
// Here I'm testing by doing a query
var query = new BooleanQuery();
query.Add(new TermQuery(new Term("key", c.Guid.ToString())), Occur.MUST);
// Let's see if it can find the record based on this
var docs = searcher.Search(query, 1);
var foundDoc = docs.FirstOrDefault();
// Yep, we have one... let's get the full doc to be sure
var actualDoc = searcher.Doc(foundDoc.Doc);
// If I inspect actualDoc, it's the right one... I want to delete it.
indexWriter.DeleteDocuments(query);
indexWriter.Commit();
}
I've tried to smash all the logic above so it's easier to read, but I've tried all kinds of methods...
indexWriter.Optimize();
indexWriter.Flush(true, true, true);
If I watch the actual folder where everything is being stored, I can see filenames like 0_1.del and stuff like that popup, which seems promising.
I then read somewhere about a merge policy, but isn't that what Flush is supposed to do?
Then read to try setting the optimize method to 1 max, and that still didn't work (i.e. indexWriter.Optimize(1)).
So using the same query to fetch works, but deleting does not. Why? What else can I check? Does delete actually remove the item permanently or does it live on in some other manner until I completely delete the directory that's being used? Not understanding.
Index segment files in Lucene are immutable they never change once written. So when a deletion is recorded, the deleted record is not actually removed from the index files immediately, the record is simply marked as deleted. The record will eventually be removed from the index once that index segment is merged to produce a new segment. i.e. the deleted record won't be in the new segment that is the result of the merge.
Theoritically, once commit is called the deletion should be removed from the reader's view since you are getting the reader from the writer (i.e. it's a real time reader) This is documented here:
Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either commit() or close() is called.
source: https://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/index/IndexWriter.html
But you might want to try closing the reader after the deletion takes place and then getting a new reader from the writer to see if that new reader now has the record removed from visibility.

linq to update or insert on a csv file

I successfully made a c# class that uses Jet to execute a SELECT string on a csv file. However, I really need an UPDATE and/or INSERT statement and Jet apparently won't allow it. I've been looking into using LINQ, but can't seem to find any examples of an UPDATE clause for LINQ either. Anyone know about this? or perhaps a different class than LINQ that could accomplish this?
Basically, I want to read a csv file into memory and query on it (select columns, or distinct, etc), which is fine with Jet, but I also want to update rows and modify the text file.
basically:
UPDATE table
SET col3 = 42
WHERE col1='mouse', col2='dolphins';
and have that take effect data read from csv.
also, I can't figure out how to access columns by name with LINQ. any advice?
so far, a constructor for my class seems to parse the file ok (I can see it in the watch and immediate windows), but I don't know how to move on from here:
string Method = this.ToString();
this.strFilePath = filePath;
this.strDelim = delim;
Logger.Debug("Starting", Method);
try
{
fileLines = File.ReadAllLines(filePath);
}
catch (Exception Ex)
{
Logger.Error(Ex, Method);
}
this.fileTable = fileLines.Select(l => l.Split(delim));
Ignore the 'Logger', this is an in-house class that writes things to a log for our own purposes.
What you're asking can't easily be done, due to the way text files are organized.
In general, you can't arbitrarily update a text file like you can a file of records. Consider, for example, the following text file.
line1,43,27,Jim
line2,29,32,Keith
Now, you want to change the 43 on line 1 to 4300. That involves adding two characters to the file. So you end up having to move everything after 43 down two spaces and then insert the 00. But moving everything down two spaces requires extending the file and moving the entire text.
Text files are typically used for sequential access: reading and appending. Insertion or deletion affects the entire file beyond the point of modification. Unless you're going to re-write the entire file on every change, you simply do not want to use a text file for holding data that changes frequently.

Upload text file to database

I have got text file that have next structure:
id=123
name=value
year=2013
Where first part (id, name, year) is name of column, and the last part - the data that I need to past in column. I have not any idea how to do it.
1. I am reading files line by line
2. ??
I have only stupid idea to replace '=' with query and to try run it. But it's look like bad idea...
Also I need to check if the data present in DB.
Upload your text file
Read it line by line
Update DB
How I would do it
Upload text file
Create a List of objects that has those values and has implemented
validation.
Read every line and every three lines try to construct a valid
object of that type in case of errors since I'd had to use regex or something similar to distinguish key from value.
Update database via LINQ2SQL with using my list of objects

C# and the CSV file

I formatted this data using c#, streamreader and writer and created this csv file. Here is the sample data:
A------B-C---D----------E------------F------G------H-------I
NEW, C,A123 ,08/24/2011,08/24/2011 ,100.00,100.00,X123456,276135
NEW, C,A125 ,08/24/2011,08/24/2011 ,200.00,100.00,X123456,276135
NEW, C,A127 ,08/24/2011,08/24/2011 , 50.00,100.00,X123456,276135
NEW, T,A122 ,08/24/2011,08/24/2011 , 5.00,100.00,X225511,276136
NEW, T,A124 ,08/24/2011,08/24/2011 , 10.00,100.00,X225511,276136
NEW, T,A133 ,08/24/2011,08/24/2011 ,500.00,100.00,X444556,276137
I would like the following output:
NEW, C,A123 ,08/24/2011,08/24/2011 ,100.00,100.00,X123456,276135
NEW, C,A125 ,08/24/2011,08/24/2011 ,200.00,100.00,X123456,276135
NEW, C,A127 ,08/24/2011,08/24/2011 , 50.00,100.00,X123456,276135
NEW, C,A001 ,08/24/2011,08/24/2011 ,350.00,100.00,X123456,276135
NEW, T,A122 ,08/24/2011,08/24/2011 , 5.00,100.00,X225511,276136
NEW, T,A124 ,08/24/2011,08/24/2011 , 10.00,100.00,X225511,276136
NEW, T,A001 ,08/24/2011,08/24/2011 , 15.00,100.00,X225511,276136
NEW, T,A133 ,08/24/2011,08/24/2011 ,500.00,100.00,X444556,276137
NEW, T,A001 ,08/24/2011,08/24/2011 ,500.00,100.00,X225511,276137
With each change in field "I", I would like to add a line, sum column F, add a "A001" to C, and copy the contents of the other fields into that newly ADDed line.
The letters on the columns are for illustrative purposes only. There are no headers.
First, what should I do first? How do I sum column F, copy contents of all fields, and add "A001" to C? How do I add a line and copy the fields w/ each change in I?
From your questions it doesn't sound like your data fits a flat file format very well, or at least not a CSV (analogy to a single DB table). Wanting to 'copy the contents of the other fields into that newly ADDed line' implies to me that there is possibily a relationship that might be better expressed referentially rather than copying the data to a new row.
Also keep in mind the requirement to sum according to column 'F' suggests that you will need to iterate over every row in order to calculate the sum.
If you decide to go a route other than a CSV your might try a light weight database solution such as SQLite. An alternative might be to look at XmlSerializer or DataContract (with Serializer) and just work with objects in your code. The objects can the be serialized to the disk when you're done with them.
You could use a custom iterator block, just as an example this is showing how to add the new line that contains your "A001" C column. You can just as easily add a new sum column - keep in mind though that you should keep the number of columns in each row the same.
public IEnumerable<string> GetUpdatedLines(string fileName)
{
int? lastValue = null;
foreach (string line in File.ReadLines(fileName))
{
var values = line.Split(',');
int myIValue = Convert.ToInt32(values[7]);
if (lastValue.HasValue && myIValue != lastValue)
{
//emit new line sum
values[2] = "A001";
yield return string.Join(",", values);
}
lastValue = myIValue;
yield return line;
}
}

Categories