I have a very large CSV file (Millions of records)
I have developed a smart search algorithm to locate specific line ranges in the file to avoid parsing the whole file.
Now I am facing a trickier issue : I am only interested in the content of a specific column.
Is there a smart way to avoid looping line by line through a 200MB Files and retrieve only the content of a specific column?
I'd use an existing library as codeulike has suggested, and for a very good reason why read this article:
Stop Rolling Your Own CSV Parser!
You mean get every value from every row for a specific column?
You're probably going to have to visit every row to do that.
This C# CSV Reading library is very quick so you might be able to use it:
LumenWorks.Framework.IO.Csv by Sebastien Lorien
Unless all CSV fields have a fixed width (and even if empty there's still n bytes of blank space between the separators surrounding it), no.
If yes
Then each row, in turn, also has a fixed length and therefore you can skip straight to the first value for that column and, once you've read it, you immediately advance to next row's value for the same field, without having to read any intermediate values.
I think this is pretty simple - but I'm on a roll at the moment (and at lunch), so I'm going to finish it anyway :)
To do this, we first want to know how long each row is in characters (adjust for bytes according to Unicode, UTF8 etc):
row_len = sum(widths[0..n-1]) + n-1 + row_sep_length
Where n is the total number of columns on each row - this is a constant for the whole file. We add an extra n-1 to it to account for the separators between column values.
And row_sep_length is the length of the separator between two rows - usually a newline, or potentially a [carriage-return & line-feed] pair.
The value for a column row[r]col[i] will be offset characters from the start of row[r]where offset is defined as:
offset = i>0 ? sum(widths[0..i-1]) + i) : 0;
//or sum of widths of all columns before col[i]
//plus one character for each separator between adjacent columns
And then, assuming you've read the whole column value, up to the next separator, the offset to the starting character for next column value row[r+1]col[i] is calculated by subtracting the width of your column from the row length. This is yet another constant for the file:
next-field-offset = row_len - widths[i];
//widths[i] is the width of the field you are actually reading.
All the while - i is zero-based in this pseudo code as is the indexing of the vectors/arrays.
To read, then, you first advance the file pointer by offset characters - taking you to the first value you want. You read the value (taking you to the next separator) and then simply advance the file pointer by next-field-offset characters. If you reach EOF at this point, you're done.
I might have missed a character either way in this - so if it's applicable - do check it!
This only works if you can guarantee that all field values - even nulls - for all rows will be the same length, and that the separators are always the same length and that alll row separators are the same length. If not - then this approach won't work.
If not
You'll have to do it the slow way - find the column in each line and do whatever it is you need to do.
If you're doing a significant amount of work on the column value each time, one optimisation will be to pull out all the column values first into a list (set with a known initial capacity too) or something (batching at 100,000 a time or something like that), then iterate through those.
If you keep each loop focused on a single task, that should be more efficient than one big loop.
Equally, once you've batched a 100,000 column values you could use Parallel Linq to distribute the second loop (not the first since there's no point parallelising reading from a file).
There are only shortcuts if you can pose specific limitations on the data.
For example, you can only read the file line by line if you know that there are no values in the file that contain line breaks. If you don't know this, you have to parse the file record by record as a stream, and each record ends where there is a line break that is not inside a value.
However, unless you know that each line takes up exactly the same amount of bytes, there is no other way to read the file than to read line by line. The line breaks in a file is just another pair of characters, there is no other way to locate a line in a text file than to read all the lines that comes before it.
You can do similar shortcuts when reading a record if you can pose limiations on the fields in the records. If you for example know that the fields to the left of the one that you are interrested in are all numerical, you can use a simpler parsing method to find the start of the field.
Related
I am trying to read the data stored in an ICMT tag on a WAV file generated by a noise monitoring device.
The RIFF parsing code all seems to work fine, except for the fact that the ICMT tag seems to have data after the declared size. As luck would have it, it's the timestamp, which is the one absolutely critical piece of info for my application.
SYN is hex 16, which gives a size of 22, which is up to and including the NUL before the timestamp. The monitor documentation is no help; it says that the tag includes the time, but their example also has the same issue.
It is the last tag in the enclosing list, and the size of the list does include it - does that mean it doesn't need a chunk ID? I'm struggling to find decent RIFF docs, but I can't find anything that suggests that's the case; also I can't see how it'd be possible to determine that it was the last chunk and so know to read it with no chunk ID.
Alternatively, the ICMT comment chunk is the last thing in the file - is that a special case? Can I just get the time by reading everything from the end of the declared length ICMT to the end of the file and assume that will always work?
The current parser behaviour is that it's being read after the channel / dB information as a chunk ID + size, and then complaining that there was not enough data left in the file to fulfil the request.
No, it would still need its own ID. No, being the last thing in the file is no special case either. What you're showing here is malformed.
Your current parser errors correctly, as the next thing to be expected again is a 4 byte ID followed by 4 bytes for the length. The potential ID _10: is unknown and would be skipped, but interpreting 51:4 as DWORD for the length of course asks for trouble.
The device is the culprit. Do you have other INFO fields which use NULL bytes? If not then I assume the device is naive enough to consider a NULL the end of a string, despite producing himself strings with multiple NULLs.
Since I encountered countless files not sticking to standards I can only say your parser is too naive as well: it knows how long the encapsulating list is and thus could easily detect field lengths that would not fit anymore. And could ignore garbage like that. Or, in your case, offer the very specific option "add to last field".
I'm currently looking for a way to realize a partial word pattern algorithm in C#. The situation I'm in looks like follows:
I got a textfield for the search pattern. Every time the user enters or deletes a char in this field, an event triggers which re-runs the search algorithm. So in case I want to search for the word "face" in strings like
"Facebook", "Facelifting", ""Faceless Face" (whatever that should be) or in generally ANY real life sentences as strings,
the algorithm would first start running when typing "f" in the field. It then show the most relevant String on top of a list the strings are in. The second time it runs when "fa" is typed, and the list is sorted again. This goes on until "face" is completely typed in the textfield and the list is sorted again.
However I don't know what algorithm could be used. I tried the answer from Alain (Getting the closest string match), a simple Levenshtein-Distance algorithm as well as an self-made algorithm, which calculates the priority via
priority = (length_of_typed_pattern) * (amount_of_substr_matches)
In C#, the latter looks like this:
count = Regex.Matches(Regex.Escape(title), pattern).Count;
priority = pattern.Length * count;
The pattern as well as the title are composed of only lowercase letters.
My conclusions so far:
Hamming distance won't make any sense since the strings are not the same length most of the time
The answer from Alain works fine, but only if at least one word completely matches (you only find a most relevant string/sentence when at least one word is equal with the pattern, so if you have "face" typed and there's a string containing the word "facebook", the string containing "facebook" is almost never a top priority
What other ideas could I try? The goal would be to sort the list of strings the best possible way in the earliest moment (with the fewest letters).
You can look at my implementations in the search-* branches of my repository on http://github.com/croemheld/sprung) in Sprung/WindowMatcher.cs and Sprung/Window.cs.
Thanks for your help.
First of all you need to store frequency related to a string(number of times a particular string is searched) in some place to show most relevant one when searched. If you need to show say k most relevant entries so a Min Heap of size 'k' can be implemented.
Case 1- If a letter is pressed for the first time:-
Step (a) Read all the string starting from a Data base or dictionary and store in some data structure(Say DS1) with a FLAG_VALID(set to 1 initially) which shows that it is valid string for the present search characters(for first letter all the strings will be valid).
As you read strings fill the Min Heap according to their Frequency and an element with certain frequency is inserted only when its frequency is greater than minimum one(i.e. the first element of min Heap).
Step (b) (This step is same for all case to show result) To show results you need to show elements in reverse order than Min Heap i.e. first element in Min Heap will have least priority, so basically we need to delete all elements one by one and show it from last to first.
NOTE:- Min Heap will contain reference to a particular string and so the string and its frequency can be accessed at the same time.
Case 2- Inserting next letters in search box:
Step (a) Search through DS1 in which all strings are present and check FLAG_VALID first. If it is a valid string than compare the string from search box and the string from DS1. Set the flag accordingly(if it is a match-1 or not-0) and fill k-Min Heap as it is empty from last search as in Case 1.
Step (b) is as usual.
Case 3- Deleting a letter in search box:
It is similar to above cases but this time we will need to search for those strings also whose FALG_VALID is 0(i.e string which are invalid).
This is a crude searching method and can be improved using certain Data structure and tweaking the algorithm.
First of let me explain what I'm trying to achieve. The application that I'm making should have the ability to compare two columns of two different tables with eachother. So every cell of the column from the first table should be linked to the best matching cell from the column of the second table. So you would get something like this:
(source: modelbouwforum.nl)
This can easily be achieved by using the Levenshtein's algorithm. So I wrote a test program in c# to see if I can recreate the same results as the image is showing us. I made two array's, one containing the first column of the image and one containing the second column of the image. Every cell of the first column is compared to every cell of the second column, so that means I get 4 iterations on every cell (16 in total). The highest match (the one with the lowest levenshtein distance) of the second column is then linked to the cell of the first column.
The problem:
Let say we have two large columns with 100K rows each, this should get some serious performance issues. Because every cell from the first column need to be matched to every cell of the second column to get the highest possible match, so you have to iterate 100K * 100K = 10 billion times. So I have to create something to avoid iterating 10 billion times.
I did some research about where levenshtein could be used and came across this: http://www.slideshare.net/fullscreen/VasileTopac/fuzzy-hash-map/4. I'm wondering if I am able to create something like the guy did in the link?
Some things to consider:
In such large columns there could be multiple matches on a single cell(the user need to chose the right one). So that means you can't
exclude previously matched cells from the current search in order to bring down the iteration.
In the example the matching/comparison is only done on two columns, however in the future I like to compare a single column from table 1
to all the columns from table 2 (less work for the user). This will be even more time expensive as you can expect.
NOTE:
I'm only using c# for 4 months right now, I'll hope someone can provide me a good starting point (I prefer not get a fully working answer, I rather want to do some research myself to learn from it as well). Thanks for the understanding. English is not my native language, so please feel free to edit my post.
Try to come up with some assumption that always holds true about the matching that can segment it into smaller chunks like:
The first capital alpha character in table 1 must match the first capital alpha character in table 2
You may be able to find some valid assumption that will allow you to pre-process the values into another column:
FirstAlpha1 FirstAlpha2
=========== ===========
P C
S F
C P
F S
Then you could do a simple sort and join (exact match) on this extra value to divide the solution into smaller chunks.
Is there a library that I can use to perform binary search in a very big text file (can be 10GB).
The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.
I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.
You won't find it in a library, for two reasons:
It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).
On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)
As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.
The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.
When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.
You should find that you will reach your target pretty quickly.
You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.
Something like
byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);
This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.
So try this:
Scan the file and store the ordinal offset of each line-feed in a List
Binary search the List with a custom comparer that scans to the file offset and reads the data.
If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:
Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"
preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file
read directly from the original file starting from the given location / read the given range
You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)
Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.
Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.
The List object has a Binary Search method.
http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx
I have well defined Excel range, let's say "A5:I9" for example. I would like to multiply the complete rows of these range via C#. "Multiply" means to copy the range several times below itself, shifting the rest of the document down. Any hint how to do that?
I'm fighting with the Range.Insert and Range.Copy methods for quite some time now and in various combinations, but they never behave like I would expect accoring to the documentation!?
cheers,
Achim
To shift the rest of the document down I guess you would need to insert the expected amount of rows (five in you example) where you want to paste, before each copy:
// First copy paste with static range values
Range destination = yourWorksheet.get_range("A10",Type.Missing)
yourWorksheet.get_range("A5", "I9").Copy(destination);
Then loop on it while keeping the last "written" line.