Duplicate range in Excel via C# - c#

I have well defined Excel range, let's say "A5:I9" for example. I would like to multiply the complete rows of these range via C#. "Multiply" means to copy the range several times below itself, shifting the rest of the document down. Any hint how to do that?
I'm fighting with the Range.Insert and Range.Copy methods for quite some time now and in various combinations, but they never behave like I would expect accoring to the documentation!?
cheers,
Achim

To shift the rest of the document down I guess you would need to insert the expected amount of rows (five in you example) where you want to paste, before each copy:
// First copy paste with static range values
Range destination = yourWorksheet.get_range("A10",Type.Missing)
yourWorksheet.get_range("A5", "I9").Copy(destination);
Then loop on it while keeping the last "written" line.

Related

How to programmatically recreate Microsoft Excel's cell dimensions autofit() function for merged cells?

Currently I am creating a script capable of transferring large documents from word into excel using the Microsoft.Office.Interop namespace. I've managed to put enough code together to get everything operating correctly but I ran into a small issue. I've learnt Microsoft Excel doesn't offer autofitting function for merged cells. As my program relies on unmerging certain rows to allow for viewable data tables unless there is some way to split cells, which I've never heard of, I'd need to keep my cells merged. Currently I've been attempting to recreate something close to the AutoFit() function but without knowing Microsoft's process I've been unable to recreate the function. Currently I've been doing a word count and adjusting the row height depending on the amount of words transferred from a certain paragraph, I've been considering changing this to each individual character but this would multiply my process time tenfold.
Current AutoFit() recreation(where count is number of words in current paragraph):
wks.Cells[r, c].RowHeight = (count / 15) * 16;
Extremely slower AutoFit() recreation concept(where count is number of characters and x is average number of characters in a line):
wks.Cells[r, c].RowHeight = (count / x) * 16;
Can attach more code if it would make any responses more accurate, but I expect it would just cause distraction.
I guess what my question would be is there a way to automatically adjust row height in a fashion similar to Microsoft Excel's Autofit function? If anyone has any recommendations or insight into Microsoft's process for autofitting cells I'd love to receive any feedback.
Update 2020-11-04: Still have been unable to find any explanation for the function, nor found a method to check for line length in pixels that could then be applied to row height. I'm still sure there has to be a way though as Excel itself does it programmatically.

Compare/match two columns using approximate string matching (fuzzy string matching, levenshtein)

First of let me explain what I'm trying to achieve. The application that I'm making should have the ability to compare two columns of two different tables with eachother. So every cell of the column from the first table should be linked to the best matching cell from the column of the second table. So you would get something like this:
(source: modelbouwforum.nl)
This can easily be achieved by using the Levenshtein's algorithm. So I wrote a test program in c# to see if I can recreate the same results as the image is showing us. I made two array's, one containing the first column of the image and one containing the second column of the image. Every cell of the first column is compared to every cell of the second column, so that means I get 4 iterations on every cell (16 in total). The highest match (the one with the lowest levenshtein distance) of the second column is then linked to the cell of the first column.
The problem:
Let say we have two large columns with 100K rows each, this should get some serious performance issues. Because every cell from the first column need to be matched to every cell of the second column to get the highest possible match, so you have to iterate 100K * 100K = 10 billion times. So I have to create something to avoid iterating 10 billion times.
I did some research about where levenshtein could be used and came across this: http://www.slideshare.net/fullscreen/VasileTopac/fuzzy-hash-map/4. I'm wondering if I am able to create something like the guy did in the link?
Some things to consider:
In such large columns there could be multiple matches on a single cell(the user need to chose the right one). So that means you can't
exclude previously matched cells from the current search in order to bring down the iteration.
In the example the matching/comparison is only done on two columns, however in the future I like to compare a single column from table 1
to all the columns from table 2 (less work for the user). This will be even more time expensive as you can expect.
NOTE:
I'm only using c# for 4 months right now, I'll hope someone can provide me a good starting point (I prefer not get a fully working answer, I rather want to do some research myself to learn from it as well). Thanks for the understanding. English is not my native language, so please feel free to edit my post.
Try to come up with some assumption that always holds true about the matching that can segment it into smaller chunks like:
The first capital alpha character in table 1 must match the first capital alpha character in table 2
You may be able to find some valid assumption that will allow you to pre-process the values into another column:
FirstAlpha1 FirstAlpha2
=========== ===========
P C
S F
C P
F S
Then you could do a simple sort and join (exact match) on this extra value to divide the solution into smaller chunks.

Replace values located close to themselves with the mean value

I'm not sure if SO is the proper place for asking this, if it's not, I will remove the question and try it in some other place. Said that I'm trying to do the following:
I have a List<double> and want to replace the block of values whose values are situated very close (say 0.75 in this example) to a single value, representing the mean of the replaced values.
The values that are isolated, or alone, should not be modified.
Also the replaced block can't be longer than 5.
Computing the mean value for each interval from 0, 5, 10.. would not provide the expected results.
It happened many times that LINQ power surprised me gladly and I would be happy if someone could guide me in the creation of this little method.
What I've thought is to first find the closest values for each one, calculate the distance, if the distance is less than minimum (0.75 in this example) then assign those values to the same block.
When all values are assigned to their blocks run a second loop that replaces each block (with one value, or many values) to its mean.
The problem that I have in this approach is to assign the "block": if several values are together, I need to check if the evaluating value is contained in another block, and if it so, the new value should be in that block too.
I don't know if this is the right way of doing this or I'm over complicating it.
EDIT: the expected result:
Although you see two axes only one is used, the List is 1D, I should have drawn only the X axis.
The length of the lines that are represented is irrelevant. It's just to mark on the axis where the value is situated.
It turns out that MSDN has already done this, and provided an in-depth example application with code:
Data Clustering - Detecting Abnormal Data Using k-Means Clustering

In C#, how do I use the Excel Interop to speed up writing several cell values

I have a piece of hardware for which I am getting 30 data points. Each of these points is recorded in a spreadsheet at several different places before the sheet is made visible, then another program takes over the excel spreadsheet. It is required all these values are written to the spreadsheet before the other program takes over. If I write each cell individually, the writes are taking approximately 50ms, which takes about 1.25 seconds to complete the data acquisition.
If I could write all the values to the spreadsheet at one time, I feel this will significantly speed up the writing of all these cells. The problem I see is that Ranges work very well for updating contiguous cells where as my data isn't contiguous. Essentially, this would be an example of what I want to write:
A1 = 1
B23 = a
F8 = 2012/12/25
D53 = 4.1235
B2 = 5
I have tried creating a range of "A1,B23,F8,D53,B2", then set the values using an array of values. I tried 3 different arrays: object[5], object[1,5], and object[5,1]. These all set the values of the specified cells in the range to the first index of the array I created in all cases.
Is there a way to update these 30 cells data without iterating through the cells one at a time?
Thanks,
Tom
If your architecture would permit, another idea is using a hidden sheet with a continuous rectangular range, set names to its parts and use these names on all other sheets.
I would define a rectangular range object that includes all the cells whose values you want to modify. Get a rectangular object[,] array from that range's value property. Write the new values to the array, and then set the range's value using the modified array.
You could write the values to contiguous cells that are somewhere out of the way, say in column X, and have formulae in the target cells that refer to these updated cells. So cell A1 would have the formula "=X1", cell B23 "=X2" and so on.

C# : Fastest way for specific columns in CSV Files

I have a very large CSV file (Millions of records)
I have developed a smart search algorithm to locate specific line ranges in the file to avoid parsing the whole file.
Now I am facing a trickier issue : I am only interested in the content of a specific column.
Is there a smart way to avoid looping line by line through a 200MB Files and retrieve only the content of a specific column?
I'd use an existing library as codeulike has suggested, and for a very good reason why read this article:
Stop Rolling Your Own CSV Parser!
You mean get every value from every row for a specific column?
You're probably going to have to visit every row to do that.
This C# CSV Reading library is very quick so you might be able to use it:
LumenWorks.Framework.IO.Csv by Sebastien Lorien
Unless all CSV fields have a fixed width (and even if empty there's still n bytes of blank space between the separators surrounding it), no.
If yes
Then each row, in turn, also has a fixed length and therefore you can skip straight to the first value for that column and, once you've read it, you immediately advance to next row's value for the same field, without having to read any intermediate values.
I think this is pretty simple - but I'm on a roll at the moment (and at lunch), so I'm going to finish it anyway :)
To do this, we first want to know how long each row is in characters (adjust for bytes according to Unicode, UTF8 etc):
row_len = sum(widths[0..n-1]) + n-1 + row_sep_length
Where n is the total number of columns on each row - this is a constant for the whole file. We add an extra n-1 to it to account for the separators between column values.
And row_sep_length is the length of the separator between two rows - usually a newline, or potentially a [carriage-return & line-feed] pair.
The value for a column row[r]col[i] will be offset characters from the start of row[r]where offset is defined as:
offset = i>0 ? sum(widths[0..i-1]) + i) : 0;
//or sum of widths of all columns before col[i]
//plus one character for each separator between adjacent columns
And then, assuming you've read the whole column value, up to the next separator, the offset to the starting character for next column value row[r+1]col[i] is calculated by subtracting the width of your column from the row length. This is yet another constant for the file:
next-field-offset = row_len - widths[i];
//widths[i] is the width of the field you are actually reading.
All the while - i is zero-based in this pseudo code as is the indexing of the vectors/arrays.
To read, then, you first advance the file pointer by offset characters - taking you to the first value you want. You read the value (taking you to the next separator) and then simply advance the file pointer by next-field-offset characters. If you reach EOF at this point, you're done.
I might have missed a character either way in this - so if it's applicable - do check it!
This only works if you can guarantee that all field values - even nulls - for all rows will be the same length, and that the separators are always the same length and that alll row separators are the same length. If not - then this approach won't work.
If not
You'll have to do it the slow way - find the column in each line and do whatever it is you need to do.
If you're doing a significant amount of work on the column value each time, one optimisation will be to pull out all the column values first into a list (set with a known initial capacity too) or something (batching at 100,000 a time or something like that), then iterate through those.
If you keep each loop focused on a single task, that should be more efficient than one big loop.
Equally, once you've batched a 100,000 column values you could use Parallel Linq to distribute the second loop (not the first since there's no point parallelising reading from a file).
There are only shortcuts if you can pose specific limitations on the data.
For example, you can only read the file line by line if you know that there are no values in the file that contain line breaks. If you don't know this, you have to parse the file record by record as a stream, and each record ends where there is a line break that is not inside a value.
However, unless you know that each line takes up exactly the same amount of bytes, there is no other way to read the file than to read line by line. The line breaks in a file is just another pair of characters, there is no other way to locate a line in a text file than to read all the lines that comes before it.
You can do similar shortcuts when reading a record if you can pose limiations on the fields in the records. If you for example know that the fields to the left of the one that you are interrested in are all numerical, you can use a simpler parsing method to find the start of the field.

Categories