Determining the field length of a column in a datatable

Determining the field length of a column in a datatable - c#

I have a DataTable that I have populated via the Dataset.ReadXML() method. I am trying to programatically determine the maximum field lengths for each column in the datatable. The MaxLength value for all of my columns is always at the default of -1.
Any thoughts or examples on how to determine the proper max length? maybe based on the actual data in the table? (Datatable can be in the 25 column by 200,000+ row range)
C# 2.0

I don't quite understand what your objectives are - what are you trying to find out? The number of bytes a given column will use (e.g. 4 for INT, 8 for BIGINT and so forth), or the actual current maximum length of e.g. all strings in column "ColA" ?
As for the INT and other numerical and boolean types - those have a fixed system-given length - no problems there.
Unless your XML has a schema (XSD file) which limits the string lengths, the string fields from an XML can be any length, really, so after reading them in, your DataTable can't really know what the defined max length can be.
All you can do is loop over all rows in your DataTable and determine the current length of the strings, and get the maximum of those current lengths, as your frame of reference.
Does that help at all?
Marc

I believe it would be DataSet.DataTable.Columns[0].MaxLength

Related

Create a xls or csv file with certain header of rows and columns

I have a list of ID in a matrix 'UserID'. I want create a xls or csv file that this UserID is its header lines. number of rows is:2200000 and number of columns is 11. Label of columns is years of 1996 - 2006 . I read this page :
https://www.mathworks.com/matlabcentral/answers/101309-how-do-i-use-xlswrite-to-add-row-and-column-labels-to-my-matlab-matrix-when-i-write-it-to-excel-in-m
but this code give me error. Although sometimes less is true for the number of rows and sometimes does not answer.Can anyone introduce a program that will do this? (with matlab or even c# code)
I write this code:
data=zeros(2200000,11);
data_cells=num2cell(data);
col_header={'1996','1997','1998','1999','2000','2001','2002','2003','2004','2005','2006'};
row_header(1:2200000,1)=UserID;
output_matrix=[{' '} col_header; row_header data_cells];
xlswrite('My_file.xls',output_matrix);
and I get this error:
The specified data range is invalid or too large to write to the specified file format. Try writing to an XLSX file and use Excel A1 notation for the range argument, for example, ‘A1:D4’.

When you use xlswrite you are limited to the number of rows that your Excel version permits:
The maximum size of array A depends on the associated Excel version.
In Excel 2013 the maximum is 1048576, which is 1151424 fewer rows than your 2200000x11 matrix.
You better use csvwrite to export your data, and refer also to the tip therein:
csvwrite does not accept cell arrays for the input matrix M. To export a cell array that contains only numeric data, use cell2mat to convert the cell array to a numeric matrix before calling csvwrite. To export cell arrays with mixed alphabetic and numeric data... you must use low-level export functions to write your data.
EDIT:
In your case, you should at least change this parts of the code:
col_header = 1996:2006;
output_matrix=[0, col_header; row_header data];
and you don't need to define output_matrix as a cell array (and don't need data_dells). However, you may also have to convert UserID to a numeric variable.

How to "FillMissing" with a value of 0

I'm reading some data in from a CSV file into a frame and I want to replace the blanks in a certain column with zeros. However, when I do FillMissing(0), the series returned still shows the values as blanks. I'm guessing it's because Deedle inferred the type of the column to be int and not int? and thus a zero is equivalent to missing.
Is there a way to either use FillMissing to do what I want, or alternatively, override the type inference so it treats this column as an int??

The FillMissing method will fill all missing values in columns that have the same type as the value provided. This is a bit confusing and we're looking for better ideas how to do this!
This means that FillMissing(0) will only fill columns with integers. You can try calling FillMissing(0.0) to handle floating point columns or FillMissing(0.0M) to handle decimals.
The fact whether a value is nullable does not matter - Deedle handles missing values directly and so column loaded from a CSV will never have a type int?

C# : Fastest way for specific columns in CSV Files

I have a very large CSV file (Millions of records)
I have developed a smart search algorithm to locate specific line ranges in the file to avoid parsing the whole file.
Now I am facing a trickier issue : I am only interested in the content of a specific column.
Is there a smart way to avoid looping line by line through a 200MB Files and retrieve only the content of a specific column?

I'd use an existing library as codeulike has suggested, and for a very good reason why read this article:
Stop Rolling Your Own CSV Parser!

You mean get every value from every row for a specific column?
You're probably going to have to visit every row to do that.
This C# CSV Reading library is very quick so you might be able to use it:
LumenWorks.Framework.IO.Csv by Sebastien Lorien

Unless all CSV fields have a fixed width (and even if empty there's still n bytes of blank space between the separators surrounding it), no.
If yes
Then each row, in turn, also has a fixed length and therefore you can skip straight to the first value for that column and, once you've read it, you immediately advance to next row's value for the same field, without having to read any intermediate values.
I think this is pretty simple - but I'm on a roll at the moment (and at lunch), so I'm going to finish it anyway :)
To do this, we first want to know how long each row is in characters (adjust for bytes according to Unicode, UTF8 etc):
row_len = sum(widths[0..n-1]) + n-1 + row_sep_length
Where n is the total number of columns on each row - this is a constant for the whole file. We add an extra n-1 to it to account for the separators between column values.
And row_sep_length is the length of the separator between two rows - usually a newline, or potentially a [carriage-return & line-feed] pair.
The value for a column row[r]col[i] will be offset characters from the start of row[r]where offset is defined as:
offset = i>0 ? sum(widths[0..i-1]) + i) : 0;
//or sum of widths of all columns before col[i]
//plus one character for each separator between adjacent columns
And then, assuming you've read the whole column value, up to the next separator, the offset to the starting character for next column value row[r+1]col[i] is calculated by subtracting the width of your column from the row length. This is yet another constant for the file:
next-field-offset = row_len - widths[i];
//widths[i] is the width of the field you are actually reading.
All the while - i is zero-based in this pseudo code as is the indexing of the vectors/arrays.
To read, then, you first advance the file pointer by offset characters - taking you to the first value you want. You read the value (taking you to the next separator) and then simply advance the file pointer by next-field-offset characters. If you reach EOF at this point, you're done.
I might have missed a character either way in this - so if it's applicable - do check it!
This only works if you can guarantee that all field values - even nulls - for all rows will be the same length, and that the separators are always the same length and that alll row separators are the same length. If not - then this approach won't work.
If not
You'll have to do it the slow way - find the column in each line and do whatever it is you need to do.
If you're doing a significant amount of work on the column value each time, one optimisation will be to pull out all the column values first into a list (set with a known initial capacity too) or something (batching at 100,000 a time or something like that), then iterate through those.
If you keep each loop focused on a single task, that should be more efficient than one big loop.
Equally, once you've batched a 100,000 column values you could use Parallel Linq to distribute the second loop (not the first since there's no point parallelising reading from a file).

There are only shortcuts if you can pose specific limitations on the data.
For example, you can only read the file line by line if you know that there are no values in the file that contain line breaks. If you don't know this, you have to parse the file record by record as a stream, and each record ends where there is a line break that is not inside a value.
However, unless you know that each line takes up exactly the same amount of bytes, there is no other way to read the file than to read line by line. The line breaks in a file is just another pair of characters, there is no other way to locate a line in a text file than to read all the lines that comes before it.
You can do similar shortcuts when reading a record if you can pose limiations on the fields in the records. If you for example know that the fields to the left of the one that you are interrested in are all numerical, you can use a simpler parsing method to find the start of the field.

ESE column type to XmlSerialize arbitrary objects

What's the best ESE column type to XmlSerialize an object to my ESE DB?
Both "long binary" and "long ASCII text" work OK.
Reason for long binary: absolutely sure there's no characters conversation.
Reason for long text: the XML is text.
It seems MSDN says the 2 types only differ when sorting and searching. Obviously I'm not going to create any indices over that column; fields that need to be searchable and/or sortable are stored in separate columns of appropriate types.
Is it safe to assume any UTF8 text, less then 2GB in size, can be saved to and loaded from the ESE "long ASCII text" column value?

Yes you can put up to 2GB of data of UTF8 text into any long text/binary column. The only difference between long binary and long text is the way that the data is normalized when creating an index over the column. Other than that ESE simply stores the provided bytes in the column with no conversion. ESE can only index ASCII or UTF16 data and it is the application's responsibility to make sure the data is in the correct format so it would seem to be more correct to put the data into a long binary column. As you aren't creating an index there won't actually be any difference.
If you are running on Windows 7 or Windows Server 2008 R2 you should investigate column compresion. For XML data you might get significant savings simply by turning compression on.

Data structure question

I have a database table with a large number of rows and one numeric column, and I want to represent this data in memory. I could just use one big integer array and this would be very fast, but the number of rows could be too large for this.
Most of the rows (more than 99%) have a value of zero. Is there an effective data structure I could use that would only allocate memory for rows with non-zero values and would be nearly as fast as an array?
Update: as an example, one thing I tried was a Hashtable, reading the original table and adding any non-zero values, keyed by the row number in the original table. I got the value with a function that returned 0 if the requested index wasn't found, or else the value in the Hashtable. This works but is slow as dirt compared to a regular array - I might not be doing it right.
Update 2: here is sample code.
private Hashtable _rowStates;
private void SetRowState(int rowIndex, int state)
{
if (_rowStates.ContainsKey(rowIndex))
{
if (state == 0)
{
_rowStates.Remove(rowIndex);
}
else
{
_rowStates[rowIndex] = state;
}
}
else
{
if (state != 0)
{
_rowStates.Add(rowIndex, state);
}
}
}
private int GetRowState(int rowIndex)
{
if (_rowStates.ContainsKey(rowIndex))
{
return (int)_rowStates[rowIndex];
}
else
{
return 0;
}
}

This is an example of a sparse data structure and there are multiple ways to implement such sparse arrays (or matrices) - it all depends on how you intend to use it. Two possible strategies are:
Store only non-zero values. For each element different than zero store a pair (index, value), all other values are known to be zero by default. You would also need to store the total number of elements.
Compress consecutive zero values. Store a number of (count, value) pairs. For example if you have 12 zeros in a row followed by 200 and another 22 zeros, then store (12, 0), (1, 200), (22, 0).

I would expect that the map/dictionary/hashtable of the non-zero values should be a fast and economical solution.
In Java, using the Hashtable class would introduce locking because it is supposed to be thread-safe. Perhaps something similar has slowed down your implementation.
--- update: using Google-fu suggests that C# Hashtable does incur an overhead for thread safety. Try a Dictionary instead.

How exactly you wan't to implement it depends on what your requirements are, it's a tradeoff between memory and speed. A pure integer array is the fastest, with constant complexity lookups.
Using a hash-based collection such as Hashtable or Dictionary (Hashtable seems to be slower but thread-safe - as others have pointed out) will give you a very low memory usage for a sparse data structure as yours but can be somewhat more expensive when performing lookups. You store a key-value pair for each index and non-zero value.
You can use ContainsKey to find out whether the key exists but it is significantly faster to use TryGetValue to make the check and fetch the data in one go. For dense data it can be worth it to catch exceptions for missing elements as this will only incur a cost in the exceptional case and not each lookup.
Edited again as I got myself confused - that'll teach me to post when I ought to be sleeping.

You're paying a boxing penealty by using Hashtable. Try switching to a Dictionary<int, int>. Also, how many rows are we talking - and how fast do you need it?

Create integer array for non-zero values and bit array holding indicators if particular row contains non-zero value.
You can find then necessary element in first array summing up bits in second array starting from 0 up to row index position.

I am not sure about efficiency of this solution but you can try. So it depends at which scenario you will use it but I will write here two of them that I have in mind. First solution is if you have just one field of integers you can simply use generic list of integers:
List<int> myList = new List<int>();
The second one is almost the same, but you can create a list of your own type for example if you have two fields, count and non-zero value you can create a class which will have two properties and then you can create a list of your class and store information in it. But also you can try generic linked lists. So the code for the solution two can be like this:
public class MyDbFields
{
public MyDbFields(int count, int nonzero)
{
Count = count;
NonZero = nonzero;
}
public int Count { get; set; }
public int NonZero { get; set; }
}
Then you can create a list like this:
List<MyDbFields> fields_list = new List<MyDbFields>();
and then fill it with data:
fields_list.Add(new MyDbFields(100, 11));
I am not sure if this will fully help you solve your problem, but just my suggestion.

If I understand correctly, you cannot just select non-zero rows, because for each row index (aka PK value) your Data Structure will have to be able to report not only the value, but also whether or not it is there at all. So assuming 0 if you don't find it in your Data Structure might not be a good idea.
Just to make sure - exactly how many rows are we talking about here? Millions? A million integers would take up only 4MB RAM as an array. Not much really. I guess it must be at least 100'000'000 rows.
Basically I would suggest a sorted array of integer-pairs for storing non-zero values. The first element in each pair would be the PK value, and this is what the array would be sorted by. The second element would be the value. You can make a DB select that returns only these non-zero values, of course. Since the array will be sorted, you'll be able to use binary search to find your values.
If there are no "holes" in the PK values, then the only thing you would need besides this would be the minimum and maximum PK values so that you can determine whether a given index belongs to your data set.
If there are unused PK values between the used ones, then you need some other mechanism to determine which PK values are valid. Perhaps a bitmask or another array of valid (or invalid, whichever are fewer) PK values.
If you choose the bitmask way, there is another idea. Use two bits for every PK value. First bit will show if the PK value is valid or not. Second bit will show if it is zero or not. Store all non-zero values in another array. This however will have the drawback that you won't know which array item corresponds to which bitmask entry. You'd have to count all the way from the start to find out. This can be mitigated with some indexes. Say, for every 1000 entries in the value array you store another integer which tells you where this entry is in the bitmask.

Perhaps you are looking in the wrong area - all you are storing for each value is the row number of the database row, which suggests that perhaps you are just using this to retrieve the row?
Why not try indexing your table on the numeric column - this will provide lightning fast access to the table rows for any given numeric value (which appears to be the ultimate objective here?) If it is still too slow you can move the index itself into memory etc.
My point here is that your database may solve this problem more elegantly than you can.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.