I have multiple files coming from Mainframe systems, basically EBCDIC data. Now some of these files have data from multiple modules, appended in one single file, for example, lets say I have a file CISA, which has data from multiple sub-modules. Now all these modules have row length of 1000 bytes but have different data structure. So to read these files I need to use different layout and to do that I need to split the parent file into multiple files based on a key value specified at a given location, lets say byte range 20-23.
For first row, 20-23 byte range value maybe 0001 and for next row 0002, so I need to split this file into multiple file based on value of byte range.
In my current implementation using C#, what I have done is that read the data using byte stream and then read one row at a time. I've used a Data table with two columns, first column stores filename, generated based on the byte range (20-23) value, second column stores the Byte stream which I just read.
I keep doing this so once the entire file is read, I have a data table, which gives me a list of file names and byte stream for those file. I loop through the data table and write each row based on the file name stored in the column name.
This solution is working all right but the performance in really slow because of the high I/O in writing the data table. So is there an option with which I can skip writing the data for each row and instead save the entire partition in one shot.
Firstly, I'd completely forget about DataTable here - that seems a terrible idea. How big are the files? if they're small: just load all the data (File.ReadAllBytes) and use an ArraySegment<byte> for each (maybe a List<ArraySegment<byte>>) - or if you're OK using preview bits: this would be a great use of Span<byte> (similar to ArraySegment<byte>, but more ... just more).
If the file is large, I'd look at MemoryMappedFile here; seems a great fit.
Related
I have a list of ID in a matrix 'UserID'. I want create a xls or csv file that this UserID is its header lines. number of rows is:2200000 and number of columns is 11. Label of columns is years of 1996 - 2006 . I read this page :
https://www.mathworks.com/matlabcentral/answers/101309-how-do-i-use-xlswrite-to-add-row-and-column-labels-to-my-matlab-matrix-when-i-write-it-to-excel-in-m
but this code give me error. Although sometimes less is true for the number of rows and sometimes does not answer.Can anyone introduce a program that will do this? (with matlab or even c# code)
I write this code:
data=zeros(2200000,11);
data_cells=num2cell(data);
col_header={'1996','1997','1998','1999','2000','2001','2002','2003','2004','2005','2006'};
row_header(1:2200000,1)=UserID;
output_matrix=[{' '} col_header; row_header data_cells];
xlswrite('My_file.xls',output_matrix);
and I get this error:
The specified data range is invalid or too large to write to the specified file format. Try writing to an XLSX file and use Excel A1 notation for the range argument, for example, ‘A1:D4’.
When you use xlswrite you are limited to the number of rows that your Excel version permits:
The maximum size of array A depends on the associated Excel version.
In Excel 2013 the maximum is 1048576, which is 1151424 fewer rows than your 2200000x11 matrix.
You better use csvwrite to export your data, and refer also to the tip therein:
csvwrite does not accept cell arrays for the input matrix M. To export a cell array that contains only numeric data, use cell2mat to convert the cell array to a numeric matrix before calling csvwrite. To export cell arrays with mixed alphabetic and numeric data... you must use low-level export functions to write your data.
EDIT:
In your case, you should at least change this parts of the code:
col_header = 1996:2006;
output_matrix=[0, col_header; row_header data];
and you don't need to define output_matrix as a cell array (and don't need data_dells). However, you may also have to convert UserID to a numeric variable.
I need to perform calculations and manipulation on an extremely large table or matrix, which will have roughly 7500 rows and 30000 columns.
The matrix data will look like this:
Document ID| word1 | word 2 | word 3 |... | word 30000 | Document Class
0032 1 0 0 1 P
In other words, the vast majority of the cells will contain boolean values(0's and 1's).
The calculations that needs to be done would be useing word stemming or feature selection(reducing the number of words by using reduction techniques), as well as calculations per-class or per-word etc.
What i have in mind is designing an OOP model for representing the matrix, and then subsequently serializing the objects to disk so i may reuse them later on. For instance i will have an object for each row or each column, or perhaps an object for each intersection that is contained within another class.
I have thought about representing it in XML, but file sizes may prove problematic.
I may be sitting the pot miss with my approach here -
Am i on the right path, or would there be any better performing approaches to manipulating such large data collections.
Key issues here will be performance(reaction time etc.), as well as redundancy and integrity of the data, and obviously i would need to save the data on disk.
You haven't explained the nature of the calculations you're needing to do on the table/matrix, so I'm having to make assumptions there, but if I read your question correctly, this may be a poster-child case for the use of a relational database -- even if you don't have any actual relations in your database. If you can't use a full server, use SQL Server Compact Edition as an embedded database, which would allow you to control the .SDF file programmatically if you chose.
Edit:
After a second consideration, I withdraw my suggestion for a database. This is entirely because of the number of columns in the table, any relational database you use will have hard limits on this, and I don't see a way around it that isn't amazingly complicated.
Based on your edit, I would say that there are three things you are interested in:
A way to analyze the presence of words in documents. This is the bulk of your sample data file, primarily being boolean values indicating the presence or lack of a word in a document.
The words themselves. This is primarily contained in the first row of your sample data file.
A means of identifying documents and their classification. This is the first and last column of your data file.
After thinking about it for a little bit, this is how I would model your data:
With the case of word presence, I feel it's best to avoid a complex object model. You're wanting to do pure calculation in both directions (by column and by row), and the most flexible and potentially performant structure for that in my opinion is a simple two-dimensional array of bool fields, like so:
var wordMatrix = new bool[numDocuments,numWords];
The words themselves should be in an array or list of strings that are index-linked to the second column of the word matrix -- the one defined by numWords in the example above. If you ever needed to quickly search for a particular word, you could use a Dictionary<string, int>, with the key as the word and the value as the index, to quickly find the index of a particular word.
The document identification would similarly be in an array or list of ints index-linked to the first column. I'm assuming the document ids are integer values there. The classification would be a similar array or list, although I'd use a list of enums representing each possible value of the classification. As with the word search, if you needed to search for documents by id, you could have a Dictionary<int, int> act as your search index.
I've made several assumptions with this model, particularly that you want to do pure calculation on the word presence in all directions rather than "per document". If I'm wrong, a simpler approach might be to drop the two-dimensional array and model by document, i.e. a single C# Document class, with a DocumentId, and DocumentClasification field as well as a simple array of booleans that are index-linked to the word list. You could then work with a list of these Document objects along with a separate list of words.
Once you have a data model you like, saving it to disk is the easiest part. Just use C# serialization. You can save it via XML or binary, your choice. Binary would give you the smallest file size, naturally (I figure a little more than 200MB plus the size of a list of 30000 words). If you include the Dictionary lookup indexes, perhaps an additional 120kB.
I'm using C# and I write my data into csv files (for further use). However my files have grown into a large scale and i have to transpose them. what's the easiest way to do that. in any program?
Gil
In increasing order of complexity (and also increasing order of ability to handle large files):
Read the whole thing into a 2-D array (or jagged array aka array-of-arrays).
Memory required: equal to size of file
Track the file offset within each row. Start by finding each (non-quoted) newline, storing the current position into a List<Int64>. Then iterate across all rows, for each row: seek to the saved position, copy one cell to the output, save the new position. Repeat until you run out of columns (all rows reach a newline).
Memory required: eight bytes per row
Frequent file seeks scattered across a file much larger than the disk cache results in disk thrashing and miserable performance, but it won't crash.
Like above, but working on blocks of e.g. 8k rows. This will create a set of files each with 8k columns. The input block and output all fit within disk cache, so no thrashing occurs. After building the stripe files, iterate across the stripes, reading one row from each and appending to the output. Repeat for all rows. This results in sequential scan on each file, which also has very reasonable cache behavior.
Memory required: 64k for first pass, (column count/8k) file descriptors for second pass.
Good performance for tables of up to several million in each dimension. For even larger data sets, combine just a few (e.g. 1k) of the stripe files together, making a smaller set of larger stripes, repeat until you have only a single stripe with all data in one file.
Final comment: You might squeeze out more performance by using C++ (or any language with proper pointer support), memory-mapped files, and pointers instead of file offsets.
It really depends. Are you getting these out of a database? The you could use a MySql import statement. http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Or you could use could loop through the data add it to a file stream using streamwriter object.
StreamWriter sw = new StreamWriter('pathtofile');
foreach(String[] value in lstValueList){
String something = value[1] + "," + value[2];
sw.WriteLine(something);
}
I wrote a little proof-of-concept script here in python. I admit it's buggy and there are likely some performance improvements to be made, but it will do it. I ran it against a 40x40 file and got the desired result. I started to run it against something more like your example data set and it took too long for me to wait.
path = mkdtemp()
try :
with open('/home/user/big-csv', 'rb') as instream:
reader = csv.reader(instream)
for i, row in enumerate(reader):
for j, field in enumerate(row):
with open(join(path, 'new row {0:0>2}'.format(j)), 'ab') as new_row_stream:
contents = [ '{0},'.format(field) ]
new_row_stream.writelines(contents)
print 'read row {0:0>2}'.format(i)
with open('/home/user/transpose-csv', 'wb') as outstream:
files = glob(join(path, '*'))
files.sort()
for filename in files:
with open(filename, 'rb') as row_file:
contents = row_file.readlines()
outstream.writelines(contents + [ '\n' ])
finally:
print "done"
rmtree(path)
Is there a library that I can use to perform binary search in a very big text file (can be 10GB).
The file is a sort of a log file - every row starts with a date and time. Therefore rows are ordered.
I started to write the pseudo-code on how to do it, but I gave up since it may seem condescending. You probably know how to write a binary search, it's really not complicated.
You won't find it in a library, for two reasons:
It's not really "binary search" - the line sizes are different, so you need to adapt the algorithm (e.g. look for the middle of the file, then look for the next "newline" and consider that to be the "middle").
Your datetime log format is most likely non-standard (ok, it may look "standard", but think a bit.... you probably use '[]' or something to separate the date from the log message, something like [10/02/2001 10:35:02] My message ).
On summary - I think your need is too specific and too simple to implement in custom code for someone to bother writing a library :)
As the line lengths are not guaranteed to be the same length, you're going to need some form of recognisable line delimiter e.g. carriage return or line feed.
The binary search pattern can then be pretty much your traditional algorithm. Seek to the 'middle' of the file (by length), seek backwards (byte by byte) to the start of the line you happen to land in, as identified by the line delimiter sequence, read that record and make your comparison. Depending on the comparison, seek halfway up or down (in bytes) and repeat.
When you identify the start index of a record, check whether it was the same as the last seek. You may find that, as you dial in on your target record, moving halfway won't get you to a different record. e.g. you have adjacent records of 100 bytes and 50 bytes respectively, so jumping in at 75 bytes always takes you back to the start of the first record. If that happens, read on to the next record before making your comparison.
You should find that you will reach your target pretty quickly.
You would need to be able to stream the file, but you would also need random access. I'm not sure how you accomplish this short of a guarantee that each line of the file contains the same number of bytes. If you had that, you could get a Stream of the object and use the Seek method to move around in the file, and from there you could conduct your binary search by reading in the number of bytes that constitute a line. But again, this is only valid if the lines are the same number of bytes. Otherwise, you would jump in and out of the middle of lines.
Something like
byte[] buffer = new byte[lineLength];
stream.Seek(lineLength * searchPosition, SeekOrigin.Begin);
stream.Read(buffer, 0, lineLength);
string line = Encoding.Default.GetString(buffer);
This shouldn't be too bad under the constraint that you hold an Int64 in memory for every line-feed in the file. That really depends upon how long the line of text is on average, given 1000 bytes per line you be looking at around (10,000,000,000 / 1000 * 4) = 40mb. Very big, but possible.
So try this:
Scan the file and store the ordinal offset of each line-feed in a List
Binary search the List with a custom comparer that scans to the file offset and reads the data.
If your file is static (or changes rarely) and you have to run "enough" queries against it, I believe the best approach will be creating "index" file:
Scan the initial file and take the datetime parts of the file plus their positions in the original (this is why has to be pretty static) encode them some how (for example: unix time (full 10 digits) + nanoseconds (zero-filled 4 digits) and line position (zero filed 10 digits). this way you will have file with consistent "lines"
preform binary search on that file (you may need to be a bit creative in order to achieve range search) and get the relevant location(s) in the original file
read directly from the original file starting from the given location / read the given range
You've got range search with O(log(n)) run-time :) (and you've created primitive DB functionality)
Needless to say that if the file data file is updated "too" frequently or you don't run "enough" queries against the index file you mat end up with spending more time on creating the index file than you are saving from the query file.
Btw, working with this index file doesn't require the data file to be sorted. As log files tend to be append only, and sorted, you may speed up the whole thing by simply creating index file that only holds the locations of the EOL marks (zero-filled 10 digits) in the data file - this way you can preform the binary search directly on the data-file (using the index file in order to determinate the seek positions in the original file) and if lines are appended to the log file you can simply add (append) their EOL positions to the index file.
The List object has a Binary Search method.
http://msdn.microsoft.com/en-us/library/w4e7fxsh%28VS.80%29.aspx
What's the best ESE column type to XmlSerialize an object to my ESE DB?
Both "long binary" and "long ASCII text" work OK.
Reason for long binary: absolutely sure there's no characters conversation.
Reason for long text: the XML is text.
It seems MSDN says the 2 types only differ when sorting and searching. Obviously I'm not going to create any indices over that column; fields that need to be searchable and/or sortable are stored in separate columns of appropriate types.
Is it safe to assume any UTF8 text, less then 2GB in size, can be saved to and loaded from the ESE "long ASCII text" column value?
Yes you can put up to 2GB of data of UTF8 text into any long text/binary column. The only difference between long binary and long text is the way that the data is normalized when creating an index over the column. Other than that ESE simply stores the provided bytes in the column with no conversion. ESE can only index ASCII or UTF16 data and it is the application's responsibility to make sure the data is in the correct format so it would seem to be more correct to put the data into a long binary column. As you aren't creating an index there won't actually be any difference.
If you are running on Windows 7 or Windows Server 2008 R2 you should investigate column compresion. For XML data you might get significant savings simply by turning compression on.