MongoDB GridFS bucket? - c#

I work with MongoDB C# Samus driver.
One of constructors of class MongoDB.GridFS.GridFile has parameter "bucket". When i create GridFile in Java like example i cannot set this "bucket". But i can set this "bucket" in Java when create GridFS object Java documentation. I'm confused!
My question:
What is "bucket"? For what? Tell please some use cases;)

Bucket is base name for files and chunks collections. By default bucket is 'fs' so you will have two collections:
fs.files will store file properties like id, name, size, chunk size, md5 checksum etc.
fs.chunks will store the actual binary data split into chunks, one per document.
Using GridFS class constructor argument you can set arbitrary bucket name.
Different buckets can be useful if you need to have separate collections for different types of files, so you can apply different indexes, sharding etc.

Related

How to store protocol / specification structures

Implementing an ISO Creator ( as an intellectual exercise mostly ) I need to store cd data structure values.
For example, from page 47 of the specification https://www.cs.cmu.edu/~varun/cs315p/iso9660.pdf , in order to store a directory in ISO9660 format I need to store information about various byte fields
There are other tables, such as Path Tables that store fields and the total number of fields would be in hundreds.
So I could either
1 - Store these in .cs files , 'hard coding' essentially
public byte length=1;
public byte ExtendedAttributeLength = 0;
//and so on
2- I could store these as Constants in .cs files
3 -I could store these in XML files
4- I could even store these in a database table
Considering that it is unlikely for these values to change, but not impossible, I'm not sure which way i should store the values.
Thank you
I think you can define all those specs as structs with explicit layout. This way you can specify the offset for each field:
[StructLayout(LayoutKind.Explicit, Size=16, CharSet=CharSet.Ansi)]
public struct DirectoryRecord
{
[FieldOffset(0)]
byte RecordLength;
[FieldOffset(1)]
byte ExtendedAttrRecordLength;
...
}
You can then serialize these to byte arrays and presumably save them as parts of the ISO image header or whatever their place is in there.
However, if you look at one of the existing C# ISO9660 implementations such as .NET DiscUtils on CodePlex you will see that they do things a bit differently.
For DirectoryRecord they have defined a class with ReadFrom and WriteTo methods and it takes care of reading from appropriate offsets in an input stream. This is one option. On top of that they have some other component that reads the file and delegates reading to sub-components such as this one.
So, you could do it like them. Or you could do it with structs as I mentioned before and have them behave like POCOs only, not extra reading and writing logic. You'll have to do that somewhere else.

Writing disparate objects to a single file

This is an offshoot of my question found at Pulling objects from binary file and putting in List<T> and the question posed by David Torrey at Serializing and Deserializing Multiple Objects.
I don't know if this is even possible or not. Say that I have an application that uses five different classes to contain the necessary information to do whatever it is that the application does. The classes do not refer to each other and have various numbers of internal variables and the like. I would want to save the collection of objects spawned from these classes into a single save file and then be able to read them back out. Since the objects are created from different classes they can't go into a list to be sent to disc. I originally thought that using something like sizeof(this) to record the size of the object to record in a table that is saved at the beginning of a file and then have a common GetObjectType() that returns an actionable value as the kind of object it is would have worked, but apparently sizeof doesn't work the way I thought it would and so now I'm back at square 1.
Several options: wrap all of your objects in a larger object and serialize that. The problem with that is that you can't deserialize just one object, you have to load all of them. If you have 10 objects that each is several megs, that isn't a good idea. You want random access to any of the objects, but you don't know the offsets in the file. The "wrapper" objcect can be something as simple as a List<object>, but I'd use my own object for better type safety.
Second option: use a fixed size file header/footer. Serialize each object to a MemoryStream, and then dump the memory streams from the individual objects into the file, remembering the number of bytes in each. Finally add a fixed size block at the start or end of the file to record the offsets in the file where the individual objects begin. In the example below the header of the file has first the number of objects in the file (as an integer), then one integer for each object giving the size of each object.
// Pseudocode to serialize two objects into the same file
// First serialize to memory
byte[] bytes1 = object1.Serialize();
byte[] bytes2 = object2.Serialize();
// Write header:
file.Write(2); // Number of objects
file.Write(bytes1.Length); // Size of first object
file.Write(bytes2.Length); // Size of second object
// Write data:
file.Write(bytes1);
file.Write(bytes2);
The offset of objectN, is the size of the header PLUS the sum of all sizes up to N. So in this example, to read the file, you read the header like so (pseudo, again)
numObjs = file.readInt();
for(i=0..numObjs)
size[i] = file.readInt();
After which you can compute and seek to the correct location for object N.
Third option: Use a format that does option #2 for you, for example zip. In .NET you can use System.IO.Packaging to create a structured zip (OPC format), or you can use a third party zip library if you want to roll your own zip format.

Representing a giant matrix/table

I need to perform calculations and manipulation on an extremely large table or matrix, which will have roughly 7500 rows and 30000 columns.
The matrix data will look like this:
Document ID| word1 | word 2 | word 3 |... | word 30000 | Document Class
0032 1 0 0 1 P
In other words, the vast majority of the cells will contain boolean values(0's and 1's).
The calculations that needs to be done would be useing word stemming or feature selection(reducing the number of words by using reduction techniques), as well as calculations per-class or per-word etc.
What i have in mind is designing an OOP model for representing the matrix, and then subsequently serializing the objects to disk so i may reuse them later on. For instance i will have an object for each row or each column, or perhaps an object for each intersection that is contained within another class.
I have thought about representing it in XML, but file sizes may prove problematic.
I may be sitting the pot miss with my approach here -
Am i on the right path, or would there be any better performing approaches to manipulating such large data collections.
Key issues here will be performance(reaction time etc.), as well as redundancy and integrity of the data, and obviously i would need to save the data on disk.
You haven't explained the nature of the calculations you're needing to do on the table/matrix, so I'm having to make assumptions there, but if I read your question correctly, this may be a poster-child case for the use of a relational database -- even if you don't have any actual relations in your database. If you can't use a full server, use SQL Server Compact Edition as an embedded database, which would allow you to control the .SDF file programmatically if you chose.
Edit:
After a second consideration, I withdraw my suggestion for a database. This is entirely because of the number of columns in the table, any relational database you use will have hard limits on this, and I don't see a way around it that isn't amazingly complicated.
Based on your edit, I would say that there are three things you are interested in:
A way to analyze the presence of words in documents. This is the bulk of your sample data file, primarily being boolean values indicating the presence or lack of a word in a document.
The words themselves. This is primarily contained in the first row of your sample data file.
A means of identifying documents and their classification. This is the first and last column of your data file.
After thinking about it for a little bit, this is how I would model your data:
With the case of word presence, I feel it's best to avoid a complex object model. You're wanting to do pure calculation in both directions (by column and by row), and the most flexible and potentially performant structure for that in my opinion is a simple two-dimensional array of bool fields, like so:
var wordMatrix = new bool[numDocuments,numWords];
The words themselves should be in an array or list of strings that are index-linked to the second column of the word matrix -- the one defined by numWords in the example above. If you ever needed to quickly search for a particular word, you could use a Dictionary<string, int>, with the key as the word and the value as the index, to quickly find the index of a particular word.
The document identification would similarly be in an array or list of ints index-linked to the first column. I'm assuming the document ids are integer values there. The classification would be a similar array or list, although I'd use a list of enums representing each possible value of the classification. As with the word search, if you needed to search for documents by id, you could have a Dictionary<int, int> act as your search index.
I've made several assumptions with this model, particularly that you want to do pure calculation on the word presence in all directions rather than "per document". If I'm wrong, a simpler approach might be to drop the two-dimensional array and model by document, i.e. a single C# Document class, with a DocumentId, and DocumentClasification field as well as a simple array of booleans that are index-linked to the word list. You could then work with a list of these Document objects along with a separate list of words.
Once you have a data model you like, saving it to disk is the easiest part. Just use C# serialization. You can save it via XML or binary, your choice. Binary would give you the smallest file size, naturally (I figure a little more than 200MB plus the size of a list of 30000 words). If you include the Dictionary lookup indexes, perhaps an additional 120kB.

C#: What is the best collection class to store very similar string items for efficient serialization to a file

I would like to store a list of entityIDs of outlook emails to a file. The entityIDs are strings like:
"000000005F776F08B736B442BCF7B6A7060B509A64002000"
"000000005F776F08B736B442BCF7B6A7060B509A84002000"
"000000005F776F08B736B442BCF7B6A7060B509AA4002000"
as you can notice, the strings are very similar. I would like to save these strings in a collection class that would be stored as efficiently as possible when I serialize it to a file. Do you know of any collection class that could be used for this?
Thank you in advance for any information...
Gregor
No pre-existing collection class from the framework will suit your needs, because these are generic: by definition, they know nothing of the type they are storing (e.g. string) so they cannot do anything with it.
If efficient serialization is your only concern, I suggest that you simply compress the serialized file. Data like this are a feast for compression algorithms. .NET offers gzip and deflate algorithms in System.IO.Compression; better algorithms (if you need them) can easily be found through Google.
If in-memory efficiency is also an issue, you could store your strings in a trie or a radix tree.
You may want to take a look at the Radix Trie data-structure, as this would be able to efficiently store your keys.
As far as serialising to a file, you could, perhaps, walk the trie and write down each node. (In the following example I have used indentation to signify the level in the tree, but you could come up with something a bit more efficient, such as using control characters to signify a descent or ascent.)
00000000
5F776F08B736B442BCF7B6A7060B509A
64002000
84002000
A4002000
6F776F08B736B442BCF7B6A7060B509A
32100000
The example above is the set of:
000000005F776F08B736B442BCF7B6A7060B509A64002000
000000005F776F08B736B442BCF7B6A7060B509A84002000
000000005F776F08B736B442BCF7B6A7060B509AA4002000
000000006F776F08B736B442BCF7B6A7060B509A32100000
Why is efficient an issue? Do you want to use as less HD space as possible (HD space is cheap).
In C# there 2 most used serializers:
Binary
or XML
If you want the user to let the file be adjustable with notepad for example --> use xml. If not use binary

Import textfile with weird format in C#

I have this exported file of some weird (standard for this industry!) format, which I need to import into
our Database. The file basically looks like this:
DATRKAKT-START
KAKT_LKZ "D"
KAKT_DAT_STAMM "1042665"
DATRKAIB-START
KAIB_AZ "18831025"
KAIB_STATUS_FM 2
KAIB_KZ_WAE "E"
DATRKAIB-END
DATRKARP-START
KARP_MELD "831025"
KARP_ST_MELD "G"
DATRKARP-END
...
DATRKAKT-END
There are 56 sections with a total of 1963 different rows, so I'm really not into
creating 56 classes with 1963 properties... How would you handle this file
so that you can access some property like it were an object?
Datrkaib.Kaib_Status_Fm
Datrkarp.karp_St_Meld
Unless your programming language lets you add methods to classes at runtime or lets classes respond to calls to undefined methods, there's no way you can do this. The thing is, even if C# did let you do this, you would lose type safety and Intellisense help anyway (presumably among the reasons for wanting it to work like that), so why not just go ahead and read it into some data structure? My inclination would be a hash which can contain values or other hashes, so you'd get calls like (VB):
Datrkakt("Lkz")
Datrkakt("Dat_Stam")
Datrkakt("Kaib")("Az")
Datrkakt("Kaib")("Status_Fm")
Or if you know all the data items are uniquely named as in your example, just use one hash:
Datr("Kakt_Lkz")
Datr("Kakt_Dat_Stam")
Datr("Kaib_Az")
Datr("Kaib_Status_Fm")
You can get back Intellisense help by creating an enum of all the data item names and getting:
Datr(DatrItem.KAKT_LKZ)
Datr(DatrIrem.KAIB_STATUS_FM)
It looks like structured data - I'd run search and replace and convert it to a simple xml. and then import.
The if you want to generate a code file off of it - consider codesmith - I think it can do this.
I'd go with a List <name, list> of various object, that can be a tuple <name, value> or a named list of object.
There isn't that will automatically do this for you.
I would create a class containing all the appropriate properties (say DatrDocument), and create a DatrReader class (similar idea to the XmlDocument/XmlReader classes).
The DatrReader will need to read the contents of the file or stream, and parse it into a DatrDocument.
You may also want to write a DatrWriter class which will take a DatrDocument and write it to a stream.

Categories