In SQLite what is the maximum capacity of TEXT? - c#

According to this answer TEXT has a maximum capacity of 65535 characters (or 64Kbytes).
However I just build a test in which I stored a JSON string taken from a json file that is 305KBytes into t TEXT without problems
I am wondering if there is some property in TEXT that allows this

The limit, by default is 1 billion bytes rather than characters so the encoding has to be considered. However it can be changed.
The complete section regarding this is :-
Maximum length of a string or BLOB
The maximum number of bytes in a string or BLOB in SQLite is defined
by the preprocessor macro SQLITE_MAX_LENGTH. The default value of this
macro is 1 billion (1 thousand million or 1,000,000,000). You can
raise or lower this value at compile-time using a command-line option
like this:
DSQLITE_MAX_LENGTH=123456789
The current implementation will only support a string or BLOB length up to 2 to the power of 31-1 or 2147483647. And some
built-in functions such as hex() might fail well before that point. In
security-sensitive applications it is best not to try to increase the
maximum string and blob length. In fact, you might do well to lower
the maximum string and blob length to something more in the range of a
few million if that is possible.
During part of SQLite's INSERT and SELECT processing, the complete
content of each row in the database is encoded as a single BLOB. So
the SQLITE_MAX_LENGTH parameter also determines the maximum number of
bytes in a row.
The maximum string or BLOB length can be lowered at run-time using the
sqlite3_limit(db,SQLITE_LIMIT_LENGTH,size) interface.
Limits In SQLite

Related

Replace bytes in a file based on positions

In C# I want to replace example bytes number 200-5000 by an array/stream of 30000 bytes inside a file, the numbers are just for example, the sizes may vary, I could be replacing a section with either more or less bytes than the size of the section.
I have checked this question:
Replace sequence of bytes in binary file
It only explains how to override a fixed-sized section and will overwrite data after if the byte array length is bigger than original section and will leave old bytes in place if the section to be replaced is bigger than the new bytes.
Higher performance and not having to read the entire file into memory would be preferred, but anything goes.
I would need this to work both on windows and linux.

Constructing a base 64 string upto a given length and of same size

I was reading through this tutorial https://www.simple-talk.com/cloud/platform-as-a-service/azure-blob-storage-part-4-uploading-large-blobs/ for implementing a azure method described here https://learn.microsoft.com/en-us/rest/api/storageservices/fileservices/put-block.
In order to implement this method we require an block id which is:
A valid Base64 string value that identifies the block.
Prior to encoding, the string must be less than or equal to 64 bytes in size.
For a given blob, the length of the value specified for the blockid parameter must be the same size for each block.
Note that the Base64 string must be URL-encoded.
So inorder to achieve that author says:
"I usually just number them from 1 to whatever, using a block ID that
is formatted to a 7-character string. So for 1, I’ll get “0000001”.
Note that block id’s have to be a base 64 string."
and uses this code:
string blockId = Convert.ToBase64String(ASCIIEncoding.ASCII.GetBytes(string.Format("BlockId{0}",blockNumber.ToString("0000000"))));
Now, this is Base64 no doubt but how she is fulfilling condition 2 and
3. Because formatting to "0000000" means 23 convert to "0000023" but more then 7 digit number will remain same ex "999888777" which
violates 3 condition and also by considering 7 digits, how she is able
to achieve a 64 byte string to fulfill condition 2.
If you look at #3, the block ids must be of same length. Thus if you use:
string blockId = Convert.ToBase64String(ASCIIEncoding.ASCII.GetBytes(string.Format("BlockId{0}",blockNumber.ToString("0000000"))));
What you're essentially saying is that the maximum block id (or block number in your case) would be 9999999. If you think that you would need to use of block id more than 7 characters (say 9 characters starting from 100000000), then you would use code like the following:
string blockId = Convert.ToBase64String(ASCIIEncoding.ASCII.GetBytes(string.Format("BlockId{0}",blockNumber.ToString("000000000"))));
Then all the block ids will be of same length.
Whatever sequence you choose, you just have to ensure that when you convert any number in that sequence to a string all of them must be of same length.
A few other things I would like to mention are:
There can be a maximum of 50000 blocks for a blob. You can't split a file in more than 50,000 chunks (blocks) to upload them as blocks.
When the blocks are uploaded, you can upload them in any order i.e. you can first upload block #999 and then upload block #0. What matters is the payload for commit block list. The final blob that gets constructed and saved in blob storage is on the basis of block ids order specified in the commit block list.
What works for me is the following code (assuming the block id numbers are sequential numeric starting from 0):
string blockId = Convert.ToBase64String(ASCIIEncoding.ASCII.GetBytes(string.Format("BlockId{0}",blockNumber.ToString("d6"))));

What is the max number of indexes lucene.net can handle in a document

Lucene does not document the limitations of the storage engine. Does anyone know the max number of indexes allowed per document?
When referring to term numbers, Lucene's current implementation uses a Java int to hold the term index, which means the maximum number of unique terms in any single index segment is ~2.1 billion times the term index interval (default 128) = ~274 billion. This is technically not a limitation of the index file format, just of Lucene's current implementation.
Similarly, Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet, VInt values which have no limit.
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html#Limitations
As is suggested for all types of indexes (Lucene, RDBMS, or otherwise), the lowest possible number of fields is suggested to be indexed because it keeps your index size small and reduces run-time overhead reading from the index.
That said, the field count limitations are limited by your system resources. Fields are identified by their name (case-sensitive) rather than by an arbitrary numeric ID which typically becomes the limiting factor in these sorts of systems. Theoretical field count limitations are also hard to predict in a system without strict maximum field name lengths like Lucene.
I've personally used more than 200 analyzed fields more than 2 billion documents without issue. At the same time, performance for the same index was not what I have come to expect with smaller indexes on a medium-sized Azure VM.

Data in a condensed format

I need a library which would help me to save and query data in a condensed format (a mini DSL in essence) here's a sample of what I want:
Update 1 - Please note, figures in the samples above are made small just to make is easier to follow the logic, the real figures are limited with c# long type capacity, ex:
1,18,28,29,39,18456789,18456790,18456792,184567896.
Sample Raw Data set: 1,2,3,8,11,12,13,14
Condensed Sample Data set:
1..3,8,11..14
What would be absolute nice to have is to be able to present 1,2,4,5,6,7,8,9,10 as 1..10-3.
Querying Sample Data set:
Query 1 (get range):
1..5 -> 1..3
Query 2 (check if the value exists)
?2 -> true
Query 3 (get multiple ranges and scalar values):
1..5,11..12,14 -> 1..3,11..12,14
I don't want to develop it from scratch and would highly prefer to use something which already exists.
Here are some ideas I've had over the days since I read your question. I can't be sure any of them really apply to your use case but I hope you'll find something useful here.
Storing your data compressed
Steps you can take to reduce the amount of space your numbers take up on disk:
If your values are between 1 and ~10M, don't use a long, use a uint. (4 bytes per number.)
Actually, don't use a uint. Store your numbers 7 bits to a byte, with the remaining bit used to say "there are more bytes in this number". (Then 1-127 will fit in 1 byte, 128-~16k in 2 bytes, ~16k-~2M in 3 bytes, ~2M-~270M in 4 bytes.)
This should reduce your storage from 8 bytes per number (if you were originally storing them as longs) to, say, on average 3 bytes. Also, if you end up needing bigger numbers, the variable-byte storage will be able to hold them.
Then I can think of a couple of ways to reduce it further, given you know the numbers are always increasing and may contain lots of runs. Which works best for you only you can know by trying it on your actual data.
For each of your actual numbers, store two numbers: the number itself, followed by the number of numbers contiguous after it (e.g. 2,3,4,5,6 => 2,4). You'll have to store lone numbers as e.g. 8,0 so will increase storage for those, but if your data has lots of runs (especially long ones) this should reduce storage on average. You could further store "single gaps" in runs as e.g. 1,2,3,5,6,7 => 1,6,4 (unambiguous as 4 is too small to be the start of the next run) but this will make processing more complex and won't save much space so I wouldn't bother.
Or, rather than storing the numbers themselves, store the deltas (so 3,4,5,7,8,9 => 3,1,1,2,1,1. This will reduce the number of bytes used for storing larger numbers (e.g. 15000,15005 (4 bytes) => 15000,5 (3 bytes)). Further, if the data contains a lot of runs (e.g. lots of 1 bytes), it will then compress (e.g. zip) nicely.
Handling in code
I'd simply advise you to write a couple of methods that stream a file from disk into an IEnumerable<uint> (or ulong if you end up with bigger numbers), and do the reverse, while handling whatever you've implemented from the above.
If you do this in a lazy fashion - using yield return to return the numbers as you read them from disk and calculate them, and streaming numbers to disk rather than holding them in memory and returning them at once, you can keep your memory usage down whatever the size of the stored data.
(I think, but I'm not sure, that even the GZipStream and other compression streams will let you stream your data without having it all in memory.)
Querying
If you're comparing two of your big data sets, I wouldn't advise using LINQ's Intersect method as it requires reading one of the sources completely into memory. However, as you know both sequences are increasing, you can write a similar method that needs only hold an enumerator for each sequence.
If you're querying one of your data sets against a user-input, small list of numbers, you can happily use LINQ's Intersect method as it is currently implemented, as it only needs the second sequence to be entirely in memory.
I'm not aware of any off-the-shelf library that does quite what you want, but I'm not sure you need one.
I suggest you consider using the existing BitArray class. If, as your example suggests, you're interested in compressing sets of small integers then a single BitArray with, say 256 bits, could represent any set of integers in the range [0..255]. Of course, if your typical set has only 5 integers in it then this approach would actually expand your storage requirements; you'll have to figure out the right size of such arrays from your own knowledge of your sets.
I'd suggest also looking at your data as sets of integers, so your example 1,2,3,8,11,12,13,14 would be represented by setting on the corresponding bits in a BitArray. Your query operations then reduce to intersection between a test BitArray and your data BitArray.
Incidentally, I think your example 2, which transforms 2 -> true, would be better staying in the domain of functions that map sets of integers to sets of integers, ie it should transform 2 -> 2. If you want to, write a different method which returns a boolean.
I guess you'd need to write code to pack integers into BitArrays and to unpack BitArrays into integers, but that's part of the cost of compression.

Azure table storage: maximum variable size?

I will be using the table storage to store a lot of blob names, in a single string, appended to each other using some special character. This string will sky rockets pretty soon. But is there a maximum size to the length of a property for a particular entity ? in my case the string ?
Maximal string size for a single property is 64kb. If you take the Fat Entity approach as defined by Lokad.Cloud, then you can have 1mb property instead (leveraging the maximal entity size instead).
Maximum string size is 64kb - an individual entity cannot exceed 1mb.

Categories