I have an application that have ~1,000,000 strings in memory for performance reasons. My application consumes ~200 MB RAM.
I want to reduce the amount of memory consumed by the strings.
I know .NET represents strings in UTF-16 encoding (2 byte per char). Most strings in my application contain pure english chars, so storing them in UTF-8 encoding will be 2 times more efficient than UTF-16.
Is there a way to store a string in memory in UTF-8 encoding while allowing standard string functions? (My needs including mostly IndexOf with StringComparison.OrdinalIgnoreCase).
Unfortunately, you can't change .Net internal representation of string. My guess is that the CLR is optimized for multibyte strings.
What you are dealing with is the famous paradigm of the Space-time tradeoff, which states that in order to gain memory you'll have to use more processor, or you can save processor by using some memory.
That said, take a look at some considerations here. If I were you, once established that the memory gain will be enough for you, do try to write your own "string" class, which uses ASCII encoding. This will probably suffice.
UPDATE:
More on the money, you should check this post, "Of memory and strings", by StackOverflow legend Jon Skeet which deals with the problem you are facing. Sorry I didn't mentioned it right away, it took me some time to find the exact post from Jon.
Is there a way to store a string in memory in UTF-8 encoding while allowing standard string > functions? (My needs including mostly IndexOf with StringComparison.OrdinalIgnoreCase).
You could store as a byte array, and provide your own IndexOf implementation (since converting back to string for IndexOf would likely be a huge performance hit). Use the System.Text.Encoding functions for that (best bet would be to do a build step to convert to byte, and then read the byte arrays from disk - only converting back to string for display, if needed).
You could store them in a C/C++ library, letting you use single byte strings. You probably wouldn't want to marshal them back, but you could possibly just marshal results (I assume there's some sort of searching going on here) without too much of a perf hit. C++/CLI may make this easier (by being able to write the searching code in C++/CLI, but the string "database" in C++).
Or, you could revisit your initial performance issues that needs all of the strings in memory. An embedded database, indexing, etc. may both speed things up and reduce memory usage - and be more maintainable.
What if you store it as a bytearray? Just restore to string when you need to do some operations on it. I'd make a class for setting & getting the strings which internally stores it off as bytearrays.
to bytearray:
string s = "whatever";
byte[] b = System.Text.Encoding.UTF8.GetBytes(s);
to string:
string s = System.Text.Encoding.UTF8.GetString(b);
try using an in-memory-DB for as "storage" and SQL to interact with the data... For example SQLite can be deployed as part of your application (consists just of 1-2 DLLs which can be placed in the same folder as your application)...
What if you create your own UTF-8 string class (UTF8String?) and supply an implicit cast to String? You'll be sacrificing some speed for the sake of memory, but that might be what you're looking for.
Related
Quick intro : the question is about UTF-8 vs UTF-16.
*I tried my best to keep it as short and specific as possible please bear with me.
I know there's gazillion of variations of the specific issue UTF-8/16 not mentioning the global encoding subject,
which was the start of my questioning (ANSI vs UNICODE) and I guess it's not *MY* quest only,
as it could serve many other (performance motivated) beginners in c++.
being more specific - to the point:
giving the Following Environment parameters:
WINDOWS platform
C++ AND C#
using some english
/russian/hebrew
*lets say this is a constant.
can I use UTF-8 (half the size of UTF-16) and "get away with it" ?
...saving space and time
TLDR
I have recently moved to using C++, in the last few days I have tried to decide how to handle strings which is one of the most expensive datatypes to process, I have followed almost every famous and less famous articles on the encoding issue, though the more i tried to continue searching the more confused I have become, regarding the compatibility, while keeping high performance application without crossing the boundaries of the *framework
I have used the term framework although I am planning to do most of I/O via Native c++
can I use UTF-8 ? do I want UTF-8, i know one thing !
windows 'blood' type is UTF-16, although I think Low Level I/O and also HTTP uses/defaults/prefers/benefits from UTF-8
but I am on windows and still working with .NET
what can I use to max out my apps performance, querying manipulating saving to database...
a point I have read in a less famous [article]
A bit of research
This is a compilation of research I did to answer your problem:
Hebrew and Cyrillic in Unicode
According to Wikipedia, the Unicode Hebrew block extends from U+0590 to U+05FF and from U+FB1D to U+FB4F (I don't know the proportions):
https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet
According to Wikipedia, again, the cyrrilic can be found in the following bolcks: U+0400–U+04FF, U+0500–U+052F, U+2DE0–U+2DFF, U+A640–U+A69F, U+1D2B, U+1D78, U+FE2E–U+FE2F
https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
UTF-8 vs. UTF-16
https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/UTF-16
UTF-16 can represent the following glyphs with two bytes: U+0000 to U+D7FF and U+E000 to U+FFFF, which means all the characters above will be represented with two bytes (a wchar_t on Windows).
To represent Herbew and Cyrillic, UTF-8 will always need at least two bytes, and possibly three:
U+0000 - U+007F : 1 byte
U+0080 - U+07FF : 2 bytes
U+0800 - U+FFFF : 3 bytes
Windows
You said it yourself: Windows's DNA is UTF-16. No matter what delusional websites claim, the WinAPI won't change to UTF-8 because that makes not sense from Microsoft's viewpoint (breaking compatibility with previous Windows' applications just to make Linux lovers happy? Seriously?).
When you will develop under Windows, everything Unicode there will be optimized/designed for UTF-16.
Even the "char" API from the WinAPI is just a wrapper that will convert your char strings into wchar_t strings before calling the UTF-16 you should have been calling directly anyway.
Test!
As your problem seems to be mainly I/O, you should experiment to see if there is a meaningful difference between reading/writing/sending/receiving UTF-16 vs. UTF-8 with sample data.
Conclusion
From every fact above, I see either a neutral choice between UTF-8 and UTF-16 (the russian and cyrillic glyphs) (*), or a choice leading to UTF-16 (windows).
So, my own conclusion, unless your tests show otherwise, would be to stick to UTF-16 on Windows.
(*) You could sample a few strings in all the languages you are using, and try to have statistics on the averages the most common characters are used.
Bonus?
Now, in your stead, I would avoid using directly wchar_t on Windows.
Instead, I would use the _T(), TCHAR and <tchar.h> macro/typedef/include machinery offered by Windows: With but a few macros defined (UNICODE and _UNICODE if memory serves), as well as a few smart overloads, you can:
use wchar_t and utf-16 on Windows
use utf-8 on Linux
Which will make your code more portable should you switch to another OS.
Please read this article
http://www.joelonsoftware.com/articles/Unicode.html
Please read it carefully.
Now regarding performance I very much doubt it that you'll see any difference.
You choose your encoding based on what your program is supposed to do.
Is it supposed to communicate with other programs?
Are you storing information in database that will be accessed by other people?
Performance and disk space are not your first priorities when deciding which encoding to use.
Is there a good reason that .NET provides string functions (like search, substring extraction, splitting, etc) only for UTF-16 and not for byte arrays? I see many cases when it would be easier and much more efficient to work with 8-bit chars instead of 16-bit.
Let's take MIME (.EML) format for example. It's basically 8-bit text file. You cannot read it properly using ANY encoding (because encoding info is contained within the file, moreover, different parts can have different encodings).
So you basically better read a MIME file as bytes, determine it's structure (ideally, using 8bit-string parsing tools), and after finding encodings for all encoding-dependent data blocks apply encoding.GetString(data) to get normal UTF-16 representation of them.
Another thing is with base64 data blocks (base64 is just an example, there are also UUE and others). Currently .NET expects you to have a base64 16-bit string but it's not effective to read data of double size and do all conversions from bytes to string just to decode this data. When dealing with megabytes of data, it becomes important.
Missing byte string manipulation functions leads to the need to write them manually but the implementation is obviously less efficient than native code implementation of string functions.
I don't say it needs to be called 8-bit chars, let's keep it bytes. Just have a set of native methods which reflect most string manipulation routines, but with byte arrays. Is this needed only by me or am I missing something important about common .NET architecture?
Let's take MIME (.EML) format for example. It's basically 8-bit text file. You cannot read it properly using ANY encoding. (because encoding info is contained within the file, moreover, different parts can have different encodings).
So, you're talking about a case where general-purpose byte-string methods aren't very useful, and you'd need to specialise.
And then for other cases, you'd need to specialise again.
And again.
I actually think byte-string methods would be more useful than your example suggests, but it remains that a lot of cases for them have specialised needs that differ from other uses in incompatible ways.
Which suggests it may not be well-suited for the base library. It's not like you can't make your own that do fit those specialised needs.
Code to deal with mixed-encoding string manipulation is unnecessarily hard and much harder to explain/get right. The way you suggest to handle mixed encoding every "string" would need to keep encoding information in it and framework would have to provide implementations of all possible combinations of encodings.
Standard solution for such problem is to provide well defined way convert all types to/from single "canonical" representation and perform most operations on that canonical type. You see that more easily in image/video processing where random incoming formats converted into one format tool knows about, processed and converted back to original/any other format.
.Net strings are almost there with "canonical" way to represent Unicode string. There are still many ways to represent same string from user point of view that is actually composed from different char elements. Even regular string comparison is huge problem (as frequently in addition to encoding there are locale differences).
Notes
there are already plenty of API dealing with byte arrays to compare/slice - both in Array/List classes and as LINQ helpers. The only real missing part is regex-like matches.
even dealing with single type of encoding for strings (UTF-16 in .Net, UTF-8 in many other systems) is hard enough - even getting "sting length" is a problem (do you need to count surrogate pairs only or include all combining characters, or just .Length is enough).
it is good idea to try to write code yourself to see where complexity come from and whether particular framework decision makes sense. Try to implement 10-15 common string functions to support several encodings - i.e. (UTF8, UTF16, and one of 8-bit encoding).
Read this question today about safe and unsafe code I then read about it in MSDN but I still don't understand it. Why would you want to use pointers in C#? Is this purely for speed?
There are three reasons to use unsafe code:
APIs (as noted by John)
Getting actual memory address of data (e.g. access memory-mapped hardware)
Most efficient way to access and modify data (time-critical performance requirements)
Sometimes you'll need pointers to interface your C# to the underlying operating system or other native code. You're strongly discouraged from doing so, as it is "unsafe" (natch).
There will be some very rare occasions where your performance is so CPU-bound that you need that minuscule extra bit of performance. My recommendation would be to write those CPU-intesive pieces in a separate module in assembler or C/C++, export an API, and have your .NET code call that API. An possible additional benefit is that you can put platform-specific code in the unmanaged module, and leave the .NET platform agnostic.
I tend to avoid it, but there are some times when it is very helpful:
for performance working with raw buffers (graphics, etc)
needed for some unmanaged APIs (also pretty rare for me)
for cheating with data
For example of the last, I maintain some serialization code. Writing a float to a stream without having to use BitConverter.GetBytes (which creates an array each time) is painful - but I can cheat:
float f = ...;
int i = *(int*)&f;
Now I can use shift (>>) etc to write i much more easily than writing f would be (the bytes will be identical to if I had called BitConverter.GetBytes, plus I now control the endianness by how I choose to use shift).
There is at least one managed .Net API that often makes using pointers unavoidable. See SecureString and Marshal.SecureStringToGlobalAllocUnicode.
The only way to get the plain text value of a SecureString is to use one of the Marshal methods to copy it to unmanaged memory.
This might have been asked before, but I can't find any such posts. Is there a class to work with ASCII Strings? The benefits are numerous:
Comparison should be faster since its just byte-for-byte (instead of UTF-8 with variable encoding)
Memory efficient, should use about half the memory in large strings
Faster versions of ToUpper()/ToLower() which use a Look-Up-Table that is language invariant
Jon Skeet wrote a basic AsciiString implementation and proved #2, but I'm wondering if anyone took this further and completed such a class. I'm sure there would be uses, although no one would typically take such a route since all the existing String functions would have to be re-implemented by hand. And conversions between String <> AsciiString would be scattered everywhere complicating an otherwise simple program.
Is there such a class? Where?
I thought I would post the outcome of my efforts to implement a system as described with as much string support and compatibility as I could. It's possibly not perfect but it should give you a decent base to improve on if needed.
The ASCIIChar struct and ASCIIString string implicitly convert to their native counterparts for ease of use.
The OP's suggestion for replacements of ToUpper/Lower etc have been implemented in a much quicker way than a lookup list and all the operations are as quick and memory friendly as I could make them.
Sorry couldn't post source, it was too long. See links below.
ASCIIChar - Replaces char, stores the value in a byte instead of int and provides support methods and compatibility for the string class. Implements virtual all methods and properties available for char.
ASCIIChars - Provides static properties for each of the valid ASCII characters for ease of use.
ASCIIString - Replaces string, stores characters in a byte array and implements virtually all methods and properties available for string.
Dotnet has no ASCII string support directly. Strings are UTF16 because Windows API works with ASCII (onr char - one byte) or UTF16 only. Utf8 will be the best solution (java uses it), but .NET does not support it because Windows doesn't.
Windows API can convert between charsets, but windows api only works with 1 byte chars or 2 byte chars, so if you use UTF8 strings in .NET you must convert them everytime which has impact in performace. Dotnet can use UTF8 and other encondings via BinaryWriter/BinaryReader or a simple StreamWriter/StreamReader.
I'm trying to write a simple reader for AutoCAD's DWG files in .NET. I don't actually need to access all data in the file so the complexity that would otherwise be involved in writing a reader/writer for the whole file format is not an issue.
I've managed to read in the basics, such as the version, all the header data, the section locator records, but am having problems with reading the actual sections.
The problem seems to stem from the fact that the format uses a custom method of storing some data types. I'm going by the specs here:
http://www.opendesign.com/files/guestdownloads/OpenDesign_Specification_for_.dwg_files.pdf
Specifically, the types that depend on reading in of individual bits are the types I'm struggling to read. A large part of the problem seems to be that C#'s BinaryReader only lets you read in whole bytes at a time, when in fact I believe I need the ability to read in individual bits and not simply 8 bits or a multiple of at a time.
It could be that I'm misunderstanding the spec and how to interpret it, but if anyone could clarify how I might go about reading in individual bits from a stream, or even how to read in some of the variables types in the above spec that require more complex manipulation of bits than simply reading in full bytes then that'd be excellent.
I do realise there are commercial libraries out there for this, but the price is simply too high on all of them to be justifiable for the task at hand.
Any help much appreciated.
You can always use BitArray class to do bit wise manipulation. So you read bytes from file and load them into BitArray and then access individual bits.
For the price of any of those libraries you definitely cannot develop something stable yourself. How much time did you spend so far?