Optimizing several million char* to string conversions

Optimizing several million char* to string conversions - c#

I have an application that needs to take in several million char*'s as an input parameter (typically strings less than 512 characters (in unicode)), and convert and store them as .net strings.
It turning out to be a real bottleneck in the performance of my application. I'm wondering if there's some design pattern or ideas to make it more effecient.
There is a key part that makes me feel like it can be improved: There are a LOT of duplicates. Say 1 million objects are coming in, there might only be like 50 unique char* patterns.
For the record, here is the algorithm i'm using to convert char* to string (this algorithm is in C++, but the rest of the project is in C#)
String ^StringTools::MbCharToStr ( const char *Source )
{
String ^str;
if( (Source == NULL) || (Source[0] == '\0') )
{
str = gcnew String("");
}
else
{
// Find the number of UTF-16 characters needed to hold the
// converted UTF-8 string, and allocate a buffer for them.
const size_t max_strsize = 2048;
int wstr_size = MultiByteToWideChar (CP_UTF8, 0L, Source, -1, NULL, 0);
if (wstr_size < max_strsize)
{
// Save the malloc/free overhead if it's a reasonable size.
// Plus, KJN was having fits with exceptions within exception logging due
// to a corrupted heap.
wchar_t wstr[max_strsize];
(void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size);
str = gcnew String (wstr);
}
else
{
wchar_t *wstr = (wchar_t *)calloc (wstr_size, sizeof(wchar_t));
if (wstr == NULL)
throw gcnew PCSException (__FILE__, __LINE__, PCS_INSUF_MEMORY, MSG_SEVERE);
// Convert the UTF-8 string into the UTF-16 buffer, construct the
// result String from the UTF-16 buffer, and then free the buffer.
(void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size);
str = gcnew String ( wstr );
free (wstr);
}
}
return str;
}

You could use each character from the input string to feed a trie structure. At the leaves, have a single .NET string object. Then, when a char* comes in that you've seen previously, you can quickly find the existing .NET version without allocating any memory.
Pseudo-code:
start with an empty trie,
process a char* by searching the trie until you can go no further
add nodes until your entire char* has been encoded as nodes
at the leaf, attach an actual .NET string
The answer to this other SO question should get you started: How to create a trie in c#

There is a key part that makes me feel like it can be improved: There are a LOT of duplicates. Say 1 million objects are coming in, there might only be like 50 unique char* patterns.
If this is the case, you may want to consider storing the "found" patterns within a map (such as using a std::map<const char*, gcroot<String^>> [though you'll need a comparer for the const char*), and use that to return the previously converted value.
There is an overhead to storing the map, doing the comparison, etc. However, this may be mitigated by the dramatically reduced memory usage (you can reuse the managed string instances), as well as saving the memory allocations (calloc/free). Also, using malloc instead of calloc would likely be a (very small) improvement, as you don't need to zero out the memory prior to calling MultiByteToWideChar.

I think the first optimization you could make here would be to make your first try calling MultiByteToWideChar start with a buffer instead of a null pointer. Because you specified CP_UTF8, MultiByteToWideChar must walk over the whole string to determine the expected length. If there is some length which is longer than the vast majority of your strings, you might consider optimistically allocating a buffer of that size on the stack; and if that fails, then going to dynamic allocation. That is, move the first branch if your if/else block outside of the if/else.
You might also save some time by calculating the length of the source string once and passing it in explicitly -- that way MultiByteToWideChar doesn't have to do a strlen every time you call it.
That said, it sounds like if the rest of your project is C#, you should use the .NET BCL class libraries designed to do this rather than having a side by side assembly in C++/CLI for the sole purpose of converting strings. That's what System.Text.Encoding is for.
I doubt any kind of caching data structure you could use here is going to make any significant difference.
Oh, and don't ignore the result of MultiByteToWideChar -- not only should you never cast anything to void, you've got undefined behavior in the event MultiByteToWideChar fails.

I would probably use a cache based on a ternary tree structure, or similar, and look up the input string to see if it's already converted before even converting a single character to .NET representation.

Related

Does primitive array expects integer as index

Should primitive array content be accessed by int for best performance?
Here's an example
int[] arr = new arr[]{1,2,3,4,5};
Array is only 5 elements in length, so the index doesn't have to be int, but short or byte, that would save useless 3 byte memory allocation if byte is used instead of int. Of course, if only i know that array wont overflow size of 255.
byte index = 1;
int value = arr[index];
But does this work as good as it sounds?
Im worried about how this is executed on lower level, does index gets casted to int or other operations which would actually slow down the whole process instead of this optimizing it.

In C and C++, arr[index] is formally equivalent to *(arr + index). Your concerns about casting should be answerable in terms of the simpler question about what the machine will do when it needs to add add an integer offset to a pointer.
I think it's safe to say that on most modern machines when you add a "byte" to a pointer, its going to use the same instruction as it would if you added a 32-bit integer to a pointer. And indeed it's still going to represent that byte using the machine word size, padded with some unused space. So this isn't going to make using the array faster.
Your optimization might make a difference if you need to store millions of these indices in a table, and then using byte instead of int would use 4 times less memory and take less time to move that memory around. If the array you are indexing is huge, and the index needs to be larger than the machine word side, then that's a different consideration. But I think it's safe to say that in most normal situations this optimization doesn't really make sense, and size_t is probably the most appropriate generic type for array indices all things being equal (since it corresponds exactly to the machine word size, on the majority of architectures).

does index gets casted to int or other operations which would actually slow down the whole process instead of this optimizing it
No, but
that would save useless 3 byte memory allocation
You don't gain anything by saving 3 bytes.
Only if you are storing a huge array of those indices then the amount of space you would save might make it a worthwhile investment.
Otherwise stick with a plain int, it's the processor's native word size and thus the fastest.

How to concatenate and hash a username and password (stored in a secure string) in unsafe code

I'm trying to persist whether username and password combination were valid last time a program executed, but without storing the username and password themselves. The goal isn't validation, just to prevent needless attempts to use invalid credentials that could get a user locked out of a service (in this case, SharePoint, but that's not pertinent here).
My approach is to concatenate the username and password and take an MD5 hash (it's fast, and it'll validate against a provided username/password combination).
This turns out to require a bunch of stuff I don't know. Please see below for my current (not working) approach, and if anyone can provide guidance as to what I should be doing, it would be very useful.
unsafe
{
byte[] usernamePart = Encoding.Unicode.GetBytes(this.Username);
IntPtr unmanagedPwd = IntPtr.Zero;
unmanagedPwd = Marshal.SecureStringToGlobalAllocUnicode(this.Password);
// Question 1: How many bytes do I need to copy?
int lenPasswordArray = somemethod(this.Password);
IntPtr unsafeBuffer = Marshal.AllocHGlobal(usernamePart.Length + lenPasswordArray);
Marshal.Copy(usernamePart, 0, unsafeBuffer, usernamePart.Length);
// Question 2: Marshal.Copy takes a byte[]; I have an IntPtr. How to copy after the username
Marshal.Copy(unmanagedPwd, 0, IntPtr.Add(unsafeBuffer, lenPasswordArray), lenPasswordArray);
var provider = new System.Security.Cryptography.MD5CryptoServiceProvider();
//Question 3: I now have an IntPtr with username and password together. But
// provider takes a byte[]... I don't want to convert to byte[], because it'll end up
// with the same System.String problem
var targetHash = provider.ComputeHash(unsafeBuffer);
// Question 4: How do I clean up safely?
Marshal.ZeroFreeGlobalAllocUnicode(unmanagedPwd);
Marshal.Copy(new byte[usernamePart.Length + lenPasswordArray], 0, unsafeBuffer, usernamePart.Length + lenPasswordArray);
Marshal.FreeHGlobal(unsafeBuffer);
}
As in the comments, there's 4 things I need to know:
How to work out the number of bytes allocated by SecureStringToGlobalAllocUnicode
The appropriate function to use when I need n bytes after an IntPtr and don't want to allocate a managed byte[] and use Marshal.Copy
How to encrypt those bytes
How to reliably zero out and free anything I've allocated (I'm very new to unsafe code)
Edit: For clarity, what I want is the secure version of:
byte[] usernamePart = Encoding.Unicode.GetBytes(this.Username);
byte[] passwordPart = Encoding.Unicode.GetBytes(this.Password.ConvertToUnsecureString());
byte[] all = usernamePart.Concat(passwordPart).ToArray();
var provider = new System.Security.Cryptography.MD5CryptoServiceProvider();
return provider.ComputeHash(all).ToString();

Unfortunately, without more details, it would be difficult to know what the best answer is. One particular detail that's missing here is where the SecureString object comes from. Do you create it for the purpose of performing this hash? Or is the password already represented by the SecureString object, which you are passing to other APIs?
If the former, then it suggests that you already have an unencrypted, non-deterministic-lifetime string in your process containing the password. If the latter, then while the lifetime of the unencrypted version(s) of the password may be deterministic, note that the password still winds up decrypted at various points of execution.
That said, in terms of your specific questions:
How to work out the number of bytes allocated by SecureStringToGlobalAllocUnicode
It seems to me that you should be able to trust that doubling the original text's length would be reliable. The SecureString.Length property returns the number of char objects composing the string, i.e. the number of 16-bit UTF16 values, so the bytes are just twice that. The Length property isn't taking into account Unicode code points that take two 16-bit values (i.e. low and high surrogate), so it should be accurate for byte-length computations.
That said, if you don't trust that…the allocated string should be null terminated, so you can just do a normal scan of the string. Note that if you use the BSTR method for the string, the string is prefixed with a 32-bit byte count (not character count) representing the string, not counting its null terminator; you can retrieve that by subtracting 4 from the IntPtr returned, getting the four bytes there, and converting that back to an int value.
The appropriate function to use when I need n bytes after an IntPtr and don't want to allocate a managed byte[] and use Marshal.Copy
There are lots of ways to do this. I think one of the simpler approaches is to p/invoke the Windows CopyMemory() function:
[DllImport("kernel32.dll")]
unsafe extern static void CopyMemory(void* destination, void* source, IntPtr size_t);
Just pass the appropriate IntPtr values to the method, using either the IntPtr.ToPointer() method or the explicit conversion to void* that's available. Used like this:
unsafe
{
CopyMemory(IntPtr.Add(unsafeBuffer, usernamePart.Length).ToPointer(),
unmanagedPwd.ToPointer(), new IntPtr(lenPasswordArray));
}
In .NET 4.6 (according to MSDN...I haven't used this myself...still stuck on 4.5), you can (will be able to) use the Buffer.MemoryCopy() method. E.g.:
Buffer.MemoryCopy(unmanagedPwd.ToPointer(),
IntPtr.Add(unsafeBuffer, usernamePart.Length).ToPointer(),
lenPasswordArray,
lenPasswordArray);
(Note that I think you had a type in your original example; you are adding lenPasswordArray to the unsafeBuffer pointer to determine the location to which to copy the password data. I've corrected that in the above examples, using the user name length instead, since you seem to be wanting to copy the password data immediately after the data for the user name which has already been copied).
How to encrypt those bytes
What do you mean by that? Are you asking how to hash the bytes? I.e. run the MD5 hash algorithm on them? Note that that's not encryption; there's no practical way to decrypt the value (MD5 security flaws notwithstanding).
If you simply mean to hash the bytes, you would need an MD5 implementation that could operate on unmanaged memory. I'm not sure whether Windows has an unmanaged MD5 API, but it does have cryptography in general. So you could p/invoke to access those functions. See Cryptographic Service Providers for more details.
I will note that at this point, you now have the unencrypted data in memory, in two different places: the originally decrypted memory block from the call to SecureStringToGlobalAllocUnicode(), and of course the new copy you made copying to the unsafeBuffer. You can control the lifetime of these buffers more closely than you can a System.String object, but other than that you have the same risk during that lifetime of malicious code inspecting your process and recovering the plaintext.
If you mean something other than hashing, please be more specific about how and why you want to "encrypt those bytes".
How to reliably zero out and free anything I've allocated (I'm very new to unsafe code)
I don't know what unsafe has to do with the question. Indeed, except for the places where you need to use void*, your code example itself doesn't need unsafe.
As for zeroing out the memory buffers, the code you have seems to be fine to me. If you want something slightly more efficient than allocating a whole new byte[] buffer just for the purpose of setting another memory location to all zeroes, you can p/invoke the SecureZeroMemory() Windows function instead (similar to the CopyMemory() example above).
Now, all of the above said, as I mentioned in the comments, it seems to me that there are ways to do this in managed, safe code, simply by controlling the lifetime of the intermediate objects explicitly yourself. For example:
static string SecureComputeHash(string username, SecureString password)
{
byte[] textBytes = null;
IntPtr textChars = IntPtr.Zero;
try
{
byte[] userNameBytes = Encoding.Unicode.GetBytes(username);
textChars = Marshal.SecureStringToGlobalAllocUnicode(password);
int passwordByteLength = password.Length * 2;
textBytes = new byte[userNameBytes.Length + passwordByteLength];
userNameBytes.CopyTo(textBytes, 0);
Marshal.Copy(textChars, textBytes, userNameBytes.Length, passwordByteLength);
using (MD5CryptoServiceProvider provider = new MD5CryptoServiceProvider())
{
return Convert.ToBase64String(provider.ComputeHash(textBytes));
}
}
finally
{
// Clean up temporary buffers
if (textChars != IntPtr.Zero)
{
Marshal.ZeroFreeGlobalAllocUnicode(textChars);
}
if (textBytes != null)
{
for (int i = 0; i < textBytes.Length; i++)
{
textBytes[i] = 0;
}
}
}
}
(I used base64 encoding to convert your hashed byte[] result to a string. The simple call to ToString() you showed in your example won't do anything useful, as it just returns the type name for a byte[] object. I think base64 is the most efficient, useful way to store the hashed data, but you can of course use any representation you find useful).
The above assumes that your password is already in a SecureString object. Of course, if you are simply initializing a SecureString object from some other non-encrypted object, you could do the above differently, such as creating a char[] directly from the non-encrypted object (which could be e.g. string or StringBuilder).
I don't see how your unmanaged approach would be noticeably better than the above.
The only exception I can see is if you are worried that the MD5CryptoServiceProvider class might leave some copy of your data in its own internal data structures. That could be a valid concern, but then you would also have that concern for your unmanaged approach too, since you haven't shown what MD5 implementation you would actually use there (you would have to make sure whatever implementation you use is careful about not leaving copies of your data).
Personally, I suspect (but don't know for sure) that given the word "crypto" in the MD5CryptoServiceProvider class name, that class is careful to clear temporary in-memory buffers.
Other than that possible concern, the entirely managed approach accomplishes the same thing, with IMHO less fuss.

How are String and Char types stored in memory in .NET?

I'd need to store a language code string, such as "en", which will always contains 2 characters.
Is it better to define the type as "String" or "Char"?
private string languageCode;
vs
private char[] languageCode;
Or is there another, better option?
How are these 2 stored in memory? how many bytes or bits for will be allocated to them when values assigned?

How They Are Stored
Both the string and the char[] are stored on the heap - so storage is the same. Internally I would assume a string simply is a cover for char[] with lots of extra code to make it useful for you.
Also if you have lots of repeating strings, you can make use of Interning to reduce the memory footprint of those strings.
The Better Option
I would favour string - it is immediately more apparent what the data type is and how you intend to use it. People are also more accustomed to using strings so maintainability won't suffer. You will also benefit greatly from all the boilerplate code that has been done for you. Microsoft have also put a lot of effort in to make sure the string type is not a performance hog.
The Allocation Size
I have no idea how much is allocated, I believe strings are quite efficient in that they only allocate enough to store the Unicode characters - as they are immutable it is safe to do this. Arrays also cannot be resized without allocating the space in a new array, so I'd again assume they grab only what they need.
Overhead of a .NET array?
Alternatives
Based on your information that there are only 20 language codes and performance is key, you could declare your own enum in order to reduce the size required to represent the codes:
enum LanguageCode : byte
{
en = 0,
}
This will only take 1 byte as opposed to 4+ for two char (in an array), but it does limit the range of available LanguageCode values to the range of byte - which is more than big enough for 20 items.
You can see the size of value types using the sizeof() operator: sizeof(LanguageCode). Enums are nothing but the underlying type under the hood, they default to int, but as you can see in my code sample you can change that by "inheriting" a new type.

Short answer: Use string
Long answer:
private string languageCode;
AFAIK strings are stored as a length prefixed array of chars. A String object is instantiated on the heap to maintain this raw array. But a String object is much more than a simple array it enables basic string operations like comparison, concatenation, substring extraction, search etc
While
private char[] languageCode;
will be stored as an Array of chars i.e. an Array object will be created on the heap and then it will be used to manage your characters. But it still has a length attribute which is stored internally so there are no apparent savings in memory when compared to a string. Though presumably an Array is simpler than a String and may have fewer internal variables thus offering a lower memory foot print (this needs to be verified).
But OTOH you loose the ability to perform string operations on this char array. Even operations like string comparison become cumbersome now. So long story short use a string!

How are these 2 stored in memory? how many bytes or bits for will be allocated to them when values assigned?
Every instance in .NET is stored as follows: one IntPtr-sized field for the type identifier; one more for locking on the instance; the remainder is instance field data rounded up to an IntPtr-sized amount. Hence, on a 32-bit platform every instance occupies 8 bytes + field data.
This applies to both a string and a char[]. Both of these also store the length of the data as an IntPtr-sized integer, followed by the actual data. Thus, a two-character string and a two-character char[], on a 32-bit platform, will occupy 8+4+4 = 16 bytes.
The only way to reduce this when storing exactly two characters is to store the actual characters, or a struct containing the characters, in a field or an array. All of these would consume only 4 bytes for the characters:
// Option 1
class MyClass
{
char Char1, Char2;
}
// Option 2
class MyClass
{
CharStruct chars;
}
...
struct CharStruct { public char Char1; public char Char2; }
MyClass will end up using 8 bytes (on a 32-bit machine) per instance plus the 4 bytes for the chars.
// Option 3
class MyClass
{
CharStruct[] chars;
}
This will use 8 bytes for the MyClass overhead, plus 4 bytes for the chars reference, plus 12 bytes for the array's overhead, plus 4 bytes per CharStruct in the array.

If you want to store exactly 2 chars, and do it most efficiently, use a struct:
struct Char2
{
public char C1, C2;
}
Using this struct will generally not cause new heap allocations. It will just upsize an existing object (by the minimum possible amount) or consume stack space which is very cheap.

Strings indeed have a size overhead of one pointer length, i.e. 4 bytes for a 32 bit process, 8 bytes for a 64 bit process. But then again, strings offer so much more in return than char arrays.
If your application uses many short strings and you don't need to use their string properties and methods that often, you could probably safe a few bytes of memory. But if you want to use any of them as a string, you will first have to create a new string instance. I can't see how this will help you safe enough memory to be worth the trouble.

String just implements an indexer of type char internally and we can say that string is just equivalent to char[] type with lots of extra code to make it useful for you, hence, like an array, it is stored on heap always.
An array cannot be manipulated without allocating it new space, same will be the case of a string hence, it is immutable
String implements IEnumerable<char>
Noticeable point: When you pass a string to a function, it is a pass by value unless there is a use of ref

Unsafe string creation from char[]

I'm working on a high performance code in which this construct is part of the performance critical section.
This is what happens in some section:
A string is 'scanned' and metadata is stored efficiently.
Based upon this metadata chunks of the main string are separated into a char[][].
That char[][] should be transferred into a string[].
Now, I know you can just call new string(char[]) but then the result would have to be copied.
To avoid this extra copy step from happening I guess it must be possible to write directly to the string's internal buffer. Even though this would be an unsafe operation (and I know this bring lots of implications like overflow, forward compatibility).
I've seen several ways of achieving this, but none I'm really satisfied with.
Does anyone have true suggestions as to how to achieve this?
Extra information:
The actual process doesn't include converting to char[] necessarily, it's practically a 'multi-substring' operation. Like 3 indexes and their lengths appended.
The StringBuilder has too much overhead for the small number of concats.
EDIT:
Due to some vague aspects of what it is exactly that I'm asking, let me reformulate it.
This is what happens:
Main string is indexed.
Parts of the main string are copied to a char[].
The char[] is converted to a string.
What I'd like to do is merge step 2 and 3, resulting in:
Main string is indexed.
Parts of the main string are copied to a string (and the GC can keep its hands off of it during the process by proper use of the fixed keyword?).
And a note is that I cannot change the output type from string[], since this is an external library, and projects depend on it (backward compatibility).

I think that what you are asking to do is to 'carve up' an existing string in-place into multiple smaller strings without re-allocating character arrays for the smaller strings. This won't work in the managed world.
For one reason why, consider what happens when the garbage collector comes by and collects or moves the original string during a compaction- all of those other strings 'inside' of it are now pointing at some arbitrary other memory, not the original string you carved them out of.
EDIT: In contrast to the character-poking involved in Ben's answer (which is clever but IMHO a bit scary), you can allocate a StringBuilder with a pre-defined capacity, which eliminates the need to re-allocate the internal arrays. See http://msdn.microsoft.com/en-us/library/h1h0a5sy.aspx.

What happens if you do:
string s = GetBuffer();
fixed (char* pch = s) {
pch[0] = 'R';
pch[1] = 'e';
pch[2] = 's';
pch[3] = 'u';
pch[4] = 'l';
pch[5] = 't';
}
I think the world will come to an end (Or at least the .NET managed portion of it), but that's very close to what StringBuilder does.
Do you have profiler data to show that StringBuilder isn't fast enough for your purposes, or is that an assumption?

Just create your own addressing system instead of trying to use unsafe code to map to an internal data structure.
Mapping a string (which is also readable as a char[]) to an array of smaller strings is no different from building a list of address information (index & length of each substring). So make a new List<Tuple<int,int>> instead of a string[] and use that data to return the correct string from your original, unaltered data structure. This could easily be encapsulated into something that exposed string[].

In .NET, there is no way to create an instance of String which shares data with another string. Some discussion on why that is appears in this comment from Eric Lippert.

Danger of C# Substring method?

Recently I have been reading up on some of the flaws with the Java substring method - specifically relating to memory, and how java keeps a reference to the original string. Ironically I am also developing a server application that uses C# .Net's implementation of substring many tens of times in a second. That got me thinking...
Are there memory issues with the C# (.Net) string.Substring?
What is the performance like on string.Substring? Is there a faster way to split a string based on start/end position?

Looking at .NET's implementation of String.Substring, a substring does not share memory with the original.
private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
if (((startIndex == 0) && (length == this.Length)) && !fAlwaysCopy)
{
return this;
}
// Allocate new (separate) string
string str = FastAllocateString(length);
// Copy chars from old string to new string
fixed (char* chRef = &str.m_firstChar)
{
fixed (char* chRef2 = &this.m_firstChar)
{
wstrcpy(chRef, chRef2 + startIndex, length);
}
}
return str;
}

Every time you use substring you create a new string instance - it has to copy the character from the old string to the new, along with the associated new memory allocation — and don't forget that these are unicode characters. This may or not be a bad thing - at some point you want to use these characters somewhere anyway. Depending on what you're doing, you might want your own method that merely finds the proper indexes within the string that you can then use later.

Just to add another perspective on this.
Out of memory (most times) does not mean you've used up all the memory. It means that your memory has been fragmented and the next time you want to allocate a chunk the system is unable to find a contiguous chunk of memory to fit your needs.
Frequent allocations/deallocations will cause memory fragmentation. The GC may not be in a position to de-fragment in time sue to the kinds of operations you do. I know the Server GC in .NET is pretty good about de-fragmenting memory but you could always starve (preventing the GC from doing a collect) the system by writing bad code.

it is always good to try it out & measure the elapsed milliseconds.
Stopwatch watch = new Stopwatch();
watch.Start();
// run string.Substirng code
watch.Stop();
watch.ElapsedMilliseconds();

In the case of the Java memory leak one may experience when using subString, it's easily fixed by instantiating a new String object with the copy constructor (that is a call of the form "new String(String)"). By using that you can discard all references to the original (and in the case that this is actually an issue, rather large) String, and maintain only the parts of it you need in memory.
Not ideal, in theory the JVM could be more clever and compress the String object (as was suggested above), but this gets the job done with what we have now.
As for C#, as has been said, this problem doesn't exist.

The CLR (hence C#'s) implementation of Substring does not retain a reference to the source string, so it does not have the "memory leak" problem of Java strings.

most of these type of string issues are because String is immutable. The StringBuilder class is intended for when you are doing a lot of string manipulations:
http://msdn.microsoft.com/en-us/library/2839d5h5(VS.71).aspx
Note that the real issue is memory allocation rather than CPU, although excessive memory alloc does take CPU...

I seem to recall that the strings in Java were stored as the actual characters along with a start and length.
This means that a substring string can share the same characters (since they're immutable) and only have to maintain a separate start and length.
So I'm not entirely certain what your memory issues are with the Java strings.
Regarding that article posted in your edit, it seems a bit of a non-issue to me.
Unless you're in the habit of making huge strings, then taking a small substring of them and leaving those lying around, this will have near-zero impact on memory.
Even if you had a 10M string and you made 400 substrings, you're only using that 10M for the underlying char array - it's not making 400 copies of that substring. The only memory impact is the start/length bit of each substring object.
The author seems to be complaining that they read a huge string into memory then only wanted a bit of it, but the entire thing was kept - my suggestion would be they they might want to rethink how they process their data :-)
To call this a Java bug is a huge stretch as well. A bug is something that doesn't work to specification. This was a deliberate design decision to improve performance, running out of memory because you don't understand how things work is not a bug, IMNSHO. And it's definitely not a memory leak.
There was one possible good suggestion in the comments to that article, that the GC could more aggressively recover bits of unused strings by compressing them.
This is not something you'd want to do on a first pass GC since it would be relatively expensive. However, where every other GC operation had failed to reclaim enough space, you could do it.
Unfortunately it would almost certainly mean that the underlying char array would need to keep a record of all the string objects that referenced it, so it could both figure out what bits were unused and modify all the string object start and length fields.
This in itself may introduce unacceptable performance impacts and, on top of that, if your memory is so short for this to be a problem, you may not even be able to allocate enough space for a smaller version of the string.
I think, if the memory's running out, I'd probably prefer not to be maintaining this char-array-to-string mapping to make this level of GC possible, instead I would prefer that memory to be used for my strings.
Since there is a perfectly acceptable workaround, and good coders should know about the foibles of their language of choice, I suspect the author is right - it won't be fixed.
Not because the Java developers are too lazy, but because it's not a problem.
You're free to implement your own string methods which match the C# ones (which don't share the underlying data except in certain limited scenarios). This will fix your memory problems but at the cost of a performance hit, since you have to copy the data every time you call substring. As with most things in IT (and life), it's a trade-off.

For profiling memory while developing you can use this code:
bool forceFullCollection = false;
Int64 valTotalMemoryBefore = System.GC.GetTotalMemory(forceFullCollection);
//call String.Substring
Int64 valTotalMemoryAfter = System.GC.GetTotalMemory(forceFullCollection);
Int64 valDifferenceMemorySize = valTotalMemoryAfter - valTotalMemoryBefore;
About parameter forceFullCollection: "If the forceFullCollection parameter is true, this method waits a short interval before returning while the system collects garbage and finalizes objects. The duration of the interval is an internally specified limit determined by the number of garbage collection cycles completed and the change in the amount of memory recovered between cycles. The garbage collector does not guarantee that all inaccessible memory is collected." GC.GetTotalMemory Method
Good luck!;)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.