Unsafe string creation from char[]

Unsafe string creation from char[] - c#

I'm working on a high performance code in which this construct is part of the performance critical section.
This is what happens in some section:
A string is 'scanned' and metadata is stored efficiently.
Based upon this metadata chunks of the main string are separated into a char[][].
That char[][] should be transferred into a string[].
Now, I know you can just call new string(char[]) but then the result would have to be copied.
To avoid this extra copy step from happening I guess it must be possible to write directly to the string's internal buffer. Even though this would be an unsafe operation (and I know this bring lots of implications like overflow, forward compatibility).
I've seen several ways of achieving this, but none I'm really satisfied with.
Does anyone have true suggestions as to how to achieve this?
Extra information:
The actual process doesn't include converting to char[] necessarily, it's practically a 'multi-substring' operation. Like 3 indexes and their lengths appended.
The StringBuilder has too much overhead for the small number of concats.
EDIT:
Due to some vague aspects of what it is exactly that I'm asking, let me reformulate it.
This is what happens:
Main string is indexed.
Parts of the main string are copied to a char[].
The char[] is converted to a string.
What I'd like to do is merge step 2 and 3, resulting in:
Main string is indexed.
Parts of the main string are copied to a string (and the GC can keep its hands off of it during the process by proper use of the fixed keyword?).
And a note is that I cannot change the output type from string[], since this is an external library, and projects depend on it (backward compatibility).

I think that what you are asking to do is to 'carve up' an existing string in-place into multiple smaller strings without re-allocating character arrays for the smaller strings. This won't work in the managed world.
For one reason why, consider what happens when the garbage collector comes by and collects or moves the original string during a compaction- all of those other strings 'inside' of it are now pointing at some arbitrary other memory, not the original string you carved them out of.
EDIT: In contrast to the character-poking involved in Ben's answer (which is clever but IMHO a bit scary), you can allocate a StringBuilder with a pre-defined capacity, which eliminates the need to re-allocate the internal arrays. See http://msdn.microsoft.com/en-us/library/h1h0a5sy.aspx.

What happens if you do:
string s = GetBuffer();
fixed (char* pch = s) {
pch[0] = 'R';
pch[1] = 'e';
pch[2] = 's';
pch[3] = 'u';
pch[4] = 'l';
pch[5] = 't';
}
I think the world will come to an end (Or at least the .NET managed portion of it), but that's very close to what StringBuilder does.
Do you have profiler data to show that StringBuilder isn't fast enough for your purposes, or is that an assumption?

Just create your own addressing system instead of trying to use unsafe code to map to an internal data structure.
Mapping a string (which is also readable as a char[]) to an array of smaller strings is no different from building a list of address information (index & length of each substring). So make a new List<Tuple<int,int>> instead of a string[] and use that data to return the correct string from your original, unaltered data structure. This could easily be encapsulated into something that exposed string[].

In .NET, there is no way to create an instance of String which shares data with another string. Some discussion on why that is appears in this comment from Eric Lippert.

Related

Is it faster to access char in string via [] operator or faster to access char in char[] via [] operator?

I have a simple question. I'm working with a gigantic string in C# and I'm repeatedly going through it and accessing individual characters via the [] operator. Is it faster to turn the string into a char[] and use the [] operator on the char array or is my approach faster? Or are they both the same? I want my program to be bleeding edge fast.

If you need the absolute fastest implementation, then you can drop to unsafe and use pointers:
string s = ...
int len = s.Length;
fixed (char* ptr = s)
{
// talk to ptr[0] etc; DO NOT go outside of ptr[0] <---> ptr[len-1]
}
that then avoids range checking but:
requires unsafe
you are to blame if you go outside bounds

First of all: you should measure. Noone can answer this question definitively without measuring.
In this particular case everything indicates that accessing the string directly is better: strings are implemented as arrays of characters in pretty much every language, and converting the string to a char[] manually would require an immediate allocation of another gigantic amount of memory. That can't be good.
But I wouldn't event take my own word for it. Measure.

One obvious drawback to converting a string to char[] is the need to copy: since strings in C# are immutable while arrays are mutable, your code would end up duplicating your gigantic string in its entirety! This will almost certainly dwarf the potential speed gains, if any, to be had after the conversion.

string is a char[] array under the hood. So there is literally no difference.
The bigger problem here is that trying to optimize this far down in the stack is probably wasted time, there are probably better improvements to be made elsewhere.

System.String underlying implementation

I was recently trying to do the following in c#
string str = "u r awesome";
str[0]="i";
And it wouldn't work because apparently str[i] is only a get not a set, so I was wondering what the underlying implementation of string is that would force str[i] to only be a get.
Isn't it just a managed wrapper for a char *? So then why can't I set str[i]?

You can't set characters of a string because the .NET String class is immutable -- that means that its contents cannot be changed after it is created. This allows the same string instance to be used many times safely, without one object worrying that another object is going to stomp on its strings.
If you need a mutable class that lets you manipulate a string of characters, consider using StringBuilder instead.
If you want to compare to C, the String type is like const char * except that you cannot just cast away the constness. StringBuilder is more like a char * (with automatic allocation resizing) and with a method (ToString()) to create a new, independent String instance from its contents.

The answers the others gave concerning immutability are of course correct and are the "actual" cause of the issue your having.
Since you specifically asked about the underlying implementation (and if just out of curiosity), and as a reference to others that might stumble upon this question, here is some more information about that topic from Eric Lippert:
"In the .NET CLR, strings are laid out in memory pretty much the same
way that BSTRs were implemented in OLE Automation: as a word-aligned
memory buffer consisting of a four-byte integer giving the length of
the string, followed by the characters of the string in two-byte
chunks of UTF-16 data, followed by two zero bytes."
Note the "pretty much" part here, however BSTR themselves are also explained in Eric's blog.
Mind you, that all of this should be considered an implementation detail. And even though it shouldn't really concern most of us, it might help though during debugging interop issues or in general understanding.

Like answered by cdhowie, it is not the same to the concept of string in c/c++
If you want to the the above, as a suggestion you can try to mimic the implementation through container such as below
List<char> str = new List<char>("u r awesome");
str[0] = 'i';
str[2] = 'm';
Console.WriteLine(str.ToArray());

C# better way to do this?

Hi I have this code below and am looking for a prettier/faster way to do this.
Thanks!
string value = "HelloGoodByeSeeYouLater";
string[] y = new string[]{"Hello", "You"};
foreach(string x in y)
{
value = value.Replace(x, "");
}

You could do:
y.ToList().ForEach(x => value = value.Replace(x, ""));
Although I think your variant is more readable.

Forgive me, but someone's gotta say it,
value = Regex.Replace( value, string.Join("|", y.Select(Regex.Escape)), "" );
Possibly faster, since it creates fewer strings.
EDIT: Credit to Gabe and lasseespeholt for Escape and Select.

While not any prettier, there are other ways to express the same thing.
In LINQ:
value = y.Aggregate(value, (acc, x) => acc.Replace(x, ""));
With String methods:
value = String.Join("", value.Split(y, StringSplitOptions.None));
I don't think anything is going to be faster in managed code than a simple Replace in a foreach though.

It depends on the size of the string you are searching. The foreach example is perfectly fine for small operations but creates a new instance of the string each time it operates because the string is immutable. It also requires searching the whole string over and over again in a linear fashion.
The basic solutions have all been proposed. The Linq examples provided are good if you are comfortable with that syntax; I also liked the suggestion of an extension method, although that is probably the slowest of the proposed solutions. I would avoid a Regex unless you have an extremely specific need.
So let's explore more elaborate solutions and assume you needed to handle a string that was thousands of characters in length and had many possible words to be replaced. If this doesn't apply to the OP's need, maybe it will help someone else.
Method #1 is geared towards large strings with few possible matches.
Method #2 is geared towards short strings with numerous matches.
Method #1
I have handled large-scale parsing in c# using char arrays and pointer math with intelligent seek operations that are optimized for the length and potential frequency of the term being searched for. It follows the methodology of:
Extremely cheap Peeks one character at a time
Only investigate potential matches
Modify output when match is found
For example, you might read through the whole source array and only add words to the output when they are NOT found. This would remove the need to keep redimensioning strings.
A simple example of this technique is looking for a closing HTML tag in a DOM parser. For example, I may read an opening STYLE tag and want to skip through (or buffer) thousands of characters until I find a closing STYLE tag.
This approach provides incredibly high performance, but it's also incredibly complicated if you don't need it (plus you need to be well-versed in memory manipulation/management or you will create all sorts of bugs and instability).
I should note that the .Net string libraries are already incredibly efficient but you can optimize this approach for your own specific needs and achieve better performance (and I have validated this firsthand).
Method #2
Another alternative involves storing search terms in a Dictionary containing Lists of strings. Basically, you decide how long your search prefix needs to be, and read characters from the source string into a buffer until you meet that length. Then, you search your dictionary for all terms that match that string. If a match is found, you explore further by iterating through that List, if not, you know that you can discard the buffer and continue.
Because the Dictionary matches strings based on hash, the search is non-linear and ideal for handling a large number of possible matches.
I'm using this methodology to allow instantaneous (<1ms) searching of every airfield in the US by name, state, city, FAA code, etc. There are 13K airfields in the US, and I've created a map of about 300K permutations (again, a Dictionary with prefixes of varying lengths, each corresponding to a list of matches).
For example, Phoenix, Arizona's main airfield is called Sky Harbor with the short ID of KPHX. I store:
KP
KPH
KPHX
Ph
Pho
Phoe
Ar
Ari
Ariz
Sk
Sky
Ha
Har
Harb
There is a cost in terms of memory usage, but string interning probably reduces this somewhat and the resulting speed justifies the memory usage on data sets of this size. Searching happens as the user types and is so fast that I have actually introduced an artificial delay to smooth out the experience.
Send me a message if you have the need to dig into these methodologies.

Extension method for elegance
(arguably "prettier" at the call level)
I'll implement an extension method that allows you to call your implementation directly on the original string as seen here.
value = value.Remove(y);
// or
value = value.Remove("Hello", "You");
// effectively
string value = "HelloGoodByeSeeYouLater".Remove("Hello", "You");
The extension method is callable on any string value in fact, and therefore easily reusable.
Implementation of Extension method:
I'm going to wrap your own implementation (shown in your question) in an extension method for pretty or elegant points and also employ the params keyword to provide some flexbility passing the arguments. You can substitute somebody else's faster implementation body into this method.
static class EXTENSIONS {
static public string Remove(this string thisString, params string[] arrItems) {
// Whatever implementation you like:
if (thisString == null)
return null;
var temp = thisString;
foreach(string x in arrItems)
temp = temp.Replace(x, "");
return temp;
}
}
That's the brightest idea I can come up with right now that nobody else has touched on.

Danger of C# Substring method?

Recently I have been reading up on some of the flaws with the Java substring method - specifically relating to memory, and how java keeps a reference to the original string. Ironically I am also developing a server application that uses C# .Net's implementation of substring many tens of times in a second. That got me thinking...
Are there memory issues with the C# (.Net) string.Substring?
What is the performance like on string.Substring? Is there a faster way to split a string based on start/end position?

Looking at .NET's implementation of String.Substring, a substring does not share memory with the original.
private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
if (((startIndex == 0) && (length == this.Length)) && !fAlwaysCopy)
{
return this;
}
// Allocate new (separate) string
string str = FastAllocateString(length);
// Copy chars from old string to new string
fixed (char* chRef = &str.m_firstChar)
{
fixed (char* chRef2 = &this.m_firstChar)
{
wstrcpy(chRef, chRef2 + startIndex, length);
}
}
return str;
}

Every time you use substring you create a new string instance - it has to copy the character from the old string to the new, along with the associated new memory allocation — and don't forget that these are unicode characters. This may or not be a bad thing - at some point you want to use these characters somewhere anyway. Depending on what you're doing, you might want your own method that merely finds the proper indexes within the string that you can then use later.

Just to add another perspective on this.
Out of memory (most times) does not mean you've used up all the memory. It means that your memory has been fragmented and the next time you want to allocate a chunk the system is unable to find a contiguous chunk of memory to fit your needs.
Frequent allocations/deallocations will cause memory fragmentation. The GC may not be in a position to de-fragment in time sue to the kinds of operations you do. I know the Server GC in .NET is pretty good about de-fragmenting memory but you could always starve (preventing the GC from doing a collect) the system by writing bad code.

it is always good to try it out & measure the elapsed milliseconds.
Stopwatch watch = new Stopwatch();
watch.Start();
// run string.Substirng code
watch.Stop();
watch.ElapsedMilliseconds();

In the case of the Java memory leak one may experience when using subString, it's easily fixed by instantiating a new String object with the copy constructor (that is a call of the form "new String(String)"). By using that you can discard all references to the original (and in the case that this is actually an issue, rather large) String, and maintain only the parts of it you need in memory.
Not ideal, in theory the JVM could be more clever and compress the String object (as was suggested above), but this gets the job done with what we have now.
As for C#, as has been said, this problem doesn't exist.

The CLR (hence C#'s) implementation of Substring does not retain a reference to the source string, so it does not have the "memory leak" problem of Java strings.

most of these type of string issues are because String is immutable. The StringBuilder class is intended for when you are doing a lot of string manipulations:
http://msdn.microsoft.com/en-us/library/2839d5h5(VS.71).aspx
Note that the real issue is memory allocation rather than CPU, although excessive memory alloc does take CPU...

I seem to recall that the strings in Java were stored as the actual characters along with a start and length.
This means that a substring string can share the same characters (since they're immutable) and only have to maintain a separate start and length.
So I'm not entirely certain what your memory issues are with the Java strings.
Regarding that article posted in your edit, it seems a bit of a non-issue to me.
Unless you're in the habit of making huge strings, then taking a small substring of them and leaving those lying around, this will have near-zero impact on memory.
Even if you had a 10M string and you made 400 substrings, you're only using that 10M for the underlying char array - it's not making 400 copies of that substring. The only memory impact is the start/length bit of each substring object.
The author seems to be complaining that they read a huge string into memory then only wanted a bit of it, but the entire thing was kept - my suggestion would be they they might want to rethink how they process their data :-)
To call this a Java bug is a huge stretch as well. A bug is something that doesn't work to specification. This was a deliberate design decision to improve performance, running out of memory because you don't understand how things work is not a bug, IMNSHO. And it's definitely not a memory leak.
There was one possible good suggestion in the comments to that article, that the GC could more aggressively recover bits of unused strings by compressing them.
This is not something you'd want to do on a first pass GC since it would be relatively expensive. However, where every other GC operation had failed to reclaim enough space, you could do it.
Unfortunately it would almost certainly mean that the underlying char array would need to keep a record of all the string objects that referenced it, so it could both figure out what bits were unused and modify all the string object start and length fields.
This in itself may introduce unacceptable performance impacts and, on top of that, if your memory is so short for this to be a problem, you may not even be able to allocate enough space for a smaller version of the string.
I think, if the memory's running out, I'd probably prefer not to be maintaining this char-array-to-string mapping to make this level of GC possible, instead I would prefer that memory to be used for my strings.
Since there is a perfectly acceptable workaround, and good coders should know about the foibles of their language of choice, I suspect the author is right - it won't be fixed.
Not because the Java developers are too lazy, but because it's not a problem.
You're free to implement your own string methods which match the C# ones (which don't share the underlying data except in certain limited scenarios). This will fix your memory problems but at the cost of a performance hit, since you have to copy the data every time you call substring. As with most things in IT (and life), it's a trade-off.

For profiling memory while developing you can use this code:
bool forceFullCollection = false;
Int64 valTotalMemoryBefore = System.GC.GetTotalMemory(forceFullCollection);
//call String.Substring
Int64 valTotalMemoryAfter = System.GC.GetTotalMemory(forceFullCollection);
Int64 valDifferenceMemorySize = valTotalMemoryAfter - valTotalMemoryBefore;
About parameter forceFullCollection: "If the forceFullCollection parameter is true, this method waits a short interval before returning while the system collects garbage and finalizes objects. The duration of the interval is an internally specified limit determined by the number of garbage collection cycles completed and the change in the amount of memory recovered between cycles. The garbage collector does not guarantee that all inaccessible memory is collected." GC.GetTotalMemory Method
Good luck!;)

Are C# Strings (and other .NET API's) limited to 2GB in size?

Today I noticed that C#'s String class returns the length of a string as an Int. Since an Int is always 32-bits, no matter what the architecture, does this mean that a string can only be 2GB or less in length?
A 2GB string would be very unusual, and present many problems along with it. However, most .NET api's seem to use 'int' to convey values such as length and count. Does this mean we are forever limited to collection sizes which fit in 32-bits?
Seems like a fundamental problem with the .NET API's. I would have expected things like count and length to be returned via the equivalent of 'size_t'.

Seems like a fundamental problem with
the .NET API...
I don't know if I'd go that far.
Consider almost any collection class in .NET. Chances are it has a Count property that returns an int. So this suggests the class is bounded at a size of int.MaxValue (2147483647). That's not really a problem; it's a limitation -- and a perfectly reasonable one, in the vast majority of scenarios.
Anyway, what would the alternative be? There's uint -- but that's not CLS-compliant. Then there's long...
What if Length returned a long?
An additional 32 bits of memory would be required anywhere you wanted to know the length of a string.
The benefit would be: we could have strings taking up billions of gigabytes of RAM. Hooray.
Try to imagine the mind-boggling cost of some code like this:
// Lord knows how many characters
string ulysses = GetUlyssesText();
// allocate an entirely new string of roughly equivalent size
string schmulysses = ulysses.Replace("Ulysses", "Schmulysses");
Basically, if you're thinking of string as a data structure meant to store an unlimited quantity of text, you've got unrealistic expectations. When it comes to objects of this size, it becomes questionable whether you have any need to hold them in memory at all (as opposed to hard disk).

Correct, the maximum length would be the size of Int32, however you'll likely run into other memory issues if you're dealing with strings larger than that anyway.

At some value of String.length() probably about 5MB its not really practical to use String anymore. String is optimised for short bits of text.
Think about what happens when you do
msString += " more chars"
Something like:
System calculates length of myString plus length of " more chars"
System allocates that amount of memory
System copies myString to new memory location
System copies " more chars" to new memory location after last copied myString char
The original myString is left to the mercy of the garbage collector.
While this is nice and neat for small bits of text its a nightmare for large strings, just finding 2GB of contiguous memory is probably a showstopper.
So if you know you are handling more than a very few MB of characters use one of the *Buffer classes.

It's pretty unlikely that you'll need to store more than two billion objects in a single collection. You're going to incur some pretty serious performance penalties when doing enumerations and lookups, which are the two primary purposes of collections. If you're dealing with a data set that large, There is almost assuredly some other route you can take, such as splitting up your single collection into many smaller collections that contain portions of the entire set of data you're working with.
Heeeey, wait a sec.... we already have this concept -- it's called a dictionary!
If you need to store, say, 5 billion English strings, use this type:
Dictionary<string, List<string>> bigStringContainer;
Let's make the key string represent, say, the first two characters of the string. Then write an extension method like this:
public static string BigStringIndex(this string s)
{
return String.Concat(s[0], s[1]);
}
and then add items to bigStringContainer like this:
bigStringContainer[item.BigStringIndex()].Add(item);
and call it a day. (There are obviously more efficient ways you could do that, but this is just an example)
Oh, and if you really really really do need to be able to look up any arbitrary object by absolute index, use an Array instead of a collection. Okay yeah, you use some type safety, but you can index array elements with a long.

The fact that the framework uses Int32 for Count/Length properties, indexers etc is a bit of a red herring. The real problem is that the CLR currently has a max object size restriction of 2GB.
So a string -- or any other single object -- can never be larger than 2GB.
Changing the Length property of the string type to return long, ulong or even BigInteger would be pointless since you could never have more than approx 2^30 characters anyway (2GB max size and 2 bytes per character.)
Similarly, because of the 2GB limit, the only arrays that could even approach having 2^31 elements would be bool[] or byte[] arrays that only use 1 byte per element.
Of course, there's nothing to stop you creating your own composite types to workaround the 2GB restriction.
(Note that the above observations apply to Microsoft's current implementation, and could very well change in future releases. I'm not sure whether Mono has similar limits.)

In versions of .NET prior to 4.5, the maximum object size is 2GB. From 4.5 onwards you can allocate larger objects if gcAllowVeryLargeObjects is enabled. Note that the limit for string is not affected, but "arrays" should cover "lists" too, since lists are backed by arrays.

Even in x64 versions of Windows I got hit by .Net limiting each object to 2GB.
2GB is pretty small for a medical image. 2GB is even small for a Visual Studio download image.

If you are working with a file that is 2GB, that means you're likely going to be using a lot of RAM, and you're seeing very slow performance.
Instead, for very large files, consider using a MemoryMappedFile (see: http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile.aspx). Using this method, you can work with a file of nearly unlimited size, without having to load the whole thing in memory.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.