Clearing Regex cache in C# - c#

I am using the Regex method Regex.Replace for large strings. Since these strings are cached, it is consuming a lot of memory.
I want to clear these Regex cache once a particulate operation is completed, so that the strings are garbage collected.
I can set the Regex cache size using Regex.CacheSize property, but how can I keep the cache size and clear the cache? Setting the cache size to zero will impact performance since I'm using this method multiple times for the same strings.
If I set the cache size to zero and reset it back to the old value, will the cached objects be discarded and garbage collected?
Code:
// languageDetails is a xml string holding, xml comments, name space etc.
// Need to remove the comments.
string pattern = "(<!--.*?-->)";
string languageDetails = Regex.Replace(
languageDetails,
pattern,
string.Empty,
RegexOptions.Singleline);

... for large strings. Since these strings are cached , it is consuming a lot of memory.
The input/output strings are not cached in any way.
The cache stores compiled regular expressions. So unless you have very many very large patterns, your memory problem is not caused by the cache.

Look at thesource code:
public static int CacheSize
{
[__DynamicallyInvokable] get
{
return Regex.cacheSize;
}
[__DynamicallyInvokable] set
{
if (value < 0)
throw new ArgumentOutOfRangeException(nameof (value));
Regex.cacheSize = value;
if (Regex.livecode.Count <= Regex.cacheSize)
return;
lock (Regex.livecode)
{
while (Regex.livecode.Count > Regex.cacheSize)
Regex.livecode.RemoveLast();
}
}
}
As you can see, setting the value to 0 will call Regex.livecode.RemoveLast();
So, it will clean the livecode list.

Related

C# big string array to string

I have a string array of about 20,000,000 values.
And i need to convert it to a string
I've tried:
string data = "";
foreach (var i in tm)
{
data = data + i;
}
But that takes too long time
does someone know a faster way?
Try StringBuilder:
StringBuilder sb = new StringBuilder();
foreach (var i in tm)
{
sb.Append(i);
}
To get the resulting String use ToString():
string result = sb.ToString();
The answer is going to depend on the size of the output string and the amount of memory you have available and usable. The hard limit on string length appears to be 2^31-1 (int.MaxValue) characters, occupying just over 4GB of memory. Whether you can actually allocate that is dependent on your framework version, etc. If you're going to be producing a larger output then you can't put it into a single string anyway.
You've already discovered that naive concatenation is going to be tragically slow. The problem is that every pass through the loop creates a new string, then immediately discards it on the next iteration. This is going to fill up memory pretty quickly, forcing the Garbage Collector to work overtime finding old strings to clear out of memory, not to mention the amount of memory fragmentation and all that stuff that modern programmers don't pay much attention to.
A StringBuiler, is a reasonable solution. Internally it allocates blocks of characters that it then stitches together at the end using pointers and memory copies. Saves a lot of hassles that way and is quite speedy.
As for String.Join... it uses a StringBuilder. So does String.Concat although it is certainly quicker when not inserting separator characters.
For simplicity I would use String.Concat and be done with it.
But then I'm not much for simplicity.
Here's an untested and possibly horribly slow answer using LINQ. When I get time I'll test it and see how it performs, but for now:
string result = new String(lines.SelectMany(l => (IEnumerable<char>)l).ToArray());
Obviously there is a potential overflow here since the ToArray call can potentially create an array larger than the String constructor can handle. Try it out and see if it's as quick as String.Concat.
So you can do it in LINQ, like such.
string data = tm.Aggregate("", (current, i) => current + i);
Or you can use the string.Join function
string data = string.Join("", tm);
Cant check it right now but I'm curious on how this option would perform:
var data = String.Join(string.Empty, tm);
Is Join optimized and ignores concatenation a with String.Empty?
For this big data unfortunately memory based methods will fail and this will be a real headache for GC. For this operation create a file and put every string in it. Like this:
using (StreamWriter sw = new StreamWriter("some_file_to_write.txt")){
for (int i=0; i<tm.Length;i++)
sw.Write(tm[i]);
}
Try to avoid using "var" on this performance demanding approach. Correction: "var" does not effect perfomance. "dynamic" does.

Ensuring thread safety with ASP.Net cache - a tale of two strategies

I have code that is not currently thread safe:
public byte[] GetImageByteArray(string filepath, string contentType, RImgOptions options)
{
//Our unique cache keys will be composed of both the image's filepath and the requested width
var cacheKey = filepath + options.Width.ToString();
var image = HttpContext.Current.Cache[cacheKey];
//If there is nothing in the cache, we need to generate the image, insert it into the cache, and return it
if (image == null)
{
RImgGenerator generator = new RImgGenerator();
byte[] bytes = generator.GenerateImage(filepath, contentType, options);
CacheItem(cacheKey, bytes);
return bytes;
}
//Image already exists in cache, serve it up!
else
{
return (byte[])image;
}
}
My CacheItem() method checks to see if its max cache size has been reached, and if it has, it will start removing cached items:
//If the cache exceeds its max allotment, we will remove items until it falls below the max
while ((int)cache[CACHE_SIZE] > RImgConfig.Settings.Profile.CacheSize * 1000 * 1000)
{
var entries = (Dictionary<string, DateTime>)cache[CACHE_ENTRIES];
var earliestCacheItem = entries.SingleOrDefault(kvp => kvp.Value == entries.Min(d => d.Value));
int length = ((byte[])cache[earliestCacheItem.Key]).Length;
cache.Remove(earliestCacheItem.Key);
cache[CACHE_SIZE] = (int)cache[CACHE_SIZE] - length;
}
Since one thread could remove an item from the cache as another thread is referencing it, I can think of two options:
Option 1: A lock
lock (myLockObject)
{
if(image == null){ **SNIP** }
}
Option 2: Assign a shallow copy to a local variable
var image = HttpContext.Current.Cache[cacheKey] != null ? HttpContext.Current.Cache[cacheKey].MemberwiseClone() : null;
Both of these options have overhead. The first forces threads to enter that code block one at a time. The second necessitates creating a new object in memory which could be of non-trivial size.
Are there any other strategies I could employ here?
To provide pure consistency of your cache solution you should lock your resource while slowing down the application.
In general, you should try to provide caching strategy, based on your application logic.
Check sliding window caching: while item which is older when some
time span - will reduce locking of different threads - good when you
have large spread of different cached items which not for sure will
be used again.
Consider using least frequently used strategy: the item that is least used
should be removed while reached max cache size - best serves while
you have multiple client hitting frequently same part of cached
content.
Just check which one more suites better your type of BL and use it. It will not remove the locking issue at all, but right choice will significantly remove racing conditions.
In order to reduce shared resource between different threads, use read and write locks on each item and not on entire collection. This will boost your performance as well.
Another point of consideration that should be kept in mind - what if content of image with the same path is changed physically on the disk (different image was saved with the same name) while having this image already cached there is a chance of mistakenly provide not relevant data.
Hope it helped.

fast custom string splitting

I am writing a custom string split. It will split on a dot(.) that is not preceded by an odd number of backslashes (\).
«string» -> «IEnemerable<string>»
"hello.world" -> "hello", "world"
"abc\.123" -> "abc\.123"
"aoeui\\.dhtns" -> "aoeui\\","dhtns"
I would like to know if there is a substring that will reuse the original string (for speed), or is there an existing split that can do this fast?
This is what I have but is 2—3 times slower than input.Split('.') //where input is a string. (I know it is a (slightly more complex problem, but not that much)
public IEnumerable<string> HandMadeSplit(string input)
{
var Result = new LinkedList<string>();
var word = new StringBuilder();
foreach (var ch in input)
{
if (ch == '.')
{
Result.AddLast(word.ToString());
word.Length = 0;
}
else
{
word.Append(ch);
}
}
Result.AddLast(word.ToString());
return Result;
}
It now uses List instead of LinkedList, and record beginning and end of substring and use string.substring to create the new substrings. This does a lot and is nearly as fast as string.split but I have added my adjustments. (will add code)
The loop that you show is the right approach if you need performance. (Regex wouldn't be).
Switch to an index-based for-loop. Remember the index of the start of the match. Don't append individual chars. Instead, remember the range of characters to copy out and do that with a single Substring call per item.
Also, don't use a LinkedList. It is slower than a List for almost all cases except random-access mutations.
You might also switch from List to a normal array that you resize with Array.Resize. This results in slightly tedious code (because you have inlined a part of the List class into your method) but it cuts out some small overheads.
Next, don't return an IEnumerable because that forces the caller through indirection when accessing its items. Return a List or an array.
This is the one I eventually settled on. It is not as fast a string.split, but good enough and can be modified, to do what I want.
private IEnumerable<string> HandMadeSplit2b(string input)
{
//this one is margenaly better that the second best 2, but makes the resolver (its client much faster), nealy as fast as original.
var Result = new List<string>();
var begining = 0;
var len = input.Length;
for (var index=0;index<len;index++)
{
if (input[index] == '.')
{
Result.Add(input.Substring(begining,index-begining));
begining = index+1;
}
}
Result.Add(input.Substring(begining));
return Result;
}
You shouldn't try to use string.Split for that.
If you need help to implement it, a simple way to solve this is to have loop that scans the string, keeping track of the last place where you found a qualifying dot. When you find a new qualifying dot (or reach the end of the input string), just yield return the current substring.
Edit: about returning a list or an array vs. using yield
If in your application, the most important thing is the time spent by the caller on iterating the substrings, then you should populate a list or an array and return that, as suggested in the accepted question. I would not use a resizable array while collecting the substrings because this would be slow.
On the other hand, if you care about the overall performance and about memory, and if sometimes the caller doesn't have to iterate over the entire list, you should use yield return. When you use yield return, you have the advantage that no code at all is executing until the caller has called MoveNext (directly or indirectly through a foreach). This means that you save the memory for allocating the array/list, and you save the time spent on allocating/resizing/populating the list. You will be spending time almost only on the logic of finding the substrings, and this will be done lazily, that is - only when actually needed because the caller continues to iterate the substrings.

String interning in .Net Framework - What are the benefits and when to use interning

I want to know the process and internals of string interning specific to .Net framework. Would also like to know the benefits of using interning and the scenarios/situations where we should use string interning to improve the performance. Though I have studied interning from the Jeffery Richter's CLR book but I am still confused and would like to know it in more detail.
[Editing] to ask a specific question with a sample code as below:
private void MethodA()
{
string s = "String"; // line 1 - interned literal as explained in the answer
//s.intern(); // line 2 - what would happen in line 3 if we uncomment this line, will it make any difference?
}
private bool MethodB(string compareThis)
{
if (compareThis == "String") // line 3 - will this line use interning (with and without uncommenting line 2 above)?
{
return true;
}
return false;
}
In general, interning is something that just happens, automatically, when you use literal string values. Interning provides the benefit of only having one copy of the literal in memory, no matter how often it's used.
That being said, it's rare that there is a reason to intern your own strings that are generated at runtime, or ever even think about string interning for normal development.
There are potentially some benefits if you're going to be doing a lot of work with comparisons of potentially identical runtime generated strings (as interning can speed up comparisons via ReferenceEquals). However, this is a highly specialized usage, and would require a fair amount of profiling and testing, and wouldn't be an optimization I'd consider unless there was a measured problem in place.
This is an "old" question, but I have a different angle on it.
If you're going to have a lot of long-lived strings from a small pool, interning can improve memory efficiency.
In my case, I was interning another type of object in a static dictionary because they were reused frequently, and this served as a fast cache before persisting them to disk.
Most of the fields in these objects are strings, and the pool of values is fairly small (much smaller than the number of instances, anyway).
If these were transient objects, it wouldn't matter because the string fields would be garbage collected often. But because references to them were being held, their memory usage started to accumulate (even when no new unique values were being added).
So interning the objects reduced the memory usage substantially, and so did interning their string values while they were being interned.
Interning is an internal implementation detail. Unlike boxing, I do not think there is any benefit in knowing more than what you have read in Richter's book.
Micro-optimisation benefits of interning strings manually are minimal hence is generally not recommended.
This probably describes it:
class Program
{
const string SomeString = "Some String"; // gets interned
static void Main(string[] args)
{
var s1 = SomeString; // use interned string
var s2 = SomeString; // use interned string
var s = "String";
var s3 = "Some " + s; // no interning
Console.WriteLine(s1 == s2); // uses interning comparison
Console.WriteLine(s1 == s3); // do NOT use interning comparison
}
}
Interned strings have the following characteristics:
Two interned strings that are identical will have the same address in memory.
Memory occupied by interned strings is not freed until your application terminates.
Interning a string involves calculating a hash and looking it up in a dictionary which consumes CPU cycles.
If multiple threads intern strings at the same time they will block each other because accesses to the dictionary of interned strings are serialized.
The consequences of these characteristics are:
You can test two interned strings for equality by just comparing the address pointer which is a lot faster than comparing each character in the string. This is especially true if the strings are very long and start with the same characters. You can compare interned strings with the Object.ReferenceEquals method, but it is safer to use the string == operator because it checks to see if the strings are interned first.
If you use the same string many times in your application, your application will only store one copy of the string in memory reducing the memory required to run your application.
If you intern many different strings this will allocate memory for those strings that will never be freed, and your application will consume ever increasing amounts of memory.
If you have a very large number of interned strings, string interning can become slow, and threads will block each other when accessing the interned string dictionary.
You should use string interning only if:
The set of strings you are interning is fairly small.
You compare these strings many times for each time that you intern them.
You really care about minute performance optimizations.
You don't have many threads aggressively interning strings.
Internalization of strings affects memory consumption.
For example if you read strings and keep them it in a list for caching; and the exact same string occurs 10 times, the string is actually stored only once in memory if string.Intern is used. If not, the string is stored 10 times.
In the example below, the string.Intern variant consumes about 44 MB and the without-version (uncommented) consumes 1195 MB.
static void Main(string[] args)
{
var list = new List<string>();
for (int i = 0; i < 5 * 1000 * 1000; i++)
{
var s = ReadFromDb();
list.Add(string.Intern(s));
//list.Add(s);
}
Console.WriteLine(Process.GetCurrentProcess().PrivateMemorySize64 / 1024 / 1024 + " MB");
}
private static string ReadFromDb()
{
return "abcdefghijklmnopqrstuvyxz0123456789abcdefghijklmnopqrstuvyxz0123456789abcdefghijklmnopqrstuvyxz0123456789" + 1;
}
Internalization also improves performance for equals-compare. The example below the intern version takes about 1 time units while the non-intern takes 7 time units.
static void Main(string[] args)
{
var a = string.Intern(ReadFromDb());
var b = string.Intern(ReadFromDb());
//var a = ReadFromDb();
//var b = ReadFromDb();
int equals = 0;
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 250 * 1000 * 1000; i++)
{
if (a == b) equals++;
}
stopwatch.Stop();
Console.WriteLine(stopwatch.Elapsed + ", equals: " + equals);
}

Danger of C# Substring method?

Recently I have been reading up on some of the flaws with the Java substring method - specifically relating to memory, and how java keeps a reference to the original string. Ironically I am also developing a server application that uses C# .Net's implementation of substring many tens of times in a second. That got me thinking...
Are there memory issues with the C# (.Net) string.Substring?
What is the performance like on string.Substring? Is there a faster way to split a string based on start/end position?
Looking at .NET's implementation of String.Substring, a substring does not share memory with the original.
private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
if (((startIndex == 0) && (length == this.Length)) && !fAlwaysCopy)
{
return this;
}
// Allocate new (separate) string
string str = FastAllocateString(length);
// Copy chars from old string to new string
fixed (char* chRef = &str.m_firstChar)
{
fixed (char* chRef2 = &this.m_firstChar)
{
wstrcpy(chRef, chRef2 + startIndex, length);
}
}
return str;
}
Every time you use substring you create a new string instance - it has to copy the character from the old string to the new, along with the associated new memory allocation — and don't forget that these are unicode characters. This may or not be a bad thing - at some point you want to use these characters somewhere anyway. Depending on what you're doing, you might want your own method that merely finds the proper indexes within the string that you can then use later.
Just to add another perspective on this.
Out of memory (most times) does not mean you've used up all the memory. It means that your memory has been fragmented and the next time you want to allocate a chunk the system is unable to find a contiguous chunk of memory to fit your needs.
Frequent allocations/deallocations will cause memory fragmentation. The GC may not be in a position to de-fragment in time sue to the kinds of operations you do. I know the Server GC in .NET is pretty good about de-fragmenting memory but you could always starve (preventing the GC from doing a collect) the system by writing bad code.
it is always good to try it out & measure the elapsed milliseconds.
Stopwatch watch = new Stopwatch();
watch.Start();
// run string.Substirng code
watch.Stop();
watch.ElapsedMilliseconds();
In the case of the Java memory leak one may experience when using subString, it's easily fixed by instantiating a new String object with the copy constructor (that is a call of the form "new String(String)"). By using that you can discard all references to the original (and in the case that this is actually an issue, rather large) String, and maintain only the parts of it you need in memory.
Not ideal, in theory the JVM could be more clever and compress the String object (as was suggested above), but this gets the job done with what we have now.
As for C#, as has been said, this problem doesn't exist.
The CLR (hence C#'s) implementation of Substring does not retain a reference to the source string, so it does not have the "memory leak" problem of Java strings.
most of these type of string issues are because String is immutable. The StringBuilder class is intended for when you are doing a lot of string manipulations:
http://msdn.microsoft.com/en-us/library/2839d5h5(VS.71).aspx
Note that the real issue is memory allocation rather than CPU, although excessive memory alloc does take CPU...
I seem to recall that the strings in Java were stored as the actual characters along with a start and length.
This means that a substring string can share the same characters (since they're immutable) and only have to maintain a separate start and length.
So I'm not entirely certain what your memory issues are with the Java strings.
Regarding that article posted in your edit, it seems a bit of a non-issue to me.
Unless you're in the habit of making huge strings, then taking a small substring of them and leaving those lying around, this will have near-zero impact on memory.
Even if you had a 10M string and you made 400 substrings, you're only using that 10M for the underlying char array - it's not making 400 copies of that substring. The only memory impact is the start/length bit of each substring object.
The author seems to be complaining that they read a huge string into memory then only wanted a bit of it, but the entire thing was kept - my suggestion would be they they might want to rethink how they process their data :-)
To call this a Java bug is a huge stretch as well. A bug is something that doesn't work to specification. This was a deliberate design decision to improve performance, running out of memory because you don't understand how things work is not a bug, IMNSHO. And it's definitely not a memory leak.
There was one possible good suggestion in the comments to that article, that the GC could more aggressively recover bits of unused strings by compressing them.
This is not something you'd want to do on a first pass GC since it would be relatively expensive. However, where every other GC operation had failed to reclaim enough space, you could do it.
Unfortunately it would almost certainly mean that the underlying char array would need to keep a record of all the string objects that referenced it, so it could both figure out what bits were unused and modify all the string object start and length fields.
This in itself may introduce unacceptable performance impacts and, on top of that, if your memory is so short for this to be a problem, you may not even be able to allocate enough space for a smaller version of the string.
I think, if the memory's running out, I'd probably prefer not to be maintaining this char-array-to-string mapping to make this level of GC possible, instead I would prefer that memory to be used for my strings.
Since there is a perfectly acceptable workaround, and good coders should know about the foibles of their language of choice, I suspect the author is right - it won't be fixed.
Not because the Java developers are too lazy, but because it's not a problem.
You're free to implement your own string methods which match the C# ones (which don't share the underlying data except in certain limited scenarios). This will fix your memory problems but at the cost of a performance hit, since you have to copy the data every time you call substring. As with most things in IT (and life), it's a trade-off.
For profiling memory while developing you can use this code:
bool forceFullCollection = false;
Int64 valTotalMemoryBefore = System.GC.GetTotalMemory(forceFullCollection);
//call String.Substring
Int64 valTotalMemoryAfter = System.GC.GetTotalMemory(forceFullCollection);
Int64 valDifferenceMemorySize = valTotalMemoryAfter - valTotalMemoryBefore;
About parameter forceFullCollection: "If the forceFullCollection parameter is true, this method waits a short interval before returning while the system collects garbage and finalizes objects. The duration of the interval is an internally specified limit determined by the number of garbage collection cycles completed and the change in the amount of memory recovered between cycles. The garbage collector does not guarantee that all inaccessible memory is collected." GC.GetTotalMemory Method
Good luck!;)

Categories