Enhancing performance of streamwriter in c#

Enhancing performance of streamwriter in c# - c#

in my program i need to write large text files (~300 mb), the text files contains numbers seperated by spaces, i'm using this code :
TextWriter guessesWriter = TextWriter.Synchronized(new StreamWriter("guesses.txt"));
private void QueueStart()
{
while (true)
{
if (writeQueue.Count > 0)
{
guessesWriter.WriteLine(writeQueue[0]);
writeQueue.Remove(writeQueue[0]);
}
}
}
private static void Check()
{
TextReader tr = new StreamReader("data.txt");
string guess = tr.ReadLine();
b = 0;
List<Thread> threads = new List<Thread>();
while (guess != null) // Reading each row and analyze it
{
string[] guessNumbers = guess.Split(' ');
List<int> numbers = new List<int>();
foreach (string s in guessNumbers) // Converting each guess to a list of numbers
numbers.Add(int.Parse(s));
threads.Add(new Thread(GuessCheck));
threads[b].Start(numbers);
b++;
guess = tr.ReadLine();
}
}
private static void GuessCheck(object listNums)
{
List<int> numbers = (List<int>) listNums;
if (!CloseNumbersCheck(numbers))
{
writeQueue.Add(numbers[0] + " " + numbers[1] + " " + numbers[2] + " " + numbers[3] + " " + numbers[4] + " " + numbers[5] + " " + numbers[6]);
}
}
private static bool CloseNumbersCheck(List<int> numbers)
{
int divideResult = numbers[0]/10;
for (int i = 1; i < 6; i++)
{
if (numbers[i]/10 != divideResult)
return false;
}
return true;
}
the file data.txt contains data in this format : (dots mean more numbers following the same logic)
1 2 3 4 5 6 1
1 2 3 4 5 6 2
1 2 3 4 5 6 3
.
.
.
1 2 3 4 5 6 8
1 2 3 4 5 7 1
.
.
.
i know this is not very efficient and i was looking for some advice on how to make it quicker.
if you night know how to save LARGE amount of numbers more efficiently than a .txt i would appreciate it.

One way to improve efficiency is with a larger buffer on your output stream. You are using the defaults, which give you probably a 1k buffer, but you won't see maximum performance with less than a 64k buffer. Open your file like this:
new StreamWriter("guesses.txt", new UTF8Encoding(false, true), 65536)

Instead of reading and writing line by line (ReadLine and WriteLine), you should read and write big block of data (ReadBlock and Write). This way you will access disk alot less and have a big performance boost. But you will need to manage the end of each line (look at Environment.NewLine).

The effeciency could be improved by using BinaryWriter. Then you could just write out integers directly. This would allow you to skip the parsing step on the read and the ToString conversion on the write.
It also looks like you are creating a bunch of threads in there. Additional threads will slow down your performance. You should do all of the work on a single thread, since threads are very heavyweight objects.
Here is a more-or-less direct conversion of your code to use a BinaryWriter. (This does not address the thread problem.)
BinaryWriter guessesWriter = new BinaryWriter(new StreamWriter("guesses.dat"));
private void QueueStart()
{
while (true)
{
if (writeQueue.Count > 0)
{
lock (guessesWriter)
{
guessesWriter.Write(writeQueue[0]);
}
writeQueue.Remove(writeQueue[0]);
}
}
}
private const int numbersPerThread = 6;
private static void Check()
{
BinaryReader tr = new BinaryReader(new StreamReader("data.txt"));
b = 0;
List<Thread> threads = new List<Thread>();
while (tr.BaseStream.Position < tr.BaseStream.Length)
{
List<int> numbers = new List<int>(numbersPerThread);
for (int index = 0; index < numbersPerThread; index++)
{
numbers.Add(tr.ReadInt32());
}
threads.Add(new Thread(GuessCheck));
threads[b].Start(numbers);
b++;
}
}

Try using a bufferi n between. There is a BGufferdSTream. Right now you use very inefficient disc access patterns.

Related

Fast comparison of two doubles ignoring everything that comes after 6 decimal digits

I try to optimize the performance of some calculation process.
Decent amount of time is wasted on calculations like the following:
var isBigger = Math.Abs((long) (a * 1e6) / 1e6D) > ((long) ((b + c) * 1e6)) / 1e6D;
where "a","b" and "c" are doubles, "b" and "c" are positive, "a" might be negative.
isBigger should be true only if absolute value of "a" is bigger than "b+c" disregarding anything after the 6th decimal digit.
So I look at this expression, I understand what it does, but it seems hugely inefficient to me, since it multiplies and divides compared numbers by million just to get rig of anything after 6 decimal places.
Below is the program I used to try and create a better solution. So far I failed.
Can someone help me?
class Program
{
static void Main(string[] args)
{
var arrLength = 1000000;
var arr1 = GetArrayOf_A(arrLength);
var arr2 = GetArrayOf_B(arrLength);
var arr3 = GetArrayOf_C(arrLength);
var result1 = new bool[arrLength];
var result2 = new bool[arrLength];
var sw = new Stopwatch();
sw.Start();
for (var i = 0; i < arrLength; i++)
{
result1[i] = Math.Abs((long) (arr1[i] * 1e6) / 1e6D)
>
(long) ((arr2[i] + arr3[i]) * 1e6) / 1e6D;
}
sw.Stop();
var t1 = sw.Elapsed.TotalMilliseconds;
sw.Restart();
for (var i = 0; i < arrLength; i++)
{
//result2[i] = Math.Round(Math.Abs(arr1[i]) - (arr2[i] + arr3[i]),6) > 0; // Incorrect, example by index = 0
//result2[i] = Math.Abs(arr1[i]) - (arr2[i] + arr3[i]) > 0.000001; // Incorrect, example by index = 1
//result2[i] = Math.Abs(arr1[i]) - (arr2[i] + arr3[i]) > 0.0000001; // Incorrect, example by index = 2
result2[i] = Math.Abs(arr1[i]) - (arr2[i] + arr3[i]) > 0.00000001; // Incorrect, example by index = 3
}
sw.Stop();
var t2 = sw.Elapsed.TotalMilliseconds;
var areEquivalent = true;
for (var i = 0; i < arrLength; i++)
{
if (result1[i] == result2[i]) continue;
areEquivalent = false;
break;
}
Console.WriteLine($"Functions are equivalent : {areEquivalent}");
if (areEquivalent)
{
Console.WriteLine($"Current function total time: {t1}ms");
Console.WriteLine($"Equivalent function total time: {t2}ms");
}
Console.WriteLine("Press ANY key to quit . . .");
Console.ReadKey();
}
private static readonly Random _rand = new Random(DateTime.Now.Millisecond);
private const int NumberOfRepresentativeExamples = 4;
private static double[] GetArrayOf_A(int arrLength)
{
if(arrLength<=NumberOfRepresentativeExamples)
throw new ArgumentException($"{nameof(arrLength)} should be bigger than {NumberOfRepresentativeExamples}");
var arr = new double[arrLength];
// Representative numbers
arr[0] = 2.4486382579120365;
arr[1] = -1.1716818990000011;
arr[2] = 5.996414627393257;
arr[3] = 6.0740085822069;
// the rest is to build time statistics
FillTheRestOfArray(arr);
return arr;
}
private static double[] GetArrayOf_B(int arrLength)
{
if(arrLength<=NumberOfRepresentativeExamples)
throw new ArgumentException($"{nameof(arrLength)} should be bigger than {NumberOfRepresentativeExamples}");
var arr = new double[arrLength];
// Representative numbers
arr[0] = 2.057823225;
arr[1] = 0;
arr[2] = 2.057823225;
arr[3] = 2.060649901;
// the rest is to build time statistics
FillTheRestOfArray(arr);
return arr;
}
private static double[] GetArrayOf_C(int arrLength)
{
if(arrLength<=NumberOfRepresentativeExamples)
throw new ArgumentException($"{nameof(arrLength)} should be bigger than {NumberOfRepresentativeExamples}");
var arr = new double[arrLength];
// Representative numbers
arr[0] = 0.3908145999796302;
arr[1] = 1.1716809269999997;
arr[2] = 3.9385910820740282;
arr[3] = 4.0133582670728858;
// the rest is to build time statistics
FillTheRestOfArray(arr);
return arr;
}
private static void FillTheRestOfArray(double[] arr)
{
for (var i = NumberOfRepresentativeExamples; i < arr.Length; i++)
{
arr[i] = _rand.Next(0, 10) + _rand.NextDouble();
}
}
}

You don't need the division since if (x/100) < (y/100) that means that x<y.
for(var i = 0; i < arrLength; i++)
{
result2[i] = Math.Abs((long)(arr1[i] * 1e6))
> (long)((arr2[i] + arr3[i]) * 1e6);
}
with the results for me:
Arrays have 1000000 elements.
Functions are equivalent : True
Current function total time: 40.10ms 24.94 kflop
Equivalent function total time: 22.42ms 44.60 kflop
A speedup of 78.83 %
PS. Make sure you compare RELEASE versions of the binary which includes math optimizations.
PS2. The display code is
Console.WriteLine($"Arrays have {arrLength} elements.");
Console.WriteLine($"Functions are equivalent : {areEquivalent}");
Console.WriteLine($" Current function total time: {t1:F2}ms {arrLength/t1/1e3:F2} kflop");
Console.WriteLine($"Equivalent function total time: {t2:F2}ms {arrLength/t2/1e3:F2} kflop");
Console.WriteLine($"An speedup of {t1/t2-1:P2}");

Overall your question goes into the Area of Realtime Programming. Not nessesarily realtime constraint, but it goes into teh same optimisation territory. The kind where every last nanosecond has be shaved off.
.NET is not the ideal scenario for this kind of operation. Usually that thing is done in dedicated lanagauges. The next best thing is doing it in Assembler, C or native C++. .NET has additional features like the Garbage Collector and Just In Time compiler that make even getting reliable benchmark results tricky. Much less reliale runtime performance.
For the datatypes, Float should be about the fastest operation there is. For historical reasons float opeations have been optimized.
One of your comment mentions physics and you do have an array. And I see stuff like array[i] = array2[i] + array3[i]. So maybe this should be a matrix operation you run on the GPU instead? This kind of "huge paralellized array opeartions" is exactly what the GPU is good at. Exactly what drawing on the screen is at it's core.
Unless you tell us what you are actually doing here as sa operation, that is about the best answer I can give.

Is this what you're looking for?
Math.Abs(a) - (b + c) > 0.000001
or if you want to know if the difference is bigger (difference either way):
Math.Abs(Math.Abs(a) - (b + c)) > 0.000001
(I'm assuming you're not limiting to this precision because of speed but because of inherent floating point limited precision.)

In addition to asking this question on this site, I also asked a good friend of mine, and so far he provided the best answer. Here it is:
result2[i] = Math.Abs(arr1[i]) - (arr2[i] + arr3[i]) > 0.000001 ||
Math.Abs((long)(arr1[i] * 1e6)) > (long)((arr2[i] + arr3[i])*1e6);
I am happy to have such friends :)

Different program outputs when different capacity arguments are passed to List constructor, C#

I'm implementing a slightly fancier version of counting sort in C#. The "slightly fancier" part is that I replace some elements in the sorted output with "-" rather than the original value. Here is a sample input/ output pair (the range of possible integer values are between 0 and 99):
IN
20
0 ab
6 cd
0 ef
6 gh
4 ij
0 ab
6 cd
0 ef
6 gh
0 ij
4 that
3 be
0 to
1 be
5 question
1 or
2 not
4 is
2 to
4 the
OUT
- - - - - to be or not to be - that is the question - - - -
And here is my implementation:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
class Solution
{
static void Main(String[] args)
{
int n = Convert.ToInt32(Console.ReadLine());
List<List<string>> rsltLists = new List<List<string>>(100);
for(int i=0; i<n; i++)
{
rsltLists.Add(new List<String>()); // PROBLEM IS HERE
}
for(int a0 = 0; a0 < n; a0++)
{
string[] tokens_x = Console.ReadLine().Split(' ');
int x = Convert.ToInt32(tokens_x[0]);
string s = tokens_x[1];
if(a0 < n/2)
{
// Replace string with '-'
rsltLists[x].Add("-");
}
else
{
rsltLists[x].Add(s);
}
}
foreach(List<string> rsltList in rsltLists)
{
foreach(string rslt in rsltList)
{
Console.Write(rslt + " ");
}
}
}
}
I'm submitting my code as the solution to a problem on Hackerrank. The problem is that for the 5th test case, my solution times out (the test case contains an enormous number of lines so I'm not including it here). To speed my solution up, I replaced the //PROBLEM IS HERE line with rsltLists.Add(new List<String>(100)). This causes the 5th test case to fail rather than time out (test cases 1-4 still passed). When I replaced the problem line with rsltLists.Add(new List<String>(10000)) the 5th test case and several other test cases failed (though not all of the test cases failed). Why would changing the amount of space I reserve for each List<String> cause this inconsistent behavior? I would expected the fifth test case to fail (maybe), but I wouldn't have expected test cases that were passing previously to start failing.

Why are you creating n rsltLists? That is not the requirement. There are 100 possible values and array is better for that. You should NOT be using n here. x is 100.
for(int i=0; i<n; i++) // no, problem is here
{
rsltLists.Add(new List<String>()); // PROBLEM IS HERE
}
This should be pretty fast
public static string HackerSort()
{
List<string> input = new List<string>() {"20"
, "0 ab"
, "6 cd"
, "0 ef"
, "6 gh"
, "4 ij"
, "0 ab"
, "6 cd"
, "0 ef"
, "6 gh"
, "0 ij"
, "4 that"
, "3 be"
, "0 to"
, "1 be"
, "5 question"
, "1 or"
, "2 not"
, "4 is"
, "2 to"
, "4 the" };
List<string>[] wl = new List<string>[100];
int n = int.Parse(input[0]);
int half = n/2;
char split = ' ';
for (int i = 0; i < n; i++)
{
string s = input[i + 1];
string[] ss = s.Split(split);
//Debug.WriteLine(ss[0]);
int row = int.Parse(ss[0]);
if(wl[row] == null)
{
wl[row] = new List<string>((n / 100) + 1);
}
wl[row].Add(i < half ? "-" : ss[1]);
}
StringBuilder sb = new StringBuilder();
foreach(List<string> ls in wl.Where(x => x != null))
{
sb.Append(string.Join(" ", ls) + ' ');
}
Debug.WriteLine(sb.ToString().TrimEnd());
return sb.ToString().TrimEnd();
}

Couple thoughts on this:
You're creating a list for each option... but many aren't used. How about only instantiating the lists that you'll actually use?
Compounded with the one above, you're creating 100 lists, each with a capacity of 100... that's a lot of memory to set aside that you won't be using
One solution:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
class Solution
{
static void Main(String[] args)
{
int n = Convert.ToInt32(Console.ReadLine());
int threshold = n / 2;
List<string>[] stringMap = new List<string>[100];
for(int a0 = 0; a0 < n; a0++){
string[] tokens_x = Console.ReadLine().Split(' ');
int x = Convert.ToInt32(tokens_x[0]);
if(stringMap[x] == null)
{
stringMap[x] = new List<string>();
}
stringMap[x].Add((a0 >= threshold ? tokens_x[1] : "-"));
}
List<string> output = new List<string>();
for(int i = 0; i < stringMap.Length; i++)
{
if(stringMap[i] == null)
{
continue;
}
output.AddRange(stringMap[i]);
}
Console.WriteLine(string.Join(" ", output));
}
}

The "inner" List will always have exactly two elements, one of which you want to treat as a number rather than a string. Better to use a small class or even a tuple here, rather than nested lists.
I'm at work with only VS2015 with no tuple support, so this code is unchecked and likely has a mistake or two:
static void Main(String[] args)
{
int n = int.Parse(Console.ReadLine());
var data = new List<(int, string)>(n);
for(int a0 = 0; a0 < n; a0++)
{
var tokens = Console.ReadLine().Split(' ');
int x = int.Parse(tokens[0]);
if(a0 < n/2) tokens[1] = "-";
data.Add( (val: x, str: tokens[1]) )
}
foreach(var item in data.OrderBy(i => i.val))
{
Console.Write(item.str + " ");
}
}

One way to solve the memory-hogging / long processing times would be to store the input in a SortedDictionary<int, List<string>> instead. The Key would be the integer portion of the input, and the Value would be a List<string> containing the other part of the input (one item per input that matches the key).
Then, when we have the dictionary populated, we can just output each List<string> of data in order (the SortedDictionary will already be sorted by Key).
In this way, we're only creating as many lists as we actually need, and each list is only as long as it needs to be (both of which I believe were causes for the errors in your original code, but I don't know where to find the actual test case code to verify).
private static void Main()
{
var length = Convert.ToInt32(Console.ReadLine());
var halfway = length / 2;
var items = new SortedDictionary<int, List<string>>();
for (int inputLine = 0; inputLine < length; inputLine++)
{
var input = Console.ReadLine().Split();
var sortIndex = Convert.ToInt32(input[0]);
var value = inputLine < halfway ? "-" : input[1];
if (items.ContainsKey(sortIndex)
{
items[sortIndex].Add(value);
}
else
{
items.Add(sortIndex, new List<string> {value});
}
}
Console.WriteLine(string.Join(" ", items.SelectMany(i => i.Value)));
// Not submitted to website, but for local testing:
Console.Write("\n\nPress any key to exit...");
Console.ReadKey();
}
Output

Increase chances of name being picked from Array

For my program, I've prompted the user to put 20 names into an array (the array size is 5 for testing for now), this array is then sent to a text document. I need to make it so that it will randomly pick a name from the list and display it (which I have done). But I now need to make it increase the chances of a name being picked, how would I go about doing this?
Eg. I want to increase the chances of the name 'Jim' being picked from the array.
class Programt
{
static void readFile()
{
}
static void Main(string[] args)
{
string winner;
string file = #"C:\names.txt";
string[] classNames = new string[5];
Random RandString = new Random();
Console.ForegroundColor = ConsoleColor.White;
if (File.Exists(file))
{
Console.WriteLine("Names in the text document are: ");
foreach (var displayFile in File.ReadAllLines(file))
Console.WriteLine(displayFile);
Console.ReadKey();
}
else
{
Console.WriteLine("Please enter 5 names:");
for (int i = 0; i < 5; i++)
classNames[i] = Console.ReadLine();
File.Create(file).Close();
File.WriteAllLines(file, classNames);
Console.WriteLine("Writing names to file...");
winner = classNames[RandString.Next(0, classNames.Length)];
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\nThe winner of the randomiser is: {0} Congratulations! ", winner);
Thread.Sleep(3000);
Console.Write("Completed");
Thread.Sleep(1000);
}
}
}

There's two ways of doing this. You can either produce a RNG with a normal distribution targeting one number.
Or the simpler way is a translational step. Generate in the range 0-100 and then produce code which translates to the answer in a biased way e.g.
0-5 : Answer 1
6-10: Answer 2
11-90: Answer 3
91-95: Answer 4
96-100: Answer 5
This gives an 80% chance of picking Answer 3, the others only get a 5% chance
So where you currently have RandString.Next(0, classNames.Length) you can replace that with a function something like GetBiasedIndex(0, classNames.Length, 3)
The function would look something like this (with test code):
public Form1()
{
InitializeComponent();
int[] results = new int[5];
Random RandString = new Random();
for (int i = 0; i < 1000; i++)
{
var output = GetBiasedIndex(RandString, 0, 4, 3);
results[output]++;
}
StringBuilder builder = new StringBuilder();
for (int i = 0; i < 5; i++)
{
builder.AppendLine(results[i].ToString());
}
label1.Text = builder.ToString();
}
private int GetBiasedIndex(Random rng, int start, int end, int target)
{
//Number between 0 and 100 (Helps to think percentage)
var check = rng.Next(0, 100);
//There's a few ways to do this next bit, but I'll try to keep it simple
//Allocate x% to the target and split the remaining y% among all the others
int x = 80;//80% chance of target
int remaining = 100 - x;//20% chance of something else
//Take the check for the target out of the last x% (we can take it out of any x% chunk but this makes it simpler
if (check > (100 - x))
{
return target;
}
else
{
//20% left if there's 4 names remaining that's 5% each
var perOther = (100 - x) / ((end - start) - 1);
//result is now in the range 0..4
var result = check / perOther;
//avoid hitting the target in this section
if (result >= target)
{
//adjust the index we are returning since we have already accounted for the target
result++;
}
//return the index;
return result;
}
}
and the output:
52
68
55
786
39
If you're going to call this function repeatedly you'll need to pass in the instance of the RNG so that you don't reset the seed each call.
If you want to target a name instead of an index you just need to look up that name first and have an else condition for when that name isn't found

Convert pdf pages to images more than 1000

I could convert the pdf pages into images. if it is less than 50 pages its working fast...
if any pdf large than 1000 pages... it acquires lot of time to complete.
any one can review this code and make it work for large file size...
i have used PdfLibNet dll(will not work in 4.0) in .NET3.5
here is my sample code:
public void ConverIMG(string filename)
{
PDFWrapper wrapper = new PDFWrapper();
wrapper.RenderDPI = Dpi;
wrapper.LoadPDF(filename);
int count = wrapper.PageCount;
for (int i = 1; i <= wrapper.PageCount; i++)
{
string fileName = AppDomain.CurrentDomain.BaseDirectory + #"IMG\" + i.ToString() + ".png";
wrapper.ExportJpg(fileName, i, i, (double)100, 100);
while (wrapper.IsJpgBusy)
{
Thread.Sleep(50);
}
}
wrapper.Dispose();
}
PS:
we need to split pages and convert to images parallely and we need to get completed status.

If PDFWrapper performance degrades for documents bigger then 50 pages it suggests it is not very well written. To overcome this you could do conversion in 50 page batches and recreate PDFWrapper after each batch. There is an assumption that ExportJpg() gets slower with number of calls and its initial speed does not depend on the size of PDF.
This is only a workaround for apparent problems in PDFWrapper and a proper solution would be to use a fixed library. Also I would suggest Thread.Sleep(1) if you really need to wait with yielding.
public void ConverIMG(string filename)
{
PDFWrapper wrapper = new PDFWrapper();
wrapper.RenderDPI = Dpi;
wrapper.LoadPDF(filename);
int count = wrapper.PageCount;
for (int i = 1; i <= count; i++)
{
string fileName = AppDomain.CurrentDomain.BaseDirectory + #"IMG\" + i.ToString() + ".png";
wrapper.ExportJpg(fileName, i, i, (double) 100, 100);
while (wrapper.IsJpgBusy)
{
Thread.Sleep(1);
}
if (i % 50 == 0)
{
wrapper.Dispose();
wrapper = new PDFWrapper();
wrapper.RenderDPI = Dpi;
wrapper.LoadPDF(filename);
}
}
wrapper.Dispose();
}

Taking large (1 million) number of substring (100 character wide) reads from a long string (3 million characters)

How can I take 1 million substring from a string with more than 3 million characters efficiently in C#? I have written a program which involves reading random DNA reads (substrings from random position) of length say 100 from a string with 3 million characters. There are 1 million such reads. Currently i run a while loop that runs 1 million times and read a substring of 100 character length from the string with 3 million character. This is taking a long time. What can i do to complete this faster?
heres my code, len is the length of the original string, 3 million in this case, it may be as low as 50 thats why the check in the while loop.
while(i < 1000000 && len-100> 0) //len is 3000000
{
int randomPos = _random.Next()%(len - ReadLength);
readString += all.Substring(randomPos, ReadLength) + Environment.NewLine;
i++;
}

Using a StringBuilder to assemble the string will get you a 600 times increase in processing (as it avoids repeated object creation everytime you append to the string.
before loop (initialising capacity avoids recreating the backing array in StringBuilder):
StringBuilder sb = new StringBuilder(1000000 * ReadLength);
in loop:
sb.Append(all.Substring(randomPos, ReadLength) + Environment.NewLine);
after loop:
readString = sb.ToString();
Using a char array instead of a string to extract the values yeilds another 30% improvement as you avoid object creation incurred when calling Substring():
before loop:
char[] chars = all.ToCharArray();
in loop:
sb.Append(chars, randomPos, ReadLength);
sb.AppendLine();
Edit (final version which does not use StringBuilder and executes in 300ms):
char[] chars = all.ToCharArray();
var iterations = 1000000;
char[] results = new char[iterations * (ReadLength + 1)];
GetRandomStrings(len, iterations, ReadLength, chars, results, 0);
string s = new string(results);
private static void GetRandomStrings(int len, int iterations, int ReadLength, char[] chars, char[] result, int resultIndex)
{
Random random = new Random();
int i = 0, index = resultIndex;
while (i < iterations && len - 100 > 0) //len is 3000000
{
var i1 = len - ReadLength;
int randomPos = random.Next() % i1;
Array.Copy(chars, randomPos, result, index, ReadLength);
index += ReadLength;
result[index] = Environment.NewLine[0];
index++;
i++;
}
}

I think better solutions will come, but .NET StringBuilder class instances are faster than String class instances because it handles data as a Stream.
You can split the data in pieces and use .NET Task Parallel Library for Multithreading and Parallelism
Edit: Assign fixed values to a variable out of the loop to avoid recalculation;
int x = len-100
int y = len-ReadLength
use
StringBuilder readString= new StringBuilder(ReadLength * numberOfSubStrings);
readString.AppendLine(all.Substring(randomPos, ReadLength));
for Parallelism you should split your input to pieces. Then run these operations on pieces in seperate threads. Then combine the results.
Important: As my previous experiences showed these operations run faster with .NET v2.0 rather than v4.0, so you should change your projects target framework version; but you can't use Task Parallel Library with .NET v2.0 so you should use multithreading in oldschool way like
Thread newThread ......

How long is a long time ? It shouldn't be that long.
var file = new StreamReader(#"E:\Temp\temp.txt");
var s = file.ReadToEnd();
var r = new Random();
var sw = new Stopwatch();
sw.Start();
var range = Enumerable.Range(0,1000000);
var results = range.Select( i => s.Substring(r.Next(s.Length - 100),100)).ToList();
sw.Stop();
sw.ElapsedMilliseconds.Dump();
s.Length.Dump();
So on my machine the results were 807ms and the string is 4,055,442 chars.
Edit: I just noticed that you want a string as a result, so my above solution just changes to...
var results = string.Join(Environment.NewLine,range.Select( i => s.Substring(r.Next(s.Length - 100),100)).ToArray());
And adds about 100ms, so still under a second in total.

Edit: I abandoned the idea to use memcpy, and I think the result is super great.
I've broken a 3m length string into 30k strings of 100 length each in 43 milliseconds.
private static unsafe string[] Scan(string hugeString, int subStringSize)
{
var results = new string[hugeString.Length / subStringSize];
var gcHandle = GCHandle.Alloc(hugeString, GCHandleType.Pinned);
var currAddress = (char*)gcHandle.AddrOfPinnedObject();
for (var i = 0; i < results.Length; i++)
{
results[i] = new string(currAddress, 0, subStringSize);
currAddress += subStringSize;
}
return results;
}
To use the method for the case shown in the question:
const int size = 3000000;
const int subSize = 100;
var stringBuilder = new StringBuilder(size);
var random = new Random();
for (var i = 0; i < size; i++)
{
stringBuilder.Append((char)random.Next(30, 80));
}
var hugeString = stringBuilder.ToString();
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 1000; i++)
{
var strings = Scan(hugeString, subSize);
}
stopwatch.Stop();
Console.WriteLine(stopwatch.ElapsedMilliseconds / 1000); // 43

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Enhancing performance of streamwriter in c# - c#

Instead of reading and writing line by line (ReadLine and WriteLine), you should read and write big block of data (ReadBlock and Write). This way you will access disk alot less and have a big performance boost. But you will need to manage the end of each line (look at Environment.NewLine).

Try using a bufferi n between. There is a BGufferdSTream. Right now you use very inefficient disc access patterns.

Related

Fast comparison of two doubles ignoring everything that comes after 6 decimal digits

Different program outputs when different capacity arguments are passed to List constructor, C#

Increase chances of name being picked from Array

Convert pdf pages to images more than 1000

Taking large (1 million) number of substring (100 character wide) reads from a long string (3 million characters)

Categories

Resources