Generate multiple unique strings in C# - c#

For my project, I have to generate a list of unique strings.
Everything works fine, but my problem is that it is very slow at the end.
I've tried using Parallel-Loops, but I found out that my ConcurrentBag<T>, which I was using, is also slow.
Now I'm using a simple for-loop and List<T> and it is now a little bit faster, but also really slow.
Here's my code:
private List<string> Generate(int start, int end, bool allowDupes)
{
var list = new List<string>();
var generator = new StringGenerator(LowerCase, UpperCase, Digits, NumberOfCharacters);
for (var i = start; i < end; i++)
{
StringBuilder sb;
while (true)
{
sb = new StringBuilder();
for (var j = 0; j < NumberOfSegments; j++)
{
sb.Append(generator.GenerateRandomString());
if (j < NumberOfSegments - 1)
{
sb.Append(Delimiter);
}
}
if (!allowDupes)
{
if (list.Contains(sb.ToString()))
{
continue;
}
}
break;
}
list.Add(sb.ToString());
GeneratedStringCount = i + 1;
}
return new List<string>(list);
}
I've also talked to my teacher and he would use the same algorithm for generating these strings.
Do you know a better solution? (The GenerateRandomString() Method in StringGenerator is simple and does not consume much performance. list.Contains(xy) is consuming alot of resources. [Performance Analysis in Visual Studio])

List.Contains is slow. Use a HashSet instead.
private List<string> Generate(int start, int end, bool allowDupes)
{
var strings = new HashSet<string>();
var list = new List<string>();
var generator = new StringGenerator(LowerCase, UpperCase, Digits, NumberOfCharacters);
for (var i = start; i < end; i++)
{
while (true)
{
string randomString = GetRandomString();
if (allowDupes || strings.Add(randomString))
{
list.Add(randomString);
break;
}
}
GeneratedStringCount = i + 1;
}
return new List<string>(list);
}
private string GetRandomString()
{
var segments = Enumerable.Range(1, NumberOfSegments)
.Select(_ => generator.GenerateRandomString());
var result = string.Join(Delimeter, segments);
return result;
}
This still has the chance for slow performance, but you could remedy that with a smart GenerateRandomString function.

public static String GenerateEightCode( int codeLenght, Boolean isCaseSensitive)
{
char[] chars = GetCharsForCode(isCaseSensitive);
byte[] data = new byte[1];
RNGCryptoServiceProvider crypto = new RNGCryptoServiceProvider();
crypto.GetNonZeroBytes(data);
data = new byte[codeLenght];
crypto.GetNonZeroBytes(data);
StringBuilder sb = new StringBuilder(codeLenght);
foreach (byte b in data)
{
sb.Append(chars[b % (chars.Length)]);
}
string key = sb.ToString();
if (codeLenght == 8)
key = key.Substring(0, 4) + "-" + key.Substring(4, 4);
else if (codeLenght == 16)
key = key.Substring(0, 4) + "-" + key.Substring(4, 4) + "-" + key.Substring(8, 4) + "-" + key.Substring(12, 4);
return key.ToString();
}
private static char[] GetCharsForCode(Boolean isCaseSensitive)
{
// all - abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890
char[] chars = new char[58];
if (isCaseSensitive)
{
chars = "abcdefghijkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ123456789".ToCharArray();//počet unikátních kombinací 4 - 424 270, 8 - 1 916 797 311, 16 - 7.99601828013E+13
}
else
{
chars = new char[35];
chars = "ABCDEFGHIJKLMNPQRSTUVWXYZ123456789".ToCharArray();//počet unikátních kombinací 4 - 52 360, 8 - 23 535 820, 16 - 4 059 928 950
}
return chars;
}

Related

Thousands separator after the decimal point [duplicate]

I wonder what would be the best way to format numbers so that the NumberGroupSeparator would work not only on the integer part to the left of the comma, but also on the fractional part, on the right of the comma.
Math.PI.ToString("###,###,##0.0##,###,###,###") // As documented ..
// ..this doesn't work
3.14159265358979 // result
3.141,592,653,589,79 // desired result
As documented on MSDN the NumberGroupSeparator works only to the left of the comma. I wonder why??
A little clunky, and it won't work for scientific numbers but here is a try:
class Program
{
static void Main(string[] args)
{
var π=Math.PI*10000;
Debug.WriteLine(Display(π));
// 31,415.926,535,897,931,899
}
static string Display(double x)
{
int s=Math.Sign(x);
x=Math.Abs(x);
StringBuilder text=new StringBuilder();
var y=Math.Truncate(x);
text.Append((s*y).ToString("#,#"));
x-=y;
if (x>0)
{
// 15 decimal places is max reasonable precision
y=Math.Truncate(x*Math.Pow(10, 15));
text.Append(".");
text.Append(y.ToString("#,#").TrimEnd('0'));
}
return text.ToString();
}
}
It might be best to work with the string generated by your .ToString():
class Program
{
static string InsertSeparators(string s)
{
string decSeparator = System.Threading.Thread.CurrentThread.CurrentCulture.NumberFormat.NumberDecimalSeparator;
int separatorPos = s.IndexOf(decSeparator);
if (separatorPos >= 0)
{
string decPart = s.Substring(separatorPos + decSeparator.Length);
// split the string into parts of 3 or less characters
List<String> parts = new List<String>();
for (int i = 0; i < decPart.Length; i += 3)
{
string part = "";
for (int j = 0; (j < 3) && (i + j < decPart.Length); j++)
{
part += decPart[i + j];
}
parts.Add(part);
}
string groupSeparator = System.Threading.Thread.CurrentThread.CurrentCulture.NumberFormat.NumberGroupSeparator;
s = s.Substring(0, separatorPos) + decSeparator + String.Join(groupSeparator, parts);
}
return s;
}
static void Main(string[] args)
{
for (int n = 0; n < 15; n++)
{
string s = Math.PI.ToString("0." + new string('#', n));
Console.WriteLine(InsertSeparators(s));
}
Console.ReadLine();
}
}
Outputs:
3
3.1
3.14
3.142
3.141,6
3.141,59
3.141,593
3.141,592,7
3.141,592,65
3.141,592,654
3.141,592,653,6
3.141,592,653,59
3.141,592,653,59
3.141,592,653,589,8
3.141,592,653,589,79
OK, not my strong side, but I guess this may be my best bet:
string input = Math.PI.ToString();
string decSeparator = System.Threading.Thread.CurrentThread
.CurrentCulture.NumberFormat.NumberGroupSeparator;
Regex RX = new Regex(#"([0-9]{3})");
string result = RX.Replace(input , #"$1" + decSeparator);
Thanks for listening..

Importing and removing duplicates from a massive amount of text files using C# and Redis

This is a bit of a doozy and it's been a while since I worked with C#, so bear with me:
I'm running a jruby script to iterate through 900 files (5 Mb - 1500 Mb in size) to figure out how many dupes STILL exist within these (already uniq'd) files. I had little luck with awk.
My latest idea was to insert them into a local MongoDB instance like so:
db.collection('hashes').update({ :_id => hash}, { $inc: { count: 1} }, { upsert: true)
... so that later I could just query it like db.collection.where({ count: { $gt: 1 } }) to get all the dupes.
This is working great except it's been over 24 hours and at the time of writing I'm at 72,532,927 Mongo entries and growing.
I think Ruby's .each_line is bottlnecking the IO hardcore:
So what I'm thinking now is compiling a C# program which fires up a thread PER EACH FILE and inserts the line (md5 hash) into a Redis list.
From there, I could have another compiled C# program simply pop the values off and ignore the save if the count is 1.
So the questions are:
Will using a compiled file reader and multithreading the file reads significantly improve performance?
Is using Redis even necessary? With a tremendous amount of AWS memory, could I not just use the threads to fill some sort of a list atomically and proceed from there?
Thanks in advance.
Updated
New solution. Old solution. The main idea is to calculate dummy hashes(just sum of all chars in string) of each line and store it in Dictionary<ulong, List<LinePosition>> _hash2LinePositions. It's possible to have multiple hashes in the same stream and it solves by List in Dictionary Value. When the hashes are the same, we read and compare the strings from the streams. LinePosition is using for storing info about line - position in stream and its length. I don't have such huge files as you, but my tests shows that it works. Here is the full code:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
public class Solution
{
struct LinePosition
{
public long Start;
public long Length;
public LinePosition(long start, long count)
{
Start = start;
Length = count;
}
public override string ToString()
{
return string.Format("Start: {0}, Length: {1}", Start, Length);
}
}
class TextFileHasher : IDisposable
{
readonly Dictionary<ulong, List<LinePosition>> _hash2LinePositions;
readonly Stream _stream;
bool _isDisposed;
public HashSet<ulong> Hashes { get; private set; }
public string Name { get; private set; }
public TextFileHasher(string name, Stream stream)
{
Name = name;
_stream = stream;
_hash2LinePositions = new Dictionary<ulong, List<LinePosition>>();
Hashes = new HashSet<ulong>();
}
public override string ToString()
{
return Name;
}
public void CalculateFileHash()
{
int readByte = -1;
ulong dummyLineHash = 0;
// Line start position in file
long startPosition = 0;
while ((readByte = _stream.ReadByte()) != -1) {
// Read until new line
if (readByte == '\r' || readByte == '\n') {
// If there was data
if (dummyLineHash != 0) {
// Add line hash and line position to the dict
AddToDictAndHash(dummyLineHash, startPosition, _stream.Position - 1 - startPosition);
// Reset line hash
dummyLineHash = 0;
}
}
else {
// Was it new line ?
if (dummyLineHash == 0)
startPosition = _stream.Position - 1;
// Calculate dummy hash
dummyLineHash += (uint)readByte;
}
}
if (dummyLineHash != 0) {
// Add line hash and line position to the dict
AddToDictAndHash(dummyLineHash, startPosition, _stream.Position - startPosition);
// Reset line hash
dummyLineHash = 0;
}
}
public List<LinePosition> GetLinePositions(ulong hash)
{
return _hash2LinePositions[hash];
}
public List<string> GetDuplicates()
{
List<string> duplicates = new List<string>();
foreach (var key in _hash2LinePositions.Keys) {
List<LinePosition> linesPos = _hash2LinePositions[key];
if (linesPos.Count > 1) {
duplicates.AddRange(FindExactDuplicates(linesPos));
}
}
return duplicates;
}
public void Dispose()
{
if (_isDisposed)
return;
_stream.Dispose();
_isDisposed = true;
}
private void AddToDictAndHash(ulong hash, long start, long count)
{
List<LinePosition> linesPosition;
if (!_hash2LinePositions.TryGetValue(hash, out linesPosition)) {
linesPosition = new List<LinePosition>() { new LinePosition(start, count) };
_hash2LinePositions.Add(hash, linesPosition);
}
else {
linesPosition.Add(new LinePosition(start, count));
}
Hashes.Add(hash);
}
public byte[] GetLineAsByteArray(LinePosition prevPos)
{
long len = prevPos.Length;
byte[] lineBytes = new byte[len];
_stream.Seek(prevPos.Start, SeekOrigin.Begin);
_stream.Read(lineBytes, 0, (int)len);
return lineBytes;
}
private List<string> FindExactDuplicates(List<LinePosition> linesPos)
{
List<string> duplicates = new List<string>();
linesPos.Sort((x, y) => x.Length.CompareTo(y.Length));
LinePosition prevPos = linesPos[0];
for (int i = 1; i < linesPos.Count; i++) {
if (prevPos.Length == linesPos[i].Length) {
var prevLineArray = GetLineAsByteArray(prevPos);
var thisLineArray = GetLineAsByteArray(linesPos[i]);
if (prevLineArray.SequenceEqual(thisLineArray)) {
var line = System.Text.Encoding.Default.GetString(prevLineArray);
duplicates.Add(line);
}
#if false
string prevLine = System.Text.Encoding.Default.GetString(prevLineArray);
string thisLine = System.Text.Encoding.Default.GetString(thisLineArray);
Console.WriteLine("PrevLine: {0}\r\nThisLine: {1}", prevLine, thisLine);
StringBuilder sb = new StringBuilder();
sb.Append(prevPos);
sb.Append(" is '");
sb.Append(prevLine);
sb.Append("'. ");
sb.AppendLine();
sb.Append(linesPos[i]);
sb.Append(" is '");
sb.Append(thisLine);
sb.AppendLine("'. ");
sb.Append("Equals => ");
sb.Append(prevLine.CompareTo(thisLine) == 0);
Console.WriteLine(sb.ToString());
#endif
}
else {
prevPos = linesPos[i];
}
}
return duplicates;
}
}
public static void Main(String[] args)
{
List<TextFileHasher> textFileHashers = new List<TextFileHasher>();
string text1 = "abc\r\ncba\r\nabc";
TextFileHasher tfh1 = new TextFileHasher("Text1", new MemoryStream(System.Text.Encoding.Default.GetBytes(text1)));
tfh1.CalculateFileHash();
textFileHashers.Add(tfh1);
string text2 = "def\r\ncba\r\nwet";
TextFileHasher tfh2 = new TextFileHasher("Text2", new MemoryStream(System.Text.Encoding.Default.GetBytes(text2)));
tfh2.CalculateFileHash();
textFileHashers.Add(tfh2);
string text3 = "def\r\nbla\r\nwat";
TextFileHasher tfh3 = new TextFileHasher("Text3", new MemoryStream(System.Text.Encoding.Default.GetBytes(text3)));
tfh3.CalculateFileHash();
textFileHashers.Add(tfh3);
List<string> totalDuplicates = new List<string>();
Dictionary<ulong, Dictionary<TextFileHasher, List<LinePosition>>> totalHashes = new Dictionary<ulong, Dictionary<TextFileHasher, List<LinePosition>>>();
textFileHashers.ForEach(tfh => {
foreach(var dummyHash in tfh.Hashes) {
Dictionary<TextFileHasher, List<LinePosition>> tfh2LinePositions = null;
if (!totalHashes.TryGetValue(dummyHash, out tfh2LinePositions))
totalHashes[dummyHash] = new Dictionary<TextFileHasher, List<LinePosition>>() { { tfh, tfh.GetLinePositions(dummyHash) } };
else {
List<LinePosition> linePositions = null;
if (!tfh2LinePositions.TryGetValue(tfh, out linePositions))
tfh2LinePositions[tfh] = tfh.GetLinePositions(dummyHash);
else
linePositions.AddRange(tfh.GetLinePositions(dummyHash));
}
}
});
HashSet<TextFileHasher> alreadyGotDuplicates = new HashSet<TextFileHasher>();
foreach(var hash in totalHashes.Keys) {
var tfh2LinePositions = totalHashes[hash];
var tfh = tfh2LinePositions.Keys.FirstOrDefault();
// Get duplicates in the TextFileHasher itself
if (tfh != null && !alreadyGotDuplicates.Contains(tfh)) {
totalDuplicates.AddRange(tfh.GetDuplicates());
alreadyGotDuplicates.Add(tfh);
}
if (tfh2LinePositions.Count <= 1) {
continue;
}
// Algo to get duplicates in more than 1 TextFileHashers
var tfhs = tfh2LinePositions.Keys.ToArray();
for (int i = 0; i < tfhs.Length; i++) {
var tfh1Positions = tfhs[i].GetLinePositions(hash);
for (int j = i + 1; j < tfhs.Length; j++) {
var tfh2Positions = tfhs[j].GetLinePositions(hash);
for (int k = 0; k < tfh1Positions.Count; k++) {
var tfh1Pos = tfh1Positions[k];
var tfh1ByteArray = tfhs[i].GetLineAsByteArray(tfh1Pos);
for (int m = 0; m < tfh2Positions.Count; m++) {
var tfh2Pos = tfh2Positions[m];
if (tfh1Pos.Length != tfh2Pos.Length)
continue;
var tfh2ByteArray = tfhs[j].GetLineAsByteArray(tfh2Pos);
if (tfh1ByteArray.SequenceEqual(tfh2ByteArray)) {
var line = System.Text.Encoding.Default.GetString(tfh1ByteArray);
totalDuplicates.Add(line);
}
}
}
}
}
}
Console.WriteLine();
if (totalDuplicates.Count > 0) {
Console.WriteLine("Total number of duplicates: {0}", totalDuplicates.Count);
Console.WriteLine("#######################");
totalDuplicates.ForEach(x => Console.WriteLine("{0}", x));
Console.WriteLine("#######################");
}
// Free resources
foreach (var tfh in textFileHashers)
tfh.Dispose();
}
}
If you have tons of ram... You guys are overthinking it...
var fileLines = File.ReadAllLines(#"c:\file.csv").Distinct();

C# How can I compare two word strings and indicate which parts are different

For example if I have...
string a = "personil";
string b = "personal";
I would like to get...
string c = "person[i]l";
However it is not necessarily a single character. I could be like this too...
string a = "disfuncshunal";
string b = "dysfunctional";
For this case I would want to get...
string c = "d[isfuncshu]nal";
Another example would be... (Notice that the length of both words are different.)
string a = "parralele";
string b = "parallel";
string c = "par[ralele]";
Another example would be...
string a = "ato";
string b = "auto";
string c = "a[]to";
How would I go about doing this?
Edit: The length of the two strings can be different.
Edit: Added additional examples. Credit goes to user Nenad for asking.
I must be very bored today, but I actually made UnitTest that pass all 4 cases (if you did not add some more in the meantime).
Edit: Added 2 edge cases and fix for them.
Edit2: letters that repeat multiple times (and error on those letters)
[Test]
[TestCase("parralele", "parallel", "par[ralele]")]
[TestCase("personil", "personal", "person[i]l")]
[TestCase("disfuncshunal", "dysfunctional", "d[isfuncshu]nal")]
[TestCase("ato", "auto", "a[]to")]
[TestCase("inactioned", "inaction", "inaction[ed]")]
[TestCase("refraction", "fraction", "[re]fraction")]
[TestCase("adiction", "ad[]diction", "ad[]iction")]
public void CompareStringsTest(string attempted, string correct, string expectedResult)
{
int first = -1, last = -1;
string result = null;
int shorterLength = (attempted.Length < correct.Length ? attempted.Length : correct.Length);
// First - [
for (int i = 0; i < shorterLength; i++)
{
if (correct[i] != attempted[i])
{
first = i;
break;
}
}
// Last - ]
var a = correct.Reverse().ToArray();
var b = attempted.Reverse().ToArray();
for (int i = 0; i < shorterLength; i++)
{
if (a[i] != b[i])
{
last = i;
break;
}
}
if (first == -1 && last == -1)
result = attempted;
else
{
var sb = new StringBuilder();
if (first == -1)
first = shorterLength;
if (last == -1)
last = shorterLength;
// If same letter repeats multiple times (ex: addition)
// and error is on that letter, we have to trim trail.
if (first + last > shorterLength)
last = shorterLength - first;
if (first > 0)
sb.Append(attempted.Substring(0, first));
sb.Append("[");
if (last > -1 && last + first < attempted.Length)
sb.Append(attempted.Substring(first, attempted.Length - last - first));
sb.Append("]");
if (last > 0)
sb.Append(attempted.Substring(attempted.Length - last, last));
result = sb.ToString();
}
Assert.AreEqual(expectedResult, result);
}
Have you tried my DiffLib?
With that library, and the following code (running in LINQPad):
void Main()
{
string a = "disfuncshunal";
string b = "dysfunctional";
var diff = new Diff<char>(a, b);
var result = new StringBuilder();
int index1 = 0;
int index2 = 0;
foreach (var part in diff)
{
if (part.Equal)
result.Append(a.Substring(index1, part.Length1));
else
result.Append("[" + a.Substring(index1, part.Length1) + "]");
index1 += part.Length1;
index2 += part.Length2;
}
result.ToString().Dump();
}
You get this output:
d[i]sfunc[shu]nal
To be honest I don't understand what this gives you, as you seem to completely ignore the changed parts in the b string, only dumping the relevant portions of the a string.
Here is a complete and working console application that will work for both examples you gave:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication2
{
class Program
{
static void Main(string[] args)
{
string a = "disfuncshunal";
string b = "dysfunctional";
StringBuilder sb = new StringBuilder();
for (int i = 0; i < a.Length; i++)
{
if (a[i] != b[i])
{
sb.Append("[");
sb.Append(a[i]);
sb.Append("]");
continue;
}
sb.Append(a[i]);
}
var str = sb.ToString();
var startIndex = str.IndexOf("[");
var endIndex = str.LastIndexOf("]");
var start = str.Substring(0, startIndex + 1);
var mid = str.Substring(startIndex + 1, endIndex - 1);
var end = str.Substring(endIndex);
Console.WriteLine(start + mid.Replace("[", "").Replace("]", "") + end);
}
}
}
it will not work if you want to display more than one entire section of the mismatched word.
You did not specify what to do if the strings were of different lengths, but here is a solution to the problem when the strings are of equal length:
private string Compare(string string1, string string2) {
//This only works if the two strings are the same length..
string output = "";
bool mismatch = false;
for (int i = 0; i < string1.Length; i++) {
char c1 = string1[i];
char c2 = string2[i];
if (c1 == c2) {
if (mismatch) {
output += "]" + c1;
mismatch = false;
} else {
output += c1;
}
} else {
if (mismatch) {
output += c1;
} else {
output += "[" + c1;
mismatch = true;
}
}
}
return output;
}
Not really good approach but as an exercise in using LINQ: task seem to be find matching prefix and suffix for 2 strings, return "prefix + [+ middle of first string + suffix.
So you can match prefix (Zip + TakeWhile(a==b)), than repeat the same for suffix by reversing both strings and reversing result.
var first = "disfuncshunal";
var second = "dysfunctional";
// Prefix
var zipped = first.ToCharArray().Zip(second.ToCharArray(), (f,s)=> new {f,s});
var prefix = string.Join("",
zipped.TakeWhile(c => c.f==c.s).Select(c => c.f));
// Suffix
var zippedReverse = first.ToCharArray().Reverse()
.Zip(second.ToCharArray().Reverse(), (f,s)=> new {f,s});
var suffix = string.Join("",
zippedReverse.TakeWhile(c => c.f==c.s).Reverse().Select(c => c.f));
// Cut and combine.
var middle = first.Substring(prefix.Length,
first.Length - prefix.Length - suffix.Length);
var result = prefix + "[" + middle + "]" + suffix;
Much easier and faster approach is to use 2 for loops (from start to end, and from end to start).

DateTime.Now.Ticks is too slow

I am making a password generator which can generate a password.
var listOfCharacters = "abcdefghijklmnopqrstuvwxyz" //the chars which are using
chars = listOfCharacters.ToCharArray();
string password = string.Empty;
for (int i = 0; i < length; i++)
{
int x = random.Next(0, chars.Length); //with random he is picking a random char from my list from position 0 - 26 (a - z)
password += chars.GetValue(x); // putting x (the char which is picked) in the new generated password
}
if (length < password.Length) password = password.Substring(0, length); // if the password contains the correct length he will be returns
return password;
My random:
random = new Random((int)DateTime.Now.Ticks);
I am looking for a faster way to generate a password than using Ticks, because its not fast enough for me. I am looking for a simple code which i can easy put in my above code. I am just a beginner in C#. So that i still can use int x = random.Next(0, chars.Length); but instead of Random.next a faster one.
EDIT:
When i want two generate two passwords in a short time .Ticks is to slow
My test code:
[TestMethod]
public void PasswordGeneratorShouldRenderUniqueNextPassword()
{
// Create an instance, and generate two passwords
var generator = new PasswordGenerator();
var firstPassword = generator.Generate(8); //8 is the length of the password
var secondPassword = generator.Generate(8);
// Verify that both passwords are unique
Assert.AreNotEqual(firstPassword, secondPassword);
}
You could use the hashcode of a Guid as seed value for your Random instance. It should be random enough for your case.
random = new Random(Guid.NewGuid().GetHashCode());
Either create the Random instance once at startup or use the RNGCryptoServiceProvider.
// Create the random instance only once.
private static Random _Random = new Random();
static void Main(string[] args)
{
var listOfCharacters = "abcdefghijklmnopqrstuvwxyz".ToList();
var result = new StringBuilder();
for (int i = 0; i < 20; i++)
{
// Consider creating the provider only once!
var provider = new RNGCryptoServiceProvider();
// The same is true for the byte array.
var bytes = new byte[4];
provider.GetBytes(bytes);
var number = BitConverter.ToInt32(bytes, 0);
var index = Math.Abs(number % listOfCharacters.Count);
result.Append(listOfCharacters[index]);
}
Console.WriteLine(result.ToString());
Console.ReadKey();
}
Bias testing
static void Main(string[] args)
{
var listOfCharacters = "abcdefghijklmnopqrstuvwxyz".ToList();
var occurences = new Dictionary<char, int>();
foreach (var character in listOfCharacters)
{
occurences.Add(character, 0);
}
var provider = new RNGCryptoServiceProvider();
var bytes = new byte[4];
for (int i = 0; i < 1000000; i++)
{
provider.GetBytes(bytes);
var number = BitConverter.ToInt32(bytes, 0);
var index = Math.Abs(number % listOfCharacters.Count);
occurences[listOfCharacters[index]]++;
}
var orderedOccurences = occurences.OrderBy(kvp => kvp.Value);
var minKvp = orderedOccurences.First();
var maxKvp = orderedOccurences.Last();
Console.WriteLine("Min occurence: " + minKvp.Key + " Times: " + minKvp.Value);
Console.WriteLine("Max occurence: " + maxKvp.Key + " Times: " + maxKvp.Value);
Console.WriteLine("Difference: " + (maxKvp.Value - minKvp.Value));
Console.ReadKey();
}
The result is that between the highest occurrence and the lowest is a value somewhere between 700 - 800 which means the bias is somewhere at 0.08% and the two characters with the maximum difference always differs. So i really can't see any bias.
The following program generates ~500 passwords / millisecond on my computer:
class Program
{
static void Main(string[] args)
{
var g = new Generator();
IEnumerable<string> passwords = new List<string>();
var startTime = DateTime.Now;
passwords = g.GetPassword().ToList();
}
}
class Generator
{
Random r = new Random(Guid.NewGuid().GetHashCode());
string randomCharsList;
const int length = 8;
const int randomLength = 8000;
const string listOfCharacters = "abcdefghijklmnopqrstuvwxyz";
public Generator()
{
CreateRandom();
}
private void CreateRandom()
{
var randomChars = new StringBuilder();
string password = string.Empty;
for (int i = 0; i < randomLength + length; i++)
{
var random = new Random(i * Guid.NewGuid().ToByteArray().First());
int x = random.Next(0, listOfCharacters.Length);
randomChars.Append(listOfCharacters[x]);
}
randomCharsList = randomChars.ToString();
}
public IEnumerable<string> GetPassword()
{
int pos;
var startTime = DateTime.Now;
while ((DateTime.Now - startTime).Milliseconds < 1)
{
pos = r.Next(randomLength);
yield return randomCharsList.Substring(pos, length);
}
}
}
Just wanted to add a note here about the thread safety issue in generating random numbers (such as a high volume webserver where we ran into this issue).
Essentially, Random class is NOT thread safe and if collision occurs, it will return 0 (not what i expected) which needless to say can really wreak havoc in your logic :) So if using in a multi-threaded environment then make sure to protect access to any shared Random objects.
See section on the 'The System.Random class and thread safety' at https://msdn.microsoft.com/en-us/library/system.random(v=vs.110).aspx for more details.
Hope this helps someone.

C# - A faster alternative to Convert.ToSingle()

I'm working on a program which reads millions of floating point numbers from a text file. This program runs inside of a game that I'm designing, so I need it to be fast (I'm loading an obj file). So far, loading a relatively small file takes about a minute (without precompilation) because of the slow speed of Convert.ToSingle(). Is there a faster way to do this?
EDIT: Here's the code I use to parse the Obj file
http://pastebin.com/TfgEge9J
using System;
using System.IO;
using System.Collections.Generic;
using OpenTK.Math;
using System.Drawing;
using PlatformLib;
public class ObjMeshLoader
{
public static StreamReader[] LoadMeshes(string fileName)
{
StreamReader mreader = new StreamReader(PlatformLib.Platform.openFile(fileName));
MemoryStream current = null;
List<MemoryStream> mstreams = new List<MemoryStream>();
StreamWriter mwriter = null;
if (!mreader.ReadLine().Contains("#"))
{
mreader.BaseStream.Close();
throw new Exception("Invalid header");
}
while (!mreader.EndOfStream)
{
string cmd = mreader.ReadLine();
string line = cmd;
line = line.Trim(splitCharacters);
line = line.Replace(" ", " ");
string[] parameters = line.Split(splitCharacters);
if (parameters[0] == "mtllib")
{
loadMaterials(parameters[1]);
}
if (parameters[0] == "o")
{
if (mwriter != null)
{
mwriter.Flush();
current.Position = 0;
}
current = new MemoryStream();
mwriter = new StreamWriter(current);
mwriter.WriteLine(parameters[1]);
mstreams.Add(current);
}
else
{
if (mwriter != null)
{
mwriter.WriteLine(cmd);
mwriter.Flush();
}
}
}
mwriter.Flush();
current.Position = 0;
List<StreamReader> readers = new List<StreamReader>();
foreach (MemoryStream e in mstreams)
{
e.Position = 0;
StreamReader sreader = new StreamReader(e);
readers.Add(sreader);
}
return readers.ToArray();
}
public static bool Load(ObjMesh mesh, string fileName)
{
try
{
using (StreamReader streamReader = new StreamReader(Platform.openFile(fileName)))
{
Load(mesh, streamReader);
streamReader.Close();
return true;
}
}
catch { return false; }
}
public static bool Load2(ObjMesh mesh, StreamReader streamReader, ObjMesh prevmesh)
{
if (prevmesh != null)
{
//mesh.Vertices = prevmesh.Vertices;
}
try
{
//streamReader.BaseStream.Position = 0;
Load(mesh, streamReader);
streamReader.Close();
#if DEBUG
Console.WriteLine("Loaded "+mesh.Triangles.Length.ToString()+" triangles and"+mesh.Quads.Length.ToString()+" quadrilaterals parsed, with a grand total of "+mesh.Vertices.Length.ToString()+" vertices.");
#endif
return true;
}
catch (Exception er) { Console.WriteLine(er); return false; }
}
static char[] splitCharacters = new char[] { ' ' };
static List<Vector3> vertices;
static List<Vector3> normals;
static List<Vector2> texCoords;
static Dictionary<ObjMesh.ObjVertex, int> objVerticesIndexDictionary;
static List<ObjMesh.ObjVertex> objVertices;
static List<ObjMesh.ObjTriangle> objTriangles;
static List<ObjMesh.ObjQuad> objQuads;
static Dictionary<string, Bitmap> materials = new Dictionary<string, Bitmap>();
static void loadMaterials(string path)
{
StreamReader mreader = new StreamReader(Platform.openFile(path));
string current = "";
bool isfound = false;
while (!mreader.EndOfStream)
{
string line = mreader.ReadLine();
line = line.Trim(splitCharacters);
line = line.Replace(" ", " ");
string[] parameters = line.Split(splitCharacters);
if (parameters[0] == "newmtl")
{
if (materials.ContainsKey(parameters[1]))
{
isfound = true;
}
else
{
current = parameters[1];
}
}
if (parameters[0] == "map_Kd")
{
if (!isfound)
{
string filename = "";
for (int i = 1; i < parameters.Length; i++)
{
filename += parameters[i];
}
string searcher = "\\" + "\\";
filename.Replace(searcher, "\\");
Bitmap mymap = new Bitmap(filename);
materials.Add(current, mymap);
isfound = false;
}
}
}
}
static float parsefloat(string val)
{
return Convert.ToSingle(val);
}
int remaining = 0;
static string GetLine(string text, ref int pos)
{
string retval = text.Substring(pos, text.IndexOf(Environment.NewLine, pos));
pos = text.IndexOf(Environment.NewLine, pos);
return retval;
}
static void Load(ObjMesh mesh, StreamReader textReader)
{
//try {
//vertices = null;
//objVertices = null;
if (vertices == null)
{
vertices = new List<Vector3>();
}
if (normals == null)
{
normals = new List<Vector3>();
}
if (texCoords == null)
{
texCoords = new List<Vector2>();
}
if (objVerticesIndexDictionary == null)
{
objVerticesIndexDictionary = new Dictionary<ObjMesh.ObjVertex, int>();
}
if (objVertices == null)
{
objVertices = new List<ObjMesh.ObjVertex>();
}
objTriangles = new List<ObjMesh.ObjTriangle>();
objQuads = new List<ObjMesh.ObjQuad>();
mesh.vertexPositionOffset = vertices.Count;
string line;
string alltext = textReader.ReadToEnd();
int pos = 0;
while ((line = GetLine(alltext, pos)) != null)
{
if (line.Length < 2)
{
break;
}
//line = line.Trim(splitCharacters);
//line = line.Replace(" ", " ");
string[] parameters = line.Split(splitCharacters);
switch (parameters[0])
{
case "usemtl":
//Material specification
try
{
mesh.Material = materials[parameters[1]];
}
catch (KeyNotFoundException)
{
Console.WriteLine("WARNING: Texture parse failure: " + parameters[1]);
}
break;
case "p": // Point
break;
case "v": // Vertex
float x = parsefloat(parameters[1]);
float y = parsefloat(parameters[2]);
float z = parsefloat(parameters[3]);
vertices.Add(new Vector3(x, y, z));
break;
case "vt": // TexCoord
float u = parsefloat(parameters[1]);
float v = parsefloat(parameters[2]);
texCoords.Add(new Vector2(u, v));
break;
case "vn": // Normal
float nx = parsefloat(parameters[1]);
float ny = parsefloat(parameters[2]);
float nz = parsefloat(parameters[3]);
normals.Add(new Vector3(nx, ny, nz));
break;
case "f":
switch (parameters.Length)
{
case 4:
ObjMesh.ObjTriangle objTriangle = new ObjMesh.ObjTriangle();
objTriangle.Index0 = ParseFaceParameter(parameters[1]);
objTriangle.Index1 = ParseFaceParameter(parameters[2]);
objTriangle.Index2 = ParseFaceParameter(parameters[3]);
objTriangles.Add(objTriangle);
break;
case 5:
ObjMesh.ObjQuad objQuad = new ObjMesh.ObjQuad();
objQuad.Index0 = ParseFaceParameter(parameters[1]);
objQuad.Index1 = ParseFaceParameter(parameters[2]);
objQuad.Index2 = ParseFaceParameter(parameters[3]);
objQuad.Index3 = ParseFaceParameter(parameters[4]);
objQuads.Add(objQuad);
break;
}
break;
}
}
//}catch(Exception er) {
// Console.WriteLine(er);
// Console.WriteLine("Successfully recovered. Bounds/Collision checking may fail though");
//}
mesh.Vertices = objVertices.ToArray();
mesh.Triangles = objTriangles.ToArray();
mesh.Quads = objQuads.ToArray();
textReader.BaseStream.Close();
}
public static void Clear()
{
objVerticesIndexDictionary = null;
vertices = null;
normals = null;
texCoords = null;
objVertices = null;
objTriangles = null;
objQuads = null;
}
static char[] faceParamaterSplitter = new char[] { '/' };
static int ParseFaceParameter(string faceParameter)
{
Vector3 vertex = new Vector3();
Vector2 texCoord = new Vector2();
Vector3 normal = new Vector3();
string[] parameters = faceParameter.Split(faceParamaterSplitter);
int vertexIndex = Convert.ToInt32(parameters[0]);
if (vertexIndex < 0) vertexIndex = vertices.Count + vertexIndex;
else vertexIndex = vertexIndex - 1;
//Hmm. This seems to be broken.
try
{
vertex = vertices[vertexIndex];
}
catch (Exception)
{
throw new Exception("Vertex recognition failure at " + vertexIndex.ToString());
}
if (parameters.Length > 1)
{
int texCoordIndex = Convert.ToInt32(parameters[1]);
if (texCoordIndex < 0) texCoordIndex = texCoords.Count + texCoordIndex;
else texCoordIndex = texCoordIndex - 1;
try
{
texCoord = texCoords[texCoordIndex];
}
catch (Exception)
{
Console.WriteLine("ERR: Vertex " + vertexIndex + " not found. ");
throw new DllNotFoundException(vertexIndex.ToString());
}
}
if (parameters.Length > 2)
{
int normalIndex = Convert.ToInt32(parameters[2]);
if (normalIndex < 0) normalIndex = normals.Count + normalIndex;
else normalIndex = normalIndex - 1;
normal = normals[normalIndex];
}
return FindOrAddObjVertex(ref vertex, ref texCoord, ref normal);
}
static int FindOrAddObjVertex(ref Vector3 vertex, ref Vector2 texCoord, ref Vector3 normal)
{
ObjMesh.ObjVertex newObjVertex = new ObjMesh.ObjVertex();
newObjVertex.Vertex = vertex;
newObjVertex.TexCoord = texCoord;
newObjVertex.Normal = normal;
int index;
if (objVerticesIndexDictionary.TryGetValue(newObjVertex, out index))
{
return index;
}
else
{
objVertices.Add(newObjVertex);
objVerticesIndexDictionary[newObjVertex] = objVertices.Count - 1;
return objVertices.Count - 1;
}
}
}
Based on your description and the code you've posted, I'm going to bet that your problem isn't with the reading, the parsing, or the way you're adding things to your collections. The most likely problem is that your ObjMesh.Objvertex structure doesn't override GetHashCode. (I'm assuming that you're using code similar to http://www.opentk.com/files/ObjMesh.cs.
If you're not overriding GetHashCode, then your objVerticesIndexDictionary is going to perform very much like a linear list. That would account for the performance problem that you're experiencing.
I suggest that you look into providing a good GetHashCode method for your ObjMesh.Objvertex class.
See Why is ValueType.GetHashCode() implemented like it is? for information about the default GetHashCode implementation for value types and why it's not suitable for use in a hash table or dictionary.
Edit 3: The problem is NOT with the parsing.
It's with how you read the file. If you read it properly, it would be faster; however, it seems like your reading is unusually slow. My original suspicion was that it was because of excess allocations, but it seems like there might be other problems with your code too, since that doesn't explain the entire slowdown.
Nevertheless, here's a piece of code I made that completely avoids all object allocations:
static void Main(string[] args)
{
long counter = 0;
var sw = Stopwatch.StartNew();
var sb = new StringBuilder();
var text = File.ReadAllText("spacestation.obj");
for (int i = 0; i < text.Length; i++)
{
int start = i;
while (i < text.Length &&
(char.IsDigit(text[i]) || text[i] == '-' || text[i] == '.'))
{ i++; }
if (i > start)
{
sb.Append(text, start, i - start); //Copy data to the buffer
float value = Parse(sb); //Parse the data
sb.Remove(0, sb.Length); //Clear the buffer
counter++;
}
}
sw.Stop();
Console.WriteLine("{0:N0}", sw.Elapsed.TotalSeconds); //Only a few ms
}
with this parser:
const int MIN_POW_10 = -16, int MAX_POW_10 = 16,
NUM_POWS_10 = MAX_POW_10 - MIN_POW_10 + 1;
static readonly float[] pow10 = GenerateLookupTable();
static float[] GenerateLookupTable()
{
var result = new float[(-MIN_POW_10 + MAX_POW_10) * 10];
for (int i = 0; i < result.Length; i++)
result[i] = (float)((i / NUM_POWS_10) *
Math.Pow(10, i % NUM_POWS_10 + MIN_POW_10));
return result;
}
static float Parse(StringBuilder str)
{
float result = 0;
bool negate = false;
int len = str.Length;
int decimalIndex = str.Length;
for (int i = len - 1; i >= 0; i--)
if (str[i] == '.')
{ decimalIndex = i; break; }
int offset = -MIN_POW_10 + decimalIndex;
for (int i = 0; i < decimalIndex; i++)
if (i != decimalIndex && str[i] != '-')
result += pow10[(str[i] - '0') * NUM_POWS_10 + offset - i - 1];
else if (str[i] == '-')
negate = true;
for (int i = decimalIndex + 1; i < len; i++)
if (i != decimalIndex)
result += pow10[(str[i] - '0') * NUM_POWS_10 + offset - i];
if (negate)
result = -result;
return result;
}
it happens in a small fraction of a second.
Of course, this parser is poorly tested and has these current restrictions (and more):
Don't try parsing more digits (decimal and whole) than provided for in the array.
No error handling whatsoever.
Only parses decimals, not exponents! i.e. it can parse 1234.56 but not 1.23456E3.
Doesn't care about globalization/localization. Your file is only in a single format, so there's no point caring about that kind of stuff because you're probably using English to store it anyway.
It seems like you won't necessarily need this much overkill, but take a look at your code and try to figure out the bottleneck. It seems to be neither the reading nor the parsing.
Have you measured that the speed problem is really caused by Convert.ToSingle?
In the code you included, I see you create lists and dictionaries like this:
normals = new List<Vector3>();
texCoords = new List<Vector2>();
objVerticesIndexDictionary = new Dictionary<ObjMesh.ObjVertex, int>();
And then when you read the file, you add in the collection one item at a time.
One of the possible optimizations would be to save total number of normals, texCoords, indexes and everything at the start of the file, and then initialize these collections by these numbers. This will pre-allocate the buffers used by collections, so adding items to the them will be pretty fast.
So the collection creation should look like this:
// These values should be stored at the beginning of the file
int totalNormals = Convert.ToInt32(textReader.ReadLine());
int totalTexCoords = Convert.ToInt32(textReader.ReadLine());
int totalIndexes = Convert.ToInt32(textReader.ReadLine());
normals = new List<Vector3>(totalNormals);
texCoords = new List<Vector2>(totalTexCoords);
objVerticesIndexDictionary = new Dictionary<ObjMesh.ObjVertex, int>(totalIndexes);
See List<T> Constructor (Int32) and Dictionary<TKey, TValue> Constructor (Int32).
This related question is for C++, but is definitely worth a read.
For reading as fast as possible, you're probably going to want to map the file into memory and then parse using some custom floating point parser, especially if you know the numbers are always in a specific format (i.e. you're the one generating the input files in the first place).
I tested .Net string parsing once and the fastest function to parse text was the old VB Val() function. You could pull the relevant parts out of Microsoft.VisualBasic.Conversion Val(string)
Converting String to numbers
Comparison of relative test times (ms / 100000 conversions)
Double Single Integer Int(w/ decimal point)
14 13 6 16 Val(Str)
14 14 6 16 Cxx(Val(Str)) e.g., CSng(Val(str))
22 21 17 e! Convert.To(str)
23 21 16 e! XX.Parse(str) e.g. Single.Parse()
30 31 31 32 Cxx(str)
Val: fastest, part of VisualBasic dll, skips non-numeric,
ConvertTo and Parse: slower, part of core, exception on bad format (including decimal point)
Cxx: slowest (for strings), part of core, consistent times across formats

Categories