This is an assignment from university. I have to do a Radix sort on car registration plates (ABC 123) in two ways 1) array 2) linked list. The most interesting thing is that sorting MUST BE done in the file. For example, from now on we will talk only about array. I generate car numbers and put them in array, then with binary write I write all generated car reg plates to the file. After that I give the newly generated file to Radix Sort and he need to do the magic. I will show you the code that I have at the moment, but it's not actually a 'real' radix sort, because my mind cannot understand how would I implement radix sort in file. ( I have implemented radix sort for normal array and linked list, but when it is done INSIDE a file it is mind blowing). I just wanted to ask if any of you would have any tips or ideas on how I could improve the sorting algorithm, because it is hella slow. Thank you.
PROGRAM.CS
public static void CountingSort(DataArray items, int exp)
{
UTF8Encoding encoder = new UTF8Encoding();
Byte[] forChange = new byte[16];
double first, second;
int i, j;
NumberPlate plate1;
NumberPlate plate2;
for (int z = 0; z < items.Length; z++)
{
i = 0;
j = 1;
while (j < items.Length)
{
BitConverter.GetBytes(items[i]).CopyTo(forChange, 0);
BitConverter.GetBytes(items[j]).CopyTo(forChange, 8);
string firstPlate = encoder.GetString(forChange, 1, 7);
string[] partsFirst = firstPlate.Split(' ');
plate1 = new NumberPlate(partsFirst[0], partsFirst[1]);
string secondPlate = encoder.GetString(forChange, 9, 7);
string[] partsSecond = secondPlate.Split(' ');
plate2 = new NumberPlate(partsSecond[0], partsSecond[1]);
first = plate1.GetPlateCode() / exp % 10;
second = plate2.GetPlateCode() / exp % 10;
if (first > second)
{
items.Swap(j, BitConverter.ToDouble(forChange, 0), BitConverter.ToDouble(forChange, 8));
}
i++;
j++;
}
}
}
public static void Radix_Sort(DataArray items)
{
for (int exp = 1; exp < Math.Pow(10, 9); exp *= 10)
{
CountingSort(items, exp);
}
}
public static void Test_File_Array_List(int seed)
{
int n = 5;
string filename;
filename = #"mydataarray.txt";
//filename = #"mydataarray.dat";
MyFileArray myfilearray = new MyFileArray(filename, n);
using (myfilearray.fs = new FileStream(filename, FileMode.Open, FileAccess.ReadWrite))
{
Console.WriteLine("\n FILE ARRAY \n");
myfilearray.Print(n);
Radix_Sort(myfilearray);
myfilearray.Print(n);
}
}
ARRAY.CS
public override double this[int index]
{
get
{
Byte[] data = new Byte[8];
fs.Seek(8 * index, SeekOrigin.Begin);
fs.Read(data, 0, 8);
double result = BitConverter.ToDouble(data, 0);
return result;
}
}
public override void Swap(int j, double a, double b)
{
Byte[] data = new Byte[16];
BitConverter.GetBytes(b).CopyTo(data, 0);
BitConverter.GetBytes(a).CopyTo(data, 8);
fs.Seek(8 * (j - 1), SeekOrigin.Begin);
fs.Write(data, 0, 16);
}
If the assignment mentions array and linked list, then it would seem that the file is only used to read the data into the array or linked list, then then sort is done, and the result is written to a file.
For a file based radix sort, for each digit (right to left), 10 temp files are created, the data is read and appended to file "digit" based on the digit, then the 10 temp files are closed, then concatenated into a single working file for the next radix sort step. For each letter, 26 temp files would be used.
Related
I have to create a utility that searches through 40 to 60 GiB of text files as quick as possible.
Each file has around 50 MB of data that consists of log lines (about 630.000 lines per file).
A NOSQL document database is unfortunately no option...
As of now I am using a Aho-Corsaick algorithm for the search which I stole from Tomas Petricek off of his blog. It works very well.
I process the files in Tasks. Each file is loaded into memory by simply calling File.ReadAllLines(path). The lines are then fed into the Aho-Corsaick one by one, thus each file causes around 600.000 calls to the algorithm (I need the line number in my results).
This takes a lot of time and requires a lot of memory and CPU.
I have very little expertise in this field as I usually work in image processing.
Can you guys recommend algorithms and approaches which could speed up the processing?
Below is more detailed view to the Task creation and file loading which is pretty standard. For more information on the Aho-Corsaick, please visit the linked blog page above.
private KeyValuePair<string, StringSearchResult[]> FindInternal(
IStringSearchAlgorithm algo,
string file)
{
List<StringSearchResult> result = new List<StringSearchResult>();
string[] lines = File.ReadAllLines(file);
for (int i = 0; i < lines.Length; i++)
{
var results = algo.FindAll(lines[i]);
for (int j = 0; j < results.Length; j++)
{
results[j].Row = i;
}
}
foreach (string line in lines)
{
result.AddRange(algo.FindAll(line));
}
return new KeyValuePair<string, StringSearchResult[]>(
file, result.ToArray());
}
public Dictionary<string, StringSearchResult[]> Find(
params string[] search)
{
IStringSearchAlgorithm algo = new StringSearch();
algo.Keywords = search;
Task<KeyValuePair<string, StringSearchResult[]>>[] findTasks
= new Task<KeyValuePair<string, StringSearchResult[]>>[_files.Count];
Parallel.For(0, _files.Count, i => {
findTasks[i] = Task.Factory.StartNew(
() => FindInternal(algo, _files[i])
);
});
Task.WaitAll(findTasks);
return findTasks.Select(t => t.Result)
.ToDictionary(x => x.Key, x => x.Value);
}
EDIT
See section Initial Answer for the original Answer.
I further optimized my code by doing the following:
Added paging to prevent memory overflow / crash due to large amount of result data.
I offload the search results into local files as soon as they exceed a certain buffer size (64kb in my case).
Offloading the results required me to convert my SearchData struct to binary and back.
Splicing the array of files which are processed and running them in Tasks greatly increased performance (from 35 sec to 9 sec when processing about 25 GiB of search data)
Splicing / scaling the file array
The code below gives a scaled/normalized value for T_min and T_max.
This value can then be used to determine the size of each array holding n-amount of file paths.
private int ScalePartition(int T_min, int T_max)
{
// Scale m to range.
int m = T_max / 2;
int t_min = 4;
int t_max = Math.Max(T_max / 16, T_min);
m = ((T_min - m) / (T_max - T_min)) * (t_max - t_min) + t_max;
return m;
}
This code shows the implementation of the scaling and splicing.
// Get size of file array portion.
int scale = ScalePartition(1, _files.Count);
// Iterator.
int n = 0;
// List containing tasks.
List<Task<SearchData[]>> searchTasks = new List<Task<SearchData[]>>();
// Loop through files.
while (n < _files.Count) {
// Local instance of n.
// You will get an AggregateException if you use n
// as n changes during runtime.
int num = n;
// The amount of items to take.
// This needs to be calculated as there might be an
// odd number of elements in the file array.
int cnt = n + scale > _files.Count ? _files.Count - n : scale;
// Run the Find(int, int, Regex[]) method and add as task.
searchTasks.Add(Task.Run(() => Find(num, cnt, regexes)));
// Increment iterator by the amount of files stored in scale.
n += scale;
}
Initial Answer
I had the best results so far after switching to MemoryMappedFile and moving from the Aho-Corsaick back to Regex (a demand has been made that pattern matching is a must have).
There are still parts that can be optimized or changed and I'm sure this is not the fastest or best solution but for it's alright.
Here is the code which returns the results in 30 seconds for 25 GiB worth of data:
// GNU coreutil wc defined buffer size.
// Had best performance with this buffer size.
//
// Definition in wc.c:
// -------------------
// /* Size of atomic reads. */
// #define BUFFER_SIZE (16 * 1024)
//
private const int BUFFER_SIZE = 16 * 1024;
private KeyValuePair<string, SearchData[]> FindInternal(Regex[] rgx, string file)
{
// Buffer for data segmentation.
byte[] buffer = new byte[BUFFER_SIZE];
// Get size of file.
FileInfo fInfo = new FileInfo(file);
long fSize = fInfo.Length;
fInfo = null;
// List of results.
List<SearchData> results = new List<SearchData>();
// Create MemoryMappedFile.
string name = "mmf_" + Path.GetFileNameWithoutExtension(file);
using (var mmf = MemoryMappedFile.CreateFromFile(
file, FileMode.Open, name))
{
// Create read-only in-memory access to file data.
using (var accessor = mmf.CreateViewStream(
0, fSize,
MemoryMappedFileAccess.Read))
{
// Store current position.
int pos = (int)accessor.Position;
// Check if file size is less then the
// default buffer size.
int cnt = (int)(fSize - BUFFER_SIZE > 0
? BUFFER_SIZE
: fSize - BUFFER_SIZE);
// Iterate through file until end of file is reached.
while (accessor.Position < fSize)
{
// Write data to buffer.
accessor.Read(buffer, 0, cnt);
// Update position.
pos = (int)accessor.Position;
// Update next buffer size.
cnt = (int)(fSize - pos >= BUFFER_SIZE
? BUFFER_SIZE
: fSize - pos);
// Convert buffer data to string for Regex search.
string s = Encoding.UTF8.GetString(buffer);
// Run regex against extracted data.
foreach (Regex r in rgx) {
// Get matches.
MatchCollection matches = r.Matches(s);
// Create SearchData struct to reduce memory
// impact and only keep relevant data.
foreach (Match m in matches) {
SearchData sd = new SearchData();
// The actual matched string.
sd.Match = m.Value;
// The index in the file.
sd.Index = m.Index + pos;
// Index to find beginning of line.
int nFirst = m.Index;
// Index to find end of line.
int nLast = m.Index;
// Go back in line until the end of the
// preceeding line has been found.
while (s[nFirst] != '\n' && nFirst > 0) {
nFirst--;
}
// Append length of \r\n (new line).
// Change this to 1 if you work on Unix system.
nFirst+=2;
// Go forth in line until the end of the
// current line has been found.
while (s[nLast] != '\n' && nLast < s.Length-1) {
nLast++;
}
// Remove length of \r\n (new line).
// Change this to 1 if you work on Unix system.
nLast-=2;
// Store whole line in SearchData struct.
sd.Line = s.Substring(nFirst, nLast - nFirst);
// Add result.
results.Add(sd);
}
}
}
}
}
return new KeyValuePair<string, SearchData[]>(file, results.ToArray());
}
public List<KeyValuePair<string, SearchData[]>> Find(params string[] search)
{
var results = new List<KeyValuePair<string, SearchData[]>>();
// Prepare regex objects.
Regex[] regexes = new Regex[search.Length];
for (int i=0; i<regexes.Length; i++) {
regexes[i] = new Regex(search[i], RegexOptions.Compiled);
}
// Get all search results.
// Creating the Regex once and passing it
// to the sub-routine is best as the regex
// engine adds a lot of overhead.
foreach (var file in _files) {
var data = FindInternal(regexes, file);
results.Add(data);
}
return results;
}
I had a stupid idea yesterday were I though that it might work out converting the file data to a bitmap and looking for the input within pixels as pixel checking is quite fast.
Just for the giggles... here is the non-optimized test code for that stupid idea:
public struct SearchData
{
public string Line;
public string Search;
public int Row;
public SearchData(string l, string s, int r) {
Line = l;
Search = s;
Row = r;
}
}
internal static class FileToImage
{
public static unsafe SearchData[] FindText(string search, Bitmap bmp)
{
byte[] buffer = Encoding.ASCII.GetBytes(search);
BitmapData data = bmp.LockBits(
new Rectangle(0, 0, bmp.Width, bmp.Height),
ImageLockMode.ReadOnly, bmp.PixelFormat);
List<SearchData> results = new List<SearchData>();
int bpp = Bitmap.GetPixelFormatSize(bmp.PixelFormat) / 8;
byte* ptFirst = (byte*)data.Scan0;
byte firstHit = buffer[0];
bool isFound = false;
for (int y=0; y<data.Height; y++) {
byte* ptStride = ptFirst + (y * data.Stride);
for (int x=0; x<data.Stride; x++) {
if (firstHit == ptStride[x]) {
byte[] temp = new byte[buffer.Length];
if (buffer.Length < data.Stride-x) {
int ret = 0;
for (int n=0, xx=x; n<buffer.Length; n++, xx++) {
if (ptStride[xx] != buffer[n]) {
break;
}
ret++;
}
if (ret == buffer.Length) {
int lineLength = 0;
for (int n = 0; n<data.Stride; n+=bpp) {
if (ptStride[n+2] == 255 &&
ptStride[n+1] == 255 &&
ptStride[n+0] == 255)
{
lineLength=n;
}
}
SearchData sd = new SearchData();
byte[] lineBytes = new byte[lineLength];
Marshal.Copy((IntPtr)ptStride, lineBytes, 0, lineLength);
sd.Search = search;
sd.Line = Encoding.ASCII.GetString(lineBytes);
sd.Row = y;
results.Add(sd);
}
}
}
}
}
return results.ToArray();
bmp.UnlockBits(data);
return null;
}
private static unsafe Bitmap GetBitmapInternal(string[] lines, int startIndex, Bitmap bmp)
{
int bpp = Bitmap.GetPixelFormatSize(bmp.PixelFormat) / 8;
BitmapData data = bmp.LockBits(
new Rectangle(0, 0, bmp.Width, bmp.Height),
ImageLockMode.ReadWrite,
bmp.PixelFormat);
int index = startIndex;
byte* ptFirst = (byte*)data.Scan0;
int maxHeight = bmp.Height;
if (lines.Length - startIndex < maxHeight) {
maxHeight = lines.Length - startIndex -1;
}
for (int y = 0; y < maxHeight; y++) {
byte* ptStride = ptFirst + (y * data.Stride);
index++;
int max = lines[index].Length;
max += (max % bpp);
lines[index] += new string('\0', max % bpp);
max = lines[index].Length;
for (int x=0; x+2<max; x+=bpp) {
ptStride[x+0] = (byte)lines[index][x+0];
ptStride[x+1] = (byte)lines[index][x+1];
ptStride[x+2] = (byte)lines[index][x+2];
}
ptStride[max+2] = 255;
ptStride[max+1] = 255;
ptStride[max+0] = 255;
for (int x = max + bpp; x < data.Stride; x += bpp) {
ptStride[x+2] = 0;
ptStride[x+1] = 0;
ptStride[x+0] = 0;
}
}
bmp.UnlockBits(data);
return bmp;
}
public static unsafe Bitmap[] GetBitmap(string filePath)
{
int bpp = Bitmap.GetPixelFormatSize(PixelFormat.Format24bppRgb) / 8;
var lines = System.IO.File.ReadAllLines(filePath);
int y = 0x800; //lines.Length / 0x800;
int x = lines.Max(l => l.Length) / bpp;
int cnt = (int)Math.Ceiling((float)lines.Length / (float)y);
Bitmap[] results = new Bitmap[cnt];
for (int i = 0; i < results.Length; i++) {
results[i] = new Bitmap(x, y, PixelFormat.Format24bppRgb);
results[i] = GetBitmapInternal(lines, i * 0x800, results[i]);
}
return results;
}
}
You can split the file into partitions and regex search each partition in parallel then join the results. There are some sharp edges in the details like handling values that span two partitions. Gigantor is a c# library I have created that does this very thing. Feel free to try it or have a look at the source code.
When I run this code the array has a new size after, is there anything wrong or bad about it ?
static int[] ExpandArray(int[] input, int add_size)
{
for (int i = 0; i < add_size; i++)
{
int[] temp = input;
input = new int[input.Length + 1];
for (var j = 0; j < temp.Length; j++)
{
input[j] = temp[j];
}
}
return input;
}
static void Main(string[] args)
{
int[] ovride = new int[3] { 1, 2, 3 };
ovride = ExpandArray(ovride, 10);
ovride = ExpandArray(ovride, 10);
Console.WriteLine(ovride.Length);
}
is there anything wrong or bad about it ?
This isn't code review, but:
Yes. You should not resize arrays. This involves a new allocation and a copy of all elements. As does Array.Resize(), by the way.
Hey, there is a method that already does this: Array.Resize(). Don't reinvent the wheel.
You definitely should not do the resize in a loop.
So to clean up the code a little:
static int[] ExpandArray(int[] input, int sizeToAdd)
{
// Create a new array with the desired size
var ouput = new int[input.Length + sizeToAdd];
// Copy all elements from input to output
for (int i = 0; i < input.Length; i++)
{
output[i] = input[i];
}
// Return the new array, having the remaining
// items set to their default (0 for int)
return output;
}
You'd actually want input to be updatable by ref, and then end with input = output.
Ultimately, just use a List<int>, as that allows for more efficient resizing, and does so automatically when necessary.
You can use Array.Resize which:
Changes the number of elements of a one-dimensional array to the specified new size.
int[] ovride = new int[3] { 1, 2, 3 };
Array.Resize(ref ovride, ovride.Length + 10);
Array.Resize(ref ovride, ovride.Length + 10);
Console.WriteLine(ovride.Length); // prints 23
But if you expect collection size changes List can be a more suitable option for your goal.
Hi I am trying to write and then read a 3d array of strings to a file. The array is declared as theatre[5, 5, 9]. I have been looking but cant find anything I understand. It is basically just to switch between pages in a wp8 app.
How can I do this?
Any help is much appreciated.
Thanks.
Edit: It seems you can simply use BinaryFormatter.Serialize() directly on your array as-is. It goes something like this:
using System.Runtime.Serialization.Formatters.Binary;
...
// writing
FileStream fs = File.Open("...");
BinaryFormatter bf = new BinaryFormatter();
bf.Serialize(fs, theArray);
// reading
string[,,] theArray;
FileStream fs = File.Open("...");
BinaryFormatter bf = new BinaryFormatter();
theArray = (string[,,])bf.Deserialize(fs);
First solution (try this if BinaryFormatter fails):
You can translate between 3D and 1D as follows:
struct Vector {
public int x;
public int y;
public int z;
Vector(int x, int y, int z)
{
this.x = x;
this.y = y;
this.z = z;
}
}
Vector GetIndices3d(int i, Vector bounds)
{
Vector indices = new Vector();
int zSize = bounds.x * bounds.y;
indices.x = (i % zSize) % bounds.x;
indices.y = (i % zSize) / bounds.x;
indices.z = i / zSize;
return indices;
}
int GetIndex1d(Vector indices, Vector bounds)
{
return (indices.z * (bounds.x * bounds.y)) +
(indices.y * bounds.x) +
indices.x;
}
Then it's just a matter of turning the 3D array to a 1D array and serializing it to a file. Do the opposite for reading.
string[] Get1dFrom3d(string[,,] data)
{
Vector bounds = new Vector(data.GetLength(0), data.GetLength(1), data.GetLength(2));
string[] result = new string[data.Length];
Vector v;
for (int i = 0; i < data.Length; ++i)
{
v = GetIndices3d(i, bounds);
result[i] = data[v.x, v.y, v.z];
}
return result;
}
string[,,] Get3dFrom1d(string[] data, Vector bounds)
{
string[,,] result = new string[bounds.x, bounds.y, bounds.z];
Vector v;
for (int i = 0; i < data.Length; ++i)
{
v = GetIndices3d(i, bounds);
result[v.x, v.y, v.z] = data[i];
}
return result;
}
Serializing the data to a file depends on the content of the data. You can choose a seperator character that does not appear in any of the data, and concatenate the strings using the separator.
If it's not possible to determine a distinct seperator character, you can choose one at your own convenience, and preprocess the strings such that the separator is escaped where it naturally appears in the data. This is usually done by inserting the seperator where it appears in the string such that it appears twise. Then handle this when reading the file (ie: pairs of separators = natural occurence of character data).
Another approach would be to turn everything into hexadecimal, and use some arbitrary separator. This will more or less double the file size.
From your last comment, it seems like you don't need a 3D array, even by relying on the quickest/dirtiest approach, you can/should use a 2D one. But you can avoid the second dimension by creating a custom class with the properties you want. Sample code:
seat[] theatre = new seat[5]; //5 seats
int count0 = -1;
do
{
count0 = count0 + 1; //Seat number
theatre[count0] = new seat();
theatre[count0].X = 1; //X value for Seat count0
theatre[count0].Y = 2; //Y value for Seat count0
theatre[count0].Prop1 = "prop 1"; //Prop1 for Seat count0
theatre[count0].Prop2 = "prop 2"; //Prop2 for Seat count0
theatre[count0].Prop3 = "prop 3"; //Prop3 for Seat count0
} while (count0 < theatre.GetLength(0) - 1);
Where seat is defined by the following code:
class seat
{
public int X { get; set; }
public int Y { get; set; }
public string Prop1 { get; set; }
public string Prop2 { get; set; }
public string Prop3 { get; set; }
}
Let's say I have the array
1,2,3,4,5,6,7,8,9,10,11,12
if my chunck size = 4
then I want to be able to have a method that will output an array of ints int[] a =
a[0] = 1
a[1] = 3
a[2] = 6
a[3] = 10
a[4] = 14
a[5] = 18
a[6] = 22
a[7] = 26
a[8] = 30
a[9] = 34
a[10] = 38
a[11] = 42
note that a[n] = a[n] + a[n-1] + a[n-2] + a[n-3] because the chunk size is 4 thus I sum the last 4 items
I need to have the method without a nested loop
for(int i=0; i<12; i++)
{
for(int k = i; k>=0 ;k--)
{
// do sumation
counter++;
if(counter==4)
break;
}
}
for example i don't want to have something like that... in order to make code more efficient
also the chunck size may change so I cannot do:
a[3] = a[0] + a[1] + a[2] + a[3]
edit
The reason why I asked this question is because I need to implement check sum rolling for my data structures class. I basically open a file for reading. I then have a byte array. then I will perform a hash function on parts of the file. lets say the file is 100 bytes. I split it in chunks of 10 bytes. I perform a hash function in each chunck thus I get 10 hashes. then I need to compare those hashes with a second file that is similar. let's say the second file has the same 100 bytes but with an additional 5 so it contains a total of 105 bytes. becasuse those extra bytes may have been in the middle of the file if I perform the same algorithm that I did on the first file it is not going to work. Hope I explain my self correctly. and because some files are large. it is not efficient to have a nested loop in my algorithm.
also the real rolling hashing functions are very complex. Most of them are in c++ and I have a hard time understanding them. That's why I want to create my own hashing function very simple just to demonstrate how check sum rolling works...
Edit 2
int chunckSize = 4;
int[] a = new int[] { 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12 }; // the bytes of the file
int[] b = new int[a.Length]; // array where we will place the checksums
int[] sum = new int[a.Length]; // array needed to avoid nested loop
for (int i = 0; i < a.Length; i++)
{
int temp = 0;
if (i == 0)
{
temp = 1;
}
sum[i] += a[i] + sum[i-1+temp];
if (i < chunckSize)
{
b[i] = sum[i];
}
else
{
b[i] = sum[i] - sum[i - chunckSize];
}
}
the problem with this algorithm is that with large files the sum will at some point be larger than int.Max thus it is not going to work....
but at least know it is more efficient. getting rid of that nested loop helped a lot!
edit 3
Based on edit two I have worked this out. It does not work with large files and also the checksum algorithm is very bad. but at least I think it explains the hashing rolling that I am trying to explain...
Part1(#"A:\fileA.txt");
Part2(#"A:\fileB.txt", null);
.....
// split the file in chuncks and return the checksums of the chuncks
private static UInt64[] Part1(string file)
{
UInt64[] hashes = new UInt64[(int)Math.Pow(2, 20)];
var stream = File.OpenRead(file);
int chunckSize = (int)Math.Pow(2, 22); // 10 => kilobite 20 => megabite 30 => gigabite etc..
byte[] buffer = new byte[chunckSize];
int bytesRead; // how many bytes where read
int counter = 0; // counter
while ( // while bytesRead > 0
(bytesRead =
(stream.Read(buffer, 0, buffer.Length)) // returns the number of bytes read or 0 if no bytes read
) > 0)
{
hashes[counter] = 0;
for (int i = 0; i < bytesRead; i++)
{
hashes[counter] = hashes[counter] + buffer[i]; // simple algorithm not realistic to perform check sum of file
}
counter++;
}// end while loop
return hashes;
}
// split the file in chuncks rolling it. In reallity this file will be on a different computer..
private static void Part2(string file, UInt64[] hash)
{
UInt64[] hashes = new UInt64[(int)Math.Pow(2, 20)];
var stream = File.OpenRead(file);
int chunckSize = (int)Math.Pow(2, 22); // chunks must be as big as in pervious method
byte[] buffer = new byte[chunckSize];
int bytesRead; // how many bytes where read
int counter = 0; // counter
UInt64[] sum = new UInt64[(int)Math.Pow(2, 20)];
while ( // while bytesRead > 0
(bytesRead =
(stream.Read(buffer, 0, buffer.Length)) // returns the number of bytes read or 0 if no bytes read
) > 0)
{
for (int i = 0; i < bytesRead; i++)
{
int temp = 0;
if (counter == 0)
temp = 1;
sum[counter] += (UInt64)buffer[i] + sum[counter - 1 + temp];
if (counter < chunckSize)
{
hashes[counter] = (UInt64)sum[counter];
}else
{
hashes[counter] = (UInt64)sum[counter] - (UInt64)sum[counter - chunckSize];
}
counter++;
}
}// end while loop
// mising to compare hashes arrays
}
Add an array r for the result, and initialize its first chunk members using a loop from 0 to chunk-1. Now observe that to get r[i+1] you can add a[i+1] to r[i], and subtract a[i-chunk+1]. Now you can do the rest of the items in a single non-nested loop:
for (int i=chunk+1 ; i < N-1 ; i++) {
r[i+1] = a[i+1] + r[i] - a[i-chunk+1];
}
You can get this down to a single for loop, though that may not be good enough. To do that, just note that c[i+1] = c[i]-a[i-k+1]+a[i+1]; where a is the original array, c is the chunky array, and k is the size of the chunks.
I understand that you want to compute a rolling hash function to hash every n-gram (where n is what you call the "chunk size"). Rolling hashing is sometimes called "recursive hashing". There is a wikipedia entry on the topic:
http://en.wikipedia.org/wiki/Rolling_hash
A common algorithm to solve this problem is Karp-Rabin. Here is some pseudo-code which you should be able to easily implement in C#:
B←37
s←empty First-In-First-Out (FIFO) structure (e.g., a linked-list)
x←0(L-bit integer)
z←0(L-bit integer)
for each character c do
append c to s
x ← (B x−B^n z + c ) mod 2^L
yield x
if length(s) = n then
remove oldest character y from s
z ← y
end if
end for
Note that because B^n is a constant, the main loop only does two multiplications, one subtraction and one addition. The "mod 2^L" operation can be done very fast (use a mask, or unsigned integers with L=32 or L=64, for example).
Specifically, your C# code might look like this where n is the "chunk" size (just set B=37, and Btothen = 37 ^ n)
r[0] = 0
for (int i=1 ; i < N ; i++) {
r[i] = a[i] + B * r[i-1] - Btothen * a[i-n];
}
Karp-Rabin is not ideal however. I wrote a paper where better solutions are discussed:
Daniel Lemire and Owen Kaser, Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698-710, 2010.
http://arxiv.org/abs/0705.4676
I also published the source code (Java and C++, alas no C# but it should not be hard to go from Java to C#):
https://github.com/lemire/rollinghashjava
https://github.com/lemire/rollinghashcpp
How about storing off the last chunk_size values as you step through?
Allocate an array of size chunk_size, set them all to zero, and then set the element at i % chunk_size with your current element at each iteration of i, and then add up all the values?
using System;
class Sample {
static void Main(){
int chunckSize = 4;
int[] a = new int[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 };
int[] b = new int[a.Length];
int sum = 0;
int d = chunckSize*(chunckSize-1)/2;
foreach(var i in a){
if(i < chunckSize){
sum += i;
b[i-1]=sum;
} else {
b[i-1]=chunckSize*i -d;
}
}
Console.WriteLine(String.Join(",", b));//1,3,6,10,14,18,22,26,30,34,38,42
}
}
I have this code
Open WritingPath & "\FplDb.txt" For Random As #1 Len = Len(WpRec)
For i = 1 To 99
WpRec.WpIndex = FplDB(i, 1)
WpRec.WpName = FplDB(i, 2)
WpRec.WpLat = FplDB(i, 3)
WpRec.WpLon = FplDB(i, 4)
WpRec.WpLatDir = FplDB(i, 5)
WpRec.WpLonDir = FplDB(i, 6)
Put #1, i, WpRec
Next i
Close #1
SaveOk = 1
FplSave = SaveOk
Exit Function
This function makes binary serialization of a matrix of 99 structs (WpRec) to file, using "Open" and "Put" statements. But I didn't get how it is encoded... It is important to me because I need to rewrite the same serialization in C# but I need to know what encoding method is used for that so I can do the same in C#....
The tricky bit in VB6 was that you were allowed to declare structures with fixed length strings so that you could write records containing strings that didn't need a length prefix. The length of the string buffer was encoded into the type instead of needing to be written out with the record. This allowed for fixed size records. In .NET, this has kind of been left behind in the sense that VB.NET has a mechanism to support it for backward compatibility, but it's not really intended for C# as far as I can tell: How to declare a fixed-length string in VB.NET?.
.NET seems to have a preference for generally writing out strings with a length prefix, meaning that records are generally variable-length. This is suggested by the implementation of BinaryReader.ReadString.
However, you can use System.BitConverter to get finer control over how records are serialized and de-serialized as bytes (System.IO.BinaryReader and System.IO.BinaryWriter are probably not useful since they make assumptions that strings have a length prefix). Keep in mind that a VB6 Integer maps to a .NET Int16 and a VB6 Long is a .Net Int32. I don't know exactly how you have defined your VB6 structure, but here's one possible implementation as an example:
class Program
{
static void Main(string[] args)
{
WpRecType[] WpRec = new WpRecType[3];
WpRec[0] = new WpRecType();
WpRec[0].WpIndex = 0;
WpRec[0].WpName = "New York";
WpRec[0].WpLat = 40.783f;
WpRec[0].WpLon = 73.967f;
WpRec[0].WpLatDir = 1;
WpRec[0].WpLonDir = 1;
WpRec[1] = new WpRecType();
WpRec[1].WpIndex = 1;
WpRec[1].WpName = "Minneapolis";
WpRec[1].WpLat = 44.983f;
WpRec[1].WpLon = 93.233f;
WpRec[1].WpLatDir = 1;
WpRec[1].WpLonDir = 1;
WpRec[2] = new WpRecType();
WpRec[2].WpIndex = 2;
WpRec[2].WpName = "Moscow";
WpRec[2].WpLat = 55.75f;
WpRec[2].WpLon = 37.6f;
WpRec[2].WpLatDir = 1;
WpRec[2].WpLonDir = 2;
byte[] buffer = new byte[WpRecType.RecordSize];
using (System.IO.FileStream stm =
new System.IO.FileStream(#"C:\Users\Public\Documents\FplDb.dat",
System.IO.FileMode.OpenOrCreate, System.IO.FileAccess.ReadWrite))
{
WpRec[0].SerializeInto(buffer);
stm.Write(buffer, 0, buffer.Length);
WpRec[1].SerializeInto(buffer);
stm.Write(buffer, 0, buffer.Length);
WpRec[2].SerializeInto(buffer);
stm.Write(buffer, 0, buffer.Length);
// Seek to record #1, load and display it
stm.Seek(WpRecType.RecordSize * 1, System.IO.SeekOrigin.Begin);
stm.Read(buffer, 0, WpRecType.RecordSize);
WpRecType rec = new WpRecType(buffer);
Console.WriteLine("[{0}] {1}: {2} {3}, {4} {5}", rec.WpIndex, rec.WpName,
rec.WpLat, (rec.WpLatDir == 1) ? "N" : "S",
rec.WpLon, (rec.WpLonDir == 1) ? "W" : "E");
}
}
}
class WpRecType
{
public short WpIndex;
public string WpName;
public Single WpLat;
public Single WpLon;
public byte WpLatDir;
public byte WpLonDir;
const int WpNameBytes = 40; // 20 unicode characters
public const int RecordSize = WpNameBytes + 12;
public void SerializeInto(byte[] target)
{
int position = 0;
target.Initialize();
BitConverter.GetBytes(WpIndex).CopyTo(target, position);
position += 2;
System.Text.Encoding.Unicode.GetBytes(WpName).CopyTo(target, position);
position += WpNameBytes;
BitConverter.GetBytes(WpLat).CopyTo(target, position);
position += 4;
BitConverter.GetBytes(WpLon).CopyTo(target, position);
position += 4;
target[position++] = WpLatDir;
target[position++] = WpLonDir;
}
public void Deserialize(byte[] source)
{
int position = 0;
WpIndex = BitConverter.ToInt16(source, position);
position += 2;
WpName = System.Text.Encoding.Unicode.GetString(source, position, WpNameBytes);
position += WpNameBytes;
WpLat = BitConverter.ToSingle(source, position);
position += 4;
WpLon = BitConverter.ToSingle(source, position);
position += 4;
WpLatDir = source[position++];
WpLonDir = source[position++];
}
public WpRecType()
{
}
public WpRecType(byte[] source)
{
Deserialize(source);
}
}
Add a reference to Microsoft.VisualBasic and use FilePut
It is designed to assist with compatibility with VB6
The VB6 code in your question would be something like this in C# (I haven't compiled this)
Microsoft.VisualBasic.FileOpen (1, WritingPath & "\FplDb.txt", OpenMode.Random,
RecordLength:=Marshal.SizeOf(WpRec))
for (i = 1; i < 100 ; i++) {
WpRec.WpIndex = FplDB(i, 1)
WpRec.WpName = FplDB(i, 2)
WpRec.WpLat = FplDB(i, 3)
WpRec.WpLon = FplDB(i, 4)
WpRec.WpLatDir = FplDB(i, 5)
WpRec.WpLonDir = FplDB(i, 6)
Microsoft.VisualBasic.FilePut(1, WpRec, i)
}
Microsoft.VisualBasic.FileClose(1)
I think Marshal.SizeOf(WpRec) returns the same value that Len(WpRec) will return in VB6 - do check this though.
The put statement in VB6 does not do any encoding. It saves a structure just as it is stored internally in memory. For example, put saves a double as a 64-bit floating point value, just as it is represented in memory. In your example, the members of WpRec are stored in the put statement just as WpRec is stored in memory.