Efficient way of finding repeating sequences of characters in string - c#

I am trying to do the following:
Read file contents into byte array
convert byte array into Base64 String
find all sequences of repeating characters that are longer than 8 in length
place the found repeating patterns in a list
Here is where I am currently having some issues... I am currently reading a 1MB file using this loop:
void bkg_DoWork(object sender, DoWorkEventArgs e)
{
try
{
Byte[] bytes = File.ReadAllBytes(this.txt_Filename.Text);
string file = Convert.ToBase64String(bytes);
char lastchar = '\0';
int count = 0;
List<RepeatingPattern> patterns = new List<RepeatingPattern>();
this.Invoke((MethodInvoker)delegate
{
this.pb_Progress.Maximum = file.Length;
this.pb_Progress.Value = 0;
this.lbl_Progress.Text = "Progress: Read file contents read... Looking for patterns! 0% Done...";
});
for (int i = 0; i < file.Length; i++)
{
this.Invoke((MethodInvoker)delegate
{
this.pb_Progress.Value += 1;
this.lbl_Progress.Text = "Progress: Looking for patterns! " + (int)Decimal.Truncate((decimal)((double)i / file.Length) * 100) + "% Done...";
});
if (file[i] == lastchar)
count += 1;
else
{
//create a pattern, if the count is more than what a pattern's compressed pattern looks like to save space... 8 chars
//[$a,#$]
if (count > 8)
{
//create and add a pattern to the list if necessary.
RepeatingPattern ptn = new RepeatingPattern(lastchar, count);
if (!patterns.Contains(ptn))
patterns.Add(ptn);
}
count = 0;
lastchar = file[i];
}
}
e.Result = patterns;
}
catch (Exception ex)
{
e.Result = ex;
}
}
However, when using this loop, I find that the process is VERY long... for example, this 1MB file, takes like 1 minute to loop through... in this day in age, it feels like this is a long time for such a small file. Is there a more efficient way to do what I want to do/find the repeating patterns?

Related

Is there an equivalent to mmap.mmap.rfind in C#?

While looking at memory-mapped files in C#, there was some difficulty in identifying how to search a file quickly forward and in reverse. My goal is to rewrite the following function in the language, but nothing could be found like the find and rfind methods used below. Is there a way in C# to quickly search a memory-mapped file using a particular substring?
#! /usr/bin/env python3
import mmap
import pathlib
# noinspection PyUnboundLocalVariable
def drop_last_line(path):
with path.open('r+b') as file:
with mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as search:
for next_line in b'\r\n', b'\r', b'\n':
if search.find(next_line) >= 0:
break
else:
raise ValueError('cannot find any line delimiters')
end_1st = search.rfind(next_line)
end_2nd = search.rfind(next_line, 0, end_1st - 1)
file.truncate(0 if end_2nd < 0 else end_2nd + len(next_line))
Is there a way in C# to quickly search a memory-mapped file using a particular substring?
Do you know of any way to memory-map an entire file in C# and then treat it as a byte array?
Yes, it's quite easy to map an entire file into a view then to read it into a single byte array as the following code shows:
static void Main(string[] args)
{
var sourceFile= new FileInfo(#"C:\Users\Micky\Downloads\20180112.zip");
int length = (int) sourceFile.Length; // length of target file
// Create the memory-mapped file.
using (var mmf = MemoryMappedFile.CreateFromFile(sourceFile.FullName,
FileMode.Open,
"ImgA"))
{
var buffer = new byte[length]; // allocate a buffer with the same size as the file
using (var accessor = mmf.CreateViewAccessor())
{
var read=accessor.ReadArray(0, buffer, 0, length); // read the whole thing
}
// let's try searching for a known byte sequence. Change this to suit your file
var target = new byte[] {71, 213, 62, 204,231};
var foundAt = IndexOf(buffer, target);
}
}
I couldn't seem to find any byte searching method in Marshal or Array but you can use this search algorithm courtesy of Social MSDN as a start:
private static int IndexOf2(byte[] input, byte[] pattern)
{
byte firstByte = pattern[0];
int index = -1;
if ((index = Array.IndexOf(input, firstByte)) >= 0)
{
for (int i = 0; i < pattern.Length; i++)
{
if (index + i >= input.Length ||
pattern[i] != input[index + i]) return -1;
}
}
return index;
}
...or even this more verbose example (also courtesy Social MSDN, same link)
public static int IndexOf(byte[] arrayToSearchThrough, byte[] patternToFind)
{
if (patternToFind.Length > arrayToSearchThrough.Length)
return -1;
for (int i = 0; i < arrayToSearchThrough.Length - patternToFind.Length; i++)
{
bool found = true;
for (int j = 0; j < patternToFind.Length; j++)
{
if (arrayToSearchThrough[i + j] != patternToFind[j])
{
found = false;
break;
}
}
if (found)
{
return i;
}
}
return -1;
}

String compression for repeated chars or substrings [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a string that may contain any letter
string uncompressed = "dacacacd";
I need to compress this string in the format
string compressed = "d3(ac)d";
But also compress any substring if possible ex:
string uncompressed = "dabcabcdabcabc";
string compressed = "2(d2(abc))";
Is there a way to achieve this without any third party library?
Here's an example that will do the compression based on the longest sub-strings first. Maybe not the most efficient or best compression for something like "abababcabc" but should at least get you started.
public class CompressedString
{
private class Segment
{
public Segment(int count, CompressedString value)
{
Count = count;
Value = value;
}
public int Count { get; set; }
public CompressedString Value { get; set; }
}
private List<Segment> segments = new List<Segment>();
private string uncompressible;
private CompressedString(){}
public static CompressedString Compress(string s)
{
var compressed = new CompressedString();
// longest possible repeating substring is half the length of the
// string, so try that first and work down to shorter lengths
for(int len = s.Length/2; len > 0; len--)
{
// look for the substring at each possible index
for(int i = 0; i < s.Length - len - 1; i++)
{
var sub = s.Substring(i, len);
int count = 1;
// look for repeats of the substring immediately after it.
for(int j = i + len; j <= s.Length - len; j += len)
{
// increase the count of times the substring is found
// or stop looking when it doesn't match
if(sub == s.Substring(j, len))
{
count++;
}
else
{
break;
}
}
// if we found repeats then handle the substring before the
// repeats, the repeast, and everything after.
if(count > 1)
{
// if anything is before the repeats then add it to the
// segments with a count of one and compress it.
if (i > 0)
{
compressed.segments.Add(new Segment(1, Compress(s.Substring(0, i))));
}
// Add the repeats to the segments with the found count
// and compress it.
compressed.segments.Add(new Segment(count, Compress(sub)));
// if anything is after the repeats then add it to the
// segments with a count of one and compress it.
if (s.Length - (count * len) > i)
{
compressed.segments.Add(new Segment(1, Compress(s.Substring(i + (count * len)))));
}
// We're done compressing so break this loop and the
// outer by setting len to 0.
len = 0;
break;
}
}
}
// If we failed to find any repeating substrings then we just have
// a single uncompressible string.
if (!compressed.segments.Any())
{
compressed.uncompressible = s;
}
// Reduce the the compression for something like "2(2(ab))" to "4(ab)"
compressed.Reduce();
return compressed;
}
private void Reduce()
{
// Attempt to reduce each segment.
foreach(var seg in segments)
{
seg.Value.Reduce();
// If there is only one sub segment then we can reduce it.
if(seg.Value.segments.Count == 1)
{
var subSeg = seg.Value.segments[0];
seg.Value = subSeg.Value;
seg.Count *= subSeg.Count;
}
}
}
public override string ToString()
{
if(segments.Any())
{
StringBuilder builder = new StringBuilder();
foreach(var seg in segments)
{
if (seg.Count == 1)
builder.Append(seg.Value.ToString());
else
{
builder.Append(seg.Count).Append("(").Append(seg.Value.ToString()).Append(")");
}
}
return builder.ToString();
}
return uncompressible;
}
}

C# Concatenate strings or array of chars

I'm facing a problem while developing an application.
Basically,
I have a fixed string, let's say "IHaveADream"
I now want to user to insert another string, for my purpose of a fixed length, and then concatenate every character of the fixed string with every character of the string inserted by the user.
e.g.
The user inserts "ByeBye"
then the output would be:
"IBHyaevBeyAeDream".
How to accomplish this?
I have tried with String.Concat and String.Join, inside a for statement, with no luck.
One memory-efficient option is to use a string builder, since both the original string and the user input could potentially be rather large. As mentioned by Kris, you can initialize your StringBuilder capacity to the combined length of both strings.
void Main()
{
var start = "IHaveADream";
var input = "ByeBye";
var sb = new StringBuilder(start.Length + input.Length);
for (int i = 0; i < start.Length; i++)
{
sb.Append(start[i]);
if (input.Length >= i + 1)
sb.Append(input[i]);
}
sb.ToString().Dump();
}
This only safely accounts for the input string being shorter or equal in length to the starting string. If you had a longer input string, you'd want to take the longer length as the end point for your for loop iteration and check that each array index is not out of bounds.
void Main()
{
var start = "IHaveADream";
var input = "ByeByeByeByeBye";
var sb = new StringBuilder(start.Length + input.Length);
var length = start.Length >= input.Length ? start.Length : input.Length;
for (int i = 0; i < length; i++)
{
if (start.Length >= i + 1)
sb.Append(start[i]);
if (input.Length >= i + 1)
sb.Append(input[i]);
}
sb.ToString().Dump();
}
You can create an array of characters and then re-combine them in the order you want.
char[] chars1 = "IHaveADream".ToCharArray();
char[] chars2 = "ByeBye".ToCharArray();
// you can create a custom algorithm to handle
// different size strings.
char[] c = new char[17];
c[0] = chars1[0];
c[1] = chars2[0];
...
c[13] = chars1[10];
string s = new string(c);
var firstString = "Ihaveadream";
var secondString = "ByeBye";
var stringBuilder = new StringBuilder();
for (int i = 0; i< firstString.Length; i++) {
stringBuilder .Append(str[i]);
if (i < secondString.Length) {
stringBuilder .Append(secondStr[i]);
}
}
var result = stringBuilder.ToString();
If you don't care much about memory usage or perfomance you can just use:
public static string concatStrings(string value, string value2)
{
string result = "";
int i = 0;
for (i = 0; i < Math.Max(value.Length, value2.Length) ; i++)
{
if (i < value.Length) result += value[i].ToString();
if (i < value2.Length) result += value2[i].ToString();
}
return result;
}
Usage:
string conststr = "IHaveADream";
string input = "ByeBye";
var result = ConcatStrings(conststr, input);
Console.WriteLine(result);
Output: IBHyaevBeyAeDream
P.S.
Just checked perfomance of both methods (with strBuilder and simple cancatenation) and it appears to be that both of this methods take same time to execute (if you have just one operation). The main reason for it is that string builder take considerable time to initialize while with use of concatenation we don't need that.
But in case if you have to process something like 1500 strings then it's different story and string builder is more of an option.
For 100 000 method executions it showed 85 (str buld) vs 22 (concat) ms respectively.
My Code

How do I let StreamReader.ReadLine() differentiate "\r\n" from "\n"?

I created a function that receives chunked HTTP packages using StreamReader and TcpClient.
Here's what I've created:
private string recv()
{
Thread.Sleep(Config.ApplicationClient.WAIT_INTERVAL);
string result = String.Empty;
string line = reader.ReadLine();
result += line + "\n";
while (line.Length > 0)
{
line = reader.ReadLine();
result += line + "\n";
}
for (int size = -1, total = 0; size != 0; total = 0)
{
line = reader.ReadLine();
size = PacketAnalyzer.parseHex(line);
while (total < size)
{
line = reader.ReadLine();
result += line + "\n";
int i = encoding.GetBytes(line).Length;
total += i + 2; //this part assumes that line break is caused by "\r\n", which is not always the case
}
}
reader.DiscardBufferedData();
return result;
}
For each new line it reads, it adds an additional length of 2 to total, assuming that the new line is created by "\r\n". This works for almost all cases except when the data contains '\n', which I have no idea how to differentiate it from "\r\n". For such cases, it'll think that it has read more than there actually is, thereby short reading a chunk and exposing PacketAnalyzer.parseHex() to error.
(Answered in a question edit. Converted to a community wiki answer. See What is the appropriate action when the answer to a question is added to the question itself? )
The OP wrote:
SOLVED: I made the following two line-reading and stream-emptying functions and I'm back on track again.
NetworkStream ns;
//.....
private void emptyStreamBuffer()
{
while (ns.DataAvailable)
ns.ReadByte();
}
private string readLine()
{
int i = 0;
for (byte b = (byte) ns.ReadByte(); b != '\n'; b = (byte) ns.ReadByte())
buffer[i++] = b;
return encoding.GetString(buffer, 0, i);
}

How to read last "n" lines of log file [duplicate]

This question already has answers here:
Get last 10 lines of very large text file > 10GB
(21 answers)
Closed 1 year ago.
need a snippet of code which would read out last "n lines" of a log file. I came up with the following code from the net.I am kinda new to C sharp. Since the log file might be
quite large, I want to avoid overhead of reading the entire file.Can someone suggest any performance enhancement. I do not really want to read each character and change position.
var reader = new StreamReader(filePath, Encoding.ASCII);
reader.BaseStream.Seek(0, SeekOrigin.End);
var count = 0;
while (count <= tailCount)
{
if (reader.BaseStream.Position <= 0) break;
reader.BaseStream.Position--;
int c = reader.Read();
if (reader.BaseStream.Position <= 0) break;
reader.BaseStream.Position--;
if (c == '\n')
{
++count;
}
}
var str = reader.ReadToEnd();
Your code will perform very poorly, since you aren't allowing any caching to happen.
In addition, it will not work at all for Unicode.
I wrote the following implementation:
///<summary>Returns the end of a text reader.</summary>
///<param name="reader">The reader to read from.</param>
///<param name="lineCount">The number of lines to return.</param>
///<returns>The last lneCount lines from the reader.</returns>
public static string[] Tail(this TextReader reader, int lineCount) {
var buffer = new List<string>(lineCount);
string line;
for (int i = 0; i < lineCount; i++) {
line = reader.ReadLine();
if (line == null) return buffer.ToArray();
buffer.Add(line);
}
int lastLine = lineCount - 1; //The index of the last line read from the buffer. Everything > this index was read earlier than everything <= this indes
while (null != (line = reader.ReadLine())) {
lastLine++;
if (lastLine == lineCount) lastLine = 0;
buffer[lastLine] = line;
}
if (lastLine == lineCount - 1) return buffer.ToArray();
var retVal = new string[lineCount];
buffer.CopyTo(lastLine + 1, retVal, 0, lineCount - lastLine - 1);
buffer.CopyTo(0, retVal, lineCount - lastLine - 1, lastLine + 1);
return retVal;
}
Had trouble with your code. This is my version. Since its' a log file, something might be writing to it, so it's best making sure you're not locking it.
You go to the end. Start reading backwards until you reach n lines. Then read everything from there on.
int n = 5; //or any arbitrary number
int count = 0;
string content;
byte[] buffer = new byte[1];
using (FileStream fs = new FileStream("text.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
// read to the end.
fs.Seek(0, SeekOrigin.End);
// read backwards 'n' lines
while (count < n)
{
fs.Seek(-1, SeekOrigin.Current);
fs.Read(buffer, 0, 1);
if (buffer[0] == '\n')
{
count++;
}
fs.Seek(-1, SeekOrigin.Current); // fs.Read(...) advances the position, so we need to go back again
}
fs.Seek(1, SeekOrigin.Current); // go past the last '\n'
// read the last n lines
using (StreamReader sr = new StreamReader(fs))
{
content = sr.ReadToEnd();
}
}
A friend of mine uses this method (BackwardReader can be found here):
public static IList<string> GetLogTail(string logname, string numrows)
{
int lineCnt = 1;
List<string> lines = new List<string>();
int maxLines;
if (!int.TryParse(numrows, out maxLines))
{
maxLines = 100;
}
string logFile = HttpContext.Current.Server.MapPath("~/" + logname);
BackwardReader br = new BackwardReader(logFile);
while (!br.SOF)
{
string line = br.Readline();
lines.Add(line + System.Environment.NewLine);
if (lineCnt == maxLines) break;
lineCnt++;
}
lines.Reverse();
return lines;
}
Does your log have lines of similar length? If yes, then you can calculate average length of the line, then do the following:
seek to end_of_file - lines_needed*avg_line_length (previous_point)
read everything up to the end
if you grabbed enough lines, that's fine. If no, seek to previous_point - lines_needed*avg_line_length
read everything up to previous_point
goto 3
memory-mapped file is also a good method -- map the tail of file, calculate lines, map the previous block, calculate lines etc. until you get the number of lines needed
Here is my answer:-
private string StatisticsFile = #"c:\yourfilename.txt";
// Read last lines of a file....
public IList<string> ReadLastLines(int nFromLine, int nNoLines, out bool bMore)
{
// Initialise more
bMore = false;
try
{
char[] buffer = null;
//lock (strMessages) Lock something if you need to....
{
if (File.Exists(StatisticsFile))
{
// Open file
using (StreamReader sr = new StreamReader(StatisticsFile))
{
long FileLength = sr.BaseStream.Length;
int c, linescount = 0;
long pos = FileLength - 1;
long PreviousReturn = FileLength;
// Process file
while (pos >= 0 && linescount < nFromLine + nNoLines) // Until found correct place
{
// Read a character from the end
c = BufferedGetCharBackwards(sr, pos);
if (c == Convert.ToInt32('\n'))
{
// Found return character
if (++linescount == nFromLine)
// Found last place
PreviousReturn = pos + 1; // Read to here
}
// Previous char
pos--;
}
pos++;
// Create buffer
buffer = new char[PreviousReturn - pos];
sr.DiscardBufferedData();
// Read all our chars
sr.BaseStream.Seek(pos, SeekOrigin.Begin);
sr.Read(buffer, (int)0, (int)(PreviousReturn - pos));
sr.Close();
// Store if more lines available
if (pos > 0)
// Is there more?
bMore = true;
}
if (buffer != null)
{
// Get data
string strResult = new string(buffer);
strResult = strResult.Replace("\r", "");
// Store in List
List<string> strSort = new List<string>(strResult.Split('\n'));
// Reverse order
strSort.Reverse();
return strSort;
}
}
}
}
catch (Exception ex)
{
System.Diagnostics.Debug.WriteLine("ReadLastLines Exception:" + ex.ToString());
}
// Lets return a list with no entries
return new List<string>();
}
const int CACHE_BUFFER_SIZE = 1024;
private long ncachestartbuffer = -1;
private char[] cachebuffer = null;
// Cache the file....
private int BufferedGetCharBackwards(StreamReader sr, long iPosFromBegin)
{
// Check for error
if (iPosFromBegin < 0 || iPosFromBegin >= sr.BaseStream.Length)
return -1;
// See if we have the character already
if (ncachestartbuffer >= 0 && ncachestartbuffer <= iPosFromBegin && ncachestartbuffer + cachebuffer.Length > iPosFromBegin)
{
return cachebuffer[iPosFromBegin - ncachestartbuffer];
}
// Load into cache
ncachestartbuffer = (int)Math.Max(0, iPosFromBegin - CACHE_BUFFER_SIZE + 1);
int nLength = (int)Math.Min(CACHE_BUFFER_SIZE, sr.BaseStream.Length - ncachestartbuffer);
cachebuffer = new char[nLength];
sr.DiscardBufferedData();
sr.BaseStream.Seek(ncachestartbuffer, SeekOrigin.Begin);
sr.Read(cachebuffer, (int)0, (int)nLength);
return BufferedGetCharBackwards(sr, iPosFromBegin);
}
Note:-
Call ReadLastLines with nLineFrom starting at 0 for the last line and nNoLines as the number of lines to read back from.
It reverses the list so the 1st one is the last line in the file.
bMore returns true if there are more lines to read.
It caches the data in 1024 char chunks - so it is fast, you may want to increase this size for very large files.
Enjoy!
This is in no way optimal but for quick and dirty checks with small log files I've been using something like this:
List<string> mostRecentLines = File.ReadLines(filePath)
// .Where(....)
// .Distinct()
.Reverse()
.Take(10)
.ToList()
Something that you can now do very easily in C# 4.0 (and with just a tiny bit of effort in earlier versions) is use memory mapped files for this type of operation. Its ideal for large files because you can map just a portion of the file, then access it as virtual memory.
There is a good example here.
As #EugeneMayevski stated above, if you just need an approximate number of lines returned, each line has roughly the same line length and you're more concerned with performance especially for large files, this is a better implementation:
internal static StringBuilder ReadApproxLastNLines(string filePath, int approxLinesToRead, int approxLengthPerLine)
{
//If each line is more or less of the same length and you don't really care if you get back exactly the last n
using (FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
var totalCharsToRead = approxLengthPerLine * approxLinesToRead;
var buffer = new byte[1];
//read approx chars to read backwards from end
fs.Seek(totalCharsToRead > fs.Length ? -fs.Length : -totalCharsToRead, SeekOrigin.End);
while (buffer[0] != '\n' && fs.Position > 0) //find new line char
{
fs.Read(buffer, 0, 1);
}
var returnStringBuilder = new StringBuilder();
using (StreamReader sr = new StreamReader(fs))
{
returnStringBuilder.Append(sr.ReadToEnd());
}
return returnStringBuilder;
}
}
Most log files have a DateTime stamp. Although can be improved, the code below works well if you want the log messages from the last N days.
/// <summary>
/// Returns list of entries from the last N days.
/// </summary>
/// <param name="N"></param>
/// <param name="cSEP">field separator, default is TAB</param>
/// <param name="indexOfDateColumn">default is 0; change if it is not the first item in each line</param>
/// <param name="bFileHasHeaderRow"> if true, it will not include the header row</param>
/// <returns></returns>
public List<string> ReadMessagesFromLastNDays(int N, char cSEP ='\t', int indexOfDateColumn = 0, bool bFileHasHeaderRow = true)
{
List<string> listRet = new List<string>();
//--- replace msFileName with the name (incl. path if appropriate)
string[] lines = File.ReadAllLines(msFileName);
if (lines.Length > 0)
{
DateTime dtm = DateTime.Now.AddDays(-N);
string sCheckDate = GetTimeStamp(dtm);
//--- process lines in reverse
int iMin = bFileHasHeaderRow ? 1 : 0;
for (int i = lines.Length - 1; i >= iMin; i--) //skip the header in line 0, if any
{
if (lines[i].Length > 0) //skip empty lines
{
string[] s = lines[i].Split(cSEP);
//--- s[indexOfDateColumn] contains the DateTime stamp in the log file
if (string.Compare(s[indexOfDateColumn], sCheckDate) >= 0)
{
//--- insert at top of list or they'd be in reverse chronological order
listRet.Insert(0, s[1]);
}
else
{
break; //out of loop
}
}
}
}
return listRet;
}
/// <summary>
/// Returns DateTime Stamp as formatted in the log file
/// </summary>
/// <param name="dtm">DateTime value</param>
/// <returns></returns>
private string GetTimeStamp(DateTime dtm)
{
// adjust format string to match what you use
return dtm.ToString("u");
}

Categories