Is there an equivalent to mmap.mmap.rfind in C#? - c#

While looking at memory-mapped files in C#, there was some difficulty in identifying how to search a file quickly forward and in reverse. My goal is to rewrite the following function in the language, but nothing could be found like the find and rfind methods used below. Is there a way in C# to quickly search a memory-mapped file using a particular substring?
#! /usr/bin/env python3
import mmap
import pathlib
# noinspection PyUnboundLocalVariable
def drop_last_line(path):
with path.open('r+b') as file:
with mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as search:
for next_line in b'\r\n', b'\r', b'\n':
if search.find(next_line) >= 0:
break
else:
raise ValueError('cannot find any line delimiters')
end_1st = search.rfind(next_line)
end_2nd = search.rfind(next_line, 0, end_1st - 1)
file.truncate(0 if end_2nd < 0 else end_2nd + len(next_line))

Is there a way in C# to quickly search a memory-mapped file using a particular substring?
Do you know of any way to memory-map an entire file in C# and then treat it as a byte array?
Yes, it's quite easy to map an entire file into a view then to read it into a single byte array as the following code shows:
static void Main(string[] args)
{
var sourceFile= new FileInfo(#"C:\Users\Micky\Downloads\20180112.zip");
int length = (int) sourceFile.Length; // length of target file
// Create the memory-mapped file.
using (var mmf = MemoryMappedFile.CreateFromFile(sourceFile.FullName,
FileMode.Open,
"ImgA"))
{
var buffer = new byte[length]; // allocate a buffer with the same size as the file
using (var accessor = mmf.CreateViewAccessor())
{
var read=accessor.ReadArray(0, buffer, 0, length); // read the whole thing
}
// let's try searching for a known byte sequence. Change this to suit your file
var target = new byte[] {71, 213, 62, 204,231};
var foundAt = IndexOf(buffer, target);
}
}
I couldn't seem to find any byte searching method in Marshal or Array but you can use this search algorithm courtesy of Social MSDN as a start:
private static int IndexOf2(byte[] input, byte[] pattern)
{
byte firstByte = pattern[0];
int index = -1;
if ((index = Array.IndexOf(input, firstByte)) >= 0)
{
for (int i = 0; i < pattern.Length; i++)
{
if (index + i >= input.Length ||
pattern[i] != input[index + i]) return -1;
}
}
return index;
}
...or even this more verbose example (also courtesy Social MSDN, same link)
public static int IndexOf(byte[] arrayToSearchThrough, byte[] patternToFind)
{
if (patternToFind.Length > arrayToSearchThrough.Length)
return -1;
for (int i = 0; i < arrayToSearchThrough.Length - patternToFind.Length; i++)
{
bool found = true;
for (int j = 0; j < patternToFind.Length; j++)
{
if (arrayToSearchThrough[i + j] != patternToFind[j])
{
found = false;
break;
}
}
if (found)
{
return i;
}
}
return -1;
}

Related

Efficient way of finding repeating sequences of characters in string

I am trying to do the following:
Read file contents into byte array
convert byte array into Base64 String
find all sequences of repeating characters that are longer than 8 in length
place the found repeating patterns in a list
Here is where I am currently having some issues... I am currently reading a 1MB file using this loop:
void bkg_DoWork(object sender, DoWorkEventArgs e)
{
try
{
Byte[] bytes = File.ReadAllBytes(this.txt_Filename.Text);
string file = Convert.ToBase64String(bytes);
char lastchar = '\0';
int count = 0;
List<RepeatingPattern> patterns = new List<RepeatingPattern>();
this.Invoke((MethodInvoker)delegate
{
this.pb_Progress.Maximum = file.Length;
this.pb_Progress.Value = 0;
this.lbl_Progress.Text = "Progress: Read file contents read... Looking for patterns! 0% Done...";
});
for (int i = 0; i < file.Length; i++)
{
this.Invoke((MethodInvoker)delegate
{
this.pb_Progress.Value += 1;
this.lbl_Progress.Text = "Progress: Looking for patterns! " + (int)Decimal.Truncate((decimal)((double)i / file.Length) * 100) + "% Done...";
});
if (file[i] == lastchar)
count += 1;
else
{
//create a pattern, if the count is more than what a pattern's compressed pattern looks like to save space... 8 chars
//[$a,#$]
if (count > 8)
{
//create and add a pattern to the list if necessary.
RepeatingPattern ptn = new RepeatingPattern(lastchar, count);
if (!patterns.Contains(ptn))
patterns.Add(ptn);
}
count = 0;
lastchar = file[i];
}
}
e.Result = patterns;
}
catch (Exception ex)
{
e.Result = ex;
}
}
However, when using this loop, I find that the process is VERY long... for example, this 1MB file, takes like 1 minute to loop through... in this day in age, it feels like this is a long time for such a small file. Is there a more efficient way to do what I want to do/find the repeating patterns?

C# Streamreader - Break on {CR}{LF} only

I am trying to count the number of rows in a text file (to compare to a control file) before performing a complex SSIS insert package.
Currently I am using a StreamReader and it is breaking a line with a {LF} embedded into a new line, whereas SSIS is using {CR}{LF} (correctly), so the counts are not tallying up.
Does anyone know an alternate method of doing this where I can count the number of lines in the file based on {CR}{LF} Line breaks only?
Thanks in advance
Iterate through the file and count number of CRLFs.
Pretty straightforward implementation:
public int CountLines(Stream stream, Encoding encoding)
{
int cur, prev = -1, lines = 0;
using (var sr = new StreamReader(stream, encoding, false, 4096, true))
{
while ((cur = sr.Read()) != -1)
{
if (prev == '\r' && cur == '\n')
lines++;
prev = cur;
}
}
//Empty stream will result in 0 lines, any content would result in at least one line
if (prev != -1)
lines++;
return lines;
}
Example usage:
using(var s = File.OpenRead(#"<your_file_path>"))
Console.WriteLine("Found {0} lines", CountLines(s, Encoding.Default));
Actually it's a find substring in string task. More generic algorithms can be used.
{CR}{LF} is the desired. Can't really say which is correct.
Since ReadLine strips off the end of line you don't know
Use StreamReader.Read Method () and look for 13 followed by 10
It return Int
Here's a pretty lazy way... this will read the entire file into memory.
var cnt = File.ReadAllText("yourfile.txt")
.Split(new[] { "\r\n" }, StringSplitOptions.None)
.Length;
Here is an extension-method that reads the lines with line-seperator {Cr}{Lf} only, and not {LF}. You could do a count on it.
var count= new StreamReader(#"D:\Test.txt").ReadLinesCrLf().Count()
But could also use it for reading files, sometimes usefull since the normal StreamReader.ReadLine breaks on both {Cr}{Lf} and {LF}. Can be used on any TextReader and works streaming (file size is not an issue).
public static IEnumerable<string> ReadLinesCrLf(this TextReader reader, int bufferSize = 4096)
{
StringBuilder lineBuffer = null;
//read buffer
char[] buffer = new char[bufferSize];
int charsRead;
var previousIsLf = false;
while ((charsRead = reader.Read(buffer, 0, bufferSize)) != 0)
{
int bufferIndex = 0;
int writeIdx = 0;
do
{
var currentChar = buffer[bufferIndex];
switch (currentChar)
{
case '\n':
if (previousIsLf)
{
if (lineBuffer == null)
{
//return from current buffer writeIdx could be higher than 0 when multiple rows are in the buffer
yield return new string(buffer, writeIdx, bufferIndex - writeIdx - 1);
//shift write index to next character that will be read
writeIdx = bufferIndex + 1;
}
else
{
Debug.Assert(writeIdx == 0, $"Write index should be 0, when linebuffer != null");
lineBuffer.Append(buffer, writeIdx, bufferIndex - writeIdx);
Debug.Assert(lineBuffer.ToString().Last() == '\r',$"Last character in linebuffer should be a carriage return now");
lineBuffer.Length--;
//shift write index to next character that will be read
writeIdx = bufferIndex + 1;
yield return lineBuffer.ToString();
lineBuffer = null;
}
}
previousIsLf = false;
break;
case '\r':
previousIsLf = true;
break;
default:
previousIsLf = false;
break;
}
bufferIndex++;
} while (bufferIndex < charsRead);
if (writeIdx < bufferIndex)
{
if (lineBuffer == null) lineBuffer = new StringBuilder();
lineBuffer.Append(buffer, writeIdx, bufferIndex - writeIdx);
}
}
//return last row
if (lineBuffer != null && lineBuffer.Length > 0) yield return lineBuffer.ToString();
}

Improve string parse performance

Before we start, I am aware of the term "premature optimization". However the following snippets have proven to be an area where improvements can be made.
Alright. We currently have some network code that works with string based packets. I am aware that using strings for packets is stupid, crazy and slow. Sadly, we don't have any control over the client and so have to use strings.
Each packet is terminated by \0\r\n and we currently use a StreamReader/Writer to read individual packets from the stream. Our main bottleneck comes from two places.
Firstly: We need to trim that nasty little null-byte off the end of the string. We currently use code like the following:
line = await reader.ReadLineAsync();
line = line.Replace("\0", ""); // PERF this allocates a new string
if (string.IsNullOrWhiteSpace(line))
return null;
var packet = ClientPacket.Parse(line, cl.Client.RemoteEndPoint);
As you can see by that cute little comment, we have a GC performance issue when trimming the '\0'. There are numerous different ways you could trim a '\0' off the end of a string, but all will result in the same GC hammering we get. Because all string operations are immutable, they result in a new string object being created. As our server handles 1000+ connections all communicating at around 25-40 packets per second (its a game server), this GC matter is becoming an issue. So here comes my first question: What is a more efficient way of trimming that '\0' off the end of our string? By efficient I don't only mean speed, but also GC wise (ultimately I'd like a way to get rid of it without creating a new string object!).
Our second issue also stems from GC land. Our code looks somewhat like the following:
private static string[] emptyStringArray = new string[] { }; // so we dont need to allocate this
public static ClientPacket Parse(string line, EndPoint from)
{
const char seperator = '|';
var first_seperator_pos = line.IndexOf(seperator);
if (first_seperator_pos < 1)
{
return new ClientPacket(NetworkStringToClientPacketType(line), emptyStringArray, from);
}
var name = line.Substring(0, first_seperator_pos);
var type = NetworkStringToClientPacketType(name);
if (line.IndexOf(seperator, first_seperator_pos + 1) < 1)
return new ClientPacket(type, new string[] { line.Substring(first_seperator_pos + 1) }, from);
return new ClientPacket(type, line.Substring(first_seperator_pos + 1).Split(seperator), from);
}
(Where NetworkStringToClientPacketType is simply a big switch-case block)
As you can see we already do a few things to handle GC. We reuse a static "empty" string and we check for packets with no parameters. My only issue here is that we are using Substring a lot, and even chain a Split on the end of a Substring. This leads to (for an average packet) almost 20 new string objects being created and 12 being disposed of EACH PACKET. This causes a lot of performance issues when load increases anything over 400 users (we gotz fast ram :3)
Has anyone had an experience with this sort of thing before or could give us some pointers into what to look into next? Maybe some magical classes or some nifty pointer magic?
(PS. StringBuilder doesn't help as we aren't building strings, we are generally splitting them.)
We currently have some ideas based on an index based system where we store the index and length of each parameter rather than splitting them. Thoughts?
A few other things. Decompiling mscorlib and browsing the string class code, it seems to me like IndexOf calls are done via P/Invoke, which would mean they have added overhead for each call, correct me if I'm wrong? Would it not be faster to implement an IndexOf manually using a char[] array?
public int IndexOf(string value, int startIndex, int count, StringComparison comparisonType)
{
...
return TextInfo.IndexOfStringOrdinalIgnoreCase(this, value, startIndex, count);
...
}
internal static int IndexOfStringOrdinalIgnoreCase(string source, string value, int startIndex, int count)
{
...
if (TextInfo.TryFastFindStringOrdinalIgnoreCase(4194304, source, startIndex, value, count, ref result))
{
return result;
}
...
}
...
[DllImport("QCall", CharSet = CharSet.Unicode)]
[return: MarshalAs(UnmanagedType.Bool)]
private static extern bool InternalTryFindStringOrdinalIgnoreCase(int searchFlags, string source, int sourceCount, int startIndex, string target, int targetCount, ref int foundIndex);
Then we get to String.Split which ends up calling Substring itself (somewhere along the line):
// string
private string[] InternalSplitOmitEmptyEntries(int[] sepList, int[] lengthList, int numReplaces, int count)
{
int num = (numReplaces < count) ? (numReplaces + 1) : count;
string[] array = new string[num];
int num2 = 0;
int num3 = 0;
int i = 0;
while (i < numReplaces && num2 < this.Length)
{
if (sepList[i] - num2 > 0)
{
array[num3++] = this.Substring(num2, sepList[i] - num2);
}
num2 = sepList[i] + ((lengthList == null) ? 1 : lengthList[i]);
if (num3 == count - 1)
{
while (i < numReplaces - 1)
{
if (num2 != sepList[++i])
{
break;
}
num2 += ((lengthList == null) ? 1 : lengthList[i]);
}
break;
}
i++;
}
if (num2 < this.Length)
{
array[num3++] = this.Substring(num2);
}
string[] array2 = array;
if (num3 != num)
{
array2 = new string[num3];
for (int j = 0; j < num3; j++)
{
array2[j] = array[j];
}
}
return array2;
}
Thankfully Substring looks fast (and efficient!):
private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
if (startIndex == 0 && length == this.Length && !fAlwaysCopy)
{
return this;
}
string text = string.FastAllocateString(length);
fixed (char* ptr = &text.m_firstChar)
{
fixed (char* ptr2 = &this.m_firstChar)
{
string.wstrcpy(ptr, ptr2 + (IntPtr)startIndex, length);
}
}
return text;
}
After reading this answer here, I'm thinking a pointer based solution could be found... Thoughts?
Thanks.
You could "cheat" and work at the Encoder level...
public class UTF8NoZero : UTF8Encoding
{
public override Decoder GetDecoder()
{
return new MyDecoder();
}
}
public class MyDecoder : Decoder
{
public Encoding UTF8 = new UTF8Encoding();
public override int GetCharCount(byte[] bytes, int index, int count)
{
return UTF8.GetCharCount(bytes, index, count);
}
public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
{
int count2 = UTF8.GetChars(bytes, byteIndex, byteCount, chars, charIndex);
int i, j;
for (i = charIndex, j = charIndex; i < charIndex + count2; i++)
{
if (chars[i] != '\0')
{
chars[j] = chars[i];
j++;
}
}
for (int k = j; k < charIndex + count2; k++)
{
chars[k] = '\0';
}
return count2 + (i - j);
}
}
Note that this cheat is based on the fact that StreamReader.ReadLineAsync uses only the GetChars(). We remove the '\0' in the temporary buffer char[] buffer used by StreamReader.ReadLineAsync.

How to read last "n" lines of log file [duplicate]

This question already has answers here:
Get last 10 lines of very large text file > 10GB
(21 answers)
Closed 1 year ago.
need a snippet of code which would read out last "n lines" of a log file. I came up with the following code from the net.I am kinda new to C sharp. Since the log file might be
quite large, I want to avoid overhead of reading the entire file.Can someone suggest any performance enhancement. I do not really want to read each character and change position.
var reader = new StreamReader(filePath, Encoding.ASCII);
reader.BaseStream.Seek(0, SeekOrigin.End);
var count = 0;
while (count <= tailCount)
{
if (reader.BaseStream.Position <= 0) break;
reader.BaseStream.Position--;
int c = reader.Read();
if (reader.BaseStream.Position <= 0) break;
reader.BaseStream.Position--;
if (c == '\n')
{
++count;
}
}
var str = reader.ReadToEnd();
Your code will perform very poorly, since you aren't allowing any caching to happen.
In addition, it will not work at all for Unicode.
I wrote the following implementation:
///<summary>Returns the end of a text reader.</summary>
///<param name="reader">The reader to read from.</param>
///<param name="lineCount">The number of lines to return.</param>
///<returns>The last lneCount lines from the reader.</returns>
public static string[] Tail(this TextReader reader, int lineCount) {
var buffer = new List<string>(lineCount);
string line;
for (int i = 0; i < lineCount; i++) {
line = reader.ReadLine();
if (line == null) return buffer.ToArray();
buffer.Add(line);
}
int lastLine = lineCount - 1; //The index of the last line read from the buffer. Everything > this index was read earlier than everything <= this indes
while (null != (line = reader.ReadLine())) {
lastLine++;
if (lastLine == lineCount) lastLine = 0;
buffer[lastLine] = line;
}
if (lastLine == lineCount - 1) return buffer.ToArray();
var retVal = new string[lineCount];
buffer.CopyTo(lastLine + 1, retVal, 0, lineCount - lastLine - 1);
buffer.CopyTo(0, retVal, lineCount - lastLine - 1, lastLine + 1);
return retVal;
}
Had trouble with your code. This is my version. Since its' a log file, something might be writing to it, so it's best making sure you're not locking it.
You go to the end. Start reading backwards until you reach n lines. Then read everything from there on.
int n = 5; //or any arbitrary number
int count = 0;
string content;
byte[] buffer = new byte[1];
using (FileStream fs = new FileStream("text.txt", FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
// read to the end.
fs.Seek(0, SeekOrigin.End);
// read backwards 'n' lines
while (count < n)
{
fs.Seek(-1, SeekOrigin.Current);
fs.Read(buffer, 0, 1);
if (buffer[0] == '\n')
{
count++;
}
fs.Seek(-1, SeekOrigin.Current); // fs.Read(...) advances the position, so we need to go back again
}
fs.Seek(1, SeekOrigin.Current); // go past the last '\n'
// read the last n lines
using (StreamReader sr = new StreamReader(fs))
{
content = sr.ReadToEnd();
}
}
A friend of mine uses this method (BackwardReader can be found here):
public static IList<string> GetLogTail(string logname, string numrows)
{
int lineCnt = 1;
List<string> lines = new List<string>();
int maxLines;
if (!int.TryParse(numrows, out maxLines))
{
maxLines = 100;
}
string logFile = HttpContext.Current.Server.MapPath("~/" + logname);
BackwardReader br = new BackwardReader(logFile);
while (!br.SOF)
{
string line = br.Readline();
lines.Add(line + System.Environment.NewLine);
if (lineCnt == maxLines) break;
lineCnt++;
}
lines.Reverse();
return lines;
}
Does your log have lines of similar length? If yes, then you can calculate average length of the line, then do the following:
seek to end_of_file - lines_needed*avg_line_length (previous_point)
read everything up to the end
if you grabbed enough lines, that's fine. If no, seek to previous_point - lines_needed*avg_line_length
read everything up to previous_point
goto 3
memory-mapped file is also a good method -- map the tail of file, calculate lines, map the previous block, calculate lines etc. until you get the number of lines needed
Here is my answer:-
private string StatisticsFile = #"c:\yourfilename.txt";
// Read last lines of a file....
public IList<string> ReadLastLines(int nFromLine, int nNoLines, out bool bMore)
{
// Initialise more
bMore = false;
try
{
char[] buffer = null;
//lock (strMessages) Lock something if you need to....
{
if (File.Exists(StatisticsFile))
{
// Open file
using (StreamReader sr = new StreamReader(StatisticsFile))
{
long FileLength = sr.BaseStream.Length;
int c, linescount = 0;
long pos = FileLength - 1;
long PreviousReturn = FileLength;
// Process file
while (pos >= 0 && linescount < nFromLine + nNoLines) // Until found correct place
{
// Read a character from the end
c = BufferedGetCharBackwards(sr, pos);
if (c == Convert.ToInt32('\n'))
{
// Found return character
if (++linescount == nFromLine)
// Found last place
PreviousReturn = pos + 1; // Read to here
}
// Previous char
pos--;
}
pos++;
// Create buffer
buffer = new char[PreviousReturn - pos];
sr.DiscardBufferedData();
// Read all our chars
sr.BaseStream.Seek(pos, SeekOrigin.Begin);
sr.Read(buffer, (int)0, (int)(PreviousReturn - pos));
sr.Close();
// Store if more lines available
if (pos > 0)
// Is there more?
bMore = true;
}
if (buffer != null)
{
// Get data
string strResult = new string(buffer);
strResult = strResult.Replace("\r", "");
// Store in List
List<string> strSort = new List<string>(strResult.Split('\n'));
// Reverse order
strSort.Reverse();
return strSort;
}
}
}
}
catch (Exception ex)
{
System.Diagnostics.Debug.WriteLine("ReadLastLines Exception:" + ex.ToString());
}
// Lets return a list with no entries
return new List<string>();
}
const int CACHE_BUFFER_SIZE = 1024;
private long ncachestartbuffer = -1;
private char[] cachebuffer = null;
// Cache the file....
private int BufferedGetCharBackwards(StreamReader sr, long iPosFromBegin)
{
// Check for error
if (iPosFromBegin < 0 || iPosFromBegin >= sr.BaseStream.Length)
return -1;
// See if we have the character already
if (ncachestartbuffer >= 0 && ncachestartbuffer <= iPosFromBegin && ncachestartbuffer + cachebuffer.Length > iPosFromBegin)
{
return cachebuffer[iPosFromBegin - ncachestartbuffer];
}
// Load into cache
ncachestartbuffer = (int)Math.Max(0, iPosFromBegin - CACHE_BUFFER_SIZE + 1);
int nLength = (int)Math.Min(CACHE_BUFFER_SIZE, sr.BaseStream.Length - ncachestartbuffer);
cachebuffer = new char[nLength];
sr.DiscardBufferedData();
sr.BaseStream.Seek(ncachestartbuffer, SeekOrigin.Begin);
sr.Read(cachebuffer, (int)0, (int)nLength);
return BufferedGetCharBackwards(sr, iPosFromBegin);
}
Note:-
Call ReadLastLines with nLineFrom starting at 0 for the last line and nNoLines as the number of lines to read back from.
It reverses the list so the 1st one is the last line in the file.
bMore returns true if there are more lines to read.
It caches the data in 1024 char chunks - so it is fast, you may want to increase this size for very large files.
Enjoy!
This is in no way optimal but for quick and dirty checks with small log files I've been using something like this:
List<string> mostRecentLines = File.ReadLines(filePath)
// .Where(....)
// .Distinct()
.Reverse()
.Take(10)
.ToList()
Something that you can now do very easily in C# 4.0 (and with just a tiny bit of effort in earlier versions) is use memory mapped files for this type of operation. Its ideal for large files because you can map just a portion of the file, then access it as virtual memory.
There is a good example here.
As #EugeneMayevski stated above, if you just need an approximate number of lines returned, each line has roughly the same line length and you're more concerned with performance especially for large files, this is a better implementation:
internal static StringBuilder ReadApproxLastNLines(string filePath, int approxLinesToRead, int approxLengthPerLine)
{
//If each line is more or less of the same length and you don't really care if you get back exactly the last n
using (FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
var totalCharsToRead = approxLengthPerLine * approxLinesToRead;
var buffer = new byte[1];
//read approx chars to read backwards from end
fs.Seek(totalCharsToRead > fs.Length ? -fs.Length : -totalCharsToRead, SeekOrigin.End);
while (buffer[0] != '\n' && fs.Position > 0) //find new line char
{
fs.Read(buffer, 0, 1);
}
var returnStringBuilder = new StringBuilder();
using (StreamReader sr = new StreamReader(fs))
{
returnStringBuilder.Append(sr.ReadToEnd());
}
return returnStringBuilder;
}
}
Most log files have a DateTime stamp. Although can be improved, the code below works well if you want the log messages from the last N days.
/// <summary>
/// Returns list of entries from the last N days.
/// </summary>
/// <param name="N"></param>
/// <param name="cSEP">field separator, default is TAB</param>
/// <param name="indexOfDateColumn">default is 0; change if it is not the first item in each line</param>
/// <param name="bFileHasHeaderRow"> if true, it will not include the header row</param>
/// <returns></returns>
public List<string> ReadMessagesFromLastNDays(int N, char cSEP ='\t', int indexOfDateColumn = 0, bool bFileHasHeaderRow = true)
{
List<string> listRet = new List<string>();
//--- replace msFileName with the name (incl. path if appropriate)
string[] lines = File.ReadAllLines(msFileName);
if (lines.Length > 0)
{
DateTime dtm = DateTime.Now.AddDays(-N);
string sCheckDate = GetTimeStamp(dtm);
//--- process lines in reverse
int iMin = bFileHasHeaderRow ? 1 : 0;
for (int i = lines.Length - 1; i >= iMin; i--) //skip the header in line 0, if any
{
if (lines[i].Length > 0) //skip empty lines
{
string[] s = lines[i].Split(cSEP);
//--- s[indexOfDateColumn] contains the DateTime stamp in the log file
if (string.Compare(s[indexOfDateColumn], sCheckDate) >= 0)
{
//--- insert at top of list or they'd be in reverse chronological order
listRet.Insert(0, s[1]);
}
else
{
break; //out of loop
}
}
}
}
return listRet;
}
/// <summary>
/// Returns DateTime Stamp as formatted in the log file
/// </summary>
/// <param name="dtm">DateTime value</param>
/// <returns></returns>
private string GetTimeStamp(DateTime dtm)
{
// adjust format string to match what you use
return dtm.ToString("u");
}

Search byte[] for pattern C#

_documentContent contains the whole document as html view source.
patternToFind contains text to be searched in _documentContent.
Code snippet below works fine if language is English.
The same code however doesn't works at all when it encounters a language like Korean.
Sample Document
Present Tense
The present tense is just as you have learned. You take the dictionary form of a verb, drop the 다, add the appropriate ending.
먹다 - 먹 + 어요 = 먹어요
마시다 - 마시 + 어요 - 마시어요 - 마셔요.
This tense is used to represent what happens in the present. I eat. I drink. It is a general term for the present.
When I am trying to find 먹 the code belows fails.
can someone please suggest some solution to this
using System;
using System.Collections.Generic;
using System.Text;
namespace MultiByteStringHandling
{
class Program
{
static void Main(string[] args)
{
string _documentContent = #"먹다 - 먹 + 어요 = 먹어요";
byte[] patternToFind = Encoding.UTF8.GetBytes("먹");
byte[] DocumentBytes = Encoding.UTF8.GetBytes(_documentContent);
int intByteOffset = indexOf(DocumentBytes, patternToFind);
Console.WriteLine(intByteOffset.ToString());
}
public int indexOf(byte[] data, byte[] pattern)
{
int[] failure = computeFailure(pattern);
int j = 0;
if (data.Length == 0) return 0;
for (int i = 0; i < data.Length; i++)
{
while (j > 0 && pattern[j] != data[i])
{
j = failure[j - 1];
}
if (pattern[j] == data[i])
{
j++;
}
if (j == pattern.Length)
{
return i - pattern.Length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private int[] computeFailure(byte[] pattern)
{
int[] failure = new int[pattern.Length];
int j = 0;
for (int i = 1; i < pattern.Length; i++)
{
while (j > 0 && pattern[j] != pattern[i])
{
j = failure[j - 1];
}
if (pattern[j] == pattern[i])
{
j++;
}
failure[i] = j;
}
return failure;
}
}
}
Seriously, why not just do the following?
var indexFound = documentContent.IndexOf("data");
Converting strings into byte arrays and then searching those doesn't make much sense to me when you're original data is text. You can always find the byte position after if you wish.
UTF-8 is a variable multi-byte format. Searching for English text in Korean data will never match on a direct pattern match. If you are scanning text you would be much better off using .IndexOf(pattern) [as Noldorin pointed out] or .Contains(pattern).

Categories