I am trying to do the following:
Read file contents into byte array
convert byte array into Base64 String
find all sequences of repeating characters that are longer than 8 in length
place the found repeating patterns in a list
Here is where I am currently having some issues... I am currently reading a 1MB file using this loop:
void bkg_DoWork(object sender, DoWorkEventArgs e)
{
try
{
Byte[] bytes = File.ReadAllBytes(this.txt_Filename.Text);
string file = Convert.ToBase64String(bytes);
char lastchar = '\0';
int count = 0;
List<RepeatingPattern> patterns = new List<RepeatingPattern>();
this.Invoke((MethodInvoker)delegate
{
this.pb_Progress.Maximum = file.Length;
this.pb_Progress.Value = 0;
this.lbl_Progress.Text = "Progress: Read file contents read... Looking for patterns! 0% Done...";
});
for (int i = 0; i < file.Length; i++)
{
this.Invoke((MethodInvoker)delegate
{
this.pb_Progress.Value += 1;
this.lbl_Progress.Text = "Progress: Looking for patterns! " + (int)Decimal.Truncate((decimal)((double)i / file.Length) * 100) + "% Done...";
});
if (file[i] == lastchar)
count += 1;
else
{
//create a pattern, if the count is more than what a pattern's compressed pattern looks like to save space... 8 chars
//[$a,#$]
if (count > 8)
{
//create and add a pattern to the list if necessary.
RepeatingPattern ptn = new RepeatingPattern(lastchar, count);
if (!patterns.Contains(ptn))
patterns.Add(ptn);
}
count = 0;
lastchar = file[i];
}
}
e.Result = patterns;
}
catch (Exception ex)
{
e.Result = ex;
}
}
However, when using this loop, I find that the process is VERY long... for example, this 1MB file, takes like 1 minute to loop through... in this day in age, it feels like this is a long time for such a small file. Is there a more efficient way to do what I want to do/find the repeating patterns?
I have a text file that is divided up into many sections, each about 10 or so lines long. I'm reading in the file using File.ReadAllLines into an array, one line per element of the array, and I'm then I'm trying to parse each section of the file to bring back just some of the data. I'm storing the results in a list, and hoping to export the list to csv ultimately.
My for loop is giving me trouble, as it loops through the right amount of times, but only pulls the data from the first section of the text file each time rather than pulling the data from the first section and then moving on and pulling the data from the next section. I'm sure I'm doing something wrong either in my for loop or for each loop. Any clues to help me solve this would be much appreciated! Thanks
David
My code so far:
namespace ParseAndExport
{
class Program
{
static readonly string sourcefile = #"Path";
static void Main(string[] args)
{
string[] readInLines = File.ReadAllLines(sourcefile);
int counter = 0;
int holderCPStart = counter + 3;//Changed Paths will be an different number of lines each time, but will always start 3 lines after the startDiv
/*Need to find the start of the section and the end of the section and parse the bit in between.
* Also need to identify the blank line that occurs in each section as it is essentially a divider too.*/
int startDiv = Array.FindIndex(readInLines, counter, hyphens72);
int blankLine = Array.FindIndex(readInLines, startDiv, emptyElement);
int endDiv = Array.FindIndex(readInLines, counter + 1, hyphens72);
List<string> results = new List<string>();
//Test to see if FindIndexes work. Results should be 0, 7, 9 for 1st section of sourcefile
/*Console.WriteLine(startDiv);
Console.WriteLine(blankLine);
Console.WriteLine(endDiv);*/
//Check how long the file is so that for testing we know how long the while loop should run for
//Console.WriteLine(readInLines.Length);
//sourcefile has 5255 lines (elements) in the array
for (int i = 0; i <= readInLines.Length; i++)
{
if (i == startDiv)
{
results = (readInLines[i + 1].Split('|').Select(p => p.Trim()).ToList());
string holderCP = string.Join(Environment.NewLine, readInLines, holderCPStart, (blankLine - holderCPStart - 1)).Trim();
results.Add(holderCP);
string comment = string.Join(" ", readInLines, blankLine + 1, (endDiv - (blankLine + 1)));//in case the comment is more than one line long
results.Add(comment);
i = i + 1;
}
else
{
i = i + 1;
}
foreach (string result in results)
{
Console.WriteLine(result);
}
//csvcontent.AppendLine("Revision Number, Author, Date, Time, Count of Lines, Changed Paths, Comments");
/* foreach (string result in results)
{
for (int x = 0; x <= results.Count(); x++)
{
StringBuilder csvcontent = new StringBuilder();
csvcontent.AppendLine(results[x] + "," + results[x + 1] + "," + results[x + 2] + "," + results[x + 3] + "," + results[x + 4] + "," + results[x + 5]);
x = x + 6;
string csvpath = #"addressforcsvfile";
File.AppendAllText(csvpath, csvcontent.ToString());
}
}*/
}
Console.ReadKey();
}
private static bool hyphens72(String h)
{
if (h == "------------------------------------------------------------------------")
{
return true;
}
else
{
return false;
}
}
private static bool emptyElement(String ee)
{
if (ee == "")
{
return true;
}
else
{
return false;
}
}
}
}
It looks like you are trying to grab all of the lines in a file that are not "------" and put them into a list of strings.
You can try this:
var lineswithoutdashes = readInLines.Where(x => x != hyphens72).Select(x => x).ToList();
Now you can take this list and do the split with a '|' to extract the fields you wanted
The logic seems wrong. There are issues with the code in itself also. I am unsure what precisely you're trying to do. Anyway, a few hints that I hope will help:
The if (i == startDiv) checks to see if I equals startDiv. I assume the logic that happens when this condition is met, is what you refer to as "pulls the data from the first section". That's correct, given you only run this code when I equals startDiv.
You increase the counter I inside the for loop, which in itself also increases the counter i.
If the issue in 2. wouldn't exists then I'd suggest to not do the same operation "i = i + 1" in both the true and false conditions of the if (i == startDiv).
Given I assume this file might actually be massive, it's probably a good idea to not store it in memory, but just read the file line by line and process line by line. There's currently no obvious reason why you'd want to consume this amount of memory, unless it's because of the convenience of this API "File.ReadAllLines(sourcefile)". I wouldn't be too scared to read the file like this:
Try (BufferedReader br = new BufferedReader(new FileReader (file))) {
String line;
while ((line = br.readLine()) != null) {
// process the line.
}
}
You can skip the lines until you've passed where the line equals hyphens72.
Then for each line, you process the line with the code you provided in the true case of (i == startDiv), or at least, from what you described, this is what I assume you are trying to do.
int startDiv will return the line number that contains hyphens72.
So your current for loop will only copy to results for the single line that matches the calculated line number.
I guess you want to search the postion of startDiv in the current line?
const string hyphens72;
// loop over lines
for (var lineNumber = 0; lineNumber <= readInLines.Length; lineNumber++) {
string currentLine = readInLines[lineNumber];
int startDiv = currentLine.IndexOf(hyphens72);
// loop over characters in line
for (var charIndex = 0; charIndex < currentLine.Length; charIndex++) {
if (charIndex == startDiv) {
var currentCharacter = currentLine[charIndex];
// write to result ...
}
else {
continue; // skip this character
}
}
}
There are a several things which could be improved.
I would use ReadLines over File.ReadAllLines( because ReadAllLines reads all the lines at ones. ReadLines will stream it.
With the line results = (readInLines[i + 1].Split('|').Select(p => p.Trim()).ToList()); you're overwriting the previous results list. You'd better use results.AddRange() to add new results.
for (int i = 0; i <= readInLines.Length; i++) means when the length = 10 it will do 11 iterations. (1 too many) (remove the =)
Array.FindIndex(readInLines, counter, hyphens72); will do a scan. On large files it will take ages to completely read them and search in it. Try to touch a single line only ones.
I cannot test what you are doing, but here's a hint:
IEnumerable<string> readInLines = File.ReadLines(sourcefile);
bool started = false;
List<string> results = new List<string>();
foreach(var line in readInLines)
{
// skip empty lines
if(emptyElement(line))
continue;
// when dashes are found, flip a boolean to activate the reading mode.
if(hyphens72(line))
{
// flip state.. (start/end)
started != started;
}
if(started)
{
// I don't know what you are doing here precisely, do what you gotta do. ;-)
results.AddRange((line.Split('|').Select(p => p.Trim()).ToList()));
string holderCP = string.Join(Environment.NewLine, readInLines, holderCPStart, (blankLine - holderCPStart - 1)).Trim();
results.Add(holderCP);
string comment = string.Join(" ", readInLines, blankLine + 1, (endDiv - (blankLine + 1)));//in case the comment is more than one line long
results.Add(comment);
}
}
foreach (string result in results)
{
Console.WriteLine(result);
}
You might want to start with a class like this. I don't know whether each section begins with a row of hyphens, or if it's just in between. This should handle either scenario.
What this is going to do is take your giant list of strings (the lines in the file) and break it into chunks - each chunk is a set of lines (10 or so lines, according to your OP.)
The reason is that it's unnecessarily complicated to try to read the file, looking for the hyphens, and process the contents of the file at the same time. Instead, one class takes the input and breaks it into chunks. That's all it does.
Another class might read the file and pass its contents to this class to break them up. Then the output is the individual chunks of text.
Another class can then process those individual sections of 10 or so lines without having to worry about hyphens or what separates on chunk from another.
Now that each of these classes is doing its own thing, it's easier to write unit tests for each of them separately. You can test that your "processing" class receives an array of 10 or so lines and does whatever it's supposed to do with them.
public class TextSectionsParser
{
private readonly string _delimiter;
public TextSectionsParser(string delimiter)
{
_delimiter = delimiter;
}
public IEnumerable<IEnumerable<string>> ParseSections(IEnumerable<string> lines)
{
var result = new List<List<string>>();
var currentList = new List<string>();
foreach (var line in lines)
{
if (line == _delimiter)
{
if(currentList.Any())
result.Add(currentList);
currentList = new List<string>();
}
else
{
currentList.Add(line);
}
}
if (currentList.Any() && !result.Contains(currentList))
{
result.Add(currentList);
}
return result;
}
}
I am working on an application that was implemented using a SQLite database. I am currently in the process of adding the ability to use MSSQL as well. The complicated part is that it will need to be able to use either engine depending on the needs. There are a handful of different syntax differences that have to be accounted for. The biggest problem child I have come across is LIMIT vs TOP. I have written some logic to convert the SQLite statements into the proper format for MSSQL. My function for converting the LIMIT to TOP seems to be working, but it ended up being pretty ugly. I wanted to post it here and see if anyone had ideas for a cleaner method of completing this. I also wanted to see if anyone noticed any glaring issues that I have missed. The biggest problem I ran into is the possibility of nested select statements with possible LIMIT statements on them as well. I ended up pulling the statement apart into its individual parts, changing them from LIMIT to TOP, and then rebuilding the statement. There might even be an overall better way to do this that I am missing. Thanks ahead of time if you spend the time to take a look.
private static string ConvertLimitToTop(string commandText)
{
string processCommand = commandText;
int start = -1;
List<string> commandParts = new List<string>();
//Running through the string looking for nested statemets starting with (
for (int i = 0; i < processCommand.Length; i++)
{
//Any time we find a new open ( we want to start there
if (processCommand[i] == '(')
start = i;
//If we find a close ) we will grab the nested statment and replace it
if (processCommand[i] == ')')
{
//Grab the 3 parts of the string
string preString = processCommand.Substring(0, start);
string nestedCommand = processCommand.Substring(start, i - start + 1);
string postString = processCommand.Substring(i + 1);
//Add the nested command to the list
commandParts.Add(nestedCommand);
//Update the commandText replacing the nested command we removed with its index in the list
processCommand = preString + "{" + (commandParts.Count - 1) + "}" + postString;
//Go back the the beginning of the command and look for the next nested command
i = 0;
start = -1;
}
}
//If start isnt -1 that means we found an open ( without a closing )
if (start == -1)
{
//We want to add the final command the the list for processing too
commandParts.Add(processCommand);
//We're going to go through the command parts and replace the LIMIT
for (int i = 0; i < commandParts.Count; i++)
{
string command = commandParts[i];
Console.WriteLine(command);
//We need to find the where the LIMIT is and extact the number
int limitIndex = command.IndexOf("LIMIT");
if (limitIndex != -1)
{
int startIndex = limitIndex + 6;
//Assuming after the limit will be ), a space, or the end of the string
int endIndex = command.IndexOf(')', startIndex);
if (endIndex == -1)
endIndex = command.IndexOf(' ', startIndex);
if (endIndex == -1)
endIndex = command.Length - 1;
Console.WriteLine(startIndex);
Console.WriteLine(endIndex);
//Extract the number
string limitNumber = command.Substring(startIndex, endIndex - startIndex);
//Remove the LIMIT command. There should always be a space before so take that out too.
command = command.Remove(limitIndex - 1, endIndex - limitIndex + 1);
//Insert the top command with the number
command = command.Replace("SELECT", "SELECT TOP " + limitNumber);
//Update the list
commandParts[i] = command;
}
}
start = -1;
//We need to go through the commands in reverse order and reassemble the complete command
for (int i = 0; i < processCommand.Length; i++)
{
//If we find a { its a part of the command that needs to be replaced
if (processCommand[i] == '{')
start = i;
if (processCommand[i] == '}')
{
string startString = processCommand.Substring(0, start);
string midString = processCommand.Substring(start, i - start + 1);
string endString = processCommand.Substring(i + 1);
//Get the index of the command we need from the list
int strIndex = Int32.Parse(midString.Substring(1, midString.Length - 2));
processCommand = processCommand.Replace(midString, commandParts[strIndex]);
//Go back to the start and look for the next
i = 0;
start = -1;
}
}
commandText = processCommand;
}
else
{
LogManager.Write(LogLevel.Error, "Unmatched parentheses were found while processing a SQL command. Command: " + commandText);
}
return commandText;
}
Today i found out why this problem occurs or how this problem occurs during reading line by line from text file using C# ReadLine().
Problem :
Assume there are 3 lines in text file. Each of which has length equals to 400.(manually counted)
while reading line from C# ReadLine() and checking for length in
Console.WriteLine(str.length);
I found out that it prints:
Line 1 => 400
Line 2 => 362
Line 3 => 38
Line 4 => 400
I was confused and that text file has only 3 lines why its printing 4 that too with length changed. Then i quickly checked out for "\n" or "\r" or combination "\r\n" but i didn't find any, but what i found was 2 double quotes ex=> "abcd" , in second line.
Then i changed my code to print lines itself and boy i was amaze, i was getting output in console like :
Line 1 > blahblahblabablabhlabhlabhlbhaabahbbhabhblablahblhablhablahb
Line 2 > blablabbablablababalbalbablabal"blabablhabh
Line 3 > "albhalbahblablab
Line 4 > blahblahblabablabhlabhlabhlbhaabahbbhabhblablahblhablhablahb
now i tried removing the double quotes "" using replace function but i got same 4 lines result just without double quotes.
Now please let me know any solution other than manual edit to overcome this scenario.
Here is my code simple code:
static void Main(string[] args)
{
FileStream fin;
string s;
string fileIn = #"D:\Testing\CursedtextFile\testfile.txt";
try
{
fin = new FileStream(fileIn, FileMode.Open);
}
catch (FileNotFoundException exc)
{
Console.WriteLine(exc.Message + "Cannot open file.");
return;
}
StreamReader fstr_in = new StreamReader(fin, Encoding.Default, true);
int cnt = 0;
while ((s = fstr_in.ReadLine()) != null)
{
s = s.Replace("\""," ");
cnt = cnt + 1;
//Console.WriteLine("Line "+cnt+" => "+s.Length);
Console.WriteLine("Line " + cnt + " => " + s);
}
Console.ReadLine();
fstr_in.Close();
fin.Close();
}
Note: i was trying to read and upload 37 text files of 500 MB each of finance domain where i always face this issue and has to manually do the changes. :(
If the problem is that:
Proper line breaks should be a combination of newline (10) and carriage return (13)
Lone newlines and/or carriage returns are incorrectly being interpreted as line breaks
Then you can fix this, but the best and probably most correct way to fix this problem is to go to the source, fix the program that writes out this incorrectly formatted file in the first place.
However, here's a LINQPad program that replaces lone newlines or carriage returns with spaces:
void Main()
{
string input = "this\ris\non\ra\nsingle\rline\r\nThis is on the next line";
string output = ReplaceLoneLineBreaks(input);
output.Dump();
}
public static string ReplaceLoneLineBreaks(string input)
{
if (string.IsNullOrEmpty(input))
return input;
var result = new StringBuilder();
int index = 0;
while (index < input.Length)
{
switch (input[index])
{
case '\n':
if (index == input.Length - 1 || input[index+1] != '\r')
{
result.Append(' ');
index++;
}
else
{
result.Append(input[index]);
result.Append(input[index + 1]);
index += 2;
}
break;
case '\r':
if (index == input.Length - 1 || input[index+1] != '\n')
{
result.Append(' ');
index++;
}
else
{
result.Append(input[index]);
result.Append(input[index + 1]);
index += 2;
}
break;
default:
result.Append(input[index]);
index++;
break;
}
}
return result.ToString();
}
If the lines are all of the same length, split the lines by their length instead of watching for end of lines.
const int EndOfLine = 2; // CR LF or = 1 if only LF.
const int LineLength = 400;
string text = File.ReadAllText(path);
for (int i = 0; i < text.Length - EndOfLine; i += LineLength + EndOfLine) {
string line = text.Substring(i, Math.Min(LineLength, text.Length - i - EndOfLine));
// TODO Process line
}
If the last line is not terminated by end of line characters, remove the two - EndOfLine.
Also the Math.Min part is only a safety measure. It might not be necessary if no line is shorter than 400.
Given this code:
using (StreamWriter sw = File.CreateText(file))
{
for (int r = 0; r < originalDataTable.Rows.Count; r++)
{
for (int c = 0; c < originalDataTable.Columns.Count; c++)
{
var rowValueAtColumn = originalDataTable.Rows[r][c].ToString();
var valueToWrite = string.Format(#"{0}{1}", rowValueAtColumn, "\t");
if (c != originalDataTable.Columns.Count)
sw.Write(valueToWrite);
else
sw.Write(valueToWrite + #"\n");
}
}
}
I am trying to write a DataRow back to a file one row at a time; however, the file it is creating is creating a run-on sentence where all the data being written to the file is just in one line. There should be 590 individual lines not just one.
What do I need to add to the code above so that my lines are broken out as they are in the data table? My code just doesn't seem to be working.
sw.Write(valueToWrite + #"\n"); is wrong. Because of the # it is not entering a newline, you are writing the character \ then the character n.
You want to do either sw.Write(valueToWrite + "\n"); or have the program put a new line in for you by doing sw.WriteLine(valueToWrite), however that will enter Environment.NewLine which is \r\n on windows.
However you can make your code even simpler by inserting the row the separator outside of the column for loop. I also defined the two separators at the top of the loop in case you want to ever change them (What will the program you are sending this to do when you hit some data that has a \t or a \n in the text itself?), and a few other small tweaks to make the code easier to read.
string colSeperator = "\t";
string rowSeperator = "\n";
using (StreamWriter sw = File.CreateText(file))
{
for (int r = 0; r < originalDataTable.Rows.Count; r++)
{
for (int c = 0; c < originalDataTable.Columns.Count; c++)
{
sw.Write(originalDataTable.Rows[r][c])
sw.Write(colSeperator);
}
sw.Write(rowSeperator);
}
}
Here is another similification just to show other ways to do it (now that you don't need to check originalDataTable.Columns.Count)
string colSeperator = "\t";
string rowSeperator = "\n";
using (StreamWriter sw = File.CreateText(file))
{
foreach (DataRow row in originalDataTable.Rows)
{
foreach (object value in row.ItemArray))
{
sw.Write(value)
sw.Write(colSeperator);
}
sw.Write(rowSeperator);
}
}
Change sw.Write() to sw.WriteLine()
http://msdn.microsoft.com/en-us/library/system.io.streamwriter.writeline.aspx