I have a text file as follows(it is having more than hundered thousands of lines):
Header
AGROUP1
ADATA1|0000
ADATA2|0001
ADATA3|0002
D0000|TNE
D0001|TNE
D0002|TNE
AGROUP2
ADATA1|0000
ADATA2|0001
ADATA3|0002
D0000|TNE
D0001|TNE
D0002|TNE
AGROUP3
ADATA1|0000
ADATA2|0001
ADATA3|0002
D0000|TNE
D0001|TNE
D0002|TNE
Infact it has more than hundered thousands lines of code.
I need to read data based on group
For example in a method:
public void ReadData(string strGroup)
{
if(strGroup == "AGROUP2)
//Read from the text file starting from line "AGROUP2" to "AGROUP3"(i.e lines under AGROUP2)
}
What i have tried is
public void ReadData(string strGroup)
{
bool start = false;
while ((line = reader.ReadLine()) != null)
{
if (line == strGroup && line.Length == 5)
start = true;
else if (line.Length == 5)
start = false;
if(start)
yield return line;
}
}
It is working fine, Performance wise, it takes longer since my text file is a very very huge file....There is if condition on every line in the method.
IS the a better way to do this?
If there is anything you know about the structure of the file that might help you could use that:
if the list is sorted you might know when to stop parsing.
if the list contains jump tables or an index you could skip lines
if the groups have a specific number of lines you can skip those
If not, you're destined to search from top to bottom and you will only be able to increase the speed using technical tricks:
read batches of lines instead of single lines
try to prevent creating many tiny objects (strings) in your code that might choke the garbage collector
if you need to do a lot of random access (going back and forth throughout the file) you might consider indexing or splitting the file first.
What if you use bash command to cut the huge file into smaller ones with AGROUP# as the first line. I think bash commands are more optimized.
Related
I am attempting to use the StreamReader to break data from a text file into multiple listboxes. Thus far, I 've been able to get all the data into one listbox, but the next step of my project requires the data to be split and I think I understand listboxes better than arrays. I have made an effort to search for a similar problem, but because I am a beginner, most of what I have found perplexes me further. I have only been able to accomplish the following successfully:
StreamReader file = new StreamReader(openFileDialog1.FileName);
string data;
while (!file.EndOfStream)
{
data = file.ReadLine();
listBox1.Items.Add(data);
}
file.Close();
My data in my .txt file comes in blocks of three like so:
blue
david
8042
red
joseph
7042
I am unable to change the data format, so I have been trying to code it out in such a way that
if (blue)
listBox1.Items.Add(david);
listBox2.Items.Add(8042);
else if (red)
listBox3.Items.Add(joseph);
listBox4.Items.Add(7042);
etc. I only have two colors to work with, but lots of data for each of those colors. My problem is that I am new to coding and failing to put into place the basics I've learned to do such a thing.
What are the lines of code I'm missing to add a line below a line to a listbox while it StreamReads? Do I need to use an
int counter = 0;
and increase it by 1 or 2 to get those lines, or am I thinking too basically?
Thank you very much for any help. I feel like I'm missing something very simple I have yet to grasp.
One possible way out is reading by three lines (i.e. entire block) instead of one:
using (StreamReader file = new StreamReader(openFileDialog1.FileName)) {
while (!file.EndOfStream) {
string color = file.ReadLine();
string name = file.ReadLine();
string number = file.ReadLine();
if (color == "blue") {
listBox1.Items.Add(name);
listBox2.Items.Add(number);
}
else if (color == "red") {
listBox3.Items.Add(name);
listBox4.Items.Add(number);
}
}
}
How to sort a large csv file with 10 columns?
The sorting should be based on data type for example, string, Date, integer etc
Assuming Based on 5th column (Period Column) we need to sort.
As it is large CSV file, Without loading the same in memory we have to do.
I tried using logparser, but beyond certain size it throws error saying
"log parser tool has stopped working"
So please suggest any algorithm which i can implement in c#. Or if there is any other component or code which can help me.
Thanks in advance
Do know that running a program without memory is hard, specially if you have an algorithm that by its nature requires memory allocation.
I've looked at the External sort method mentioned by Jim Menschel and this is my implementation.
I didn't implement sorting on the fifth field but left some hints in the code so you can add that yourself.
This code reads a file, line by line and creates, in a temporary directory for each line a new file. Then we open two of those files and create a new target file. After reading a line from the two open files, we can compare them (or their fields). Based on their comparison we write the smallest one to the target file and read the next line from the file we used.
Although this doesn't keep much strings in memory it is hard on the diskdrive. I checked the NTFS limits and 50,000,000 files is within the specs.
Here are the main methods of the class:
main entry point
This take the file to be sorted
public void Sort(string file)
{
Directory.CreateDirectory(sortdir);
Split(file);
var sortedFile = SortAndCombine();
// if you feel confident you can override the original file
File.Move(sortedFile, file + ".sorted");
Directory.Delete(sortdir);
}
Split file
Split the file in a new file for each line
Yes, that will be a lot of files but it guarantees the least amount of memory used. It is easy to optimize though, read a couple of lines, sort those and write to a file.
void Split(string file)
{
using (var sr = new StreamReader(file, Encoding.UTF8))
{
var line = sr.ReadLine();
while (!String.IsNullOrEmpty(line))
{
// whatever you do, make sure this file your writed
// is ordered, just writing a single line is the easiest
using (var sw = new StreamWriter(CreateUniqueFilename()))
{
sw.WriteLine(line);
}
line = sr.ReadLine();
}
}
}
Combine the files
Iterate over all files and take one and the next one, merge those files
string SortAndCombine()
{
long processed; // keep track of how much we processed
do
{
// iterate the folder
var files = Directory.EnumerateFiles(sortdir).GetEnumerator();
bool hasnext = files.MoveNext();
processed = 0;
while (hasnext)
{
processed++;
// have one
string fileOne = files.Current;
hasnext = files.MoveNext();
if (hasnext)
{
// we have number two
string fileTwo = files.Current;
// do the work
MergeSort(fileOne, fileTwo);
hasnext = files.MoveNext();
}
}
} while (processed > 1);
var lastfile = Directory.EnumerateFiles(sortdir).GetEnumerator();
lastfile.MoveNext();
return lastfile.Current; // by magic is the name of the last file
}
Merge and Sort
Open two files and create one target file. Read a line from both of these and write sthe mallest of the two to the target file.
Keep doing that until both lines are null
void MergeSort(string fileOne, string fileTwo)
{
string result = CreateUniqueFilename();
using(var srOne = new StreamReader(fileOne, Encoding.UTF8))
{
using(var srTwo = new StreamReader(fileTwo, Encoding.UTF8))
{
// I left the actual field parsing as an excersise for the reader
string lineOne, lineTwo; // fieldOne, fieldTwo;
using(var target = new StreamWriter(result))
{
lineOne = srOne.ReadLine();
lineTwo = srTwo.ReadLine();
// naive field parsing
// fieldOne = lineOne.Split(';')[4];
// fieldTwo = lineTwo.Split(';')[4];
while(
!String.IsNullOrEmpty(lineOne) ||
!String.IsNullOrEmpty(lineTwo))
{
// use your parsed fieldValues here
if (lineOne != null && (lineOne.CompareTo(lineTwo) < 0 || lineTwo==null))
{
target.WriteLine(lineOne);
lineOne = srOne.ReadLine();
// fieldOne = lineOne.Split(';')[4];
}
else
{
if (lineTwo!=null)
{
target.WriteLine(lineTwo);
lineTwo = srTwo.ReadLine();
// fieldTwo = lineTwo.Split(';')[4];
}
}
}
}
}
}
// all is perocessed, remove the input files.
File.Delete(fileOne);
File.Delete(fileTwo);
}
Helper variable and method
There is one shared member for the temporary directory and a method for generating temporary unique filenames.
private string sortdir = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString("N"));
string CreateUniqueFilename()
{
return Path.Combine(sortdir, Guid.NewGuid().ToString("N"));
}
Memory analysis
I've created a small file with 5000 lines in it with the following code:
using(var sw= new StreamWriter("c:\\temp\\test1.txt"))
{
for(int line=0; line<5000; line++)
{
sw.WriteLine(Guid.NewGuid().ToString());
}
}
I then ran the sorting code with the memory profiler. This is what the summary looked like on my box with Windows 10, 4GB RAM and a spinning disk:
The object lifetime shows as expected a lot of String, char[] and byte[] allocations, but none of those have survived a Gen 0 collection, which means they are all short lived and I don't expect this to be a problem if the number of lines to sort increases.
This is the simplest solution that works for me. From here easy alterations and improvements are possible, either leading to even less memory consumption, reduce allocations or a higher speed. Make sure to measure, select the area where you can make the biggest impact and compare successive results. That should give you the optimum between memory usage and performance.
Instead of reading CSV completely you can simply index it:
Read unsorted CSV line by line and remember 5th element (column) value and something to identify this line later: line number or offset of this line from beginning of the file and size.
You will have some kind of List<Tuple<string, ...>>. Sort that
var sortedList = unsortedList.OrderBy(item => item.Item1);
Now you can create sorted CSV by enumerating sorted list, reading line from source file and appending it to new CSV:
using (var sortedCSV = File.AppendText(newCSVFileName))
foreach(var item in sortedList)
{
... // read line from unsorted csv using item.Item2, etc.
sortedCSV.WriteLine(...);
}
I have a task of extracting a few hundred thousand rows from CSV files where the row contains a specified ID. So I have about 300,000 IDs stored in a string List and need to extract any row in the CSV that contains any of these IDs.
At the minute I am using a Linq statement to see if each row contains any of the IDs in the List:
using (StreamReader sr = new StreamReader(csvFile))
{
string inLine = sr.ReadLine();
if(searchStrings.Any(sr.ReadLine().Contains))
{
stremWriter.Write(inLine);
}
}
This kind of works ok but it is Very slow since there are 300,000 values in the searchStrings List and a few million rows in the CSVs that I need to search.
Does anyone know how to make this search more efficient to speed it up?
Or an alternative method for extracting the required rows?
Thanks
I've faced a similarish problem before, I had to iterate through a several hundred thousand line .csv and parse each row.
I went with a threaded approach where I tried to do the reading and parsing simultaneously in batches.
Here's roughly how I did it;
using System.Collections.Concurrent; using System.Threading;
private static ConcurrentBag<String> items = new ConcurrentBag<String>();
private static List<String> searchStrings;
static void Main(string[] args)
{
using (StreamReader sr = new StreamReader(csvFile))
{
const int buffer_size = 10000;
string[] buffer = new string[buffer_size];
int count = 0;
String line = null;
while ((line = sr.ReadLine()) != null)
{
buffer[count] = line;
count++;
if (count == buffer_size)
{
new Thread(() =>
{
find(buffer);
}).Start();
buffer = new String[buffer_size];
count = 0;
}
}
if (count > 0)
{
find(buffer);
}
//some kind of sync here, can be done with a bool - make sure all the threads have finished executing
foreach (var str in searchStrings)
streamWriter.write(str);
}
}
private static void find(string[] buffer)
{
//do your search algorithm on the array of strings
//add to the concurrentbag if they match
}
I just quickly threw this code together from what I remember doing before so it might not be entirely correct. Doing it like this certainly speeds things up though (with very large files at least).
The idea is to always be reading from the hdd as string parsing can be pretty expensive, and thus batching the work on multiple cores can make it significantly faster.
With this, I was able to parse (splitting each line into about 50 items and parsing the key/value pairs and building objects in memory from them - by far the most time consuming part) around 250k lines in just over 7s.
Just throwing this out there, it's not specifically relevant to any of the tags on your question but the *nix "grep -f" functionality would work here. Essentially, you'd have a file with the list of strings you want to match (e.g., StringsToFind.txt) and you'd have your csv input file (e.g., input.csv) and the following command would output the matching lines to output.csv
grep -f StringsToFind.txt input.csv > output.csv
See grep man page for more details.
i need a fast method to work with big text file
i have 2 files,
a big text file (~20Gb)
and an another text file that contain ~12 million list of Combo words
i want find all combo words in the first text file and replace it with an another Combo word (combo word with underline)
example "Computer Information" >Replace With> "Computer_Information"
i use this code, but performance is very poor (i test in Hp G7 Server With 16Gb Ram and 16 Core)
public partial class Form1 : Form
{
HashSet<string> wordlist = new HashSet<string>();
private void loadComboWords()
{
using (StreamReader ff = new StreamReader(txtComboWords.Text))
{
string line;
while ((line = ff.ReadLine()) != null)
{
wordlist.Add(line);
}
}
}
private void replacewords(ref string str)
{
foreach (string wd in wordlist)
{
// ReplaceEx(ref str,wd,wd.Replace(" ","_"));
if (str.IndexOf(wd) > -1)
str.Replace(wd, wd.Replace(" ", "_"));
}
}
private void button3_Click(object sender, EventArgs e)
{
string line;
using (StreamReader fread = new StreamReader(txtFirstFile.Text))
{
string writefile = Path.GetFullPath(txtFirstFile.Text) + Path.GetFileNameWithoutExtension(txtFirstFile.Text) + "_ReplaceComboWords.txt";
StreamWriter sw = new StreamWriter(writefile);
long intPercent;
label3.Text = "initialing";
loadComboWords();
while ((line = fread.ReadLine()) != null)
{
replacewords(ref line);
sw.WriteLine(line);
intPercent = (fread.BaseStream.Position * 100) / fread.BaseStream.Length;
Application.DoEvents();
label3.Text = intPercent.ToString();
}
sw.Close();
fread.Close();
label3.Text = "Finished";
}
}
}
any idea to do this job in reasonable time
Thanks
At first glance the approach you've taken looks fine - it should work OK, and there's nothing obvious that will cause e.g. lots of garbage collection.
The main thing I think is that you'll only be using one of those sixteen cores: there's nothing in place to share the load across the other fifteen.
I think the easiest way to do this is to split the large 20Gb file into sixteen chunks, then analyse each of the chunks together, then merge the chunks back together again. The extra time taken splitting and reassembling the file should be minimal compared to the ~16 times gain involved in scanning these sixteen chunks together.
In outline, one way to do this might be:
private List<string> SplitFileIntoChunks(string baseFile)
{
// Split the file into chunks, and return a list of the filenames.
}
private void AnalyseChunk(string filename)
{
// Analyses the file and performs replacements,
// perhaps writing to the same filename with a different
// file extension
}
private void CreateOutputFileFromChunks(string outputFile, List<string> splitFileNames)
{
// Combines the rewritten chunks created by AnalyseChunk back into
// one large file, outputFile.
}
public void AnalyseFile(string inputFile, string outputFile)
{
List<string> splitFileNames = SplitFileIntoChunks(inputFile);
var tasks = new List<Task>();
foreach (string chunkName in splitFileNames)
{
var task = Task.Factory.StartNew(() => AnalyseChunk(chunkName));
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
CreateOutputFileFromChunks(outputFile, splitFileNames);
}
One tiny nit: move the calculation of the length of the stream out of the loop, you only need to get that once.
EDIT: also, include #Pavel Gatilov's idea to invert the logic of the inner loop and search for each word in the line in the 12 million list.
Several ideas:
I think it will be more efficient to split each line into words and look if each of several words appears in your word list. 10 lookups in a hashset is better than millions of searches of a substring. If you have composite keywords, make appropriate indexes: one that contains all single words that occur in the real keywords and another that contains all the real keywords.
Perhaps, loading strings into StringBuilder is better for replacing.
Update progress after, say 10000 lines processed, not after each one.
Process in background threads. It won't make it much faster, but the app will be responsible.
Parallelize the code, as Jeremy has suggested.
UPDATE
Here is a sample code that demonstrates the by-word index idea:
static void ReplaceWords()
{
string inputFileName = null;
string outputFileName = null;
// this dictionary maps each single word that can be found
// in any keyphrase to a list of the keyphrases that contain it.
IDictionary<string, IList<string>> singleWordMap = null;
using (var source = new StreamReader(inputFileName))
{
using (var target = new StreamWriter(outputFileName))
{
string line;
while ((line = source.ReadLine()) != null)
{
// first, we split each line into a single word - a unit of search
var singleWords = SplitIntoWords(line);
var result = new StringBuilder(line);
// for each single word in the line
foreach (var singleWord in singleWords)
{
// check if the word exists in any keyphrase we should replace
// and if so, get the list of the related original keyphrases
IList<string> interestingKeyPhrases;
if (!singleWordMap.TryGetValue(singleWord, out interestingKeyPhrases))
continue;
Debug.Assert(interestingKeyPhrases != null && interestingKeyPhrases.Count > 0);
// then process each of the keyphrases
foreach (var interestingKeyphrase in interestingKeyPhrases)
{
// and replace it in the processed line if it exists
result.Replace(interestingKeyphrase, GetTargetValue(interestingKeyphrase));
}
}
// now, save the processed line
target.WriteLine(result);
}
}
}
}
private static string GetTargetValue(string interestingKeyword)
{
throw new NotImplementedException();
}
static IEnumerable<string> SplitIntoWords(string keyphrase)
{
throw new NotImplementedException();
}
The code shows the basic ideas:
We split both keyphrases and processed lines into equivalent units which may be efficiently compared: the words.
We store a dictionary that for any word quickly gives us references to all keyphrases that contain the word.
Then we apply your original logic. However, we do not do it for all 12 mln keyphrases, but rather for a very small subset of keyphrases that have at least a single-word intersection with the processed line.
I'll leave the rest of the implementation to you.
The code however has several issues:
The SplitIntoWords must actually normalize the words to some canonical form. It depends on the required logic. In the simplest case you'll probably be fine with whitespace-character splitting and lowercasing. But it may happen that you'll need a morphological matching - that would be harder (it's very close to full-text search tasks).
For the sake of the speed, it's likely to be better if the GetTargetValue method was called once for each keyphrase before processing the input.
If a lot of your keyphrases have coinciding words, you'll still have a signigicant amount of extra work. In that case you'll need to keep the positions of keywords in the keyphrases in order to use word distance calculation to exclude irrelevant keyphrases while processing an input line.
Also, I'm not sure if StringBuilder is actually faster in this particular case. You should experiment with both StringBuilder and string to find out the truth.
It's a sample after all. The design is not very good. I'd consider extracting some classes with consistent interfaces (e.g. KeywordsIndex).
Is it possible to load a file with 3 or 4 million lines in less than 1 second (1.000000)? One line contains one word. Words range in length from 1 - 17 (does that matter?).
My code is now:
List<string> LoadDictionary(string filename)
{
List<string> wordsDictionary = new List<string>();
Encoding enc = Encoding.GetEncoding(1250);//I need ę ą ć ł etc.
using (StreamReader r = new StreamReader(filename, enc))
{
string line = "";
while ((line = r.ReadLine()) != null)
{
if (line.Length > 2)
{
wordsDictionary.Add(line);
}
}
}
return wordsDictionary;
}
Results of timed execution:
How can I force the method to make it execute in half the time?
If you know that your list will be large, you should set a good starting capacity.
List<string> wordsDictionary = new List<string>( 100000 );
If you don't do this, the list will need to keep increasing its capacity which takes a bit of time. Likely won't cut this down by half, but it's a start
How does File.ReadAllLines() and some LINQ perform?
public List<string> LoadDictionary(string filename)
{
List<string> wordsDictionary = new List<string>();
Encoding enc = Encoding.GetEncoding(1250);
string[] lines = File.ReadAllLines(filename,enc);
wordsDictionary.AddRange(lines.Where(x => x.Length > 2));
return wordsDictionary;
}
Your biggest performance hit at this point is probably just from pulling data off the hard drive and into memory. It's unlikely that you can do anything to get it to go much faster, short of getting better hardware.
Profile. Profile. Profile.
We can all guess at where the time is spent and then propose other methods that may be faster. Some of us may even have a good intuition or get lucky and stumble on the right answer. But it's going to be much more productive to measure, iterate, and measure again.
Raymond Chen did an interesting series on loading a Chinese/English dictionary and getting the load time fast. It's not exactly the same (his does character conversion and some simple parsing, and the dictionary was a bit smaller) and it's in a different language. But I recommend the series anyway, because it shows the right way to optimize something like this: profile, profile, profile.