Best Solution to load a big file to Dictionary

Best Solution to load a big file to Dictionary - c#

I have a text file with 457379 lines and this structure
Key1\t\tValue1
Key2\t\tValue2
I'm using this code to load it into a Dictionary<string,string>
private void StartScan()
{
using (StreamReader sr = new StreamReader("fh.txt"))
{
while (!sr.EndOfStream)
{
scaned++;
label4.Text = scaned.ToString();
var read = sr.ReadLine().Split(new string[] { "\t\t" }, StringSplitOptions.None);
fh.Add(read[0], read[1]);
}
}
}
but it takes more than 6 minutes to load data.
The question is is there any better solution to load the data?

The problem is you're updating an UI element (label4) every time you read a line.
This can be very expensive, so either I suggest to remove the line:
label4.Text = scaned.ToString();
or update it less frequently, e.g. once every 100 lines read.

Try:
private void StartScan()
{
var lastupdate = 0;
...
if(lastUpdate + 100 < scaned)
{
label4.Text = scaned.ToString();
lastUpdate = scaned;
}
...
it might improve quite a bit...I guess the label updating is one of the most expensive operations in your code

I find File.ReadLines to be the easiest/quickest way to process files line-by-line:
var dictionary = File.ReadLines("C:\\file.txt")
.Select(s => s.Split(new string[] { "\t\t" }, StringSplitOptions.None))
.ToDictionary(k => k[0], v => v[1]);
Having said that, there's not much functional difference between the code above and what you already have other than the fact it's slightly less verbose.

One thing that you can do is use a bufferstream.
using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
string line;
while ((line = sr.ReadLine()) != null)
{
//Do the add
}
}
You will see an improvement. Also do you need a Dictionary? If you don't need a key mapping to each value use a HashSet. It is slightly faster in adding. Just a bit but it might make a difference in the long run.

Related

asp.net MVC Seed a database from a .TXT file with code first (over 10000 words) [duplicate]

I am using a list to limit the file size since the target is limited in disk and ram.
This is what I am doing now but is there a more efficient way?
readonly List<string> LogList = new List<string>();
...
var logFile = File.ReadAllLines(LOG_PATH);
foreach (var s in logFile) LogList.Add(s);

var logFile = File.ReadAllLines(LOG_PATH);
var logList = new List<string>(logFile);
Since logFile is an array, you can pass it to the List<T> constructor. This eliminates unnecessary overhead when iterating over the array, or using other IO classes.
Actual constructor implementation:
public List(IEnumerable<T> collection)
{
...
ICollection<T> c = collection as ICollection<T>;
if( c != null) {
int count = c.Count;
if (count == 0)
{
_items = _emptyArray;
}
else {
_items = new T[count];
c.CopyTo(_items, 0);
_size = count;
}
}
...
}

A little update to Evan Mulawski answer to make it shorter
List<string> allLinesText = File.ReadAllLines(fileName).ToList()

Why not use a generator instead?
private IEnumerable<string> ReadLogLines(string logPath) {
using(StreamReader reader = File.OpenText(logPath)) {
string line = "";
while((line = reader.ReadLine()) != null) {
yield return line;
}
}
}
Then you can use it like you would use the list:
var logFile = ReadLogLines(LOG_PATH);
foreach(var s in logFile) {
// Do whatever you need
}
Of course, if you need to have a List<string>, then you will need to keep the entire file contents in memory. There's really no way around that.

You can simple read this way .
List<string> lines = System.IO.File.ReadLines(completePath).ToList();

[Edit]
If you are doing this to trim the beginning of a log file, you can avoid loading the entire file by doing something like this:
// count the number of lines in the file
int count = 0;
using (var sr = new StreamReader("file.txt"))
{
while (sr.ReadLine() != null)
count++;
}
// skip first (LOG_MAX - count) lines
count = LOG_MAX - count;
using (var sr = new StreamReader("file.txt"))
using (var sw = new StreamWriter("output.txt"))
{
// skip several lines
while (count > 0 && sr.ReadLine() != null)
count--;
// continue copying
string line = "";
while ((line = sr.ReadLine()) != null)
sw.WriteLine(line);
}
First of all, since File.ReadAllLines loads the entire file into a string array (string[]), copying to a list is redundant.
Second, you must understand that a List is implemented using a dynamic array under the hood. This means that CLR will need to allocate and copy several arrays until it can accommodate the entire file. Since the file is already on disk, you might consider trading speed for memory and working on disk data directly, or processing it in smaller chunks.
If you need to load it entirely in memory, at least try to leave in an array:
string[] lines = File.ReadAllLines("file.txt");
If it really needs to be a List, load lines one by one:
List<string> lines = new List<string>();
using (var sr = new StreamReader("file.txt"))
{
while (sr.Peek() >= 0)
lines.Add(sr.ReadLine());
}
Note: List<T> has a constructor which accepts a capacity parameter. If you know the number of lines in advance, you can prevent multiple allocations by preallocating the array in advance:
List<string> lines = new List<string>(NUMBER_OF_LINES);
Even better, avoid storing the entire file in memory and process it "on the fly":
using (var sr = new StreamReader("file.txt"))
{
string line;
while ((line = sr.ReadLine()) != null)
{
// process the file line by line
}
}

Don't store it if possible. Just read through it if you are memory constrained. You can use a StreamReader:
using (var reader = new StreamReader("file.txt"))
{
var line = reader.ReadLine();
// process line here
}
This can be wrapped in a method which yields strings per line read if you want to use LINQ.

//this is only good in .NET 4
//read your file:
List<string> ReadFile = File.ReadAllLines(#"C:\TEMP\FILE.TXT").ToList();
//manipulate data here
foreach(string line in ReadFile)
{
//do something here
}
//write back to your file:
File.WriteAllLines(#"C:\TEMP\FILE2.TXT", ReadFile);

List<string> lines = new List<string>();
using (var sr = new StreamReader("file.txt"))
{
while (sr.Peek() >= 0)
lines.Add(sr.ReadLine());
}
i would suggest this... of Groo's answer.

string inLine = reader.ReadToEnd();
myList = inLine.Split(new string[] { "\r\n" }, StringSplitOptions.None).ToList();
I also use the Environment.NewLine.toCharArray as well, but found that didn't work on a couple files that did end in \r\n. Try either one and I hope it works well for you.

string inLine = reader.ReadToEnd();
myList = inLine.Split(new string[] { "\r\n" }, StringSplitOptions.None).ToList();
This answer misses the original point, which was that they were getting an OutOfMemory error. If you proceed with the above version, you are sure to hit it if your system does not have the appropriate CONTIGUOUS available ram to load the file.
You simply must break it into parts, and either store as List or String[] either way.

Asynchronous Call using Delegate

I want that separate Async threads of method splitFile should run so that the task will become faster but below code is not working. When I debug , it reaches till line RecCnt = File.ReadAllLines(SourceFile).Length - 1; and comes out. Please help.
public delegate void SplitFile_Delegate(FileInfo file);
static void Main(string[] args)
{
DirectoryInfo d = new DirectoryInfo(#"D:\test\Perf testing Splitter"); //Assuming Test is your Folder
FileInfo[] Files = d.GetFiles("*.txt"); //Getting Text files
foreach (FileInfo file in Files)
{
SplitFile_Delegate LocalDelegate = new SplitFile_Delegate(SplitFile);
IAsyncResult R = LocalDelegate.BeginInvoke(file, null, null); //invoking the method
LocalDelegate.EndInvoke(R);
}
}
private static void SplitFile(FileInfo file)
{
try
{
String fname;
//int FileLength;
int RecCnt;
int fileCount;
fname = file.Name;
String SourceFile = #"D:\test\Perf testing Splitter\" + file.Name;
RecCnt = File.ReadAllLines(SourceFile).Length - 1;
fileCount = RecCnt / 10000;
FileStream fs = new FileStream(SourceFile, FileMode.Open);
using (StreamReader sr = new StreamReader(fs))
{
while (!sr.EndOfStream)
{
String dataLine = sr.ReadLine();
for (int x = 0; x < (fileCount + 1); x++)
{
String Filename = #"D:\test\Perf testing Splitter\Destination Files\" + fname + "_" + x + "by" + (fileCount + 1) + ".txt"; //test0by4
using (StreamWriter Writer = file.AppendText(Filename))
{
for (int y = 0; y < 10000; y++)
{
Writer.WriteLine(dataLine);
dataLine = sr.ReadLine();
}
Writer.Close();
}
}
}
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}

Your code doesn't really need any multi-threading. It doesn't really even need asynchronous processing all that much - you're saturating the I/O most likely, and unless you've got multiple drives as the data sources, you're not going to improve that by adding parallelism.
On the other hand, your code is reading each file twice. For no reason, wasting memory, time and even CPU. Instead, just do this:
FileStream fs = new FileStream(SourceFile, FileMode.Open);
using (StreamReader sr = new StreamReader(fs))
{
string line;
string fileName = null;
StreamWriter outputFile = null;
int lineCounter = 0;
int outputFileIndex = 0;
while ((line = sr.ReadLine()) != null)
{
if (fileName == null || lineCounter >= 10000)
{
lineCounter = 0;
outputFileIndex++;
fileName = #"D:\Output\" + fname + "_" + outputFileIndex + ".txt";
if (outputFile != null) outputFile.Dispose();
outputFile = File.AppendText(fileName);
}
outputFile.WriteLine(line);
lineCounter++;
}
}
If you really need to have the filename in format XOutOfY, you can just rename them afterwards - it's a lot cheaper than reading the source file twice, line after line. Or, if you don't care about keeping the whole file in memory at once, just use the array you got from ReadAllLines and iterate over that, rather than doing the reading all over again.
To make this even easier, you can also use foreach (var line in File.ReadLines(fileName)).
If you really want to make this asynchronous, the way to handle that is by using asynchronous I/O, not just by spooling new threads. So you can use await with StreamReader.ReadLineAsync etc.

You are not required to call EndInvoke and really all EndInvoke does is wait on the return value for you. Since SplitFile returns void, my guess is there's an optimization that kicks in because you don't need to wait on anything and it simply ignores the wait. For more details: C# Asynchronous call without EndInvoke?
That being said, your usage of Begin/EndInvoke will likely not be faster than serial programming (and will likely be marginally slower) as your for loop is still serialized, and you're still running the iteration in serial. All that has changed is you're using a delegate where it looks like one isn't necessary.
It's possible that what you meant to use was Parallel.ForEach (MSDN: https://msdn.microsoft.com/en-us/library/dd992001(v=vs.110).aspx) which will potentially run iterations in parallel.
Edit: As someone else has mentioned, having multiple threads engage in file operations will likely not improve performance as your file ops are probably disk bound. The main benefit you would get from an async file read/write would probably be unblocking the main thread for a UI update. You will need to specify what you want with "performance" if you want a better answer.

How To Make this Code work

static void Main()
{
FileStream fs = new FileStream("Scheduler.txt",FileMode.Open, FileAccess.Read);
StreamReader filereader = new StreamReader(fs);
string linevalue = "";
ArrayList items = new ArrayList();
while ((linevalue = filereader.ReadLine()) != null)
{
items.Add(linevalue);
}
filereader.Close();
items.Sort();
IEnumerator myEnumerator = items.GetEnumerator();
while (myEnumerator.MoveNext())
{
Console.WriteLine(myEnumerator.Current);
}
}
My program needs trouble shooting. This program, i actually got it from SO by a brilliant guy but i am not able to trace back. I don't know what's wrong. I want everything that is stored in my text file to be stored and displayed through the array list. Any help would be appreciated.
It get's displayed but incorrectly.
My Text File has got the following details
Names Date Time
Leon 13/10/2013 10:00AM
Jyothika 18/10/2013 12:18PM
Angelina 21/09/2000 01:45AM
Instead of displaying it in the same manner, it displays like
Angelina 21/09/2000 01:45AM
Names Dates Time
Leon 13/10/2013 10:00AM
Jyothika 18/10/2013 12:18PM

The problem is your reading linevalue again before adding to items:
You can do it in an easier way:
var lines = File.ReadAllLines("Scheduler.txt").ToList();
lines.Sort();
foreach(var line in lines) Console.WriteLine( line );

Try this:
while ((linevalue = filereader.ReadLine()) != null)
{
//linevalue = filereader.ReadLine();
items.Add(linevalue);
}
An easier way:
// store your items
ArrayList items =
new ArrayList(File.ReadAllLines("Scheduler.txt").ToArray());
// output them, eventually
items.ToArray().ToList().ForEach(item =>
{
Console.WriteLine(item);
});

static void Main()
{
FileStream fs = new FileStream("Scheduler.txt",FileMode.Open, FileAccess.Read);
StreamReader filereader = new StreamReader(fs);
string linevalue = "";
ArrayList items = new ArrayList();
while ((linevalue = filereader.ReadLine()) != null)
{
//linevalue = filereader.ReadLine();
items.Add(linevalue);
}
filereader.Close();
items.Sort();
IEnumerator myEnumerator = items.GetEnumerator();
while (myEnumerator.MoveNext())
{
Console.WriteLine(myEnumerator.Current);
}
}

The problem is that the Title Line is a line, so it gets sorted with the other lines. There are various ways to solve this problem... Read the Title Line separately in another variable and not adding to the ArrayList, remove from the ArrayList the first row before sorting, sort only the rows after the first... I chose the last one and wrote some variants that show how the code could be written.
First level, not sort the first row, and using the using keyword, instead of manually closing the files.
ArrayList items = new ArrayList();
using (FileStream fs = new FileStream("Scheduler.txt", FileMode.Open, FileAccess.Read))
using (StreamReader filereader = new StreamReader(fs))
{
string linevalue = string.Empty;
while ((linevalue = filereader.ReadLine()) != null)
{
items.Add(linevalue);
}
}
Console.WriteLine(items[0]);
items.Sort(1, items.Count - 1, StringComparer.CurrentCulture);
IEnumerator myEnumerator = items.GetEnumerator();
myEnumerator.MoveNext(); // skip the first row
while (myEnumerator.MoveNext())
{
Console.WriteLine(myEnumerator.Current);
}
Second level, please forget of ArrayList... It's a child of a pre-literate world... I hope one day we will be able to forget that world. And using the GetEnumerator manually? There is the foreach (or the for, being the collection an ArrayList/List<string> for these things. The for is better, because we must skip the first row.
var items = new List<string>();
using (FileStream fs = new FileStream("Scheduler.txt", FileMode.Open, FileAccess.Read))
using (StreamReader filereader = new StreamReader(fs))
{
string linevalue = string.Empty;
while ((linevalue = filereader.ReadLine()) != null)
{
items.Add(linevalue);
}
}
Console.WriteLine(items[0]);
items.Sort(1, items.Count - 1, StringComparer.CurrentCulture);
for (int i = 1; i < items.Count; i++)
{
Console.WriteLine(items[i]);
}
Then we could remove the FileStream, because the StreamReader has a good-enough constructor...
using (StreamReader filereader = new StreamReader("Scheduler.txt"))
Or we could directly use the File.ReadAllLines that returns an array of string[].
string[] items = File.ReadAllLines("Scheduler.txt");
Console.WriteLine(items[0]);
Array.Sort(items, 1, items.Length - 1, StringComparer.CurrentCulture);
for (int i = 1; i < items.Length; i++)
{
Console.WriteLine(items[i]);
}
or, instead of the for, we could use a little LINQ:
foreach (string item in items.Skip(1))
{
Console.WriteLine(item);
}
and use the Skip method to skip the first row (these lines of LINQ are compatible with all the versions that don't use ArrayList)

How to split the large text file(32 GB) using C#

I tried to split the file about 32GB using the below code but I got the memory exception.
Please suggest me to split the file using C#.
string[] splitFile = File.ReadAllLines(#"E:\\JKS\\ImportGenius\\0.txt");
int cycle = 1;
int splitSize = Convert.ToInt32(txtNoOfLines.Text);
var chunk = splitFile.Take(splitSize);
var rem = splitFile.Skip(splitSize);
while (chunk.Take(1).Count() > 0)
{
string filename = "file" + cycle.ToString() + ".txt";
using (StreamWriter sw = new StreamWriter(filename))
{
foreach (string line in chunk)
{
sw.WriteLine(line);
}
}
chunk = rem.Take(splitSize);
rem = rem.Skip(splitSize);
cycle++;
}

Well, to start with you need to use File.ReadLines (assuming you're using .NET 4) so that it doesn't try to read the whole thing into memory. Then I'd just keep calling a method to spit the "next" however many lines to a new file:
int splitSize = Convert.ToInt32(txtNoOfLines.Text);
using (var lineIterator = File.ReadLines(...).GetEnumerator())
{
bool stillGoing = true;
for (int chunk = 0; stillGoing; chunk++)
{
stillGoing = WriteChunk(lineIterator, splitSize, chunk);
}
}
...
private static bool WriteChunk(IEnumerator<string> lineIterator,
int splitSize, int chunk)
{
using (var writer = File.CreateText("file " + chunk + ".txt"))
{
for (int i = 0; i < splitSize; i++)
{
if (!lineIterator.MoveNext())
{
return false;
}
writer.WriteLine(lineIterator.Current);
}
}
return true;
}

Do not read immediately all lines into an array, but use StremReader.ReadLine method, like:
using (StreamReader sr = new StreamReader(#"E:\\JKS\\ImportGenius\\0.txt"))
{
while (sr.Peek() >= 0)
{
var fileLine = sr.ReadLine();
//do something with line
}
}

File.ReadAllLines
That will read the whole file into memory.
To work with large files you need to only read what you need now into memory, and then throw that away as soon as you have finished with it.
A better option would be File.ReadLines which returns a lazy enumerator, data is only read into memory as you get the next line from the enumerator. Providing you avoid multiple enumerations (eg. don't use Count()) only parts of the file will be read.

Instead of reading all the file at once using File.ReadAllLines, use File.ReadLines in a foreach loop to read the lines as needed.
foreach (var line in File.ReadLines(#"E:\\JKS\\ImportGenius\\0.txt"))
{
// Do something
}
Edit: On an unrelated note, you don't have to escape your backslashes when prefixing the string with a '#'. So either write "E:\\JKS\\ImportGenius\\0.txt" or #"E:\JKS\ImportGenius\0.txt", but #"E:\\JKS\\ImportGenius\\0.txt" is redundant.

The problem here is that you are reading the entire file's content into memory at once with File.ReadAllLines(). What you need to do is open a FileStream with File.OpenRead() and read/write smaller chunks.
Edit: Actually for your case ReadLine is obviously better. See other answers. :)

Use a StreamReader to read the file, write with a StreamWriter.

Fastest way to find strings in a file

I have a log file that is not more than 10KB (File size can go up to 2 MB max) and I want to find if atleast one group of these strings occurs in the files. These strings will be on different lines like,
ACTION:.......
INPUT:...........
RESULT:..........
I need to know atleast if one group of above exists in the file. And I have do this about 100 times for a test (each time log is different, so I have reload and read the log), so I am looking for fastest and bets way to do this.
I looked up in the forums for finding the fastest way, but I dont think my file is too big for those silutions.
Thansk for looking.

I would read it line by line and check the conditions. Once you have seen a group you can quit. This way you don't need to read the whole file into memory. Like this:
public bool ContainsGroup(string file)
{
using (var reader = new StreamReader(file))
{
var hasAction = false;
var hasInput = false;
var hasResult = false;
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (!hasAction)
{
if (line.StartsWith("ACTION:"))
hasAction = true;
}
else if (!hasInput)
{
if (line.StartsWith("INPUT:"))
hasInput = true;
}
else if (!hasResult)
{
if (line.StartsWith("RESULT:"))
hasResult = true;
}
if (hasAction && hasInput && hasResult)
return true;
}
return false;
}
}
This code checks if there is a line starting with ACTION then one with INPUT and then one with RESULT. If the order of those is not important then you can omit the if () else if () checks. In case the line does not start with the strings replace StartsWith with Contains.

Here's one possible way to do it:
StreamReader sr;
string fileContents;
string[] logFiles = Directory.GetFiles(#"C:\Logs");
foreach (string file in logFiles)
{
using (StreamReader sr = new StreamReader(file))
{
fileContents = sr.ReadAllText();
if (fileContents.Contains("ACTION:") || fileContents.Contains("INPUT:") || fileContents.Contains("RESULT:"))
{
// Do what you need to here
}
}
}
You may need to do some variation based on your exact implementation needs - for example, what if the word spans two lines, does the line need to start with the word, etc.
Added
Alternate line-by-line check:
StreamReader sr;
string[] lines;
string[] logFiles = Directory.GetFiles(#"C:\Logs");
foreach (string file in logFiles)
{
using (StreamReader sr = new StreamReader(file)
{
lines = sr.ReadAllLines();
foreach (string line in lines)
{
if (line.Contains("ACTION:") || line.Contains("INPUT:") || line.Contains("RESULT:"))
{
// Do what you need to here
}
}
}
}

Take a look at How to Read Text From a File. You might also want to take a look at the String.Contains() method.
Basically you will loop through all the files. For each file read line-by-line and see if any of the lines contains 1 of your special "Sections".

You don't have much of a choice with text files when it comes to efficiency. The easiest way would definitely be to loop through each line of data. When you grab a line in a string, split it on the spaces. Then match those words to your words until you find a match. Then do whatever you need.
I don't know how to do it in c# but in vb it would be something like...
Dim yourString as string
Dim words as string()
Do While objReader.Peek() <> -1
yourString = objReader.ReadLine()
words = yourString.split(" ")
For Each word in words()
If Myword = word Then
do stuff
End If
Next
Loop
Hope that helps

This code sample searches for strings in a large text file. The words are contained in a HashSet. It writes the found lines in a temp file.
if (File.Exists(#"temp.txt")) File.Delete(#"temp.txt");
String line;
String oldLine = "";
using (var fs = File.OpenRead(largeFileName))
using (var sr = new StreamReader(fs, Encoding.UTF8, true))
{
HashSet<String> hash = new HashSet<String>();
hash.Add("house");
using (var sw = new StreamWriter(#"temp.txt"))
{
while ((line = sr.ReadLine()) != null)
{
foreach (String str in hash)
{
if (oldLine.Contains(str))
{
sw.WriteLine(oldLine);
// write the next line as well (optional)
sw.WriteLine(line + "\r\n");
}
}
oldLine = line;
}
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Best Solution to load a big file to Dictionary - c#

The problem is you're updating an UI element (label4) every time you read a line. This can be very expensive, so either I suggest to remove the line: label4.Text = scaned.ToString(); or update it less frequently, e.g. once every 100 lines read.

Try: private void StartScan() { var lastupdate = 0; ... if(lastUpdate + 100 < scaned) { label4.Text = scaned.ToString(); lastUpdate = scaned; } ... it might improve quite a bit...I guess the label updating is one of the most expensive operations in your code

Related

asp.net MVC Seed a database from a .TXT file with code first (over 10000 words) [duplicate]

Asynchronous Call using Delegate

How To Make this Code work

How to split the large text file(32 GB) using C#

Fastest way to find strings in a file

Categories

Resources