C# text file deduping based on split - c#

what i want to do is de-dupe a text file (against itself) based on the split. Once the de-dupe has been complete write out to a new file and keep the first result. So a basic example is. I guess the question is how do you de dupe a text file in C# based on a string split.
File 1:
Apple|Turnip3234
Apple|Tunip22
Fox|dsa34
Turtle|3423
Hamster|d34
Fox|sdw2
Result:
Apple|Turnip3234
Fox|dsa34
Turtle|3423
Hamster|d34

string inputFile; // = ...
string outputFile; // = ...
HashSet<string> keys = new HashSet<string>();
using (StreamReader reader = new StreamReader(inputFile))
using (StreamWriter writer = new StreamWriter(outputFile))
{
string line = reader.ReadLine();
while (line != null)
{
string candidate = line.Split('|')[0];
if (keys.Add(candidate))
writer.WriteLine(line);
line = reader.ReadLine();
}
}

Use HashSet<string>. Store there left part of line (everything preceding |).
On each line call hashset.Contains(leftpart) to test if that line is a "dupe".

You can create Dictionary<string,string> where key is your first word and value is the second one. Then you can just go through all your lines, split them and check if first word occurs in Keys, and add this pair if it does not.

This will always use the first value encountered (and it's untested, but the concepts are correct).
Dictionary<String, String> dupeMap = new Dictionary<String, String>();
foreach (string line in File.Readlines("foo.txt")) {
key = line.Split("|")[0];
if (!dupeMap.ContainsKey(key)) {
dupeMap.Add(key, line);
}
}
Then you can write them all back by iterating over the Dictionary, though this is not stable because you can't be certain to get the lines back in order.
using (TextWriter tw = new StreamWriter("foo.txt")) {
foreach (string key in dupeMap.Keys()) {
tw.WriteLine(dupeMap[key]);
}
}

An easy solution is to only add values you haven't met yet.
var allLines = File.ReadAllLines(#"c:\test.txt");
Dictionary<string, string> allUniques = new Dictionary<string, string>();
foreach(string s in allLines)
{
var chunks = s.Split('|');
if (!allUniques.ContainsKey(chunks[0]))
{
allUniques.Add(chunks[0], s);
}
}
File.WriteAllLines(#"c:\test2.txt", allUniques.Values.ToArray());

Related

C#: add data to dictionary from datafile [duplicate]

So I have a generic number check that I am trying to implement:
public static bool isNumberValid(string Number)
{
}
And I want to read the contents of a textfile (only contains numbers) and check each line for the number and verify it is the valid number using isNumberValid. Then I want to output the results to a new textfile, I got this far:
private void button2_Click(object sender, EventArgs e)
{
int size = -1;
DialogResult result = openFileDialog1.ShowDialog(); // Show the dialog.
if (result == DialogResult.OK) // Test result.
{
string file = openFileDialog1.FileName;
try
{
string text = File.ReadAllText(file);
size = text.Length;
using (StringReader reader = new StringReader(text))
{
foreach (int number in text)
{
// check against isNumberValid
// write the results to a new textfile
}
}
}
catch (IOException)
{
}
}
}
Kind of stuck from here if anyone can help?
The textfile contains several numbers in a list:
4564
4565
4455
etc.
The new textfile I want to write would just be the numbers with true or false appended to the end:
4564 true
You don't need to read the entire file into memory all at once. You can write:
using (var writer = new StreamWriter(outputPath))
{
foreach (var line in File.ReadLines(filename)
{
foreach (var num in line.Split(','))
{
writer.Write(num + " ");
writer.WriteLine(IsNumberValid(num));
}
}
}
The primary advantage here is a much smaller memory footprint, as it only loads a small part of the file at a time.
You could try this to keep with the pattern you were initially following...
private void button1_Click(object sender, EventArgs e)
{
DialogResult result = openFileDialog1.ShowDialog(); // Show the dialog.
if (result == DialogResult.OK) // Test result.
{
string file = openFileDialog1.FileName;
try
{
using (var reader = new StreamReader(file))
{
using (var writer = new StreamWriter("results.txt"))
{
string currentNumber;
while ((currentNumber = reader.ReadLine()) != null)
{
if (IsNumberValid(currentNumber))
writer.WriteLine(String.Format("{0} true", currentNumber));
}
}
}
}
catch (IOException)
{
}
}
}
public bool IsNumberValid(string number)
{
//Whatever code you use to check your number
}
You need to replace your loop to look like this:
string[] lines = File.ReadAllLines(file);
foreach (var s in lines)
{
int number = int.Parse(s);
...
}
This would read each line of file, assuming that there is only one number per line,
and lines are separated with CRLF symbols. And parse each number to integer, assuming that integer is not greater than 2,147,483,647 and not less than -2,147,483,648, and integers are stored in your locale settings, with or without group separators.
In case if any line is empty, or contains non-integer - code will throw an exception.
You could try something like this:
FileStream fsIn = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read);
using (StreamReader sr = new StreamReader(fsIn))
{
line = sr.ReadLine();
while (!String.IsNullOrEmpty(line)
{
line = sr.ReadLine();
//call isNumberValid on each line, store results to list
}
}
Then print the list using FileStream.
As other people have mentioned, your isNumberValid method could make use of the Int32.TryParse method, but since you said your text file only contains numbers this may not be necessary. If you're just trying to match the number exactly, you can use number == line.
First, load all lines of the input file in a string array,
then open the output file and loop over the array of strings,
Split each line at the space separator and pass every part to your static method.
The static method use Int32.TryParse to determine if you have a valid integer or not without throwing an exception if the input text is not a valid Int32 number.
Based on the result of the method write to the output file the desidered text.
// Read all lines in memory (Could be optimized, but for this example let's go with a small file)
string[] lines = File.ReadAllLines(file);
// Open the output file
using (StringWriter writer = new StringWriter(outputFile))
{
// Loop on every line loaded from the input file
// Example "1234 ABCD 456 ZZZZ 98989"
foreach (string line in lines)
{
// Split the current line in the wannabe numbers
string[] numParts = line.Split(' ');
// Loop on every part and pass to the validation
foreach(string number in numParts)
{
// Write the result to the output file
if(isNumberValid(number))
writer.WriteLine(number + " True");
else
writer.WriteLine(number + " False");
}
}
}
// Receives a string and test if it is a Int32 number
public static bool isNumberValid(string Number)
{
int result;
return Int32.TryParse(Number, out result);
}
Of course this works only if your definition of 'number' is equal to the allowed values for a Int32 datatype

Merge 2 lines in .CSV file using StreamReader

I am currently trying to merge some lines in a .csv file. The file follows a specific format which is split by "," and the last element uses \n ascii code. This means the last element gets put onto a new line and i return an array with only one Element. I am looking to merge this element with the line above it.
So my line would be:
192.168.60.24, ACD_test1,86.33352, 07/12/2014 13:33:13, False, Annotated, True,"Attribute1
Attribute 2
Attribute 3"
192.168.60.24, ACD_test1,87.33352, 07/12/2014 13:33:13, False, Annotated, True
Is it possible to merge/join the new line attributes with the line above?
My code is shown below:
var reader = new StreamReader(File.OpenRead(#path));
string line1 = reader.ReadLine();
if (line1.Contains("Server, Tagname, Value, Timestamp, Questionable, Annotated, Substituted"))
{
while (!reader.EndOfStream)
{
List<string> listPointValue = new List<string>();
var line = reader.ReadLine();
var values = line.Split(',');
if (values.Count() < 2)
{
//*****Trying to Add Attribute to listPointValue.ElememtAt(0) here******
}
else
{
foreach (string value in values)
{
listPointValue.Add(value);
}
allValues.Add(listPointValue);
}
}
// allValues.RemoveAt(0);
return allValues;
}
I think you want to read the next line before you do the allValues.Add. That way you can decide whether to add the previous line to allValues (starting a new line). This gives you an idea of what I mean:
var reader = new StreamReader(File.OpenRead(#path));
string line1 = reader.ReadLine();
if (line1.Contains("Server, Tagname, Value, Timestamp, Questionable, Annotated, Substituted"))
{
List<string> listPointValue = new List<string>();
// Add first line to listPointValue
var line = reader.ReadLine();
var values = line.Split(',');
foreach (string value in values)
{
listPointValue.Add(value);
}
while (!reader.EndOfStream)
{
// Read next line
line = reader.ReadLine();
values = line.Split(',');
// If next line is a full line, add the previous line and create a new line
if (values.Count() > 1)
{
allValues.Add(listPointValue);
listPointValue = new List<string>();
}
// Add values to line
foreach (string value in values)
{
listPointValue.Add(value);
}
}
allValues.Add(listPointValue);
}

Fastest way to find strings in a file

I have a log file that is not more than 10KB (File size can go up to 2 MB max) and I want to find if atleast one group of these strings occurs in the files. These strings will be on different lines like,
ACTION:.......
INPUT:...........
RESULT:..........
I need to know atleast if one group of above exists in the file. And I have do this about 100 times for a test (each time log is different, so I have reload and read the log), so I am looking for fastest and bets way to do this.
I looked up in the forums for finding the fastest way, but I dont think my file is too big for those silutions.
Thansk for looking.
I would read it line by line and check the conditions. Once you have seen a group you can quit. This way you don't need to read the whole file into memory. Like this:
public bool ContainsGroup(string file)
{
using (var reader = new StreamReader(file))
{
var hasAction = false;
var hasInput = false;
var hasResult = false;
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (!hasAction)
{
if (line.StartsWith("ACTION:"))
hasAction = true;
}
else if (!hasInput)
{
if (line.StartsWith("INPUT:"))
hasInput = true;
}
else if (!hasResult)
{
if (line.StartsWith("RESULT:"))
hasResult = true;
}
if (hasAction && hasInput && hasResult)
return true;
}
return false;
}
}
This code checks if there is a line starting with ACTION then one with INPUT and then one with RESULT. If the order of those is not important then you can omit the if () else if () checks. In case the line does not start with the strings replace StartsWith with Contains.
Here's one possible way to do it:
StreamReader sr;
string fileContents;
string[] logFiles = Directory.GetFiles(#"C:\Logs");
foreach (string file in logFiles)
{
using (StreamReader sr = new StreamReader(file))
{
fileContents = sr.ReadAllText();
if (fileContents.Contains("ACTION:") || fileContents.Contains("INPUT:") || fileContents.Contains("RESULT:"))
{
// Do what you need to here
}
}
}
You may need to do some variation based on your exact implementation needs - for example, what if the word spans two lines, does the line need to start with the word, etc.
Added
Alternate line-by-line check:
StreamReader sr;
string[] lines;
string[] logFiles = Directory.GetFiles(#"C:\Logs");
foreach (string file in logFiles)
{
using (StreamReader sr = new StreamReader(file)
{
lines = sr.ReadAllLines();
foreach (string line in lines)
{
if (line.Contains("ACTION:") || line.Contains("INPUT:") || line.Contains("RESULT:"))
{
// Do what you need to here
}
}
}
}
Take a look at How to Read Text From a File. You might also want to take a look at the String.Contains() method.
Basically you will loop through all the files. For each file read line-by-line and see if any of the lines contains 1 of your special "Sections".
You don't have much of a choice with text files when it comes to efficiency. The easiest way would definitely be to loop through each line of data. When you grab a line in a string, split it on the spaces. Then match those words to your words until you find a match. Then do whatever you need.
I don't know how to do it in c# but in vb it would be something like...
Dim yourString as string
Dim words as string()
Do While objReader.Peek() <> -1
yourString = objReader.ReadLine()
words = yourString.split(" ")
For Each word in words()
If Myword = word Then
do stuff
End If
Next
Loop
Hope that helps
This code sample searches for strings in a large text file. The words are contained in a HashSet. It writes the found lines in a temp file.
if (File.Exists(#"temp.txt")) File.Delete(#"temp.txt");
String line;
String oldLine = "";
using (var fs = File.OpenRead(largeFileName))
using (var sr = new StreamReader(fs, Encoding.UTF8, true))
{
HashSet<String> hash = new HashSet<String>();
hash.Add("house");
using (var sw = new StreamWriter(#"temp.txt"))
{
while ((line = sr.ReadLine()) != null)
{
foreach (String str in hash)
{
if (oldLine.Contains(str))
{
sw.WriteLine(oldLine);
// write the next line as well (optional)
sw.WriteLine(line + "\r\n");
}
}
oldLine = line;
}
}
}

Parsing individual lines in a robots.txt file with C#

Working on an application to parse robots.txt. I wrote myself a method that pulled the the file from a webserver, and threw the ouput into a textbox. I would like the output to display a single line of text for every line thats in the file, just as it would appear if you were looking at the robots.txt normally, however the ouput in my textbox is all of the lines of text without carriage returns or line breaks. So I thought I'd be crafty, make a string[] for all the lines, make a foreach loop and all would be well. Alas that did not work, so then I thought I would try System.Enviornment.Newline, still not working. Here's the code as it sounds now....how can I change this so I get all the individual lines of robots.txt as opposed to a bunch of text cobbled together?
public void getRobots()
{
WebClient wClient = new WebClient();
string url = String.Format("http://{0}/robots.txt", urlBox.Text);
try
{
Stream data = wClient.OpenRead(url);
StreamReader read = new StreamReader(data);
string[] lines = new string[] { read.ReadToEnd() };
foreach (string line in lines)
{
textBox1.AppendText(line + System.Environment.NewLine);
}
}
catch (WebException ex)
{
MessageBox.Show(ex.Message, null, MessageBoxButtons.OK);
}
}
You are reading the entire file into the first element of the lines array:
string[] lines = new string[] {read.ReadToEnd()};
So all your loop is doing is adding the whole contents of the file into the TextBox, followed by a newline character. Replace that line with these:
string content = read.ReadToEnd();
string[] lines = content.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
And see if that works.
Edit: an alternative and perhaps more efficient way, as per Fish's comment below about reading line by line—replace the code within the try block with this:
Stream data = wClient.OpenRead(url);
StreamReader read = new StreamReader(data);
while (read.Peek() >= 0)
{
textBox1.AppendText(read.ReadLine() + System.Environment.NewLine);
}
You need to make the textBox1 multiline. Then I think you can simply go
textBox1.Lines = lines;
but let me check that
Try
public void getRobots()
{
WebClient wClient = new WebClient();
string robotText;
string[] robotLines;
System.Text.StringBuilder robotStringBuilder;
robotText = wClient.DownloadString(String.Format("http://{0}/robots.txt", urlBox.Text));
robotLines = robotText.Split(Environment.NewLine);
robotStringBuilder = New StringBuilder();
foreach (string line in robotLines)
{
robotStringBuilder.Append(line);
robotStringBuilder.Append(Environment.NewLine);
}
textbox1.Text = robotStringBuilder.ToString();
}
Try using .Read() in a while loop instead of .ReadToEnd() - I think you're just getting the entire file as one line in your lines array. Debug and check the count of lines[] to verify this.
Edit: Here's a bit of sample code. Haven't tested it, but I think it should work OK;
Stream data = wClient.OpenRead(url);
StreamReader read = new StreamReader(data);
List<string> lines = new List<string>();
string nextLine = read.ReadLine();
while (nextLine != null)
{
lines.Add(nextLine);
nextLine = read.ReadLine();
}
textBox1.Lines = lines.ToArray();

Delete specific line from a text file?

I need to delete an exact line from a text file but I cannot for the life of me workout how to go about doing this.
Any suggestions or examples would be greatly appreciated?
Related Questions
Efficient way to delete a line from a text file (C#)
If the line you want to delete is based on the content of the line:
string line = null;
string line_to_delete = "the line i want to delete";
using (StreamReader reader = new StreamReader("C:\\input")) {
using (StreamWriter writer = new StreamWriter("C:\\output")) {
while ((line = reader.ReadLine()) != null) {
if (String.Compare(line, line_to_delete) == 0)
continue;
writer.WriteLine(line);
}
}
}
Or if it is based on line number:
string line = null;
int line_number = 0;
int line_to_delete = 12;
using (StreamReader reader = new StreamReader("C:\\input")) {
using (StreamWriter writer = new StreamWriter("C:\\output")) {
while ((line = reader.ReadLine()) != null) {
line_number++;
if (line_number == line_to_delete)
continue;
writer.WriteLine(line);
}
}
}
The best way to do this is to open the file in text mode, read each line with ReadLine(), and then write it to a new file with WriteLine(), skipping the one line you want to delete.
There is no generic delete-a-line-from-file function, as far as I know.
One way to do it if the file is not very big is to load all the lines into an array:
string[] lines = File.ReadAllLines("filename.txt");
string[] newLines = RemoveUnnecessaryLine(lines);
File.WriteAllLines("filename.txt", newLines);
Hope this simple and short code will help.
List linesList = File.ReadAllLines("myFile.txt").ToList();
linesList.RemoveAt(0);
File.WriteAllLines("myFile.txt"), linesList.ToArray());
OR use this
public void DeleteLinesFromFile(string strLineToDelete)
{
string strFilePath = "Provide the path of the text file";
string strSearchText = strLineToDelete;
string strOldText;
string n = "";
StreamReader sr = File.OpenText(strFilePath);
while ((strOldText = sr.ReadLine()) != null)
{
if (!strOldText.Contains(strSearchText))
{
n += strOldText + Environment.NewLine;
}
}
sr.Close();
File.WriteAllText(strFilePath, n);
}
You can actually use C# generics for this to make it real easy:
var file = new List<string>(System.IO.File.ReadAllLines("C:\\path"));
file.RemoveAt(12);
File.WriteAllLines("C:\\path", file.ToArray());
This can be done in three steps:
// 1. Read the content of the file
string[] readText = File.ReadAllLines(path);
// 2. Empty the file
File.WriteAllText(path, String.Empty);
// 3. Fill up again, but without the deleted line
using (StreamWriter writer = new StreamWriter(path))
{
foreach (string s in readText)
{
if (!s.Equals(lineToBeRemoved))
{
writer.WriteLine(s);
}
}
}
Read and remember each line
Identify the one you want to get rid
of
Forget that one
Write the rest back over the top of
the file
I cared about the file's original end line characters ("\n" or "\r\n") and wanted to maintain them in the output file (not overwrite them with what ever the current environment's char(s) are like the other answers appear to do). So I wrote my own method to read a line without removing the end line chars then used it in my DeleteLines method (I wanted the option to delete multiple lines, hence the use of a collection of line numbers to delete).
DeleteLines was implemented as a FileInfo extension and ReadLineKeepNewLineChars a StreamReader extension (but obviously you don't have to keep it that way).
public static class FileInfoExtensions
{
public static FileInfo DeleteLines(this FileInfo source, ICollection<int> lineNumbers, string targetFilePath)
{
var lineCount = 1;
using (var streamReader = new StreamReader(source.FullName))
{
using (var streamWriter = new StreamWriter(targetFilePath))
{
string line;
while ((line = streamReader.ReadLineKeepNewLineChars()) != null)
{
if (!lineNumbers.Contains(lineCount))
{
streamWriter.Write(line);
}
lineCount++;
}
}
}
return new FileInfo(targetFilePath);
}
}
public static class StreamReaderExtensions
{
private const char EndOfFile = '\uffff';
/// <summary>
/// Reads a line, similar to ReadLine method, but keeps any
/// new line characters (e.g. "\r\n" or "\n").
/// </summary>
public static string ReadLineKeepNewLineChars(this StreamReader source)
{
if (source == null)
throw new ArgumentNullException(nameof(source));
char ch = (char)source.Read();
if (ch == EndOfFile)
return null;
var sb = new StringBuilder();
while (ch != EndOfFile)
{
sb.Append(ch);
if (ch == '\n')
break;
ch = (char)source.Read();
}
return sb.ToString();
}
}
Are you on a Unix operating system?
You can do this with the "sed" stream editor. Read the man page for "sed"
What?
Use file open, seek position then stream erase line using null.
Gotch it? Simple,stream,no array that eat memory,fast.
This work on vb.. Example search line culture=id where culture are namevalue and id are value and we want to change it to culture=en
Fileopen(1, "text.ini")
dim line as string
dim currentpos as long
while true
line = lineinput(1)
dim namevalue() as string = split(line, "=")
if namevalue(0) = "line name value that i want to edit" then
currentpos = seek(1)
fileclose()
dim fs as filestream("test.ini", filemode.open)
dim sw as streamwriter(fs)
fs.seek(currentpos, seekorigin.begin)
sw.write(null)
sw.write(namevalue + "=" + newvalue)
sw.close()
fs.close()
exit while
end if
msgbox("org ternate jua bisa, no line found")
end while
that's all..use #d

Categories