I have the PAF raw data in several files (list of all addresses in the UK).
My goal is to create a PostCode lookup in our software.
I have created a new database but there is no need to understand it for the moment.
Let's take a file, his extension is ".c01" and can be open with a text editor. The data in this file are in the following format :
0000000123A
With (according to the developer guide), 8 char for the KEY, 50 char for the NAME.
This file contains 2,449,652 rows (it's a small one !)
I create a Parsing class for this
private class SerializedBuilding
{
public int Key
{
get; set;
}
public string Name
{
get; set;
}
public bool isValid = false;
public Building ToBuilding()
{
Building b = new Building();
b.BuildingKey = Key;
b.BuildingName = Name;
return b;
}
private readonly int KEYLENGTH = 8;
private readonly int NAMELENGTH = 50;
public SerializedBuilding(String line)
{
string KeyStr = null;
string Name = null;
try
{
KeyStr = line.Substring(0, KEYLENGTH);
}
catch (Exception e)
{
Console.WriteLine("erreur parsing key line " + line);
return;
}
try
{
Name = line.Substring(KEYLENGTH - 1, NAMELENGTH);
}
catch (Exception e)
{
Console.WriteLine("erreur parsing name line " + line);
return;
}
int value;
if (!Int32.TryParse(KeyStr, out value))
return;
if (value == 0 || value == 99999999)
return;
this.Name = Name;
this.Key = value;
this.isValid = true;
}
}
I use this method to read the file
public void start()
{
AddressDataContext d = new AddressDataContext();
Count = 0;
string line;
// Read the file and display it line by line.
System.IO.StreamReader file =
new System.IO.StreamReader(filename);
SerializedBuilding sb = null;
Console.WriteLine("Number of line detected : " + File.ReadLines(filename).Count());
while ((line = file.ReadLine()) != null)
{
sb = new SerializedBuilding(line);
if (sb.isValid)
{
d.Buildings.InsertOnSubmit(sb.ToBuilding());
if (Count % 100 == 0)
d.SubmitChanges();
}
Count++;
}
d.SubmitChanges();
file.Close();
Console.WriteLine("building added");
}
I use Linq to SQL classes to insert data to my database. The connection string is the default one.
This seems to work, I have added 67200 lines. It just crashed but my questions are not about that.
My estimations :
33,647,015 rows to parse
Time needed for execution : 13 hours
It's a one-time job (just needs to be done on my sql and on the client server later) so I don't really care about performances but I think it can be interesting to know how it can be improved.
My questions are :
Is readline() and substring() the most powerful ways to read these huge files ?
Can the performance be improved by modifying the connection string ?
Related
I am writing a program for an assignment that is meant to read two text files and use their data to write to a third text file. I was instructed to pass the contents of the one file to a list. I have done something similar, passing the contents to an array (see below). But I can't seem to get it to work with a list.
Here is what I have done in the past with arrays:
StreamReader f1 = new StreamReader(args[0]);
StreamReader f2 = new StreamReader(args[1]);
StreamWriter p = new StreamWriter(args[2]);
double[] array1 = new double[20];
double[] array2 = new double[20];
double[] array3 = new double[20];
string line;
int index;
double value;
while ((line = f1.ReadLine()) != null)
{
string[] currentLine = line.Split('|');
index = Convert.ToInt16(currentLine[0]);
value = Convert.ToDouble(currentLine[1]);
array1[index] = value;
}
If it is of any interest, this is my current setup:
static void Main(String[] args)
{
// Create variables to hold the 3 elements of each item that you will read from the file
// Create variables for all 3 files (2 for READ, 1 for WRITE)
int ID;
string InvName;
int Number;
string IDString;
string NumberString;
string line;
List<InventoryNode> Inventory = new List<InventoryNode>();
InventoryNode Item = null;
StreamReader f1 = new StreamReader(args[0]);
StreamReader f2 = new StreamReader(args[1]);
StreamWriter p = new StreamWriter(args[2]);
// Read each item from the Update File and process the data
//Data is separated by pipe |
If you want to convert Array to List, you can just call Add or Insert to make it happen.
According to your code, you can do Inventory.Add(Item).
while ((line = f1.ReadLine()) != null)
{
string[] currentLine = line.Split('|');
Item = new InventoryItem {
Index = Convert.ToInt16(currentLine[0]),
Value = Convert.ToDouble(currentLine[1])
};
Inventory.Add(Item);
}
like this.
If I understand it correctly all you want to do is read two input file, parse the data in these file in a particular format (in this case int|double) and then write it to a new file. If this is the requirement, please try out the following code, as it is not sure how you want the data to be presented in the third file I have kept the format as it is (i.e. int|double)
static void Main(string[] args)
{
if (args == null || args.Length < 3)
{
Console.WriteLine("Wrong Input");
return;
}
if (!ValidateFilePath(args[0]) || !ValidateFilePath(args[1]))
{
return;
}
Dictionary<int, double> parsedFileData = new Dictionary<int, double>();
//Read the first file
ReadFileData(args[0], parsedFileData);
//Read second file
ReadFileData(args[1], parsedFileData);
//Write to third file
WriteFileData(args[2], parsedFileData);
}
private static bool ValidateFilePath(string filePath)
{
try
{
return File.Exists(filePath);
}
catch (Exception)
{
Console.WriteLine($"Failed to read file : {filePath}");
return false;
}
}
private static void ReadFileData(string filePath, Dictionary<int, double> parsedFileData)
{
try
{
using (StreamReader fileStream = new StreamReader(filePath))
{
string line;
while ((line = fileStream.ReadLine()) != null)
{
string[] currentLine = line.Split('|');
int index = Convert.ToInt16(currentLine[0]);
double value = Convert.ToDouble(currentLine[1]);
parsedFileData.Add(index, value);
}
}
}
catch (Exception ex)
{
Console.WriteLine($"Exception : {ex.Message}");
}
}
private static void WriteFileData(string filePath, Dictionary<int, double> parsedFileData)
{
try
{
using (StreamWriter fileStream = new StreamWriter(filePath))
{
foreach (var parsedLine in parsedFileData)
{
var line = parsedLine.Key + "|" + parsedLine.Value;
fileStream.WriteLine(line);
}
}
}
catch (Exception ex)
{
Console.WriteLine($"Exception : {ex.Message}");
}
}
There are few things you should always remember while writing a C# code :
1) Validate command line inputs before using.
2) Always lookout for any class that has dispose method, instantiate it inside using block.
3) Proper mechanism in the code to catch exceptions, else your program would crash at runtime with invalid inputs or inputs that you could not validate!
I am getting this error though there are other posts as well but I am not getting a proper solution for my problem.
Debugger is pointing to this statement
id = Convert.ToInt32(s);
It works fine at beginning but now it is generating error. Following is the complete function. As a side note I am following N-tier architecture in Visual Studio 2013.
public List<ATMBO> GetDataFromFile() // get data from file and store it into object and pass to BLL !!!!
{
List<ATMBO> l = new List<ATMBO>();
// opening stream !!!
FileStream f = new FileStream("BankClient.txt", FileMode.Open);
StreamReader sr = new StreamReader(f);
if (!File.Exists("BankClient.txt"))
{
Console.WriteLine("{0} does not exist.", "BankClient.txt");
}
// Start reading from file
string record=sr.ReadLine();
//sr.ReadLine();
while((record = sr.ReadLine()) != null)
{
//record = sr.ReadLine();
// storing data from file to object!!!!
string [] data = record.Split(':');
//Console.WriteLine(data[0]);
ATMBO bo = new ATMBO();
string s = (data[0]);
int id = 0;
try
{
id = Convert.ToInt32(s);
}
catch (FormatException e)
{
Console.WriteLine("Input string is not a sequence of digits.");
}
catch (OverflowException e)
{
Console.WriteLine("The number cannot fit in an Int32.");
}
bo.ID1 = id;
bo.Login = data[1];
bo.Type = data[2];
string ss = (data[3]);
int blnc = Convert.ToInt32(ss);
bo.Balance = blnc;
bo.Status = data[4];
bo.Date = data[5];
bo.Pin = data[6];
l.Add(bo);
}
sr.Close();
f.Close();
return l;
}
Contents of my BankClient.txt file:
ID:Name:Type:Balance:Status:Date:Pin
00:Admin:Savings:500:Active:1/11/2014:111
01:Nabeel:Savings:0:Active:1/11/2014:222
02:Asad:Current:600:Active:2/11/2014:333
03:Aqsa:Current:-300:Active:3/11/2014:ABC
04:Umer:Savings:1000:Active:4/11/2014:444
05:Ali:Savings:1000:Active:4/11/2014:555
You need to add some error handling to your code to make sure there are actual values you can work with, such as
string [] data = record.Split(':');
if(data.length < 7)
Console.WriteLine("Data doesn't contain what was expected");
Better yet, instead of Convert.ToInt32 you can use TryParse
int id;
if(!int.TryParse(s, out id))
Console.WriteLine("Not a valid id");
I have a problem here...so that's what I wanna do:
I have a program that saves information about user progress, ex: Calls, Answered Calls... and the user run this program every day and save the iformation to the text file. So the problem is that when the user hit's the Save button it add's a new stat's for that day. But I want those data to be modified if user save's in that day 2 times.
What I wanna do is to create a new file where to save the last time saved, and if the date are not diferent Append to file, else modify existing for that day saves.
What I did so far is:
string input3 = string.Format("{0:yyyy-MM-dd}", DateTime.Now);
StreamWriter t,tw;
if(File.Exists(filename))
{
tw=File.AppendText(filename);
t = new StreamWriter("lasttimesaved.txt");
t.WriteLine(input3);
}
else
{
tw=new StreamWriter(filename);
t = new StreamWriter("lasttimesaved.txt");
t.WriteLine(input3);
}
tw.WriteLine();
tw.Write("Stats for Name ");
tw.Write(input);
tw.Write("_");
tw.WriteLine(input3);
tw.WriteLine();
tw.Write("Total Calls: "); tw.WriteLine(calls);
tw.Write("Total Answered: "); tw.WriteLine(answ);
tw.Close();
the only thing now that I don't know ho to do is how to add above all that a check instance to see if the user allready saved today info and to modify existing data.
it's like:
try
{
using (StreamReader sr = new StreamReader("lasttimesaved.txt"))
{
String line = sr.ReadToEnd();
}
}
catch (Exception e)
{
Console.WriteLine("The file could not be read:");
Console.WriteLine(e.Message);
}
if(String.Compare(input3,line) == 0)
{
// that's where I need to modify the existing data.
}
else
{
// do the code above
}
Can anyone help me to modify curent recorded data without losing previous records.
in text file is:
Stats for Name_2013-11-26
Total Calls: 25
Total Answered: 17
Stats for Name_2013-11-27
Total Calls: 32
Total Answered: 15
Stats for Name_2013-11-28
Total Calls: 27
Total Answered: 13
I would say use XML, it will still be readable and modifiable without code and you have some neat way to modify the file with code.
With XML you can easily query the file to see if the date of today is already mentioned in the file, if so you could edit that node if not you could easily append a node.
To append nodes to an xml file i would look at this link:
C#, XML, adding new nodes
Hope this helps, use it like here:
void main()
{
var uw = new UserInformationWriter(#"C:\temp\stats.txt");
var user = new UserInfomration { Calls = "111", Answered = "110" };
uw.Save(user);
}
Here the class(es):
public class UserInformationWriter
{
public string CentralFile { get; set; }
public UserInformationWriter(string centraFile)
{
CentralFile = centraFile;
}
public void Save(UserInfomration newUserInformation)
{
try
{
var streamReader = new StreamReader(CentralFile);
var sourceInformation = streamReader.ReadToEnd();
streamReader.Close();
var userCollection = (List<UserInfomration>)(sourceInformation.ToUserInfomation());
var checkItem = ShouldModify(userCollection);
if (checkItem.Item1)
userCollection.Remove(checkItem.Item2);
newUserInformation.DateTime = DateTime.Today;
userCollection.Add(newUserInformation);
File.Delete(CentralFile);
foreach (var userInfomration in userCollection)
WriteToFile(userInfomration);
}
catch (Exception) { }
}
private Tuple<bool, UserInfomration> ShouldModify(IEnumerable<UserInfomration> userInfomations)
{
try
{
foreach (var userInfomration in userInfomations)
if (userInfomration.DateTime == DateTime.Today)
return new Tuple<bool, UserInfomration>(true, userInfomration);
}
catch (Exception) { }
return new Tuple<bool, UserInfomration>(false, null);
}
private void WriteToFile(UserInfomration newUserInformation)
{
using (var tw = new StreamWriter(CentralFile, true))
{
tw.WriteLine("*Stats for Name_{0}", newUserInformation.DateTime.ToShortDateString());
tw.WriteLine();
tw.WriteLine("*Total Calls: {0}", newUserInformation.Calls);
tw.WriteLine("*Total Answered: {0}#", newUserInformation.Answered);
tw.WriteLine();
}
}
}
public class UserInfomration
{
public DateTime DateTime { get; set; }
public string Calls { get; set; }
public string Answered { get; set; }
}
public static class StringExtension
{
private const string CallText = "TotalCalls:";
private const string AnsweredText = "TotalAnswered:";
private const string StatsForName = "StatsforName_";
private const char ClassSeperator = '#';
private const char ItemSeperator = '*';
public static IEnumerable<UserInfomration> ToUserInfomation(this string input)
{
var splited = input.RemoveUnneededStuff().Split(ClassSeperator);
splited = splited.Where(x => !string.IsNullOrEmpty(x)).ToArray();
var userInformationResult = new List<UserInfomration>();
foreach (var item in splited)
{
if (string.IsNullOrEmpty(item)) continue;
var splitedInformation = item.Split(ItemSeperator);
splitedInformation = splitedInformation.Where(x => !string.IsNullOrEmpty(x)).ToArray();
var userInformation = new UserInfomration
{
DateTime = ConvertStringToDateTime(splitedInformation[0]),
Calls = splitedInformation[1].Substring(CallText.Length),
Answered = splitedInformation[2].Substring(AnsweredText.Length)
};
userInformationResult.Add(userInformation);
}
return userInformationResult;
}
private static DateTime ConvertStringToDateTime(string input)
{
var date = input.Substring(StatsForName.Length);
return DateTime.ParseExact(date, "dd.MM.yyyy", CultureInfo.InvariantCulture);
}
private static string RemoveUnneededStuff(this string input)
{
input = input.Replace("\n", String.Empty);
input = input.Replace("\r", String.Empty);
input = input.Replace("\t", String.Empty);
return input.Replace(" ", string.Empty);
}
}
Let me know If you need help or I understood you wrong.
I've made a program and I want to save the data. Saving is working, but "Loading" doesn't work.
public void Save(StreamWriter sw)
{
for (int i = 0; i < buecher.Count; i++)
{
Buch b = (Buch)buecher[i];
if (i == 0)
sw.WriteLine("ISDN ; Autor ; Titel");
sw.WriteLine(b.ISDN + ";" + b.Autor + ";" + b.Titel);
}
}
public void Load(StreamReader sr)
{
int isd;
string aut;
string tit;
while (sr.ReadLine() != "")
{
string[] teile = sr.ReadLine().Split(';');
try
{
isd = Convert.ToInt32(teile[0]);
aut = teile[1];
tit = teile[2];
}
catch
{
throw new Exception("umwandlung fehlgeschlagen");
}
Buch b = new Buch(isd, aut, tit);
buecher.Add(b);
}
}
If I'm doing that with an break after buecher.Add(b); than its everything fine but it obviously shows me only 1 book... if I'm not using the break he gives me an error "nullreference.."
Would be awesome if someone could help me
best regards
Ramon
The problem is that you are reading two lines for each iteration in the loop (and throwing away the first one). If there are an odd number of lines in the file, the second call to Read will return null.
Read the line into a variable in the condition, and use that variable in the loop:
public void Load(StreamReader sr) {
int isd;
string aut;
string tit;
// skip header
sr.ReadLine();
string line;
while ((line = sr.ReadLine()) != null) {
if (line.Length > 0) {
string[] teile = line.Split(';');
try {
isd = Convert.ToInt32(teile[0]);
aut = teile[1];
tit = teile[2];
} catch {
throw new Exception("umwandlung fehlgeschlagen");
}
Buch b = new Buch(isd, aut, tit);
buecher.Add(b);
}
}
}
You are calling sr.ReadLine() twice for every line, once in the while() and once right after. You are hitting the end of the file, which returns a null.
Different approach to this but I suggest it because it's simpler;
Load(string filepath)
{
try
{
List<Buch> buches = File.ReadAllLines(filepath)
.Select(x => new Buch(int.Parse(x.Split(';')[0]), x.Split(';')[1], x.Split(';')[2]));
{
catch
{
throw new Exception("umwandlung fehlgeschlagen");
}
}
You could do it in more lines if you find it to be more readable but I've come to prefer File.ReadAllText and File.ReadAllLines to StreamReader approach of reading files.
Instead of using the LINQ statement you could also do;
Load(string filepath)
{
try
{
string[] lines = File.ReadAllLines(filepath);
foreach (string line in lines)
{
string[] tokens = line.Split(';');
if (tokens.Length != 3)
// error
int isd;
if (!int.TryParse(tokens[0], out isd))
//error, wasn't an int
buetcher.Add(new Buch(isd, tokens[1], tokens[2]);
}
{
catch
{
throw new Exception("umwandlung fehlgeschlagen");
}
}
Given this log file, how can I read a line with multiple new lines (\n) with a StreamReader?
The ReadLine method literally returns each line, but a message may span more that one line.
Here is what I have so far
using (var sr = new StreamReader(filePath))
using (var store = new DocumentStore {ConnectionStringName = "RavenDB"}.Initialize())
{
IndexCreation.CreateIndexes(typeof(Logs_Search).Assembly, store);
using (var bulkInsert = store.BulkInsert())
{
const char columnDelimeter = '|';
const string quote = #"~";
string line;
while ((line = sr.ReadLine()) != null)
{
batch++;
List<string> columns = null;
try
{
columns = line.Split(columnDelimeter)
.Select(item => item.Replace(quote, string.Empty))
.ToList();
if (columns.Count != 5)
{
batch--;
Log.Error(string.Join(",", columns.ToArray()));
continue;
}
bulkInsert.Store(LogParser.Log.FromStringList(columns));
/* Give some feedback */
if (batch % 100000 == 0)
{
Log.Debug("batch: {0}", batch);
}
/* Use sparingly */
if (ThrottleEnabled && batch % ThrottleBatchSize == 0)
{
Thread.Sleep(ThrottleThreadWait);
}
}
catch (FormatException)
{
if (columns != null) Log.Error(string.Join(",", columns.ToArray()));
}
catch (Exception exception)
{
Log.Error(exception);
}
}
}
}
And the Model
public class Log
{
public string Component { get; set; }
public string DateTime { get; set; }
public string Logger { get; set; }
public string Level { get; set; }
public string ThreadId { get; set; }
public string Message { get; set; }
public string Terms { get; set; }
public static Log FromStringList(List<string> row)
{
Log log = new Log();
/*log.Component = row[0] == string.Empty ? null : row[0];*/
log.DateTime = row[0] == string.Empty ? null : row[0].ToLower();
log.Logger = row[1] == string.Empty ? null : row[1].ToLower();
log.Level = row[2] == string.Empty ? null : row[2].ToLower();
log.ThreadId = row[3] == string.Empty ? null : row[3].ToLower();
log.Message = row[4] == string.Empty ? null : row[4].ToLower();
return log;
}
}
I would use Regex.Split and break the file up on anything that matches the date pattern (ex. 2013-06-19) at the beginning of each error.
If you can read the entire file into memory (i.e. File.ReadAllText), then you can treat it as a single string and use regular expressions to split on the date, or some such.
A more general solution that takes less memory would be to read the file line-by-line. Append lines to a buffer until you get the next line that starts with the desired value (in your case, a date/time stamp). Then process that buffer. For example:
StringBuilder buffer = new StringBuilder();
foreach (var line in File.ReadLines(logfileName))
{
if (line.StartsWith("2013-06-19"))
{
if (sb.Length > 0)
{
ProcessMessage(sb.ToString());
sb.Clear();
}
sb.AppendLine(line);
}
}
// be sure to process the last message
if (sb.Length > 0)
{
ProcessMessage(sb.ToString());
}
It is hard to see your file. But I would say read it line by line and Append to some variable.
Check for end of message. When you see it, do whatever you want to do with the message in that variable (insert into DB etc...) and then keep reading the next message.
Pseudo code
read the line
variable a = a + new line
if end of message
insert into DB
reset the variable
continue reading the message.....