I'm creating an application that loads data from SQL Database once a day and saves it into a text file.
The main table is a "Transactions" table, which holds data about all transactions made on that day. One of the columns represents a middle-man call sign.
My program saves the data in a DataTable first and then with a StringBuilder I give it the proper form and finally save it into a text file with StreamWriter.
My question is, how or on which stage of the process can I distinguish one table entry from another. I want to create two files: one with transactions made by middle-man A and B.
This is my code so far:
// Query for Data
row = new SqlDataAdapter("SELECT [MSISDN], [Amount], [Transaction_ID], POS.[Name], MNO.[Call Sign] FROM"
+ "[Transactions] join [POS] "
+ "on Transactions.POS_ID = POS.idPOS "
+ "join [MNO] on Transactions.MNO_ID = MNO.idMNO "
+ "where [Status] = '1'", con);
row.Fill(Row);
// Save Data in StringBuilder
for (int i = 0; i < Row.Rows.Count; i++)
{
sb.Append(Row.Rows[i].ItemArray[0].ToString()).Append(",");
double amount = Convert.ToDouble(Row.Rows[i].ItemArray[1].ToString());
sb.Append(Math.Round(amount, 2).ToString().Replace(",", ".")).Append(",");
sb.Append(Row.Rows[i].ItemArray[2].ToString()).Append(",");
sb.Append(Row.Rows[i].ItemArray[3].ToString()).Append(",");
sb.Append(Row.Rows[i].ItemArray[4].ToString()).Append(",").Append(Environment.NewLine);
}
// Create a file from StringBuilder
mydocpath = #"C:\Transactions\" + fileDate.ToString(format) + ".txt";
FileStream fsOverwrite = new FileStream(mydocpath, FileMode.Create);
using (StreamWriter outfile = new StreamWriter(fsOverwrite))
{
outfile.WriteAsync(sb.ToString());
}
Hope I was clear enough. English isn't my strong side. As well as coding for what it seems...
One option.
Put all your data into a DataSet. And then do Xsl transformations against the ds.GetXml().
Here is kind of an example:
http://granadacoder.wordpress.com/2007/05/15/xml-to-xml-conversion/
But what I would do is eliminate the DataTable altogether. Use an IDataReader.
Loop over the data. Maybe do the original query as "Order By Middle-Man-Identifer", and then when the middleManIdentifer "makes a jump", close the previous file and write a new one.
Something like that.
You may be able to learn something from this demo:
http://granadacoder.wordpress.com/2009/01/27/bulk-insert-example-using-an-idatareader-to-strong-dataset-to-sql-server-xml/
Here is a couple of IDataReader helpers:
http://kalit-codesnippetsofnettechnology.blogspot.com/2009/05/write-textfile-from-sqldatareader.html
and
How to efficiently write to file from SQL datareader in c#?
Related
This question is an extension to another I asked Here
I have a win form which has checkbox controls in it. The names of the checkboxes matches column names of a table. I can not normalize the tables cause of huge data involved, already received for the live project. so everything stays as it is.
I get the selected checbox names as a csv col1,col2,col3 which later i concatenate it to sql string.(no SPs as its a sql compact 3.5 sdf dbase).
In my GetData() method of the DataAccess class i form the sql string. But to avoid sql injections how can ensure that the column names passed are validated.
// Get Data
// selectedMPs: string csv, generated from the list of selected posts(checkboxes) from the UI, forming the col names in select
public static DataTable GetDataPostsCars(string selectedMPs, DateTime fromDateTime, DateTime toDateTime)
{
DataTable dt;
//string[] cols = selectedMPs.Split(','); //converts to array
//object[] cols2 = cols;//gets as object array
//=== using cols or cols 2 in String.Format does not help
// this WORKS, but as i am aware its prone to injections. so how can i validate the "selectedMPs" that those are columns from a list or dictionary or so on? i am not experienced with that.
string sql = string.Format(
"SELECT " + selectedMPs + " " +
"FROM GdRateFixedPosts " +
"WHERE MonitorDateTime BETWEEN '" + fromDateTime + "' AND '" + toDateTime +
using (cmd = new SqlCeCommand(sql,conn))
{
cmd.CommandType = CommandType.Text; //cmd.Parameters.Add("#toDateTime",DbType.DateTime);
dt = ExecuteSelectCommand(cmd);
}
return dt;
}
this WORKS, but as i am aware its prone to injections. so how can i validate the "selectedMPs" that those are columns from a list or dictionary or so on? i am not experienced with that. I would really appreciate your help. Thanks in advance.
This is the only possible approach, and there is no risk of injection with SQL Server Compact, as that database engine only executes a single statement per batch.
I'm reading a CSV file a few times a day. It's about 300MB and each time I have to read it through, compare with existing data in the database, add new ones, hide old ones and update existing ones. There are also bunch of data that's not getting touch.
I have access to all files both old and new ones and I'd like to compare new one with the previous one and just update what's changed in the file. I have no idea what to do and I'm using C# to do all my work. The one thing that might be most problematic is that a row in the previous field might be in another location in the second feed even if it's not updated at all. I want to avoid that problem as well, if possible.
Any idea would help.
Use one of the existing CSV parsers
Parse each row to a mapped class object
Override Equals and GetHashCode for your object
Keep a List<T> or HashSet<T> in memory, At the first step initialize them with no contents.
On reading each line from the CSV file, check if the exist in your in-memory collection (List, HashSet)
If the object doesn't exists in your in-memory collection, add it to the collection and insert in database.
If the object exists in your in-memory collection then ignore it (Checking for it would be based on Equals and GetHashCode implementation and then it would be as simple as if(inMemoryCollection.Contains(currentRowObject))
I guess you have a windows service reading CSV files periodically from a file location. You can repeat the above process, every time you read a new CSV file. This way you will be able to maintain an in-memory collection of the previously inserted objects and ignore them, irrespective of their place in the CSV file.
If you have primary key, defined for your data then you can use Dictionary<T,T>, where you Key could be the unique field. This will help you in having more performance for comparison and you can ignore Equals and GetHashCode implementation.
As a backup to this process, your DB writing routine/stored procedure should be defined in a way that it would first check, if the record already exists in the table, in that case Update the table otherwise INSERT new record. This would be UPSERT.
Remember, if you end up maintaining an in-memory collection, then keep clearing it periodically, otherwise you could end up with out of memory exception.
Just curious, why do you have to compare the old file with the new file? Isn't the data from the old file in SQL server already? (When yous say database, you mean SQL server right? I'm assuming SQL server because you use C# .net)
My approach is simple:
Load new CSV file into a staging table
Use stored procs to insert, update, and set inactive files
public static void ProcessCSV(FileInfo file)
{
foreach (string line in ReturnLines(file))
{
//break the lines up and parse the values into parameters
using (SqlConnection conn = new SqlConnection(connectionString))
using (SqlCommand command = conn.CreateCommand())
{
command.CommandType = CommandType.StoredProcedure;
command.CommandText = "[dbo].sp_InsertToStaging";
//some value from the string Line, you need to parse this from the string
command.Parameters.Add("#id", SqlDbType.BigInt).Value = line["id"];
command.Parameters.Add("#SomethingElse", SqlDbType.VarChar).Value = line["something_else"];
//execute
if (conn.State != ConnectionState.Open)
conn.Open();
try
{
command.ExecuteNonQuery();
}
catch (SqlException exc)
{
//throw or do something
}
}
}
}
public static IEnumerable<string> ReturnLines(FileInfo file)
{
using (FileStream stream = File.Open(file.FullName, FileMode.Open, FileAccess.Read, FileShare.Read))
using (StreamReader reader = new StreamReader(stream))
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
Now you write stored procs to insert, update, set inactive fields based on Ids. You'll know if a row is updated if Field_x(main_table) != Field_x(staging_table) for a particular Id, and so on.
Here's how you detect changes and updates between your main table and staging table.
/* SECTION: SET INACTIVE */
UPDATE main_table
SET IsActiveTag = 0
WHERE unique_identifier IN
(
SELECT a.unique_identifier
FROM main_table AS a INNER JOIN staging_table AS b
--inner join because you only want existing records
ON a.unique_identifier = b.unique_identifier
--detect any updates
WHERE a.field1 <> b.field2
OR a.field2 <> b.field2
OR a.field3 <> b.field3
--etc
)
/* SECTION: INSERT UPDATED AND NEW */
INSERT INTO main_table
SELECT *
FROM staging_table AS b
LEFT JOIN
(SELECT *
FROM main_table
--only get active records
WHERE IsActiveTag = 1) AS a
ON b.unique_identifier = a.unique_identifier
--select only records available in staging table
WHERE a.unique_identifier IS NULL
How big is the csv file?? if its small try the following
string [] File1Lines = File.ReadAllLines(pathOfFileA);
string [] File2Lines = File.ReadAllLines(pathOfFileB);
List<string> NewLines = new List<string>();
for (int lineNum = 0; lineNo < File1Lines.Length; lineNo++)
{
if(!String.IsNullOrEmpty(File1Lines[lineNum])
String.IsNullOrEmpty(File2Lines[lineNo]))
{
if(String.Compare(File1Lines[lineNo], File2Lines[lineNo]) != 0)
NewLines.Add(File2Lines[lineNo]) ;
}
else if (!String.IsNullOrEmpty(File1Lines[lineNo]))
{
}
else
{
NewLines.Add(File2Lines[lineNo]);
}
}
if (NewLines.Count > 0)
{
File.WriteAllLines(newfilepath, NewLines);
}
i'm stucked again. Is there any way to "connect" a graph from a form to a MYSQL database? I made the connections(forms-> database). I want that graph to automaticaly loads the columns, and the values from my database . Thanks a lot !
OK, I assume you want to show a Chart?
It is easy to use, just drop it on the form, define its series and load points.
Some or all of the Information you are looking for is in the metatables of each DBMS; in MYSQL these are in the various information_schema tables. Below is an example that loads all table names and the count of their columns into a chart named ch_tables, which contains one series called columns. It assumes you have an open Connectioncalled DBCand a Database provided in the parameter.
public void loadChart(string schema)
{
try
{
string allTables =
" SELECT table_name, count(COLUMN_NAME) "
+ " FROM information_schema.COLUMNS "
+ " where table_schema = '" + schema + "' group by table_name" ;
MySqlCommand cmd = new MySqlCommand(allTables, DBC);
MySqlDataReader rdr = cmd.ExecuteReader();
ch_tables.Series["columns"].Points.Clear();
while (rdr.Read())
{
ch_tables.Series["columns"].Points.AddXY(rdr[0], rdr[1]);
}
rdr.Close();
}
catch (MySqlException ex) { /* error handling */ }
}
I hope this gives you a start to expand it and to create other charts.
The information_schema tables contain a lot of intersting stuff; you can find the number of rows as TABLE_ROWS in information_schema.TABLES.
When asking a question it is always a good idea to give both us and yourself as much details as possible. Creating a dummy chart in Excel would have been a good idea to show just what you mean..
EDIT : I accidentally hard coded the schema name, even though the function was prepared to read of a dynamically choosen database..
I have an Excel document that has about 250000 rows which takes forever to import. I have done many variations of this import, however there are a few requirements:
- Need to validate the data in each cell
- Must check if a duplicate exists in the database
- If a duplicate exists, update the entry
- If no entry exists, insert a new one
I have used parallelization as much as possible however I am sure that there must be some way to get this import to run much faster. Any assistance or ideas would be greatly appreciated.
Note that the database is on a LAN, and yes I know I haven't used parameterized sql commands (yet).
public string BulkUserInsertAndUpdate()
{
DateTime startTime = DateTime.Now;
try
{
ProcessInParallel();
Debug.WriteLine("Time taken: " + (DateTime.Now - startTime));
}
catch (Exception ex)
{
return ex.Message;
}
return "";
}
private IEnumerable<Row> ReadDocument()
{
using (SpreadsheetDocument spreadSheetDocument = SpreadsheetDocument.Open(_fileName, false))
{
WorkbookPart workbookPart = spreadSheetDocument.WorkbookPart;
Sheet ss = workbookPart.Workbook.Descendants<Sheet>().SingleOrDefault(s => s.Name == "User");
if (ss == null)
throw new Exception("There was a problem trying to import the file. Please insure that the Sheet's name is: User");
WorksheetPart worksheetPart = (WorksheetPart)workbookPart.GetPartById(ss.Id);
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
StringTablePart = workbookPart.SharedStringTablePart;
while (reader.Read())
{
if (reader.ElementType == typeof(Row))
{
do
{
if (reader.HasAttributes)
{
var rowNum = int.Parse(reader.Attributes.First(a => a.LocalName == "r").Value);
if (rowNum == 1)
continue;
var row = (Row)reader.LoadCurrentElement();
yield return row;
}
} while (reader.ReadNextSibling()); // Skip to the next row
break; // We just looped through all the rows so no need to continue reading the worksheet
}
}
}
}
private void ProcessInParallel()
{
// Use ConcurrentQueue to enable safe enqueueing from multiple threads.
var exceptions = new ConcurrentQueue<Exception>();
Parallel.ForEach(ReadDocument(), (row, loopState) =>
{
List<Cell> cells = row.Descendants<Cell>().ToList();
if (string.IsNullOrEmpty(GetCellValue(cells[0], StringTablePart)))
return;
// validation code goes here....
try
{
using (SqlConnection connection = new SqlConnection("user id=sa;password=D3vAdm!n#;server=196.30.181.143;database=TheUnlimitedUSSD;MultipleActiveResultSets=True"))
{
connection.Open();
SqlCommand command = new SqlCommand("SELECT count(*) FROM dbo.[User] WHERE MobileNumber = '" + mobileNumber + "'", connection);
var userCount = (int) command.ExecuteScalar();
if (userCount > 0)
{
// update
command = new SqlCommand("UPDATE [user] SET NewMenu = " + (newMenuIndicator ? "1" : "0") + ", PolicyNumber = '" + policyNumber + "', Status = '" + status + "' WHERE MobileNumber = '" + mobileNumber + "'", connection);
command.ExecuteScalar();
Debug.WriteLine("Update cmd");
}
else
{
// insert
command = new SqlCommand("INSERT INTO dbo.[User] ( MobileNumber , Status , PolicyNumber , NewMenu ) VALUES ( '" + mobileNumber + "' , '" + status + "' , '" + policyNumber + "' , " + (newMenuIndicator ? "1" : "0") + " )", connection);
command.ExecuteScalar();
Debug.WriteLine("Insert cmd");
}
}
}
catch (Exception ex)
{
exceptions.Enqueue(ex);
Debug.WriteLine(ex.Message);
loopState.Break();
}
});
// Throw the exceptions here after the loop completes.
if (exceptions.Count > 0)
throw new AggregateException(exceptions);
}
I would have suggested that you do a bulk import WITHOUT any validation to an intermediary table, and only then do all the validation via SQL. Your spreadsheet's data will now be in a similiar structure as a SQL table.
This is what I have done with industrial strenght imports of 3 million rows + from Excel and CSV with great success.
Mostly I'd suggest you check that your parallelism is optimal. Since your bottlenecks are likely to be disk IO on the Excel file and IO to the Sql server, I'd suggest that it may not be. You've parallelised across those two processes (so each of them is reduced to the speed of the slowest); your parallel threads will be fighting over the database and potentially slowing eachother down. There's no point having (say) eight threads if your hard disk can't keep up with one - it just creates overhead.
Two things I'd suggest. First: take out all the parallelism and see if it's actually helping. If you single-threadedly parse the whole file into a single Queue in memory, then run the whole thing into the database, you might find it's faster.
Then, I'd try splitting it to just two threads: one to process the incoming file to the Queue, and one to take the items from the Queue and push them into the database. This way you have one thread per slow resource that you're handling - so you minimise contention - and each thread is blocked by only one resource - so you're handling that resource as optimally as possible.
This is the real trick of multithreaded programming. Throwing extra threads at a problem doesn't necessarily improve performance. What you're trying to do is minimise the time that your program is waiting idly for something external (such as disk or network IO) to complete. If one thread only waits on the Excel file, and one thread only waits on the SQL server, and what they do in between is minimal (which, in your case, it is), you'll find your code will run as fast as those external resources will allow it to.
Also, you mention it yourself, but using parameterised Sql isn't just a cool thing to point out: it will increase your performance. At the moment, you're creating a new SqlCommand for every insert, which has overhead. If you switch to a parameterised command, you can keep the same command throughout and just change the parameter values, which will save you some time. I don't think this is possible in a parallel ForEach (I doubt you can reuse the SqlCommand across threads), but it'd work fine with either of the approaches above.
Some tips for enhanced processing (as I believe this is what you need, not really a code fix).
Have Excel check for duplicate rows beforehand. It's a really decent tool for weeding out the obsolete tools. If A and B were duplicate, you'd create A then update with B's data. This way, you can weed out A and only create B.
Don't process it as an .xls(x) file, convert it to a CSV. (if you haven't already).
Create some stored procedures on your database. I generally dislike stored procedures when used in projects for simple data retrieval, but it works wonders for automated scripts that need to run efficiently. Just add a Create function (I assume the update function will be unnecessary after you've weeded out the duplicates (in tip 1)).+
Some tips I'm not sure will help your specific situation:
Use LINQ instead of creating command strings. LINQ automatically fine-tunes your queries. However, suddenly switching to LINQ is not something you can do at the blink of an eye, so you'll need to outweigh effort against how much you need it.
I know you said there is not Excel on the database server, but you can have the database process .csv files instead, there is no need for installed software for csv files. You can look into the following: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
I'm building a system that reads 5 CSV files each month. These files are supposed to follow a certain format and ordering. I have one master table and 5 temporary tables. Each CSV file is read first and then bulk inserted into its corresponding temporary table. After bulk inserting the 5 csv files into their respective temporary tables I once again insert all the records from the temporary table to the master table. This makes sure that all files are uploaded first before inserting the data to the master table.
I built this system using ASP.net and during debugging and testing everything went fine. The problem occurs whenever I deploy the application to a production server. After I deployed the application I used the same csv files I uploaded during development and testing and the system shows a data conversion error from string to date time format.
I tried many things to fix this but it seems the problem still persist. I tried changing the collation of the production database to the same one I used during development. I also tried changing some regional settings in the production server but it still doesn't work.
I thought maybe I can handle this programmatically and instead of bulk inserting from the temporary tables to the master table I would write some kind of a for loop that would insert each record manually to the master table, but then I suppose it would create a performance issue since I'll be inserting around 100,000 records each time.
I wonder if anyone has faced a similar issue during deployment. It still seems weird to me that the behaviour of the application changed after deployment.
following is a portion of the code where it uploads the inventory.csv file to the server and then bulk inserting the csv into a temporary table TB_TEMP_INVENTORY then inserting the records from temp to the master table TB_CATTLE. this is done to 4 other files and is almost identical to this.
OleDbConnection conn = new OleDbConnection(ConfigurationManager.AppSettings["LivestockConnectionString"]);
OleDbCommand comm;
OleDbDataAdapter adapter;
DataTable table = new DataTable();
string file = string.Empty;
string content = string.Empty;
StreamReader reader;
StreamWriter writer;
string month = monthDropDownList.SelectedValue;
string year = yearDropDownList.SelectedItem.Text;
// upload inventory file
file = System.IO.Path.GetFileName(inventoryFileUpload.PostedFile.FileName);
inventoryFileUpload.PostedFile.SaveAs("C://LivestockCSV//" + file);
// clean inventory file
file = "C://LivestockCSV//" + file;
reader = new StreamReader(file);
content = reader.ReadToEnd();
reader.Close();
writer = new StreamWriter(file);
writer.Write(content.Replace("\"", "")); // remove quotation
writer.Close();
writer = new StreamWriter(file);
writer.Write(content.Replace(",NULL,", ",,")); // remove NULL
writer.Close();
writer = new StreamWriter(file);
writer.Write(content.Replace(",0,", ",,")); // remove 0 dates
writer.Close();
writer = new StreamWriter(file);
writer.Write(content.Replace(",0", ",")); // remove 0 dates at eol
writer.Close();
try
{
conn.Open();
comm = new OleDbCommand("TRUNCATE TABLE TB_TEMP_INVENTORY", conn); // clear temp table
comm.ExecuteNonQuery();
// bulk insert from csv to temp table
comm = new OleDbCommand(#"SET DATEFORMAT DMY;
BULK INSERT TB_TEMP_INVENTORY
FROM '" + file + "'" +
#" WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)", conn);
comm.ExecuteNonQuery();
// check if data for same month exists in cattle table
comm = new OleDbCommand(#"SELECT *
FROM TB_CATTLE
WHERE Report='Inventory' AND Month=" + month + " AND Year=" + year, conn);
if (comm.ExecuteScalar() != null)
{
comm = new OleDbCommand(#"DELETE
FROM TB_CATTLE
WHERE Report='Inventory' AND Month=" + month + " AND Year=" + year, conn);
comm.ExecuteNonQuery();
}
// insert into master cattle table
comm = new OleDbCommand(#"SET DATEFORMAT MDY;
INSERT INTO TB_CATTLE(ID, Sex, BirthDate, FirstCalveDate, CurrentUnit, OriginalCost, AccumulatedDepreciation, WrittenDownValue, NetRealizableValue, CapitalGainLoss, Month, Year, Report, Locked, UploadedBy, UploadedAt)
SELECT DISTINCT ID, Sex, BirthDate, FirstCalveDate, CurrentUnit, 0, 0, 0, 0, 0, " + month + ", " + year + #", 'Inventory', 0, 'Admin', '" + DateTime.Now + #"'
FROM TB_TEMP_INVENTORY", conn);
comm.ExecuteNonQuery();
conn.Close();
}
catch (Exception ex)
{
ClientScript.RegisterStartupScript(typeof(string), "key", "<script>alert('" + ex.Message + "');</script>");
return;
}
You don't specify how you are doing the insert, but a reasonable option here would be something like SqlBulkCopy, which can take either a DataTable or an IDataReader as input; this would give you ample opportunity to massage the data - either in-memory (DataTable), or via the streaming API (IDataReader), while still using an efficient import. CsvReader is a good option for loading the CSV.
The other option is to use a very basic insert into the staging table, and massage the data via TSQL code.
Re why has it changed between dev/production; the most likely answers are:
the data you used in dev was not representative
there is an environmental/configuration difference between the two
1) Check SQL Server LANGUAGE and DATEFORMAT settings for dev/testing & production env.:
DBCC USEROPTIONS
2) What date format is used in CSV files (source) ?
3) What data type is used for date/time field (destination) ?
DECLARE #v VARCHAR(10) = '2010-08-23';
SET DATEFORMAT mdy;
SELECT CAST(#v AS DATETIME)
,CAST(#v AS DATE)
,YEAR(CAST(#v AS DATETIME))
,MONTH(CAST(#v AS DATETIME))
,DAY(CAST(#v AS DATETIME));
SET DATEFORMAT dmy;
SELECT CAST(#v AS DATETIME)
,CAST(#v AS DATE);