Using winforms, c# , sqlite.
My current app takes in data from text files and stores them in three respective tables. It then uses these tables to give a variety of output based on the user's selection.
Currently this app only deals with one text file but I need to make it process 100s of text files. Ie, read each text file data store it in tables etc.
... Then I well have 3 tables multiplied by the 100s of text files(3 tables for each file).
1) is it possible to maintain this many tables in sqlite?
2) how do I ensure my tables don't just get overwritten by the next file's value? Can someone put sample code for how they would approach this?
SQLite has no limit on the number of tables.
Each table must have a unique name.
However, it would be a better idea to normalize your database, i.e., use a single table with an additional column that specifies the original file of the record.
Related
How can we upload a CSV file through web(ASP.NET MVC C#) and mapped the column in CSV table to our tables in database
Example:
CSV File:
Username,Address,Roles,Date
How to add all the value in 'Username' column to User Table, Name Column ?
Value in 'Address' column to AddrDet Table, Address Column?
Value in 'Roles' column to RolesDet Table, Roles Column?
AND choose the CSV column to be added to database? (So not all column in CSV will be taken)
using ASP.NET MVC C#
Because all I know is when the CSV uploaded, it will create DataTable specially for CSV and all the column in CSV will be uploaded
Thank You
I'm using MVC and EF DB FIRST
This questions is being marked as duplicate of Upload CSV file to SQL server
I don't feel (& don't think) that the question is related to or has completely same topic, so I'm answering this. I have myself marked question as too broad, as there is too much to explain.
Also I will add some links to the question, however they are not here to fill the answer, only to give OP an idea what question/topics to look for himself.
Short explanation:
Usually when You want to import data (CSV file) into database, You already got structure & schema of data (and Database). There is existing TableA and TableB, where exist some columns inside. If You want to dynamically create new columns/update schema of DB based on CSV file, this is an uneasy work (normally is not happening).
C#/ASP/.NET application is working in a way where You give it an input (from users' clicks, data load, task scheduler passed some time checkpoint) and the APP do the work.
Typical job looks like: "We got data in this format, APP have to convert them to the inner representation (classes) and then insert them to the server". So You have to write an ASP page, where You allow user to paste/load the file. E.g.: File Upload ASP.NET MVC 3.0
Once You have loaded the file, You need to convert the CSV format (format of stored data) into Your internal representation. Which means create Your own class, with some properties and convert (Transform) CSV into the classes. E.g.: Importing CSV data into C# classes
Since You have this data inside classes (objects - instances of classes), You can work with them and carry out some internal work. This time we are looking for CRUD (Create/Read/Update/Delete) operations against SQL database. First You need to connect to SQL server, choose database and then run the queries. E.g.: https://www.codeproject.com/Articles/837599/Using-Csharp-to-connect-to-and-query-from-a-SQL-da
Plenty of developers are too lazy to write the queries themselves and they like more Object-Oriented access to this sort of problem. They are using ORM - Object-relation mapping, which allows users to have same class/object schema in Database and in the Application. One example for all is Entity-Framework (EF). E.g.: http://www.entityframeworktutorial.net/
As You can see this topic is not so easy and requires knowledge in several parts of programming.
I'm migrating data from one system to another and will be receiving a CSV file with the data to import. The file could contain up to a million records to import. I need to get each line in the file, validate it and put the data into the relevant tables. For example, the CSV would be like:
Mr,Bob,Smith,1 high street,London,ec1,012345789,work(this needs to be looked up in another table to get the ID)
There's a lot more data than this example in the real files.
So, the SQL would be something like this:
Declare #UserID
Insert into User
Values ('Mr', 'Bob', 'Smith', 0123456789)
Set #UserID = ##Identity
Insert into Address
Values ('1 high street', 'London', 'ec1', select ID from AddressType where AddressTypeName = 'work')
I was thinking of iterating over each row and call an SP with the parameters from the file which will contain the SQL above. Would this be the best way of tackling this? It's not time critical as this will just be run once when updating a site.
I'm using C# and SQL Server 2008 R2.
What about you load it into a temporary table (note that this may be logically temporary - not necessarily technically) as staging, then process it from there. This is standard ETL behavior (and a million is tiny for ETL), you first stage the data, then clean it, then put it to the final place.
When performing tasks of this nature, you do not think in terms of rotating through each record individually as that will be a huge performence problem. In this case you bulk insert the records to a staging table or use the wizard to import to a staging table (look out for teh deafult 50 characters espcially in the address field).Then you write set-based code to do any clean up you need (removing bad telephone numbers or zip code or email addresses or states or records missing data in fields that are required in your database or transforing data using lookup tables (suppose you have table with certain required values, those are likely not the same values that you wil find in this file, you need to convert them. We use doctor specialties a lot. So our system might store them as GP but the file might give us a value of General Practioner. You need to look at all teh non-matching values for the field and then determine if you can map them to existing values, if you need to throw the record out or if you need to add more values to your lookup table. Once you have gotten rid of records you don't want and cleaned up those you can in your staging table then you import to the prod tables. Inserts should be written using the SELECT version of INSERT not with the VALUES clause when you are writing more than one or two records.
I am writing an internal application and one of the functions will be importing data from a remote system. The data from the remote system comes over as a CSV file. I need to compare the data in my system with that of the CSV file.
I need to apply any changes to my system (Adds and Changes). I need to track each field that is changed.
My database is normalized so I'm dealing with about 10 tables to correspond with the data in the CSV file. What is the best way to implement this? Each CSV file has about 500,000 records that are processed daily. I started by querying row by row from my SQL database using a lookup ID then using c# do do a field by field compare and updating or inserting as necessary; however, this takes way too long.
Any suggestions?
You can do following:
Load cvs file into staging table in your db;
Perform validation and clean-up routines on it (if necessary)
Perform your comparisons and updates on your live data
Wipe out all data from staging table
Using that approach you can implement almost all clean-up, validation, and update logic using your RDBMS functionality.
If your RDBMS is SQL Server you can leverage SQL Server Integration Services.
If you have anything that serves as a unique key, you can do the following:
Create a new table Hashes that contains a unique key and a hash of all fields associated with that key (do not use .NET's object.GetHashCode(), as the value returned does change from time to time by design. I personally use Google's CityHash which I ported to C#).
When you get a new CSV file, compute the hash value for each key
Check the Hashes table for each row in the CSV file.
If there is no entry for the unique key, create one and insert the row.
If there is an entry, see if the hash has changed.
If it has, update the hash in the Hashes table and update data.
Expanding on the first comment to your question.
Create an appropriately indexed table that matches the format of your csv file and dump the data straight into it.
Have a stored procedure with appropriate queries to update/delete/insert to the active tables.
Get rid of the temporary table.
I'm currently working on a project where we have a large data warehouse which imports several GB of data on a daily basis from a number of different sources. We have a lot of files with different formats and structures all being imported into a couple of base tables which we then transpose/pivot through stored procs. This part works fine. The initial import however, is awfully slow.
We can't use SSIS File Connection Managers as the columns can be totally different from file to file so we have a custom object model in C# which transposes rows and columns of data into two base tables; one for column names, and another for the actual data in each cell, which is related to a record in the attribute table.
Example - Data Files:
Example - DB tables:
The SQL insert is performed currently by looping through all the data rows and appending the values to a SQL string. This constructs a large dynamic string which is then executed at the end via SqlCommand.
The problem is, even running in a 1MB file takes about a minute, so when it comes to large files (200MB etc) it takes hours to process a single file. I'm looking for suggestions as to other ways to approach the insert that will improve performance and speed up the process.
There are a few things I can do with the structure of the loop to cut down on the string size and number of SQL commands present in the string but ideally I'm looking for a cleaner, more robust approach. Apologies if I haven't explained myself well, I'll try and provide more detail if required.
Any ideas on how to speed up this process?
The dynamic string is going to be SLOW. Each SQLCommand is a separate call to the database. You are much better off streaming the output as a bulk insertion operation.
I understand that all your files are different formats, so you are having to parse and unpivot in code to get it into your EAV database form.
However, because the output is in a consistent schema you would be better off either using separate connection managers and the built-in unpivot operator, or in a script task adding multiple rows to the data flow in the common output (just like you are currently doing in building your SQL INSERT...INSERT...INSERT for each input row) and then letting it all stream into a destination.
i.e. Read your data and in the script source, assign the FileID, RowId, AttributeName and Value to multiple rows (so this is doing the unpivot in code, but instead of generating a varying number of inserts, you are just inserting a varying number of rows into the dataflow based on the input row).
Then pass that through a lookup to get from AttributeName to AttributeID (erroring the rows with invalid attributes).
Stream straight into an OLEDB destination, and it should be a lot quicker.
One thought - are you repeatedly going back to the database to find the appropriate attribute value? If so, switching the repeated queries to a query against a recordset that you keep at the clientside will speed things up enormously.
This is something I have done before - 4 reference tables involved. Creating a local recordset and filtering that as appropriate caused a speed up of a process from 2.5 hours to about 3 minutes.
Why not store whatever reference tables are needed within each database and perform all lookups on the database end? Or it may even be better to pass a table type into each database where keys are needed, store all reference data in one central database and then perform your lookups there.
I'm building an application to import data into a sql server 2008 Express db.
This database is being used by an application that is currently in production.
The data that needs to be imported comes from various sources, mostly excel sheets and xml files.
The database has the following tables:
tools
powertools
strikingtools
owners
Each row, or xml tag in the source files has information about 1 tool:
name, tooltype, weight, wattage, owner, material, etc...
Each of these rows has the name of the tool's owner this name has to be inserted into the owners table but only if the name isn't already in there.
For each of these rows a new row needs to be inserted in the tools table.
The tools table has a field owner_id with a foreign key to the owners table where the primary key of the corresponding row in the owners table needs to be set
Depending on the tooltype a new row must be created in either the powertools table or the strikingtools table. These 2 tables also have a tool_id field with a foreign key to the tools table that must be filled in.
The tools table has a tool_owner_id field with a foreign key to the owners table that must be filled in.
If any of the rows in the importfile fails to import for some reason, the entire import needs to be rolled back
Currently I'm using a dataset to do this but for some large files (over 200.000 tools) this requires quite a lot of memory. Can anybody think of a better aproach for this?
There are two main issues to be solved:
Parsing the a large XML document efficiently.
Adding a large amount of records to the database.
XML Parsing
Although the DataSet approach works, the whole XML document is loaded into memory. To improve the efficiency of working with large XML documents you might want look at the XmlReader class. The API is slightly more difficult to use than what DataSet provides. But you will get the benefit of not loading the whole DOM into memory at once.
Inserting records to the DB
To satisfy your Atomicity requirement you can use a single database transaction but the large number of records you are dealing with for a single transaction is not ideal. You will most likely incur issues like:
Database having to deal with a large number of locks
Database locks that might escalate from row locks to page locks and even table locks.
Concurrent use of the database will be severely affect during the import.
I would recommend the following instead of a single DB transaction:
See if it possible to create smaller transaction batches. Maybe 100 records at a time. Perhaps it is possible to logically load sections of the XML file together, where it would be acceptable load a subset of the data as a unit into the system.
Validate as much of your data upfront. E.g. Check that required fields are filled or that FK's are correct.
Make the upload repeatable. Skip over existing data.
Provide a manual undo strategy. I know this is easier said than done, but might even be required as an additional business rule. For example the upload was successful but someone realises a couple of hours later that the wrong file was uploaded.
It might be useful to upload your data to a initial staging area in your DB to perform validations and to mark which records have been processed.
Use SSIS, and create and ETL package.
Use Transactions for the roll back feature, and stored procedure that handle creating/checking the foreign keys.