What does "Data Massage" mean?

What does "Data Massage" mean? - c#

I am doing some reading, and came across avoiding an internalStore if my application does not need to massage the data before being sent to SQL. What is a data massage?

Manipulate, process, alter, recalculate. In short, if you are just moving the data in raw then no need to use internalStore, but if you're doing anything to it prior to storage, then you might want an internalStore.

Sometimes the whole process of moving data is referred to as "ETL" meaning "Extract, Transform, Load". Massaging the data is the "transform" step, but it implies ad-hoc fixes that you have to do to smooth out problems that you have encountered (like a massage does to your muscles) rather than transformations between well-known formats.
Thinks that you might do to "massage" data include:
Change formats from what the source system emits to what the target system expects, e.g. change date format from d/m/y to m/d/y.
replace missing values with defaults, e.g. Supply "0" when a quantity is not given.
Filter out records that not needed in the target system.
Check validity of records, and ignore or report on rows that would cause an error if you tried to insert them.
Normalise data to remove variations that should be the same, e.g. replace upper case with lower case, replace "01" with "1".

Clean up, normalization, filtering, ... Just changing the data somehow from the original input into a form that is better suited to your use.

And finally there is the less savory practice of massaging the data by throwing out data (or adjusting the numbers) when they don't give you the answer you want. Unfortunately people doing statistical analysis often massage the data to get rid of those pesky outliers which disprove their theory. Because of this practice referring to data cleaning as massaging the data is inappropriate. Cleaning the data to make it something that can go into your system (getting rid of meaningless dates like 02/30/2009 because someone else stored them in varchar instead of as dates, separating first and last names into separate fields, fixing all uppercase data, adding default values for fields that require data when the supplied data isn't given, etc.) is one thing - massaging the data implies a practice of adjusting the data inappropriately.
Also to comment on the idea that it is bad to have an internal store if you are not changing any data, I strongly disagree with this (and I have have loaded thousands of files from hundreds of sources through the years. In the first place, there is virtually no data that doesn't need to at least be examined for for cleaning. And if it was ok in the first run doesn't guarantee that a year later it won't be putting garbage into your system. Loading any file without first putting it into a staging table and cleaning it is simply irresponsible.
Also we find it easier to research issues with data if we can see easily the contents of the file we loaded in a staging table. Then we can pinpoint exactly which file/source gave us the data in question and that resolves many issues where the customer thinks we loading bad information that they actually sent us to load. In fact we always use two staging tables, one for the raw data as it came in from the file and one for the data after cleaning but before loading to the production tables. As a result I can resolve issues in seconds or minutes that would take hours if I had to go back and search through the original files. Because one thing you can guarantee is that if you are importing data, there will be times when the content of that data will be questioned.

Related

Loading large amounts of data into a List<MyObject> in .net

I have a C# tool that parses a collection of csv files to construct a List. This collection can be small limited to 20 files or can be as large as 10000+ files. MyObject it self has about 20 properties most of them strings. Each file can create sometimes upto 4 items in the list and sometimes as many has 300.
After the parsing is done I first save the list to a csv file so I don't have to reparse the data again later. I then summarize the data by one pivot of the dataset and then there are multiple pivots to the dataset the user can choose. The data is presented in WPF and the user acts on the data and annotates the data with some additional information that then get's added to the MyObject. Finally the user can save all of this information to another csv file.
I ran into OOM when the files got large and have optimized some of my code. First I realized I was storing one parameter, i.e. the path to the csv file which was sometimes close to 255 characters. I changed it to only save the filename and things improved slightly. I then discovered a suggestion to compile to x64 that would give me 4 Gb of memory instead of 2 Gb.
Even with this obviously I hit OOM's when more and more files are added to this data set.
Some of the options I've considered are:
When parsing the files, save to the intermediate.csv file after each file parse and not keep the list in memory. This will work for me to avoid the step of seeing an OOM even before I get to save the intermediate.csv file.
Problem with this approach is I still have to load back the intermediate file into memory once the parsing is all done.
Some of the Properties on MyObject are similar for a collection of files. So I've considered refactoring the single object into multiple objects that will possibly reduce the number of items in the List object. Essentially refactoring to List, with MyTopLevelDetailsObject containing a List. The memory foot print should reduce theoretically. I can then output this to csv by doing some translation to make it appear like a single object.
Move the data to a db like MongoDB internally and load the data to summarize to the db logic.
Use DataTables instead.
Options 2 and 3 will be significant redesign with 3 also needing me to learn MongoDB. :)
I'm looking for some guidance and helpful tips of how Large data sets have been handled.
Regards,
LW

If, after optimizations, the data can't fit in memory, almost by definition you need it to hit the disk.
Rather than reinvent the wheel and create a custom data format, it's generally best to use one of the well vetted solutions. MongoDB is a good choice here, as are other database solutions. I'm fond of SQLite, which despite the name, can handle large amounts of data and doesn't require a local server.
If you ever get to the point where fitting the data on a local disk is a problem, you might consider moving on to large data solutions like Hadoop. That's a bigger topic, though.

Options two and four can't probably help you because (as I see it) they won't reduce the total amount of information in memory.
Also consider an option to load data dynamically. I mean, the user probably can't see all data at one moment of time. So you may load a part of .csv to the memory and show it to the user, then if the user made some annotations/edits you may save this chunk of data to a separate file. If the user scrolls through data you load it on the fly. When the user wants to save final .csv you combine it from the original one and your little saved chunks.
This is often a practice when creating C# desktop application that access some large amounts of data. For example, I adopted loading data in chunks on the fly, when I needed to create a WinForms software to operate with a huge database (tables with more then 10m rows, they can't fit to mediocre office PCs memory).
And yes, too much work to do it with .csv manually. It's easier to use some database to handle saving/saving of edited parts/composition of final output.

Query DB with the little things, or store some "bigger chunks" of results and filter them in code?

I'm working on an application that imports video files and lets the user browse them and filter them based on various conditions. By importing I mean creating instances of my VideoFile model class and storing them in a DB table. Once hundreds of files are there, the user wants to browse them.
Now, the first choice they have in the UI is to select a DateRecorded, which calls a GetFilesByDate(Date date) method on my data access class. This method will query the SQL database, asking only for files with the given date.
On top of that, I need to filter files by, let's say, FrameRate, Resolution or UserRating. This would place additional criteria on the files already filtered by their date. I'm deciding which road to take:
Only query the DB for a new set of files when the desired DateRecorded changes. Handle all subsequent filtering manually in C# code, by iterating over the stored collection of _filesForSelectedDay and testing them against current additional rules.
Query the DB each time any little filter changes, asking for a smaller and very specific set of files more often.
Which one would you choose, or even better, any thoughts on pros and cons of either of those?
Some additional points:
A query in GetFilesByDate is expected to return tens of items, so it's not very expensive to store the result in a collection always sitting in memory.
Later down the road I might want to select files not just for a specific day, but let's say for the entire month. This may give hundreds or thousands of items. This actually makes me lean towards option two.
The data access layer is not yet implemented. I just have a dummy class implementing the required interface, but storing the data in a in-memory collection instead of working with any kind of DB.
Once I'm there, I'll almost certainly use SQLite and store the database in a local file.

Personally I'd always go the DB every time until it proves impractical. If it's a small amount of data then the overhead should also be small. When it gets larger then the DB comes into its own. It's unlikely you will be able to write code better than the DB although the round trip can cost. Using the DB your data will always be consistent and up to date.
If you find you are hitting the BD too hard then you can try caching your data and working out if you already have some or all of the data being requested to save time. However then you have aging and consistency problems to deal with. You also then have servers with memory stuffed full of data that could be used for other things!
Basically, until it becomes an issue, just use the DB and use your energy on the actual problems you encounter, not the maybes.

If you've already gotten a bunch of data to begin with, there's no need to query the db again for a subset of that set. Just store it in an object which you can query on refinement of the search query by the user.

Loading XML Files

I'm currently working with loading a lot (thousands of files from 1KB - 6MB) XML files, and loading them into destination databases. Currently, I'm using the SQLXMLBULKLOAD COM object. One of the biggest problems I'm having is that the COM object doesn't always play nice within our transactional environment. There's other problems too, such as performance; the process really begins choking on files approaching ~2MB, taking several minutes, if not longer in some cases (hours), to load into the tables.
So now I'm looking for an alternative, of which there seems to be two flavors:
1) Something like OPENXML, where XML is inserted as XML data into SQL Server
or
2) Solutions that parse the XML in memory, and load as rowsets into the database.
There's drawbacks to either approach, and I know I'm going to have to start doing some benchmarking of prototype solutions before I jump to any conclusions. The OPENXML approach is very attractive IMO, mainly because it promises some really good performance numbers (others claiming going from hours to milliseconds). But it has the drawback of storing data as XML -not ideal in my particular scenario since the destination tables already exist, and many other processes rely on queries and SPROCS out there that query these tables as normal rowset data.
Whatever solution I choose, I must meet the following requirements:
1) Must accept any XML file. Clients (in a business sense) need only provide an XSD, and an appropriate destination database/table(s) for the data.
2) Individual files (never larger than ~6MB) must be processed in under 1 minute (ideally even much quicker than that).
3) Inserted data must be able to accomodate existing queries and SPROCS (i.e, must ultimately end up as normal rowset data)
So my question is, do you have any experience in this situation, and what are your thoughts and insights?
I am not completely opposed to an OPENXML-like solution, just as long as the data can end up as normal rowset data at some point. I am also interested in 3rd party solutions you may have experience with, this is an important part of our process, and we are willing to spend some $ if it provides us the best and most stable solution.
I'm also not opposed to "roll-your-own" suggestions, or things on Codeplex, etc. I came across the LINQ to XSD project, but couldn't find much documentation about what its capabilitities are (just as an exame of things I am interested in)

I would revisit the performance issues you are having with the SQLXMLBULKLOAD COM. I have used this component to load 500MB xml files before. Can you post the code you are using to invoke the component?

Efficient way to analyze large amounts of data?

I need to analyze tens of thousands of lines of data. The data is imported from a text file. Each line of data has eight variables. Currently, I use a class to define the data structure. As I read through the text file, I store each line object in a generic list, List.
I am wondering if I should switch to using a relational database (SQL) as I will need to analyze the data in each line of text, trying to relate it to definition terms which I also currently store in generic lists (List).
The goal is to translate a large amount of data using definitions. I want the defined data to be filterable, searchable, etc. Using a database makes more sense the more I think about it, but I would like to confirm with more experienced developers before I make the changes, yet again (I was using structs and arraylists at first).
The only drawback I can think of, is that the data does not need to be retained after it has been translated and viewed by the user. There is no need for permanent storage of data, therefore using a database might be a little overkill.

It is not absolutely necessary to go a database. It depends on the actual size of the data and the process you need to do. If you are loading the data into a List with a custom class, why not use Linq to do your querying and filtering? Something like:
var query = from foo in List<Foo>
where foo.Prop = criteriaVar
select foo;
The real question is whether the data is so large that it cannot be loaded up into memory confortably. If that is the case, then yes, a database would be much simpler.

This is not a large amount of data. I don't see any reason to involve a database in your analysis.
There IS a query language built into C# -- LINQ. The original poster currently uses a list of objects, so there is really nothing left to do. It seems to me that a database in this situation would add far more heat than light.

It sounds like what you want is a database. Sqlite supports in-memory databases (use ":memory:" as the filename). I suspect others may have an in-memory mode as well.

I was facing the same problem that you faced now while I was working on my previous company.The thing is I was looking a concrete and good solution for a lot of bar code generated files.The bar code generates a text file with thousands of records with in a single file.Manipulating and presenting the data was so difficult for me at first.Based on the records what I programmed was, I create a class that read the file and loads the data to the data table and able to save it in database. The database what I used was SQL server 2005.Then I able to manage the saved data easily and present it which way I like it.The main point is read the data from the file and save to it to the data base.If you do so you will have a lot of options to manipulate and present as the way you like it.

If you do not mind using access, here is what you can do
Attach a blank Access db as a resource
When needed, write the db out to file.
Run a CREATE TABLE statement that handles the columns of your data
Import the data into the new table
Use sql to run your calculations
OnClose, delete that access db.
You can use a program like Resourcer to load the db into a resx file
ResourceManager res = new ResourceManager( "MyProject.blank_db", this.GetType().Assembly );
byte[] b = (byte[])res.GetObject( "access.blank" );
Then use the following code to pull the resource out of the project. Take the byte array and save it to the temp location with the temp filename
"MyProject.blank_db" is the location and name of the resource file
"access.blank" is the tab given to the resource to save

If the only thing you need to do is search and replace, you may consider using sed and awk and you can do searches using grep. Of course on a Unix platform.

From your description, I think linux command line tools can handle your data very well. Using a database may unnecessarily complicate your work. If you are using windows, these tools are also available by different ways. I would recommend cygwin. The following tools may cover your task: sort, grep, cut, awk, sed, join, paste.
These unix/linux command line tools may look scary to a windows person but there are reasons for people who love them. The following are my reasons for loving them:
They allow your skill to accumulate - your knowledge to a partially tool can be helpful in different future tasks.
They allow your efforts to accumulate - the command line (or scripts) you used to finish the task can be repeated as many times as needed with different data, without human interaction.
They usually outperform the same tool you can write. If you don't believe, try to beat sort with your version for terabyte files.

Best way to process a large database?

Background:
I have one Access database (.mdb) file, with half a dozen tables in it. This file is ~300MB large, so not huge, but big enough that I want to be efficient. In it, there is one major table, a client table. The other tables store data like consultations made, a few extra many-to-one to one fields, that sort of thing.
Task:
I have to write a program to convert this Access database to a set of XML files, one per client. This is a database conversion application.
Options:
(As I see it)
Load the entire Access database into memory in the form of List's of immutable objects, then use Linq to do lookups in these lists for associated data I need.
Benefits:
Easy parallelised. Startup a ThreadPool thread for each client. Because all the objects are immutable, they can be freely shared between the threads, which means all threads have access to all data at all times, and it is all loaded exactly once.
(Possible) Cons:
May use extra memory, loading orphaned items, items that aren't needed anymore, etc.
Use Jet to run queries on the database to extract data as needed.
Benefits:
Potentially lighter weight. Only loads data that is needed, and as it is needed.
(Possible) Cons:
Potentially heavier! May load items more than once and hence use more memory.
Possibly hard to paralellise, unless Jet/OleDb supports concurrent queries (can someone confirm/deny this?)
Some other idea?
What are StackOverflows thoughts on the best way to approach this problem?

Generate XML parts from SQL. Store each fetched record in the file as you fetch it.
Sample:
SELECT '<NODE><Column1>' + Column1 + '</Column1><Column2>' + Column2 + '</Column2></Node>' from MyTable

If your objective is to convert your database to xml files, you can then:
connect to your database through an ADO/OLEDB connection
successively open each of your tables as ADO recordsets
Save each of your recordset as a XML file:
myRecordset.save myXMLFile, adPersistXML
If you are working from the Access file, use the currentProject.accessConnection as your ADO connection

From the sounds of this, it would be a one-time operation. I strongly discourage the actual process of loading the entire setup into memory, that just does not seem like an efficient method of doing this at all.
Also, depending on your needs, you might be able to extract directly from Access -> XML if that is your true end game.
Regardless, with a database that small, doing them one at a time, with a few specifically written queries in my opinion would be easier to manage, faster to write, and less error prone.

I would lean towards jet, since you can be more specific in what data you want to pull.
Also I noticed the large filesize, this is a problem i have recently come across at work. Is this an access 95 or 97 db? If so converting the DB to 2000 or 2003 and then back to 97 will reduce this size, it seems to be a bug in some cases. The DB I was dealing with claimed to be 70meg after i converted it to 2000 and back again it was 8 meg.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.