At my work we load excel files and save them in the database.
This is basically the flow:
We import data into a DataSet from an Excel file, where each sheet is loaded into its own DataTable inside the DataSet. After populating the DataDet, i want to validate the data inside the DataSet, let's say the first DataTable. I get xml from the DataTable by using WriteXml() method of the DataTable class and load this xml into an XDocument. I then use the Validate() method of the XDocument class with a predefined xsd, which is loaded into a XmlSchemaSet object.
The problem is that the data in the excel is stored in a format that is different from the format of dateTime in xsd.
We get Excel files with datetime columns formatted like thie: '12/01/2015 12:44:45', whereas the dateTime format in xsd should be like this: '2015-01-12T12:44:45'
Is it possible to define custom dateTime format in an xsd file?
For example, instead of '2015-01-12T12:44:45', I would like it to be '12/01/2015 12:44:45', so my xml element would look like this:
<createDate>12/01/2015 12:44:45>/createDate>
In addition, i wouldn't mind if the time part would be ignored altogether.
In addition, another custom xsd format i need is like this: 378,216.00
Is it possible to define it in my xsd file?
Here is this code where we do the validation of the xml, retrieved from the datatable
public string[] ValidateExcelFromXsdFile(string schemaUri)
{
_validationErrors.Clear();
var schemas = new XmlSchemaSet();
schemas.Add("", schemaUri);
var doc = XDocument.Parse(GetXml(_dataSetFromExcel.Tables[0]));
doc.Validate(schemas, (sender, args) => _validationErrors.Add(args.Message));
return _validationErrors.ToArray();
}
You can define a pattern for strings in the format 'dd/mm/yyyy hh:mm:ss' using a regular expression, but the resulting value won't be an xs:dateTime, and checking for full validity (leap years etc) is a bit of a nightmare (it can be done, but leads to a regular expression that's about a mile long).
A better solution here might the transform-then-validate pattern, where you preprocess the input document (into standard XSD format) before validating it. You can even do some of the validation during the preprocessing phase if you choose.
The Saxon schema processor has an preprocess facet which allows you to declare some rearrangement of a value prior to schema processing, which is exactly what you need here (for both your use cases), but unfortunately it's not standard.
Check this site - there you have DateTime data type http://www.w3schools.com/schema/schema_dtypes_date.asp
Also Date and Time Data Types paragraph will allows you to adjust the rule to your needs.
Related
How to Display Data Table as output in rest service in XML FORMAT
DataTable has a method WriteXml that contains many overloaded versions. One of them is DataTable.WriteXml(string). You can use this to save the contents of DataTable as an XML file. The other option you can use is the overloaded method that writes to Stream object. You can use either FileStream, MemoryStream or similar.
Any of the overloaded methods contains a version with XmlWriterMode argument that allows you to specify whether you want to write schema or ignore it.
I have to create an XML document that would be based on an certain XML Schema Document. Since my data is a DataSet, I need to find the best way to start off.
I have couple of different ideas how to start:
manually create nodes, elements, attributes that would match XSD
transform DataSet into a class that would match the schema document and serialize it
something else?
Is this a right way to get a XML output from DataSet to match XSD schema?
May be you should give XMLBeans a try... It's a diverse framework for playing around with compiled XSD schemas. Compiled in this context means, you create JAVA classes from your XSD-files.
Compilation example (as can be seen here) scomp -out purchaseorder.jar purchaseorder.xsd
With this jar in your classpath you could create new a priori valid instances of your schema with something like:
public PurchaseOrderDocument createPO() {
PurchaseOrderDocument newPODoc = PurchaseOrderDocument.Factory.newInstance();
PurchaseOrder newPO = newPODoc.addNewPurchaseOrder();
Customer newCustomer = newPO.addNewCustomer();
newCustomer.setName("Doris Kravitz");
newCustomer.setAddress("Bellflower, CA");
return newPODoc;
}
You can find the whole example at: XMLBeans Tutorial under the heading "Creating New XML Instances from Schema".
I have a String to Date conversion problem using SQL Bulkcopy in asp.net 3.5 with C#
I read a large CSV file (with CSV reader). One of the strings read should be loaded into a SQL server 2008 Date column.
If the textfile contains for example the string '2010-12-31', SQL Bulkcopy loads it without any problems into the Date column.
However, if the string is '20101231', I get an error:
The given value of type String from the data source cannot be converted to type date of the specified target column
The file contains 80 million records so I cannot create a datatable....
SqlBulkcopy Columnmappings etc. are all ok. Also changing to DateTime does not help.
I tried
SET DATEFORMAT ymd;
But that does not help.
Any ideas how to tell SQL Server to accept this format? Otherwise I will create a custom fix in CSV reader but I would prefer something in SQL.
update
Following up on the two answers, I am using SQL bulkcopy like this (as proposed on Stackoverflow in another question):
The CSV reader (see the link above on codeproject) returns string values (not strong typed). The CSVreader implements System.Data.IDataReader so I can do something like this:
using (CsvReader reader = new CsvReader(path))
using (SqlBulkCopy bcp = new SqlBulkCopy(CONNECTION_STRING))
{ bcp.DestinationTableName = "SomeTable";
// columnmappings
bcp.WriteToServer(reader); }
All the fields coming from the iDataReader are strings, so I cannot use the c# approach unless I change quite a bit in the CSVreader
My question is therefore not related on how to fix it in C#, I can do that but i want to prevent that.
It is strange, because if you do a in sql something like
update set [somedatefield] = '20101231'
it also works, just not with bulkcopy.
Any idea why?
Thanks for any advice,
Pleun
Older issue, but wanted to add an alternative approach.
I had the same issue with SQLBulkLoader not allowing DataType/culture specifications for columns when streaming from IDataReader.
In order to reduce the speed overhead of constructing datarows locally and instead have the parsing occur on the target, a simple method I used was to temporarily set the thread culture to the culture which defines the format in use - in this case for US format dates.
For my problem - en-US dates in the input (in Powershell):
[System.Threading.Thread]::CurrentThread.CurrentCulture = 'en-US'
<call SQLBulkCopy>
For your problem, you could do the same but since the date format is not culture specific, create a default culture object (untested):
CultureInfo newCulture = (CultureInfo) System.Threading.Thread.CurrentThread.CurrentCulture.Clone();
newCulture.DateTimeFormat.ShortDatePattern = "yyyyMMDD;
Thread.CurrentThread.CurrentCulture = newCulture;
I found allowing the database server to perform the type conversions once they've gotten through the SQLBulkCopy interface to be considerably faster than performing parsing locally, particularly in a scripting language.
If you can handel it in C# itself then this code will help get the date in the string as a DateTime object which you can pass directly
//datestring is the string read from CSV
DateTime thedate = DateTime.ParseExact(dateString, "yyyyMMdd", null);
If you want it to be formatted as string then:
string thedate = DateTime.ParseExact(dateString, "yyyyMMdd", null).ToString("yyyy-MM-dd");
Good luck.
Update
In your scenario i don't know why date is not automatically formatted but from C# you need to get in and Interfere in the process of passing the data to the WriteToServer() method. Best i think you can do (keeping in mind the Performance) is to have a cache of DataRow items and Pass them to the WriteToServer() method. I will just write the sample code in a minute...
//A sample code.. polish it before implementation
//A counter to track num of records read
long records_read = 0;
While(reader.Read())
{
//We will take rows in a Buffer of 50 records
int i = records_read;//initialize it with the num of records last read
DataRow[] buffered_rows = new DataRow[50];
for(;i<50 ;i++)
{
//Code to initialize each rows with the data in the reader
//.....
//Fill the column data with Date properly formatted
records_read++;
reader.Read();
}
bcp.WriteToServer(buffered_rows);
}
Its not full code but i think you can work it out...
It's not entirely clear how you're using SqlBulkCopy, but ideally you shouldn't be uploading the data to SQL Server in string format at all: parse it to a DateTime or DateTimeOffset in your CSV reader (or on the output of your CSV reader), and upload it that way. Then you don't need to worry about string formats.
This code
XmlDataDocument xmlDataDocument = new XmlDataDocument(ds);
does not work for me, because the node names are derived from the columns' encoded ColumnName property and will look like "last_x20_name", for instance. This I cannot use in the resulting Excel spreadsheet. In order to treat the column names to make them something more friendly, I need to generate the XML myself.
I like LINQ to XML, and one of the responses to this question contained the following snippets:
XDocument doc = new XDocument(new XDeclaration("1.0","UTF-8","yes"),
new XElement("products", from p in collection
select new XElement("product",
new XAttribute("guid", p.ProductId),
new XAttribute("title", p.Title),
new XAttribute("version", p.Version))));
The entire goal is to dynamically derive the column names from the dataset, so hardcoding them is not an option. Can this be done with Linq and without making the code much longer?
It ought to be possible.
In order to use your Dataset as a source you need Linq-to-Dataset.
Then you would need a nested query
// untested
var data = new XElement("products",
from row in ds.Table["ProductsTable"].Rows.AsEnumerable()
select new XElement("product",
from column in ds.Table["ProductsTable"].Columns // not sure about this
select new XElement(colum.Fieldname, rows[colum.Fieldname])
) );
I appreciate the answers, but I had to abandon this approach altogether. I did manage to produce the XML that I wanted (albeit not with Linq), but of course there is a reason why the default implementation of the XmlDataDocument constructor uses the EncodedColumnName - namely that special characters are not allowed in element names in XML. But since I wanted to use the XML to convert what used to be a simple CSV file to the XML Spreadsheet format using XSLT (customer complains about losing leading 0's in ZIP codes etc when loading the original CSV into Excel), I had to look into ways that preserve the data in Excel.
But the ultimate goal of this is to produce a CSV file for upload to the payroll processor, and they mandate the column names to be something that is not XML-compliant (e.g. "File #"). The data is reviewed by humans before the upload, and they use Excel.
I resorted to hard-coding the column names in the XSLT after all.
We are communicating with a 3rd party service using via an XML file based on standards that this 3rd party developed. They give us an XML template for each "transaction" and we read it into a DataSet using System.Data.DataSet.ReadXML, set the proper values in the DataSet, and then write the XML back using System.Data.DataSet.WriteXML. This process has worked for several different files. However, I am now adding an additional file which requires that an integer data type be set on one of the fields. Here is a scaled down version:
<EngineDocList>
<DocVersion>1.0</DocVersion>
<EngineDoc>
<MyData>
<FieldA></FieldA>
<FieldB></FieldB>
<ClientID DataType="S32"></ClientID>
</MyData>
</EngineDoc>
</EngineDocList>
When I look at the DataSet created by my call to ReadXML to this file, the MyData table has columns of FieldA, FieldB, and MyData_ID. If I then set the value of MyData_ID and then make the call to WriteXML, the export XML has no value for ClientID. Once again, if I take a way the DataType, then I do have a ClientID column, I can set it properly, and the exported XML has the proper value. However, the third party requires that this data type be defined.
Any thoughts on why the ReadXML would be renaming this tag or how I could otherwise get this process to work? Alternatively, I could revamp the way we are reading and writing the XML, but would obviously rather not go down this path although these suggestions would also be welcome. Thanks.
I would not do this with a DataSet. It has a specific focus on simulating a relational model. Much XML will not follow that model. When the DataSet sees things that don't match it's idea of the world, it either ignores them or changes them. In neither case is it a good thing.
I'd use an XmlDocument for this.