xml xsd validation and malformed xml

xml xsd validation and malformed xml - c#

I wrote a quick class to validate an XML file at a FilePath against an XSD with .NET (see below).
I have large volumes of data files being generated by another machine on the LAN, but the files are not true XML, they are malformed, but in the same way every time and based on their structure I can make some global replaces on the content of the file to correct it. So I have to correct these before testing with XSD. I have to replace <\ with </ and so on. All the replaces are listed in the code.
When I point this to the LAN network share of the machine generating the files at a list of about 50k files, and this took about 15 minutes to complete. I'm wondering if this is just IO capped by the LAN, or if there's a better (quicker) way to correct the malformed XML than the replaces I do here.
class VCheck
{
private static XmlReaderSettings settings = new XmlReaderSettings();
private bool valid;
string message;
public string Message { get { return message; } }
public VCheck()
{
settings.ValidationType = ValidationType.Schema;
settings.ValidationFlags |= XmlSchemaValidationFlags.ReportValidationWarnings;
settings.ValidationEventHandler += new ValidationEventHandler(ValidationCallBack);
settings.Schemas.Add(null, "schema.xsd");
}
public bool CheckFile(string FileFullPath)
{
StreamReader file = new StreamReader(FileFullPath);
valid = true;
message = null;
try
{ //setup xml reader with settings
XmlReader xml = XmlReader.Create(new StringReader(#"<?xml version='1.0'?><root xmlns=""MYE"">" +
file.ReadToEnd().Replace(#"<\", #"</").Replace("&", "&").Replace("\"", """).Replace("'", "&apos;") + "</root>"),
settings);
while (xml.Read()) ; //read in all xml, validating against xsd
}
catch
{
//problem reading the xml file in, bad path, disk error etc.
return false;
}
return valid;
}
void ValidationCallBack(object sender, ValidationEventArgs e) //called on failed validations
{
valid = false;
message = e.Message;
switch (e.Severity)
{
case XmlSeverityType.Error:
//Do stuff on validation error
break;
case XmlSeverityType.Warning:
//Do stuff on validation warning
break;
}
}
}
I'd call it from main like this:
static void Main(string[] args)
{
VCheck checker = new VCheck();
foreach (string file in files) //files is a List<string> of file paths/names
{
if (!checker.CheckFile(file))
{
//To do stuff if not valid
}
}
}

I don't think reading it all into memory - ReadToEnd - and performing String.Replace on the contents is a good choice, with regard to your performance concerns.
If I were you, I'd rather rewrite those files "piece by piece" - that is, buffering and replacing data on the fly.
Just create a new file, load some of the malformed file into the buffer (say 4 kb), do the replacements, flush the results into your newly created file; rinse and repeat.
Beware: it can happen that one buffer ends with < and next one starts with \. If you want not to miss any <\s (and the like), you need to handle such cases as well.
Another possible solution is that you could try and create your own implementation of a "more tolerant" XmlReader (this class is not sealed, so you can base on it and create your own), although personally I haven't done it and I'm not sure this would be a good approach. Rewriting the files will at least leave you with syntactically valid XML, which may come in useful at some point.
PS. On a side note:
catch
{
//problem reading the xml file in, bad path, disk error etc.
return false;
}
I wouldn't do that. It leaves the caller with no idea whatsoever as for why the operation failed.

The quickest processes are those which do not need to be performed at all. So I commend Michael Kay's comments on dealing with "non-well-formed XML" to your attention.
If the non-XML data you'd like to handle as XML is being generated by a machine, there's no reason that that machine could not be generating XML data instead of the non-XML data you're currently trying to fix. Worse, every minute of effort you put into dealing with the errors in the data-producing process is a minute you've put into persuading those responsible for that process that they are producing correct, well-formed XML. So it's not only yourself you're hurting here.

Related

File Copy Program Doesn't Properly Copy File

Hello
I've been working on terminal-like application to get better at programming in c#, just something to help me learn. I've decided to add a feature that will copy a file exactly as it is, to a new file... It seems to work almost perfect. When opened in Notepad++ the file are only a few lines apart in length, and very, very, close to the same as far as actual file size goes. However, the duplicated copy of the file never runs. It says the file is corrupt. I have a feeling it's within the methods for reading and rewriting binary to files that I created. The code is as follows, thank for the help. Sorry for the spaghetti code too, I get a bit sloppy when I'm messing around with new ideas.
Class that handles the file copying/writing
using System;
using System.IO;
//using System.Collections.Generic;
namespace ConsoleFileExplorer
{
class FileTransfer
{
private BinaryWriter writer;
private BinaryReader reader;
private FileStream fsc; // file to be duplicated
private FileStream fsn; // new location of file
int[] fileData;
private string _file;
public FileTransfer(String file)
{
_file = file;
fsc = new FileStream(file, FileMode.Open);
reader = new BinaryReader(fsc);
}
// Reads all the original files data to an array of bytes
public byte[] ReadAllDataToArray()
{
byte[] bytes = reader.ReadBytes((int)fsc.Length); // reading bytes from the original file
return bytes;
}
// writes the array of original byte data to a new file
public void WriteDataFromArray(byte[] fileData, string path) // got a feeling this is the problem :p
{
fsn = new FileStream(path, FileMode.Create);
writer = new BinaryWriter(fsn);
int i = 0;
while(i < fileData.Length)
{
writer.Write(fileData[i]);
i++;
}
}
}
}
Code that interacts with this class .
(Sleep(5000) is because I was expecting an error on first attempt...
case '3':
Console.Write("Enter source file: ");
string sourceFile = Console.ReadLine();
if (sourceFile == "")
{
Console.Clear();
Console.ForegroundColor = ConsoleColor.DarkRed;
Console.Error.WriteLine("Must input a proper file path.\n");
Console.ForegroundColor = ConsoleColor.White;
Menu();
} else {
Console.WriteLine("Copying Data"); System.Threading.Thread.Sleep(5000);
FileTransfer trans = new FileTransfer(sourceFile);
//copying the original files data
byte[] data = trans.ReadAllDataToArray();
Console.Write("Enter Location to store data: ");
string newPath = Console.ReadLine();
// Just for me to make sure it doesnt exit if i forget
if(newPath == "")
{
Console.Clear();
Console.ForegroundColor = ConsoleColor.DarkRed;
Console.Error.WriteLine("Cannot have empty path.");
Console.ForegroundColor = ConsoleColor.White;
Menu();
} else
{
Console.WriteLine("Writing data to file"); System.Threading.Thread.Sleep(5000);
trans.WriteDataFromArray(data, newPath);
Console.WriteLine("File stored.");
Console.ReadLine();
Console.Clear();
Menu();
}
}
break;
File compared to new file
right-click -> open in new tab is probably a good idea
Original File
New File

You're not properly disposing the file streams and the binary writer. Both tend to buffer data (which is a good thing, especially when you're writing one byte at a time). Use using, and your problem should disappear. Unless somebody is editing the file while you're reading it, of course.
BinaryReader and BinaryWriter do not just write "raw data". They also add metadata as needed - they're designed for serialization and deserialization, rather than reading and writing bytes. Now, in the particular case of using ReadBytes and Write(byte[]) in particular, those are really just raw bytes; but there's not much point to use these classes just for that. Reading and writing bytes is the thing every Stream gives you - and that includes FileStreams. There's no reason to use BinaryReader/BinaryWriter here whatsover - the file streams give you everything you need.
A better approach would be to simply use
using (var fsn = ...)
{
fsn.Write(fileData, 0, fileData.Length);
}
or even just
File.WriteAllBytes(fileName, fileData);
Maybe you're thinking that writing a byte at a time is closer to "the metal", but that simply isn't the case. At no point during this does the CPU pass a byte at a time to the hard drive. Instead, the hard drive copies data directly from RAM, with no intervention from the CPU. And most hard drives still can't write (or read) arbitrary amounts of data from the physical media - instead, you're reading and writing whole sectors. If the system really did write a byte at a time, you'd just keep rewriting the same sector over and over again, just to write one more byte.
An even better approach would be to use the fact that you've got file streams open, and stream the files from source to destination rather than first reading everything into memory, and then writing it back to disk.

There is an File.Copy() Method in C#, you can see it here https://msdn.microsoft.com/ru-ru/library/c6cfw35a(v=vs.110).aspx
If you want to realize it by yourself, try to place a breakpoint inside your methods and use a debug. It is like a story about fisher and god, who gived a rod to fisher - to got a fish, not the exactly fish.
Also, look at you int[] fileData and byte[] fileData inside last method, maybe this is problem.

How Can I Handle This Xml Parsing Error?

Consider the following C# code:
using System.Xml.Linq;
namespace TestXmlParse
{
class Program
{
static void Main(string[] args)
{
var testxml =
#"<base>
<elem1 number='1'>
<elem2>yyy</elem2>
<elem3>xxx <yyy zzz aaa</elem3>
</elem1>
</base>";
XDocument.Parse(testxml);
}
}
}
I get a System.Xml.XmlException on the parse, of course, complaining about elem3. The error message is this:
System.Xml.XmlException was unhandled
Message='aaa' is an unexpected token. The expected token is '='. Line 4, position 59.
Source=System.Xml
LineNumber=4
LinePosition=59
Obviously this is not the real Xml (we get the xml from a third party) and while the best answer would be for the third party to clean up their xml before they send it to us, is there any other way I might fix this xml before I hand it off to the parser? I've devised a hacky way to fix this; catch the exception and use that to tell me where I need to look for characters which should be escaped. I was hoping for something a bit more elegant and comprehensive.
Any suggestions are welcome.
If this is a dupe, please point me to the other questions; I'll close this myself. I am more interested in an answer than any karma gain.
EDIT:
I guess I didn't make my question as clear as I had hoped. I know the "<" in elem3 is incorrect; I'm trying to find an elegant way to detect (and correct) any badly formed xml of that sort before I attempt the parse. As I say, I get this xml from a third-party and I can't control what they give me.

I would recommend that you do not manipulate the data you receive. If it is invalid it's your client's problem.
Editing the input so it is valid xml can cause serious problems, e.g. instead of throwing an error you may end up processing wrong data (because you tried your best to make the xml valid, but this may lead to different data).
[EDIT]
I still think it's not a good idea, but sometimes you have to do what you have to do.
Here is a very simple class that parses the input and replaces the invald opening tag. You could do this with a regex (which I am not good at) and this solution is not complete, e.g. depending on your requirements (or lets say the bad xml you get) you will have to adopt it (e.g. scan for complete xml elements instead of only the "<" and ">" brackets, put CDATA around the inner text of a node and so on).
I just wanted to illustrate how you could do it, so please don't complain if it is slow/has bugs (as I mentioned, I would not do it).
class XmlCleaner
{
public void Clean(Stream sourceStream, Stream targetStream)
{
const char openingIndicator = '<';
const char closingIndicator = '>';
const int bufferSize = 1024;
long length = sourceStream.Length;
char[] buffer = new char[bufferSize];
bool startTagFound = false;
StringBuilder writeBuffer = new StringBuilder();
using(var reader = new StreamReader(sourceStream))
{
var writer = new StreamWriter(targetStream);
try
{
while (reader.Read(buffer, 0, bufferSize) > 0)
{
foreach (var c in buffer)
{
if (c == openingIndicator)
{
if (startTagFound)
{
// we have 2 following opening tags without a closing one
// just replace the first one
writeBuffer = writeBuffer.Replace("<", "<");
// append the new one
writeBuffer.Append(c);
}
else
{
startTagFound = true;
writeBuffer.Append(c);
}
}
else if (c == closingIndicator)
{
startTagFound = false;
// write writebuffer...
writeBuffer.Append(c);
writer.Write(writeBuffer.ToString());
writeBuffer.Clear();
}
else
{
writeBuffer.Append(c);
}
}
}
}
finally
{
// unfortunately the streamwriter's dispose method closes the underlying stream, so e just flush it
writer.Flush();
}
}
}
To test it:
var testxml =
#"<base>
<elem1 number='1'>
<elem2>yyy</elem2>
<elem3>xxx <yyy zzz aaa</elem3>
</elem1>
</base>";
string result;
using (var source = new MemoryStream(Encoding.ASCII.GetBytes(testxml)))
using(var target = new MemoryStream()) {
XmlCleaner cleaner = new XmlCleaner();
cleaner.Clean(source, target);
target.Position = 0;
using (var reader = new StreamReader(target))
{
result = reader.ReadToEnd();
}
}
XDocument.Parse(result);
var expectedResult =
#"<base>
<elem1 number='1'>
<elem2>yyy</elem2>
<elem3>xxx <yyy zzz aaa</elem3>
</elem1>
</base>";
Debug.Assert(result == expectedResult);

How do I locate a particular word in a text file using .NET

I am sending mails (in asp.net ,c#), having a template in text file (.txt) like below
User Name :<User Name>
Address : <Address>.
I used to replace the words within the angle brackets in the text file using the below code
StreamReader sr;
sr = File.OpenText(HttpContext.Current.Server.MapPath(txt));
copy = sr.ReadToEnd();
sr.Close(); //close the reader
copy = copy.Replace(word.ToUpper(),"#" + word.ToUpper()); //remove the word specified UC
//save new copy into existing text file
FileInfo newText = new FileInfo(HttpContext.Current.Server.MapPath(txt));
StreamWriter newCopy = newText.CreateText();
newCopy.WriteLine(copy);
newCopy.Write(newCopy.NewLine);
newCopy.Close();
Now I have a new problem,
the user will be adding new words within an angle, say for eg, they will be adding <Salary>.
In that case i have to read out and find the word <Salary>.
In other words, I have to find all the words, that are located with the angle brackets (<>).
How do I do that?

Having a stream for your file, you can build something similar to a typical tokenizer.
In general terms, this works as a finite state machine: you need an enumeration for the states (in this case could be simplified down to a boolean, but I'll give you the general approach so you can reuse it on similar tasks); and a function implementing the logic. C#'s iterators are quite a fit for this problem, so I'll be using them on the snippet below. Your function will take the stream as an argument, will use an enumerated value and a char buffer internally, and will yield the strings one by one. You'll need this near the start of your code file:
using System.Collections.Generic;
using System.IO;
using System.Text;
And then, inside your class, something like this:
enum States {
OUT,
IN,
}
IEnumerable<string> GetStrings(TextReader reader) {
States state=States.OUT;
StringBuilder buffer;
int ch;
while((ch=reader.Read())>=0) {
switch(state) {
case States.OUT:
if(ch=='<') {
state=States.IN;
buffer=new StringBuilder();
}
break;
case States.IN:
if(ch=='>') {
state=States.OUT;
yield return buffer.ToString();
} else {
buffer.Append(Char.ConvertFromUtf32(ch));
}
break;
}
}
}
The finite-state machine model always has the same layout: while(READ_INPUT) { switch(STATE) {...}}: inside each case of the switch, you may be producing output and/or altering the state. Beyond that, the algorithm is defined in terms of states and state changes: for any given state and input combination, there is an exact new state and output combination (the output can be "nothing" on those states that trigger no output; and the state may be the same old state if no state change is triggered).
Hope this helps.
EDIT: forgot to mention a couple of things:
1) You get a TextReader to pass to the function by creating a StreamReader for a file, or a StringReader if you already have the file on a string.
2) The memory and time costs of this approach are O(n), with n being the length of the file. They seem quite reasonable for this kind of task.

Using regex.
var matches = Regex.Matches(text, "<(.*?)>");
List<string> words = new List<string>();
for (int i = 0; i < matches.Count; i++)
{
words.Add(matches[i].Groups[1].Value);
}
Of course, this assumes you already have the file's text in a variable. Since you have to read the entire file to achieve that, you could look for the words as you are reading the stream, but I don't know what the performance trade off would be.

This is not an answer, but comments can't do this:
You should place some of your objects into using blocks. Something like this:
using(StreamReader sr = File.OpenText(HttpContext.Current.Server.MapPath(txt)))
{
copy = sr.ReadToEnd();
} // reader is closed by the end of the using block
//remove the word specified UC
copy = copy.Replace(word.ToUpper(), "#" + word.ToUpper());
//save new copy into existing text file
FileInfo newText = new FileInfo(HttpContext.Current.Server.MapPath(txt));
using(var newCopy = newText.CreateText())
{
newCopy.WriteLine(copy);
newCopy.Write(newCopy.NewLine);
}
The using block ensures that resources are cleaned up even if an exception is thrown.

How to save a human readable file

Currently i have an application that reads and writes several properties from one or two basic classes to a .txt file using the Binary Serializer.
I've opened up the .txt file in NotePad and as it's formatted for the application it's not very readable to the human eye, not for me anyway =D
I've heard of using XML but pretty much most of my searches seem to overcomplicate things.
The kind of data im trying to save is simply a collection of "Person.cs" classes,nothing more than a name and address, all private strings but with properties and marked as Serializable.
What would be the best way to actually save my data in a way that can be easily read by a person? It would also make it easier to make small changes to the application's data directly in the file instead of having to load it, change it and save it.
Edit:
I have added the current way i am saving and loading my data, my _userCollection is as it suggests and the nUser/nMember are an integer.
#region I/O Operations
public bool SaveData()
{
try
{
//Open the stream using the Data.txt file
using (Stream stream = File.Open("Data.txt", FileMode.Create))
{
//Create a new formatter
BinaryFormatter bin = new BinaryFormatter();
//Copy data in collection to the file specified earlier
bin.Serialize(stream, _userCollection);
bin.Serialize(stream, nMember);
bin.Serialize(stream, nUser);
//Close stream to release any resources used
stream.Close();
}
return true;
}
catch (IOException ex)
{
throw new ArgumentException(ex.ToString());
}
}
public bool LoadData()
{
//Check if file exsists, otherwise skip
if (File.Exists("Data.txt"))
{
try
{
using (Stream stream = File.Open("Data.txt", FileMode.Open))
{
BinaryFormatter bin = new BinaryFormatter();
//Copy data back into collection fields
_userCollection = (List<User>)bin.Deserialize(stream);
nMember = (int)bin.Deserialize(stream);
nUser = (int)bin.Deserialize(stream);
stream.Close();
//Sort data to ensure it is ordered correctly after being loaded
_userCollection.Sort();
return true;
}
}
catch (IOException ex)
{
throw new ArgumentException(ex.ToString());
}
}
else
{
//Console.WriteLine present for testing purposes
Console.WriteLine("\nLoad failed, Data.txt not found");
return false;
}
}

Replace your BinaryFormatter with XMLSerializer and run the same exact code.
The only change you need to make is the BinaryFormatter takes an empty constructor, while for the XMLSerializer you need to declare the type in the constructor:
XmlSerializer serializer = new XmlSerializer(typeof(Person));

Using XmlSerializer is not really complicated. Have a look at this MSDN page for an example: http://msdn.microsoft.com/en-us/library/system.xml.serialization.xmlserializer.aspx

You could implement your own PersonsWriter, that takes a StreamWriter as constructor argument and has a Write method that takes an IList<Person> as input to parse out a nice text representation.
For example:
public class PersonsWriter : IDisposable
{
private StreamWriter _wr;
public PersonsWriter(IList<Person> persons, StreamWriter writer)
{
this._wr = writer;
}
public void Write(IList<Persons> people) {
foreach(Person dude in people)
{
_wr.Write(#"{0} {1}\n{2}\n{3} {4}\n\n",
dude.FirstName,
dude.LastName,
dude.StreetAddress,
dude.ZipCode,
dude.City);
}
}
public void Dispose()
{
_wr.Flush();
_wr.Dispose();
}
}

YAML is another option for human readable markup that is also easy to parse. there are libraries available for c# as well as almost all other popular languages. Here's a sample of what yaml looks like:
invoice: 34843
date : 2001-01-23
bill-to: &id001
given : Chris
family : Dumars
address:
lines: |
458 Walkman Dr.
Suite #292
city : Royal Oak
state : MI
postal : 48046

Frankly, as a human, I don't find XML to be all that readable. In fact, it's not really designed to be read by humans.
If you want a human readable format, then you have to build it.
Say, you have a Person class that has a First Name, a last Name and a SSN as properties. Create your file, and have it write out 3 lines, with a description of the field in the first fifty (random number from my head) and then with character 51 have the value start being written.
This will produce a file that looks like:
First Name-------Stephen
Last Name -------Wrighton
SSN -------------XXX-XX-XXXX
Then, reading it back in, your program would know where the data begins on each line, and what each line is for (the program would know that Line 3 is the SSN value).
But remember, to truly gain human readability, you sacrifice data portability.

Try the DataContractSerializer
It serializes objects to XML and is very easy to use

Write a CSV reader writer if you want a good compromise between human and machine readable in a Windows environment
Loads into Excel too.
There's a discussion about it here:
http://knab.ws/blog/index.php?/archives/3-CSV-file-parser-and-writer-in-C-Part-1.html
EDIT
That is a C# article... it just confusingly has "C" in the URL.

I really think you should go with XML (look into DataContractSerializer). Its not that complicated. You could probably even just replace BinarySerializer with XMLSerializer and go.
If you still don't want to do that, though, you can write a delimited text file. Then you'll have to write your own reader method (although, it could almost just use the split method).
//Inside the Person class:
public override string ToString()
{
List<String> propValues = new List<String>();
// Get the type.
Type t = this.GetType();
// Cycle through the properties.
foreach (PropertyInfo p in t.GetProperties())
{
propValues.add("{0}:={1}", p.Name, p.GetValue(o, null));
}
return String.Join(",". propValues.ToArray())
}
using (System.IO.TextWriter tw = new System.IO.StreamWriter("output.txt"))
{
tw.WriteLine(person.ToString());
}

C#: Check if a file is not locked and writable

I want to check if a list of files is in use or not writable before I start replacing files.
Sure I know that the time from the file-check and the file-copy there is a chance that one or more files is gonna to be locked by someone else but i handle those exceptions. I want to run this test before file copy because the complete list of files have a better chance to succeed than if a file in the middle of the operation fails to be replaced.
Have any of you an example or a hint in the right direction

There is no guarantee that the list you get, at any point of time, is going to stay the same the next second as somebody else might take control of the file by the time you come back to them.
I see one way though - "LOCK" the files that you want to replace by getting their corresponding FileStream objects. This way you are sure that you have locked all "available" files by opening them and then you can replace them the way you want.
public void TestGivenFiles(List<string> listFiles)
{
List<FileStream> replaceAbleFileStreams = GetFileStreams(listFiles);
Console.WriteLine("files Received = " + replaceAbleFileStreams.Count);
foreach (FileStream fileStream in replaceAbleFileStreams)
{
// Replace the files the way you want to.
fileStream.Close();
}
}
public List<FileStream> GetFileStreams(List<string> listFilesToReplace)
{
List<FileStream> replaceableFiles = new List<FileStream>();
foreach (string sFileLocation in listFilesToReplace)
{
FileAttributes fileAttributes = File.GetAttributes(sFileLocation);
if ((fileAttributes & FileAttributes.ReadOnly) != FileAttributes.ReadOnly)
{ // Make sure that the file is NOT read-only
try
{
FileStream currentWriteableFile = File.OpenWrite(sFileLocation);
replaceableFiles.Add(currentWriteableFile);
}
catch
{
Console.WriteLine("Could not get Stream for '" + sFileLocation+ "'. Possibly in use");
}
}
}
return replaceableFiles;
}
That said, you are better off trying to replace them one by one and and ignore the ones that you can't.

You must open each file for writing in order to test this.

Double
How to check For File Lock in C# ?
Can I simply ‘read’ a file that is in use?

Read one byte, write same byte?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.