Is there a better Regex for parsing DTD

Is there a better Regex for parsing DTD - c#

I've got the DTD for OFX 1.03 (their latest version despite having developed and released 1.60, but I digress...)
I would like to use regex to have groups that split an entity, element, other tags into its parts for further processing such that I would take a tag like this:
<!ENTITY % ACCTTOMACRO "(BANKACCTTO | CCACCTTO | INVACCTTO)">
And create an object like this
new EntityTag { string Name = "%ACCTTOMACRO"; string[] ChildTypes = new string[] {"BANKACCTTO", "CCACCTTO", "INVACCTTO"}};
I've got a regular expression that looks like this:
Regex re = new Regex(#"<!(\b)+([\s\S])?[^>]+>");
Admittedly, I'm new to regex, so I've done good so far getting this which gives me a match collection over the DTD for each tag without comments.
I would like to leverage grouping to facilitate creation of the previously mentioned object.
If I'm on the totally wrong path, please instruct me, however if you do download this document, I think you may find its not standard. (Visual studio throws up some red flags with the way this document is formatted)
I don't expect anyone to go to the trouble, but for the curious here is the link to download the specs.

It looks like they've got schema available as well. Why not download the schema instead and parse that with an XML parser (for instance, LINQ-to-XML)?

Related

Matching tags in C#

I'm trying to match tags with C# and I'm having some trouble getting it to work. I have these tags:
<categories=1></categories=1>
The =1 could be really any number. It could be 1, 2, 3 or any other given number. Is there a way to match this tag in C# using IndexOf or RegEx or a better method.
So to give an example of how I want to use it. I would have something like:
if (PUT WORKING CODE HERE ONCE FIGURED OUT)
{
Do Something
}
Is there an easy way to do this?
Thanks!

I would suggest to first make the document valid XML by replacing those equation signs, then use any XML parser.

there is only one valid answer to this need, unless you are doing homeworks and need to learn how to code this yourself...
avoid reinventing things from scratch and use Html Agility Pack
it is called Html but also handles XML files, in case you have to do more complex things, like parsing, and don't want or cannot use pure XPath and XML related .NET Framework classes.
see here for some examples: How to use HTML Agility pack

Loading XML Document - Name cannot begin with the zero character

I am trying to load something which claims to be an XML document into any type of .net XML object: XElement, XmlDocument, or XmlTextReader. All of them throw an exception :
Name cannot begin with the '0' character, hexadecimal value 0x30
The error related to a bit of 'XML'
<chart_value
color="ff4400"
alpha="100"
size="12"
position="cursor"
decimal_char="."
0=""
/>
I believe the problem is the author should not have named an attribute as 0.
If I could change this I would, but I do not have control of this feed. I suppose those who use it are using more permissive tools. Is there anyway I can load this as XML without throwing an error?
There is no XML declaration either, nor namespace or contract definition. I was thinking I might have to turn it into a string and do a replace, but this is not very elegant. Was wondering if there was any other options.

As many have said, this is not XML.
Having said that, it's almost XML and WANTS to be XML, so I don't think you should use a regex to screw around inside of it (here's why).
Wherever you're getting the stream, dump into into a string, change 0= to something like zero= and try parsing it.
Don't forget to reverse the operation if you have to return-to-sender.
If you're reading from a file, you can do something like this:
var txt = File.ReadAllText(#"\path\to\wannabe.xml");
var clean = txt.Replace("0=", "zero=");
var doc = new XmlDocument();
doc.LoadXml(clean);
This is not guaranteed to remove all potential XML problems -- but it should remove the one you have.

Just replace the Numeric value with '_'
Example: "0=" replace to "_0="
I hope that will fix the problem, thanks.

It might claim to be an XML document, but the claim is clearly false, so you should reject the document.
The only good way to deal with bad XML is to find out what bit of software is producing it, and either fix it or throw it away. All the benefits of XML go out of the window if people start tolerating stuff that's nearly XML but not quite.

The 0="" obviously uses an invalid attribute name 0. You'd probably have to do a find/replace to try and fix the XML if you cannot fix it at the source that created it. You might be able to use RegEx to try to do more efficient manipulation of the XML string.

Parsing XML-ish data

Yes, I really am going to ask about parsing XML with regexes... here goes.
I have some XML-ish data, and I need to parse it. I can't do it completely with an XMLDocument or similar because it's not proper XML, and I'm not sure I can (or want to) change the format. The main problem is tags which have special meaning, and look like this:
<$ something_here $>
C#'s XmlDocument falls over parsing that, and I assume other methods will too. I could, with a lot of work, change the above to something like
<some_special_tag><![CDATA[ something_here ]]></some_special_tag>
But that's ugly, and I don't really want to. The reason it would be time consuming to change is that I have hundreds, maybe thousands of XML documents which would need to be changed.
At the moment, I'm parsing the document with regexes. I only need to pick out a couple of specific tags (not the ones above), and it seems to be working, but I'm uncomfortable with it. I'm doing something like this at the moment:
...
MatchCollection mc = Regex.Matches(Template, "<tagname.*?/tagname>"); // or similar
foreach (Match m in mc) {
try {
XmlDocument xd = new XmlDocument();
xd.LoadXml(m.Value);
...
This at least means I'm not using regexes exclusively :)
Can anyone think of a better way? Is there some way of getting XmlDocument to politely ignore the $ character that causes it to fall over? It doesn't seem likely, but I thought I should at least get some opinions.

No, there is no way to get XmlDocument to parse a document which isn't xml, no matter how close to xml it might look!
If its possible to do then I would definitely recommend that you convert your documents to be actual xml (or at least some recognised document format). Trying to create and maintain a reliable working parser for any format is quite a lot of work, let alone a format that doesn't appear to be rigeriously defined.
Using a some_special_tag element to identify special sections seems like a good idea to me. If necessary you can use a different namespace to ensure no clashes with other elements in your document - this is in fact exactly the way that xslt works ("special" tags are used to mean special things, like templates or nodes that should be replaced) and exactly what xml was designed to support.
Also I don't understand why you would need to place the something_here bit in CDATA sections. All characters that "break" xml can be escaped fairly easily (for example by writing < as <). CDATA sections are generally only used when the contents of a node needs so much escaping that its easier and less messy to just to use CDATA sections instead.
Update: Regarding migration to a new format, can you not use both methods? Attempt to parse the document as an XML document (or if there are performance concerns then perform some other test to quickly determine if the document is in the "old" or "new" format such as checking for a version attribute in the root element) - if it doesn't work then fall back to the old method.
This way as long as everything is working fine (which is will be as long as nothing changes) users don't need to modify their documents, however if they run into problems or want to use any new features then explain to them that they must update their document to the new format.
Depending on how well your current "parser" works, you may even be able to provide an upgrade utility that automatically performns the conversion (as best it can).

Can't you replace <$ something_here $> to that big CDATA section at run-time and then load the XML document as usual?

converting Xml to txt

I am currently working with an XML file that keeps race information in XML format like so
<Row xmlns="Practice2a">
<RecordType>Qualifying Classification</RecordType>
<_x0030_02150Position>3</_x0030_02150Position>
<Class>250</Class>
<_x0030_02150MachineNo>11</_x0030_02150MachineNo>
<RiderName>Kevin James</RiderName>
<Machine>Honda</Machine>
<_x0030_02150ToDBehind>29.680</_x0030_02150ToDBehind>
<_x0030_02150BestLapSpeed>97.1415157615475</_x0030_02150BestLapSpeed>
<_x0030_02150ToDBestLapTime>5:32.274</_x0030_02150ToDBestLapTime>
<_x0030_02150BestOnLap>7</_x0030_02150BestOnLap>
</Row>
I want to create a plain txt file with just some of the information , I just want in kind off in a table format e.g
pos Name racetime and BestLaptime
I have attempted to remove the tags from the file and create a txt file so now I get
I create a line count to possibly use as delimiters for extracting the right fields.
139 Qualifying Classification
140 3
141 250
142 11
Driver Name: Machine Type: Kevin James
145 Honda
146 29.680
147 97.1415157615475
148 5:32.274
My code is getting quite out of hand and I am wondering if there is a much better way to achieve this rather than adding 14 to count each time , that's how i am displaying Driver Name:" instead of a number.
Any pointers as to how you would go about this would be a great insight.

A quick solution would be to read your xml in XmlDocument (or even simpler to a dataset), and generate the text file in your c# code.
See:
Walkthrough: Reading XML Data into a Dataset
Read XML Attribute using XmlDocument
Alternate approach would be to define an xslt to reformat your xml to layout of your choice. Normally its a preferred approach for generating html docs from your xml datam, though could be used to transform into normal text reports. You can read more about it on
W3School- XSLT
XSLT Basics

You can parse and format it using LinqToXml:
using System.Xml.Linq
// [...]
// Load the XML, either from a string or from an url
var doc = XDocument.Parse(xmlString);
// or
var doc = XDocument.Load(new Uri(#"C:\myFile.xml"));
var result = String.Empty;
foreach (var el in doc.Descendants())
{
// do something with it and format the data to your liking... e.g.
result += FormatElement(el);
}
// or more compact
doc.Descendants().ToList().ForEach(el => result += FormatElement(el));
// [...]
private string FormatElement(XElement el)
{
return String.Format("{0}: {1}", el.Name, el.Value);
}
Of course you need to adapt the FormatElement method to your needs, but this scheme should work.

XML is designed so that some features are required and may be depended-on, while other things are the choice of the author of the document. Your scheme seems to get those features exactly backwards! Which line something appears on is not guaranteed by the standard. An entire legal XML file may occupy a single line.
The whole point of XML is that the use of a standard format allows for the use of common tools. The .NET Framework has (several) XML parsing components built in to it that can read this file and give you exactly the information you are looking for. You can then output that information as text in whatever format you like.
There is no reason to parse it yourself.
And remember, if your solution includes RegEx, then you've already lost.
(I'm kidding about that last part. Sort of.)

Finding "Keywords" with potentially damaged HTML Files and Counting Hits

I'm trying to create a master index file for a bunch of HTML files sitting in a directory. There could be anywhere from 5 to 5000. These files aren't clean or nice, so some of the libs I looked at don't seem like they would play nice. Many of these files come from the temp directory or are carved out of the file slack (ergo incomplete files in many cases). Plus, sometimes people just write sloppy HTML.
I've basically decided to enumerate through the directory and use something like
string[] FileEntries = Directory.GetFiles(WhichDirectory);
foreach (string FileName in FileEntries)
{
using (StreamReader sr = new StreamReader(FileName))
{
HTMLContents = sr.ReadToEnd();
}
I'm hoping that the StreamReader can dump the contents into a character array the same way it would a text file.
Anyways, given that this might not be the cleanest HTML in the world, there a few things I'd like to parse out of the array.
Any Instance of a date in ANY format (e.g. 1/1/11, January 1st, 2011, 1-1-11, Jan-1-2011, etc) and dump these into a string to be read back later. Hopefully there is a lib or something for finding "instances" of dates.
Read a text file line by line with various "keywords" to look for in the mess of HTML. Things like "Bob Evans" or "Sausage Factory Ltd" etc. I then want to count the number of times each "keyword" shows up. The problem is I don't want to have to resort to the user having to know regex expressions.
So, the desired output would be something like this:
BobEvans9304902.html
Title: Bob Evans Secret Sausage Recipe
Dates Found: "October 2nd, 2009" , "7/22/09"
"Bob Evans Sausage" : 30 hits
"Paprika" : 2 hits
"Don't overwork it" : 5 hits
All the solutions I have seen so far seem like they only work for single characters or words (LINQ) or split a "neat' sentence into words. I'm hoping I won't have to create a new copy of the string and strip out all the HTML tags, since it's not always going to be neat and I don't want to add another step to mass file processing. If that's the only way to do it, though, so be it.

You probably want to investigate an HTML to XML parser that handles poorly formed XML like the html agility pack. Then you can focus on the content and use XPath queries to search for/count keywords. I expect you'll probably still need regex to handle the dates though.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.