Consider the following C# code:
using System.Xml.Linq;
namespace TestXmlParse
{
class Program
{
static void Main(string[] args)
{
var testxml =
#"<base>
<elem1 number='1'>
<elem2>yyy</elem2>
<elem3>xxx <yyy zzz aaa</elem3>
</elem1>
</base>";
XDocument.Parse(testxml);
}
}
}
I get a System.Xml.XmlException on the parse, of course, complaining about elem3. The error message is this:
System.Xml.XmlException was unhandled
Message='aaa' is an unexpected token. The expected token is '='. Line 4, position 59.
Source=System.Xml
LineNumber=4
LinePosition=59
Obviously this is not the real Xml (we get the xml from a third party) and while the best answer would be for the third party to clean up their xml before they send it to us, is there any other way I might fix this xml before I hand it off to the parser? I've devised a hacky way to fix this; catch the exception and use that to tell me where I need to look for characters which should be escaped. I was hoping for something a bit more elegant and comprehensive.
Any suggestions are welcome.
If this is a dupe, please point me to the other questions; I'll close this myself. I am more interested in an answer than any karma gain.
EDIT:
I guess I didn't make my question as clear as I had hoped. I know the "<" in elem3 is incorrect; I'm trying to find an elegant way to detect (and correct) any badly formed xml of that sort before I attempt the parse. As I say, I get this xml from a third-party and I can't control what they give me.
I would recommend that you do not manipulate the data you receive. If it is invalid it's your client's problem.
Editing the input so it is valid xml can cause serious problems, e.g. instead of throwing an error you may end up processing wrong data (because you tried your best to make the xml valid, but this may lead to different data).
[EDIT]
I still think it's not a good idea, but sometimes you have to do what you have to do.
Here is a very simple class that parses the input and replaces the invald opening tag. You could do this with a regex (which I am not good at) and this solution is not complete, e.g. depending on your requirements (or lets say the bad xml you get) you will have to adopt it (e.g. scan for complete xml elements instead of only the "<" and ">" brackets, put CDATA around the inner text of a node and so on).
I just wanted to illustrate how you could do it, so please don't complain if it is slow/has bugs (as I mentioned, I would not do it).
class XmlCleaner
{
public void Clean(Stream sourceStream, Stream targetStream)
{
const char openingIndicator = '<';
const char closingIndicator = '>';
const int bufferSize = 1024;
long length = sourceStream.Length;
char[] buffer = new char[bufferSize];
bool startTagFound = false;
StringBuilder writeBuffer = new StringBuilder();
using(var reader = new StreamReader(sourceStream))
{
var writer = new StreamWriter(targetStream);
try
{
while (reader.Read(buffer, 0, bufferSize) > 0)
{
foreach (var c in buffer)
{
if (c == openingIndicator)
{
if (startTagFound)
{
// we have 2 following opening tags without a closing one
// just replace the first one
writeBuffer = writeBuffer.Replace("<", "<");
// append the new one
writeBuffer.Append(c);
}
else
{
startTagFound = true;
writeBuffer.Append(c);
}
}
else if (c == closingIndicator)
{
startTagFound = false;
// write writebuffer...
writeBuffer.Append(c);
writer.Write(writeBuffer.ToString());
writeBuffer.Clear();
}
else
{
writeBuffer.Append(c);
}
}
}
}
finally
{
// unfortunately the streamwriter's dispose method closes the underlying stream, so e just flush it
writer.Flush();
}
}
}
To test it:
var testxml =
#"<base>
<elem1 number='1'>
<elem2>yyy</elem2>
<elem3>xxx <yyy zzz aaa</elem3>
</elem1>
</base>";
string result;
using (var source = new MemoryStream(Encoding.ASCII.GetBytes(testxml)))
using(var target = new MemoryStream()) {
XmlCleaner cleaner = new XmlCleaner();
cleaner.Clean(source, target);
target.Position = 0;
using (var reader = new StreamReader(target))
{
result = reader.ReadToEnd();
}
}
XDocument.Parse(result);
var expectedResult =
#"<base>
<elem1 number='1'>
<elem2>yyy</elem2>
<elem3>xxx <yyy zzz aaa</elem3>
</elem1>
</base>";
Debug.Assert(result == expectedResult);
Related
Here's what I want to achieve:
The user copies a cell (or a range), say A3, and - when she hits a button - I need to get access to the address of the cell (to create a link) programmatically.
Accessing the clipboard in text format is easy:
string clip;
if (Clipboard.ContainsText()) clip = Clipboard.GetText();
I also found that it is possible to access the clipboard in different formats, like this
var dataObj = Clipboard.GetDataObject();
var format = DataFormats.CommaSeparatedValue;
if (dataObj != null && dataObj.GetDataPresent(format))
{
var csvData = dataObj.GetData(format);
//...
}
but I couldn't for the life of me find which format contains the link and how to get it. (I cycled through all formats offered by Clipboard.GetDataObject().GetFormats(), but some returned inscrutable streams I couldn't make sense of.
Background info:
A. The Link must be there, because I can use "paste link" which creates an absolute reference
B. I'm using Excel 2010 and VS2010 - C# under Win7
C. The code runs in a custom task pane
Any help appreciated!
So,
and thanks for everybody who managed to read till this point. I finally figured it out. My solution is still awkward, as I can't get my head around the actual structure of the stream from the clipboard, but I find what I am looking for:
protected override void WndProc(ref Message m)
{
const int WM_PASTE = 0x0302;
var enc = new System.Text.UTF7Encoding();
string buffer, rangeAddress;
if (m.Msg == WM_PASTE)
{
if (Clipboard.ContainsText())
{
string clip = Clipboard.GetText();
var dataObject = Clipboard.GetDataObject();
var mstream = (MemoryStream)dataObject.GetData("Link Source", true);
if(mstream == null) return;
var rdr = new System.IO.StreamReader(mstream, enc, true);
buffer = rdr.ReadToEnd();
buffer = StripWeirdChars(buffer);
int IndexExcl = buffer.IndexOf("!");
if (IndexExcl >= 0)
{
rangeAddress = buffer.Substring(IndexExcl + 1, buffer.Length - IndexExcl - 4);
// do whatever you want to do with it, e.g.:Globals.ThisAddIn.Application.ActiveCell.Value = rangeAddress;
return;
}
}
}
base.WndProc(ref m);
}
}
The key here is obviously the particular format for the GetData: "Link Source".
Reading the resulting stream produces a string with a lot of weird characters, but also the name of the sheet and coordinates of the copied range. I strip the weird chars using a straightforward
public static string StripWeirdChars(string source)
{
string res = "";
foreach (char c in source) if ((int)c >= 32) res += c;
return res;
}
There still are some odd chars I can't make sense of, but the good news is that after the first exclamation mark you'll find the address of the range (with some trailing rubbish of fixed length). This works even when the range was copied from another worksheet and even if this worksheet has odd chars (like German umlauts) in its name.
There certainly is a much neater solution out there
Getting the Excel Range object from the Clipboard through the IStream interface
but the code I found there is not complete (the "obvious" parts are left out) and I couldn't get the thing to work due to my incompetence and missing experience with IStream to begin with.
Any help in using this to get a neat solution is appreciated, but I am content with what I have for the time being. Thanks, guys.
This is just the answer from #drdhk, since this was requested in one of the comments to show this was answered:
So,
and thanks for everybody who managed to read till this point. I finally figured it out. My solution is still awkward, as I can't get my head around the actual structure of the stream from the clipboard, but I find what I am looking for:
protected override void WndProc(ref Message m)
{
const int WM_PASTE = 0x0302;
var enc = new System.Text.UTF7Encoding();
string buffer, rangeAddress;
if (m.Msg == WM_PASTE)
{
if (Clipboard.ContainsText())
{
string clip = Clipboard.GetText();
var dataObject = Clipboard.GetDataObject();
var mstream = (MemoryStream)dataObject.GetData("Link Source", true);
if(mstream == null) return;
var rdr = new System.IO.StreamReader(mstream, enc, true);
buffer = rdr.ReadToEnd();
buffer = StripWeirdChars(buffer);
int IndexExcl = buffer.IndexOf("!");
if (IndexExcl >= 0)
{
rangeAddress = buffer.Substring(IndexExcl + 1, buffer.Length - IndexExcl - 4);
// do whatever you want to do with it, e.g.:Globals.ThisAddIn.Application.ActiveCell.Value = rangeAddress;
return;
}
}
}
base.WndProc(ref m);
}
}
The key here is obviously the particular format for the GetData: "Link Source".
Reading the resulting stream produces a string with a lot of weird characters, but also the name of the sheet and coordinates of the copied range. I strip the weird chars using a straightforward
public static string StripWeirdChars(string source)
{
string res = "";
foreach (char c in source) if ((int)c >= 32) res += c;
return res;
}
There still are some odd chars I can't make sense of, but the good news is that after the first exclamation mark you'll find the address of the range (with some trailing rubbish of fixed length). This works even when the range was copied from another worksheet and even if this worksheet has odd chars (like German umlauts) in its name.
There certainly is a much neater solution out there
Getting the Excel Range object from the Clipboard through the IStream interface:
(http://www.codeproject.com/Articles/149009/Getting-the-Excel-Range-object-from-the-Clipboard)
but the code I found there is not complete (the "obvious" parts are left out) and I couldn't get the thing to work due to my incompetence and missing experience with IStream to begin with.
Any help in using this to get a neat solution is appreciated, but I am content with what I have for the time being. Thanks, guys.
There is one limitation with your approach: The result is always a rectangular range. If you have copied, let's say A1:B2 and A4:B5 into the clipboard, your approach would return A1:B5, which means that row 3 is included in the result range, where in fact, it was not part of the range in the clipboard.
There are more clipboard data types that contain less weird characters, like ObjectLink or Link, but they come with the same limitation for non-rectangular ranges.
Non-rectangular ranges can be selected by pressing Ctrl while selecting ranges.
I need to be able to take a text file with unknown encoding (e.g., UTF-8, UTF-16, ...) and copy it line by line, making specific changes as I go. In this example, I am changing the encoding, however there are other uses for this kind of processing.
What I can't figure out is how to determine if the last line has a newline! Some programs care about the difference between a file with these records:
Rec1<newline>
Rec2<newline>
And a file with these:
Rec1<newline>
Rec2
How can I tell the difference in my code so that I can take appropriate action?
using (StreamReader reader = new StreamReader(sourcePath))
using (StreamWriter writer = new StreamWriter(destinationPath, false, outputEncoding))
{
bool isFirstLine = true;
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (isFirstLine)
{
writer.Write(line);
isFirstLine = false;
}
else
{
writer.Write("\r\n" + line);
}
}
//if (LastLineHasNewline)
//{
// writer.Write("\n");
//}
writer.Flush();
}
The commented out code is what I want to be able to do, but I can't figure out how to set the condition lastInputLineHadNewline! Remember, I have no a priori knowledge of the input file encoding.
Remember, I have no a priori knowledge of the input file encoding.
That's the fundamental problem to solve.
If the file could be using any encoding, then there is no concept of reading "line by line" as you can't possibly tell what the line ending is.
I suggest you first address this part, and the rest will be easy. Now, without knowing the context it's hard to say whether that means you should be asking the user for the encoding, or detecting it heuristically, or something else - but I wouldn't start trying to use the data before you can fully understand it.
As often happens, the moment you go to ask for help, the answer comes to the surface. The commented out code becomes:
if (LastLineHasNewline(reader))
{
writer.Write("\n");
}
And the function looks like this:
private static bool LastLineHasNewline(StreamReader reader)
{
byte[] newlineBytes = reader.CurrentEncoding.GetBytes("\n");
int newlineByteCount = newlineBytes.Length;
reader.BaseStream.Seek(-newlineByteCount, SeekOrigin.End);
byte[] inputBytes = new byte[newlineByteCount];
reader.BaseStream.Read(inputBytes, 0, newlineByteCount);
for (int i = 0; i < newlineByteCount; i++)
{
if (newlineBytes[i] != inputBytes[i])
return false;
}
return true;
}
So I'm doing a project where I am reading in a config file. The config file is just a list of string like "D 1 1", "C 2 2", etc. Now I haven't ever done a read/write in C# so I looked it up online expecting to find some sort of rendition of C/C++ .eof(). I couldn't find one.
So what I have is...
TextReader tr = new StreamReader("/mypath");
Of all the examples online of how I found to read to the end of a file the two examples that kept occurring were
while ((line = tr.ReadLine() != null)
or
while (tr.Peek() >= 0)
I noticed that StreamReader has a bool EndOfStream but no one was suggesting it which led me to believe something was wrong with that solution. I ended up trying it like this...
while (!(tr as StreamReader).EndOfStream)
and it seems to work just fine.
So I guess my question is would I experience issues with casting a TextReader as a StreamReader and checking EndOfStream?
One obvious downside is that it makes your code StreamReader specific. Given that you can easily write the code using just TextReader, why not do so? That way if you need to use a StringReader (or something similar) for unit tests etc, there won't be any difficulties.
Personally I always use the "read a line until it's null" approach - sometimes via an extension method so that I can use
foreach (string line in reader.EnumerateLines())
{
}
EnumerateLines would then be an extension method on TextReader using an iterator block. (This means you can also use it for LINQ etc easily.)
Or you could use ReadAllLines, to simplify your code:
http://msdn.microsoft.com/en-us/library/s2tte0y1.aspx
This way, you let .NET take care of all the EOF/EOL management, and you focus on your content.
No you wont experience any issue's. If you look at the implementation if EndToStream, you'll find that it just checks if there is still data in the buffer and if not, if it can read more data from the underlying stream:
public bool EndOfStream
{
get
{
if (this.stream == null)
{
__Error.ReaderClosed();
}
if (this.charPos < this.charLen)
{
return false;
}
int num = this.ReadBuffer();
return num == 0;
}
}
Ofcourse casting in your code like that makes it dependend on StreamReader being the actual type of your reader which isn't pretty to begin with.
Maybe read it all into a string and then parse it: StreamReader.ReadToEnd()
using (StreamReader sr = new StreamReader(path))
{
//This allows you to do one Read operation.
string contents = sr.ReadToEnd());
}
Well, StreamReader is a specialisation of TextReader, in the sense that StreamReader inherits from TextReader. So there shouldn't be a problem. :)
var arpStream = ExecuteCommandLine(cmd, arg);
arpStream.ReadLine(); // Read entries
while (!arpStream.EndOfStream)
{
var line1 = arpStream.ReadLine().Trim();
// TeststandInt.SendLogPrint(line, true);
}
I have a use-case where I'm required to read in some information from an XML file and act on it accordingly. The problem is, this XML file is technically allowed to be empty or full of whitespace and this means "there's no info, do nothing", any other error should fail hard.
I'm currently thinking about something along the lines of:
public void Load (string fileName)
{
XElement xml;
try {
xml = XElement.Load (fileName);
}
catch (XmlException e) {
// Check if the file contains only whitespace here
// if not, re-throw the exception
}
if (xml != null) {
// Do this only if there wasn't an exception
doStuff (xml);
}
// Run this irrespective if there was any xml or not
tidyUp ();
}
Does this pattern seem ok? If so, how do people recommend implementing the check for if the file contained only whitespace inside the catch block? Google only throws up checks for if a string is whitespace...
Cheers muchly,
Graham
Well, the easiest way is probably to make sure it isn't whitespace in the first place, by reading the entire file into a string first (I'm assuming it isn't too huge):
public void Load (string fileName)
{
var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read);
var reader = new StreamReader(stream, Encoding.UTF8, true);
var xmlString = reader.ReadToEnd();
if (!string.IsNullOrWhiteSpace(xmlString)) { // Use (xmlString.Trim().Length == 0) for .NET < 4
var xml = XElement.Parse(xmlString); // Exceptions will bubble up
doStuff(xml);
}
tidyUp();
}
I am sending mails (in asp.net ,c#), having a template in text file (.txt) like below
User Name :<User Name>
Address : <Address>.
I used to replace the words within the angle brackets in the text file using the below code
StreamReader sr;
sr = File.OpenText(HttpContext.Current.Server.MapPath(txt));
copy = sr.ReadToEnd();
sr.Close(); //close the reader
copy = copy.Replace(word.ToUpper(),"#" + word.ToUpper()); //remove the word specified UC
//save new copy into existing text file
FileInfo newText = new FileInfo(HttpContext.Current.Server.MapPath(txt));
StreamWriter newCopy = newText.CreateText();
newCopy.WriteLine(copy);
newCopy.Write(newCopy.NewLine);
newCopy.Close();
Now I have a new problem,
the user will be adding new words within an angle, say for eg, they will be adding <Salary>.
In that case i have to read out and find the word <Salary>.
In other words, I have to find all the words, that are located with the angle brackets (<>).
How do I do that?
Having a stream for your file, you can build something similar to a typical tokenizer.
In general terms, this works as a finite state machine: you need an enumeration for the states (in this case could be simplified down to a boolean, but I'll give you the general approach so you can reuse it on similar tasks); and a function implementing the logic. C#'s iterators are quite a fit for this problem, so I'll be using them on the snippet below. Your function will take the stream as an argument, will use an enumerated value and a char buffer internally, and will yield the strings one by one. You'll need this near the start of your code file:
using System.Collections.Generic;
using System.IO;
using System.Text;
And then, inside your class, something like this:
enum States {
OUT,
IN,
}
IEnumerable<string> GetStrings(TextReader reader) {
States state=States.OUT;
StringBuilder buffer;
int ch;
while((ch=reader.Read())>=0) {
switch(state) {
case States.OUT:
if(ch=='<') {
state=States.IN;
buffer=new StringBuilder();
}
break;
case States.IN:
if(ch=='>') {
state=States.OUT;
yield return buffer.ToString();
} else {
buffer.Append(Char.ConvertFromUtf32(ch));
}
break;
}
}
}
The finite-state machine model always has the same layout: while(READ_INPUT) { switch(STATE) {...}}: inside each case of the switch, you may be producing output and/or altering the state. Beyond that, the algorithm is defined in terms of states and state changes: for any given state and input combination, there is an exact new state and output combination (the output can be "nothing" on those states that trigger no output; and the state may be the same old state if no state change is triggered).
Hope this helps.
EDIT: forgot to mention a couple of things:
1) You get a TextReader to pass to the function by creating a StreamReader for a file, or a StringReader if you already have the file on a string.
2) The memory and time costs of this approach are O(n), with n being the length of the file. They seem quite reasonable for this kind of task.
Using regex.
var matches = Regex.Matches(text, "<(.*?)>");
List<string> words = new List<string>();
for (int i = 0; i < matches.Count; i++)
{
words.Add(matches[i].Groups[1].Value);
}
Of course, this assumes you already have the file's text in a variable. Since you have to read the entire file to achieve that, you could look for the words as you are reading the stream, but I don't know what the performance trade off would be.
This is not an answer, but comments can't do this:
You should place some of your objects into using blocks. Something like this:
using(StreamReader sr = File.OpenText(HttpContext.Current.Server.MapPath(txt)))
{
copy = sr.ReadToEnd();
} // reader is closed by the end of the using block
//remove the word specified UC
copy = copy.Replace(word.ToUpper(), "#" + word.ToUpper());
//save new copy into existing text file
FileInfo newText = new FileInfo(HttpContext.Current.Server.MapPath(txt));
using(var newCopy = newText.CreateText())
{
newCopy.WriteLine(copy);
newCopy.Write(newCopy.NewLine);
}
The using block ensures that resources are cleaned up even if an exception is thrown.