Why does XDocument.Parse throw NotSupportedException? - c#

I am trying to parse xml data using XDocument.Parse wchich throws NotSupportedException, just like in topic: Is XDocument.Parse different in Windows Phone 7? and I updated my code according to posted advice, but it still doesn't help. Some time ago I parsed RSS using similar (but simpler) method and that worked just fine.
public void sList()
{
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
string url = "http://eztv.it";
Uri u = new Uri(url);
client.DownloadStringAsync(u);
client.DownloadStringCompleted += new DownloadStringCompletedEventHandler(client_DownloadStringCompleted);
}
private void client_DownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
{
try
{
string s = e.Result;
s = cut(s);
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Ignore;
XDocument document = null;// XDocument.Parse(s);//Load(s);
using (XmlReader reader = XmlReader.Create(new StringReader(e.Result), settings))
{
document = XDocument.Load(reader); // error thrown here
}
// ... rest of code
}
catch (Exception ex)
{
MessageBox.Show( ex.Message);
}
}
string cut(string s)
{
int iod = s.IndexOf("<select name=\"SearchString\">");
int ido = s.LastIndexOf("</select>");
s = s.Substring(iod, ido - iod + 9);
return s;
}
When I substitute string s for
//string s = "<select name=\"SearchString\"><option value=\"308\">10 Things I Hate About You</option><option value=\"539\">2 Broke Girls</option></select>";
Everything works and no exception is thrown, so what do I do wrong?

There are special symbols like '&' in e.Result.
I just tried replace this symbols (all except '<', '>', '"') with HttpUtility.HtmlEncode() and XDocument parsed it
UPD:
I didn't want to show my code, but you left me no chance :)
string y = "";
for (int i = 0; i < s.Length; i++)
{
if (s[i] == '<' || s[i] == '>' || s[i] == '"')
{
y += s[i];
}
else
{
y += HttpUtility.HtmlEncode(s[i].ToString());
}
}
XDocument document = XDocument.Parse(y);
var options = (from option in document.Descendants("option")
select option.Value).ToList();
It's work for me on WP7. Please, do not use this code for html conversion. I wrote it quickly just for test purposes

Related

How to loop XmlTextReader properly (C#)?

Below is a sample of the type of XML file I am trying to handle. If I have only one part along with an accompanying number/character I can process the data extraction without the necessity of the 'if (!reader.EOF)' control structure. However when I try to include this structure so that I can loop back to checking for another part, number, and character group, it deadlocks.
Any advice as to how to do this properly? This was the most efficient idea that popped into my head. I am new to reading data from XMLs.
Sample Xml:
<?xml version="1.0" encoding="UTF-8"?>
<note>
<part>100B</part>
<number>45</number>
<character>a</character>
<part>100C</part>
<number>55</number>
<character>b</character>
</note>
Code:
String part = "part";
String number = "number";
String character = "character";
String appendString = "";
StringBuilder sb = new StringBuilder();
try
{
XmlTextReader reader = new XmlTextReader("myPath");
while (reader.Read())
{
switch (reader.NodeType)
{
case XmlNodeType.Element: // The node is an element.
myLabel:
if (reader.Name == part)
{
part = reader.ReadInnerXml();
}
if (reader.Name == number)
{
number = reader.ReadInnerXml();
number = double.Parse(number).ToString("F2"); //format num
}
if (reader.Name == character)
{
character = reader.ReadInnerXml();
}
//new string
appendString = ("Part: " + part + "\nNumber: " + number +
"\nCharacter: " + character + "\n");
//concatenate
sb.AppendLine(appendString);
if (reader.EOF != true)
{
Debug.Log("!eof");
part = "part";
number = "number";
character = "character";
goto myLabel;
}
//print fully concatenated result
sb.ToString();
//reset string builder
sb.Length = 0;
break;
}
}
}
catch (XmlException e)
{
// Write error.
Debug.Log(e.Message);
}
catch (FileNotFoundException e)
{
// Write error.
Debug.Log(e);
}
catch(ArgumentException e)
{
// Write error.
Debug.Log(e);
}
XmlReader class has many useful methods. Use it.
See this:
var sb = new StringBuilder();
using (var reader = XmlReader.Create("test.xml"))
{
while (reader.ReadToFollowing("part"))
{
var part = reader.ReadElementContentAsString();
sb.Append("Part: ").AppendLine(part);
reader.ReadToFollowing("number");
var number = reader.ReadElementContentAsDouble();
sb.Append("Number: ").Append(number).AppendLine();
reader.ReadToFollowing("character");
var character = reader.ReadElementContentAsString();
sb.Append("Character: ").AppendLine(character);
}
}
Console.WriteLine(sb);
Alexander's answer is fine, I just want to add sample using XDocument, according comments of Jon Skeet:
var sb = new StringBuilder();
var note = XDocument.Load("test.xml").Root.Descendants();
foreach (var el in note)
{
sb.Append(el.Name).Append(": ").AppendLine(el.Value);
}
Console.WriteLine(sb);

How to access and replace text in certain paragraphs using OPENXML powertools case by case

I am trying to redact some word files using c# and openxml. I need to do controlled replace of the numbers with certain phrase. Each word file contains different amount of info. I want to use OPENXML powertools for this purspose.
I used normal openxml method to replace but it very unreliable and gets random errors such as zero length error.I used regex replace and that seems to work but it replaces it through out the document which is highly undesirable.
Here is some snippet of the code :
private void redact_Replaceall(string wfile)
{
try
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(wfile, true))
{
var ydoc = doc.MainDocumentPart.GetXDocument();
IEnumerable<XElement> content = ydoc.Descendants(W.body);
Regex regex = new Regex(#"\d+\.\d{2,3}");
int count1 = OpenXmlPowerTools.OpenXmlRegex.Match(content, regex);
int count2 = OpenXmlPowerTools.OpenXmlRegex.Replace(content, regex, replace_text, null);
statusBar1.Text = "Try 1: Found: " + count1 + ", Replaced: " + count2;
doc.MainDocumentPart.PutXDocument();
}
}
catch(Exception e)
{
MessageBox.Show("Replace all exprienced error: " + e.Message);
}
}
Basically, I want to do this redaction based on content of paragraph. I am able to get the paragraphs using but not the id's
IEnumerable<XElement> content = ydoc.Descendants(W.p);
Here is my approach using the normal openxml method but I get alot of errors depending on the file.
foreach (DocumentFormat.OpenXml.Wordprocessing.Paragraph para in bod.Descendants<DocumentFormat.OpenXml.Wordprocessing.Paragraph>())
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
string temp = text.Text;
int firstlength = first.Length + 1;
int secondlength = second.Length + 1;
if (text.Text.Contains(first) && !(temp.Length > firstlength))
{
text.Text = text.Text.Replace(first, "DELETED");
}
if (text.Text.Contains(second) && !(temp.Length > secondlength))
{
text.Text = text.Text.Replace(second, "DELETED");
}
}
}
}
Here is the last new approach but I am stuck on it
private void redact_Replacebadones(string wfile)
{
try
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(wfile, true))
{
var ydoc = doc.MainDocumentPart.GetXDocument();
/* from XElement xele in ydoc.Root.Elements();
List<string> lhsElements = xele.Elements("lhs")
.Select(el => el.Attribute("id").Value)
.ToList();
*/
/// XElement
IEnumerable<XElement> content = ydoc.Descendants(W.p);
foreach (var p in content )
{
if (p.Value.Contains("each") && !p.Value.Contains("DELETED"))
{
string to_overwrite = p.Value;
Regex regexop = new Regex(#"\d+\.\d{2,3}");
regexop.Replace(to_overwrite, "Deleted");
p.SetValue(to_overwrite);
MessageBox.Show("NAME :" + p.GetParagraphInfo() +" VValue:"+to_overwrite);
}
}
doc.MainDocumentPart.PutXDocument();
}
}
catch (Exception e)
{
MessageBox.Show("Replace each exprienced error: " + e.Message);
}
}
May be a bit late. OpenXML Power tools by Eric white has a Function SearchAndReplace where you can replace Text content, so you don't have to handle it with RegEx.
This function handles also text which is splitted into runs. (If you edit a word, a word can be splittet in runs, so you dint find the search phrase directly.)
May be this helps somebody.

How can I read a Lync conversation file containing HTML?

I'm having trouble reading a local file, into a string, in c#.
Here's what I came up with till now:
string file = #"C:\script_test\{5461EC8C-89E6-40D1-8525-774340083829}.html";
using (StreamReader reader = new StreamReader(file))
{
string line = "";
while ((line = reader.ReadLine()) != null)
{
textBox1.Text += line.ToString();
}
}
And it's the only solution that seems to work.
I've tried some other suggested methods for reading a file, such as:
string file = #"C:\script_test\{5461EC8C-89E6-40D1-8525-774340083829}.html";
string html = File.ReadAllText(file).ToString();
textBox1.Text += html;
Yet it does not work as expected.
Here are the first few lines of the file i'm trying to read:
as you can see, it has some funky characters, honestly I don't know if that's the cause of this weird behavior.
But in the first case, the code seems to skip those lines, printing only "Document generated by Office Communicator..."
Your task would be easier if you could use an API or the SDK or even would have a description of the format you try to read. However the binary format looks not to be that complicated and with an hexviewer installed I got this far to get the html out of the example you provided.
To parse non-text files you fall-back to the BinaryReader and then use one of the Read methods to read the correct type from the bytestream. I used ReadByte and ReadInt32. Notice how in the description of the method is explained how many bytes are read. That becomes handy when you try to decipher your file.
private string ParseHist(string file)
{
using (var f = File.Open(file, FileMode.Open))
{
using (var br = new BinaryReader(f))
{
// read 4 bytes as an int
var first = br.ReadInt32();
// read integer / zero ended byte arrays as string
var lead = br.ReadInt32();
// until we have 4 zero bytes
while (lead != 0)
{
var user = ParseString(br);
Trace.Write(lead);
Trace.Write(":");
Trace.Write(user.Length);
Trace.Write(":");
Trace.WriteLine(user);
lead = br.ReadInt32();
// weird special case
if (lead == 2)
{
lead = br.ReadInt32();
}
}
// at the start of the html block
var htmllen = br.ReadInt32();
Trace.WriteLine(htmllen);
// parse the html
var html = ParseString(br);
Trace.Write(len);
Trace.Write(":");
Trace.Write(html.Length);
Trace.Write(":");
Trace.WriteLine(html);
// other structures follow, left unparsed
return html.ToString();
}
}
}
// a string seems to be ascii encoded and ends with a zero byte.
private static string ParseString(BinaryReader br)
{
var ch = br.ReadByte();
var sb = new StringBuilder();
while (ch != 0)
{
sb.Append((char)ch);
ch = br.ReadByte();
}
return sb.ToString();
}
You could use the simple parsing logic in a winform application as follows:
private void button1_Click(object sender, EventArgs e)
{
webBrowser1.DocumentText = ParseHist(#"5461EC8C-89E6-40D1-8525-774340083829-Copia.html");
}
Keep in mind that this is not bullet proof or the recommended way but it should get you started. For files that don't parse well you'll need to go back to the hexviewer and work-out what other byte structures are new or different from what you already had. That is not something I intend to help you with, that is left as an exercise for you to figure out.
I don't know if it's the right way to answer this, but here's what I've managed to do so far:
string file = #"C:\script_test\{1C0365BC-54C6-4D31-A1C1-586C4575F9EA}.hist";
string outText = "";
//Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
StreamReader reader = new StreamReader(file, utf8);
char[] text = reader.ReadToEnd().ToCharArray();
//skip first n chars
/*
for (int i = 250; i < text.Length; i++)
{
outText += text[i];
}
*/
for (int i = 0; i < text.Length; i++)
{
//skips non printable characters
if (!Char.IsControl(text[i]))
{
outText += text[i];
}
}
string source = "";
source = WebUtility.HtmlDecode(outText);
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(source);
string html = "<html><style>";
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//style"))
{
html += node.InnerHtml+ Environment.NewLine;
}
html += "</style><body>";
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//body"))
{
html += node.InnerHtml + Environment.NewLine;
}
html += "</body></html>";
richTextBox1.Text += html+Environment.NewLine;
webBrowser1.DocumentText = html;
The conversation displays correctly, both style and encoding.
So it's a start for me.
Thank you all for the support!
EDIT
Char.IsControl(char)
skips non printable characters :)

XML Parsing for Mediawiki link

I have this link http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=panadol&prop=revisions&rvprop=content
I need to get the content inside tag. so I used this code
private void HttpsCompleted(object sender, DownloadStringCompletedEventArgs e)
{
WebClient wwc = new WebClient();
String xmlStr = "http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=" + medName + "&prop=revisions&rvprop=content";
wwc.DownloadStringCompleted += wwc_DownloadStringCompleted;
wwc.DownloadStringAsync(new Uri(xmlStr));
}
else
{
MessageBox.Show("Couldn't search for medicine!\nCheck the internet connection.");
}
}
catch (Exception)
{
// do nothing
}
}
also calling this method.
XNamespace ns = "http://www.w3.org/2005/Atom";
var entry = XDocument.Parse(e.Result);
var xmlData = new xmlWiki();
var g = entry.Element(ns + "rev").Value.ToString();
}
}
catch (Exception f)
{
MessageBox.Show(f.ToString());
}
}
But I am getting Null reference exception when the code executes "var g = entry.Element(ns + "rev").Value.ToString(); "
Please any help. Thank you in advance
rev is not the child of root of tree. This is the path to it:
api
query
pages
page
revisions
rev
You can use .Descendants() to reach it.
var entry = XDocument.Parse(html);
var g = entry.Descendants("rev").First().Value;

Why is XmlReader / XmlSerializer messing up line jumps in text when deserializing?

My object template, which is deserialized from a hand made XML file contains mixed types and the text can contain line jumps. When I look at the text I can see line jumps are \r\n, but in my deserialized template object, line jumps are \n. How can I keep line jumps as \r\n?
XmlReaderSettings settings = new XmlReaderSettings();
settings.CloseInput = true;
//settings.ValidationEventHandler += ValidationEventHandler;
settings.ValidationType = ValidationType.Schema;
settings.Schemas.Add(schema);
StringReader r = new StringReader(syntaxEdit.Text);
Schema.template rawTemplate = null;
using (XmlReader validatingReader = XmlReader.Create(r, settings))
{
try
{
XmlSerializer serializer = new XmlSerializer(typeof(Schema.template));
rawTemplate = serializer.Deserialize(validatingReader) as Schema.template;
}
catch (Exception ex)
{
rawTemplate = null;
string floro = ex.Message + (null != ex.InnerException ? ":\n" + ex.InnerException.Message : "");
MessageBox.Show(floro);
}
}
It seems that this is required behavior by the XML specification and is a "feature" in Microsoft's implementation of the XmlReader (see this answer).
Probably the easiest thing for you to do would be to replace \n with \r\n in your result.
That's the behavior mandated by the XML specification: every \r\n, \r or \n MUST be interpreted as a single \n character. If you want to maintain the \r in your output, you have to change it to a character reference (
) as shown below.
public class StackOverflow_7374609
{
[XmlRoot(ElementName = "MyType", Namespace = "")]
public class MyType
{
[XmlText]
public string Value;
}
static void PrintChars(string str)
{
string toEscape = "\r\n\t\b";
string escapeChar = "rntb";
foreach (char c in str)
{
if (' ' <= c && c <= '~')
{
Console.WriteLine(c);
}
else
{
int escapeIndex = toEscape.IndexOf(c);
if (escapeIndex >= 0)
{
Console.WriteLine("\\{0}", escapeChar[escapeIndex]);
}
else
{
Console.WriteLine("\\u{0:X4}", (int)c);
}
}
}
Console.WriteLine();
}
public static void Test()
{
string serialized = "<MyType>Hello\r\nworld</MyType>";
MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes(serialized));
XmlSerializer xs = new XmlSerializer(typeof(MyType));
MyType obj = (MyType)xs.Deserialize(ms);
Console.WriteLine("Without the replacement");
PrintChars(obj.Value);
serialized = serialized.Replace("\r", "
");
ms = new MemoryStream(Encoding.UTF8.GetBytes(serialized));
obj = (MyType)xs.Deserialize(ms);
Console.WriteLine("With the replacement");
PrintChars(obj.Value);
}
}

Categories