Reading very large .xml.bz2 files - c#

I'd like to parse Wikimedia's .xml.bzip2 dumps without extracting the entire file or performing any XML validation:
var filename = "enwiki-20160820-pages-articles.xml.bz2";
var settings = new XmlReaderSettings()
{
ValidationType = ValidationType.None,
ConformanceLevel = ConformanceLevel.Auto // Fragment ?
};
using (var stream = File.Open(filename, FileMode.Open))
using (var bz2 = new BZip2InputStream(stream))
using (var xml = XmlTextReader.Create(bz2, settings))
{
xml.ReadToFollowing("page");
// ...
}
The BZip2InputStream works - if I use a StreamReader, I can read XML line by line. But when I use XmlTextReader, it fails when I try to perform the read:
System.Xml.XmlException: 'Unexpected end of file has occurred. The following elements are not closed: mediawiki. Line 58, position 1.'
The bzip stream is not at EOF. Is it possible to open an XmlTextReader on top of a BZip2 stream? Or is there some other means to do this?

This should work. I used combination of XmlReader and Xml Linq. You can parse the XElement doc as needed.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication29
{
class Program
{
const string URL = #"https://dumps.wikimedia.org/enwiki/20160820/enwiki-20160820-abstract26.xml";
static void Main(string[] args)
{
XmlReader reader = XmlReader.Create(URL);
while (!reader.EOF)
{
if (reader.Name != "doc")
{
reader.ReadToFollowing("doc");
}
if (!reader.EOF)
{
XElement doc = (XElement)XElement.ReadFrom(reader);
}
}
}
}
}

Related

sending an xml message to port

Apologies for the vague question. I am trying to write a c# console application which sends an xml message to a port on which a 3rd party application is listening. The application then sends back another xml message so I need to read that too. Any suggestions would be greatly appreciated.
This link kind of shows what I'm trying to do.
If you're not hugely familiar with raw sockets, I'd do something like:
using (var client = new TcpClient())
{
client.Connect("host", 2324);
using (var ns = client.GetStream())
using (var writer = new StreamWriter(ns))
{
writer.Write(xml);
writer.Write("\r\n\r\n");
writer.Flush();
}
client.Close();
}
For less abstraction, you'd just use a Socket instance directly and deal with all the encoding etc manually, just giving Socket.Send a byte[].
Use xml Linq
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication50
{
class Program
{
static void Main(string[] args)
{
//<?xml version="1.0"?>
//<active_conditions>
// <condition id="12323" name="Sunny"/>
// <condition id="13323" name="Warm"/>
//</active_conditions>
string header = "<?xml version=\"1.0\"?><active_conditions></active_conditions>";
XDocument doc = XDocument.Parse(header);
XElement activeCondition = (XElement)doc.FirstNode;
activeCondition.Add(new object[] {
new XElement("condition", new object[] {
new XAttribute("id", 12323),
new XAttribute("name", "Sunny")
}),
new XElement("condition", new object[] {
new XAttribute("id", 13323),
new XAttribute("name", "Warm")
})
});
string xml = doc.ToString();
XDocument doc2 = XDocument.Parse(xml);
var results = doc2.Descendants("condition").Select(x => new
{
id = x.Attribute("id"),
name = x.Attribute("name")
}).ToList();
}
}
}

Read the content of an xml file within a zip package

I am required to read the contents of an .xml file using the Stream (Here the xml file is existing with in the zip package). Here in the below code, I need to get the file path at runtime (here I have hardcoded the path for reference). Please let me know how to read the file path at run time.
I have tried to use string s =entry.FullName.ToString(); but get the error "Could not find the Path". I have also tried to hard code the path as shown below. however get the same FileNotFound error.
string metaDataContents;
using (var zipStream = new FileStream(#"C:\OB10LinuxShare\TEST1\Temp" + "\\"+zipFileName+".zip", FileMode.Open))
using (var archive = new ZipArchive(zipStream, ZipArchiveMode.Read))
{
foreach (var entry in archive.Entries)
{
if (entry.Name.EndsWith(".xml"))
{
FileInfo metadataFileInfo = new FileInfo(entry.Name);
string metadataFileName = metadataFileInfo.Name.Replace(metadataFileInfo.Extension, String.Empty);
if (String.Compare(zipFileName, metadataFileName, true) == 0)
{
using (var stream = entry.Open())
using (var reader = new StreamReader(stream))
{
metaDataContents = reader.ReadToEnd();
clientProcessLogWriter.WriteToLog(LogWriter.LogLevel.DEBUG, "metaDataContents : " + metaDataContents);
}
}
}
}
}
I have also tried to get the contents of the .xml file using the Stream object as shown below. But here I get the error "Stream was not readable".
Stream metaDataStream = null;
string metaDataContent = string.Empty;
using (Stream stream = entry.Open())
{
metaDataStream = stream;
}
using (var reader = new StreamReader(metaDataStream))
{
metaDataContent = reader.ReadToEnd();
}
Kindly suggest, how to read the contents of the xml with in a zip file using Stream and StreamReader by specifying the file path at run time
Your section code snippet is failing because when you reach the end of the first using statement:
using (Stream stream = entry.Open())
{
metaDataStream = stream;
}
... the stream will be disposed. That's the point of a using statment. You should be fine with this sort of code, but load the XML file while the stream is open:
XDocument doc;
using (Stream stream = entry.Open())
{
doc = XDocument.Load(stream);
}
That's to load it as XML... if you really just want the text, you could use:
string text;
using (Stream stream = entry.Open())
{
using (StreamReader reader = new StreamReader(stream))
{
text = reader.ReadToEnd();
}
}
Again, note how this is reading before it hits the end of either using statement.
Here is a sample of how to read a zip file using .net 4.5
private void readZipFile(String filePath)
{
String fileContents = "";
try
{
if (System.IO.File.Exists(filePath))
{
System.IO.Compression.ZipArchive apcZipFile = System.IO.Compression.ZipFile.Open(filePath, System.IO.Compression.ZipArchiveMode.Read);
foreach (System.IO.Compression.ZipArchiveEntry entry in apcZipFile.Entries)
{
if (entry.Name.ToUpper().EndsWith(".XML"))
{
System.IO.Compression.ZipArchiveEntry zipEntry = apcZipFile.GetEntry(entry.Name);
using (System.IO.StreamReader sr = new System.IO.StreamReader(zipEntry.Open()))
{
//read the contents into a string
fileContents = sr.ReadToEnd();
}
}
}
}
}
catch (Exception)
{
throw;
}
}

Parsing xml file using xmlDocument But failing in parsing because of exception . Please help me in this regard

I am trying to parse an xml file But facing error as output.dat file could not be found.
Firstly I read all xml code in a string and then load it into xmlDocument object but face exception that output.dat not found
here is my code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml;
using System.Xml.XPath;
using System.IO;
namespace XML
{
class Program
{
static void Main(string[] args)
{
try
{
StreamReader sr = new StreamReader("output.xml");
String xml = "";
String line = sr.ReadLine();
while (line != null)
{
xml += line;
line = sr.ReadLine();
}
sr.Close();
XmlDocument xdoc = new XmlDocument();
xdoc.LoadXml(xml);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
}
}
Kindly tell me where is error
Consider replacing
XmlDocument xdoc = new XmlDocument();
xdoc.LoadXml(xml);
With
XDocument xml = XDocument.Parse(xml)

Save modified WordprocessingDocument to new file

I'm attempting to open a Word document, change some text and then save the changes to a new document. I can get the first bit done using the code below but I can't figure out how to save the changes to a NEW document (specifying the path and file name).
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using DocumentFormat.OpenXml.Packaging;
using System.IO;
namespace WordTest
{
class Program
{
static void Main(string[] args)
{
string template = #"c:\data\hello.docx";
string documentText;
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(template, true))
{
using (StreamReader reader = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
documentText = reader.ReadToEnd();
}
documentText = documentText.Replace("##Name##", "Paul");
documentText = documentText.Replace("##Make##", "Samsung");
using (StreamWriter writer = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
writer.Write(documentText);
}
}
}
}
}
I'm a complete beginner at this, so forgive the basic question!
If you use a MemoryStream you can save the changes to a new file like this:
byte[] byteArray = File.ReadAllBytes("c:\\data\\hello.docx");
using (MemoryStream stream = new MemoryStream())
{
stream.Write(byteArray, 0, (int)byteArray.Length);
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(stream, true))
{
// Do work here
}
// Save the file with the new name
File.WriteAllBytes("C:\\data\\newFileName.docx", stream.ToArray());
}
In Open XML SDK 2.5:
File.Copy(originalFilePath, modifiedFilePath);
using (var wordprocessingDocument = WordprocessingDocument.Open(modifiedFilePath, isEditable: true))
{
// Do changes here...
}
wordprocessingDocument.AutoSave is true by default so Close and Dispose will save changes.
wordprocessingDocument.Close is not needed explicitly because the using block will call it.
This approach doesn't require entire file content to be loaded into memory like in accepted answer. It isn't a problem for small files, but in my case I have to process more docx files with embedded xlsx and pdf content at the same time so the memory usage would be quite high.
Simply copy the source file to the destination and make changes from there.
File.copy(source,destination);
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(destination, true))
{
\\Make changes to the document and save it.
wordDoc.MainDocumentPart.Document.Save();
wordDoc.Close();
}
Hope this works.
This approach allows you to buffer the "template" file without batching the whole thing into a byte[], perhaps allowing it to be less resource intensive.
var templatePath = #"c:\data\hello.docx";
var documentPath = #"c:\data\newFilename.docx";
using (var template = File.OpenRead(templatePath))
using (var documentStream = File.Open(documentPath, FileMode.OpenOrCreate))
{
template.CopyTo(documentStream);
using (var document = WordprocessingDocument.Open(documentStream, true))
{
//do your work here
document.MainDocumentPart.Document.Save();
}
}
For me this worked fine:
// To search and replace content in a document part.
public static void SearchAndReplace(string document)
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
string docText = null;
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
docText = sr.ReadToEnd();
}
Regex regexText = new Regex("Hello world!");
docText = regexText.Replace(docText, "Hi Everyone!");
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(docText);
}
}
}

How to get ordered list into pdf using itext?

I have to get an ordered list into pdf.The data stored is in html format.When exporting to pdf using itextsharp,the ol-li tags should be replaced by an ordered list.
You'll want to use iTextSharp's iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList() method. Below is a full working sample WinForms app targeting iTextSharp 5.1.1.0 that does what you're looking for. See the inline comments for what's going on.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Windows.Forms;
using System.IO;
using iTextSharp.text.pdf;
using iTextSharp.text;
namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
//File to export to
string exportFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "HTML.pdf");
//Create our PDF document
using (Document doc = new Document(PageSize.LETTER)){
using (FileStream fs = new FileStream(exportFile, FileMode.Create, FileAccess.Write, FileShare.Read)){
using (PdfWriter writer = PdfWriter.GetInstance(doc, fs)){
//Open the doc for writing
doc.Open();
//Insert a page
doc.NewPage();
//This is our sample HTML
String HTML = "<ol><li>Row 1</li><li>Row 2</li></ol>";
//Create a StringReader to parse our text
using (StringReader sr = new StringReader(HTML))
{
//Pass our StringReader into iTextSharp's HTML parser, get back a list of iTextSharp elements
List<IElement> ies = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(sr, null);
//Loop through each element and add to the document
foreach (IElement ie in ies)
{
doc.Add(ie);
}
}
//Close our document
doc.Close();
}
}
}
}
}
}

Categories