I would like to be able to efficiently get a substring from a MemoryStream (that originally comes from a xml file in a zip). Currently, I read the entire MemoryStream to a string and then search for the start and end tags of the xml node I desire. This works fine but the text file may be very large so I would like to avoid converting the entire MemoryStream into a string and instead just extract the desired section of xml text directly from the stream.
What is the best way to go about this?
string xmlText;
using (var zip = ZipFile.Read(zipFileName))
{
var ze = zip[zipPath];
using (var ms = new MemoryStream())
{
ze.Extract(ms);
ms.Position = 0;
using(var sr = new StreamReader(ms))
{
xmlText = sr.ReadToEnd();
}
}
}
string startTag = "<someTag>";
string endTag = "</someTag>";
int startIndex = xmlText.IndexOf(startTag, StringComparison.Ordinal);
int endIndex = xmlText.IndexOf(endTag, startIndex, StringComparison.Ordinal) + endTag.Length - 1;
xmlText = xmlText.Substring(startIndex, endIndex - startIndex + 1);
If your file is a valid xml file then you should be able to use a XmlReader to avoid loading the entire file into memory
string xmlText;
using (var zip = ZipFile.Read(zipFileName))
{
var ze = zip[zipPath];
using (var ms = new MemoryStream())
{
ze.Extract(ms);
ms.Position = 0;
using (var xml = XmlReader.Create(ms))
{
if(xml.ReadToFollowing("someTag"))
{
xmlText = xml.ReadInnerXml();
}
else
{
// <someTag> not found
}
}
}
}
You'll likely want to catch potential exceptions if the file is not valid xml.
Assuming that since it is xml it will have line breaks, it would probably be best to use StreamReader ReadLine and search for your tags in each line. (Also note put your StreamReader in a using as well.)
Something like
using (var ms = new MemoryStream())
{
ze.Extract(ms);
ms.Position = 0;
using (var sr = new StreamReader(ms))
{
bool adding = false;
string startTag = "<someTag>";
string endTag = "</someTag>";
StringBuilder text = new StringBuilder();
while (sr.Peek() >= 0)
{
string tmp = sr.ReadLine();
if (!adding && tmp.Contains(startTag))
{
adding = true;
}
if (adding)
{
text.Append(tmp);
}
if (tmp.Contains(endTag))
break;
}
xmlText = text.ToString();
}
}
This assumes that the start and end tags are on a line by themselves. If not, you could clean up the resulting text string by getting the index of start and end again like you originally did.
Related
I am trying to serialize a simple object (5 string properties) into XML to save to a DB Image field. Then I need to DeSerialize it back into a string later in the program.
However, I am getting some errors - caused by the XML being saved thinking it is in UTF-16 - however, when I load it from the DB back into a string - it thinks it is a UTF 8 String.
The error I get is
InnerException {"There is no Unicode byte order mark. Cannot switch to Unicode."} System.Exception {System.Xml.XmlException}
-- Message "There is an error in XML document (0, 0)." string
Is this happening because of the two different ways I save and load the string to/from the DB? On the save I am using a StringBuilder - but on the load from DB I am using just a String.
Thoughts?
Serialize and Save to DB
// Now Save the OBject XML to the Query Tables
var serializer = new XmlSerializer(ExportConfig.GetType());
StringBuilder StringResult = new StringBuilder();
using (var writer = XmlWriter.Create(StringResult))
{
serializer.Serialize(writer, ExportConfig);
}
//MessageBox.Show("XML : " + StringResult);
// Now Save to the Query
try
{
string UpdateSQL = "Update ZQryRpt "
+ " Set ExportConfig = " + TAGlobal.QuotedStr(StringResult.ToString())
+ " where QryId = " + TAGlobal.QuotedStr(((DataRowView)bindingSource_zQryRpt.Current).Row["QryID"].ToString())
;
ExecNonSelectSQL(UpdateSQL, uniConnection_Config);
}
catch (Exception Error)
{
MessageBox.Show("Error Setting ExportConfig: " + Error.Message);
}
Load from DB And Deserialize
byte[] binaryData = (byte[])((DataRowView)bindingSource_zQryRpt.Current).Row["ExportConfig"];
string XMLStored = System.Text.Encoding.UTF8.GetString(binaryData, 0, binaryData.Length);
if (XMLStored.Length > 0)
{
IIDExportObject ExportConfig = new IIDExportObject();
var serializer = new XmlSerializer(ExportConfig.GetType());
//StringBuilder StringResult = new StringBuilder(XMLStored);
// Load the XML from the Query into the StringBuilder
// Now we need to build a Stream from the String to use in the XMLReader
byte[] byteArray = Encoding.UTF8.GetBytes(XMLStored);
MemoryStream stream = new MemoryStream(byteArray);
using (var reader = XmlReader.Create(stream))
{
ExportConfig = (IIDExportObject)serializer.Deserialize(reader);
}
}
John - thank you very much for the comment! It allowed me to complete the code and find a solution.
As you noted - using a stream reader was the solution - but I could not read the first line because there was only one 'line' in my string. However, I could use the line
using (StreamReader sr = new StreamReader(stream, false))
Which allows me to read the stream and ignore the "Byte Order Mark Detection" set to false.
string XMLStored = MainFormRef.GetExportConfigForCurrentQuery();
if (XMLStored.Length > 0)
{
IIDExportObject ExportConfig = new IIDExportObject();
try
{
var serializer = new XmlSerializer(ExportConfig.GetType());
// Now we need to build a Stream from the String to use in the XMLReader
byte[] byteArray = Encoding.UTF8.GetBytes(XMLStored);
MemoryStream stream = new MemoryStream(byteArray);
// Now we need to use a StreamReader to get around UTF8 vs UTF16 issues
// A little cumbersome - but it works
using (StreamReader sr = new StreamReader(stream, false))
{
using (var reader = XmlReader.Create(sr))
{
ExportConfig = (IIDExportObject)serializer.Deserialize(reader);
}
}
}
catch
{
}
I am not sure this is the best solution - but it works. I will be curious to see if anyone else has a better way of dealing with this.
Thanks to G Bradley, I took his answer and generalized it a bit to make it a bit easier to call.
public static string SerializeToXmlString<T>(T objectToSerialize)
{
XmlSerializer serializer = new XmlSerializer(typeof(T));
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = false;
settings.Encoding = Encoding.UTF8;
StringBuilder builder = new StringBuilder();
using (XmlWriter writer = XmlWriter.Create(builder, settings))
{
serializer.Serialize(writer, objectToSerialize);
}
return builder.ToString();
}
public static T DeserializeFromXmlString<T>(string xmlString)
{
if (string.IsNullOrWhiteSpace(xmlString))
return default;
var serializer = new XmlSerializer(typeof(T));
byte[] byteArray = Encoding.UTF8.GetBytes(xmlString);
MemoryStream stream = new MemoryStream(byteArray);
using (StreamReader sr = new StreamReader(stream, false))
{
using (var reader = XmlReader.Create(sr))
{
return (T)serializer.Deserialize(reader);
}
}
}
sorry for my English
I have the contents of a word document in a byte array and I want to know how many pages it has.
I already did this with a pdf file using this code:
public void MssGetNumberOfPages(byte[] ssFileBinaryData, out int ssNumberOfPages) {
int pageCount;
MemoryStream stream = new MemoryStream(ssFileBinaryData);
using (var r = new StreamReader(stream))
{
string pdfText = r.ReadToEnd();
System.Text.RegularExpressions.Regex regx = new Regex(#"/Type\s*/Page[^s]");
System.Text.RegularExpressions.MatchCollection matches = regx.Matches(pdfText);
pageCount = matches.Count;
ssNumberOfPages = pageCount;
}
// TODO: Write implementation for action
}
How do I do something similar, with a word document?
In the pdf I simply have to search through the regex the text that matches this:
Regex(#"/Type\s*/Page[^s]")
What do I have to put in the regex to match the pages of the word document?
Well, I solved this myself by converting the word document into pdf with Aspose.dll
public void MssGet_Word_NumberOfPages(byte[] ssFileBinaryData, out int ssNumberOfPages) {
// Load Word Document from this byte array
Document loadedFromBytes = new Document(new MemoryStream(ssFileBinaryData));
// Save Word to PDF byte array
MemoryStream pdfStream = new MemoryStream();
loadedFromBytes.Save(pdfStream, SaveFormat.Pdf);
byte[] pdfBytes = pdfStream.ToArray();
int pageCount;
MemoryStream stream = new MemoryStream(pdfBytes);
using (var r = new StreamReader(stream))
{
string pdfText = r.ReadToEnd();
System.Text.RegularExpressions.Regex regx = new Regex(#"/Type\s*/Page[^s]");
System.Text.RegularExpressions.MatchCollection matches = regx.Matches(pdfText);
pageCount = matches.Count;
ssNumberOfPages = pageCount;
}
}
Can you perhaps elaborate on the tool(s) you used to convert the word doc to PDF?
I am creating XLSX file with customized code with out Open XML SDK. it is working fine for 50000 records with 200 columns and taking max 13 GB RAM.
But when i was trying with 100000 rows and 200 columns taking max 16 GB RAM and never created the XLSX file and keep on increasing and decreasing RAM memory and also increasing and decreasing the CPU Usage.
I am writing 100000 rows and 200 columns into Stream and copying the stream to Package Part Stream at a same time with out splitting the XML file. That XML file size is 3 GB.
Can you please give solution for this with out using Open XML SDK.
When I tried with Open XML that is working with 100000 records with 200 columns for singe user. but at at time creating 100000 records with 200 columns for two users server is hanging.
My Customized code is taking more RAM but not hanging.
In below code "CreateOpenXMLComWorkSheet_XMLWriter" Method is taking more RAM Size.
I am using below code for your reference. Please let me know if any changes are required.
//Package method
Package package = null;
using (package = ZipPackage.Open(path, FileMode.Create))
{
packgPart = package.CreatePart(new Uri(relativePaths[relIndex], UriKind.Relative), contentTypes[6], CompressionOption.Maximum);
XmlWriter xmlWriter;
Stream stream = CreateOpenXMLComWorkSheet_XMLWriter(data, "", out xmlWriter);
CopyStream(stream, packgPart.GetStream());
xmlWriter.Flush();
xmlWriter.Close();
xmlWriter = null;
package.Flush();
packgPart = null;
stream.Close();
stream.Dispose();
stream = null;
relIndex++;
GC.Collect();
package.Close();
}
// CreateOpenXMLComWorkSheet method
// Define other methods and classes here
private static Stream CreateOpenXMLComWorkSheet_XMLWriter(List<StringBuilder> rows, string sheet,out XmlWriter xmlWriter)
{
string[] cols;
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.NewLineHandling = NewLineHandling.None;
xmlWriterSettings.Indent = false;
xmlWriter = null;
MemoryStream stream = new MemoryStream();
string nameSpace = "http://schemas.openxmlformats.org/spreadsheetml/2006/main";
xmlWriter = XmlWriter.Create(stream,xmlWriterSettings);
xmlWriter.WriteStartElement("x","worksheet",nameSpace);
xmlWriter.WriteStartElement("x","sheetData",nameSpace);
for (m = 0; m < rows.Count; m++)
{
xmlWriter.WriteStartElement("x","row",nameSpace);
cols = rows[m].ToString().Split(new string[] { univDelimiter }, StringSplitOptions.None);
for (int i = 1; i <= cols.Length; i++)
{
cellValue = cols[i - 1];
if (double.TryParse(cellValue,out dVal))
{
dataType = "n";
}
else
{
dataType = "str";
}
xmlWriter.WriteStartElement("x","c",nameSpace);
xmlWriter.WriteAttributeString("s", "13");
xmlWriter.WriteAttributeString("t", dataType);
xmlWriter.WriteStartElement("x", "v",nameSpace);
xmlWriter.WriteValue(cellValue);
xmlWriter.WriteEndElement();
xmlWriter.WriteEndElement();
}
xmlWriter.WriteEndElement();
rows[m] = null;
}
xmlWriter.WriteEndElement();
xmlWriter.WriteEndElement();
xmlWriter.Flush();
stream.Position = 0;
return stream;
}
//CopyStream method
private static void CopyStream(Stream source, Stream target)
{
const int bufSize = 0x1000;
byte[] buf = new byte[bufSize];
int bytesRead = 0;
while ((bytesRead = source.Read(buf, 0, bufSize)) > 0)
target.Write(buf, 0, bytesRead);
}
It seems you are taking wrong approach in writing files, open xml sdk is good enough tool to create excels with large amount of data.
i think you need to take SAX-Like Approach which uses combination of xmlreader and writer without running out of memory.
have a look at this wonderful blog which fits your specific requirements.
https://blogs.msdn.microsoft.com/brian_jones/2010/06/22/writing-large-excel-files-with-the-open-xml-sdk/
To reduce memory pressure consider not using MemoryStream in your XmlWriter. If you used a disk based stream then this would reduce memory pressure dramatically.
Use the stream you are getting here packgPart.GetStream() as the back store for your xml writer.
Also I sense that you don't need to load the entire CSV in the memory.
Here is a version which is using streams only.
void Main()
{
string inputFile = "D:\\_bigfile.csv";
string path = "D:\\pack.zip";
Package package = null;
using (package = ZipPackage.Open(path, FileMode.Create))
{
var packgPart = package.CreatePart(new Uri("/test.xml", UriKind.Relative), System.Net.Mime.MediaTypeNames.Text.Xml, CompressionOption.Maximum);
using (var inputStream = File.OpenRead(inputFile))
{
CreateOpenXMLComWorkSheet_XMLWriter(inputStream, "", packgPart.GetStream());
}
}
}
private const string univDelimiter = "|";
private static void CreateOpenXMLComWorkSheet_XMLWriter(Stream inputStream, string sheet, Stream packagePartStream)
{
string cellValue = "";
string dataType = "";
double dVal = 0;
string[] cols;
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.NewLineHandling = NewLineHandling.None;
xmlWriterSettings.Indent = false;
string nameSpace = "http://schemas.openxmlformats.org/spreadsheetml/2006/main";
using (var xmlWriter = XmlWriter.Create(packagePartStream, xmlWriterSettings))
{
xmlWriter.WriteStartElement("x","worksheet",nameSpace);
xmlWriter.WriteStartElement("x","sheetData",nameSpace);
using (var sr = new StreamReader(inputStream))
{
string line = null;
while ((line = sr.ReadLine()) != null)
{
xmlWriter.WriteStartElement("x","row",nameSpace);
cols = line.Split(new string[] { univDelimiter }, StringSplitOptions.None);
for (int i = 1; i <= cols.Length; i++)
{
cellValue = cols[i - 1];
if (double.TryParse(cellValue,out dVal))
{
dataType = "n";
}
else
{
dataType = "str";
}
xmlWriter.WriteStartElement("x","c",nameSpace);
xmlWriter.WriteAttributeString("s", "13");
xmlWriter.WriteAttributeString("t", dataType);
xmlWriter.WriteStartElement("x", "v",nameSpace);
xmlWriter.WriteValue(cellValue);
xmlWriter.WriteEndElement();
xmlWriter.WriteEndElement();
}
xmlWriter.WriteEndElement();
}
}
xmlWriter.WriteEndElement();
xmlWriter.WriteEndElement();
}
}
I have a text file with the following text inside:
[username][0]
I have opened the file using StreamWriter and I want to change the 0 to a 1 using the StreamWriter.Write Method. How can I do this?
If you know the exact byte position of the character(s) you want to overwrite then you can do something like this:
using (var writer = new StreamWriter(filePath))
{
writer.BaseStream.Seek(bytePos, SeekOrigin.Begin);
writer.Write('1');
}
If you don't know the exact byte position then you could do something like this:
using (var file = new FileStream(filePath, FileMode.Open))
using (var reader = new StreamReader(file))
using (var writer = new StreamWriter(file))
{
var openBracketCount = 0;
// Keep reading characters until the second open bracket is found.
do
{
var ch = Convert.ToChar(reader.Read());
if (ch == '[')
{
openBracketCount++;
}
} while (openBracketCount < 2);
writer.Write('1');
}
I'm creating simple self-extracting archive using magic number to mark the beginning of the content.
For now it is a textfile:
MAGICNUMBER .... content of the text file
Next, textfile copied to the end of the executable:
copy programm.exe/b+textfile.txt/b sfx.exe
I'm trying to find the second occurrence of the magic number (the first one would be a hardcoded constant obviously) using the following code:
string my_filename = System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName;
StreamReader file = new StreamReader(my_filename);
const int block_size = 1024;
const string magic = "MAGICNUMBER";
char[] buffer = new Char[block_size];
Int64 count = 0;
Int64 glob_pos = 0;
bool flag = false;
while (file.ReadBlock(buffer, 0, block_size) > 0)
{
var rel_pos = buffer.ToString().IndexOf(magic);
if ((rel_pos > -1) & (!flag))
{
flag = true;
continue;
}
if ((rel_pos > -1) & (flag == true))
{
glob_pos = block_size * count + rel_pos;
break;
}
count++;
}
using (FileStream fs = new FileStream(my_filename, FileMode.Open, FileAccess.Read))
{
byte[] b = new byte[fs.Length - glob_pos];
fs.Seek(glob_pos, SeekOrigin.Begin);
fs.Read(b, 0, (int)(fs.Length - glob_pos));
File.WriteAllBytes("c:/output.txt", b);
but for some reason I'm copying almost entire file, not the last few kilobytes. Is it because of the compiler optimization, inlining magic constant in while loop of something similar?
How should I do self-extraction archive properly?
Guessed I should read file backwards to avoid problems of compiler inlining magic constant multiply times.
So I've modified my code in the following way:
string my_filename = System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName;
StreamReader file = new StreamReader(my_filename);
const int block_size = 1024;
const string magic = "MAGIC";
char[] buffer = new Char[block_size];
Int64 count = 0;
Int64 glob_pos = 0;
while (file.ReadBlock(buffer, 0, block_size) > 0)
{
var rel_pos = buffer.ToString().IndexOf(magic);
if (rel_pos > -1)
{
glob_pos = block_size * count + rel_pos;
}
count++;
}
using (FileStream fs = new FileStream(my_filename, FileMode.Open, FileAccess.Read))
{
byte[] b = new byte[fs.Length - glob_pos];
fs.Seek(glob_pos, SeekOrigin.Begin);
fs.Read(b, 0, (int)(fs.Length - glob_pos));
File.WriteAllBytes("c:/output.txt", b);
}
So I've scanned the all file once, found that I though would be the last occurrence of the magic number and copied from here to the end of it. While the file created by this procedure seems smaller than in previous attempt it in no way the same file I've attached to my "self-extracting" archive. Why?
My guess is that position calculation of the beginning of the attached file is wrong due to used conversion from binary to string. If so how should I modify my position calculation to make it correct?
Also how should I choose magic number then working with real files, pdfs for example? I wont be able to modify pdfs easily to include predefined magic number in it.
Try this out. Some C# Stream IO 101:
public static void Main()
{
String path = #"c:\here is your path";
// Method A: Read all information into a Byte Stream
Byte[] data = System.IO.File.ReadAllBytes(path);
String[] lines = System.IO.File.ReadAllLines(path);
// Method B: Use a stream to do essentially the same thing. (More powerful)
// Using block essentially means 'close when we're done'. See 'using block' or 'IDisposable'.
using (FileStream stream = File.OpenRead(path))
using (StreamReader reader = new StreamReader(stream))
{
// This will read all the data as a single string
String allData = reader.ReadToEnd();
}
String outputPath = #"C:\where I'm writing to";
// Copy from one file-stream to another
using (FileStream inputStream = File.OpenRead(path))
using (FileStream outputStream = File.Create(outputPath))
{
inputStream.CopyTo(outputStream);
// Again, this will close both streams when done.
}
// Copy to an in-memory stream
using (FileStream inputStream = File.OpenRead(path))
using (MemoryStream outputStream = new MemoryStream())
{
inputStream.CopyTo(outputStream);
// Again, this will close both streams when done.
// If you want to hold the data in memory, just don't wrap your
// memory stream in a using block.
}
// Use serialization to store data.
var serializer = new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
// We'll serialize a person to the memory stream.
MemoryStream memoryStream = new MemoryStream();
serializer.Serialize(memoryStream, new Person() { Name = "Sam", Age = 20 });
// Now the person is stored in the memory stream (just as easy to write to disk using a
// file stream as well.
// Now lets reset the stream to the beginning:
memoryStream.Seek(0, SeekOrigin.Begin);
// And deserialize the person
Person deserializedPerson = (Person)serializer.Deserialize(memoryStream);
Console.WriteLine(deserializedPerson.Name); // Should print Sam
}
// Mark Serializable stuff as serializable.
// This means that C# will automatically format this to be put in a stream
[Serializable]
class Person
{
public String Name { get; set; }
public Int32 Age { get; set; }
}
The easiest solution is to replace
const string magic = "MAGICNUMBER";
with
static string magic = "magicnumber".ToUpper();
But there are more problems with the whole magic string approach. What is the file contains the magic string? I think that the best solution is to put the file size after the file. The extraction is much easier that way: Read the length from the last bytes and read the required amount of bytes from the end of the file.
Update: This should work unless your files are very big. (You'd need to use a revolving pair of buffers in that case (to read the file in small blocks)):
string inputFilename = System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName;
string outputFilename = inputFilename + ".secret";
string magic = "magic".ToUpper();
byte[] data = File.ReadAllBytes(inputFilename);
byte[] magicData = Encoding.ASCII.GetBytes(magic);
for (int idx = magicData.Length - 1; idx < data.Length; idx++) {
bool found = true;
for (int magicIdx = 0; magicIdx < magicData.Length; magicIdx++) {
if (data[idx - magicData.Length + 1 + magicIdx] != magicData[magicIdx]) {
found = false;
break;
}
}
if (found) {
using (FileStream output = new FileStream(outputFilename, FileMode.Create)) {
output.Write(data, idx + 1, data.Length - idx - 1);
}
}
}
Update2: This should be much faster, use little memory and work on files of all size, but the program your must be proper executable (with size being a multiple of 512 bytes):
string inputFilename = System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName;
string outputFilename = inputFilename + ".secret";
string marker = "magic".ToUpper();
byte[] data = File.ReadAllBytes(inputFilename);
byte[] markerData = Encoding.ASCII.GetBytes(marker);
int markerLength = markerData.Length;
const int blockSize = 512; //important!
using(FileStream input = File.OpenRead(inputFilename)) {
long lastPosition = 0;
byte[] buffer = new byte[blockSize];
while (input.Read(buffer, 0, blockSize) >= markerLength) {
bool found = true;
for (int idx = 0; idx < markerLength; idx++) {
if (buffer[idx] != markerData[idx]) {
found = false;
break;
}
}
if (found) {
input.Position = lastPosition + markerLength;
using (FileStream output = File.OpenWrite(outputFilename)) {
input.CopyTo(output);
}
}
lastPosition = input.Position;
}
}
Read about some approaches here: http://www.strchr.com/creating_self-extracting_executables
You can add the compressed file as resource to the project itself:
Project > Properties
Set the property of this resource to Binary.
You can then retrieve the resource with
byte[] resource = Properties.Resources.NameOfYourResource;
Search backwards rather than forwards (assuming your file won't contain said magic number).
Or append your (text) file and then lastly its length (or the length of the original exe), so you only need read the last DWORD / few bytes to see how long the file is - then no magic number is required.
More robustly, store the file as an additional data section within the executable file. This is more fiddly without external tools as it requires knowledge of the PE file format used for NT executables, q.v. http://msdn.microsoft.com/en-us/library/ms809762.aspx