In an Excel Power Query file the data connection can be from a SQL server. We have a large number of files that specify a SQL server by name and this server is going to be decommissioned. We need to update the connection to replace the older server name with the new server name. This is possible by opening the Excel file, browsing to the query and editing the server name manually. Due to the large number of files it is desired to do this using C#. The image below shows the input fields (with the names removed) where you would update this manually.
First by unzipping the Excel file and browsing the contents under the folder xl > connections.xml I would have expected it to specfiy the connection there but it only says $Workbook$
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<connections xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<connection id="1" keepAlive="1" name="Query" description="Connection to the query in the workbook." type="5" refreshedVersion="6" background="1" saveData="1">
<dbPr connection="Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location="table"" command="SELECT * FROM [table]"/>
</connection>
</connections>
On the MDSN forms there is a reference to this topic and the answer provided by Will Gregg says:
External data source connection information is stored in the XLSX package in a custom part. You can locate the custom part under the customXML folder of the package. For example: customXml\iem1.xml.
Contained in item1.xml is a element. The definition for the element can be found in the [MS-QDEFF]: Query Definition File Format document (https://msdn.microsoft.com/en-us/library/mt577220(v=office.12).aspx).
In order to work with the data of the element you will need to decode the contents as described in the [MS-QDEFF]: Query Definition File Format document.
Once the data is decoded, you will need to examine the contents of the PackagePart. Within that package you will find the external data connection information in the Forumlas\Section1.m part.
This is helpful to point me to the item.xml file in the customXml folder but does not give any details on how to decode the information in the DataMashup object. The answer did mention the [MS-QDEFF]: Query Definition File Format document is available at this link from the main article about the query definition format. The information in this document can seem dense and complex at first glance.
On Stack Overflow there are 6 questions that mention DataMashup and 4 of them are related to Power BI, which while similar to this issue are not the same. The links to each of those questions are listed below:
how to decode/ get encoding of file (Power BI desktop file)
How to edit Power BI Desktop document parameters or data sources programmatically with C#?
Is there documentation/an API for the PBix file format?
How to update clients' Power BI files without ruining their reports?
The other 2 questions are more relevant as they ask about Excel not Power BI which I will discuss below:
This question asks how to remove Power Query queries' custom XML data using VBA. I do not want to delete the query but rather update the connection string and I would like to do this in C# not VBA. The questions shows the result using the Macro recorder and I do not want to open each Excel file to run a VBA Macro.
This question asks how to find the query information and comes across the same $Workbook$ that I did. In the comment by Axel Richter he says In *.xlsx/customXml/ you will find a item1.xml which contains a DataMashup element which contains a base64Binary which is the binary query definition file. I have no clue how to work with that. That's why only a comment and not a answer. Over a year later an answer was added by Tom Jebo pointing to the Open Specifications details I found as well but does not offer a solution on how to manipulate the DataMashup object. I am adding this as a new question since the this question is looking to solve a little different problem than I am and it is also looking for a solution in JavaScript.
What is the best way to decode the DataMashup object, change the server name, and then save the updated connection back to the Excel file?
In this blog post by Jeff Atwood on July 1, 2011, asking and answering your own questions is encouraged. In addition this page form the Stack Overflow help center address the same issue. I decided to post a full working solution in C# for others to modify and use, hopefully saving them the time of needing to sludge through all the working out I did.
As mentioned in the question, the most helpful document is the [MS-QDEFF]: Query Definition File Format. I will include the most relevant parts of this document here but refer to the original document if needed. Below shows the example XML with the DataMashup provided by Microsoft. This is for a short query but expect something similar if you open up the customXml > item1.xml file.
<DataMashup sqmid="7690c5d6-5698-463c-a560-a0093d4f6332"
xmlns="http://schemas.microsoft.com/DataMashup">
AAAAAEUDAABQSwMEFAACAAgAta0pR62KRJynAAAA+QAAABIAHABDb25maWcvUGFja2FnZS54bWwgohgA
KKAUAAAAAAAAAAAAAAAAAAAAAAAAAAAhY9NDoIwGESvQrqnP4jGkI+ycCuJCdG4bUqFRiiGFsvdXHgkr
yCJYti5nMmb5M3r8YRsbJvgrnqrO5MihikKlJFdqU2VosFdwi3KOByEvIpKBRNsbDJanaLauVtCiPce+
xXu+opElDJyzveFrFUrQm2sE0Yq9FuV/1eIw+kjwyMcxTimmzVmMWVA5h5ybRbMpIwpkEUJu6FxQ6+4M
uGxADJHIN8b/A1QSwMEFAACAAgAta0pRw/K6aukAAAA6QAAABMAHABbQ29udGVudF9UeXBlc10ueG1sI
KIYACigFAAAAAAAAAAAAAAAAAAAAAAAAAAAAG2OSw7CMAxErxJ5n7qwQAg1ZQHcgAtEwf2I5qPGReFsL
DgSVyBtd4ilZ+Z55vN6V8dkB/GgMfbeKdgUJQhyxt961yqYuJF7ONbV9Rkoihx1UUHHHA6I0XRkdSx8I
Jedxo9Wcz7HFoM2d90Sbstyh8Y7JseS5x9QV2dq9DSwuKQsr7UZB3Fac3OVAqbEuMj4l7A/eR3C0BvN2
cQkbZR2IXEZXn8BUEsDBBQAAgAIALWtKUdi3rmEPAAAAEsAAAATABwARm9ybXVsYXMvU2VjdGlvbjEub
SCiGAAooBQAAAAAAAAAAAAAAAAAAAAAAAAAAAArTk0uyczPUwiG0IbWvFy8XMUZiUWpKQqBpalFlYYKt
go5qSW8XApAEJxfWpScChQx1Dbk5crMQxa1BgBQSwECLQAUAAIACAC1rSlHrYpEnKcAAAD5AAAAEgAAA
AAAAAAAAAAAAAAAAAAAQ29uZmlnL1BhY2thZ2UueG1sUEsBAi0AFAACAAgAta0pRw/K6aukAAAA6QAAA
BMAAAAAAAAAAAAAAAAA8wAAAFtDb250ZW50X1R5cGVzXS54bWxQSwECLQAUAAIACAC1rSlHYt65hDwAA
ABLAAAAEwAAAAAAAAAAAAAAAADkAQAARm9ybXVsYXMvU2VjdGlvbjEubVBLBQYAAAAAAwADAMIAAABtA
gAAAAA0AQAA77u/PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz48UGVybWlzc2lvb
kxpc3QgeG1sbnM6eHNpPSJodHRwOi8vd3d3LnczLm9yZy8yMDAxL1hNTFNjaGVtYS1pbnN0YW5jZSIge
G1sbnM6eHNkPSJodHRwOi8vd3d3LnczLm9yZy8yMDAxL1hNTFNjaGVtYSI+PENhbkV2YWx1YXRlRnV0d
XJlUGFja2FnZXM+ZmFsc2U8L0NhbkV2YWx1YXRlRnV0dXJlUGFja2FnZXM+PEZpcmV3YWxsRW5hYmxlZ
D50cnVlPC9GaXJld2FsbEVuYWJsZWQ+PFdvcmtib29rR3JvdXBUeXBlIHhzaTpuaWw9InRydWUiIC8+P
C9QZXJtaXNzaW9uTGlzdD7LBwAAAAAAAKkHAADvu788P3htbCB2ZXJzaW9uPSIxLjAiIGVuY29kaW5nP
SJ1dGYtOCI/PjxMb2NhbFBhY2thZ2VNZXRhZGF0YUZpbGUgeG1sbnM6eHNpPSJodHRwOi8vd3d3LnczL
m9yZy8yMDAxL1hNTFNjaGVtYS1pbnN0YW5jZSIgeG1sbnM6eHNkPSJodHRwOi8vd3d3LnczLm9yZy8yM
DAxL1hNTFNjaGVtYSI+PEl0ZW1zPjxJdGVtPjxJdGVtTG9jYXRpb24+PEl0ZW1UeXBlPkFsbEZvcm11b
GFzPC9JdGVtVHlwZT48SXRlbVBhdGggLz48L0l0ZW1Mb2NhdGlvbj48U3RhYmxlRW50cmllcyAvPjwvS
XRlbT48SXRlbT48SXRlbUxvY2F0aW9uPjxJdGVtVHlwZT5Gb3JtdWxhPC9JdGVtVHlwZT48SXRlbVBhd
Gg+U2VjdGlvbjEvUXVlcnkxPC9JdGVtUGF0aD48L0l0ZW1Mb2NhdGlvbj48U3RhYmxlRW50cmllcz48R
W50cnkgVHlwZT0iSXNQcml2YXRlIiBWYWx1ZT0ibDAiIC8+PEVudHJ5IFR5cGU9IlJlc3VsdFR5cGUiI
FZhbHVlPSJzTnVtYmVyIiAvPjxFbnRyeSBUeXBlPSJGaWxsRW5hYmxlZCIgVmFsdWU9ImwxIiAvPjxFb
nRyeSBUeXBlPSJGaWxsVG9EYXRhTW9kZWxFbmFibGVkIiBWYWx1ZT0ibDAiIC8+PEVudHJ5IFR5cGU9I
kZpbGxDb3VudCIgVmFsdWU9ImwxIiAvPjxFbnRyeSBUeXBlPSJGaWxsRXJyb3JDb3VudCIgVmFsdWU9I
mwwIiAvPjxFbnRyeSBUeXBlPSJGaWxsQ29sdW1uVHlwZXMiIFZhbHVlPSJzQlE9PSIgLz48RW50cnkgV
HlwZT0iRmlsbENvbHVtbk5hbWVzIiBWYWx1ZT0ic1smcXVvdDtRdWVyeTEmcXVvdDtdIiAvPjxFbnRye
SBUeXBlPSJGaWxsRXJyb3JDb2RlIiBWYWx1ZT0ic1Vua25vd24iIC8+PEVudHJ5IFR5cGU9IkZpbGxMY
XN0VXBkYXRlZCIgVmFsdWU9ImQyMDE1LTA5LTEwVDA0OjQ1OjQxLjkyNzU5MDBaIiAvPjxFbnRyeSBUe
XBlPSJSZWxhdGlvbnNoaXBJbmZvQ29udGFpbmVyIiBWYWx1ZT0ic3smcXVvdDtjb2x1bW5Db3VudCZxd
W90OzoxLCZxdW90O2tleUNvbHVtbk5hbWVzJnF1b3Q7OltdLCZxdW90O3F1ZXJ5UmVsYXRpb25zaGlwc
yZxdW90OzpbXSwmcXVvdDtjb2x1bW5JZGVudGl0aWVzJnF1b3Q7OlsmcXVvdDtTZWN0aW9uMS9RdWVye
TEvQXV0b1JlbW92ZWRDb2x1bW5zMS57UXVlcnkxLDB9JnF1b3Q7XSwmcXVvdDtDb2x1bW5Db3VudCZxd
W90OzoxLCZxdW90O0tleUNvbHVtbk5hbWVzJnF1b3Q7OltdLCZxdW90O0NvbHVtbklkZW50aXRpZXMmc
XVvdDs6WyZxdW90O1NlY3Rpb24xL1F1ZXJ5MS9BdXRvUmVtb3ZlZENvbHVtbnMxLntRdWVyeTEsMH0mc
XVvdDtdLCZxdW90O1JlbGF0aW9uc2hpcEluZm8mcXVvdDs6W119IiAvPjxFbnRyeSBUeXBlPSJGaWxsZ
WRDb21wbGV0ZVJlc3VsdFRvV29ya3NoZWV0IiBWYWx1ZT0ibDEiIC8+PEVudHJ5IFR5cGU9IkFkZGVkV
G9EYXRhTW9kZWwiIFZhbHVlPSJsMCIgLz48RW50cnkgVHlwZT0iUmVjb3ZlcnlUYXJnZXRTaGVldCIgV
mFsdWU9InNTaGVldDIiIC8+PEVudHJ5IFR5cGU9IlJlY292ZXJ5VGFyZ2V0Q29sdW1uIiBWYWx1ZT0ib
DEiIC8+PEVudHJ5IFR5cGU9IlJlY292ZXJ5VGFyZ2V0Um93IiBWYWx1ZT0ibDEiIC8+PEVudHJ5IFR5c
GU9Ik5hbWVVcGRhdGVkQWZ0ZXJGaWxsIiBWYWx1ZT0ibDAiIC8+PEVudHJ5IFR5cGU9IkZpbGxUYXJnZ
XQiIFZhbHVlPSJzUXVlcnkxIiAvPjxFbnRyeSBUeXBlPSJCdWZmZXJOZXh0UmVmcmVzaCIgVmFsdWU9I
mwxIiAvPjxFbnRyeSBUeXBlPSJGaWxsU3RhdHVzIiBWYWx1ZT0ic0NvbXBsZXRlIiAvPjxFbnRyeSBUe
XBlPSJRdWVyeUlEIiBWYWx1ZT0iczdlMDQzNjJlLTkyZjUtNGQ4Mi04YjA3LTI3NjFlYWY2OGFlNSIgL
z48L1N0YWJsZUVudHJpZXM+PC9JdGVtPjxJdGVtPjxJdGVtTG9jYXRpb24+PEl0ZW1UeXBlPkZvcm11b
GE8L0l0ZW1UeXBlPjxJdGVtUGF0aD5TZWN0aW9uMS9RdWVyeTEvU291cmNlPC9JdGVtUGF0aD48L0l0Z
W1Mb2NhdGlvbj48U3RhYmxlRW50cmllcyAvPjwvSXRlbT48L0l0ZW1zPjwvTG9jYWxQYWNrYWdlTWV0Y
WRhdGFGaWxlPhYAAABQSwUGAAAAAAAAAAAAAAAAAAAAAAAA2gAAAAEAAADQjJ3fARXREYx6AMBPwpfrA
QAAACLWGAG5O6FHjkAGtB+m5EQAAAAAAgAAAAAAA2YAAMAAAAAQAAAAaH8KNe2ciHwfVosIvSCr6gAAA
AAEgAAAoAAAABAAAAA40fOKWe6kmTAWJSBXs4cYUAAAAPNy7uF6Dtr9PvADu+eZdeV7JutpIQTh41qqT
3QnFoWPwE0Xyrur5N6Q2s2TEzjlBDfkEmNaGtr3htemOjWZYXKQHP+R5u/90zHWiwOwjjowFAAAAF2UC
6Jm8C98hVmJBo638e4Qk65V
</DataMashup>
The value of this object is encoded in a Base64 string. If you are not familiar with Base 64, this Wikipedia article would be a good place to start. The first step in the solution will be to open the XML document and convert this into its byte representation. This can be done as follows:
string file = #"\customXml\item1.xml"; // or wherever your xml file is
XDocument doc = XDocument.Load(file);
byte[] dataMashup = Convert.FromBase64String(doc.Root.Value);
NOTE: In the full example provided at the bottom of this answer, all the manipulation is done in memory.
From the Microsoft definition document:
Version (4 bytes): Unsigned integer that MUST be set to 0.
Package Parts Length (4 bytes): Unsigned integer that specifies the length of the Package Parts field.
Package Parts (variable): Variable-length binary stream (section 2.3).
Permissions Length (4 bytes): Unsigned integer that specifies the length of the Permissions field.
Permissions (variable): Variable-length binary stream (section 2.4).
Metadata Length (4 bytes): Unsigned integer that specifies the length of the Metadata field.
Metadata (variable): Variable-length binary stream (section 2.5).
Permission Bindings Length (4 bytes): Unsigned integer that specifies the length of the Permission Bindings field.
Permission Bindings (variable): Variable-length binary stream (section 2.6).
Since each field that defines the length of its content is 4 bytes I defined a constant
private const int FIELDS_LENGTH = 4;
Then each of the values defined in this section (quoted from Microsoft) can be found as shown below:
int version = BitConverter.ToUInt16(dataMashup.Take(FIELDS_LENGTH).ToArray(), 0);
int packagePartsLength = BitConverter.ToUInt16(dataMashup.Skip(FIELDS_LENGTH).Take(FIELDS_LENGTH).ToArray(), 0);
byte[] packageParts = dataMashup.Skip(FIELDS_LENGTH * 2).Take(packagePartsLength).ToArray();
int permissionsLength = BitConverter.ToUInt16(dataMashup.Skip(FIELDS_LENGTH * 2 + packagePartsLength).Take(FIELDS_LENGTH).ToArray(), 0);
byte[] permissions = dataMashup.Skip(FIELDS_LENGTH * 3).Take(permissionsLength).ToArray();
int metadataLength = BitConverter.ToUInt16(dataMashup.Skip(FIELDS_LENGTH * 3 + packagePartsLength + permissionsLength).Take(FIELDS_LENGTH).ToArray(), 0);
byte[] metadata = dataMashup.Skip(FIELDS_LENGTH * 4 + packagePartsLength + permissionsLength).Take(metadataLength).ToArray();
int permissionsBindingLength = BitConverter.ToUInt16(dataMashup.Skip(FIELDS_LENGTH * 4 + packagePartsLength + permissionsLength + metadataLength).Take(FIELDS_LENGTH).ToArray(), 0);
byte[] permissionsBinding = dataMashup.Skip(FIELDS_LENGTH * 5 + packagePartsLength + permissionsLength + metadataLength).Take(permissionsBindingLength).ToArray();
Using the byte[] for the package parts, it represents a Package object from the System.IO.Packaging namespace.
using (MemoryStream ms = new MemoryStream(packageParts)) {
using (Package package = Package.Open(ms, FileMode.Open, FileAccess.ReadWrite)) {
PackagePart section = package.GetParts().Where(x => x.Uri.OriginalString == "/Formulas/Section1.m").FirstOrDefault();
string query;
using (StreamReader reader = new StreamReader(section.GetStream())) {
query = reader.ReadToEnd();
// do other replacing, removing of query here
}
using (BinaryWriter writer = new BinaryWriter(section.GetStream())) {
// write updated query back to package part
writer.Write(Encoding.ASCII.GetBytes(query));
}
}
packageParts = ms.ToArray();
}
Finally I need to update the original byte[] with the new information from the updated package.
bytes = BitConverter.GetBytes(version)
.Concat(BitConverter.GetBytes(packageParts.Length))
.Concat(packageParts)
.Concat(BitConverter.GetBytes(permissionsLength))
.Concat(permissions)
.Concat(BitConverter.GetBytes(metadataLength))
.Concat(metadata)
.Concat(BitConverter.GetBytes(permissionsBindingLength))
.Concat(permissionsBinding);
doc.Root.Value = Convert.ToBase64String(bytes.ToArray());
entryStream.SetLength(0);
doc.Save(entryStream);
Below is the full example for completeness. It is a console application which takes in a directory of files to update as command line arguments and then replaces the old server name with the new server name.
using System;
using System.Collections.Generic;
using System.Linq;
using System.IO;
using System.IO.Compression;
using System.Xml.Linq;
using System.IO.Packaging;
using System.Text;
namespace MyApp {
class Program {
private const int FIELDS_LENGTH = 4;
static void Main(string[] args) {
if (args.Length != 1) {
Console.WriteLine("specify one directory to update");
}
if (!Directory.Exists(args[0])) {
Console.WriteLine("directory does not exist");
}
IEnumerable<FileInfo> files = Directory.GetFiles(args[0]).Where(x => Path.GetExtension(x) == ".xlsx").Select(x => new FileInfo(x));
foreach (FileInfo file in files) {
using (FileStream fileStream = File.Open(file.FullName, FileMode.OpenOrCreate)) {
using (ZipArchive archive = new ZipArchive(fileStream, ZipArchiveMode.Update)) {
ZipArchiveEntry entry = archive.GetEntry("customXml/item1.xml");
IEnumerable<byte> bytes;
using (Stream entryStream = entry.Open()) {
XDocument doc = XDocument.Load(entryStream);
byte[] dataMashup = Convert.FromBase64String(doc.Root.Value);
int version = BitConverter.ToUInt16(dataMashup.Take(FIELDS_LENGTH).ToArray(), 0);
int packagePartsLength = BitConverter.ToUInt16(dataMashup.Skip(FIELDS_LENGTH).Take(FIELDS_LENGTH).ToArray(), 0);
byte[] packageParts = dataMashup.Skip(FIELDS_LENGTH * 2).Take(packagePartsLength).ToArray();
int permissionsLength = BitConverter.ToUInt16(dataMashup.Skip(FIELDS_LENGTH * 2 + packagePartsLength).Take(FIELDS_LENGTH).ToArray(), 0);
byte[] permissions = dataMashup.Skip(FIELDS_LENGTH * 3).Take(permissionsLength).ToArray();
int metadataLength = BitConverter.ToUInt16(dataMashup.Skip(FIELDS_LENGTH * 3 + packagePartsLength + permissionsLength).Take(FIELDS_LENGTH).ToArray(), 0);
byte[] metadata = dataMashup.Skip(FIELDS_LENGTH * 4 + packagePartsLength + permissionsLength).Take(metadataLength).ToArray();
int permissionsBindingLength = BitConverter.ToUInt16(dataMashup.Skip(FIELDS_LENGTH * 4 + packagePartsLength + permissionsLength + metadataLength).Take(FIELDS_LENGTH).ToArray(), 0);
byte[] permissionsBinding = dataMashup.Skip(FIELDS_LENGTH * 5 + packagePartsLength + permissionsLength + metadataLength).Take(permissionsBindingLength).ToArray();
// use double memory stream to solve issue as memory stream will change
// size when re-saving the data mashup object
using (MemoryStream packagePartsStream = new MemoryStream(packageParts)) {
using (MemoryStream ms = new MemoryStream()) {
packagePartsStream.CopyTo(ms);
using (Package package = Package.Open(ms, FileMode.Open, FileAccess.ReadWrite)) {
PackagePart section = package.GetParts().Where(x => x.Uri.OriginalString == "/Formulas/Section1.m").FirstOrDefault();
string query;
using (StreamReader reader = new StreamReader(section.GetStream())) {
query = reader.ReadToEnd();
// do other replacing, removing of query here
query = query.Replace("old-server", "new-server");
}
using (BinaryWriter writer = new BinaryWriter(section.GetStream())) {
writer.Write(Encoding.ASCII.GetBytes(query));
}
}
packageParts = ms.ToArray();
}
bytes = BitConverter.GetBytes(version)
.Concat(BitConverter.GetBytes(packageParts.Length))
.Concat(packageParts)
.Concat(BitConverter.GetBytes(permissionsLength))
.Concat(permissions)
.Concat(BitConverter.GetBytes(metadataLength))
.Concat(metadata)
.Concat(BitConverter.GetBytes(permissionsBindingLength))
.Concat(permissionsBinding);
doc.Root.Value = Convert.ToBase64String(bytes.ToArray());
entryStream.SetLength(0);
doc.Save(entryStream);
}
}
}
}
}
}
}
}
NOTE: As I only needed to update the Package Parts part, I can confirm this decoding / encoding works but I did not test the the decoding / encoding of the Permissions, Metadata, or Permissions Binding. If you need to use these this should at least get you started.
NOTE: This code does not catch errors or handle every case. It is meant to be a working example of how to update the connections in a Power Query file. Feel free to adapt this as you need.
Related
I'm trying to parse a crg-file in C#. The file is mixed with plain text and binary data. The first section of the file contains plain text while the rest of the file is binary (lots of floats), here's an example:
$
$ROAD_CRG
reference_line_start_u = 100
reference_line_end_u = 120
$
$KD_DEFINITION
#:KRBI
U:reference line u,m,730.000,0.010
D:reference line phi,rad
D:long section 1,m
D:long section 2,m
D:long section 3,m
...
$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
�#z����RA����\�l
...
I know I can read bytes starting at a specific offset but how do I find out which byte to start from? The last row before the binary section will always contain at least four dollar signs "$$$$". Here's what I've got so far:
using var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
var startByte = ??; // How to find out where to start?
using (BinaryReader reader = new BinaryReader(fs))
{
reader.BaseStream.Seek(startByte, SeekOrigin.Begin);
var f = reader.ReadSingle();
Debug.WriteLine(f);
}
When you have a mixture of text data and binary data, you need to treat everything as binary. This means you should be using raw Stream access, or something similar, and using binary APIs to look through the text data (often looking for cr/lf/crlf at bytes as sentinels, although it sounds like in your case you could just look for the $$$$ using binary APIs, then decode the entire block before, and scan forwards). When you think you have an entire line, then you can use Encoding to parse each line - the most convenient API being encoding.GetString(). When you've finished looking through the text data as binary, then you can continue parsing the binary data, again using the binary API. I would usually recommend against BinaryReader here too, because frankly it doesn't gain you much over more direct API. The other problem you might want to think about is CPU endianness, but assuming that isn't a problem: BitConverter.ToSingle() may be your friend.
If the data is modest in size, you may find it easiest to use byte[] for the data; either via File.ReadAllBytes, or by renting an oversized byte[] from the array-pool, and loading it from a FileStream. The Stream API is awkward for this kind of scenario, because once you've looked at data: it has gone - so you need to maintain your own back-buffers. The pipelines API is ideal for this, when dealing with large data, but is an advanced topic.
UPDATE: This code may not work as expected. Please review the valuable information in the comments.
using (var fs = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read))
{
using (StreamReader sr = new StreamReader(fs, Encoding.ASCII, true, 1, true))
{
var line = sr.ReadLine();
while (!string.IsNullOrWhiteSpace(line) && !line.Contains("$$$$"))
{
line = sr.ReadLine();
}
}
using (BinaryReader reader = new BinaryReader(fs))
{
// TODO: Start reading the binary data
}
}
Solution
I know this is far from the most optimized solution but in my case it did the trick and since the plain text section of the file was known to be fairly small this didn't cause any noticable performance issues. Here's the code:
using var fileStream = new FileStream(#"crg_sample.crg", FileMode.Open, FileAccess.Read);
using var reader = new BinaryReader(fileStream);
var newLine = '\n';
var markerString = "$$$$";
var currentString = "";
var foundMarker = false;
var foundNewLine = false;
while (!foundNewLine)
{
var c = reader.ReadChar();
if (!foundMarker)
{
currentString += c;
if (currentString.Length > markerString.Length)
currentString = currentString.Substring(1);
if (currentString == markerString)
foundMarker = true;
}
else
{
if (c == newLine)
foundNewLine = true;
}
}
if (foundNewLine)
{
// Read binary
}
Note: If you're dealing with larger or more complex files you should probably take a look at Mark Gravell's answer and the comment sections.
I know the title is long, but it describes the problem exactly. I didn't know how else to explain it because this is totally out there.
I have a utility written in C# targeting .NET Core 2.1 that downloads and decrypts (AES encryption) files originally uploaded by our clients from our encrypted store, so they can be reprocessed through some of our services in the case that they fail. This utility is run via CLI using database IDs for the files as arguments, for example download.bat 101 102 103 would download 3 files with the corresponding IDs. I'm receiving byte data through a message queue (really not much more than a TCP socket) which describes a .TIF image.
I have a good reason to believe that the byte data is not ever corrupted on the server. That reason is when I run the utility with only one ID parameter, such as download.bat 101, then it works just fine. Furthermore, when I run it with multiple IDs, the last file that is downloaded by the utility is always intact, but the rest are always corrupted.
This odd behavior has persisted across two different implementations for writing the byte data to a file. Those implementations are below.
File.ReadAllBytes implementation:
private static void WriteMessageContents(FileServiceResponseEnvelope envelope, string destination, byte[] encryptionKey, byte[] macInitialVector)
{
using (var inputStream = new MemoryStream(envelope.Payload))
using (var outputStream = new MemoryStream(envelope.Payload.Length))
{
var sha512 = YellowAesEncryptor.DecryptStream(inputStream, outputStream, encryptionKey, macInitialVector, 0);
File.WriteAllBytes(destination, outputStream.ToArray());
_logger.LogStatement($"Finished writing [{envelope.Payload.Length} bytes] to [{destination}].", LogLevel.Debug);
}
}
FileStream implementation:
private static void WriteMessageContents(FileServiceResponseEnvelope envelope, string destination, byte[] encryptionKey, byte[] macInitialVector)
{
using (var inputStream = new MemoryStream(envelope.Payload))
using (var outputStream = new MemoryStream(envelope.Payload.Length))
{
var sha512 = YellowAesEncryptor.DecryptStream(inputStream, outputStream, encryptionKey, macInitialVector, 0);
using (FileStream fs = new FileStream(destination, FileMode.Create))
{
var bytes = outputStream.ToArray();
fs.Write(bytes, 0, envelope.Payload.Length);
_logger.LogStatement($"File byte content: [{string.Join(", ", bytes.Take(16))}]", LogLevel.Trace);
fs.Flush();
}
_logger.LogStatement($"Finished writing [{envelope.Payload.Length} bytes] to [{destination}].", LogLevel.Debug);
}
}
This method is called from a for loop which first receives the messages I described earlier and then feeds their payloads to the above method:
using (var requestSocket = new RequestSocket(fileServiceEndpoint))
{
// Envelopes is constructed beforehand
foreach (var envelope in envelopes)
{
var timer = Stopwatch.StartNew();
requestSocket.SendMoreFrame(messageTypeBytes);
requestSocket.SendMoreFrame(SerializationHelper.SerializeObjectToBuffer(envelope));
if (!requestSocket.TrySendFrame(_timeout, signedPayloadBytes, signedPayloadBytes.Length))
{
var message = $"Timeout exceeded while processing [{envelope.ActionType}] request.";
_logger.LogStatement(message, LogLevel.Error);
throw new Exception(message);
}
var responseReceived = requestSocket.TryReceiveFrameBytes(_timeout, out byte[] responseBytes);
...
var responseEnvelope = SerializationHelper.DeserializeObject<FileServiceResponseEnvelope>(responseBytes);
...
_logger.LogStatement($"Received response with payload of [{responseEnvelope.Payload.Length} bytes].", LogLevel.Info);
var destDir = downloadDetails.GetDestinationPath(responseEnvelope.FileId);
if (!Directory.Exists(destDir))
Directory.CreateDirectory(destDir);
var dest = Path.Combine(destDir, idsToFileNames[responseEnvelope.FileId]);
WriteMessageContents(responseEnvelope, dest, encryptionKey, macInitialVector);
}
}
I also know that TIFs have a very specific header, which looks something like this in raw bytes:
[73, 73, 42, 0, 8, 0, 0, 0, 20, 0...
It always begins with "II" (73, 73) or "MM" (77, 77) followed by 42 (probably a Hitchhiker's reference). I analyzed the bytes written by the utility. The last file always has a header that resembles this one. The rest are always random bytes; seemingly jumbled or mis-ordered image binary data. Any insight on this would be greatly appreciated because I can't wrap my mind around what I would even need to do to diagnose this.
UPDATE
I was able to figure out this problem with the help of elgonzo in the comments. Sometimes it isn't a direct answer that helps, but someone picking your brain until you look in the right place.
All right, as I suspected this was a dumb mistake (I had severe doubts that the File API was simply this flawed for so long). I just needed help thinking through it. There was an additional bit of code which I didn't post that was biting me, when I was retrieving the metadata for the file so that I could then request the file from our storage box.
byte[] encryptionKey = null;
byte[] macInitialVector = null;
...
using (var conn = new SqlConnection(ConnectionString))
using (var cmd = new SqlCommand(uploadedFileQuery, conn))
{
conn.Open();
var reader = cmd.ExecuteReader();
while (reader.Read())
{
FileServiceMessageEnvelope readAllEnvelope = null;
var originalFileName = reader["UploadedFileClientName"].ToString();
var fileId = Convert.ToInt64(reader["UploadedFileId"].ToString());
//var originalFileExtension = originalFileName.Substring(originalFileName.IndexOf('.'));
//_logger.LogStatement($"Scooped extension: {originalFileExtension}", LogLevel.Trace);
envelopes.Add(readAllEnvelope = new FileServiceMessageEnvelope
{
ActionType = FileServiceActionTypeEnum.ReadAll,
FileType = FileTypeEnum.UploadedFile,
FileName = reader["UploadedFileServerName"].ToString(),
FileId = fileId,
WorkerAuthorization = null,
BinaryTimestamp = DateTime.Now.ToBinary(),
Position = 0,
Count = Convert.ToInt32(reader["UploadedFileSize"]),
SignerFqdn = _messengerConfig.FullyQualifiedDomainName
});
readAllEnvelope.SignMessage(_messengerConfig.PrivateKeyBytes, _messengerConfig.PrivateKeyPassword);
signedPayload = new SecureMessage { Payload = new byte[0] };
signedPayload.SignMessage(_messengerConfig.PrivateKeyBytes, _messengerConfig.PrivateKeyPassword);
signedPayloadBytes = SerializationHelper.SerializeObjectToBuffer(signedPayload);
encryptionKey = (byte[])reader["UploadedFileEncryptionKey"];
macInitialVector = (byte[])reader["UploadedFileEncryptionMacInitialVector"];
}
conn.Close();
}
Eagle-eyed observers might realize that I have not properly coupled the encryptionKey and macInitialVector to the correct record, since each file has a unique key and vector. This means I was using the key for one of the files to decrypt all of them which is why they were all corrupt except for one file -- they were not properly decrypted. I solved this issue by coupling them together with the ID in a simple POCO and retrieving the appropriate key and vector for each file upon decryption.
I'm trying to convert thousands of .lbl files to text to extract just the data I need to populate a database for a much cheaper, easier method of printing labels. However, I've tried many different approaches to convert the .lbl file to a text file:
see this question
and this question
But I just have had zero luck. Here's the code:
using (BinaryReader b = new BinaryReader(File.Open(#"stupid.lbl", FileMode.Open)))
{
int length = (int)b.BaseStream.Length;
byte[] allData = b.ReadBytes(length);
using (BinaryWriter w = new BinaryWriter(File.Open(#"test.txt", FileMode.Create)))
{
Encoding encEncoder = Encoding.ASCII;
string str = encEncoder.GetString(allData);
w.Write(str);
}
}
I do have the proper file paths and whatnot to be able to actually read/write and everything opens, but it's all gibberish...I just need to get the real, useful, human readable text out of it to push into an access db...Any ideas out there?
I have a stream of bytes which actually (if put right) will form a valid Word file, I need to convert this stream into a Word file without writing it to disk, I take the original stream from SQL Server database table:
ID Name FileData
----------------------------------------
1 Word1 292jf2jf2ofm29fj29fj29fj29f2jf29efj29fj2f9 (actual file data)
the FileData field carries the data.
Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document doc = new Microsoft.Office.Interop.Word.Document();
doc = word.Documents.Open(#"C:\SampleText.doc");
doc.Activate();
The above code opens and fill a Word file from File System, I don't want that, I want to define a new Microsoft.Office.Interop.Word.Document, but I want to fill its content manually from byte stream.
After getting the in-memory Word document, I want to do some parsing of keywords.
Any ideas?
Create an in memmory file system, there are drivers for that.
Give word a path to an ftp server path (or something else) which you then use to push the data.
One important thing to note: storing files in a database is generally not good design.
You could look at how Sharepoint solves this. They have created a web interface for documents stored in their database.
Its not that hard to create or embed a webserver in your application that can serve pages to Word. You don't even have to use the standard ports.
There probably isn't any straight-forward way of doing this. I found a couple of solutions searching for it:
Use the OpenOffice SDK to manipulate the document instead of Word
Interop
Write the data to the clipboard, and then from the Clipboard to Word
I don't know if this does it for you, but apparently the API doesn't provide what you're after (unfortunately).
There are really only 2 ways to open a Word document programmatically - as a physical file or as a stream. There's a "package", but that's not really applicable.
The stream method is covered here: https://learn.microsoft.com/en-us/office/open-xml/how-to-open-a-word-processing-document-from-a-stream
But even it relies on there being a physical file in order to form the stream:
string strDoc = #"C:\Users\Public\Public Documents\Word13.docx";
Stream stream = File.Open(strDoc, FileMode.Open);
The best solution I can offer would be to write the file out to a temp location where the service account for the application has permission to write:
string newDocument = #"C:\temp\test.docx";
WriteFile(byteArray, newDocument);
If it didn't have permissions on the "temp" folder in my example, you would simply just add the service account of your application (application pool, if it's a website) to have Full Control of the folder.
You'd use this WriteFile() function:
/// <summary>
/// Write a byte[] to a new file at the location where you choose
/// </summary>
/// <param name="byteArray">byte[] that consists of file data</param>
/// <param name="newDocument">Path to where the new document will be written</param>
public static void WriteFile(byte[] byteArray, string newDocument)
{
using (MemoryStream stream = new MemoryStream())
{
stream.Write(byteArray, 0, (int)byteArray.Length);
// Save the file with the new name
File.WriteAllBytes(newDocument, stream.ToArray());
}
}
From there, you can open it with OpenXML and edit the file. There's no way to open a Word document in byte[] form directly into an instance of Word - Interop, OpenXML, or otherwise - because you need a documentPath, or the stream method mentioned earlier that relies on there being a physical file. You can edit the bytes you would get by reading the bytes into a string, and XML afterwards, or just edit the string, directly:
string docText = null;
byte[] byteArray = null;
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(documentPath, true))
{
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
docText = sr.ReadToEnd(); // <-- converts byte[] stream to string
}
// Play with the XML
XmlDocument xml = new XmlDocument();
xml.LoadXml(docText); // the string contains the XML of the Word document
XmlNodeList nodes = xml.GetElementsByTagName("w:body");
XmlNode chiefBodyNode = nodes[0];
// add paragraphs with AppendChild...
// remove a node by getting a ChildNode and removing it, like this...
XmlNode firstParagraph = chiefBodyNode.ChildNodes[2];
chiefBodyNode.RemoveChild(firstParagraph);
// Or play with the string form
docText = docText.Replace("John","Joe");
// If you manipulated the XML, write it back to the string
//docText = xml.OuterXml; // comment out the line above if XML edits are all you want to do, and uncomment out this line
// Save the file - yes, back to the file system - required
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(docText);
}
}
// Read it back in as bytes
byteArray = File.ReadAllBytes(documentPath); // new bytes, ready for DB saving
Reference:
https://learn.microsoft.com/en-us/office/open-xml/how-to-search-and-replace-text-in-a-document-part
I know it's not ideal, but I have searched and not found a way to edit the byte[] directly without a conversion that involves writing out the file, opening it in Word for the edits, then essentially re-uploading it to recover the new bytes. Doing byte[] byteArray = Encoding.UTF8.GetBytes(docText); prior to re-reading the file will corrupt them, as would any other Encoding I tried (UTF7,Default,Unicode, ASCII), as I found when I tried to write them back out using my WriteFile() function, above, in that last line. When not encoded and simply collected using File.ReadAllBytes(), and then writing the bytes back out using WriteFile(), it worked fine.
Update:
It might be possible to manipulate the bytes like this:
//byte[] byteArray = File.ReadAllBytes("Test.docx"); // you might be able to assign your bytes here, instead of from a file?
byte[] byteArray = GetByteArrayFromDatabase(fileId); // function you have for getting the document from the database
using (MemoryStream mem = new MemoryStream())
{
mem.Write(byteArray, 0, (int)byteArray.Length);
using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open(mem, true))
{
// do your updates -- see string or XML edits, above
// Once done, you may need to save the changes....
//wordDoc.MainDocumentPart.Document.Save();
}
// But you will still need to save it to the file system here....
// You would update "documentPath" to a new name first...
string documentPath = #"C:\temp\newDoc.docx";
using (FileStream fileStream = new FileStream(documentPath,
System.IO.FileMode.CreateNew))
{
mem.WriteTo(fileStream);
}
}
// And then read the bytes back in, to save it to the database
byteArray = File.ReadAllBytes(documentPath); // new bytes, ready for DB saving
Reference:
https://learn.microsoft.com/en-us/previous-versions/office/office-12//ee945362(v=office.12)
But note that even this method will require saving the document, then reading it back in, in order to save it to bytes for the database. It will also fail if the document is in .doc format instead of .docx on that line where the document is being opened.
Instead of that last section for saving the file to the file system, you could just take the memory stream and save that back into bytes once you are outside of the WordprocessingDocument.Open() block, but still inside the using (MemoryStream mem = new MemoryStream() { ... } statement:
// Convert
byteArray = mem.ToArray();
This will have your Word document byte[].
I'm using C# in ASP.NET version 2. I'm trying to open an image file, read (and change) the XMP header, and close it back up again. I can't upgrade ASP, so WIC is out, and I just can't figure out how to get this working.
Here's what I have so far:
Bitmap bmp = new Bitmap(Server.MapPath(imageFile));
MemoryStream ms = new MemoryStream();
StreamReader sr = new StreamReader(Server.MapPath(imageFile));
*[stuff with find and replace here]*
byte[] data = ToByteArray(sr.ReadToEnd());
ms = new MemoryStream(data);
originalImage = System.Drawing.Image.FromStream(ms);
Any suggestions?
How about this kinda thing?
byte[] data = File.ReadAllBytes(path);
... find & replace bit here ...
File.WriteAllBytes(path, data);
Also, i really recommend against using System.Bitmap in an asp.net process, as it leaks memory and will crash/randomly fail every now and again (even MS admit this)
Here's the bit from MS about why System.Drawing.Bitmap isn't stable:
http://msdn.microsoft.com/en-us/library/system.drawing.aspx
"Caution:
Classes within the System.Drawing namespace are not supported for use within a Windows or ASP.NET service. Attempting to use these classes from within one of these application types may produce unexpected problems, such as diminished service performance and run-time exceptions."
Part 1 of the XMP spec 2012, page 10 specifically talks about how to edit a file in place without needing to understand the surrounding format (although they do suggest this as a last resort). The embedded XMP packet looks like this:
<?xpacket begin="■" id="W5M0MpCehiHzreSzNTczkc9d"?>
... the serialized XMP as described above: ...
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf= ...>
...
</rdf:RDF>
</x:xmpmeta>
... XML whitespace as padding ...
<?xpacket end="w"?>
In this example, ‘■’ represents the
Unicode “zero width non-breaking space
character” (U+FEFF) used as a
byte-order marker.
The (XMP Spec 2010, Part 3, Page 12) also gives specific byte patterns (UTF-8, UTF16, big/little endian) to look for when scanning the bytes. This would complement Chris' answer about reading the file in as a giant byte stream.
You can use the following functions to read/write the binary data:
public byte[] GetBinaryData(string path, int bufferSize)
{
MemoryStream ms = new MemoryStream();
using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read))
{
int bytesRead;
byte[] buffer = new byte[bufferSize];
while((bytesRead = fs.Read(buffer,0,bufferSize))>0)
{
ms.Write(buffer,0,bytesRead);
}
}
return(ms.ToArray());
}
public void SaveBinaryData(string path, byte[] data, int bufferSize)
{
using (FileStream fs = File.Open(path, FileMode.Create, FileAccess.Write))
{
int totalBytesSaved = 0;
while (totalBytesSaved<data.Length)
{
int remainingBytes = Math.Min(bufferSize, data.Length - totalBytesSaved);
fs.Write(data, totalBytesSaved, remainingBytes);
totalBytesSaved += remainingBytes;
}
}
}
However, loading entire images to memory would use quite a bit of RAM. I don't know much about XMP headers, but if possible you should:
Load only the headers in memory
Manipulate the headers in memory
Write the headers to a new file
Copy the remaining data from the original file