Xml Parsing - Unique Error Situation - c#

I was trying to parse an XMLfile created using Visual Studio using a tool which uses Xerces parser and I got "content not allowed in prolog" error.
Now when I create an XML file using some other editor like notepad++ and have the exact same content as the one created above I don't get this error.
What do you think might be the problem. You might understand that this is not a repeat question.
EDIT
So i found out the problem. Its because the tool which i use could not handle the Bom at the beginning of the file

The file starts with a UTF-8 byte-order mark. The XML specifications say that documents may start with a BOM, so it should be fine. Is it possible that the tool uses an old version of Xerces which didn't cope with a BOM? Other than that, the file looks fine to me.
Is this a tool you have the source code to? Are you able to create a short but complete program which demonstrates the problem, failing to parse it? Can you try a later version of Xerces?

Check the encoding of the file created using visual studio and compare it with the notepad file encoding, that must be the issue.

Related

OpenXmlSDK can't read manualy created xlsx file: 'The specified package is invalid. The main part is missing.'

I have a third-party library, which creates xlsx-file. It doesn't use OpenXmlSDK, it combines file from fragments of the xml-markup. For zipping there are used ZipArchive class.
But when I try to do with OpenXmlSDK
var document = SpreadsheetDocument.Open(fileStream, false);
it fails with error:
DocumentFormat.OpenXml.Packaging.OpenXmlPackageException: 'The specified package is invalid. The main part is missing.'
MS Excel opens this file normally. Resaving from Excel helps.
Also I unzip files, then zip them again (without any changes), try to call above code again and it works.
Where is the problem? How to zip xlsx-file ready for OpenXmlSDK?
SOLUTION
Problem was with saving file by third-party library. Files, included to zip have entry name with \ instead /. Code of that library was edited to fix that and all is ok.
After some research I found people complaining about this exception in two scenarios:
document uses or references not installed font (as described here:
https://github.com/OfficeDev/Open-XML-SDK/issues/561)
invalid file name extension (other than xlsx, as described here: https://social.msdn.microsoft.com/Forums/office/en-US/6e7e27d4-cd97-46ae-9eca-bfd618dde301/openxml-sdk20-the-specified-package-is-invalid-the-main-part-is-missing?forum=oxmlsdk)
Since You open the file from a stream, the second cause is rather not applicable in this case.
If font usage is not the cause, try to manually compare file versions before and after saving with Excel in Open XML Productivity Tool (https://www.microsoft.com/en-us/download/details.aspx?id=30425).
If there are no differences in documents' contents, try to compare archive compression settings.
UPDATE
It seems I've found some more information about the issue that can help to find the solution.
I was able to reproduce The main part is missing. error by creating archive with: ZipFile.CreateFromDirectory(#"C:\DirToCompress", destFilePath, CompressionLevel.Fastest, false);.
Then, I've checked that opening the file with Package.Open(destFilePath, FileMode.Open, FileAccess.Read) actually listed 0 parts found in the file.
After verifying some differences, I noticed that in the correct xlsx file, entries nested within folders in the archive have FullName paths presented using / character, for example: _rels/.rels. In the corrupted file, the names were written with \ character, for example: _rels\.rels.
You can investigate it by opening a file using ZipArchive class (for example: new ZipArchive(archiveStream, ZipArchiveMode.Read, false, UTF8Encoding.UTF8);) and inspecting the Entries collection.
The important thing to note is that there are naming rules for parts described in the Office Open XML specification: https://www.ecma-international.org/news/TC45_current_work/Office%20Open%20XML%20Part%202%20-%20Open%20Packaging%20Conventions.pdf
As a test, I wrote a code that opens the corrupted xlsx file using ZipArchive class and rewrites each entry by copying its contents and replacing \ with / for the name of the recreated entry. After this operation, the resulting file seems to be opened correctly by SpreadsheetDocument.Open(...) method.
Please note that the name fixing method I used was very simple and may be not enough or working correctly in some scenarios. However, these notes may help to find a desired solution for the issue.

Excel opening CSV with wrong encoding

This is partly a question for the Microsoft forums too, but I think there might be some coding involved.
We have a system built in C# .NET that generates CSV files. However, we have problems with special characters "æÆøØåÅ". The thing is, when I open the file in NotePad, everything is correct. But when I open the file in Excel, these characters are wrong. If I open in NotePad and save without actually doing any changes, it works in Excel. But I dont understand why? Is there some hidden information added to the file that can we adjusted in our C# code to make it correct in the first place?
There are other questions like this, but all answers I could find are workarounds for when you already have a wrong CSV file. In our case, we create this file, and the people we send the files too are usually not computer-people capable of changing encoding, etc.
Edit:
Here is the code we tried to use at the end, after generating our result CSV-string:
string result = "some;æøå;string";
byte[] bytes = System.Text.Encoding.GetEncoding(65001).GetBytes(result.ToString());
return System.Text.Encoding.GetEncoding(65001).GetString(bytes);

open or convert webarchive File in c#

I am trying to find a way to open or convert a webarchive file to any other format in C#. The goal is an automated import system with as few restrictions on file type as possible. I cannot seem to find any way of converting the file other than using safari to open it.
Unfortunately what you are looking for cannot really be done. A webarchive is a proprietary file type made by Apple to display offline webpages in a Safari. This is a combination of xml, html, and binary data, but there are examples in Objective-C to convert the webarchive to a zip archive that contains the html and embedded images/media that was originally displayed on the website that was saved into the webarchive file.
Here is an Objective-C example from GitHub - WebArchiveExtractor
As for converting to PDF...not sure that can be done, you would be better off printing the webpage to PDF in the first place and then uploading that to your document management system.
Apparently though the webarchive filetype contains XML with binary encoded images/media similar to an MHTML file, so you may be able to figure out the format by viewing them in text editors and then writing a conversion utility, but there is very limited information on the web regarding the internal schema of the webarchive file format, so this may be a daunting task. However, since WebKit is open source you can see their code for created an archive and try to reverse it to build your converter. Here's the source code (in C++) for the archiving features in Safari, which actually looks like they are using mhtml, but I haven't explored deep enough to tell if it's exactly the same format: http://trac.webkit.org/browser/trunk/Source/WebCore/loader/archive
Good Luck!

XML file generated in windows not loading in linux environment

We are generating an xml file in C# using xmlseralizer and UTF8 encoding. We check the output and the xml is well formed and passes XSD validation.
We send this xml to customer who load this in UNIX environment. They keep on telling us that xml is not valid and has invalid characters. We don't have UNIX environment to test.
The question being, is there any difference when loading xml files in UNIX?
What can we ask the customer to provide to better understand this situation?
You might have a UTF-8 BOM as the first three bytes of your file:
<?xml version="1.0" encoding="utf-8"?>
It is not part of the XML document so a file reader should not pass it on to be interpreted by the XML parser. If you have it, you could try to remove it and see if your users have the same complaint. Most editors will not show it to you so you might have use a hex editor. (Hex: EF BB BF).
If the problem remains, you'd need to know at what byte offset the purported invalid characters are and which section of the XML specification they violate. Which program and version they are use and what feedback it gives might be helpful, too.
You might also consider that the file is getting damaged in delivery. A round trip transmission might help detect that.

How to print a file to the XpsDocumentWriter in C# or better yet, via VS.NET Automation?

What I am trying to accomplish (the manual way)
In VS 2010, I have project items that are sequencediagrams, they are really just .xml and have a suffix of .sequencediagrams . So I open the diagram in VS and go to File->Print . Now I do not select a physical printer in the dropdown, I change that to Microsoft XPS Document Writer, because I want an outputed .XPS file.
How to do in code ?
I am trying to do this in C# code, specifically in a VS add-in (automation). So what I have is a handle to a projectinfo, which gives me the full path of the file, but I am kind lost in the print part of it. I thought I could use http://msdn.microsoft.com/en-us/library/system.windows.xps.xpsdocumentwriter.aspx , but the method signatures don't seem to take a simple document path.
Anyone have experience in this? suggestions? Thanks.
It was suggested to look at these links:
http://blogs.msdn.com/b/camerons/archive/2010/03/08/save-a-diagram-to-image-file.aspx
http://weblogs.asp.net/gunnarpeipman/archive/2010/09/03/visual-studio-extension-save-uml-diagram-as-image.aspx
by Ryan Molden (MSFT)
Thanks to Ryan, bypassing the whole diagram -> XPS step is great.

Categories