Read annotation data with Umlaut using iText

Read annotation data with Umlaut using iText - c#

I am using iText to read the author and subject from stamp annotations.
If the annotation author includes non-ASCII characters (e.g. "äüö"), they are read as follows:
Anton M�ller
My code:
using System;
using System.IO;
using iText.Kernel.Pdf;
namespace iText7Test
{
class Program
{
static void Main(string[] args)
{
Stream inputStream = File.OpenRead(#"Stamp_Anton_Mueller.pdf");
PdfDocument annoPdf = new PdfDocument(new PdfReader(inputStream));
for (int iPage = 1; iPage <= annoPdf.GetNumberOfPages(); iPage++)
{
PdfPage annoPage = annoPdf.GetPage(iPage);
var annotations = annoPage.GetAnnotations();
foreach (var annot in annotations)
{
PdfDictionary annoDict = annot.GetPdfObject();
if ("/Stamp" != annoDict.Get(PdfName.IT, true)?.ToString())
continue;
var subject = annoDict.Get(PdfName.Subj, true);
var author = annoDict.Get(PdfName.T); // this reads "Anton M�ller"
var creationDate = annoDict.Get(PdfName.CreationDate, false);
Console.WriteLine("\nAuthor of Stamp_Anton_Mueller.pdf: {0}", author); // this writes: "Author of Stamp_Anton_Mueller.pdf: Anton M?ller"
}
}
}
}
}
If I simply load the inputStream into a string, the resulting string has the same � issues.
string myPdfString = new StreamReader(inputStream).ReadToEnd();
However, if I set the Encoding parameter of the StreamReader, the Umlaut is shown correctly.
string encodedPdfString = new StreamReader(inputStream, Encoding.Default).ReadToEnd();
I did not see any option to choose Encoding for the PdfReader
Sample PDF: https://drive.google.com/file/d/1_bs47kSkITX1SdDYllVBQPhRP4D3xUAw/view

Related

PdfLayer.GetTitle() always returning null

C# itext 7.1.4 (NuGet release) doesn't seem to parse OCG/layer titles correctly.
The C# code below should read a pdf, print all layer titles, turn off the layer visibility and save it to the dest file.
Example pdf file: https://docdro.id/qI479di
using iText.Kernel.Pdf;
using System;
namespace PDFSetOCGVisibility
{
class Program
{
static void Main(string[] args)
{
var src = #"layer-example.pdf";
var dest = #"layer-example-out.pdf"; ;
PdfDocument pdf = new PdfDocument(new PdfReader(src), new PdfWriter(dest));
var Catalog = pdf.GetCatalog();
var ocProps = Catalog.GetOCProperties(false);
var layers = ocProps.GetLayers();
foreach(var layer in layers)
{
var title = layer.GetTitle();
Console.WriteLine($"title: {title ?? "null"}");
layer.SetOn(false);
}
pdf.Close();
}
}
}
Expected output is:
title: Layer 1
title: Layer 2
Actual output is:
title: null
title: null
Writing the file with disabled layers works fine but the layer titles are always null.

Just tested the itext5 version:
using iTextSharp.text.pdf;
using System;
using System.IO;
namespace PDFSetOCGVisibility5
{
class Program
{
static void Main(string[] args)
{
var src = #"layer-example.pdf";
var dest = #"layer-example-out.pdf";
var reader = new PdfReader(src);
PdfStamper pdf = new PdfStamper(reader, new FileStream(dest, FileMode.Create));
var layers = pdf.GetPdfLayers();
foreach (var layer in layers)
{
var title = layer.Key;
Console.WriteLine($"title: {title ?? "null"}");
layer.Value.On = false;
}
pdf.Close();
reader.Close();
}
}
}
It's working as expected, so this seems to be a regression in itext7

I don't know what's the purpose of title/GetTitle() but to get the Name (as displayed on the panel) the following code works:
var title = layer.GetPdfObject().GetAsString(PdfName.Name).ToUnicodeString();

Create PDF from existing pdf with azure storage

I made a bot application with the Microsoft Botbuilder. Now I want to create a pdf-file from the user input. The file should be stored in my azure storage.
I have a "pdf-template" which should be copied and modified (this file is in the azure storage already). It has some textboxes which should be filled with the user input. I already wrote the code for that with iTextSharp.
But I need a filestream for this code. Does anybody know how to get the filestream from the file in my azure storage? Or is there maybe another way to finish my task?
Edit:
Here is the code where I need the filestream
string fileNameExisting = Path.Combine(Directory.GetCurrentDirectory(), "Some.pdf");
string fileNameNew = #"Path/Some2.pdf";
var inv = new Invention
{
Inventor = new Inventor { Firstname = "TEST!", Lastname= "TEST!" },
Date = DateTime.Now,
Title = "TEST",
Slogan = "TEST!",
Description = "TEST!",
Advantages = "TEST!s",
TaskPosition = "TEST!",
TaskSolution = "TEST!"
};
using (var existingFileStream = new FileStream(fileNameExisting, FileMode.Open))
using (var newFileStream = new FileStream(fileNameNew, FileMode.Create))
{
// Open existing PDF
var pdfReader = new PdfReader(existingFileStream);
// PdfStamper, which will create
var stamper = new PdfStamper(pdfReader, newFileStream);
var form = stamper.AcroFields;
var fieldKeys = form.Fields.Keys;
foreach (string fieldKey in fieldKeys)
{
var props = fieldKey.Split('.');
string t = GetProp(props, inv);
form.SetField(fieldKey, t);
}
stamper.Close();
pdfReader.Close();
}
}
public static string GetProp(string[] classes, object oldObj)
{
var obj = oldObj.GetType().GetProperty(classes[0]).GetValue(oldObj, null);
if(classes.Length>1)
{
classes = classes.Skip(1).ToArray();
return GetProp(classes, obj);
}
Console.WriteLine(obj.ToString());
return obj.ToString();
}

The PdfReader constructor also takes a byte array. You should be able to create the object using something like:
var pdfTemplateBytes = await new WebClient().DownloadDataTaskAsync("https://myaccount.blob.core.windows.net/templates/mytemplate.pdf");
var pdfReader = new PdfReader(pdfTemplateBytes );

Deserialize xml using Bond throws System.IO.InvalidDataException : Unexpected node type

I played a little with Bond using this code:
using System;
using System.IO;
using System.Text;
using System.Xml;
using Bond;
using Bond.Protocols;
using NUnit.Framework;
public class Sandbox
{
[Test]
public void RoundtripWithSchema()
{
var sb = new StringBuilder();
var source = new WithSchema { Value = 1 };
using (XmlWriter xmlWriter = XmlWriter.Create(sb))
{
var writer = new SimpleXmlWriter(xmlWriter);
Serialize.To(writer, source);
}
var xml = sb.ToString();
Console.Write(xml);
Console.WriteLine();
using (var xmlReader = XmlReader.Create(new StringReader(xml)))
{
var reader = new SimpleXmlReader(xmlReader);
var roundtripped = Deserialize<WithSchema>.From(reader); // System.IO.InvalidDataException : Unexpected node type
Assert.AreEqual(source.Value, roundtripped.Value);
}
}
[Test]
public void RoundtripUsingSerializerWithSchema()
{
var sb = new StringBuilder();
var source = new WithSchema { Value = 1 };
using (XmlWriter xmlWriter = XmlWriter.Create(sb))
{
var writer = new SimpleXmlWriter(xmlWriter);
var serializer = new Serializer<SimpleXmlWriter>(typeof(WithSchema));
serializer.Serialize(source, writer);
}
var xml = sb.ToString();
Console.Write(xml);
Console.WriteLine();
using (var xmlReader = XmlReader.Create(new StringReader(xml)))
{
var reader = new SimpleXmlReader(xmlReader);
var serializer = new Deserializer<SimpleXmlReader>(typeof(WithSchema));
var roundtripped = serializer.Deserialize<WithSchema>(reader); // System.IO.InvalidDataException : Unexpected node type
Assert.AreEqual(source.Value, roundtripped.Value);
}
}
}
[Schema]
public class WithSchema
{
[Id(0)]
public int Value { get; set; }
}
Both samples output the expected xml:
<?xml version="1.0" encoding="utf-16"?>
<WithSchema>
<Value>1</Value>
</WithSchema>
Both fail when deserializing throwing System.IO.InvalidDataException : Unexpected node type
Don't know where to look for the bug really, suggestions?

The Bond SimpleXmlReader is having trouble with the <?xml version="1.0" encoding="utf-16"?> line. If you leave this out when serializing, you can deserialize without a problem.
Try something like this
using (XmlWriter xmlWriter = XmlWriter.Create(sb, new XmlWriterSettings { OmitXmlDeclaration = true }))
{
var writer = new SimpleXmlWriter(xmlWriter);
Serialize.To(writer, source);
}
My guess is that XmlNodeType.XmlDeclaration probably needs to be added to the IgnoredTokens set in Bond's SimpleXmlParser.
Bond versions later than v4.0.1 will have this issue (Bond issue #112 on GitHub) fixed.
Inspiration for this answer came from the Bond simple_xml sample.

copying openXML image from one document to another

We have conditional Footers that INCLUDETEXT based on the client:
IF $CLIENT = "CLIENT1" "{INCLUDETEXT "CLIENT1HEADER.DOCX"}" ""
Depending on our document, there could be a varying amount of IF/ELSE, and these all work correctly for merging the correct files in the correct place.
However, some of these documents may have client specific images/branding, which also need to be copied across from the INCLUDETEXT file.
Below is the method that is called to replace any Picture elements that exist in the IEnumerable<Run> that is copied from the Source document to the Target document.
The image is copied fine, however it doesn't appear to update the RID in my Picture or add a record into the .XML.Rels files. (I even tried adding a ForEach to add to all the headers and footers, to see if this made any difference.
private void InsertImagesFromOldDocToNewDoc(WordprocessingDocument source, WordprocessingDocument target, IEnumerable<Picture> pics)
{
IEnumerable<Picture> imageElements = source.MainDocumentPart.Document.Descendants<Run>().Where(x => x.Descendants<Picture>().FirstOrDefault() != null).Select(x => x.Descendants<Picture>().FirstOrDefault());
foreach (Picture pic in pics) //the new pics
{
Picture oldPic = imageElements.Where(x => x.Equals(pic)).FirstOrDefault();
if (oldPic != null)
{
string imageId = "";
ImageData shape = oldPic.Descendants<ImageData>().FirstOrDefault();
ImagePart p = source.MainDocumentPart.GetPartById(shape.RelationshipId) as ImagePart;
ImagePart newPart = target.MainDocumentPart.AddPart<ImagePart>(p);
newPart.FeedData(p.GetStream());
shape.RelId = target.MainDocumentPart.GetIdOfPart(newPart);
string relPart = target.MainDocumentPart.CreateRelationshipToPart(newPart);
}
}
}
Has anyone come across this issue before?
It appears the OpenXML SDK documentation is a 'little' sparse...

Late reaction but this thread helped me a lot to got it working. Here my solution for copying a document with images
private static void CopyDocumentWithImages(string path)
{
if (!Path.GetFileName(path).StartsWith("~$"))
{
using (var source = WordprocessingDocument.Open(path, false))
{
using (var newDoc = source.CreateNew(path.Replace(".docx", "-images.docx")))
{
foreach (var e in source.MainDocumentPart.Document.Body.Elements())
{
var clonedElement = e.CloneNode(true);
clonedElement.Descendants<DocumentFormat.OpenXml.Drawing.Blip>()
.ToList().ForEach(blip =>
{
var newRelation = newDoc.CopyImage(blip.Embed, source);
blip.Embed = newRelation;
});
clonedElement.Descendants<DocumentFormat.OpenXml.Vml.ImageData>().ToList().ForEach(imageData =>
{
var newRelation = newDoc.CopyImage(imageData.RelationshipId, source);
imageData.RelationshipId = newRelation;
});
newDoc.MainDocumentPart.Document.Body.AppendChild(clonedElement);
}
newDoc.Save();
}
}
}
}
CopyImage:
public static string CopyImage(this WordprocessingDocument newDoc, string relId, WordprocessingDocument org)
{
var p = org.MainDocumentPart.GetPartById(relId) as ImagePart;
var newPart = newDoc.MainDocumentPart.AddPart(p);
newPart.FeedData(p.GetStream());
return newDoc.MainDocumentPart.GetIdOfPart(newPart);
}
CreateNew:
public static WordprocessingDocument CreateNew(this WordprocessingDocument org, string name)
{
var doc = WordprocessingDocument.Create(name, WordprocessingDocumentType.Document);
doc.AddMainDocumentPart();
doc.MainDocumentPart.Document = new Document(new Body());
using (var streamReader = new StreamReader(org.MainDocumentPart.ThemePart.GetStream()))
using (var streamWriter = new StreamWriter(doc.MainDocumentPart.AddNewPart<ThemePart>().GetStream(FileMode.Create)))
{
streamWriter.Write(streamReader.ReadToEnd());
}
using (var streamReader = new StreamReader(org.MainDocumentPart.StyleDefinitionsPart.GetStream()))
using (var streamWriter = new StreamWriter(doc.MainDocumentPart.AddNewPart<StyleDefinitionsPart>().GetStream(FileMode.Create)))
{
streamWriter.Write(streamReader.ReadToEnd());
}
return doc;
}

Stuart,
I had faced the same problem when I was trying to copy the numbering styles from one document to the other.
I think what Word does internally is, whenever an object is copied from one document to the other the ID for that object is not copied over to the new document and instead what happens is a new ID is assigned to it.
You'll have to get the ID after the image has been copied and then replace it everywhere your image has been used.
I hope this helps, this is what I to use copy numbering styles.
Cheers

Error while deserializing to json an unzipped string

I create a zip file and I copy in it a file that contains a serialized list of objects. The file encoding is in UTF8. Then I unzip the file and I try to deserialize it, but I will get this error:
Unexpected character encountered while parsing value: . Path '', line 0, position 0
The problem does not exist if I use ASCII encoding instead of UTF8. But I need to use the UTF8. So I am wondering if the DotNetZip library does not have full support for the UTF8, or maybe I am missing something else.
In order to reproduce the error:
Json library is the one at: http://json.codeplex.com/
The Zip library is the one at: http://dotnetzip.codeplex.com/
Create a simple class "Dog":
public class Dog
{
public string FirstName { get; set; }
public string LastName { get; set; }
}
Then use this code (the last line will cause the error):
var list = new List<Dog>();
list.Add(new Dog { FirstName = "Arasd", LastName = "1234123" });
list.Add(new Dog { FirstName = "fghfgh", LastName = "vbnvbn" });
var serialized = JsonConvert.SerializeObject(list, Formatting.Indented);
var zipFile = new ZipFile(#"C:\Users\daviko\Desktop\test.zip");
using (zipFile)
{
zipFile.CompressionLevel = Ionic.Zlib.CompressionLevel.BestCompression;
zipFile.UpdateEntry("dogs.txt", serialized, UTF8Encoding.UTF8);
zipFile.Save();
}
var readFromZipFile = string.Empty;
using (var input = new MemoryStream())
{
using (zipFile)
{
var entry = zipFile["dogs.txt"];
entry.Extract(input);
}
using (var output = new MemoryStream())
{
input.CopyTo(output);
readFromZipFile = new UTF8Encoding().GetString( input.ToArray());
}
}
var deserialized = JsonConvert.DeserializeObject<List<Dog>>(readFromZipFile);

The following code:
using (zipFile)
{
zipFile.CompressionLevel = Ionic.Zlib.CompressionLevel.BestCompression;
zipFile.UpdateEntry("dogs.txt", serialized, UTF8Encoding.UTF8);
zipFile.Save();
}
will dispose the zipFile when it executes. So you must create the zipFile again, before your try reading it again.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Read annotation data with Umlaut using iText - c#

Related

PdfLayer.GetTitle() always returning null

Create PDF from existing pdf with azure storage

Deserialize xml using Bond throws System.IO.InvalidDataException : Unexpected node type

copying openXML image from one document to another

Error while deserializing to json an unzipped string

Categories

Resources