count word pages from byte array - c#

sorry for my English
I have the contents of a word document in a byte array and I want to know how many pages it has.
I already did this with a pdf file using this code:
public void MssGetNumberOfPages(byte[] ssFileBinaryData, out int ssNumberOfPages) {
int pageCount;
MemoryStream stream = new MemoryStream(ssFileBinaryData);
using (var r = new StreamReader(stream))
{
string pdfText = r.ReadToEnd();
System.Text.RegularExpressions.Regex regx = new Regex(#"/Type\s*/Page[^s]");
System.Text.RegularExpressions.MatchCollection matches = regx.Matches(pdfText);
pageCount = matches.Count;
ssNumberOfPages = pageCount;
}
// TODO: Write implementation for action
}
How do I do something similar, with a word document?
In the pdf I simply have to search through the regex the text that matches this:
Regex(#"/Type\s*/Page[^s]")
What do I have to put in the regex to match the pages of the word document?

Well, I solved this myself by converting the word document into pdf with Aspose.dll
public void MssGet_Word_NumberOfPages(byte[] ssFileBinaryData, out int ssNumberOfPages) {
// Load Word Document from this byte array
Document loadedFromBytes = new Document(new MemoryStream(ssFileBinaryData));
// Save Word to PDF byte array
MemoryStream pdfStream = new MemoryStream();
loadedFromBytes.Save(pdfStream, SaveFormat.Pdf);
byte[] pdfBytes = pdfStream.ToArray();
int pageCount;
MemoryStream stream = new MemoryStream(pdfBytes);
using (var r = new StreamReader(stream))
{
string pdfText = r.ReadToEnd();
System.Text.RegularExpressions.Regex regx = new Regex(#"/Type\s*/Page[^s]");
System.Text.RegularExpressions.MatchCollection matches = regx.Matches(pdfText);
pageCount = matches.Count;
ssNumberOfPages = pageCount;
}
}

Can you perhaps elaborate on the tool(s) you used to convert the word doc to PDF?

Related

iTextSharp extraction cyrillic characters

In my project I need to read a PDF document. This pdf contains ukrainian & russian characters. the PDFReader read all characters in this pdf but the cirillic characters missing in output. I'm try to use encoding but it not helped. What can I do with this chars?
public static string GetText(string filePath)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
StringBuilder text = new StringBuilder();
if (File.Exists(filePath)){
PdfReader pdfReader = new PdfReader(filePath);
for (int i = 1; i < pdfReader.NumberOfPages; i++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string thePage = PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy);
text.Append(System.Environment.NewLine);
thePage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(thePage)));
text.Append(thePage);
} pdfReader.Close();
} return text.ToString();
}
iTextSharp is an outdated product that is no longer supported, probably there are problems with text extraction. Here is a simple example of how the extraction text works in ITEXT 7 (the code is in java, but everything is the same for c#).
String filePath = "test.pdf";
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(filePath);
PdfDocument pdfDocument = new PdfDocument(pdfReader);
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++) {
PdfPage page = pdfDocument.getPage(i);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
String thePage = PdfTextExtractor.getTextFromPage(page, strategy);
text.append(thePage);
}
pdfReader.close();
System.out.print(text);
The code is about the same as in your example, but the text extracts

Maori Macrons aren't recognized when converting the letter to PDF using aspose word

I'm using below code to convert html to pdf using aspose word plugin but the Maori macrons aren't recognized and causing issue.
MemoryStream stream = new MemoryStream(data);
// load HTML file
Aspose.Words.LoadOptions loadOptions = new Aspose.Words.LoadOptions();
loadOptions.LoadFormat = LoadFormat.Html;
Aspose.Words.Document doc = new Aspose.Words.Document(stream, loadOptions);
foreach (Aspose.Words.Section section in doc)
{
section.PageSetup.PaperSize = PaperSize.Letter;
section.PageSetup.RightMargin = 20;
section.PageSetup.LeftMargin = 10;
section.PageSetup.TopMargin = 10;
section.PageSetup.BottomMargin = 10;
}
doc.Save(strPath, SaveFormat.Pdf);
Sample text before converting:
What we get in Pdf after conversion.

PDF text replace not working

I'm trying to replace text in PDF using iTextSharp dll but its not working in all cases. PDF document doesn't have acro fields.
If the text which I need to replace is bigger than original text its not printing all characters. Finding some special characters is also not working.
I have tried this code
using (PdfReader reader = new PdfReader(sourceFileName))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace("SOMETEXT", "NEWBIGGERTEXT");
reader.SetPageContent(i, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
}
new PdfStamper(reader, new FileStream(newFileName, FileMode.Create, FileAccess.Write)).Close();
}
Please let me know how this can be achieved.

Get substring from MemoryStream without converting entire stream to string

I would like to be able to efficiently get a substring from a MemoryStream (that originally comes from a xml file in a zip). Currently, I read the entire MemoryStream to a string and then search for the start and end tags of the xml node I desire. This works fine but the text file may be very large so I would like to avoid converting the entire MemoryStream into a string and instead just extract the desired section of xml text directly from the stream.
What is the best way to go about this?
string xmlText;
using (var zip = ZipFile.Read(zipFileName))
{
var ze = zip[zipPath];
using (var ms = new MemoryStream())
{
ze.Extract(ms);
ms.Position = 0;
using(var sr = new StreamReader(ms))
{
xmlText = sr.ReadToEnd();
}
}
}
string startTag = "<someTag>";
string endTag = "</someTag>";
int startIndex = xmlText.IndexOf(startTag, StringComparison.Ordinal);
int endIndex = xmlText.IndexOf(endTag, startIndex, StringComparison.Ordinal) + endTag.Length - 1;
xmlText = xmlText.Substring(startIndex, endIndex - startIndex + 1);
If your file is a valid xml file then you should be able to use a XmlReader to avoid loading the entire file into memory
string xmlText;
using (var zip = ZipFile.Read(zipFileName))
{
var ze = zip[zipPath];
using (var ms = new MemoryStream())
{
ze.Extract(ms);
ms.Position = 0;
using (var xml = XmlReader.Create(ms))
{
if(xml.ReadToFollowing("someTag"))
{
xmlText = xml.ReadInnerXml();
}
else
{
// <someTag> not found
}
}
}
}
You'll likely want to catch potential exceptions if the file is not valid xml.
Assuming that since it is xml it will have line breaks, it would probably be best to use StreamReader ReadLine and search for your tags in each line. (Also note put your StreamReader in a using as well.)
Something like
using (var ms = new MemoryStream())
{
ze.Extract(ms);
ms.Position = 0;
using (var sr = new StreamReader(ms))
{
bool adding = false;
string startTag = "<someTag>";
string endTag = "</someTag>";
StringBuilder text = new StringBuilder();
while (sr.Peek() >= 0)
{
string tmp = sr.ReadLine();
if (!adding && tmp.Contains(startTag))
{
adding = true;
}
if (adding)
{
text.Append(tmp);
}
if (tmp.Contains(endTag))
break;
}
xmlText = text.ToString();
}
}
This assumes that the start and end tags are on a line by themselves. If not, you could clean up the resulting text string by getting the index of start and end again like you originally did.

How to add multiple fpages to fixed document in xps?

My requirement is to create xps document which has 10 pages (say). I am using the following code to create a xps document. Please take a look.
// Create the new document
XpsDocument xd = new XpsDocument("D:\\9780545325653.xps", FileAccess.ReadWrite);
IXpsFixedDocumentSequenceWriter xdSW = xd.AddFixedDocumentSequence();
IXpsFixedDocumentWriter xdW = xdSW.AddFixedDocument();
IXpsFixedPageWriter xpW = xdW.AddFixedPage();
fontURI = AddFontResourceToFixedPage(xpW, #"D:\arial.ttf");
image = AddJpegImageResourceToFixedPage(xpW, #"D:\Single content\20_1.jpg");
StringBuilder pageContents = new StringBuilder();
pageContents.Append(ReadFile(#"D:\Single content\20.fpage\20.fpage", i));
xmlWriter = xpW.XmlWriter;
xmlWriter.WriteRaw(pageContents.ToString());
}
xmlWriter.Close();
xpW.Commit();
// Commit the fixed document
xdW.Commit();
// Commite the fixed document sequence writer
xdSW.Commit();
// Commit the XPS document itself
xd.Close();
}
private static string AddFontResourceToFixedPage(IXpsFixedPageWriter pageWriter, String fontFileName)
{
string fontUri = "";
using (XpsFont font = pageWriter.AddFont(false))
{
using (Stream dstFontStream = font.GetStream())
using (Stream srcFontStream = File.OpenRead(fontFileName))
{
CopyStream(srcFontStream, dstFontStream);
// commit font resource to the package file
font.Commit();
}
fontUri = font.Uri.ToString();
}
return fontUri;
}
private static Int32 CopyStream(Stream srcStream, Stream dstStream)
{
const int size = 64 * 1024; // copy using 64K buffers
byte[] localBuffer = new byte[size];
int bytesRead;
Int32 bytesMoved = 0;
// reset stream pointers
srcStream.Seek(0, SeekOrigin.Begin);
dstStream.Seek(0, SeekOrigin.Begin);
// stream position is advanced automatically by stream object
while ((bytesRead = srcStream.Read(localBuffer, 0, size)) > 0)
{
dstStream.Write(localBuffer, 0, bytesRead);
bytesMoved += bytesRead;
}
return bytesMoved;
}
private static string ReadFile(string filePath,int i)
{
FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.ReadWrite);
StringBuilder sb = new StringBuilder();
using (StreamReader sr = new StreamReader(fs))
{
String line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
sb.AppendLine(line);
}
}
string allines = sb.ToString();
//allines = allines.Replace("FontUri=\"/Resources/f7728e4c-2606-4fcb-b963-d2d3f52b013b.odttf\"", "FontUri=\"" + fontURI + "\" ");
//XmlReader xmlReader = XmlReader.Create(fs, new XmlReaderSettings() { IgnoreComments = true });
XMLSerializer serializer = new XMLSerializer();
FixedPage fp = (FixedPage)serializer.DeSerialize(allines, typeof(FixedPage));
foreach (Glyphs glyph in fp.lstGlyphs)
{
glyph.FontUri = fontURI;
}
fp.Path.PathFill.ImageBrush.ImageSource = image;
fs.Close();
string fpageString = serializer.Serialize(fp);
return fpageString;
}
private static string AddJpegImageResourceToFixedPage(IXpsFixedPageWriter pageWriter, String imgFileName)
{
XpsImage image = pageWriter.AddImage("image/jpeg");
using (Stream dstImageStream = image.GetStream())
using (Stream srcImageStream = File.OpenRead(imgFileName))
{
CopyStream(srcImageStream, dstImageStream); // commit image resource to the package file
//image.Commit();
}
return image.Uri.ToString();
}
If you see it, i would have passed single image and single fpage to create a xps document. I want to pass multiple fpages list and image list to create a xps document which has multiple pages..?
You are doing this in the most excruciatingly difficult manner possible. I'd suggest taking the lazy man's route.
Realize that an XpsDocument is just a wrapper on a FixedDocumentSequence, which contains zero or more FixedDocuments, which contains zero or more FixedPages. All these types can be created, manipulated and combined without writing XML.
All you really need to do is create a FixedPage with whatever content on it you need. Here's an example:
static FixedPage CreateFixedPage(Uri imageSource)
{
FixedPage fp = new FixedPage();
fp.Width = 320;
fp.Height = 240;
Grid g = new Grid();
g.HorizontalAlignment = System.Windows.HorizontalAlignment.Center;
g.VerticalAlignment = System.Windows.VerticalAlignment.Center;
fp.Children.Add(g);
Image img = new Image
{
UriSource = imageSource,
};
g.Children.Add(image);
return fp;
}
This is all WPF. I'm creating a FixedPage that has as its root a Grid, which contains an Image that is loaded from the given Uri. The image will be stretched to fill the available space of the Grid. Or, you could do whatever you want. Create a template as a UserControl, send it text to place within itself, whatever.
Next, you just need to add a bunch of fixed pages to an XpsDocument. It's incredibly hard, so read carefully:
public void WriteAllPages(XpsDocument document, IEnumerable<FixedPage> pages)
{
var writer = XpsDocument.CreateXpsDocumentWriter(document);
foreach(var page in pages)
writer.Write(page);
}
And that's all you need to do. Create your pages, add them to your document. Done.

Categories