Powerpoint to Text C# - Microsoft.Interop - c#

I have been trying to read .ppt files from last 3 days. I searched a lot on internet and I came up with different source code snippets but nothing was perfect. And now i tried this code, and it is not printing "Check4" because of some unidentified problem in "Foreach" statement, and throwing an exception. Please guide me. I need it badly.
public static void ppt2txt (String source)
{
string fileName = System.IO.Path.GetFileNameWithoutExtension(source);
string filePath = System.IO.Path.GetDirectoryName(source);
Console.Write("Check1");
Application pa = new Microsoft.Office.Interop.PowerPoint.ApplicationClass ();
Microsoft.Office.Interop.PowerPoint.Presentation pp = pa.Presentations.Open (source,
Microsoft.Office.Core.MsoTriState.msoTrue,
Microsoft.Office.Core.MsoTriState.msoFalse,
Microsoft.Office.Core.MsoTriState.msoFalse);
Console.Write("Check2");
String pps = "";
Console.Write("Check3");
foreach (Microsoft.Office.Interop.PowerPoint.Slide slide in pp.Slides)
{
foreach (Microsoft.Office.Interop.PowerPoint.Shape shape in slide.Shapes)
pps += shape.TextFrame.TextRange.Text.ToString ();
}
Console.Write("Check4");
Console.WriteLine(pps);
}
Thrown exception is
System.ArgumentException: The specified value is out of range.
at Microsoft.Office.Interop.PowerPoint.TextFrame.get_TextRange()
at KareneParser.Program.ppt2txt(String source) in c:\Users\Shahmeer\Desktop\New folder (2)\KareneParser\Program.cs:line 323
at KareneParser.Program.Main(String[] args) in c:\Users\Shahmeer\Desktop\New folder (2)\KareneParser\Program.cs:line 150
Line 323 on which exception is caught
pps += shape.TextFrame.TextRange.Text.ToString ();
Thanks in advance.

It looks like you need to check your shape objects to see if they have a TextFrame and Text present.
In your nested foreach loop try this:
foreach (Microsoft.Office.Interop.PowerPoint.Slide slide in pp.Slides)
{
foreach (Microsoft.Office.Interop.PowerPoint.Shape shape in slide.Shapes)
{
if(shape.HasTextFrame == Microsoft.Office.Core.MsoTriState.msoTrue)
{
var textFrame = shape.TextFrame;
if(textFrame.HasText == Microsoft.Office.Core.MsoTriState.msoTrue)
{
var textRange = textFrame.TextRange;
pps += textRange.Text.ToString ();
}
}
}
}
This is of course untested on my part, it looks to me though that as your foreach loops, you're trying to access some shapes in the powerpoint doc that don't have text present, hence the out of range exception. I've added in checking to make sure it only appends text to your pps string if it has Text present.

Not all shapes have text. Lines etc are also shapes.
Check for HasText first:
foreach (Microsoft.Office.Interop.PowerPoint.Shape shape in slide.Shapes)
{
if(shape.TextFrame.HasText)
{
pps += shape.TextFrame.TextRange.Text;
}
}

Related

Aspose PDF - get text from page that has a matching string

I'm working with an existing library - the goal of the library is to pull text out of PDFs to verify against expected values to quality check recorded data vs data in pdf.
I'm looking for a way to succinctly pull a specific page worth of text given a string that should only fall on that specific page.
var pdfDocument = new Document(file.PdfFilePath);
var textAbsorber = new TextAbsorber{
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
pdfDocument.Pages.Accept(textAbsorber);
foreach (var page in pdfDocument.Pages)
{
}
I'm stuck inside the foreach(var page in pdfDocument.Pages) portion... or is that the right area to be looking?
Answer: Text Absorber recreated each page - inside the foreach loop.
If the absorber isn't recreated, it keeps text from previous loops.
public List<string> ProcessPage(MyInfoClass file, string find)
{
var pdfDocument = new Document(file.PdfFilePath);
foreach (Page page in pdfDocument.Pages)
{
var textAbsorber = new TextAbsorber {
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
page.Accept(textAbsorber);
var ext = textAbsorber.Text;
var exts = ext.Replace("\n", "").Split('\r').ToList();
if (ext.Contains(find))
return exts;
}
return null;
}

OpenXML : replacing text in multiples runs

I have a word document with fields that I need to change (see below) but for a reason that I don't understand, my modification is not saved during the process.
I'm using the OpenXML .NET SDK in C#.
Code :
using (WordprocessingDocument myDoc = WordprocessingDocument.Open(destinationFile, true))
{
var body = myDoc.MainDocumentPart.Document.Body;
foreach (var headerParts in myDoc.MainDocumentPart.HeaderParts)
{
foreach (var Para in headerParts.Header.Descendants<DocumentFormat.OpenXml.Wordprocessing.Paragraph>())
{
foreach (var run in Para.Descendants<DocumentFormat.OpenXml.Wordprocessing.Run>())
{
foreach (var text in run.Descendants<DocumentFormat.OpenXml.Wordprocessing.Text>())
{
text.Text = text.Text.Replace("Nom", cv.firstName);
text.Text = text.Text.Replace("Prenom", cv.secondName);
text.Text = text.Text.Replace("NbAnnee", cv.nbAnneeExp.ToString());
text.Text = text.Text.Replace("Objet", cv.objet);
}
}
}
}
myDoc.MainDocumentPart.Document.Save();
}
I don't know where I'm wrong, I followed a lot of templates that were present on SO.
Does anyone have an idea ?
So I found myself the answer : In fact, in my document, I add content controller and because there are not editable, the changement could not be saved during the process.
Morality : this method works... only if you don't use content controller :D

C# Checking if a certain shape exists in Powerpoint

I am trying to find out, if a certain Shape exists in a Powerpoint presentation. I am new to C# and not sure how to cycle through all the shapes. I tried through a foreach loop but got nowhere. Here is what I got:
using pptNS = Microsoft.Office.Interop.PowerPoint;
...
pptNS.Slide pptSlide = null;
bool shapeCheck = false;
pptNS.Presentation pptPresentation = null;
try
{
// Create an instance of PowerPoint.
powerpointApplication = new pptNS.ApplicationClass();
pptPresentation = powerpointApplication.Presentations.Open([pptAddress]);
foreach (pptNS.Shapes sh in pptSlide.Shapes)
{
if (sh.Title.Equals("SlideID"))
{
shapeCheck = true;
}
}
}
catch (Exception ex)
But obviously this throws an System.InvalidCastException. Does somebody know what I should use instead of pptSlide.Shapesin the foreach loop? Or another method to check if a certain shape exists?
I think you should change this:
foreach (pptNS.Shapes sh in pptSlide.Shapes)
to this:
foreach (var sh in pptSlide.Shapes)

Getting bibliographic data from text in a PDF and exporting to a window form

I use iText5 for .NET to extract text from a PDF, by using below code.
private void button1_Click(object sender, EventArgs e)
{
PdfReader reader2 = new PdfReader("Scharfetter1969.pdf");
int pagen = reader2.NumberOfPages;
reader2.Close();
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
for (int i = 1; i < 2; i++)
{
textBox1.Text = "";
PdfReader reader = new PdfReader("Scharfetter1969.pdf");
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
textBox1.Text = s;
reader.Close();
}
}
But I want to get bibliographic data from research paper pdf.
Here is example of data which is extrected from this pdf (in endnote format), Here's a link!
%0 Journal Article
%T Repeated temperature modulation epitaxy for p-type doping and light-emitting diode based on ZnO
%A Tsukazaki, A.
%A Ohtomo, A.
%A Onuma, T.
%A Ohtani, M.
%A Makino, T.
%A Sumiya, M.
%A Ohtani, K.
%A Chichibu, S.F.
%A Fuke, S.
%A Segawa, Y.
%J Nature Materials
%V 4
%N 1
%P 42-46
%# 1476-1122
%D 2004
%I Nature Publishing Group
But remember that this is bibliographic information, it is not available in metadata of this pdf. I want to access Article Type (%O), Title (%T), Authors (%A), Date (%D) and (%I) and show it to different assigned textbox in window form.
I am using C# if any one have any code for this, or guide me how to do this.
PDF is a one-way format. You put data in so that it renders consistently on all devices (monitors, printers, etc) but the format was never intended to pull data back out. Any and all attempts to do that will be pure guess work. iText's PdfTextExtractor works but you are going to have to piece things together based on your own arbitrary set of rules, and these rules will probably change from PDF to PDF. The supplied PDF was created by InDesign which does such a great job of making text look good that it actually makes it even harder to parse the data back out.
That said, if your PDFs are all visually consistent, you could try to pull the data out while retaining formatting and use the formatting rules to guess what is what. That post will get you some HTML formatting that you could guess at. (If this actually works I'd recommend returning something more specific than HTML but I'll leave that up to you.)
Running it against your supplied PDF shows that the title is using the font HelveticaNeue-LightExt at about 17pts so you could write a rule to look for all lines that use that font at that size and combine them together. Authors are done in HelveticaNeue-Condensed at about 10pts so that's another rule.
The below code is a modified version of the one linked to above. Its a full working C# 2010 WinForms app targeting iTextSharp 5.1.1.0. It pulls out the title and authors for the supplied PDF but you'll need to tweak it for other PDFs and meta data. See the comments in the code for specific implementation details.
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;
namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "nmat4-42.pdf"));
TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();
string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
//Buffers to hold various parts from the PDF
List<string> titles = new List<string>();
List<string> authors = new List<string>();
//Array of lines of text
string[] lines = F.Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
//Temporary string
string t;
//Loop through each line in the array
foreach (string line in lines)
{
//See if the line looks like a "title"
if (line.Contains("HelveticaNeue-LightExt") && line.Contains("font-size:17.28003"))
{
//Remove the HTML tags
titles.Add(System.Text.RegularExpressions.Regex.Replace(line, "</?span.*?>", "").Trim());
}
//See if the line looks like an "author"
else if (line.Contains("HelveticaNeue-Condensed") && line.Contains("font-size:9.995972"))
{
//Remove the HTML tags and trim extra characters
t = System.Text.RegularExpressions.Regex.Replace(line, "</?span.*?>", "").Trim(new char[] { ' ', ',', '*' });
//Make sure we have a valid name, probably need some more exceptions here, too
if (!string.IsNullOrWhiteSpace(t) && t != "AND")
{
authors.Add(t);
}
}
}
//Write out the title to the console
Console.WriteLine("Title : {0}", string.Join(" ", titles.ToArray()));
//Write out each author
foreach (string author in authors)
{
Console.WriteLine("Author : {0}", author);
}
Console.WriteLine(F);
this.Close();
}
public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
{
//HTML buffer
private StringBuilder result = new StringBuilder();
//Store last used properties
private Vector lastBaseLine;
private string lastFont;
private float lastFontSize;
//http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
private enum TextRenderMode
{
FillText = 0,
StrokeText = 1,
FillThenStrokeText = 2,
Invisible = 3,
FillTextAndAddToPathForClipping = 4,
StrokeTextAndAddToPathForClipping = 5,
FillThenStrokeTextAndAddToPathForClipping = 6,
AddTextToPaddForClipping = 7
}
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
{
string curFont = renderInfo.GetFont().PostscriptFontName;
//Check if faux bold is used
if ((renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText))
{
curFont += "-Bold";
}
//This code assumes that if the baseline changes then we're on a newline
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;
//See if something has changed, either the baseline, the font or the font size
if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
{
//if we've put down at least one span tag close it
if ((this.lastBaseLine != null))
{
this.result.AppendLine("</span>");
}
//If the baseline has changed then insert a line break
if ((this.lastBaseLine != null) && curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
{
this.result.AppendLine("<br />");
}
//Create an HTML tag with appropriate styles
this.result.AppendFormat("<span style=\"font-family:{0};font-size:{1}\">", curFont, curFontSize);
}
//Append the current text
this.result.Append(renderInfo.GetText());
//Set currently used properties
this.lastBaseLine = curBaseline;
this.lastFontSize = curFontSize;
this.lastFont = curFont;
}
public string GetResultantText()
{
//If we wrote anything then we'll always have a missing closing tag so close it here
if (result.Length > 0)
{
result.Append("</span>");
}
return result.ToString();
}
//Not needed
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
}

Microsoft Word Document Controls not accepting carriage returns

So, I have a Microsoft Word 2007 Document with several Plain Text Format (I have tried Rich Text Format as well) controls which accept input via XML.
For carriage returns, I had the string being passed through XML containing "\r\n" when I wanted a carriage return, but the word document ignored that and just kept wrapping things on the same line. I also tried replacing the \r\n with System.Environment.NewLine in my C# mapper, but that just put in \r\n anyway, which still didn't work.
Note also that on the control itself I have set it to "Allow Carriage Returns (Multiple Paragrpahs)" in the control properties.
This is the XML for the listMapper
<Field id="32" name="32" fieldType="SimpleText">
<DataSelector path="/Data/DB/DebtProduct">
<InputField fieldType=""
path="/Data/DB/Client/strClientFirm"
link="" type=""/>
<InputField fieldType=""
path="strClientRefDebt"
link="" type=""/>
</DataSelector>
<DataMapper formatString="{0} Account Number: {1}"
name="SimpleListMapper" type="">
<MapperData></MapperData>
</DataMapper>
</Field>
Note that this is the listMapper C# where I actually map the list (notice where I try and append the system.environment.newline)
namespace DocEngine.Core.DataMappers
{
public class CSimpleListMapper:CBaseDataMapper
{
public override void Fill(DocEngine.Core.Interfaces.Document.IControl control, CDataSelector dataSelector)
{
if (control != null && dataSelector != null)
{
ISimpleTextControl textControl = (ISimpleTextControl)control;
IContent content = textControl.CreateContent();
CInputFieldCollection fileds = dataSelector.Read(Context);
StringBuilder builder = new StringBuilder();
if (fileds != null)
{
foreach (List<string> lst in fileds)
{
if (CanMap(lst) == false) continue;
if (builder.Length > 0 && lst[0].Length > 0)
builder.Append(Environment.NewLine);
if (string.IsNullOrEmpty(FormatString))
builder.Append(lst[0]);
else
builder.Append(string.Format(FormatString, lst.ToArray()));
}
content.Value = builder.ToString();
textControl.Content = content;
applyRules(control, null);
}
}
}
}
}
Does anybody have any clue at all how I can get MS Word 2007 (docx) to quit ignoring my newline characters??
Use a function like this:
private static Run InsertFormatRun(Run run, string[] formatText)
{
foreach (string text in formatText)
{
run.AppendChild(new Text(text));
RunProperties runProps = run.AppendChild(new RunProperties());
Break linebreak = new Break();
runProps.AppendChild(linebreak);
}
return run;
}
None of the above answers were any help for me.
However I figured out that the InsertAfter method swaps the \n in the original XML string for \v and when this is passed into the content control it then renders correctly.
contentControl.MultiLine = true
contentControl.Range.InsertAfter(your string)
I got the same problem but it was in a table cell.
I had one string with carriage return (multiple line) into a Text object that was append to a paragraph that was append to a table cell.
=> The carriage return was ignored by word.
Well the solution was simple :
Create one paragraph by line and add all of these paragraph's to the table cell.
I think it works
WordprocessingDocument _docx = WordprocessingDocument.Create("c:\\Test.docx", WordprocessingDocumentType.Document);
MainDocumentPart _part = _docx.MainDocumentPart;
string _str = "abc\ndef\ngeh";
string _strArr[] = _str.Split('\n');
foreach (string _line in _strArr)
{
Body _body = new Body();
_body.Append(NewText(_text));
_part.Append(_body);
}
_part.Document.Save();
_docx.Close();
.
static Paragraph NewText(string _text)
{
Paragraph _head = new Paragraph();
Run _run = new Run();
Text _line = new Text(_text);
_run.Append(_line);
_head.Append(_run);
return _head;
}

Categories