3rd party Pdf library significantly slower when running NUnit - c#

I am evaluating Winnovative's PdfToText library and have run into something that concerns me.
Everything runs fine and I am able to extract the text content from a small 20k or less pdf immediately if I am running a console application. However, if I call the same code from the NUnit gui running it takes 15-25 seconds (I've verified it's PdfToText by putting a breakpoint on the line that extracts the text and hitting F10 to see how long it takes to advance to the next line).
This concerns me because I'm not sure where to lay blame since I don't know the cause. Is there a problem with NUnit or PdfToText? All I want to do is extract the text from a pdf, but 20 seconds is completely unreasonable if I'm going to see this behavior under certain conditions. If it's just when running NUnit, that's acceptable, but otherwise I'll have to look elsewhere.
It's easier to demonstrate the problem using a complete VS Solution (2010), so here's the link to make it easier to setup and run (no need to download NUnit or PdfToText or even a sample pdf):
http://dl.dropbox.com/u/273037/PdfToTextProblem.zip (You may have to change the reference to PdfToText to use the x86 dll if you're running on a 32-bit machine).
Just hit F5 and the NUnit Gui runner will load.
I'm not tied to this library, if you have suggestions, I've tried iTextSharp (way too expensive for 2 lines of code), and looked at Aspose (I didn't try it, but the SaaS license is $11k). But they either lack the required functionality or are way too expensive.

(comment turned into answer)
How complex are your PDFs? The 4.1.6 version of iText allows for a closed sourced solution. Although 4.1.6 doesn't directly have a text extractor it isn't too terribly hard to write one using the PdfReader and GetPageContent().

Below is the code I used to extract the text from the PDF using iTextSharp v4.1.6. If it seems overly verbose, it's related to how I'm using it and the flexibility required.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using iTextSharp.text.pdf;
namespace ClassLibrary1
{
public class PdfToken
{
private PdfToken(int type, string value)
{
Type = type;
Value = value;
}
public static PdfToken Create(PRTokeniser tokenizer)
{
return new PdfToken(tokenizer.TokenType, tokenizer.StringValue);
}
public int Type { get; private set; }
public string Value { get; private set; }
public bool IsOperand
{
get
{
return Type == PRTokeniser.TK_OTHER;
}
}
}
public class PdfOperation
{
public PdfOperation(PdfToken operationToken, IEnumerable<PdfToken> arguments)
{
Name = operationToken.Value;
Arguments = arguments;
}
public string Name { get; private set; }
public IEnumerable<PdfToken> Arguments { get; private set; }
}
public interface IPdfParsingStrategy
{
void Execute(PdfOperation op);
}
public class PlainTextParsingStrategy : IPdfParsingStrategy
{
StringBuilder text = new StringBuilder();
public PlainTextParsingStrategy()
{
}
public String GetText()
{
return text.ToString();
}
#region IPdfParsingStrategy Members
public void Execute(PdfOperation op)
{
// see Adobe PDF specs for additional operations
switch (op.Name)
{
case "TJ":
PrintText(op);
break;
case "Tm":
SetMatrix(op);
break;
case "Tf":
SetFont(op);
break;
case "S":
PrintSection(op);
break;
case "G":
case "g":
case "rg":
SetColor(op);
break;
}
}
#endregion
bool newSection = false;
private void PrintSection(PdfOperation op)
{
text.AppendLine("------------------------------------------------------------");
newSection = true;
}
private void PrintNewline(PdfOperation op)
{
text.AppendLine();
}
private void PrintText(PdfOperation op)
{
if (newSection)
{
newSection = false;
StringBuilder header = new StringBuilder();
PrintText(op, header);
}
PrintText(op, text);
}
private static void PrintText(PdfOperation op, StringBuilder text)
{
foreach (PdfToken t in op.Arguments)
{
switch (t.Type)
{
case PRTokeniser.TK_STRING:
text.Append(t.Value);
break;
case PRTokeniser.TK_NUMBER:
text.Append(" ");
break;
}
}
}
String lastFont = String.Empty;
String lastFontSize = String.Empty;
private void SetFont(PdfOperation op)
{
var args = op.Arguments.ToList();
string font = args[0].Value;
string size = args[1].Value;
//if (font != lastFont || size != lastFontSize)
// text.AppendLine();
lastFont = font;
lastFontSize = size;
}
String lastX = String.Empty;
String lastY = String.Empty;
private void SetMatrix(PdfOperation op)
{
var args = op.Arguments.ToList();
string x = args[4].Value;
string y = args[5].Value;
if (lastY != y)
text.AppendLine();
else if (lastX != x)
text.Append(" ");
lastX = x;
lastY = y;
}
String lastColor = String.Empty;
private void SetColor(PdfOperation op)
{
lastColor = PrintCommand(op).Replace(" ", "_");
}
private static string PrintCommand(PdfOperation op)
{
StringBuilder text = new StringBuilder();
foreach (PdfToken t in op.Arguments)
text.AppendFormat("{0} ", t.Value);
text.Append(op.Name);
return text.ToString();
}
}
}
And here's how I call it:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using iTextSharp.text.pdf;
namespace ClassLibrary1
{
public class PdfExtractor
{
public static string GetText(byte[] pdfBuffer)
{
PlainTextParsingStrategy strategy = new PlainTextParsingStrategy();
ParsePdf(pdfBuffer, strategy);
return strategy.GetText();
}
private static void ParsePdf(byte[] pdf, IPdfParsingStrategy strategy)
{
PdfReader reader = new PdfReader(pdf);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
byte[] page = reader.GetPageContent(i);
if (page != null)
{
PRTokeniser tokenizer = new PRTokeniser(page);
List<PdfToken> parameters = new List<PdfToken>();
while (tokenizer.NextToken())
{
var token = PdfToken.Create(tokenizer);
if (token.IsOperand)
{
strategy.Execute(new PdfOperation(token, parameters));
parameters.Clear();
}
else
{
parameters.Add(token);
}
}
}
}
}
}
}

Related

Roslyn - get grouped single line comments

I am writing a program in C# for extracting comments from code. I am using Roslyn compiler to do that. It's great, because I am just visiting the whole abstract syntax tree and fetching SingleLineComment trivia, MultiLineComment trivia and DocumentationComment trivia syntax from the file in solution. But there is a problem because programmers often write comments like that:
// General Information about an assembly is controlled through the following
// set of attributes. Change these attribute values to modify the information
// associated with an assembly.
You can see that these are three single line comments, but I want them them to be fetched from code as one comment. Can I achieve that with Roslyn or maybe there is another way? Because that's frequent situation when programmers are writing multi line commments using single line comments syntax.
My code for extracting comments looks like this:
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using System.Collections.Generic;
namespace RoslynPlay
{
public class CommentStore
{
public List<Comment> Comments { get; } = new List<Comment>();
public void AddCommentTrivia(SyntaxTrivia trivia,
LocationStore commentLocationstore, string fileName)
{
if (trivia.Kind() == SyntaxKind.SingleLineCommentTrivia)
{
Comments.Add(new SingleLineComment(trivia.ToString(),
trivia.GetLocation().GetLineSpan().EndLinePosition.Line + 1, commentLocationstore)
{
FileName = fileName,
});
}
else if (trivia.Kind() == SyntaxKind.MultiLineCommentTrivia)
{
Comments.Add(new MultiLineComment(trivia.ToString(),
trivia.GetLocation().GetLineSpan().StartLinePosition.Line + 1,
trivia.GetLocation().GetLineSpan().EndLinePosition.Line + 1, commentLocationstore)
{
FileName = fileName,
});
}
}
public void AddCommentNode(DocumentationCommentTriviaSyntax node,
LocationStore commentLocationstore, string fileName)
{
Comments.Add(new DocComment(node.ToString(),
node.GetLocation().GetLineSpan().StartLinePosition.Line + 1,
node.GetLocation().GetLineSpan().EndLinePosition.Line,
commentLocationstore)
{
FileName = fileName,
});
}
}
}
and in main main file (Program.cs) I am launching comment extraction from code like this:
string fileContent;
SyntaxTree tree;
SyntaxNode root;
CommentsWalker commentWalker;
MethodsAndClassesWalker methodWalker;
string[] files = Directory.GetFiles(projectPath, $"*.cs", SearchOption.AllDirectories);
var commentStore = new CommentStore();
Console.WriteLine("Reading files...");
ProgressBar progressBar = new ProgressBar(files.Length);
foreach (var file in files)
{
fileContent = File.ReadAllText(file);
string filePath = new Regex($#"{projectPath}\\(.*)$").Match(file).Groups[1].ToString();
tree = CSharpSyntaxTree.ParseText(fileContent);
root = tree.GetRoot();
commentWalker = new CommentsWalker(filePath, commentStore);
commentWalker.Visit(root);
progressBar.UpdateAndDisplay();
}
and here is also the comment walker:
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.CSharp.Syntax;
namespace RoslynPlay
{
public class CommentsWalker : CSharpSyntaxWalker
{
private string _fileName;
private CommentStore _commentStore;
public CommentsWalker(string fileName,
CommentStore commentStore)
: base(SyntaxWalkerDepth.StructuredTrivia)
{
_fileName = fileName;
_commentStore = commentStore;
}
public override void VisitTrivia(SyntaxTrivia trivia)
{
if (trivia.Kind() == SyntaxKind.SingleLineCommentTrivia
|| trivia.Kind() == SyntaxKind.MultiLineCommentTrivia)
{
_commentStore.AddCommentTrivia(trivia, _commentLocationStore, _fileName);
}
base.VisitTrivia(trivia);
}
public override void VisitDocumentationCommentTrivia(DocumentationCommentTriviaSyntax node)
{
_commentStore.AddCommentNode(node, _commentLocationStore, _fileName);
base.VisitDocumentationCommentTrivia(node);
}
}
}
And the problem is because trivia.Kind() == SyntaxKind.SingleLineCommentTrivia extracts only single line of comments, but I want to extract single line comments blocks as one comment.

Managed-code action not changing property

I am attempting to use a managed code action in an InstallShield suite project.
I followed their example in hopes it wouldn't be a big deal.
http://helpnet.flexerasoftware.com/installshield21helplib/helplibrary/SteActionManagedCd.htm
Ideally, I want to have my .dll method to change an install shield property when the action is executed. When I test out the installer to see if the property changed, I get the default value I set.
Maybe I am doing it wrong, any suggestions would be greatly appreciated.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.InteropServices;
using Microsoft.Web.Administration;
namespace IISHelp
{
[Guid("BAFAEAED-08C6-4679-B94E-487A4D89DE63")]
[TypeLibType(4288)]
public interface ISuiteExtension
{
[DispId(1)]
string get_Attribute(string bstrName);
[DispId(2)]
void LogInfo(string bstr);
[DispId(3)]
string get_Property(string bstrName);
[DispId(3)]
void set_Property(string bstrName, string bstrValue);
[DispId(4)]
string FormatProperty(string bstrValue);
[DispId(5)]
string ResolveString(string bstrStringId);
}
public class Help
{
const UInt32 ERROR_INSTALL_FAILURE = 1603;
const UInt32 ERROR_SUCCESS = 0;
ISuiteExtension GetExtensionFromDispatch(object pDispatch)
{
if (Marshal.IsComObject(pDispatch) == false)
throw new ContextMarshalException("Invalid dispatch object passed to CLR method");
return (ISuiteExtension)pDispatch;
}
public UInt32 IsUniqueName(object pDispatch)
{
try
{
List<string> currentVirtualDirectories = GetApplicationNames();
ISuiteExtension suiteExtension = GetExtensionFromDispatch(pDispatch);
var name = suiteExtension.get_Property("APPLICATION_NAME");
suiteExtension.set_Property("IS_UNIQUE_APPLICATION_NAME", (!currentVirtualDirectories.Contains(name)).ToString());
}
catch (System.ContextMarshalException)
{
return ERROR_INSTALL_FAILURE;
}
return ERROR_SUCCESS;
}
public List<string> GetApplicationNames()
{
List<string> currentVirtualDirectories = new List<string>();
ServerManager mgr = new ServerManager();
foreach (Site s in mgr.Sites)
{
foreach (Application app in s.Applications)
{
currentVirtualDirectories.Add(app.Path.Replace('/', ' ').Trim());
}
}
return currentVirtualDirectories.Where(s => !string.IsNullOrWhiteSpace(s)).Distinct().ToList();
}
}
}

How to sort an arraylist on date?

Code:
while ((linevalue = filereader.ReadLine()) != null)
{
items.Add(linevalue);
}
filereader.Close();
items.Sort();
//To display the content of array (sorted)
IEnumerator myEnumerator = items.GetEnumerator();
while (myEnumerator.MoveNext())
{
Console.WriteLine(myEnumerator.Current);
}
The program above displays all the values. How to extract only the dates and sort it in ascending order?
I am not let to work with linq, use the exception or threading or any other stuff. I have to stick with the File Stream, try to get my data out of the text file, sort and store it, so that i can retrieve it, view it and edit it and search for any particular date and see the date of joining records for that date. Can't figure out. Struggling
Basically, don't try and work with the file as lines of text; separate that away, so that you have one piece of code which parses that text into typed records, and then process those upstream when you only need to deal with typed data.
For example (and here I'm assuming that the file is tab-delimited, but you could change it to be column indexed instead easily enough); look at how little work my Main method needs to do to work with the data:
using System;
using System.Collections.Generic;
using System.Globalization;
using System.IO;
using System.Linq;
static class Program
{
static void Main()
{
foreach (var item in ReadFile("my.txt").OrderBy(x => x.Joined))
{
Console.WriteLine(item.Names);
}
}
static readonly char[] tab = { '\t' };
class Foo
{
public string Names { get; set; }
public int Age { get; set; }
public string Designation { get; set; }
public DateTime Joined { get; set; }
}
static IEnumerable<Foo> ReadFile(string path)
{
using (var reader = File.OpenText(path))
{
// skip the first line (headers), or exit
if (reader.ReadLine() == null) yield break;
// read each line
string line;
var culture = CultureInfo.InvariantCulture;
while ((line = reader.ReadLine()) != null)
{
var parts = line.Split(tab);
yield return new Foo
{
Names = parts[0],
Age = int.Parse(parts[1], culture),
Designation = parts[2],
Joined = DateTime.Parse(parts[3], culture)
};
}
}
}
}
And here's a version (not quite as elegant, but working) that works on .NET 2.0 (and probably on .NET 1.1) using only ISO-1 language features; personally I think it would be silly to use .NET 1.1, and if you are using .NET 2.0, then List<T> would be vastly preferable to ArrayList. But this is "worst case":
using System;
using System.Collections;
using System.Globalization;
using System.IO;
class Program
{
static void Main()
{
ArrayList items = ReadFile("my.txt");
items.Sort(FooByDateComparer.Default);
foreach (Foo item in items)
{
Console.WriteLine(item.Names);
}
}
class FooByDateComparer : IComparer
{
public static readonly FooByDateComparer Default
= new FooByDateComparer();
private FooByDateComparer() { }
public int Compare(object x, object y)
{
return ((Foo)x).Joined.CompareTo(((Foo)y).Joined);
}
}
static readonly char[] tab = { '\t' };
class Foo
{
private string names, designation;
private int age;
private DateTime joined;
public string Names { get { return names; } set { names = value; } }
public int Age { get { return age; } set { age = value; } }
public string Designation { get { return designation; } set { designation = value; } }
public DateTime Joined { get { return joined; } set { joined = value; } }
}
static ArrayList ReadFile(string path)
{
ArrayList items = new ArrayList();
using (StreamReader reader = File.OpenText(path))
{
// skip the first line (headers), or exit
if (reader.ReadLine() == null) return items;
// read each line
string line;
CultureInfo culture = CultureInfo.InvariantCulture;
while ((line = reader.ReadLine()) != null)
{
string[] parts = line.Split(tab);
Foo foo = new Foo();
foo.Names = parts[0];
foo.Age = int.Parse(parts[1], culture);
foo.Designation = parts[2];
foo.Joined = DateTime.Parse(parts[3], culture);
items.Add(foo);
}
}
return items;
}
}
I'm not sure why you'd want to retrieve just the dates. You'd probably be better reading your data into Tuples first. Something like
List<Tuple<string, int, string, DateTime>> items.
Then you can sort them by items.Item4, which will be the date.
You can use LINQ and split the line according to tabs to only retrieve the date and order them through a conversion to date.
while ((linevalue = filereader.ReadLine()) != null)
{
items.Add(linevalue.Split('\t').Last());
}
filereader.Close();
items.OrderBy(i => DateTime.Parse(i));
foreach(var item in items)
{
Console.WriteLine(item);
}
get the desired values in Array from the file...
public class DateComparer : IComparer {
public int Compare(DateTime x, DateTime y) {
if(x.Date > y.Date)
return 1;
if(x.Date < y.Date)
return -1;
else
return 0;
}
}
list.Sort(new DateComparer());

A Simple Helper Class doesn't work

Sorry for asking such a simple question but I lost really long time trying to solve this. At the end, I decide to ask you.
Let's start with the code base :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.Mvc;
namespace Navigation.Helpers
{
public static class NavigationBarSE
{
public static MvcHtmlString RenderNavigationBarSE(this HtmlHelper helper, String[] includes)
{
return new MvcHtmlString("Y U no Work??");
//NavTypeSE res = new NavTypeSE(includes);
//String ress = res.toString();
//return new MvcHtmlString(ress);
}
}
}
In the original form, this helper needs to return a String that produced by the NavTypeSE class. But in the end, to get a result, I only want it to return a String for me... But it didn't do that...
Before you ask, I can say that,
<add namespace="Navigation.Helpers"/>
exists in my Web.config file in Views folder.
For detailed information, my NavTypeSE class as below :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Navigation.Helpers
{
//Creates a Navigation Menu Type which includes Previous, Next and Validate Buttons
public class NavTypeSE
{
Boolean pr, nt, vld;
Boolean Previous { get; set; }
Boolean Next { get; set; }
Boolean Validate { get; set; }
public NavTypeSE(Boolean Previous, Boolean Next, Boolean Validate)
{
this.pr = Previous;
this.nt = Next;
this.vld = Validate;
}
public NavTypeSE() { }
public NavTypeSE(String[] inc)
{
for(int i=0; i<inc.Length; i++)//foreach (String s in inc)
{
String s = inc[i]; // Don't need for foreach method.
if (s.Equals("previous")||s.Equals("Previous"))
{
this.pr = true;
}
else if (s.Equals("next") || s.Equals("Next"))
{
this.nt = true;
}
else if (s.Equals("validate") || s.Equals("Validate"))
{
this.vld = true;
}
else
{
this.pr = false; this.nt = false; this.vld = false;
}
}
public String toString()
{
return "Previous: " + this.pr + ", Next: " + this.nt + ", Validate: " + this.vld;
}
}
}
Also, in my View, I call this Helper like below :
#{
String[] str = new String[] { "Previous", "next", "Validate" };
Html.RenderNavigationBarSE(str);
}
This is just a base for a project. And I'm starter level in both C# and ASP.NET MVC Platform. Sorry for spending your time.
Your RenderNavigationBarSE writes nothing into the Response just returns a MvcHtmlString.
So you need to put an # before the method call to tell Razor engine that you want to write the returned MvcHtmlString into the response (otherwise inside a code block it just executes your method and throws away the returned value)
#{
String[] str = new String[] { "Previous", "next", "Validate" };
}
#Html.RenderNavigationBarSE(str);
You can read more about the Razor syntax:
Introduction to ASP.NET Web Programming Using the Razor Syntax (C#)
There is also a C# Razor Syntax Quick Reference

XML Serialization of List growing too large

I have a C# Windows forms application that runs a Trivia game on an IRC channel, and keeps the questions it asks, and the Leaderboard (scores) in Classes that I serialize to XML to save between sessions. The issue I have been having is best described with the flow, so here it is:
User X Gets entry in Leaderboard class with a score of 1. Class is saved to XML, XML contains one entry for user X.
User Y gets entry in Leaderboard class with a score of 1. Class is saved to XML, XML contains duplicate entries for User X, and one entry for User Y.
After running it for a week with under 20 users, I hoped to be able to write a web backend in PHP to help me use the scores. XML file is 2 megabytes.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Runtime.Serialization;
namespace IRCTriviaBot
{
[Serializable()]
public class LeaderBoard
{
[Serializable()]
public class Pair
{
public string user;
public int score;
public Pair(string usr, int scr)
{
user = usr;
score = scr;
}
public Pair() { }
}
private static List<Pair> pairs = null;
public List<Pair> Pairs
{
get
{
if (pairs==null)
{
pairs = new List<Pair>();
}
return pairs;
}
}
public LeaderBoard()
{
}
public void newScore(string usr)
{
bool found = false;
for (int i = 0; i < Pairs.Count && !found; ++i)
{
if (Pairs[i].user==usr)
{
found = true;
Pairs[i].score++;
}
}
if (!found)
{
Pairs.Add(new Pair(usr, 1));
}
}
public int getScore(string usr)
{
bool found = false;
for (int i = 0; i < Pairs.Count && !found; ++i)
{
if (Pairs[i].user == usr)
{
return Pairs[i].score;
}
}
if (!found)
{
return 0;
}
return 0;
}
}
}
Here's where the serialization and deserialization happens.
void parseMessage(string message, string user = "")
{
if (message == "-startgame-")
{
if (!gameStarted)
{
gameStarted = true;
openScores();
startGame();
}
}
else if (message == "-hint-")
{
if (!hintGiven && gameStarted)
{
sendMessage("Here's a better hint: " + Form2.qa.Answers[curQ].Trim());
hintGiven = true;
}
}
else if (message == "-myscore-")
{
sendMessage(user + ", your score is: " + leaderB.getScore(user));
}
else if (message.ToLower() == Form2.qa.Answers[curQ].ToLower())
{
if (gameStarted)
{
sendMessage(user + " got it right! Virtual pat on the back!");
leaderB.newScore(user);
saveScores();
System.Threading.Thread.Sleep(2000);
startGame();
}
}
else if (message == "-quit-")
{
if (gameStarted)
{
sendMessage("Sorry to see you go! Have fun without me :'(");
gameStarted = false;
}
else
{
sendMessage("A game is not running.");
}
}
else
{
if (gameStarted)
{
//sendMessage("Wrong.");
}
}
}
void saveScores()
{
//Opens a file and serializes the object into it in binary format.
Stream stream = System.IO.File.Open("scores.xml", FileMode.Open);
XmlSerializer xmlserializer = new XmlSerializer(typeof(LeaderBoard));
//BinaryFormatter formatter = new BinaryFormatter();
xmlserializer.Serialize(stream, leaderB);
stream.Close();
}
void openScores()
{
Stream stream = System.IO.File.OpenRead("scores.xml");
XmlSerializer xmlserializer = new XmlSerializer(typeof(LeaderBoard));
//BinaryFormatter formatter = new BinaryFormatter();
leaderB = (LeaderBoard)xmlserializer.Deserialize(stream);
stream.Close();
}
I think this has to do with pairs being marked static. I don't believe the XmlSerializer will clear a list before adding elements to it, so every time you call openScores() you will create duplicate entries rather than overwrite existing ones.
In general, I've observed that serialization and global variables don't play well together. For this purpose, "global variables" includes private statics, singletons, monostate classes like this, and thread-local variables.
It also looks like there's some waffle here between using XML and binary serialization. They are completely different beasts. XML serialization looks only at a class's public properties, while binary serialization looks only at a class's instance fields. Also, XML serialization ignores the Serializable attribute.

Categories