Problems to extract text from PDF for certain pdfs only C#

Problems to extract text from PDF for certain pdfs only C# - c#

I need to extract some data from a PDF file.
I'm using the iTextSharp to do that.
I'm using this code which I founded on the net:
using System;
using System.IO;
using iTextSharp.text.pdf;
namespace PdfToText
{
/// <summary>
/// Parses a PDF file and extracts the text from it.
/// </summary>
public class PDFParser
{
/// BT = Beginning of a text object operator
/// ET = End of a text object operator
/// Td move to the start of next line
/// 5 Ts = superscript
/// -5 Ts = subscript
#region Fields
#region _numberOfCharsToKeep
/// <summary>
/// The number of characters to keep, when extracting text.
/// </summary>
private static int _numberOfCharsToKeep = 15;
#endregion
#endregion
#region ExtractText
/// <summary>
/// Extracts a text from a PDF file.
/// </summary>
/// <param name="inFileName">the full path to the pdf file.</param>
/// <param name="outFileName">the output file name.</param>
/// <returns>the extracted text</returns>
public bool ExtractText(string inFileName, string outFileName)
{
StreamWriter outFile = null;
try
{
outFileName = String.Empty;
outFileName = Path.GetDirectoryName(System.AppDomain.CurrentDomain.BaseDirectory);
//string currentDirectory = Directory.GetCurrentDirectory();
//string filePath = System.IO.Path.Combine(currentDirectory, "Data", "myfile.txt");
// extract the text
//string test = "";
outFileName += #"\test.txt";
// Create a reader for the given PDF file
PdfReader reader = new PdfReader(inFileName);
//outFile = File.CreateText(outFileName);
outFile = new StreamWriter(outFileName, true, System.Text.Encoding.UTF8);
Console.Write("Processing: ");
int totalLen = 68;
float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
int totalWritten = 0;
float curUnit = 0;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
outFile.Write(ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ");
// Write the progress.
if (charUnit >= 1.0f)
{
for (int i = 0; i < (int)charUnit; i++)
{
Console.Write("#");
totalWritten++;
}
}
else
{
curUnit += charUnit;
if (curUnit >= 1.0f)
{
for (int i = 0; i < (int)curUnit; i++)
{
Console.Write("#");
totalWritten++;
}
curUnit = 0;
}
}
}
if (totalWritten < totalLen)
{
for (int i = 0; i < (totalLen - totalWritten); i++)
{
Console.Write("#");
}
}
return true;
}
catch(Exception ex)
{
return false;
}
finally
{
if (outFile != null) outFile.Close();
}
}
#endregion
#region ExtractTextFromPDFBytes
/// <summary>
/// This method processes an uncompressed Adobe (text) object
/// and extracts text.
/// </summary>
/// <param name="input">uncompressed</param>
/// <returns></returns>
private string ExtractTextFromPDFBytes(byte[] input)
{
if (input == null || input.Length == 0) return "";
try
{
string resultString = "";
// Flag showing if we are we currently inside a text object
bool inTextObject = false;
// Flag showing if the next character is literal
// e.g. '\\' to get a '\' character or '\(' to get '('
bool nextLiteral = false;
// () Bracket nesting level. Text appears inside ()
int bracketDepth = 0;
// Keep previous chars to get extract numbers etc.:
char[] previousCharacters = new char[_numberOfCharsToKeep];
for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';
for (int i = 0; i < input.Length; i++)
{
char c = (char)input[i];
if (inTextObject)
{
// Position the text
if (bracketDepth == 0)
{
if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
{
resultString += "\n\r";
}
else
{
if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
{
resultString += "\n";
}
else
{
if (CheckToken(new string[] { "Tj" }, previousCharacters))
{
resultString += " ";
}
}
}
}
// End of a text object, also go to a new line.
if (bracketDepth == 0 &&
CheckToken(new string[] { "ET" }, previousCharacters))
{
inTextObject = false;
resultString += " ";
}
else
{
// Start outputting text
if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
{
bracketDepth = 1;
}
else
{
// Stop outputting text
if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
{
bracketDepth = 0;
}
else
{
// Just a normal text character:
if (bracketDepth == 1)
{
// Only print out next character no matter what.
// Do not interpret.
if (c == '\\' && !nextLiteral)
{
nextLiteral = true;
}
else
{
if (((c >= ' ') && (c <= '~')) ||
((c >= 128) && (c < 255)))
{
resultString += c.ToString();
}
nextLiteral = false;
}
}
}
}
}
}
// Store the recent characters for
// when we have to go back for a checking
for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
{
previousCharacters[j] = previousCharacters[j + 1];
}
previousCharacters[_numberOfCharsToKeep - 1] = c;
// Start of a text object
if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
{
inTextObject = true;
}
}
return resultString;
}
catch
{
return "";
}
}
#endregion
#region CheckToken
/// <summary>
/// Check if a certain 2 character token just came along (e.g. BT)
/// </summary>
/// <param name="search">the searched token</param>
/// <param name="recent">the recent character array</param>
/// <returns></returns>
private bool CheckToken(string[] tokens, char[] recent)
{
foreach (string token in tokens)
{
if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
(recent[_numberOfCharsToKeep - 2] == token[1]) &&
((recent[_numberOfCharsToKeep - 1] == ' ') ||
(recent[_numberOfCharsToKeep - 1] == 0x0d) ||
(recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
((recent[_numberOfCharsToKeep - 4] == ' ') ||
(recent[_numberOfCharsToKeep - 4] == 0x0d) ||
(recent[_numberOfCharsToKeep - 4] == 0x0a))
)
{
return true;
}
}
return false;
}
#endregion
}
}
I'm using this way:
PDFParser pdfParser = new PDFParser();
pdfParser.ExtractText(pdfFile,Path.GetFileNameWithoutExtension(pdfFile) + ".txt");
So the pdf content is written in a txt file.
It works good for certain pdf-s, but for a pdf file that I really need to use, the txt file remains always empty. I didn't get errors, but for some reason it's not writing anything, although as you can see in this screenshot it recognize the pdf,that it has 2 pages...
This is the pdf that I need but the txt always remains empty.(the black lines are added by me, so there are not present when I want to write in the txt)
And this is another pdf. For this the program works ok, and it is written is a txt file. It is much bigger than the other pdf, and still for this I can extract the texts and for the other I can't.
Do you have any idea what can be the problem?

Too long for comment and maybe an answer that you do not like to get:
In PDFs the "Text your see" aka how does a font look and "What the glyphs mean" aka what glyph is mapped to which utf8-letter are separate things.
They are stored in different parts of the pdf - it is utterly possible that a pdf looks totally fine, but if you try to extract text it will give you nothing from it because it only contains the shape of your textglyphs but not theire "meaning".
Try to open the pdf and Select + Copy the text you are after, if you paste that into an editor and noting is there, your pdf lacks the information "what utf8-letter is displayed by this glyph".
OR:
It also might be that your pdf only containts the image of a text - a photo so to say. You can read it, iTextSharp sees only a "picture" - no text.
Those are possible 'why's that would answer your question. As to how to fix it:
There are several questions about corrupt PDFs on SO:
How to repair a PDF file and embed missing fonts
Embedded fonts in PDF: copy and paste problems (this answer)
Copy and Paste relates to text parsing, so the might help you out on how to fix it.
Your edit shows details about your parsing, why don't you leverage iTextSharp for that?
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
public static string ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}
from: http://www.squarepdf.net/parsing-pdf-files-using-itextsharp
or like here: parse-pdf-with-itextsharp-and-then-extract-specific-text-to-the-screen ?

Related

C# Removing longest words in each line [duplicate]

This question already has answers here:
Finding longest word in string
(4 answers)
Closed 2 years ago.
I have .txt file.
I have to remove all the longest words for each line.
The main method where I'm looking for longest word is :
/// <summary>
/// Finds longest word in line
/// </summary>
/// <param name="eil">Line</param>
/// <param name="skyr">Punctuation</param>
/// <returns>Returns longest word for line</returns>
static string[] RastiIlgZodiEil(string eil, char[] skyr)
{
string[] zodIlg = new string[100];
for (int k = 0; k < 100; k++)
{
zodIlg[k] = " ";
}
int kiek = 0;
string[] parts = eil.Split(skyr,
StringSplitOptions.RemoveEmptyEntries);
int i = 0;
foreach (string zodis in parts)
{
if (zodis.Length > zodIlg[i].Length)
{
zodIlg[kiek] = zodis;
kiek++;
i++;
}
else
{
i++;
}
}
return zodIlg;
}
EDIT : Method that reads the .txt file and uses the previous method to replace line with a new line that is configured (by replacing the word that has to be deleted with an empty string).
/// <summary>
/// Finds longest words for each line and then replaces them with
/// emptry string
/// </summary>
/// <param name="fv">File name</param>
/// <param name="skyr">Punctuation</param>
static void RastiIlgZodiFaile(string fv, string fvr, char[] skyr)
{
using (var fr = new StreamWriter(fvr, true,
System.Text.Encoding.GetEncoding(1257)))
{
using (StreamReader reader = new StreamReader(fv,
Encoding.GetEncoding(1257)))
{
int n = 0;
string line;
while (((line = reader.ReadLine()) != null))
{
n++;
if (line.Length > 0)
{
string[] temp = RastiIlgZodiEil(line, skyr);
foreach (string t in temp)
{
line = line.Replace(t, "");
}
fr.WriteLine(line);
}
}
}
}
}

You could remove the longest word(s) from each line with:
static string RemoveLongestWord(string eil, char[] skyr)
{
string[] parts = eil.Split(skyr, StringSplitOptions.RemoveEmptyEntries);
int longestLength = parts.OrderByDescending(s => s.Length).First().Length;
var longestWords = parts.Where(s => s.Length == longestLength);
foreach(string word in longestWords)
{
eil = eil.Replace(word, "");
}
return eil;
}
Just pass each line to the function and you'll get that line back with the longest word removed.
Here's an approach that more closely resembles what you were doing before:
static string[] RastiIlgZodiEil(string eil, char[] skyr)
{
List<string> zodIlg = new List<string>();
string[] parts = eil.Split(skyr, StringSplitOptions.RemoveEmptyEntries);
int maxLength = -1;
foreach (string zodis in parts)
{
if (zodis.Length > maxLength)
{
maxLength = zodis.Length;
}
}
foreach (string zodis in parts)
{
if (zodis.Length == maxLength)
{
zodIlg.Add(zodis);
}
}
return zodIlg.Distinct().ToArray();
}
The first pass finds the longest length. The second pass adds all word that match that length to a List<string>. Finally, we call Distinct() to remove duplicates from the list and return an array version of it.

Here is another solution. Read all lines from a file and split each line using a loop. Aggregate function is used to get the highest length to filter the data further.
static void Main(string[] args)
{
var data = File.ReadAllLines(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "test.txt"));
for (int i = 0; i < data.Length; i++)
{
Console.WriteLine(data[i]);
var split = data[i].Split(' ');
int length = split.Aggregate((a, b) => a.Length >= b.Length ? a : b).Length;
data[i] = string.Join(' ', split.Where(w => w.Length < length));
Console.WriteLine(data[i]);
}
Console.Read();
}

extracting Arabic text in c# by using itextsharp

I have this code and I'm using it to take the text of a PDF. It's great for a PDF in English but when I'm trying to extract the text in Arabic it shows me something like this.
") + n 9 n <+, + )+ $ # $ +$ F% 9& .< $ : ;"
using (PdfReader reader = new PdfReader(path))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
String text = "";
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text = PdfTextExtractor.GetTextFromPage(reader, i,strategy);
}
}

I had to change the strategy like this
var t = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
var te = Convert(t);
and this function to reverse the Arabic words and keep the English
private string Convert(string source)
{
string arabicWord = string.Empty;
StringBuilder sbDestination = new StringBuilder();
foreach (var ch in source)
{
if (IsArabic(ch))
arabicWord += ch;
else
{
if (arabicWord != string.Empty)
sbDestination.Append(Reverse(arabicWord));
sbDestination.Append(ch);
arabicWord = string.Empty;
}
}
// if the last word was arabic
if (arabicWord != string.Empty)
sbDestination.Append(Reverse(arabicWord));
return sbDestination.ToString();
}
private bool IsArabic(char character)
{
if (character >= 0x600 && character <= 0x6ff)
return true;
if (character >= 0x750 && character <= 0x77f)
return true;
if (character >= 0xfb50 && character <= 0xfc3f)
return true;
if (character >= 0xfe70 && character <= 0xfefc)
return true;
return false;
}
// Reverse the characters of string
string Reverse(string source)
{
return new string(source.ToCharArray().Reverse().ToArray());
}

Unable to find csv file on ASP.NET

Anyone who knows how to solve this error please help me. Below is my code for reading the csv file. When I tried to upload it showed me *Server Error in '/' Application. Could not find file 'C:/...csv * Im a beginner in c#.
ReadCSV
string filename = FileUpload1.PostedFile.FileName;
using (CsvFileReader reader = new CsvFileReader(filename))
{
CsvRow row = new CsvRow();
while (reader.ReadRow(row))
{
foreach (string s in row)
{
Console.Write(s);
Console.Write(" ");
TextBox1.Text += s;
}
Console.WriteLine();
}
}
CSVClass
public class CsvFileReader : StreamReader
{
public CsvFileReader(Stream stream)
: base(stream)
{
}
public CsvFileReader(string filename): base(filename)
{
}
/// <summary>
/// Reads a row of data from a CSV file
/// </summary>
/// <param name="row"></param>
/// <returns></returns>
public bool ReadRow(CsvRow row)
{
row.LineText = ReadLine();
if (String.IsNullOrEmpty(row.LineText))
return false;
int pos = 0;
int rows = 0;
while (pos < row.LineText.Length)
{
string value;
// Special handling for quoted field
if (row.LineText[pos] == '"')
{
// Skip initial quote
pos++;
// Parse quoted value
int start = pos;
while (pos < row.LineText.Length)
{
// Test for quote character
if (row.LineText[pos] == '"')
{
// Found one
pos++;
// If two quotes together, keep one
// Otherwise, indicates end of value
if (pos >= row.LineText.Length || row.LineText[pos] != '"')
{
pos--;
break;
}
}
pos++;
}
value = row.LineText.Substring(start, pos - start);
value = value.Replace("\"\"", "\"");
}
else
{
// Parse unquoted value
int start = pos;
while (pos < row.LineText.Length && row.LineText[pos] != ',')
pos++;
value = row.LineText.Substring(start, pos - start);
}
// Add field to list
if (rows < row.Count)
row[rows] = value;
else
row.Add(value);
rows++;
// Eat up to and including next comma
while (pos < row.LineText.Length && row.LineText[pos] != ',')
pos++;
if (pos < row.LineText.Length)
pos++;
}
// Delete any unused items
while (row.Count > rows)
row.RemoveAt(rows);
// Return true if any columns read
return (row.Count > 0);
}
}

FileUpload1.PostedFile.FileName is the filename from your client/browser - it does not contain a path...
You either use FileUpload1.PostedFile.InputStream to access it
using (CsvFileReader reader = new CsvFileReader(FileUpload1.PostedFile.InputStream))
OR you first save it to disk (anywhere you have needed permissions) via FileUpload1.PostedFile.SaveAs and then access that file.

You need to save the file that is being uploaded to disk first. Something like this:
string fileSavePath= Sever.MapPath("/files/" + FileUpload1.PostedFile.FileName);
FileUpload1.SaveAs(fileSavePath);
using (CsvFileReader reader = new CsvFileReader(fileSavePath))
....
Haven't tested this code, but should give you a starting point.

Display first 2000 words from a long string (Stripping HTML) C# ASP.NET

I have a field in my database that holds input from an html input. So I have in my db column data. What I need is to be able to extract this and display a short version of it as an intro. Maybe even the first paragraph if possible.

Something like this maybe?
public string Get(string text, int maxWordCount)
{
int wordCounter = 0;
int stringIndex = 0;
char[] delimiters = new[] { '\n', ' ', ',', '.' };
while (wordCounter < maxWordCount)
{
stringIndex = text.IndexOfAny(delimiters, stringIndex + 1);
if (stringIndex == -1)
return text;
++wordCounter;
}
return text.Substring(0, stringIndex);
}
It's quite simplified and doesnt handle if multiple delimiters comes after each other (for instance ", "). you might just want to use space as a delimiter.
If you want to get just the first paragraph, simply search after "\r\n\r\n" <-- two line breaks:
public string GetFirstParagraph(string text)
{
int pos = text.IndexOf("\r\n\r\n");
return pos == -1 ? text : text.Substring(0, pos);
}
Edit:
A very simplistic way to strip HTML:
return Regex.Replace(text, #”<(.|\n)*?>”, string.Empty);

The Html Agility Pack is usually the recommended way to strip out the HTML. After that it would just be a matter of doing a String.Substring to get the bit of it that you want.
If you need to get out the 2000 first words I suppose you could either use IndexOf to find a whitespace 2000 times and loop through it until then to get the index to use in the call to Substring.
Edit: Add sample method
public int GetIndex(string str, int numberWanted)
{
int count = 0;
int index = 1;
for (; index < str.Length; index++)
{
if (char.IsWhiteSpace(str[index - 1]) == true)
{
if (char.IsLetterOrDigit(str[index]) == true ||
char.IsPunctuation(str[index]))
{
count++;
if (count >= numberWanted)
break;
}
}
}
return index;
}
And call it like:
string wordList = "This is a list of a lot of words";
int i = GetIndex(wordList, 5);
string result = wordList.Substring(0, i);

Once you have your string you would have to count your words. I assume space is a delimiter for words, so the following code should find the first 2000 words in a string (or break out if there are fewer words).
string myString = "la la la";
int lastPosition = 0;
for (int i = 0; i < 2000; i++)
{
int position = myString.IndexOf(' ', lastPosition + 1);
if (position == -1) break;
lastPosition = position;
}
string firstThousandWords = myString.Substring(0, lastPosition);
You can change indexOf to indexOfAny to support more characters as delimiters.

I had the same problem and combined a few Stack Overflow answers into this class. It uses the HtmlAgilityPack which is a better tool for the job. Call:
Words(string html, int n)
To get n words
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace UmbracoUtilities
{
public class Text
{
/// <summary>
/// Return the first n words in the html
/// </summary>
/// <param name="html"></param>
/// <param name="n"></param>
/// <returns></returns>
public static string Words(string html, int n)
{
string words = html, n_words;
words = StripHtml(html);
n_words = GetNWords(words, n);
return n_words;
}
/// <summary>
/// Returns the first n words in text
/// Assumes text is not a html string
/// http://stackoverflow.com/questions/13368345/get-first-250-words-of-a-string
/// </summary>
/// <param name="text"></param>
/// <param name="n"></param>
/// <returns></returns>
public static string GetNWords(string text, int n)
{
StringBuilder builder = new StringBuilder();
//remove multiple spaces
//http://stackoverflow.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, #"\s+", " ");
IEnumerable<string> words = cleanedString.Split().Take(n + 1);
foreach (string word in words)
builder.Append(" " + word);
return builder.ToString();
}
/// <summary>
/// Returns a string of html with tags removed
/// </summary>
/// <param name="html"></param>
/// <returns></returns>
public static string StripHtml(string html)
{
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var root = document.DocumentNode;
var stringBuilder = new StringBuilder();
foreach (var node in root.DescendantsAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
stringBuilder.Append(" " + text.Trim());
}
}
return stringBuilder.ToString();
}
}
}
Merry Christmas!

Reading PDF documents in .Net [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Is there an open source library that will help me with reading/parsing PDF documents in .NET/C#?

Since this question was last answered in 2008, iTextSharp has improved their api dramatically. If you download the latest version of their api from http://sourceforge.net/projects/itextsharp/, you can use the following snippet of code to extract all text from a pdf into a string.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PdfParser
{
public static class PdfTextExtractor
{
public static string pdfText(string path)
{
PdfReader reader = new PdfReader(path);
string text = string.Empty;
for(int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader,page);
}
reader.Close();
return text;
}
}
}

iTextSharp is the best bet. Used it to make a spider for lucene.Net so that it could crawl PDF.
using System;
using System.IO;
using iTextSharp.text.pdf;
using System.Text.RegularExpressions;
namespace Spider.Utils
{
/// <summary>
/// Parses a PDF file and extracts the text from it.
/// </summary>
public class PDFParser
{
/// BT = Beginning of a text object operator
/// ET = End of a text object operator
/// Td move to the start of next line
/// 5 Ts = superscript
/// -5 Ts = subscript
#region Fields
#region _numberOfCharsToKeep
/// <summary>
/// The number of characters to keep, when extracting text.
/// </summary>
private static int _numberOfCharsToKeep = 15;
#endregion
#endregion
#region ExtractText
/// <summary>
/// Extracts a text from a PDF file.
/// </summary>
/// <param name="inFileName">the full path to the pdf file.</param>
/// <param name="outFileName">the output file name.</param>
/// <returns>the extracted text</returns>
public bool ExtractText(string inFileName, string outFileName)
{
StreamWriter outFile = null;
try
{
// Create a reader for the given PDF file
PdfReader reader = new PdfReader(inFileName);
//outFile = File.CreateText(outFileName);
outFile = new StreamWriter(outFileName, false, System.Text.Encoding.UTF8);
Console.Write("Processing: ");
int totalLen = 68;
float charUnit = ((float)totalLen) / (float)reader.NumberOfPages;
int totalWritten = 0;
float curUnit = 0;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
outFile.Write(ExtractTextFromPDFBytes(reader.GetPageContent(page)) + " ");
// Write the progress.
if (charUnit >= 1.0f)
{
for (int i = 0; i < (int)charUnit; i++)
{
Console.Write("#");
totalWritten++;
}
}
else
{
curUnit += charUnit;
if (curUnit >= 1.0f)
{
for (int i = 0; i < (int)curUnit; i++)
{
Console.Write("#");
totalWritten++;
}
curUnit = 0;
}
}
}
if (totalWritten < totalLen)
{
for (int i = 0; i < (totalLen - totalWritten); i++)
{
Console.Write("#");
}
}
return true;
}
catch
{
return false;
}
finally
{
if (outFile != null) outFile.Close();
}
}
#endregion
#region ExtractTextFromPDFBytes
/// <summary>
/// This method processes an uncompressed Adobe (text) object
/// and extracts text.
/// </summary>
/// <param name="input">uncompressed</param>
/// <returns></returns>
public string ExtractTextFromPDFBytes(byte[] input)
{
if (input == null || input.Length == 0) return "";
try
{
string resultString = "";
// Flag showing if we are we currently inside a text object
bool inTextObject = false;
// Flag showing if the next character is literal
// e.g. '\\' to get a '\' character or '\(' to get '('
bool nextLiteral = false;
// () Bracket nesting level. Text appears inside ()
int bracketDepth = 0;
// Keep previous chars to get extract numbers etc.:
char[] previousCharacters = new char[_numberOfCharsToKeep];
for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';
for (int i = 0; i < input.Length; i++)
{
char c = (char)input[i];
if (input[i] == 213)
c = "'".ToCharArray()[0];
if (inTextObject)
{
// Position the text
if (bracketDepth == 0)
{
if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
{
resultString += "\n\r";
}
else
{
if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
{
resultString += "\n";
}
else
{
if (CheckToken(new string[] { "Tj" }, previousCharacters))
{
resultString += " ";
}
}
}
}
// End of a text object, also go to a new line.
if (bracketDepth == 0 &&
CheckToken(new string[] { "ET" }, previousCharacters))
{
inTextObject = false;
resultString += " ";
}
else
{
// Start outputting text
if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
{
bracketDepth = 1;
}
else
{
// Stop outputting text
if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
{
bracketDepth = 0;
}
else
{
// Just a normal text character:
if (bracketDepth == 1)
{
// Only print out next character no matter what.
// Do not interpret.
if (c == '\\' && !nextLiteral)
{
resultString += c.ToString();
nextLiteral = true;
}
else
{
if (((c >= ' ') && (c <= '~')) ||
((c >= 128) && (c < 255)))
{
resultString += c.ToString();
}
nextLiteral = false;
}
}
}
}
}
}
// Store the recent characters for
// when we have to go back for a checking
for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
{
previousCharacters[j] = previousCharacters[j + 1];
}
previousCharacters[_numberOfCharsToKeep - 1] = c;
// Start of a text object
if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
{
inTextObject = true;
}
}
return CleanupContent(resultString);
}
catch
{
return "";
}
}
private string CleanupContent(string text)
{
string[] patterns = { #"\\\(", #"\\\)", #"\\226", #"\\222", #"\\223", #"\\224", #"\\340", #"\\342", #"\\344", #"\\300", #"\\302", #"\\304", #"\\351", #"\\350", #"\\352", #"\\353", #"\\311", #"\\310", #"\\312", #"\\313", #"\\362", #"\\364", #"\\366", #"\\322", #"\\324", #"\\326", #"\\354", #"\\356", #"\\357", #"\\314", #"\\316", #"\\317", #"\\347", #"\\307", #"\\371", #"\\373", #"\\374", #"\\331", #"\\333", #"\\334", #"\\256", #"\\231", #"\\253", #"\\273", #"\\251", #"\\221"};
string[] replace = { "(", ")", "-", "'", "\"", "\"", "à", "â", "ä", "À", "Â", "Ä", "é", "è", "ê", "ë", "É", "È", "Ê", "Ë", "ò", "ô", "ö", "Ò", "Ô", "Ö", "ì", "î", "ï", "Ì", "Î", "Ï", "ç", "Ç", "ù", "û", "ü", "Ù", "Û", "Ü", "®", "™", "«", "»", "©", "'" };
for (int i = 0; i < patterns.Length; i++)
{
string regExPattern = patterns[i];
Regex regex = new Regex(regExPattern, RegexOptions.IgnoreCase);
text = regex.Replace(text, replace[i]);
}
return text;
}
#endregion
#region CheckToken
/// <summary>
/// Check if a certain 2 character token just came along (e.g. BT)
/// </summary>
/// <param name="tokens">the searched token</param>
/// <param name="recent">the recent character array</param>
/// <returns></returns>
private bool CheckToken(string[] tokens, char[] recent)
{
foreach (string token in tokens)
{
if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
(recent[_numberOfCharsToKeep - 2] == token[1]) &&
((recent[_numberOfCharsToKeep - 1] == ' ') ||
(recent[_numberOfCharsToKeep - 1] == 0x0d) ||
(recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
((recent[_numberOfCharsToKeep - 4] == ' ') ||
(recent[_numberOfCharsToKeep - 4] == 0x0d) ||
(recent[_numberOfCharsToKeep - 4] == 0x0a))
)
{
return true;
}
}
return false;
}
#endregion
}
}

PDFClown might help, but I would not recommend it for a big or heavy use application.

public string ReadPdfFile(object Filename, DataTable ReadLibray)
{
PdfReader reader2 = new PdfReader((string)Filename);
string strText = string.Empty;
for (int page = 1; page <= reader2.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader((string)Filename);
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s;
reader.Close();
}
return strText;
}

iText is the best library I know. Originally written in Java, there is a .NET port as well.
See http://www.ujihara.jp/iTextdotNET/en/

itext?
http://www.itextpdf.com/terms-of-use/index.php
Guide
http://www.vogella.com/articles/JavaPDF/article.html

You could look into this:
http://www.codeproject.com/KB/showcase/pdfrasterizer.aspx
It's not completely free, but it looks very nice.
Alex

http://www.c-sharpcorner.com/UploadFile/psingh/PDFFileGenerator12062005235236PM/PDFFileGenerator.aspx is open source and may be a good starting point for you.

aspose pdf works pretty well. then again, you have to pay for it

Have a look at Docotic.Pdf library. It does not require you to make source code of your application open (like iTextSharp with viral AGPL 3 license, for example).
Docotic.Pdf can be used to read PDF files and extract text with or without formatting. Please have a look at the article that shows how to extract text from PDFs.
Disclaimer: I work for Bit Miracle, vendor of the library.

There is also LibHaru
http://libharu.org/wiki/Main_Page

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Problems to extract text from PDF for certain pdfs only C# - c#

Related

C# Removing longest words in each line [duplicate]

extracting Arabic text in c# by using itextsharp

Unable to find csv file on ASP.NET

Display first 2000 words from a long string (Stripping HTML) C# ASP.NET

Reading PDF documents in .Net [closed]

Categories

Resources