PdfTextExtractor.GetTextFromPage() returns empty string

PdfTextExtractor.GetTextFromPage() returns empty string - c#

I'm trying to extract the text from the following PDF with the following code (using iText7 7.2.2) :
var source = (string)GetHttpResult("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf", new CookieContainer());
var bytes = Encoding.UTF8.GetBytes(source);
var stream = new MemoryStream(bytes);
var reader = new PdfReader(stream);
var doc = new PdfDocument(reader);
var pages = doc.GetNumberOfPages();
var text = PdfTextExtractor.GetTextFromPage(doc.GetPage(1));
Loading the PDF in my browser (Edge 100.0) works fine.
GetHttpResult() is a simple HttpClient defining a custom CookieContainer, a custom UserAgent, and calling ReadAsStringAsync(). Nothing fancy.
source has the correct PDF content, starting with "%PDF-1.7".
pages has the correct number of pages, which is 2.
But, whatever I try, text is always empty.
Defining an explicit TextExtractionStrategy, trying some Encodings, extracting from all pages in a loop, ..., nothing matters, text is always empty, with no Exception thrown anywhere.
I think I don't read this PDF how it's "meant" to be read, but what is the correct way then (correct content in source, correct number of pages, no Exception anywhere) ?
Thanks.

That's it ! Thanks to mkl and KJ !
I first downloaded the PDF as a byte array so I'm sure it's not modified in any way.
Then, as pdftotext is able to extract the text from this PDF, I searched for a NuGet package able to do the same. I tested almost ten of them, and FreeSpire.PDF finally did it !
Update : Actually, FreeSpire.PDF missed some words, so I finally found PdfPig, able to extract every single word.
Code using PdfPig :
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;
byte[] bytes;
using (HttpClient client = new())
{
bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
}
List<string> words = new();
using (PdfDocument document = PdfDocument.Open(bytes))
{
foreach (Page page in document.GetPages())
{
foreach (Word word in page.GetWords())
{
words.Add(word.Text);
}
}
}
string text = string.Join(" ", words);
Code using FreeSpire.PDF :
using Spire.Pdf;
using Spire.Pdf.Exporting.Text;
byte[] bytes;
using (HttpClient client = new())
{
bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
}
string text = string.Empty;
SimpleTextExtractionStrategy strategy = new();
using (PdfDocument doc = new())
{
doc.LoadFromBytes(bytes);
foreach (PdfPageBase page in doc.Pages)
{
text += page.ExtractText(strategy);
}
}

Related

JsReport save action result in PDF

I would like to generate the PDF based on the View, but I don't want to display it after generating it. Just save to disk.
As I am in Azure, I had to use the version with Docker, but it did not print the footer (page count)
With that I will use iText7 to add the footer (page count) to delete the original PDF and display the new output.
Huge job, but the only way I found it, since Rotativa and other components that work with wkhtmltopdf.org did not print the CSS correctly.
So my problem is:
How to save the PDF without displaying it?
With the example of the site:
https://jsreport.net/learn/dotnet-aspnetcore#save-to-file
Itneeds the return View() which makes me unable to display the modified
PDF and not the original
.
[MiddlewareFilter(typeof(JsReportPipeline))]
public async Task<IActionResult> InvoiceWithHeader()
{
HttpContext.JsReportFeature().Recipe(Recipe.ChromePdf);
HttpContext.JsReportFeature().OnAfterRender((r) => {
using (var file = System.IO.File.Open("report.pdf", FileMode.Create))
{
r.Content.CopyTo(file);
}
r.Content.Seek(0, SeekOrigin.Begin);
});
return View(InvoiceModel.Example());
}
OnAfterRender does not answer my problem, is there how to do the step as I said? or is there another better solution?
Generate PDF of Action by JsReport
Save PDF from JsReport
Add page count by iText7
Delete original PDF from JsReport
View the pdf modified by iText7 in the view
NOTE: using new jsreport.Local.LocalReporting() works perfectly the problem was when going up to Azure.
Update:
I'm tried, but it didn't work
var htmlContent = await JsReportMVCService.RenderViewToStringAsync(HttpContext, RouteData, "/Views/OcorrenciaTalaos/GerarPdf.cshtml", retorno);
(var contentType, var generatedFile) = await GeneratePDFAsync(htmlContent);
using (var fileStream = new FileStream("tempJsReport.pdf", FileMode.Create))
{
await generatedFile.CopyToAsync(fileStream);
}
public async Task<(string ContentType, MemoryStream GeneratedFileStream)> GeneratePDFAsync(string htmlContent)
{
IJsReportFeature feature = new JsReportFeature(HttpContext);
feature.Recipe(Recipe.ChromePdf);
if (!feature.Enabled) return (null, null);
feature.RenderRequest.Template.Content = htmlContent;
// var htmlContent = await JsReportMVCService.RenderViewToStringAsync(HttpCSexontext, RouteData, "GerarPdf", retorno);
var report = await JsReportMVCService.RenderAsync(feature.RenderRequest);
var contentType = report.Meta.ContentType;
MemoryStream ms = new MemoryStream();
report.Content.CopyTo(ms);
return (contentType, ms);
}

You can overwrite the final response inside the OnAfterRender this way:
[MiddlewareFilter(typeof(JsReportPipeline))]
public IActionResult InvoiceDownload()
{
HttpContext.JsReportFeature().Recipe(Recipe.ChromePdf)
.OnAfterRender((r) =>
{
// write current report to file
using(var fileStream = System.IO.File.Create("c://temp/out.pdf"))
{
r.Content.CopyTo(fileStream);
}
// do modifications
// ...
// overwrite response with a new pdf
r.Content = System.IO.File.OpenRead("c://temp/final.pdf");
});
return View("Invoice", InvoiceModel.Example());
}
However, the page numbers should work and you shouldn't need to do it this complicated way. No matter you are in docker or not. Here is the answer in another question

How do I copy multiple streams into 1 for the client to download?

I'm using c# and asp core 3 and have this right now.
string templatePath = Path.Combine(_webHostEnvironment.WebRootPath, #"templates\pdf\test.pdf");
Stream finalStream = new MemoryStream();
foreach (Info p in list)
{
Stream pdfInputStream = new FileStream(path: templatePath, mode: FileMode.Open);
Stream outStream = PdfService.FillForm(pdfInputStream, p);
outStream.Position = 0;
outStream.CopyTo(finalStream);
outStream.Dispose();
pdfInputStream.Dispose();
}
finalStream.Position = 0;
return File(finalStream, "application/pdf", "test.pdf"));
Right now I just get the first PDF when there should be 3. How to combine all the streams (PDF) created in the loop into 1 PDF? I'm using iTextSharp and using this as a guide to produce the FillForm code.
https://medium.com/#taithienbo/fill-out-a-pdf-form-using-itextsharp-for-net-core-4b323cb58459

You can't just combine PDF by adding them into a single stream :-)
You can add each PDF stream to an array and request ITextSharp to combine them and after that returning the newly created stream.
List<Stream> pdfStreams = new List<Stream>();
foreach(var item in list)
{
// Open PDF + fill form
pdfStreams.Add(outstream);
}
var newStream = Merge(pdfStreams);
return File(newStream)
I don't know ITextSharp but it seems you can merge PDFs : https://weblogs.sqlteam.com/mladenp/2014/01/10/simple-merging-of-pdf-documents-with-itextsharp-5-4-5/
Edit
By the way, you could use "using" statement for stream (you wouldn't have to call dispose yourself) and I don't know how heavy are your PDFs but you should maybe consider to use the ".CopyToAsync".

C# Read csv from url and save to database

I'm trying to get data from a csv-file from a Webservice.
If i paste the url in my browser, the csv will be downloaded and look like the following example:
"ID","ProductName","Company"
"1","Apples","Alfreds futterkiste"
"2","Oranges","Alfreds futterkiste"
"3","Bananas","Alfreds futterkiste"
"4","Salad","Alfreds futterkiste"
...next 96 rows
However I don't want to download the csv-file first and then extract data from it afterwards.
The webservice uses pagination and returns 100 rows (determined by the &num-parameter with a max of 100). After the first request i can use the &next-parameter to fetch the next 100 rows based on ID. For instance the url
http://testWebservice123.com/Example.csv?auth=abc&number=100&next=100
will get me rows from ID 101 to 200. So if there are a lot of rows i would end up downloading a lot of csv-files and saving them to the harddrive. So instead of downloading the csv-files first and saving them hdd to I want to get data directly from the webservice to be able to write directly to a database without saving the csv-files.
After a bit of search I came up with the following solution
static void Main(string[] args)
{
string startUrl = "http://testWebservice123.com/Example.csv?auth=abc&number=100";
string url = "";
string deltaRequestParameter = "";
string lastLine;
int numberOfLines = 0;
do
{
url = startUrl + deltaRequestParameter;
WebClient myWebClient = new WebClient();
using (Stream myStream = myWebClient.OpenRead(url))
{
using (StreamReader sr = new StreamReader(myStream))
{
numberOfLines = 0;
while (!sr.EndOfStream)
{
var row = sr.ReadLine();
var values = row.Split(',');
//do whatever with the rows by now - i.e. write to console
Console.WriteLine(values[0] + " " + values[1]);
lastLine = values[0].Replace("\"", ""); //last line in the loop - get the last ID.
numberOfLines++;
deltaRequestParameter = "&next=" + lastLine;
}
}
}
} while (numberOfLines == 101); //since the header is returned each time the number of rows will be 101 until we get to the last request
}
but im not sure if this is an "up to date" way of doing this, or if there is a better way (easier/simpler)? In other words i'm insecure about whether using WebClient and StreamReader is the right way to go?
In this thread: how to read a csv file from a url?
WebClient.DownloadString is mentioned as well as WebRequest. But if I want to write to a database without saving csv to hdd which is the best option?
Furhtermore - will the approach I have taken save data to a temporary disk storage behind the scenes or will all data be read into memmory and then disposed when the loop completes?
I have read the following documentation but can't seem to find out what it does behind the scenes:
StreamReader: https://learn.microsoft.com/en-us/dotnet/api/system.io.streamreader?view=netframework-4.7.2
Stream: https://learn.microsoft.com/en-us/dotnet/api/system.io.stream?view=netframework-4.7.2
Edit:
I guess I could also be using the following "TextFieldParser"...but my questions is really still the same:
(using the Assembly Microsoft.VisualBasic)
using (Stream myStream = myWebClient.OpenRead(url))
{
using (TextFieldParser parser = new TextFieldParser(myStream))
{
numberOfLines = 0;
parser.TrimWhiteSpace = true; // if you want
parser.Delimiters = new[] { "," };
parser.HasFieldsEnclosedInQuotes = true;
while (!parser.EndOfData)
{
string[] line = parser.ReadFields();
Console.WriteLine(line[0].ToString() + " " + line[1].ToString());
numberOfLines++;
deltaRequestParameter = "&next=" + line[0].ToString();
}
}
}

The HttpClient class on System.Web.Http is available as of .Net 4.5. You have to work with async code, but it's not a bad idea to get into it if you're dealing with the web.
As sample data, I'll use jsonplaceholder's "todo" list. It provides json data, not csv data, but it gives a simple enough structure that can serve our purpose in the example below.
This is the core function, which fetches from jsonplaceholder in a similar way to your "testWebService123" site, although I'm just getting the first 3 todo's, as opposed to testing for when I've hit the last page (you would probably keep your do-while) logic on that one.
async void DownloadPagesAsync() {
for (var i = 1; i < 3; i++) {
var pageToGet = $"https://jsonplaceholder.typicode.com/todos/{i}";
using (var client = new HttpClient())
using (HttpResponseMessage response = await client.GetAsync(pageToGet))
using (HttpContent content = response.Content)
using (var stream = (MemoryStream) await content.ReadAsStreamAsync())
using (var sr = new StreamReader(stream))
while (!sr.EndOfStream) {
var row =
sr.ReadLine()
.Replace(#"""", "")
.Replace(",", "");
if (row.IndexOf(":") == -1)
continue;
var values = row.Split(':');
Console.WriteLine($"{values[0]}, {values[1]}");
}
}
}
This is how you would call the function, such as you would in a Main() method:
Task t = new Task(DownloadPagesAsync);
t.Start();
The new task, here is taking in an "action", or or in other words a function that returns void, as a parameter. Then you start the task. Be careful, it is asynchronous, so any code you have after t.Start() may very well run before your task completes.
As to your question as to whether the stream reads "in memory" or not, running GetType() on "stream" in the code resulted in a "MemoryStream" type, though it seems to only be recognized as a "Stream" object at compile time. A MemoryStream is definately in-memory. I'm not really sure if any of the other kinds of stream objects save temporary files behind the scenes, but I'm leaning towards not.
But looking into the inner workings of a class, though commendable, is not usually required for your anxiety about disposing. For any class, just see if it implements IDisposable. If it does, then put in in a "using" statement, as you have done in your code. When the program terminates, as expected or via error, the program will implement the proper disposures after control has passed out of the "using" block.
HttpClient is in fact the newer approach. From what I understand, it does not replace all of the functionality for WebClient, but is stronger in many respects. See this SO site for more details comparing the two classes.
Also, something to know about WebClient is that it can be simple, but limiting. If you run into issues, you will need to look into the HttpWebRequest class, which is a "lower level" class that gives you greater access to the nuts and bolts of things (such as working with cookies).

Unable to cast object of type 'iTextSharp.text.pdf.PdfArray' to type 'iTextSharp.text.pdf.PRStream'

I have to replace number "14-1" into "10-2". I am using following iText code but getting following type cast error. Can any one help me by modifying the program and remove the casting issue:
I have many PDF's where i have to replace the numbers at same location. I also need to understand it logically to how to do this:
using System;
using System.IO;
using System.Text;
using iTextSharp.text.io;
using iTextSharp.text.pdf;
using System.Windows.Forms;
namespace iText5
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
public const string src = #"D:\test1\A.pdf";
public const string dest = #"D:\test1\ENV1.pdf";
private void button1_Click(object sender, EventArgs e)
{
FileInfo file = new FileInfo(dest);
file.Directory.Create();
manipulatePdf(src, dest);
}
public void manipulatePdf(String src, String dest)
{
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.GetPageN(1);
PdfObject obj = dict.GetDirectObject(PdfName.CONTENTS);
PRStream stream = (PRStream)obj;
byte[] data = PdfReader.GetStreamBytes(stream);
string xyz = Encoding.UTF8.GetString(data);
byte[] newBytes = Encoding.UTF8.GetBytes(xyz.Replace("14-1", "10-2"));
stream.SetData(newBytes);
PdfStamper stamper = new PdfStamper(reader, new FileStream(dest, FileMode.Create));
stamper.Close();
reader.Close();
}
}
}

This is a problem:
PdfDictionary dict = reader.GetPageN(1);
PdfObject obj = dict.GetDirectObject(PdfName.CONTENTS);
PRStream stream = (PRStream)obj;
First you get a page dictionary. That page dictionary has a /Contents entry. If you read the PDF standard (ISO 32000), then you see that the value of the /Contents entry can be either a stream, or an array. You assume that it's always a stream. In some cases, your code will work, but in cases where the value of the /Contents entry is an array of references to a series of streams, you will get a class cast error (for the obvious reason that an array of streams is not the same as a stream).
I think that you want to do something like this:
byte[] data = reader.GetPageContent(i);
string xyz = PdfEncodings.ConvertToString(data, PdfObject.TEXT_PDFDOCENCODING);
string abc = xyz.Replace("14-1", "10-2");
reader.SetPageContent(i, PdfEncodings.ConvertToBytes(abc, PdfObject.TEXT_PDFDOCENCODING));
However, that's a very bad idea, because of the reasons explained in the answers to these questions:
Replace the text in pdf document using itextSharp
PDF text replace not working
Is there any API in C# or .net to edit pdf documents?
Using ContentByteUtils for raw PDF manipulation
...
You are making the assumption that you will find a literal string with value "14-1" in the content. That might be true for simple PDF documents, but in many cases the appearance of "14-1" on a page (that you can read with your eyes) doesn't mean the string "14-1" is present as such in the content (that you extract with GetPageContent). That string could be part of an XObject, or the syntax to render "14-1" could be constructed in such a way that xyz.Replace("14-1", "10-2") won't change xyz in any way.
Bottom line: PDF is not a format for editing. A page in a PDF file consists of content that is added at absolute positions. The content on a page doesn't reflow if you change it (e.g. the existing content won't move to the next line or to the next page if you add extra content). Instead of editing a PDF document, you should edit the source that was used to create the document, and then create a new PDF from that source.
Important: you are using an old version of iText. We abandoned the name iTextSharp more than two years ago in favor of iText for .NET. The current version of iText is iText 7.1.2; see Nuget: https://www.nuget.org/packages/itext7/
Many people think that iText 5.5.13 is the latest version. That assumption is wrong. iText 5 has been discontinued and is no longer supported. The recent 5.5.x versions are maintenance releases for paying customers who can't migrate to iText 7 right away.

why wont XML read from string? (but will from .txt of same data)

this is the code in question:
using (var file = MemoryMappedFile.OpenExisting("AIDA64_SensorValues"))
{
using (var readerz = file.CreateViewAccessor(0, 0))
{
var bytes = new byte[567];
var encoding = Encoding.ASCII;
readerz.ReadArray<byte>(0, bytes, 0, bytes.Length);
File.WriteAllText("C:\\myFile.txt", encoding.GetString(bytes));
var readerSettings = new XmlReaderSettings { ConformanceLevel = ConformanceLevel.Fragment };
using (var reader = XmlReader.Create("C:\\myFile.txt", readerSettings))
{
This is what myfile.txt looks like:
<sys><id>SCPUCLK</id><label>CPU Clock</label><value>1598</value></sys><sys><id>SCPUFSB</id><label>CPU FSB</label><value>266</value></sys><sys><id>SMEMSPEED</id><label>Memory Speed</label><value>DDR2-667</value></sys><sys><id>SFREEMEM</id><label>Free Memory</label><value>415</value></sys><sys><id>SGPU1CLK</id><label>GPU Clock</label><value>562</value></sys><sys><id>SFREELVMEM</id><label>Free Local Video Memory</label><value>229</value></sys><temp><id>TCPU</id><label>CPU</label><value>42</value></temp><temp><id>TGPU1</id><label>GPU</label><value>58</value></temp>
if i write the data to a txt file on the hard drive with:
File.WriteAllText("C:\\myFile.txt", encoding.GetString(bytes));
then read that same text file with the fragment XmlReader:
XmlReader.Create("C:\\myFile.txt");
it reads it just fine, the program runs and completes like it supposed to, but then if i directly read with the fragment XmlReader like:
XmlReader.Create(encoding.GetString(bytes));
I get exception when run " illegal characters in path" on the XmlReader.Create line.
ive tried writing it to a separate string first and reading that with xmlreader, and it wouldn't help to try to print it to CMD to see what it looks like because CMD wouldnt show the invalid characters im dealing with right?
but oh well i did Console.WriteLine(encoding.GetString(bytes)); and it precisely matched the txt file.
so somehow writing it to the text file is removing some "illegal characters"? what do you guys think?

XmlReader.Create(encoding.GetString(bytes));
XmlReader.Create() interprets your string as the URI where it should read a file from. Instead encapsulate your bytes in a StringReader:
StringReader sr = new StringReader(encoding.GetString(bytes));
XmlReader.Create(sr);

Here:
XmlReader.Create(encoding.GetString(bytes));
you are simply invoking the following method which takes a string representing a filename. However you are passing the actual XML string to it which obviously is an invalid filename.
If you want to load the reader from a buffer you could use a stream:
byte[] bytes = ... represents the XML bytes
using (var stream = new MemoryStream(bytes))
using (var reader = XmlReader.Create(stream))
{
...
}

The method XmlReader.Create() with a single string as argument needs a URI passed and not the XML document as string, please refer to the MSDN. It tries to open a file named "<..." which is an invalid URI. You can pass a Stream instead.

You are passing the xml content in the place where it is expecting a path, as evidenced by the error - illegal characters in path
Use an appropriate overload, and pass a stream - http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.create.aspx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

PdfTextExtractor.GetTextFromPage() returns empty string - c#

Related

JsReport save action result in PDF

How do I copy multiple streams into 1 for the client to download?

C# Read csv from url and save to database

Unable to cast object of type 'iTextSharp.text.pdf.PdfArray' to type 'iTextSharp.text.pdf.PRStream'

why wont XML read from string? (but will from .txt of same data)

Categories

Resources