How to extract all pages and attachments from PDF to PNG - c#

I am trying to create a process in .NET to convert a PDF and all it's pages + attachments to PNGs. I am evaluating libraries and came across PDFiumSharp but it is not working for me. Here is my code:
string Inputfile = "input.pdf";
string OutputFolder = "Output";
string fileName = Path.GetFileNameWithoutExtension(Inputfile);
using (PdfDocument doc = new PdfDocument(Inputfile))
{
for (int i = 0; i < doc.Pages.Count; i++)
{
var page = doc.Pages[i];
using (var bitmap = new PDFiumBitmap((int)page.Width, (int)page.Height, false))
{
page.Render(bitmap);
var targetFile = Path.Combine(OutputFolder, fileName + "_" + i + ".png");
bitmap.Save(targetFile);
}
}
}
When I run this code, I get this exception:
screenshot of exception
Does anyone know how to fix this? Also does PDFiumSharp support extracting PDF attachments? If not, does anyone have any other ideas on how to achieve my goal?

PDFium does not look like it supports extracting PDF attachments. If you want to achieve your goal, then you can take a look at another library that supports both extracting PDF attachments as well as converting PDFs to PNGs.
I am an employee of the LEADTOOLS PDF SDK which you can try out via these 2 nuget packages:
https://www.nuget.org/packages/Leadtools.Pdf/
https://www.nuget.org/packages/Leadtools.Document.Sdk/
Here is some code that will convert a PDF + all attachments in the PDF to separate PNGs in an output directory:
SetLicense();
cache = new FileCache { CacheDirectory = "cache" };
List<LEADDocument> documents = new List<LEADDocument>();
if (!Directory.Exists(OutputDir))
Directory.CreateDirectory(OutputDir);
using var document = DocumentFactory.LoadFromFile("attachments.pdf", new LoadDocumentOptions { Cache = cache, LoadAttachmentsMode = DocumentLoadAttachmentsMode.AsAttachments });
if (document.Pages.Count > 0)
documents.Add(document);
foreach (var attachment in document.Attachments)
documents.Add(document.LoadDocumentAttachment(new LoadAttachmentOptions { AttachmentNumber = attachment.AttachmentNumber }));
ConvertDocuments(documents, RasterImageFormat.Png);
And the ConvertDocuments method:
static void ConvertDocuments(IEnumerable<LEADDocument> documents, RasterImageFormat imageFormat)
{
using var converter = new DocumentConverter();
using var ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD);
ocrEngine.Startup(null, null, null, null);
converter.SetOcrEngineInstance(ocrEngine, false);
converter.SetDocumentWriterInstance(new DocumentWriter());
foreach (var document in documents)
{
var name = string.IsNullOrEmpty(document.Name) ? "Attachment" : document.Name;
string outputFile = Path.Combine(OutputDir, $"{name}.{RasterCodecs.GetExtension(imageFormat)}");
int count = 1;
while (File.Exists(outputFile))
outputFile = Path.Combine(OutputDir, $"{name}({count++}).{RasterCodecs.GetExtension(imageFormat)}");
var jobData = new DocumentConverterJobData
{
Document = document,
Cache = cache,
DocumentFormat = DocumentFormat.User,
RasterImageFormat = imageFormat,
RasterImageBitsPerPixel = 0,
OutputDocumentFileName = outputFile,
};
var job = converter.Jobs.CreateJob(jobData);
converter.Jobs.RunJob(job);
}
}

Related

.Net Core: Reading data from CSV & Excel files

Using .net core & c# here.
I have a UI from which user can upload the Excel or CSV files. Once they upload this goes to my web api which handles the reading of the data from these files and returns json.
My Api code as:
[HttpPost("upload")]
public async Task<IActionResult> FileUpload(IFormFile file)
{
JArray data = new JArray();
using (ExcelPackage package = new ExcelPackage(file.OpenReadStream()))
{
ExcelWorksheet worksheet = package.Workbook.Worksheets[1];
//Process, read from excel here and populate jarray
}
return Ok(data );
}
In my above code I am using EPPlus for reading the excel file. For excel file it works all fine but it cannot read csv file which is the limitation of EPPlus.
I searched and found another library CSVHelper: https://joshclose.github.io/CsvHelper/ The issue with this is it does vice versa and can read from CSV but not from Excel.
Is there any library available which supports reading from both.
Or would it be possible use EPPlus only but convert uploaded CSV to excel on the fly and then read. (please note I am not storing the excel file anywhere so cant use save as to save it as excel)
Any inputs please?
--Updated - Added code for reading data from excel---
int rowCount = worksheet.Dimension.End.Row;
int colCount = worksheet.Dimension.End.Column;
for (int row = 1; row <= rowCount; row++)
{
for (int col = 1; col <= colCount; col++)
{
var rowValue = worksheet.Cells[row, col].Value;
}
}
//With the code suggested in the answer rowcount is always 1
You can use EPPLus and a MemoryStream for opening csv files into an ExcelPackage without writing to a file. Below is an example. You may have to change some of the the parameters based on your CSV file specs.
[HttpPost("upload")]
public async Task<IActionResult> FileUpload(IFormFile file)
{
var result = string.Empty;
string worksheetsName = "data";
bool firstRowIsHeader = false;
var format = new ExcelTextFormat();
format.Delimiter = ',';
format.TextQualifier = '"';
using (var reader = new System.IO.StreamReader(file.OpenReadStream()))
using (ExcelPackage package = new ExcelPackage())
{
result = reader.ReadToEnd();
ExcelWorksheet worksheet =
package.Workbook.Worksheets.Add(worksheetsName);
worksheet.Cells["A1"].LoadFromText(result, format, OfficeOpenXml.Table.TableStyles.Medium27, firstRowIsHeader);
}
}
Here's using Aspose, which is unfortunately not free, but wow it works great. My API is using the streaming capability with Content-Type: multipart/form-data rather than the IFormFile implementation:
[HttpPut]
[DisableFormValueModelBinding]
public async Task<IActionResult> UploadSpreadsheet()
{
if (!MultipartRequestHelper.IsMultipartContentType(Request.ContentType))
{
return BadRequest($"Expected a multipart request, but got {Request.ContentType}");
}
var boundary = MultipartRequestHelper.GetBoundary(MediaTypeHeaderValue.Parse(Request.ContentType), _defaultFormOptions.MultipartBoundaryLengthLimit);
var reader = new MultipartReader(boundary, HttpContext.Request.Body);
var section = (await reader.ReadNextSectionAsync()).AsFileSection();
//If you're doing CSV, you add this line:
LoadOptions loadOptions = new LoadOptions(LoadFormat.CSV);
var workbook = new Workbook(section.FileStream, loadOptions);
Cells cells = workbook.Worksheets[0].Cells;
var rows = cells.Rows.Cast<Row>().Where(x => !x.IsBlank);
//Do whatever else you want here
Please try with below code
private string uploadCSV(FileUpload fl)
{
string fileName = "";
serverLocation = Request.PhysicalApplicationPath + "ExcelFiles\\";
fileName = fl.PostedFile.FileName;
int FileSize = fl.PostedFile.ContentLength;
string contentType = fl.PostedFile.ContentType;
fl.PostedFile.SaveAs(serverLocation + fileName);
string rpath = string.Empty, dir = string.Empty;
HttpContext context = HttpContext.Current;
string baseUrl = context.Request.Url.Scheme + "://" + context.Request.Url.Authority + context.Request.ApplicationPath.TrimEnd('/') + '/';
try
{
rpath = serverLocation + fileName;//Server.MapPath(dir + fileName);
using (Stream InputStream = fl.PostedFile.InputStream)
{
Object o = new object();
lock (o)
{
byte[] buffer = new byte[InputStream.Length];
InputStream.Read(buffer, 0, (int)InputStream.Length);
lock (o)
{
File.WriteAllBytes(rpath, buffer);
buffer = null;
}
InputStream.Close();
}
}
}
catch (Exception ex)
{
lblSOTargetVal.Text = ex.Message.ToString();
}
return rpath;
}
Use the Open XML SDK package and add insert working solution for it.

Using PDFsharp and MigraDoc to write to and then read from a PDF

I'm trying to write verification code for our PDF generating routines, and I'm having difficulty getting PDFsharp to extract text from files created with MigraDoc. The ExtractText code works with other PDFs, but not with the PDFs that I generate with MigraDoc (see code below.)
Any tips on what I'm doing wrong?
//Create the Doc
var doc = new MigraDoc.DocumentObjectModel.Document();
doc.Info.Title = "VerifyReadWrite";
var section = doc.AddSection();
section.AddParagraph("ABCDEF abcdef");
//Render the PDF
var renderer = new PdfDocumentRenderer(true);
var pdf = new PdfDocument();
renderer.PdfDocument = pdf;
renderer.Document = doc;
renderer.RenderDocument();
var msOut = new MemoryStream();
pdf.Save(msOut, true);
var pdfBytes = msOut.ToArray();
//Read the PDF into PdfSharp
var ms = new MemoryStream(pdfBytes);
var pdfRead = PdfSharp.Pdf.IO.PdfReader.Open(ms, PdfDocumentOpenMode.ReadOnly);
var segments = pdfRead.Pages[0].ExtractText().ToList();
Results in the following:
segments[0] = "\0$\0%\0&\0'\0(\0)"
segments[1] = "\0D\0E\0F\0G\0H\0I"
I'd expect to see:
segments[0] = "ABCDEF"
segments[1] = "abcdef"
I'm using the ExtractText code from here:
C# Extract text from PDF using PdfSharp
and it works very well for all but PDFs generated with MigraDoc.
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text.Select(x => x.Trim());
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = (COperator) cObject;
if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else
{
var sequence = cObject as CSequence;
if (sequence != null)
{
var cSequence = sequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = (CString) cObject;
yield return cString.Value;
}
}
}
It seems the code used to extract text does not support all cases.
Try new PdfDocumentRenderer(false) (instead of 'true'). AFAIK this will lead to a different encoding and the text extraction might work.

C# Download the sound of a youtube video

I can download a video from youtube but I want the sound only. How can I do that?
Code I have for downloading the video (Using VideoLibrary):
YouTube youtube = YouTube.Default;
Video vid = youtube.GetVideo(txt_youtubeurl.Text);
System.IO.File.WriteAllBytes(source + vid.FullName, vid.GetBytes());
Install the NuGet packages: MediaToolkit and VideoLibrary, it will allow you to do the conversion by file extension.
var source = #"<your destination folder>";
var youtube = YouTube.Default;
var vid = youtube.GetVideo("<video url>");
File.WriteAllBytes(source + vid.FullName, vid.GetBytes());
var inputFile = new MediaFile { Filename = source + vid.FullName };
var outputFile = new MediaFile { Filename = $"{source + vid.FullName}.mp3" };
using (var engine = new Engine())
{
engine.GetMetadata(inputFile);
engine.Convert(inputFile, outputFile);
}
The above code works awesome you don't need to download the video first I created this procedure so when rookies like myself see this makes it easier to use.
You need the nuget packages MediaToolkit and VideoLibrary.
example url: https://www.youtube.com/watch?v=lzm5llVmR2E
example path just needs a path to save file to.
just add the name of the mp3 file to save
Hope this helps someone I have tested this code;
private void SaveMP3(string SaveToFolder, string VideoURL, string MP3Name)
{
var source = #SaveToFolder;
var youtube = YouTube.Default;
var vid = youtube.GetVideo(VideoURL);
File.WriteAllBytes(source + vid.FullName, vid.GetBytes());
var inputFile = new MediaFile { Filename = source + vid.FullName };
var outputFile = new MediaFile { Filename = $"{MP3Name}.mp3" };
using (var engine = new Engine())
{
engine.GetMetadata(inputFile);
engine.Convert(inputFile, outputFile);
}
}
based on this topic, i have developed a simple and dumb program to Download a youtube playlist. Hope this helps someone. It's just a Main.cs file: Youtube Playlist Downloader - Mp4 & Mp3
Ok found a better way the above code didn't normalize the audio posting it for others.
First Add Nuget package: https://www.nuget.org/packages/NReco.VideoConverter/
To Convert MP4 to MP3
// Client
var client = new YoutubeClient();
var videoId = NormalizeVideoId(txtFileURL.Text);
var video = await client.GetVideoAsync(videoId);
var streamInfoSet = await client.GetVideoMediaStreamInfosAsync(videoId);
// Get the best muxed stream
var streamInfo = streamInfoSet.Muxed.WithHighestVideoQuality();
// Compose file name, based on metadata
var fileExtension = streamInfo.Container.GetFileExtension();
var fileName = $"{video.Title}.{fileExtension}";
// Replace illegal characters in file name
fileName = RemoveIllegalFileNameChars(fileName);
tmrVideo.Enabled = true;
// Download video
txtMessages.Text = "Downloading Video please wait ... ";
//using (var progress = new ProgressBar())
await client.DownloadMediaStreamAsync(streamInfo, fileName);
// Add Nuget package: https://www.nuget.org/packages/NReco.VideoConverter/ To Convert MP4 to MP3
if (ckbAudioOnly.Checked)
{
var Convert = new NReco.VideoConverter.FFMpegConverter();
String SaveMP3File = MP3FolderPath + fileName.Replace(".mp4", ".mp3");
Convert.ConvertMedia(fileName, SaveMP3File, "mp3");
//Delete the MP4 file after conversion
File.Delete(fileName);
LoadMP3Files();
txtMessages.Text = "File Converted to MP3";
tmrVideo.Enabled = false;
txtMessages.BackColor = Color.White;
if (ckbAutoPlay.Checked) { PlayFile(SaveMP3File); }
return;
}
I like the idea of using a method. I tried SaveMP3() but it had some problems.
This worked for me: `
private void SaveMP3(string SaveToFolder, string VideoURL, string MP3Name)
{
string source = SaveToFolder;
var youtube = YouTube.Default;
var vid = youtube.GetVideo(VideoURL);
string videopath = Path.Combine(source, vid.FullName);
File.WriteAllBytes(videopath, vid.GetBytes());
var inputFile = new MediaFile { Filename = Path.Combine(source, vid.FullName) };
var outputFile = new MediaFile { Filename = Path.Combine(source , $"{MP3Name}.mp3") };
using (var engine = new Engine())
{
engine.GetMetadata(inputFile);
engine.Convert(inputFile, outputFile);
}
File.Delete(Path.Combine(source, vid.FullName));
}
`

Read more than one file

I am writing a pdf to word converter which works perfectly fine for me. But I want to be able to convert more than one file.
What happens now is that it read the first file and does the convert process.
public static void PdfToImage()
{
try
{
Application application = null;
application = new Application();
var doc = application.Documents.Add();
string path = #"C:\Users\Test\Desktop\pdfToWord\";
foreach (string file in Directory.EnumerateFiles(path, "*.pdf"))
{
using (var document = PdfiumViewer.PdfDocument.Load(file))
{
int pagecount = document.PageCount;
for (int index = 0; index < pagecount; index++)
{
var image = document.Render(index, 200, 200, true);
image.Save(#"C:\Users\chnikos\Desktop\pdfToWord\output" + index.ToString("000") + ".png", ImageFormat.Png);
application.Selection.InlineShapes.AddPicture(#"C:\Users\chnikos\Desktop\pdfToWord\output" + index.ToString("000") + ".png");
}
string getFileName = file.Substring(file.LastIndexOf("\\"));
string getFileWithoutExtras = Regex.Replace(getFileName, #"\\", "");
string getFileWihtoutExtension = Regex.Replace(getFileWithoutExtras, #".pdf", "");
string fileName = #"C:\Users\Test\Desktop\pdfToWord\" + getFileWihtoutExtension;
doc.PageSetup.PaperSize = WdPaperSize.wdPaperA4;
foreach (Microsoft.Office.Interop.Word.InlineShape inline in doc.InlineShapes)
{
if (inline.Height > inline.Width)
{
inline.ScaleWidth = 250;
inline.ScaleHeight = 250;
}
}
doc.PageSetup.TopMargin = 28.29f;
doc.PageSetup.LeftMargin = 28.29f;
doc.PageSetup.RightMargin = 30.29f;
doc.PageSetup.BottomMargin = 28.29f;
application.ActiveDocument.SaveAs(fileName, WdSaveFormat.wdFormatDocument);
doc.Close();
}
}
I thought that with my foreach that problem should not occur. And yes there are more than one pdf in this folder
The line
var doc = application.Documents.Add();
is outside the foreach loop. So you only create a single word document for all your *.pdf files.
Move the above line inside the foreach loop to add a new word document for each *.pdf file.

Telerik RenderReport

I have some problems with Telerik reports.
Feels like i have missed something...
I wanna create a list of reports, and then write them to ONE file.
But when i write it out i only get one page.
The writer writer over page 1 all foreach, so it just write one page.
But i want several pages... in this case 10.
Have tried write with FileStream, File and more...
Does anyone have a good idea?
public void WriteToFile()
{
string path = #"C:\";
string test = "test";
var report = new Report2();
var procceser = new ReportProcessor();
var list = new List<RenderingResult>();
for (int i = 0; i < 10; i++)
{
var res = procceser.RenderReport("PDF", report, null);
list.Add(res);
}
string filePath = Path.Combine(path, test);
var Writer = new BinaryWriter(File.Create(filePath));
foreach (var renderingResult in list)
{
Writer.Write(renderingResult.DocumentBytes);
}
Writer.Flush();
Writer.Close();
}

Categories