How to extract the text of a PDF file using Azure Functions? - c#

I want to create an Azure Function that gets triggered anytime a file is uploaded to blob storage and extracts the text of a PDF file. I don't know what would be the best library to use either.
I found this post that shows how to use PdfSharp to extract the text of a PDF file but I can't seem to get it working since It's my first time using Azure Functions.

This question is overly broad and will probably be closed as such. But here are some pointers.
Start by installing the Azure Storage Emulator so that you can create Blobs locally for testing. Get it here.
Create an Azure Function v2. Set up a Blob Storage Trigger so that whenever something is written to your local storage, the trigger will be called. Blob trigger described here.
Once you can hit a breakpoint in your Azure Function when a Blob is added to your local emulator, you'll need to get the bytes and extract the text using a PDF ripper of your choice. There are many, some are free, and some are paid. Suggesting one and giving code examples could run several thousand words, so it's up to you which one you pick and use.

Related

Azure Media Services - Programmatically Transcode Blob into Asset

I am able to programmatically upload a local mp4 file into blob storage that is associated with my Azure Media Services.
How can I programmatically transcode the blob into an asset? preferably C#
I have a working ps1 script that I need to convert to a C# application. Following along with the script, my next step is to programmatically create a transcoding job to create the asset. I've been going down too many rabbit holes, just need to find a good example.
Fantastic! Have you tried out one of our many samples that show how to do that already?
Each of the samples in our .NET repo here - https://github.com/Azure-Samples/media-services-v3-dotnet/tree/main/VideoEncoding
show how to upload and encode an asset into a new Output asset using a Transform and a Job entity. After that the sample just downloads the contents of the output asset back to your local system.
Hopefully it was not too difficult to locate our samples from our documentation. Otherwise, I might have some work for our doc team to make it easier to find for you. Watch out for Google searches that lead you to the older deprecated v2 API. Most of the time it is mentioned on the page if it is deprecated. Don't ask ChatGPT - as it is very wrong... using 2019 information about v2 API.

Editing a file in Azure storage

I am using the Azure Storage File Shares client library for .NET in order to save files in the cloud, read them and so on. I got a file saved in the storage which is supposed to be updated after every time I'm doing a specific action in my code.
The way I'm doing it now is by downloading the file from the storage using
ShareFileDownloadInfo download = file.Download();
And then I edit the file locally and uploading it back to the storage.
The problem is that the file can be updated frequently which means lots of downloads and uploads of the file which increases in size.
Is there a better way of editing a file on Azure storage? Maybe some way to edit the file directly in the storage without the need to download it before editing?
Downloading and uploading the file is the correct way to make edits with the way you currently handling the data. If you are finding yourself doing this often, there are some strategies you could use to reduce traffic:
If you are the only one editing the file you could cache a copy of it locally and upload the updates to that copy instead of downloading it each time.
Cache pending updates and only update the file at regular intervals instead of with each change.
Break the single file up into multiple time-boxed files, say one per hour. This wouldn't help with frequency but it can with size.
FYI, when pushing logs to storage, many Azure services use a combination of #2 and #3 to minimize traffic.

Lock Azure blob during write

I have to update a large file stored as Azure blob. This update will take a few seconds and I need to ensure that no other client ever gets the partially updated file.
As described in https://learn.microsoft.com/en-us/azure/storage/common/storage-concurrency it should be easy to lock the file for writing but as far as I understand, other clients will still be able to read the file. I could use read locks but that would mean that only one client can read the file and that's not what I want.
According to Preventing azure blob from being accessed by other service while it's being created it seems that at least new files will be "committed" at the end of an upload but I could not find information what happens when I update an existing file.
So, the question is: What will other clients read during an update (replace) operation?
Will they read the old file until the new data is committed or
will they read the partially updated file content?
I did a test for the scenario(I didn't find any official doc about this), updating a 400M file in blob with a 600M file. and during the update(about 10 seconds after starting update), use code to read the blob which is updating.
The test result is that, only the old file can be read during updating.

read excelsheet in azure uploaded as a blob

I am using FileUpload control of asp.net and uploading the excel with some data. I can't save it in some folder. I can have stream of excel sheet file or I can have Blobstream after uploading excel as a blob. Now I want to convert that excel sheets 1st sheet to datatable so how shall I do that? I am using C# .NET. I don't want to use Interop library. I can use external libraries. Oledb connection is getting failed since I don't have any physical path of excel as a data source. I tried following links:
1) http://www.codeproject.com/Articles/14639/Fast-Excel-file-reader-with-basic-functionality
2) http://exceldatareader.codeplex.com/
Please help.
Depending on the type of Excel file you can use the examples you posted or go for the OpenXML alternative (for xlsx files): http://openexcel.codeplex.com/
Now, the problem with the physical path is easy to solve. Saving the file to blob storage is great. But if you want, you can also save it in a local resource to have it locally. This will allow you to process the file using a simple OleDb connection. Once you're done with the file, you can just delete it from the local resource (it will still be available in the blob storage since you also uploaded it there).
Don't forget to have some kind of clean up mechanism in case your processing fails. You wouldn't want to end up with a disk filled with temporary files (even though it could take a while before this happens).
Read more on local resources here: http://msdn.microsoft.com/en-us/library/windowsazure/ee758708.aspx
You should use OpenXML SDK which is an officially suggested way of working with MS Office documents - http://www.microsoft.com/download/en/details.aspx?id=5124
I first created local storage as per the link:
http://msdn.microsoft.com/en-us/library/windowsazure/ee758708.aspx
suggested by Sandrino above. Thanks Sandrino for this. Then I used oledb connection and it gave me an error "Microsoft.Jet.Oledb.4.0 dll is not registered". Then I logged on to the azure server and in the IIS changed app pool configuration for 32-bit. To change app pool to 32-bit refer the following link:
http://blog.nkadesign.com/2008/windows-2008-the-microsoftjetoledb40-provider-is-not-registered-on-the-local-machine/
The approach you followed is not the correct one, as you said you logged on to azure and changed, the VM which is running on azure is not the permanent one for you. For any updates you are going to get new VM machine. you might have to find turn around for this, instead of modifying manually. You can make use of the startup tasks in your azure app. See the link below it may help you.
http://msdn.microsoft.com/en-us/library/gg456327.aspx

Is there a c#/asp.net mvc3 file image picker/uploader control for Azure

Before I reinvent any wheels, has anyone seen a file uploader/image browser control (you know, the kind of thing typical in a CMS) that works with Azure Blob storage? The stuff I've used to date relies on physical directory structures.
As I know there is no image uploader control for ASP.NET MVC.
You should use tag for uploading image.
The excellent Aurigma Image Uploader worked for us in an MVC3 project. Take the uploaded file bytes and shove them in Azure File Storage.
We use Azure File Storage Explorer to manage files in file Azure File Storage.
In my project, we use PlUpload, and it works pretty good with the blobs.

Categories