PDF/TIFF Document Text Detection gcsDestinationBucketName - c#

I'm working on Pdf to text file conversion using google cloud vision API.
I got an initial code help through there side, image to text conversion working fine with JSON key which I got through registration and activation,
here is a code which I got for pdf to text conversion
private static object DetectDocument(string gcsSourceUri,
string gcsDestinationBucketName, string gcsDestinationPrefixName)
{
var client = ImageAnnotatorClient.Create();
var asyncRequest = new AsyncAnnotateFileRequest
{
InputConfig = new InputConfig
{
GcsSource = new GcsSource
{
Uri = gcsSourceUri
},
// Supported mime_types are: 'application/pdf' and 'image/tiff'
MimeType = "application/pdf"
},
OutputConfig = new OutputConfig
{
// How many pages should be grouped into each json output file.
BatchSize = 2,
GcsDestination = new GcsDestination
{
Uri = $"gs://{gcsDestinationBucketName}/{gcsDestinationPrefixName}"
}
}
};
asyncRequest.Features.Add(new Feature
{
Type = Feature.Types.Type.DocumentTextDetection
});
List<AsyncAnnotateFileRequest> requests =
new List<AsyncAnnotateFileRequest>();
requests.Add(asyncRequest);
var operation = client.AsyncBatchAnnotateFiles(requests);
Console.WriteLine("Waiting for the operation to finish");
operation.PollUntilCompleted();
// Once the rquest has completed and the output has been
// written to GCS, we can list all the output files.
var storageClient = StorageClient.Create();
// List objects with the given prefix.
var blobList = storageClient.ListObjects(gcsDestinationBucketName,
gcsDestinationPrefixName);
Console.WriteLine("Output files:");
foreach (var blob in blobList)
{
Console.WriteLine(blob.Name);
}
// Process the first output file from GCS.
// Select the first JSON file from the objects in the list.
var output = blobList.Where(x => x.Name.Contains(".json")).First();
var jsonString = "";
using (var stream = new MemoryStream())
{
storageClient.DownloadObject(output, stream);
jsonString = System.Text.Encoding.UTF8.GetString(stream.ToArray());
}
var response = JsonParser.Default
.Parse<AnnotateFileResponse>(jsonString);
// The actual response for the first page of the input file.
var firstPageResponses = response.Responses[0];
var annotation = firstPageResponses.FullTextAnnotation;
// Here we print the full text from the first page.
// The response contains more information:
// annotation/pages/blocks/paragraphs/words/symbols
// including confidence scores and bounding boxes
Console.WriteLine($"Full text: \n {annotation.Text}");
return 0;
}
this function required 3 parameters
string gcsSourceUri,
string gcsDestinationBucketName,
string gcsDestinationPrefixName
I don't understand which value should I set for those 3 params.
I never worked on third party API before so it's a little bit confusing for me

Suppose you own a GCS bucket named 'giri_bucket' and you put a pdf at the root of the bucket 'test.pdf'. If you wanted to write the results of the operation to the same bucket you could set the arguments to be
gcsSourceUri: 'gs://giri_bucket/test.pdf'
gcsDestinationBucketName: 'giri_bucket'
gcsDestinationPrefixName: 'async_test'
When the operation completes, there will be 1 or more output files in your GCS bucket at giri_bucket/async_test.
If you want, you could even write your output to a different bucket. You just need to make sure your gcsDestinationBucketName + gcsDestinationPrefixName is unique.
You can read more about the request format in the docs: AsyncAnnotateFileRequest

Related

GetFile AWS S3 .NET 6 C#

I have these two types of method to get the file in S3.
//method 1
var client = new AmazonS3Client();
var request = new GetPreSignedUrlRequest()
{
BucketName = BucketName,
Key = fileName,
Expires = DateTime.UtcNow.AddSeconds(300),
};
var presignedUrlResponse = client.GetPreSignedURL(request);
return presignedUrlResponse;
//method 2
var client = new AmazonS3Client();
var request = new GetPreSignedUrlRequest()
{
BucketName = BucketName,
Key = fileName,
};
var file = await client.GetObjectAsync(BucketName, fileName);
return File(file.ResponseStream, file.Headers.ContentType);
In my method 1 by using GetPreSignedURL it saves the name of the region in the path of the photo and I don't want that, I need it to save without, because then I can't open the photo in the browser.
Example: https://service-manager-estagio.s3.sa-east-1.amazonaws.com/urlphoto
I want to save without this sa-east-1
In my method 2 on the return File line I can't use this File, the message is that I can't use it as a method
I need help using method 1 and saving without the region name or using method 2 with this File.
But if anyone knows another way to do this GET is valid too.
Thanks!!

Read one line from 200gb text file which is stored on azure blob storage using C#

I have 200 gb text file on azure blob storage . I want to search in the text and then matching line need to download instead of whole 200 gb file and then select that line.
I have written code in c# by downloading complete file and then searching and selecting but its taking too much time and then failed with timeout error .
var content ="" ////Downloading whole text from azure blob storage
StringReader strReader = new StringReader(contents);
var searchedLines1 = contents.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries).
Select((text, index) => new { text, lineNumber = index + 1 })
.Where(x => x.text.Contains("TYLER15727#YAHOO.COM") || x.lineNumber == 1);
You will need to stream the file and set the timeout. I have wrapped the stream implementation in IAsyncEnumerable which is completely unnecessary... but why not
Given
public static async IAsyncEnumerable<string> Read(StreamReader stream)
{
while(!stream.EndOfStream)
yield return await stream.ReadLineAsync();
}
Usage
var blobClient = new BlobClient( ... , new BlobClientOptions()
{
Transport = new HttpClientTransport(new HttpClient {Timeout = Timeout.InfiniteTimeSpan}),
Retry = {NetworkTimeout = Timeout.InfiniteTimeSpan}
});
await using var stream = await blobClient.OpenReadAsync();
using var reader = new StreamReader(stream);
await foreach (var line in Read(reader))
if (line.Contains("bob"))
{
Console.WriteLine("Yehaa");
// exit or what ever
}
Disclaimer : Completely untested
Note : If you are using C#4 you will need to remove all all the awaits and async methods, and just use the for loop with stream.ReadLine

how to consume the latest text file data send in kafka and write back in another text file using c#?

I have sent one text file of data to Kafka producer after reading that file in string. Now I want to consume the same data in text file. How do I consume it?
var fileName = #"D:\kafka_examples\new2.txt";
var options = new KafkaOptions(new Uri("http://localhost:9092"),
new Uri("http://localhost:9092"));
var router = new BrokerRouter(options);
var consumer = new KafkaNet.Consumer(new ConsumerOptions("Hello-Kafka",
new BrokerRouter(options)));
var text="";
//Consume returns a blocking IEnumerable (ie: never ending stream)
if (File.Exists(fileName))
{
File.Delete(fileName);
}
foreach (var message in consumer.Consume())
{
Console.WriteLine("Response: P{0},O{1} : {2}",
message.Meta.PartitionId, message.Meta.Offset,
text= Encoding.UTF8.GetString(message.Value));
using (StreamWriter sw = File.CreateText(fileName))
{
sw.WriteLine(text);
}
}
I tried this but the file is not written in given text file. All messages are coming. I want only the last message.
There is no concept of the "latest" message in a stream; they are infinite.
But what you could do is try to lookup the current latest offset when your code starts, then subtract one (or the number of lines in the file), then seek the consumer group there and break the for loop on reading just those many messages.
i.e.
var filename = ...
var lines = linesInFile(filename)
var consumer = ... // (with a consumer group id)
var latestOffset = seekToEnd(consumer, -1 * lines) // second param is the delta offset from the end
var i = lines;
foreach (var message in consumer.Consume()) {
...
if (i <= 0) break;
}
Also, Kafka is not an http service. Remove http:// and your duplicate localhost address from the code

Don't know how to transcribe wav file from Google Cloud Storage for LongRunningRecognize conversion to text in C#?

I'm able to convert audio files to text as long as they are under a minute. I need to transcribe longer files. Apparently, you have to have the file in Cloud Storage but I can't figure out if there is one command that does that or if I have to do it separately. What I'm using now is:
var credential = GoogleCredential.FromFile(GoogleCredentials);
var channel = new Grpc.Core.Channel(SpeechClient.DefaultEndpoint.ToString(), credential.ToChannelCredentials());
var speech = SpeechClient.Create(channel);
var response = speech.LongRunningRecognize(
new RecognitionConfig()
{
Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
LanguageCode = "en",
},
RecognitionAudio.FromFile(waveFile));
response = response.PollUntilCompleted();
I know I need to specify a file in Cloud Storage like:
RecognitionAudio.FromStorageUri("gs://ldn-speech/" + waveFile);
But I don't know how to get the file into the gs bucket. Do I have to do that in a separate step or as part of one of the Speech API's? I'm looking for someone to show me an example.
EDIT: I found that I needed to upload the file separately and could use the credential file I had already been using in the speech recognition process: So, all I needed was:
var credential = GoogleCredential.FromFile(GoogleCredentials);
var storage = StorageClient.Create(credential);
using (var f = File.OpenRead(fullFileName))
{
fileName = Path.GetFileName(fullFileName);
storage.UploadObject(bucketName, fileName, null);
}
There is also another method of going about in your case.
As stated in your edit you indeed needed to upload the file separately to your Cloud Storage bucket.
If you are planning on transcribing long audio files (longer than 1 minute) to text you may consider using Asynchronous Speech recognition:
https://cloud.google.com/speech-to-text/docs/async-recognize#speech-async-recognize-gcs-csharp
The code sample uses Cloud Storage bucket to store raw audio input for long-running transcription processes. It also requires that you have created and activated a service account.
Here’s an example:
static object AsyncRecognizeGcs(string storageUri)
{
var speech = SpeechClient.Create();
var longOperation = speech.LongRunningRecognize(new RecognitionConfig()
{
Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
SampleRateHertz = 16000,
LanguageCode = "en",
}, RecognitionAudio.FromStorageUri(storageUri));
longOperation = longOperation.PollUntilCompleted();
var response = longOperation.Result;
foreach (var result in response.Results)
{
foreach (var alternative in result.Alternatives)
{
Console.WriteLine($"Transcript: { alternative.Transcript}");
}
}
return 0;
}
(1) I found that I did indeed need to upload the file separately to cloud storate. (2) could use the credential file I had already been using in the speech recognition process: So, all I needed was:
var credential = GoogleCredential.FromFile(GoogleCredentials);
var storage = StorageClient.Create(credential);
using (var f = File.OpenRead(fullFileName))
{
fileName = Path.GetFileName(fullFileName);
storage.UploadObject(bucketName, fileName, null);
}
Once in Cloud storage, I could transcribe it as I originally thought. Then delete the file after the process was complete with:
var credential = GoogleCredentials;
var storage = StorageClient.Create(credential);
using (var f = File.OpenRead(fullFileName))
{
fileName = Path.GetFileName(fullFileName);
storage.DeleteObject(bucketName, fileName);
}

Office/Word Add-in Uploading File to MVC Application

I am building an add-in for Word, with the goal of being able to save the open Word document to our MVC web application. I have followed this guide and am sending the slices like this:
function sendSlice(slice, state) {
var data = slice.data;
if (data) {
var fileData = myEncodeBase64(data);
var request = new XMLHttpRequest();
request.onreadystatechange = function () {
if (request.readyState == 4) {
updateStatus("Sent " + slice.size + " bytes.");
state.counter++;
if (state.counter < state.sliceCount) {
getSlice(state);
}
else {
closeFile(state);
}
}
}
request.open("POST", "http://localhost:44379/api/officeupload/1");
request.setRequestHeader("Slice-Number", slice.index);
request.setRequestHeader("Total-Slices", state.sliceCount);
request.setRequestHeader("FileId", "abc29572-8eca-473d-80de-8b87d64e06a0");
request.setRequestHeader("FileName", "file.docx");
request.send(fileData);
}
}
And then receiving the slices like this:
public void Post()
{
if (Files == null) Files = new Dictionary<Guid, Dictionary<int, byte[]>>();
var slice = int.Parse(Request.Headers.GetValues("Slice-Number").First());
var numSlices = int.Parse(Request.Headers.GetValues("Total-Slices").First());
var filename = Request.Headers.GetValues("FileName").First();
var fileId = Guid.Parse(Request.Headers.GetValues("FileId").First());
var content = Request.Content.ReadAsStringAsync().Result;
if (!Files.ContainsKey(fileId)) Files[fileId] = new Dictionary<int, byte[]>();
Files[fileId][slice] = Convert.FromBase64String(content);
if (Files[fileId].Keys.Count == numSlices)
{
byte[] array = Combine(Files[fileId].OrderBy(x => x.Key).Select(x => x.Value).ToArray());
System.IO.FileStream writeFileStream = new System.IO.FileStream("c:\\temp\\test.docx", System.IO.FileMode.Create, System.IO.FileAccess.Write);
writeFileStream.Write(array, 0, array.Length);
writeFileStream.Close();
Files.Remove(fileId);
}
}
The problem is that the file that is produced by the controller is unreadable in Word. I have tested with a word document with "Test123" as the entire contents of the document, and when the file is saved through word it is 13kb, but when sent to the web app and saved from there the file is 41kb.
My assumption is that the I am missing something either with the encoding or decoding, since I am only sending a single slice so there shouldn't be an issue with recombining them.
There's an Excel snippet in Script Lab that produces the base64 encoded file which you can paste into an online decoder like www.base64decode.org. The APIs are the same as in Word. This can help you isolate the encoding code. After you install Script Lab, open the Samples tab, scroll to the Document section. It's the Get file (using slicing) snippet.

Categories