Issues with webscraping in C#: Downloading and parsing zipped text files - c#

I am writing an webscraper, to do the download content from a website.
Traversing to the website/URL, triggers the creation of a temporary URL. This new URL has a zipped text file. This zipped file has to be downloaded and parsed.
I have written a scraper in C# using WebClient and its function DownloadFileAsync(). The zipped file is read from the designated location on a trapped DownloadFileCompleted event.
My issue is The Windows Open/Save dialog are triggered. This requires user input and the automation is disrupted.
Can you suggest a way to bypass the issue ? I am cool with rewriting the code using any alternate libraries. :)
Thanks for reading

You can use 'HttpWebRequest' to perform the request and save the streamed bytes to disk.
var request = WebRequest.Create(#"your url here");
request.Method=WebRequestMethods.Http.Get;
var response = request.GetResponse();
using (var writeStream = new FileStream(#"path", FileMode.Create))
{
using (var readStream = response.GetResponseStream())
{
var buffer = new byte[1024];
var readCount = readStream.Read(buffer,0,buffer.Length);
while (readCount > 0)
{
writeStream.Write(buffer,0,buffer.Length);
readCount= readStream.Read(buffer,0,buffer.Length);
}
}
}

Related

Convert docx stream to pdf directly

I need to convert any file coming from web response into .pdf format, I'm currently getting it word docx file format from the URL and saving it into memory stream so i can later insert it in it's designated library.
The problem I'm facing now is that I'm saving my docx files directly into .pdf by putting an extension at the end which obviously ends up not opening the file later, So i'm trying to convert my memory stream into pdf directly .
Here is my piece of code that i tried to convert the the stream to .pdf but it looks like the file isn't getting converted correctly.
private Stream DownloadFromUrl(string url)
{
var webRequest = WebRequest.Create(url);
webRequest.Credentials = CredentialCache.DefaultNetworkCredentials;
webRequest.PreAuthenticate = true;
webRequest.UseDefaultCredentials = true;
//EventLogUtility.LogInformationMessage(DocumentURL);
string message = string.Empty;
using (Stream outputStream = new MemoryStream())
{
using (var response = webRequest.GetResponse())
{
using (var content = response.GetResponseStream())
{
var memory = new MemoryStream();
content.CopyTo(memory);
Document doc = new Document(memory);
doc.Save(memory, SaveFormat.Pdf);
return memory;
}
}
}
}
If the content in the stream is actually in the Microsoft Word file format (and not just plain text), then you need to map the format to the file format for PDF. I know there is a 'Print to PDF' function available in Word, you could try looking into that.

Not compelete result from Azure speech to text API

I have a small wav sound file in which I want to get the text of it, so I used Azure speech to text API to test it.
first thing I convert the audio file as they recommended in their documentation to PCM - Mono -16K sample rate.
and I use this code in c# in the documentation example here to upload the file and get the result.
HttpWebRequest request = null;
request = (HttpWebRequest)HttpWebRequest.Create("https://speech.platform.bing.com/speech/recognition/interactive/cognitiveservices/v1?language=en-US&format=detailed");
request.SendChunked = true;
request.Accept = #"application/json;text/xml";
request.Method = "POST";
request.ProtocolVersion = HttpVersion.Version11;
request.ContentType = #"audio/wav; codec=audio/pcm; samplerate=16000";
request.Headers["Ocp-Apim-Subscription-Key"] = "my key";
// Send an audio file by 1024 byte chunks
using (FileStream fs = new FileStream("D:/b.wav", FileMode.Open, FileAccess.Read))
{
/*
* Open a request stream and write 1024 byte chunks in the stream one at a time.
*/
byte[] buffer = null;
int bytesRead = 0;
using (Stream requestStream = request.GetRequestStream())
{
/*
* Read 1024 raw bytes from the input audio file.
*/
buffer = new Byte[checked((uint)Math.Min(1024, (int)fs.Length))];
while ((bytesRead = fs.Read(buffer, 0, buffer.Length)) != 0)
{
requestStream.Write(buffer, 0, bytesRead);
}
// Flush
requestStream.Flush();
}
}
string responseString;
Console.WriteLine("Response:");
using (WebResponse response = request.GetResponse())
{
Console.WriteLine(((HttpWebResponse)response).StatusCode);
using (StreamReader sr = new StreamReader(response.GetResponseStream()))
{
responseString = sr.ReadToEnd();
}
Console.WriteLine(responseString);
Console.ReadLine();
}
also i tried using cUrl tool and also write it in java as i was thought that maybe it's problem with the programming language I use that i not upload the file correctly.
this the link of the sound file i want to convert it to text here.
so Now i need to help to figure it out if the problem comes from the format of the sound file? or from maybe code that i not upload it correctly? or it's from the API I mean to be not accurate enough?
i tried IBM speech to text and it got all the text with no problem.
iam using now the free trial of Azure speech to text API and I want to figure where the problem comes if anyone has experience with this to see if I will work with this API or not.
Update
I want to clear that iam not got any error i just got incomplete result to my sound file I upload, for example the sound file i upload he said at the end of the sound "What is up with that", the result i got from Azure is just the first sentence only which is "I say that like it's a bad thing.", also I upload another sound file which contains the "What is up with that" only check it here,and it just gives me an empty result like this.
{"RecognitionStatus":"NoMatch","Offset":17300000,"Duration":0}
so all that i want to know if this normal from the Speech to text API Azure or the problem with my code or from the sound file? this what i want to get an answer with it.
when i test another API on those files it worked like IBM for example.
Thanks in advance.

downloading with web request

I want to get a link from a TextBox and download a file from link.
But before downloading file, I want to know the size of the file in advance and create an empty file with that size. but I can't.
and another question, I want to show percentage of download progress. How can I know data is downloaded and I should update the percentage?
WebRequest request = WebRequest.Create(URL);
WebResponse response = request.GetResponse();
totalSize = request.ContentLength;//always is -1
using (FileStream f = new FileStream(savePath, FileMode.Create))
{
f.SetLength(totalSize);
}
System.IO.StreamReader reader = new
System.IO.StreamReader(response.GetResponseStream());
WebClient client = new WebClient();
client.DownloadFile (URL, savePath);
The best way would be to use the WebClient with its DownloadFile Function, which has an async callback for events like Completed or ProgressChanged.
Getting the size of the file in advance would be a step harder though.

Creating a Download Accelerator

I am referring to this article to understand file downloads using C#.
Code uses traditional method to read Stream like
((bytesSize = strResponse.Read(downBuffer, 0, downBuffer.Length)) > 0
How can I divide a file to be downloaded into multiple segments, so that I can download separate segments in parallel and merge them?
using (WebClient wcDownload = new WebClient())
{
try
{
// Create a request to the file we are downloading
webRequest = (HttpWebRequest)WebRequest.Create(txtUrl.Text);
// Set default authentication for retrieving the file
webRequest.Credentials = CredentialCache.DefaultCredentials;
// Retrieve the response from the server
webResponse = (HttpWebResponse)webRequest.GetResponse();
// Ask the server for the file size and store it
Int64 fileSize = webResponse.ContentLength;
// Open the URL for download
strResponse = wcDownload.OpenRead(txtUrl.Text);
// Create a new file stream where we will be saving the data (local drive)
strLocal = new FileStream(txtPath.Text, FileMode.Create, FileAccess.Write, FileShare.None);
// It will store the current number of bytes we retrieved from the server
int bytesSize = 0;
// A buffer for storing and writing the data retrieved from the server
byte[] downBuffer = new byte[2048];
// Loop through the buffer until the buffer is empty
while ((bytesSize = strResponse.Read(downBuffer, 0, downBuffer.Length)) > 0)
{
// Write the data from the buffer to the local hard drive
strLocal.Write(downBuffer, 0, bytesSize);
// Invoke the method that updates the form's label and progress bar
this.Invoke(new UpdateProgessCallback(this.UpdateProgress), new object[] { strLocal.Length, fileSize });
}
}
you need several threads to accomplish that.
first you start the first download thread, creating a webclient and getting the file size. then you can start several new thread, which add a download range header.
you need a logic which takes care about the downloaded parts, and creates new download parts when one finished.
http://msdn.microsoft.com/de-de/library/system.net.httpwebrequest.addrange.aspx
I noticed that the WebClient implementation has sometimes a strange behaviour, so I still recommend implementing an own HTTP client if you really want to write a "big" download program.
ps: thanks to user svick

FileStream with querystring in filename?

I have a need to be able to open a file on disk but pass in parameters to that file via a querystring. It's a .SWF file, so I'm passing in the parameter necessary to get it to load correctly.
The code I'm using to do so is:
FileStream fs = new FileStream(#"C:\test\file.swf?key=value", FileMode.Open, FileAccess.Read);
I'm getting an error opening the file: "Invalid characters in path" because of the "?" in the filename. Is there any way to load a file from disk into a FileStream object using a querystring in the filename?
I think you can't do what you're trying to do. When you load a file from disk the querystring does not exist as a concept. It will only return the bytes contained in the SWF file.
The querystring matters at the execution level.
So I solved this problem by putting my two SWF files on a web server and using the following code. Not exactly production ready code, but it illustrates the concept.
private static FileStream GetFileStream()
{
string url = #"http://www.someurl.com/shell.swf?Filename=actualfile.swf";
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
WebResponse response = request.GetResponse();
byte[] result = null;
int byteCount = Convert.ToInt32(response.ContentLength);
using (BinaryReader reader = new BinaryReader(response.GetResponseStream()))
result = reader.ReadBytes(byteCount);
return new FileStream(result);
}

Categories