I'm reading the content from a page using DownloadString from the WebClient class and then writing the contents of that to a static HTML file using the StreamWriter class. On the page that I'm reading in, there's an inline javascript method that just sets an anchor element's OnClick attribute to set the window.location = history.go(-1); I'm finding when I view the static HTML page, there's an odd looking letter showing up that isn't present on the dynamic web page.
WebClient & SteamWriter Code
using (var client = new WebClient())
{
var html = client.DownloadString(url);
//This constructor prepares a StreamWriter (UTF-8) to write to the specified file or will create it if it doesn't already exist
using (var stream = new StreamWriter(file, false, Encoding.UTF8))
{
stream.Write(html);
stream.Close();
}
}
The dynamic page's HTML snippet in question
<span>Sorry, but something went wrong on our end. Click here to go back to the previous page.</span>
The static page's HTML snippet
<span>Sorry, but something went wrong on our end. Â Click here to go back to the previous page.</span>
I was thinking that adding the Encoding.UTF8 parameter would solve this issue but it didn't seem to help. Is there some sort of extra encoding or decoding that I need to do? Or did I completely miss something else that's needed for this type of operation?
I updated the WebClient to encode in UTF8 as it converts the resource into a string, seems to have taken care of the issue.
using (var client = new WebClient())
{
client.Encoding = System.Text.Encoding.UTF8;
var html = client.DownloadString(url);
//This constructor prepares a StreamWriter (UTF-8) to write to the specified file or will create it if it doesn't already exist
using (var stream = new StreamWriter(file, false, Encoding.UTF8))
{
stream.Write(html);
stream.Close();
}
}
Related
I'm trying to extract the text from the following PDF with the following code (using iText7 7.2.2) :
var source = (string)GetHttpResult("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf", new CookieContainer());
var bytes = Encoding.UTF8.GetBytes(source);
var stream = new MemoryStream(bytes);
var reader = new PdfReader(stream);
var doc = new PdfDocument(reader);
var pages = doc.GetNumberOfPages();
var text = PdfTextExtractor.GetTextFromPage(doc.GetPage(1));
Loading the PDF in my browser (Edge 100.0) works fine.
GetHttpResult() is a simple HttpClient defining a custom CookieContainer, a custom UserAgent, and calling ReadAsStringAsync(). Nothing fancy.
source has the correct PDF content, starting with "%PDF-1.7".
pages has the correct number of pages, which is 2.
But, whatever I try, text is always empty.
Defining an explicit TextExtractionStrategy, trying some Encodings, extracting from all pages in a loop, ..., nothing matters, text is always empty, with no Exception thrown anywhere.
I think I don't read this PDF how it's "meant" to be read, but what is the correct way then (correct content in source, correct number of pages, no Exception anywhere) ?
Thanks.
That's it ! Thanks to mkl and KJ !
I first downloaded the PDF as a byte array so I'm sure it's not modified in any way.
Then, as pdftotext is able to extract the text from this PDF, I searched for a NuGet package able to do the same. I tested almost ten of them, and FreeSpire.PDF finally did it !
Update : Actually, FreeSpire.PDF missed some words, so I finally found PdfPig, able to extract every single word.
Code using PdfPig :
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;
byte[] bytes;
using (HttpClient client = new())
{
bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
}
List<string> words = new();
using (PdfDocument document = PdfDocument.Open(bytes))
{
foreach (Page page in document.GetPages())
{
foreach (Word word in page.GetWords())
{
words.Add(word.Text);
}
}
}
string text = string.Join(" ", words);
Code using FreeSpire.PDF :
using Spire.Pdf;
using Spire.Pdf.Exporting.Text;
byte[] bytes;
using (HttpClient client = new())
{
bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
}
string text = string.Empty;
SimpleTextExtractionStrategy strategy = new();
using (PdfDocument doc = new())
{
doc.LoadFromBytes(bytes);
foreach (PdfPageBase page in doc.Pages)
{
text += page.ExtractText(strategy);
}
}
I have an CSV file in memory that I want to upload to a Web API.
If I save the CSV file to disk and upload it, it gets accepted.
However, I want to avoid the extra work and also make the code cleaner by simply uploading the text I have as a MemoryStream Object (I think that's the correct format?).
The following code works for uploading the file:
string webServiceUrl = "XXX";
string filePath = #"C:\test.csv";
string cred = "YYY";
using (var client = new WebClient()){
client.Headers.Add("Authorization", "Basic " + cred);
byte[] rawResponse = client.UploadFile(webServiceUrl, "POST", filePath);
Console.WriteLine(System.Text.Encoding.ASCII.GetString(rawResponse));
}
How would I do if I had a string with all the contents and I want to upload it in the same way without having to save it down to a file?
WebClient.UploadData or WebClient.UploadString perhaps?
Thank you
EDIT:
I tried what you said but by using a local file (in case there was something wrong with the string), but I get the same error.
Here is what I suppose the code would be using your solution
string webServiceUrl = "XXX";
string file = #"C:\test.csv";
string cred = "YYY";
FileStream fs = new FileStream(file, FileMode.Open, FileAccess.Read);
BinaryReader r = new BinaryReader(fs);
byte[] postArray = r.ReadBytes((int)fs.Length);
using (var client = new WebClient())
{
client.Headers.Add("Authorization", "Basic " + cred);
using (var postStream = client.OpenWrite(webServiceUrl, "POST"))
{
postStream.Write(postArray, 0, postArray.Length);
}
}
Any thoughts?
Use OpenWrite() from the WebClient.
using (var postStream = client.OpenWrite(endpointUrl))
{
postStream.Write(memStreamContent, 0, memStream.Length);
}
As documentation mentioned:
The OpenWrite method returns a writable stream that is used to send data to a resource.
Update
Try to set the position of the MemoryStream to 0 before uploading.
memoryStream.Position = 0;
When you copy the file into the MemoryStream, the pointer is moved to the end of the stream, so when you then try to read it, you're getting a null byte instead of your stream data.
MSDN - CopyTo()
Copying begins at the current position in the current stream, and does not reset the position of the destination stream after the copy operation is complete.
I finally managed to solve it.
First I made a request using CURL that worked.
I analyzed the packet data and made an except copy of the packet.
I did a lot of changes, however, the final change was that using the different functions I found online it never closed the packet with a "Last-Boundary" while CURL did.
So by modifying the function, making sure it properly wrote a Last-Boundary it finally worked.
Also, another crucial thing was to set PreAuthenticate to true, the examples online didn't do that.
So, all in all:
1. Make sure that the packet is properly constructed.
2. Make sure you pre authenticate if you need to authenticate.
webrequest.PreAuthenticate = true;
webrequest.Headers[HttpRequestHeader.Authorization] = string.Format("Basic {0}", cred);
Don't forget to add SSL if using a https (which you probably do if you authenticate):
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls | SecurityProtocolType.Ssl3;
Hope this helps someone.
And thanks for the help earlier!
I am using a C# WinForms app to scrape some data from a webpage that uses charset ISO-8859-1. It works well for many special characters, but not all.
(* Below I use colons instead of semi-colons so that you will see the code that I see, and not the value of it)
I looked at the Page Source and I noticed that for the ones that won't display correctly, the actual code (e.g. ū:) is in the Page Source, instead of the value. For example, in the Page Source I see Ryū: Murakami, but I expect to see Ryū Murakami. Also, there are many other codes that appear as codes, such as Ş: ō: š: č: ă: ș: and many more.
I have tried using WebClient.DownloadString and WebClient.DownloadData.
Try #1 Code:
using (WebClient wc = new WebClient())
{
wc.Encoding = Encoding.GetEncoding("ISO-8859-1");
string WebPageText = wc.DownloadString("http://www.[removed].htm");
// Scrape WebPageText here
}
Try #2 Code:
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
using (WebClient wc = new WebClient())
{
wc.Encoding = iso;
byte[] AllData = wc.DownloadData("http://www.[removed].htm");
byte[] utfBytes = Encoding.Convert(iso, utf8, AllData);
string WebPageText = utf8.GetString(utfBytes);
// Scrape WebPageText here
}
I want to keep the special characters, so please don't suggest any RemoveDiacritics examples. Am I missing something?
Consider Decoding your HTML input.
We've noticed that UTF8 characters don't come out correctly when using UIDevice.CurrentDevice.Name in MonoTouch.
It comes out as "iPad 2 ??", if you use some of the special characters like holding down the apostrophe key on the iPad keyboard. (Sorry don't know the equivalent to show these characters in windows)
Is there a recommended workaround to get the correct text? We don't mind to convert to UTF8 ourselves. I also tried simulating this from a UITextField and it worked fine--no UTF8 problems.
The reason this is causing problems is we are sending this text off to a web service, and it's causing XML parsing issues.
Here is a snipped of the XmlWriter code (_parser.WriteRequest):
using (XmlWriter xmlWriter = XmlWriter.Create(textWriter, new XmlWriterSettings
{
#if DEBUG
Indent = true,
#else
Indent = false, NewLineHandling = NewLineHandling.None,
#endif
OmitXmlDeclaration = true
}))
{
xmlWriter.WriteStartDocument();
xmlWriter.WriteStartElement("REQUEST");
xmlWriter.WriteAttributeString("TYPE", "EXAMPLE");
xmlWriter.WriteEndElement();
xmlWriter.WriteEndDocument();
}
The TextWriter is passed in from:
public Response MakeRequest(Request request)
{
var httpRequest = CreateRequest(request);
WriteRequest(httpRequest.GetRequestStream(), request);
using (var httpResponse = httpRequest.GetResponse() as HttpWebResponse)
{
using (var responseStream = httpResponse.GetResponseStream())
{
var response = new Response();
ReadResponse(response, responseStream);
return response;
}
}
}
private void WriteRequest(Stream requestStream, Request request)
{
if (request.Type == null)
{
throw new InvalidOperationException("Request Type was null!");
}
if (_logger.Enabled)
{
var builder = new StringBuilder();
using (var writer = new StringWriter(builder, CultureInfo.InvariantCulture))
{
_parser.WriteRequest(writer, request);
}
_logger.Log("REQUEST: " + builder.ToString());
using (requestStream)
{
using (StreamWriter writer = new StreamWriter(requestStream))
{
writer.Write(builder.ToString());
}
}
}
else
{
using (requestStream)
{
using (StreamWriter writer = new StreamWriter(requestStream))
{
_parser.WriteRequest(writer, request);
}
}
}
}
_logger writes to Console.WriteLine, it is enabled in #if DEBUG mode. Request is just a storage class with properties, sorry easy to confuse with HttpWebRequest.
I'm seeing ?? in both XCode's console and MonoDevelop's console. I'm also assuming the server is receiving them strangely as well, as I get an error. Using UITextField.Text with the same strange characters instead of the device description works fine with no issues. It makes me think the device description is the culprit.
EDIT: this fixed it -
Encoding.UTF8.GetString (Encoding.ASCII.GetBytes(UIDevice.CurrentDevice.Name));
Okay, I think I know the problem. You're creating a StringWriter, which always reports its encoding as UTF-16 (unless you override the Encoding property). You're then taking the string from that StringWriter (which will start with <?xml version="1.0" encoding="UTF-16" ?>) and writing it to a StreamWriter which will default to UTF-8. That mixture of encodings is causing the problem.
The simplest approach would be to change your code to pass a Stream directly to the XmlWriter - a MemoryStream if you really want, or just requestStream. That way the XmlWriter can declare that it's using the exact encoding that it's actually writing the binary data in - you haven't got an intermediate step to mess things up.
Alternatively, you could create a subclass of StringWriter which allows you to specify the encoding. See this answer for some sample code.
MonoTouch simply calls NSString.FromHandle on the value it receive from the call on UIDevice.CurrentDevice.Name. That just like most string are created from NSString inside all bindings.
That should get you a string that you can see it MonoDevelop (no ?) so I can't rule out a bug.
Can you tell us exactly how the device is named ? if so then please open a bug report and we'll check this possibility.
this is the code in question:
using (var file = MemoryMappedFile.OpenExisting("AIDA64_SensorValues"))
{
using (var readerz = file.CreateViewAccessor(0, 0))
{
var bytes = new byte[567];
var encoding = Encoding.ASCII;
readerz.ReadArray<byte>(0, bytes, 0, bytes.Length);
File.WriteAllText("C:\\myFile.txt", encoding.GetString(bytes));
var readerSettings = new XmlReaderSettings { ConformanceLevel = ConformanceLevel.Fragment };
using (var reader = XmlReader.Create("C:\\myFile.txt", readerSettings))
{
This is what myfile.txt looks like:
<sys><id>SCPUCLK</id><label>CPU Clock</label><value>1598</value></sys><sys><id>SCPUFSB</id><label>CPU FSB</label><value>266</value></sys><sys><id>SMEMSPEED</id><label>Memory Speed</label><value>DDR2-667</value></sys><sys><id>SFREEMEM</id><label>Free Memory</label><value>415</value></sys><sys><id>SGPU1CLK</id><label>GPU Clock</label><value>562</value></sys><sys><id>SFREELVMEM</id><label>Free Local Video Memory</label><value>229</value></sys><temp><id>TCPU</id><label>CPU</label><value>42</value></temp><temp><id>TGPU1</id><label>GPU</label><value>58</value></temp>
if i write the data to a txt file on the hard drive with:
File.WriteAllText("C:\\myFile.txt", encoding.GetString(bytes));
then read that same text file with the fragment XmlReader:
XmlReader.Create("C:\\myFile.txt");
it reads it just fine, the program runs and completes like it supposed to, but then if i directly read with the fragment XmlReader like:
XmlReader.Create(encoding.GetString(bytes));
I get exception when run " illegal characters in path" on the XmlReader.Create line.
ive tried writing it to a separate string first and reading that with xmlreader, and it wouldn't help to try to print it to CMD to see what it looks like because CMD wouldnt show the invalid characters im dealing with right?
but oh well i did Console.WriteLine(encoding.GetString(bytes)); and it precisely matched the txt file.
so somehow writing it to the text file is removing some "illegal characters"? what do you guys think?
XmlReader.Create(encoding.GetString(bytes));
XmlReader.Create() interprets your string as the URI where it should read a file from. Instead encapsulate your bytes in a StringReader:
StringReader sr = new StringReader(encoding.GetString(bytes));
XmlReader.Create(sr);
Here:
XmlReader.Create(encoding.GetString(bytes));
you are simply invoking the following method which takes a string representing a filename. However you are passing the actual XML string to it which obviously is an invalid filename.
If you want to load the reader from a buffer you could use a stream:
byte[] bytes = ... represents the XML bytes
using (var stream = new MemoryStream(bytes))
using (var reader = XmlReader.Create(stream))
{
...
}
The method XmlReader.Create() with a single string as argument needs a URI passed and not the XML document as string, please refer to the MSDN. It tries to open a file named "<..." which is an invalid URI. You can pass a Stream instead.
You are passing the xml content in the place where it is expecting a path, as evidenced by the error - illegal characters in path
Use an appropriate overload, and pass a stream - http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.create.aspx