Getting the largest image from HTML page source in C#

Getting the largest image from HTML page source in C# - c#

Basically, what I'm trying to do is get an image to represent a page (for quick browsing in a XAML GridView).
I have the pages URL (and it's HTML content), but now I'm not completely sure how to proceed. I could just use the Favicon, but I don't think that would scale well up to the 200x200 box I'm using to display it. The other option (as far as I can think of) is to look through the HTML source and pick out the largest image.
Is there an easier/simpler way to do that in C# other than just using Regexs to find the height/width of all the image tags and then comparing them?
Thanks!

There is no way to know for sure from the HTML source what size the images are. An img tag doesn't require the height and width parameters. If they're not specified, then the image is displayed in its actual size. If all the img tags on the page have their height and width specified, you could pick the one that has the largest values. But those are the display sizes. The actual sizes might be quite different.
The only way to be 100% sure is to download each image and get its size.
By the way, if you're parsing HTML, you probably shouldn't be doing it with regular expressions. I know it seems simple enough, but you're almost certain to get things wrong and not handle some common cases. You'll save yourself a lot of time and frustration by using something like the Html Agility Pack.

you can try
imageObject.ActualWidth
imageObject.ActualHeight
properties

Related

c#, check area of pdf

I need to enter a text to existing pdf (in top or bottom of the page) in c#.
I need to make sure that I dont overwrite any visible text or image.
Is there any way I could check an area in pdf if it contains text, image, control etc? I understand it will not be 100% accurate

You're going to need a full PDF consumer at the very least, because the only way to find out where the marks are on the page is to parse (and possibly render) the PDF.
There are complications which you haven't covered (possibly they have not occurred to you); what do you consider to be the area of the PDF file ? The MediaBox ? CropBox, TrimBox, ArtBox, BleedBox ? What if the PDF file contains, for example, a rectangular fill with white which covers the page ? What about a /Separation space called /White ? is that white (it generally gets rendered that way on the output) or not ? And yes, this is a widely used ink in the T-shirt printing industry amongst others.
To me the simplest solution would seem to be to use a tool which will give you the BoundingBox of marks on the page. I know the Ghostscript bbox device can do this, I imagine there are other tools which can do so. But note (for Ghostscript at least); if there are any marks in white (whatever the colour space), these are considered as marking the page and will be counted into the bbox.
The same tool ought to be able to give the size of the various Boxes in the PDF file (you'd need the pdf_info.ps program for Ghostscript to get this, currently). You can then quickly calculate which areas are unmarked.
But 'unmarked' isn't the same things as 'white'. If you want to not count areas which are painted in 'white' then the problem becomes greater. You really need to render the content and then look at each image sample in the output to see if its white or not, recording the maxima and minima of the x and y co-ordinates to determine the 'non-white' area of the page.
This is because there are complications like transfer functions, transparency blending, colour management, and image masking, any or all of which might cause an area which is marked with a non-white colour to be rendered white (a transparency SMask for example) or an area marked with white to be rendered non-white (eg a transfer function).
Your question is unclear because you haven't defined whether any of these issues are important to you, and how you want to treat them.

Get rendered text letters dimensions

I've got a really simple setup - I have a string, a font and a font size on the ready. I want to render this to a Silverlight WriteableBitmap.
There's one catch - I want to be able to tell apart the letters in the rendered text. Ideally, I'd like to have a System.Windows.Rect for every rendered letter.
The problem is Silverlight's API, which is missing all of the useful stuff like Graphics.MeasureString, which I could have used to measure the letters separately.
What adequate options do I have to get the letters' measures in codebehind?

I sort of figured this out on my own. The solution is slow and far from perfect, but hey, it works!
The idea is to render the text many times, adding one letter at a time, and finding the difference between the current and previous TextBlock widths.
So, for example, if the text is "ab", we first render "a" and get its length. Then we render "ab" and find the difference, which should be the size of the rendered "b".

Add Fixed Width file to PDF

I have a client that is asking me to add a fixed width (510 character) header record to a PDF file. They have asked that I create a new page (not problem) in which I write this fixed width header record on.
I can do this, and see the header record as page 1, followed by the original PDF. The problem is white space. The 510 character fixed width header is about 60% white space and all the ways I've tried generating the PDF cause this to be truncated. There are also line breaks where the text wraps. The client want to be able to use some OCR software they have purchased in order to read this header file from page 1.
I know very little about PDF file format. I've tried using ABCpdf, PDFsharp, and also created an RDLC and bound it to this header string and then generated a PDF from that. All 3 resulted in the same outcome.
Let me say I know how crazy this sounds, but it's what a client is requesting. I proposed several other ways in which we could solve their problem, but this (right now) is the only one they are comfortable with. They are not comfortable with me just appending the 510 characters onto the byte array, and having them separate it out programatically.

Are you looking to have a page displaying the long header? You can create a PDF page of any size (Print to PDF with a custom pages size of 20" wide by 6" tall. Weird but possible.)
Once that page is created, it can be inserted into another document of regular letter size pages.
Are you looking for consecutive pages displaying chunks of the header?

Using an OCR to read content that you put in is an overkill. Instead of rendering the 500-character header as text. Render it as single-character form fields. This way it will be easy to access those form-fields by name and retrieve the values using the same PDF library which you created the PDFs.

PDF doesn't wrap text lines automatically & respect line position

I'm trying to generate a PDF via code because not all actual PDF .NET libraries support the new Windows Runtime for Windows/Windows Phone 8.1.
The PDF is saved correctly, with only a bug of stream position count that I can fix easily, but, as you can see, the text doesn't wrap if line is too long.
I tried with PDF NewLine char (\n), but C# automatically convert it in the input string
Also, I can't understand the position of lines or objects to put into the document, because I follow this guide online that talk about a reversing axis disposition (x for height and y for width), but seems I didn't catch the right methodology (I put in my code a constant left position, at 40, and a variable top descreasing value (from 600, I'm not manage now the multipage if the value is less than 0).
This is the code of PDF generated:
http://pastebin.com/ZkZmbJdM
(Sorry if I use Pastebin, but using this editor Code function the code seems to be unformatted for special characters used for it)
Where am I doing wrong?

PDF is a graphical format trying to make you think it's a document format. But nope, it's just like drawing with GDI+ for instance. This is the reason why it can achieve the same rendered output across many platforms/programs/etc as opposed to text flow formats like doc/docx. And also, this is why it can really render anything.
So, as opposed to document formats, it is the responsibility of the program that generates the PDF to get the layout right. Think of it just as if you'd draw with GDI+.
In documents like docx or html, it's the rendering program that has to do the layout work. With such document, you just write text and the viewer will take care of laying it out.
Your PDF library certainly has the necessary code to measure the text length. Maybe even it has some code to provide some layout capabilities. You'll have to use these functions to do the layout.

Calculate area of HTML element on website?

I'm trying to figure out if it's possible to calculate the area of a HTML element on a website? In pixels, as a percentage or whatever.
My first thoughts have been to make the assumption that the element is 100% in width and height, and then try to retrieve the size through a mapping between the HTML and CSS.
So if there's a width/height attribute in the referenced CSS-file I could possibly say that
the body element is covered by a column that takes 25% of the area (everything is based on your screen resolution of course - and I'm still trying to figure out how I'd be able to do this programmatically).
Or whether I should render the website and do my calculations based on an image with the most common screen resolution at the time).
Are there any more possible solutions?
(Currently I'm trying to solve this in Perl, but I suppose any language that's got a library for this purpose would be appreciated to know about!)
EDIT: I need to retrieve the visual area for every single element on a page.
For example; if there are elements on top of the <body> element, that covers it visually, I want to exclude that area from the <body>'s and so on. A simple raytracing to find the visible area for every element on a page.
EDIT: Let's say we exclude JavaScript - any other approaches possible?

Personally, I would use jQuery - even if you don't use a library, your best bet will be a JavaScript solution.
var $elt = $('#some-element-id'),
height = $elt.height(),
width = $elt.width(),
area = height * width; // contains the area in px^2
http://api.jquery.com/height and http://api.jquery.com/width

This is such a simple problem I don't think JQuery is necessary if you are not already using it.
Try running this:
<div id="myParent">
What's up?
<div id="myDiv">Hello there!</div>
</div>
With
var myDiv = document.getElementById("myDiv");
alert(myDiv.offsetHeight);
alert(myDiv.offsetWidth);
var myParent = myDiv.parentNode;
alert(myParent.offsetHeight);
alert(myParent.offsetWidth);
Divide resulting widths to get % of space the element takes in it's parent, or simply use the absolute pixel values.

I would recommend using jQuery to do it if possible.
alert('Size: ' + $('li').size());
http://api.jquery.com/size/

Would it be feasible to use javascript? If so you can get the width/height with something like this:
document.getElementById(YourElementsId).style.height;
document.getElementById(YourElementsId).style.width;
However this does depening on how the elements are sytled in the first place. Another option would be
document.getElementById(YourElementsId).clientHeight;
document.getElementById(YourElementsId).clientWidth;

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.