Get and Download pictures with AngleSharp

Get and Download pictures with AngleSharp - c#

I started using Anglesharp for a Project, I need to get and download not only HTML but also images of the document.
I know that in the Document object there is a property called Images, but appearently it doesn't get all of them, I did a test on a YouTube page and got only one (repeated several times).
For example I'd like to get the thumbinail of the current video, and this seems to be inside a <meta> tag.
To be more precise, images are stored inside this kind of tags:
<meta content="https://i.ytimg.com/vi/hW-kDv1WcQM/hqdefault.jpg" property="og:image">
So I wonder if there is a way to select all the nodes/url of any image inside a page, no matter the tag used.
I don't think that QuerySelectorAll does work in this case, as this selects only one type of node.
You can try the sample code you find on github to verify that (I just changed the url with the YouTube one, and the selector too :D):
// Setup the configuration to support document loading
var config = Configuration.Default.WithDefaultLoader();
// Load the names of all The Big Bang Theory episodes from Wikipedia
var address = "https://www.youtube.com/watch?v=hW-kDv1WcQM&feature=youtu.be";
// Asynchronously get the document in a new context using the configuration
var document = await BrowsingContext.New(config).OpenAsync(address);
// This CSS selector gets the desired content
var cellSelector = "img";
// Perform the query to get all cells with the content
var cells = document.QuerySelectorAll(cellSelector);
// We are only interested in the text - select it with LINQ
var titles = cells.Select(m => m.TextContent);
Oh, shure, you can also add this to check that the Image property doesn't get the video thumbinails:
var Images = document.Images.Select(sl=> sl.Source).Distinct().ToList();
Any other method to select nodes based on the URL content? (like all of the urls ending with ".jpg", or ".png", etc.)

You can use the LINQ API to get all attributes that contains image URL in a page, like so :
.....
var document = await BrowsingContext.New(config).OpenAsync(address);
//list all image file extension here :
var fileExtensions = new string[] { ".jpg", ".png" };
//find all attribute in any element...
//where the value ends with one of the listed file extension
var result = from element in document.All
from attribute in element.Attributes
where fileExtensions.Any(e => attribute.Value.EndsWith(e))
select attribute;
foreach (var item in result)
{
Console.WriteLine(item.Value);
}

Related

Merge PDF files with TOC element

I'm merging PDF files using GemBox.Pdf as shown here. This works great and I can easily add outlines.
I've previously done a similar thing and merged Word files with GemBox.Document as shown here.
But now my problem is that there is no TOC element in GemBox.Pdf. I want to get automatically a Table of Contents while merging multiple PDF files into one.
Am I missing something or is there really no such element for PDF?
Do I need to recreate it, if yes then how would I do that?
I can add a bookmark, but I don't know how to add a link to it.

There is no such element in PDF files, so we need to create this content ourselves.
Now one way would be to create text elements, outlines, and link annotations, position them appropriately, and set the link destinations to outlines.
However, this could be quite some work so perhaps it would be easier to just create the desired TOC element with GemBox.Document, save it as a PDF file, and then import it into the resulting PDF.
// Source data for creating TOC entries with specified text and associated PDF files.
var pdfEntries = new[]
{
new { Title = "First Document Title", Pdf = PdfDocument.Load("input1.pdf") },
new { Title = "Second Document Title", Pdf = PdfDocument.Load("input2.pdf") },
new { Title = "Third Document Title", Pdf = PdfDocument.Load("input3.pdf") },
};
/***************************************************************/
/* Create new document with TOC element using GemBox.Document. */
/***************************************************************/
// Create new document.
var tocDocument = new DocumentModel();
var section = new Section(tocDocument);
tocDocument.Sections.Add(section);
// Create and add TOC element.
var toc = new TableOfEntries(tocDocument, FieldType.TOC);
section.Blocks.Add(toc);
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));
// Create heading style.
// By default, when updating TOC element a TOC entry is created for each paragraph that has heading style.
var heading1Style = (ParagraphStyle)tocDocument.Styles.GetOrAdd(StyleTemplateType.Heading1);
// Add heading and empty (placeholder) pages.
// The number of added placeholder pages depend on the number of pages that actual PDF file has so that TOC entries have correct page numbers.
int totalPageCount = 0;
foreach (var pdfEntry in pdfEntries)
{
section.Blocks.Add(new Paragraph(tocDocument, pdfEntry.Title) { ParagraphFormat = { Style = heading1Style } });
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));
int currentPageCount = pdfEntry.Pdf.Pages.Count;
totalPageCount += currentPageCount;
while (--currentPageCount > 0)
section.Blocks.Add(new Paragraph(tocDocument, new SpecialCharacter(tocDocument, SpecialCharacterType.PageBreak)));
}
// Remove last extra-added empty page.
section.Blocks.RemoveAt(section.Blocks.Count - 1);
// Update TOC element and save the document as PDF stream.
toc.Update();
var pdfStream = new MemoryStream();
tocDocument.Save(pdfStream, new GemBox.Document.PdfSaveOptions());
/***************************************************************/
/* Merge PDF files into PDF with TOC element using GemBox.Pdf. */
/***************************************************************/
// Load a PDF stream using GemBox.Pdf.
var pdfDocument = PdfDocument.Load(pdfStream);
var rootDictionary = (PdfDictionary)((PdfIndirectObject)pdfDocument.GetDictionary()[PdfName.Create("Root")]).Value;
var pagesDictionary = (PdfDictionary)((PdfIndirectObject)rootDictionary[PdfName.Create("Pages")]).Value;
var kidsArray = (PdfArray)pagesDictionary[PdfName.Create("Kids")];
var pageIds = kidsArray.Cast<PdfIndirectObject>().Select(obj => obj.Id).ToArray();
// Remove empty (placeholder) pages.
while (totalPageCount-- > 0)
pdfDocument.Pages.RemoveAt(pdfDocument.Pages.Count - 1);
// Add pages from PDF files.
foreach (var pdfEntry in pdfEntries)
foreach (var page in pdfEntry.Pdf.Pages)
pdfDocument.Pages.AddClone(page);
/*****************************************************************************/
/* Update TOC links from placeholder pages to actual pages using GemBox.Pdf. */
/*****************************************************************************/
// Create a mapping from an ID of a empty (placeholder) page indirect object to an actual page indirect object.
var pageCloneMap = new Dictionary<PdfIndirectObjectIdentifier, PdfIndirectObject>();
for (int i = 0; i < kidsArray.Count; ++i)
pageCloneMap.Add(pageIds[i], (PdfIndirectObject)kidsArray[i]);
foreach (var entry in pageCloneMap)
{
// If page was updated, it means that we passed TOC pages, so break from the loop.
if (entry.Key != entry.Value.Id)
break;
// For each TOC page, get its 'Annots' entry.
// For each link annotation from the 'Annots' get the 'Dest' entry.
// Update the first item in the 'Dest' array so that it no longer points to a removed page.
if (((PdfDictionary)entry.Value.Value).TryGetValue(PdfName.Create("Annots"), out PdfBasicObject annotsObj))
foreach (PdfIndirectObject annotObj in (PdfArray)annotsObj)
if (((PdfDictionary)annotObj.Value).TryGetValue(PdfName.Create("Dest"), out PdfBasicObject destObj))
{
var destArray = (PdfArray)destObj;
destArray[0] = pageCloneMap[((PdfIndirectObject)destArray[0]).Id];
}
}
// Save resulting PDF file.
pdfDocument.Save("Result.pdf");
pdfDocument.Close();
This way you can easily customize the TOC element by using the TOC switches and styles. For more info, see the Table Of Content example from GemBox.Document.

Copying OLE Objects from one slide to another corrupts the resulting PowerPoint

I have code that copies the content of one PowerPoint slide into another. Below is an example of how images are processed.
foreach (OpenXmlElement element in sourceSlide.CommonSlideData.ShapeTree.ChildElements.ToList())
{
string elementType = element.GetType().ToString();
if (elementType.EndsWith(".Picture"))
{
// Deep clone the element.
elementClone = element.CloneNode(true);
var picture = (Picture)elementClone;
// Get the picture's original rId
var blip = picture.BlipFill.Blip;
string rId = blip.Embed.Value;
// Retrieve the ImagePart from the original slide by rId
ImagePart sourceImagePart = (ImagePart)sourceSlide.SlidePart.GetPartById(rId);
// Add the image part to the new slide, letting OpenXml generate the new rId
ImagePart targetImagePart = targetSlidePart.AddImagePart(sourceImagePart.ContentType);
// And copy the image data.
targetImagePart.FeedData(sourceImagePart.GetStream());
// Retrieve the new ID from the target image part,
string id = targetSlidePart.GetIdOfPart(targetImagePart);
// and assign it to the picture.
blip.Embed.Value = id;
// Get the shape tree that we're adding the clone to and append to it.
ShapeTree shapeTree = targetSlide.CommonSlideData.ShapeTree;
shapeTree.Append(elementClone);
}
This code works fine. For other scenarios like Graphic Frames, it looks a bit different, because each graphic frame can contain multiple picture objects.
// Go thru all the Picture objects in this GraphicFrame.
foreach (var sourcePicture in element.Descendants<Picture>())
{
string rId = sourcePicture.BlipFill.Blip.Embed.Value;
ImagePart sourceImagePart = (ImagePart)sourceSlide.SlidePart.GetPartById(rId);
var contentType = sourceImagePart.ContentType;
var targetPicture = elementClone.Descendants<Picture>().First(x => x.BlipFill.Blip.Embed.Value == rId);
var targetBlip = targetPicture.BlipFill.Blip;
ImagePart targetImagePart = targetSlidePart.AddImagePart(contentType);
targetImagePart.FeedData(sourceImagePart.GetStream());
string id = targetSlidePart.GetIdOfPart(targetImagePart);
targetBlip.Embed.Value = id;
}
Now I need to do the same thing with OLE objects.
// Go thru all the embedded objects in this GraphicFrame.
foreach (var oleObject in element.Descendants<OleObject>())
{
// Get the rId of the embedded OLE object.
string rId = oleObject.Id;
// Get the EmbeddedPart from the source slide.
var embeddedOleObj = sourceSlide.SlidePart.GetPartById(rId);
// Get the content type.
var contentType = embeddedOleObj.ContentType;
// Create the Target Part. Let OpenXML assign an rId.
var targetObjectPart = targetSlide.SlidePart.AddNewPart<EmbeddedObjectPart>(contentType, null);
// Get the embedded OLE object data from the original object.
var objectStream = embeddedOleObj.GetStream();
// And give it to the ObjectPart.
targetObjectPart.FeedData(objectStream);
// Get the new rId and assign it to the OLE Object.
string id = targetSlidePart.GetIdOfPart(targetObjectPart);
oleObject.Id = id;
}
But it didn't work. The resulting PowerPoint is corrupted.
What am I doing wrong?
NOTE: All of the code works except for the rId handling in the OLE Object. I know it works because if I simply pass the original rId from the source object to the target Object Part, like this:
var targetObjectPart = targetSlide.SlidePart
.AddNewPart<EmbeddedObjectPart>(contentType, rId);
it will function properly, so long as that rId doesn't already exist in the target slide, which will obviously not work every time like I need it to.
The source slide and target slide are coming from different PPTX files. We're using OpenXML, not Office Interop.

Since you did not provide the full code, it is difficult to tell what's wrong.
My guess would be that you are not modifying the correct object.
In your code example for Pictures, you are creating and modifying elementClone.
In your code example for ole objects, you are working with and modifying oleObject (which is a descendant of element) and it is not exacly clear from the context, whether it is a part of the source document or of the target document.
You can try this minimal example:
use a new pptx with one embedded ole object for c:\testdata\input.pptx
use a new pptx (a blank one) for c:\testdata\output.pptx
After running the code, I was able to open the embedded ole object in the output document.
using DocumentFormat.OpenXml.Presentation;
using DocumentFormat.OpenXml.Packaging;
using System.Linq;
namespace ooxml
{
class Program
{
static void Main(string[] args)
{
CopyOle("c:\\testdata\\input.pptx", "c:\\testdata\\output.pptx");
}
private static void CopyOle(string inputFile, string outputFile)
{
using (PresentationDocument sourceDocument = PresentationDocument.Open(inputFile, true))
{
using (PresentationDocument targetDocument = PresentationDocument.Open(outputFile, true))
{
var sourceSlidePart = sourceDocument.PresentationPart.SlideParts.First();
var targetSlidePart = targetDocument.PresentationPart.SlideParts.First();
foreach (var element in sourceSlidePart.Slide.CommonSlideData.ShapeTree.ChildElements)
{
//clones an element, does not copy the actual relationship target (e.g. ppt\embeddings\oleObject1.bin)
var elementClone = element.CloneNode(true);
//for each cloned OleObject, fix its relationship
foreach(var clonedOleObject in elementClone.Descendants<OleObject>())
{
//find the original EmbeddedObjectPart in the source document
//(we can use the id from the clonedOleObject to do that, since it contained the same id
// as the source ole object)
var sourceObjectPart = sourceSlidePart.GetPartById(clonedOleObject.Id);
//create a new EmbeddedObjectPart in the target document and copy the data from the original EmbeddedObjectPart
var targetObjectPart = targetSlidePart.AddEmbeddedObjectPart(sourceObjectPart.ContentType);
targetObjectPart.FeedData(sourceObjectPart.GetStream());
//update the relationship target on the clonedOleObject to point to the newly created EmbeddedObjectPath
clonedOleObject.Id = targetSlidePart.GetIdOfPart(targetObjectPart);
}
//add cloned element to the document
targetSlidePart.Slide.CommonSlideData.ShapeTree.Append(elementClone);
}
targetDocument.PresentationPart.Presentation.Save();
}
}
}
}
}
As for troubleshooting, the OOXML Tools chrome extension was helpful.
It allows to compare the structure of two documents, so it is way easier to analyze what went wrong.
Examples:
if you were to only clone all elements, you could see that /ppt/embeddings/* and /ppt/media/* would be missing
or you can check whether the relationships are correct (e.g. input document uses "rId1" to reference the embedded data and the output document uses "R3a2fa0c37eaa42b5")

Processing word document using OpenXML and C#

So I'm trying to populate the content controls in a word document by matching the Tag and populating the text within that content control.
The following displays in a MessageBox all of the tags I have in my document.
//Create a copy of the template file and open the document
File.Delete(hhscDocument);
File.Copy(hhscTemplate, hhscDocument, true);
//Open the word document specified by location
using (var document = WordprocessingDocument.Open(hhscDocument, true))
{
//Change the document type from template to document
var mainDocument = document.MainDocumentPart.Document;
if (mainDocument.Body.Descendants<Tag>().Any())
{
//MessageBox.Show(mainDocument.Body.Descendants<Table>().Count().ToString());
var tags = mainDocument.Body.Descendants<Tag>().ToList();
var aString = string.Empty;
foreach(var tag in tags)
{
aString += string.Format("{0}{1}", tag.Val, Environment.NewLine);
}
MessageBox.Show(aString);
}
}
However when I try the following it doesn't work.
//Create a copy of the template file and open the document
File.Delete(hhscDocument);
File.Copy(hhscTemplate, hhscDocument, true);
//Open the word document specified by location
using (var document = WordprocessingDocument.Open(hhscDocument, true))
{
//Change the document type from template to document
var mainDocument = document.MainDocumentPart.Document;
if (mainDocument.Body.Descendants<Tag>().Any())
{
//MessageBox.Show(mainDocument.Body.Descendants<Table>().Count().ToString());
var tags = mainDocument.Body.Descendants<Tag>().ToList();
var bString = string.Empty;
foreach(var tag in tags)
{
bString += string.Format("{0}{1}", tag.Parent.GetFirstChild<Text>().Text, Environment.NewLine);
}
MessageBox.Show(bString);
}
}
My objective in the end is if I match the appropriate tag I want to populate/change the text in the content control that tag belongs to.

So I basically used FirstChild and InnerXml to pick apart the documents XML contents. From there I developed the following that does what I need.
//Open the word document specified by location
using (var document = WordprocessingDocument.Open(hhscDocument, true))
{
var mainDocument = document.MainDocumentPart.Document;
if (mainDocument.Body.Descendants<Tag>().Any())
{
//Find all elements(descendants) of type tag
var tags = mainDocument.Body.Descendants<Tag>().ToList();
//Foreach of these tags
foreach (var tag in tags)
{
//Jump up two levels (.Parent.Parent) in the XML element and then jump down to the run level
var run = tag.Parent.Parent.Descendants<Run>().ToList();
//I access the 1st element because there is only one element in run
run[0].GetFirstChild<Text>().Text = "<new_text_value>";
}
}
mainDocument.Save();
}
This finds all the tags inside of your document and stores the elements in a list
var tags = mainDocument.Body.Descendants<Tag>().ToList();
This part of the code starts off at the tag part of the xml. From there I call parent twice to jump up two levels in the XML code so I can gain access to the Run level using descendants.
var run = tag.Parent.Parent.Descendants<Run>().ToList();
And last but not least the following code stores a new value into the text part of the PlainText Content control.
run[0].GetFirstChild<Text>().Text = "<new_text_value>";
Things that I noticed is the xml hierarchy is a funky thing. I find it easier to access these things from bottom up, hence why I started with the tags and moved up.

iTextSharp Input string was not in a correct format css error

I have been trying to get my MVC application te create pdf files based on MVC Views. I got this working with plain html. But i would also like to iclude my css files that i use for the browser. Now some of them work but with one i get the following error:
An exception of type 'System.FormatException' occurred in mscorlib.dll but was not handled in user code
Additional information: Input string was not in a correct format.
I am using the following code:
var data = GetHtml(new IndexModel(Context), "~\\Views\\Home\\Index.cshtml", "");
using (var document = new iTextSharp.text.Document())
{
//define output control HTML
var memStream = new MemoryStream();
TextReader xmlString = new StringReader(data);
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream("c:\\tmp\\my.pdf", FileMode.OpenOrCreate));
//open doc
document.Open();
// register all fonts in current computer
FontFactory.RegisterDirectories();
// Set factories
var htmlContext = new HtmlPipelineContext(null);
htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
// Set css
ICSSResolver cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(false);
cssResolver.AddCssFile(HttpContext.Server.MapPath("~/Content/elements.css"), true);
cssResolver.AddCssFile(HttpContext.Server.MapPath("~/Content/style.css"), true);
cssResolver.AddCssFile(HttpContext.Server.MapPath("~/Content/jquery-ui.css"), true);
// Export
IPipeline pipeline = new CssResolverPipeline(cssResolver, new HtmlPipeline(htmlContext, new PdfWriterPipeline(document, writer)));
var worker = new XMLWorker(pipeline, true);
var xmlParse = new XMLParser(true, worker);
xmlParse.Parse(xmlString);
xmlParse.Flush();
document.Close();
}
the string "data" is correct and has no issues, the problem lies with the AddCssFile().
If i create the pdf without and css files everything works, but including the css files triggers the error.
Help will be very much appreciated.

I don't know the exact answer, but by looking at the error you are getting back, I would try two different approaches.
Move the
cssResolver.AddCssFile(HttpContext.Server.MapPath("~/Content/elements.css"), true);
To something like
var cssPath = HttpContext.Server.MapPath("~/Content/elements.css"), true);
cssResolver.AddCssFile(cssPath);
Then set a breakpoint and look at the values being returned for cssPath. Make sure they are accurate and do not contain any odd characters.
Second approach... If all else fails, try giving an absolute URL to the CSS resource such as http://yourdomain.com/cssPath instead of a file system path.
If either of those two appraoches help you, then you can use it to determine the actual problem and then refactor it to your hearts content after that.
UPDATE ------------------------------------------------------------------>
According to the documentation, you need an absolute URL for the file, so Server.MapPath won't work.
addCssFile
void addCssFile(String href,
boolean isPersistent)
throws CssResolverException
Add a
Parameters:
href - the link to the css file ( an absolute uri )
isPersistent - true if the added css should not be deleted on a call to clear
Throws:
CssResolverException - thrown if something goes wrong
In that case, I would try using something like :
public string AbsoluteContent(string contentPath)
{
var path = Url.Content(contentPath);
var url = new Uri(HttpContext.Current.Request.Url, path);
return url.AbsoluteUri;
}
and use it like such :
var cssPath = AbsoluteContent("~/Content/embeddedCss/yourcssfile.css");

How to send XML file to client in ASP.NET MVC

In an ASP.NET MVC I have a database table. I want to have a button on some view page, if some user clicks that button I my application will generate XML file containing all rows in the database. Then the file containing XML should be sent to the client so that the user will see a download pop-up window.
Similarly I want to allow user to upload an XML file whose content will be added to the database.
What's the simplest way to let the user upload and download file ?
Thanks for all the answers
EDIT:
This is my approach:
public FileContentResult Download() {
if(model.Series.Count() < 1) {
byte[] content = new byte[0];
return new FileContentResult(content, "Series");
}
XmlSerializer serializer = new XmlSerializer(model.Series.FirstOrDefault().GetType());
MemoryStream xmlStream = new MemoryStream();
foreach (Series s in model.Series) {
serializer.Serialize(xmlStream, s);
}
byte[] content2 = new byte[xmlStream.Length];
xmlStream.Position = 0;
xmlStream.Read(content2, 0, (int) xmlStream.Length);
return File(content2, "Series");
}
Where model is DataContext. Howewer this does not work. When I try to download the data I get this error:
XML Parsing Error: junk after document element
Location: http://localhost:1399/Xml/Download
Line Number 7, Column 10:</Series><?xml version="1.0"?>
---------^

for download part, you could use FileStreamResult
This page has examples for upload and download; check it out.

An XML document can only have one top level element. After the end of the element, you cannot have anything else. It looks like after the "</Series>" element you have "<?xml version="1.0>", which is invalid.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get and Download pictures with AngleSharp - c#

Related

Merge PDF files with TOC element

Copying OLE Objects from one slide to another corrupts the resulting PowerPoint

Processing word document using OpenXML and C#

iTextSharp Input string was not in a correct format css error

How to send XML file to client in ASP.NET MVC

Categories

Resources