How to get wrapped text from Excel cell? - c#

I want to know how to get wrapped text from Excel cell using ClosedXML.
(ClosedXML version is 0.95.4, and use C#[Visualstudio 2017])
For example:
excel cell data (text) is following
abc
xyz
I want
abc
xyz
(=wrapped text), but I get abcxyz (=no wrapped text).
I tried several methods (as follows) and the results were the same.
tgtcell.RichText
tgtcell.GetString()
tgtcell.Value.ToString()
etc.
Can anyone tell me how to get it?

Now the cause is known. In the cell where the wrapped text was entered, the unwrapped one was mixed in. (Wrapped only by specifying the style format.)
Unfortunately, it was the cell that I tested, so it looked unreadable.

(this is more of an extended comment than a definitive answer)
I created an Excel file in which I typed ABC<Alt>+<Enter>xyz in cell A1. That puts ABC on one line and xyz under it (all in the same cell). Then I saved it and closed Excel.
Then I opened the file in the Microsoft "OpenXML Productivity Tool" (which unfortunately has been taken down from "Microsoft.com" but is still available on GitHub).
If I look in the Shared Strings section of the file, I can see this as XML:
<x:sst count="1" uniqueCount="1" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<x:si>
<x:t>ABCxyz</x:t>
</x:si>
</x:sst>
But, if I look at the reflected code (reflected into C#), I see:
public class GeneratedClass
{
// Creates an SharedStringTable instance and adds its children.
public SharedStringTable GenerateSharedStringTable()
{
SharedStringTable sharedStringTable1 = new SharedStringTable(){ Count = (UInt32Value)1U, UniqueCount = (UInt32Value)1U };
SharedStringItem sharedStringItem1 = new SharedStringItem();
Text text1 = new Text();
text1.Text = "ABC\nxyz";
sharedStringItem1.Append(text1);
sharedStringTable1.Append(sharedStringItem1);
return sharedStringTable1;
}
}
So it looks like Excel stores a Unix newline character between lines in a single cell. I don't know why that doesn't show up in the XML.
I'm far from an OpenXML expert, but I have used it several times successfully (mostly to generate files, not to parse them). I've never used ClosedXML. I don't have a clue what Wrapped Text means (nor does your explanation help). But, I can see the two lines of text in the file.
I don't know if this will help you - I hope so.

Related

Tickmark using epplus

My problem is, when I am trying to generate my Excel file
I have some cells contains the tickmark symbol.
using (ExcelRange Rng = wsSheet.Cells[col])
{
Rng.Value = Beneficiary.IDentityOrFamilyBookCheck;// this value is ü
Rng.Style.Font.SetFromFont("Wingdings", 16);
Rng.Style.Font.Charset = 178;
}
in fact with the specified font and value it must appear as a check
but it is not
This is how it actually appears
This is how it must appear
Your Charset is wrong. Try Rng.Style.Font.Charset = 2; It is always good to do some reverse engineering. In this case create an Excel file with the Font and symbol you want and read the Style within your program.
I was able to discover the solution in my own way. Please note that Behnam's way is excellent also and can clarify the issue, but I did it in a different way.
I created an excel file containing the required symbol.
I changed the .xsls extension to .zip.
Iow I became able to show my excel file XML content and discover the actual Rng.Style.Font.Charset.

How can I export a piece of a DOCX file and keep the same paragraph numbering?

TL;DR:
How can I capture the paragraph numbering as a 'part' of the text and export it to a DOCX?
Problem
I have a document that's split into sections and sub-sections that reads similarly to a set of state statutes (Statute 208, with subsections Statute 208.1, Statute 208.2, etc.). We created this by modifying the numbering.xml file within the .docx zip.
I want to export a 'sub-section' (208.5) and its text to a separate .docx file. My VSTO add-in exports the text well enough, but the numbering resets to 208.1. This does make some sense as it's now the first paragraph with that <ilvl> in the document.
PDF works okay
Funnily enough, I'm able to call Word.Range's ExportAsFixedFormat function and export this selection to PDF just fine - even retaining the numbering. This led me down a path of trying to 'render' the selection, possibly as it would be printed, in order to throw it into a new .docx file, but I haven't figured that out, either.
What I've tried:
Range.ExportFragment() using both wdFormatStrictOpenXMLDocument and wdFormatDocumentDefaultas the wdSaveType values.
These export but also reset the numbering.
Document.PrintOut() using PrintToFile = true and a valid filename. I realize now that this, quite literally, generates 'printout instructions' and won't inject a new file at path filename with any valid file structure.
Plainly doesn't work. :)
Application.Selection.XML to a variable content and calling Document.Content.InsertXML(content) on a newly added Document object.
Still resets the numbering.
Code Section for Context
using Word = Microsoft.Office.Interop.Word;
Word.Range range = Application.ActiveDocument.Range(startPosition, endPosition);
range.Select();
//export to DOCX?
Application.Selection.Range.ExportFragment(
filename, Word.WdSaveFormat.wdFormatDocumentDefault);
You could use ConvertNumbersToText(wdNumberAllNumbers) before exporting, then _Document.Undo() or close without saving after the export.
There is some good information at this (dated) link that still should work with current Word APIs:
https://forums.windowssecrets.com/showthread.php/27711-Determining-which-ListTemplates-item-is-in-use-(VBA-Word-2000)
Information at that link suggests that you can create a name/handle for your ListTemplate so that you can reference it in code--as long as your statute-style bullets are associated with a named style for the document. The idea is to first name the ListTemplate that's associated with the statute bullet style for the active document and then reference that name when accessing the ListLevels collection.
For instance, you could have code that looks something like this:
ActiveDocument.Styles("StatutesBulletStyle").ListTemplate.Name = "StatuteBulletListTemplate";
After the above assignment, you can refer to the template by name:
ActiveDocument.ListTemplates("StatuteBulletListTemplate").ListLevels(1).StartAt = 5;
Using the above technique no longer requires that you try to figure out what the active template is...
Does that help?

Word merge field in header loses value in print preview

I have an ASP WebForms app where I use a word template that contains merge fields, to replace them with data extracted from the database. The app works great, the word document is exported, but when trying to print the document, one of the merge fields, which exists in the header, loses it's value and restores to the initial merge field name. Is this something that has to be fixed from the application's code or is this a word settings issue.
Any help is greatly appreciated.
Thank you!
I have managed to solve this problem using OpenXML Productivity Tool. It turns out that you can't add a merge field in the header, so what I did was to put it inside of a textbox. I forgot to mention that part in the initial description. Thus, the text element was buried deep in open xml. When I managed to find it and log what was inside, I found out that I inserted the MERGEFIELD <> MERGEFORMAT. Every time I tried to insert the value that I wanted in this <>, it got reset when I hit print preview. So what I did, based on a suggestion from someone who had a similar problem, was to delete this textbox and create a clean one where I only entered "Test". It needs to have a string inside so that open xml created the element Text (instrTxt).
In C# I did this:
foreach (var hPart in firstDoc.MainDocumentPart.HeaderParts)
{
foreach (var txt in hPart.Header.Descendants<Text>())
{
if(txt.Text == "Test")
{
txt.Text = "My custom text";
}
}
}
So for each header part (because I can't tell for sure in which one it is..it could even be in multiple ones), get all descendants of type text.
I got a few more than I wanted. I also got a few that contained the page number (since I have it in the header as well). So I added an if to check if it's the text element that I wanted. Once I found it, I added the text I wanted.
So long story short, instead of using a merge field in the header, I just used a text. Perhaps it's not the most efficient way of doing this. Maybe the question still remains, (if I could have inserted a merge field in the header and actually made it work without having word reset the value upon print preview? idk), but this worked for me.

Creating Word file from ObservableCollection with C#

I have an observable collection with a class that has 2 string properties: Word and Translation. I want to create a word file in format:
word = translation word = translation
word = translation word = translation...
The word document needs to be in 2 Columns (PageLayout) and the Word should be in bold.
I have first tried Microsoft.Office.Interop.Word.
PageSetup.TextColumns.SetCount(2) sets the PageLayout. As for the text itself I used a foreach loop and in each iteration I did this:
paragraph.Range.Text = Word + " = " + Translation;
object boldStart = paragraph.Range.Start;
object boldEnd = paragraph.Range.Start + Word.Length;
Word.Range boldPart = document.Range(boldStart, boldEnd);
boldPart.Bold = 1;
paragraph.Range.InsertParagraphAfter();
This does exactly what I want, but if there are 1000 items in the collection it takes about 10sec, much much more if the number is 10k+. I then used a StringBuilder and just set document.Content.Text = sb.ToString(); and that takes less than a sec, but I can't set the word to be bold that way.
Then I switched to using Open XML SDK 2.5, but even after reading the msdn documentation I still have no idea how to make just a part of the text bold, and I don't know if it's even possible to set PageLayout Columns count. The only thing I could do was to make it look the same as with Interop.Word, but with just 1 column and <1sec creation time.
Should I be using Interop.Word or Open XML (or maybe combined) for this? And can someone pls show me how to write this properly, so it doesn't take forever if the collection is relatively large? Any help is appreciated. :)
OOXML can be intimidating at first. http://officeopenxml.com/anatomyofOOXML.php has some good examples. Whenever you get confused unzip the docx and browse the contents to see how it's done.
The basic idea is you'd open Word, create a template with the styling you want and a code word to find the paragraph, then multiply the paragraph, replacing the text in that template with each word.
Your Word template would look like this:
Here's some pseudo code to get you started, assuming you have the SDK installed
var templateRegex = new Regex("\\[templateForWords\\]");
var wordPlacementRegex = new Regex("\\[word\\]");
var translationPlacementRegex = new Regex("\\[translation]\\]");
using (var document = WordprocessingDocument.Open(stream, true))
{
MainDocumentPart mainPart = document.MainDocumentPart;
// do your work here...
var paragraphTemplate = mainPart.Document.Body
.Descendants<Paragraph>()
.Where(p=>templateRegex.IsMatch(p.InnerText)); //pseudo
//... or whatever gives you the text of the Para, I don't have the SDK right now
foreach (string word in YourDictionary){
var paraClone = paragraphTemplate.Clone(); // pseudo
// you may need to do something like
// paraClone.Descendents<Text>().Where(t=>regex.IsMatch(t.Value))
// to find the exact element containing template text
paraClone.Text = templateRegex.Replace(paraClone.Text,"");// pseudo
paraClone.Text = wordPlacementRegex.Replace(paraClone.Text,word);
paraClone.Text = translationPlacementRegex.Replace(paraClone.Text,YourDictionary[word]);
paragraphTemplate.Parent.InsertAfter(paraClone,ParagraphTemplate); // pseudo
}
paragraphTemplate.Remove();
// document should auto-save
document.Package.Flush();
}
OpenXML is absolutely better, because it is faster, has less bugs, more reliable and flexible in runtime (especially in server environment). And it's not really difficult to find out how to make one or another element using OpenXML. As docx file is just a zip file with xml files inside, I open it and read the xml to get the idea, how word itself makes it. First of all, I create a document, then format it (in your case, you can create some file with two columns and bold words inside), save it, rename it to .zip file. Then open it, open "word" directory inside and the file "document.xml" inside the directory. This document contains essential part of xml, looking at this it's not difficult to figure out how to recreate it in OpenXML
Open XML is a much better option than Office COM. But the problem is that it is a low-level file format library that unlike Office COM doesn’t work on a high abstraction level. You might want to go that route but I recommend you to first consider looking into a commercial library that will give you the benefits of a high-level DOM without the need to have MS Word installed on the production machine. Our company recently purchased this toolkit which allows you to use template based approach and also DOM/programmatic approach to generate/modify/create documents.

How to search for Particular Line Contents in PDF and Make that Line Marked In Color using Itext in c#

I have a pdf file which i need to read and validate for its Correctness and if any wrong data Comes it should mark that Line with Red Color.Till Now i am able to read and Validate the Contents of the Pdf file by taking that into string but i am not getting how to make that line Colored,suppose Mark Red color in case any wrong data line comes.So my question is this that "How to search for Particular Line Contents in PDF and Make that Line Marked In Color".
Here is my code in c#..
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
if (currentText.Contains("1 . 1 To Airtel Mobile") && currentText.Contains("Total"))
{
int startPosition = currentText.IndexOf("1 . 1 To Airtel Mobile");
int endPosition = currentText.IndexOf("Total");
string result = currentText.Substring(startPosition, endPosition - startPosition);
// result will contain everything from and up to the Total line
using (StringReader reader = new StringReader(result))
{
// Loop over the lines in the string.
string[] split = line.Split(new Char[] { ' ' });
}
}
If the line Contents gets Validated Correct its Ok else Mark that Line with Red Color in PDF file
Please read the documentation before posting semi-duplicate questions, such as:
Edit an existing PDF file using iTextSharp
How to Read and Mark(Highlight) a pdf file using C#
You have received some very good feedback, such as the answer from Nenotlep that was initially deleted (I asked the moderators to have it restored). Especially the comment by mkl should have been very useful to you. It refers to Retrieve the respective coordinates of all words on the page with itextsharp and that's exactly what you're asking now, making your question a duplicate (a possible reason to have it removed from StackOverflow).
In his answer, mkl explains that you're taking your assignment too lightly. Instead of extracting pure text, you should extract TextRenderInfo objects. These objects contain information about the content (the actual text) as well as the position on the page. See for instance the ParsingHelloWorld example from chapter 15 of my book.
The method you're using returns the content of the PDF as a string. Similar to result1.txt which is the output of said example:
Hello World
In the same example, we parse a different PDF that has the exact same content when looked at by the human eye. However, when you parse the document, the content looks like this (see result2.txt):
ld
Wor
llo
He
The reason for this difference is inherent to the nature of PDF: the concept of lines doesn't really exist: you can add characters to a page in any which order you want. You don't even need to add complete words!
When you use the GetTextFromPage() method, you tell iText you don't want to get any info about the position of the text. Mlk has tried explaining this to you, but I'll try explaining it once more. In the example from my book, I have extended the RenderListener in a class named MyTextRenderListener. Now the output looks like this (see result3.txt).
<>
<<ld><Wor><llo><He>>
<<Hello People>>
This is the output of the same PDF we parsed when getting result2.txt. As you can see, we missed the words Hello People in the previous attempt.
The example is really simple: it just shows you have to text snippets are stored in the PDF. We get all the TextRenderInfo objects and we use the GetText() method to get the text. The order in which we get the text is the order that is used in the PDF's content stream.
When using a specific strategy, such as the LocationTextExtractionStrategy, iText retrieves all these objects and it used the GetBaseline() method to sort all the text snippets.
<<ld><Wor><llo><He>>
results in:
<<He><llo><Wor><ld>>
Then iText looks at the distance between the different snippets. In this case, iText adds a space between the <llo> and <Wor> snippet.
You are now looking to do the same thing: you are going to write a system that is going to retrieve all the text snippets, that is going to order them, examine them, and based on the composed content, you are going to add a background at those locations.

Categories