Perform find & replace in excel doc in C#

Perform find & replace in excel doc in C# - c#

I am trying to create a postage label that is populated dynamically in C#.
What I thought I could do was create the label in Excel so I could get the layout perfect, then use OpenXML to replace the fields. i.e. I have fields called things like XXAddressLine1XX, where I want to edit the XML and replace that text with the actual Address Line 1 from the database.
Has anyone actually done something similar to this before and could post some code up that I could try? I've used OpenXML to do this with word documents before, but I can't seem to find the XML data for the Excel document when using OpenXML in c# so am struggling to make progress.
Either that, or are there any better methods I could try to accomplish this?

Every now and then, there's a programming problem that is best solved with a non-programming solution. Businesses have had the need for printing mailing labels en masse for a long time, and in recent decades programs like Microsoft Word make this really simple via the "mail merge" feature. See http://office.microsoft.com/en-us/word-help/demo-use-mail-merge-to-format-and-print-mailing-labels-HA001190394.aspx for an example.
Word's mail merge will allow you to connect to a variety of data sources. The given example uses an Excel spreadsheet, but you can also use Access or SQL Databases, etc.

The purely academic answer is: it's possible. I wrote a set of template parsing classes that search through a PowerPoint presentation replacing tags from a mark-up language I invented with charts and dynamic text objects fetched from a database. The trickiest part of the string replacement bit was handling tags that occurred across Run elements inside a Paragraph element. This occurs usually if you use special characters such as '{' or spaces in your tags. I was able to solve it by storing the text of the entire TextBody element in a gigantic character array (in your case it would be the contents of a Cell element), storing a list of extents that enumerated where in the character array each Run element began and ended, and then walking the character array while paying attention to the Run boundaries appropriately. Mind that if your tag spans across multiple Run elements you'd need to remove any extras and snip content across boundaries before you inserted the replacement Run. Unfortunately I cannot post any code because the work was done for a company, but that's the general idea of how to achieve it. I was not able to handle any newline cases (i.e. a tag occurs with a newline in it) because that would require writing a cross Paragraph indexer, which was beyond the scope of what I wanted to achieve. That could be done as well, but it would be significantly more difficult I think.

The Powershell script extension PSExcel have an search and replace feature:
PSExcel: https://github.com/RamblingCookieMonster/PSExcel
Import-Module d:\psexcel\trunk\PSExcel\PSExcel
# Get commands in the module
Get-Command -Module PSExcel
# Get help for a command
Get-Help Import-XLSX -Full
# Modify the path of the Excel file to your local svn ceck-out
Get-ChildItem "C:\tmp\Source" -Filter *.xlsx | Foreach-Object{
$fileName = $_.FullName
# Open an existing XLSX to search and set cells within
$Excel = New-Excel -Path $_.FullName
$Workbook = $Excel | Get-Workbook
#$Excel | Get-Worksheet
$Worksheet = $Workbook | Get-Worksheet -Name "Sheet1"
#Search for any cells like 'Chris'. Set them all to Chris2
$Worksheet | Search-CellValue {$_ -like 'Chris'} -As Passthru | Set-CellValue -Value "Chris2"
#Save your changes and close the ExcelPackage
$Excel | Save-Excel -Close
}

Related

Field and text delimiters within cells in csv files

This is likely a very basic question that I could not, despite trying, find a satsifying answer to. Feel free to skip to the question at the end if you aren't interested in the background.
The task:
I wish to create an easy localisation solution for my unity projects. After some initial research I concluded it would be best to use a .csv file read by a streamreader, so that translators would only ever have to interact with the csv table, where information is neatly organized.
The main problem:
Due to the nature of the text, I need to account for linebreaks and special characters in the actual fields. As such I could not use the normal readLine() method.
This I worked with by using Read() and checking if a linebreak is within a text delimiter bracket. But as I check for the text delimiter, I am afraid it might run into an un-escaped delimiter part of the normal in-cell text (since the normal text delimiter is quotation marks).
So I switched the delimiter to §. But now every time I open the file I have to re-enter § as a text delimiter in OpenOfficeCalc, probably due to encoding differences. Which is annoying but not the end of the world.
My question:
How does OpenOffice (or similar software) usually tell in-cell commas/quotation marks apart from the ones used as delimiters? If I knew that, I could probably incorporate a similar approach in my reading of the file.
I've tried to look at the files with NotePad++, revealing a difference in linebreaks (/r instead of /r/n) and obviously it's within a text delimiter bracket, but when it comes to how it seperates its delimiters from ones just entered in the text/field, I am drawing a blank.
Translation file in OpenOffice Calc:
Translation file in NotePad++, showing all characters:
I'd appreciate any insight or links on the topic.

From https://en.wikipedia.org/wiki/Comma-separated_values:
The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line breaks.
LibreOffice Calc has a reasonable way to handle these things.
Use LF for line breaks and CR at the end of each record. It seems your code already handles this.
Use quotes to delimit strings when needed. If the string contains one or more quotes, then duplicate the quote to make it literal.
From the example in your question, it looks like you told Calc not to use any quotes as string delimiters. Why did you do this? When I tried it, LibreOffice (or Apache OpenOffice) showed the fields in different columns after opening the file saved that way.
The following example CSV file has fields that contain commas, quotes and line breaks.
When viewed in Calc:
A B
--------- --
1 | 1,",2", 3
--------- --
2 | a c
| b
Calc correctly reads and saves the file as shown below. Settings when saving are Field delimiter , and String delimiter " which are the defaults.
"1,"",2"",",3[CR]
"a
b",c[CR]

.net Interop.Word insert from docx without screwing up formatting

I have an issue I've stuck with for over a year now. I made a Forms application in VB.net which allows the user to type in some information and select items which represent docx-files with tables with special formatting, pictures and other formatting quirks in them.
At the end the software creates a Word document via Office.Interop, using the information the user provided in text fields in the Forms and the items they selected (e.g. it creates a table in Word, listing the user's selections with some extra info) and then appends the content from multiple docx-files depending on the user's selection to the document created via Interop.
The problem is: To achieve this I had to use a pretty dirty method:
I open the respective docx-files, select all content (Range.Wholestory()) and copy it (Range.Copy()). Then I insert this content from the clipboard into my newly created document with the following option:
Selection.PasteAndFormat (wdFormatOriginalFormatting)
This produces a satisfactory result but it feels super dirty since it uses the user's clipboard (which I save at the beginning of the runtime and restore at the end).
I originally tried to use the Selection.InsertFile-Method and tried this again today but it completely screws the formatting.
When the content of the docx is inserted this way it neither has the formatting of the original docx nor the one of the file I created with the program. E.g. the SpaceBefore and SpaceAfter values are wrong, even if I explicitly define them in my created file. Changing the formatting afterwards is no option since the source files contain a lot of special formatting and can change all the time.
Another factor which makes it hard: I cannot save the file before it is presented to the user, using temp folder is not an option in the environment this application is deployed into, so basically everything happens in RAM.
Summary:
Basically what I want is to create the same outcome as with my "Copy and Paste" method utilizing the OriginalFormatting WITHOUT using the clipboard. The problem is, the InsertFile-Method doesn't provide an option for the formatting.
Any idea or help would be greatly appreciated.
Edit:
The FormattedText option as suggested by Rich Michaels produces the same result as the InsertFile-Method. Here is the relevant part of what I did (word is the Microsoft.Office.Interop.Word.Application):
#Opening the source file
Dim doctemp As Microsoft.Office.Interop.Word.Document
doctemp = word.Documents.Open(doctempfilepath)
#Selecting whole document; this is what I did for the "Copy/Paste"-Method, too
doctemp.Range.WholeStory()
Dim insert_range As wordoptions.Range
doc_destination.Activate()
#Jumping to the end and selecting the range
word.Selection.EndKey(Unit:=Microsoft.Office.Interop.Word.WdUnits.wdStory)
insert_range = word.Selection.Range
#Inserting the text
insert_range.FormattedText = doctemp.Range.FormattedText
doctemp.Close(False)
This is the problem:

Use the Range.FormattedText property. It doesn't touch the clipboard and it maintains the source formatting. The process is ...
Set the range in the Source document you want "copied" and set the insertion point in the Destination document and then,
DestinationRange.FormattedText = SourceRange.FormattedText

Is there any way to assign Id's to paragraphs in Open XML SDK 2.5?

I'm working on an application which has to create word documents with the use of Office Open XML SDK 2.5. The idea that I'm having now is that I will start from a template with an empty body (so I have all the namespaces etc. defined already), and add Paragraphsto it. If I need images I will add the ImageParts and try to give the ImagePart the Id present in the predefined paragraphpart which will contain the image. I will store the paragraphs as xml in a database, fetch the ones I need, fill in/modify some values if needed and insert them into my word document. But this is the tricky part, how can I easily insert them in a way so I don't have to query on their content to later on find one of the paragraphs? In other words, I need Id's. I have some options in mind:
For each possible paragraph I have, manually create a SdtBlock. This SdtBlock will have an Id which matches the Id of each paragraph in the database. This seems like a lot of manual work though, and I'd rather be able to create future word documents easier...
I chose this approach but I insert Building Blocks which can be stored in templates with a specific tagname.
Create the paragraphs, copy the xml from the developer tool, and manually add a ParagraphId. This seems even more of a nightmare though, because for every future new paragraphs I will have to create new Id's etc. Also it would be impossible to insert tables as there is no way (afaik) to give those an Id.
Work with bookmarks to know where to insert the data. I don't really like this either as bookmarks are visible for everyone. I know I can replace them, but then I don't have any way to identify individual paragraphs later on.
**** my database and just add everything in the template :D Remove the paragraphs I don't need by deleting the bookmarks with their content. This idea seems the worst of all though as I don't want to depend on having a templatefile with all possible content per word-file I need to generate.
Anyone with experience in OpenXml who knows which approach would be the best? Maybe there is another approach which is better and I have completely overlooked? The ideal solution would be that I can add Ids in Office Word but that's a no-go as I haven't found anything to do that yet.
Thanks in advance!

Content Controls (std) were designed for this, although I'm not sure the designers ever contemplated "targeting" each and every paragraph in the document...
Back in the 2003/2007 days it was possible to add custom XML mark-up to a Word document, which would have been exactly what you're looking for. But Microsoft lost a patent court case around 2009 and had to pull the functionality. So content controls are really your only good choice.
Your approach could, possibly, be combined with the BuildingBlocks concept. BuildingBlocks are Word content stored in a Word template file as valid Word Open XML. They can be assigned to "galleries" and categorized. There is a Content Control of type BuildingBlock that can be associated with a specific Gallery and Category which might help you to a certain extent and would be an alternative to storing content in a database.

Ok, I did a small research, you can do it in strict OpenXML, but only before you open your file in Word. Word will remove everything it cannot read.
using (WordprocessingDocument document = WordprocessingDocument.Open(path, true)) {
document.MainDocumentPart.Document.Body.Ancestors().First()
.SetAttribute(new OpenXmlAttribute() {
LocalName = "someIdName",
Value = "111" });
}
Here, for example, I set attribute "someIdName", which doesn't exits in OpenXML, to some random element. You can set it anywhere and use it as id

Use OpenXML to replace text in DOCX file - strange content

I'm trying to use the OpenXML SDK and the samples on Microsoft's pages to replace placeholders with real content in Word documents.
It used to work as described here, but after editing the template file in Word adding headers and footers it stopped working. I wondered why and some debugging showed me this:
Which is the content of texts in this piece of code:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(DocumentFile, true))
{
var texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>().ToList();
}
So what I see here is that the body of the document is "fragmented", even though in Word the content looks like this:
Can somebody tell me how I can get around this?
I have been asked what I'm trying to achieve. Basically I want to replace user defined "placeholders" with real content. I want to treat the Word document like a template. The placeholders can be anything. In my above example they look like {var:Template1}, but that's just something I'm playing with. It could basically be any word.
So for example if the document contains the following paragraph:
Do not use the name USER_NAME
The user should be able to replace the USER_NAME placeholder with the word admin for example, keeping the formatting intact. The result should be
Do not use the name admin
The problem I see with working on paragraph level, concatenating the content and then replacing the content of the paragraph, I fear I'm losing the formatting that should be kept as in
Do not use the name admin

Various things can fragment text runs. Most frequently proofing markup (as apparently is the case here, where there are "squigglies") or rsid (used to compare documents and track who edited what, when), as well as the "Go back" bookmark Word sets in the background. These become readily apparent if you view the underlying WordOpenXML (using the Open XML SDK Productivity Tool, for example) in the document.xml "part".
It usually helps to go an element level "higher". In this case, get the list of Paragraph descendants and from there get all the Text descendants and concatenate their InnerText.

OpenXML is indeed fragmenting your text:
I created a library that does exactly this : render a word template with the values from a JSON.
From the documenation of docxtemplater :
Why you should use a library for this
Docx is a zipped format that contains some xml. If you want to build a simple replace {tag} by value system, it can already become complicated, because the {tag} is internally separated into <w:t>{</w:t><w:t>tag</w:t><w:t>}</w:t>. If you want to embed loops to iterate over an array, it becomes a real hassle.
The library basically will do the following to keep formatting :
If the text is :
<w:t>Hello</w:t>
<w:t>{name</w:t>
<w:t>} !</w:t>
<w:t>How are you ?</w:t>
The result would be :
<w:t>Hello</w:t>
<w:t>John !</w:t>
<w:t>How are you ?</w:t>
You also have to replace the tag by <w:t xml:space=\"preserve\"> to ensure that the space is not stripped out if they is any in your variables.

Search for files that have a common naming convention

I have a folder, full of 38,000+ .pdf files. I was not the genius to put them all into one folder, but I now have the task of separating them. The files that are of value to us, all have the same basic naming convention, for example:
123456_20130604_NEST_IV
456789_20120209_VERT_IT
What I'm trying to do, if possible, is search the folder for only those files with that particular naming convention. As in, search only for files that have 6 digits, an underscore, and then 8 digits followed by another underscore. Kind of like *****_********. I've searched online but I haven't had much luck. Any help would be great!

var regex = new Regex(#"^\d{6}_\d{8}_", RegexOptions.Compiled);
string[] files = Directory.GetFiles(folderPath)
.Where(path => regex.Match(Path.GetFileName(path)).Success)
.ToArray();
files would contain paths to a files, that match criteria.
For my example C:\Temp\123456_20130604_NEST_IV 456789_20120209_VERT_IT.pdf, which I've added beforehand.
As a bonus, here is PowerShell script to do this (assuming you are in the correct folder, otherwise use gc "C:\temp" instead of dir):
dir | Where-Object {$_ -match "^\d{6}_\d{8}_"}

? - single character
* - multiple characters
So, I would say use ?????? _ ???????? _ ???? _ ??.* to get all your files
You can use move or copy command from a command prompt to do that.
If you want to do advanced searches such as pattern matching, use windows grep: http://www.wingrep.com/

Are you familiar with regular expressions? If not, they are a generalized way to search for strings of a special format. I see you tagged your question with C# so assuming you are writing a C# script you might try the .NET regular expression module.
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
If you are a beginner, you may want to start here.
http://www.codeproject.com/Articles/9099/The-30-Minute-Regex-Tutorial

There are numerous ways to handle this. What I like to do is to divide work into different steps with clear output/data in each step. Hence I would tackle this in the following way (since this seems easier for me instead of writing a master program in c# that does everything):
Open windows command prompt (start/run/cmd), navigate to correct
folder and then "dir *.pdf > pdf_files.txt". This would give you a
file containing all pdf-files inside the specific folder.
open up the txt-file (pdf_files.txt) in Notepad++ and then press "ctrl + f
(find)" activate radio button "regular expressions"
type: [0-9]{6}_[0-9]{8}_.*\.pdf and press "Find all in current document"
Copy results and save to new .txt-file
Now you have a text file containing all pdf-files that you can do what you want with (create a c# program that parses the files and move them depending on their name or whatever)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.