Search and Replace of text in a memorystream in C# .NET - c#

I have loaded a memorystream with a word document and I want to be able to alter specific text within the memorystream and save it back to the word document, like search and replace functionality. Please can anyone help me with this as I don't want to use the Word Interop libraries. I have the code to load and save the document already, please see below. The problem is, if I convert the memorystring to a string and use the string replace method, when I save the string all the formatting within the word document is lost and when I open the document all it shows is black boxes all over the place.
private void ReplaceInFile(string filePath, string searchText, string replaceText)
{
byte[] inputFile = File.ReadAllBytes(filePath);
MemoryStream memory = new MemoryStream(inputFile);
byte[] data = memory.ToArray();
string pathStr = Request.PhysicalApplicationPath + "\\Docs\\OutputDocument.doc";
FileInfo wordFile = new FileInfo(pathStr);
FileStream fileStream = wordFile.Open(FileMode.Create, FileAccess.Write, FileShare.None);
fileStream.Write(data, 0, data.Length);
fileStream.Close();
memory.Close();
}
I copied the code from sample code on the internet. So That is why memorystream was used as I had no idea how to do it. My issue is the company I work for doesn't want to use the word interop as sometimes they have found that word can display popup dialog boxes on occassion that prevents the coded functionality from executing. This is why I want to look at ways of achieving a mail merge functionality but in a programmatical way. I did do a very similar thing to what I want to do here many years ago but in Delphi not C# and I have typically lost the code. So if anyone can shed any light on this then I would be grateful.

You will have to use the Word interop libraries - or at least something similar. It's not like Word documents are just plain text documents - they're binary files. Converting the bytes into a string and doing a replace that way is going to break the document completely.
With the new open formats you may be able to write your own code to parse them, but it's going to be significantly harder than using a library.

Your best bet is to convert the file to OOXML - then that's an XML file which you can update programmatically using a string find / replace, System.XML or LINQ.
(See http://blogs.msdn.com/b/ericwhite/archive/2008/09/19/bulk-convert-doc-to-docx.aspx for more info on the server side conversion process.)

Related

C# Parse Memory Stream text from RichTextBox with special characters

I need your help to find the best/fastest way to parse (regular expression) text in a RichTextbox.
I have already tried several methods, and the fastest one, so far, seems to be saving the text into a MemoryStream and read it line by line while performing the validation.
I have no problem doing that and it actually seems to work pretty well... Except, when I have special chars - Latin chars to be more specific. Lets say for example that I have the name, "João" (John in English BTW), the text, coming from the StreamReader, appears as "Jo\'e3o"... resulting in a failure to find the text.
Not sure if this is because of encoding, I have tried to set the Encoding to UTF8 when creating the StreamReader, but it doesn't work, I always see the text with those codes.
I am starting to think that my only option is to parse the text or lines from the RichTextbox obj, but it is sooooo much slower...
UPDATE
Adding some example code on how I'm reading the RichTextBox text.
(This seems to be the fastest way to read large amounts of text.)
var rtb = new RichTextBox();
var rtbMemStream = new MemoryStream();
rtb.SaveFile(rtbMemStream, RichTextBoxStreamType.RichText);
using (StreamReader sr = new StreamReader(rtbStream, Encoding.UTF8))
{
while (!sr.EndOfStream)
{
var streamLine = sr.ReadLine();
ParseLine(streamLine);
}
}
Any help or suggestions is appreciated,
Thank you in advanced.

Creating Word file from ObservableCollection with C#

I have an observable collection with a class that has 2 string properties: Word and Translation. I want to create a word file in format:
word = translation word = translation
word = translation word = translation...
The word document needs to be in 2 Columns (PageLayout) and the Word should be in bold.
I have first tried Microsoft.Office.Interop.Word.
PageSetup.TextColumns.SetCount(2) sets the PageLayout. As for the text itself I used a foreach loop and in each iteration I did this:
paragraph.Range.Text = Word + " = " + Translation;
object boldStart = paragraph.Range.Start;
object boldEnd = paragraph.Range.Start + Word.Length;
Word.Range boldPart = document.Range(boldStart, boldEnd);
boldPart.Bold = 1;
paragraph.Range.InsertParagraphAfter();
This does exactly what I want, but if there are 1000 items in the collection it takes about 10sec, much much more if the number is 10k+. I then used a StringBuilder and just set document.Content.Text = sb.ToString(); and that takes less than a sec, but I can't set the word to be bold that way.
Then I switched to using Open XML SDK 2.5, but even after reading the msdn documentation I still have no idea how to make just a part of the text bold, and I don't know if it's even possible to set PageLayout Columns count. The only thing I could do was to make it look the same as with Interop.Word, but with just 1 column and <1sec creation time.
Should I be using Interop.Word or Open XML (or maybe combined) for this? And can someone pls show me how to write this properly, so it doesn't take forever if the collection is relatively large? Any help is appreciated. :)
OOXML can be intimidating at first. http://officeopenxml.com/anatomyofOOXML.php has some good examples. Whenever you get confused unzip the docx and browse the contents to see how it's done.
The basic idea is you'd open Word, create a template with the styling you want and a code word to find the paragraph, then multiply the paragraph, replacing the text in that template with each word.
Your Word template would look like this:
Here's some pseudo code to get you started, assuming you have the SDK installed
var templateRegex = new Regex("\\[templateForWords\\]");
var wordPlacementRegex = new Regex("\\[word\\]");
var translationPlacementRegex = new Regex("\\[translation]\\]");
using (var document = WordprocessingDocument.Open(stream, true))
{
MainDocumentPart mainPart = document.MainDocumentPart;
// do your work here...
var paragraphTemplate = mainPart.Document.Body
.Descendants<Paragraph>()
.Where(p=>templateRegex.IsMatch(p.InnerText)); //pseudo
//... or whatever gives you the text of the Para, I don't have the SDK right now
foreach (string word in YourDictionary){
var paraClone = paragraphTemplate.Clone(); // pseudo
// you may need to do something like
// paraClone.Descendents<Text>().Where(t=>regex.IsMatch(t.Value))
// to find the exact element containing template text
paraClone.Text = templateRegex.Replace(paraClone.Text,"");// pseudo
paraClone.Text = wordPlacementRegex.Replace(paraClone.Text,word);
paraClone.Text = translationPlacementRegex.Replace(paraClone.Text,YourDictionary[word]);
paragraphTemplate.Parent.InsertAfter(paraClone,ParagraphTemplate); // pseudo
}
paragraphTemplate.Remove();
// document should auto-save
document.Package.Flush();
}
OpenXML is absolutely better, because it is faster, has less bugs, more reliable and flexible in runtime (especially in server environment). And it's not really difficult to find out how to make one or another element using OpenXML. As docx file is just a zip file with xml files inside, I open it and read the xml to get the idea, how word itself makes it. First of all, I create a document, then format it (in your case, you can create some file with two columns and bold words inside), save it, rename it to .zip file. Then open it, open "word" directory inside and the file "document.xml" inside the directory. This document contains essential part of xml, looking at this it's not difficult to figure out how to recreate it in OpenXML
Open XML is a much better option than Office COM. But the problem is that it is a low-level file format library that unlike Office COM doesn’t work on a high abstraction level. You might want to go that route but I recommend you to first consider looking into a commercial library that will give you the benefits of a high-level DOM without the need to have MS Word installed on the production machine. Our company recently purchased this toolkit which allows you to use template based approach and also DOM/programmatic approach to generate/modify/create documents.

Is there a less convoluted, preferably automatable, way to convert PDFs to HTML?

I need to convert PDF files to HTML.
I can do this manually via several steps, using this (Rube) Goldberg variation:
0) Save PDF as text
1) Copy-and-paste text into MS Word
2) Save MS Word doc as HTML
I feel like I'm walking on my hands doing that, though.
Is there a programmatic way to accomplish the same? So that I could do something like:
string htmlFile = ConvertPDFToHTML("FrumiousBandersnatch.PDF");

Export FlowDocument with UIElement to rtf

I am trying to export a FlowDocument which contains a grid to rtf. I used the following code
using (FileStream fs = new FileStream(#"C:\demo.rtf", FileMode.OpenOrCreate, FileAccess.Write))
{
TextRange textRange = new TextRange(doc.ContentStart, doc.ContentEnd);
textRange.Save(fs, DataFormats.Rtf);
}
However I am getting a blank document. How can this be solved?
I had a similar issue recently and the culprit turned out to be the
FileMode.OpenOrCreate
It should have been
FileMode.Create
instead.
When you use OpenOrCreate and the file already exists and has more content than you are writing into it you will end up with the end of the old file after the end of the new content. Word or WordPad or whatever you are trying to open it in may not be able to interpret it correctly but makes an attempt to show you what it can which may be in your case a blank page.
The second issue that may be part of the problem is the viewer you use to open it and the FlowDocument you use to write it may not be on the same wave length to put it mildly.
You may notice that WordPad for example displays the same rtf file differently than Word.
They also produce very different files when you save them.
Same goes for the FlowDocument - it may be saving something that for example WordPad or even Word (though this is less likely) is not able to display correctly (or at all).

Rtf to WordML Convert in C#

I have a windows application to generate report.
It has templates in RTF as "{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang2057{\\fonttbl{\\f0\\fnil\\fcharset0 Arial;}}\r\n\\viewkind4\\uc1\\pard\\fs20\\tab\\tab\\tab\\tab af\\par\r\n}\r\n", which is written to word doc file. then the word is Saved-As XML and close. Then, tags like (say) are extracted and some new
The problem here is Word, which is used as converter in the process and it consumes valuable time in Loop, where it opens word instance, save, close, delete.
Please correct any mistake if i have made and help me with an alternative to convert to WordML .
Use Aspose .Words
//your rtf string
string rtfStrx = "{\\rtf1\\ansi\\ansicpg1252\\deff0\\deflang2057{\\fonttbl{\\f0\\fnil\\fcharset0 Arial;}}\r\n\\viewkind4\\uc1\\pard\\fs20\\tab\\tab\\tab\\tab af\\par\r\n}\r\n"
//convert string to bytes for memory stream
byte[] rtfBytex = Encoding.UTF8.GetBytes(rtfStrx);
MemoryStream rtfStreamx = new MemoryStream(rtfBytex);
Document rtfDocx = new Document(rtfStreamx);
rtfDocx.Save(#"C:\Temp.xml", SaveFormat.WordML);
This saves your RTF text in new document as WordML. I cannot say about time it will take in loop. But it will surely have much less time then MS Word being physically opened and closed.
Unless I am missing something, I assume that you are trying to create Office XML file from RTF template? I think you can use Open XML SDK for creation of the xml file. Specifically, DocumentReflector that comes with that SDK seems to a good fit for that. See this example. Also, there is a http://www.codeguru.com/cpp/controls/richedit/conversions/article.php/c5377/ which shows how to convert from RTF to HTML that might guide you.
use wpf richtextbox. Rtf => xaml. Since xaml is xml_ use xslt or linq to convert it to your desired xml structure

Categories