C# - Implementing Markdown to Word (OpenXML)

C# - Implementing Markdown to Word (OpenXML) - c#

I'm trying to implement my own version of markdown for creating Word Documents in a C# application. For bold/italic/underline I am going to use **/ `/_ respectively. I have created something that parses combinations of **'s to output bold text by extracting a match and using something like this:
RunProperties rPr2 = new RunProperties();
rPr2.Append(new Bold() { Val = new OnOffValue(true) });
Run run2 = new Run();
run2.Append(rPr2);
run2.Append(new Text(extractedString));
p.Append(run2);
My issue is when I come to combining the three different formats, as I'm thinking I would have to weigh up all the different formatting combinations and split them into separate runs. Bold runs, bold italic runs, underline runs, bold underline runs etc etc. I want my program to be able to handle something like this:
**_Lorem ipsum_** (creates bold & underlined run)
`Lorem ipsum` dolor sit amet, **consectetur _adipiscing_ elit**.
_Praesent `feugiat` velit_ sed tellus convallis, **non `rhoncus** tortor` auctor.
Basically any mix of the styles you could throw at it I want it to handle. However if I am programmatically generating these runs, I need to weigh everything up before setting the text into runs, should I handle this with an array of character indexes for each style and merge them into a big list of styles (not sure how exactly I would do this)?
The final question is does something like this already exist? If it does I have been unable to find it (markdown to word).

I think you'll have to split your text into parts by the formatting they have and add each part with the correct formatting to the document. Like here http://msdn.microsoft.com/en-us/library/office/gg278312.aspx.
So
**non `rhoncus** tortor` will become - "non "{bold}, "rhoncus "{bold,italic}, "tortor"{italic}
I think it'll be easier than performing several runs. You don't even have to parse the entire document. Just parse as you go and after each "change" in the formatting write to the docx.
Another thought - If all you're creating is simple text and that's all you need, it might be even simpler to generate the openXML itself. Your data is very structured, should be easy enough to create an XML out of it.
Here's a simple algorithm to do what I propose...
// These are the different formattings you have
public enum Formatings
{
Bold, Italic, Underline, Undefined
}
// This will store the current format
private Dictionary<Formatings, bool> m_CurrentFormat;
// This will store which string translates into which format
private Dictionary<string, Formatings> m_FormatingEncoding;
public void Init()
{
m_CurrentFormat = new Dictionary<Formatings, bool>();
foreach (Formatings format in Enum.GetValues(typeof(Formatings)))
{
m_CurrentFormat.Add(format, false);
}
m_FormatingEncoding = new Dictionary<string, Formatings>
{{"**", Formatings.Bold}, {"'", Formatings.Italic}, {"\\", Formatings.Underline}};
}
public void ParseFormattedText(string p_text)
{
StringBuilder currentWordBuilder = new StringBuilder();
int currentIndex = 0;
while (currentIndex < p_text.Length)
{
Formatings currentFormatSymbol;
int shift;
if (IsFormatSymbol(p_text, currentIndex, out currentFormatSymbol, out shift))
{
// This is the current word you need to insert
string currentWord = currentWordBuilder.ToString();
// This is the current formatting status --> m_CurrentFormat
// This is where you can insert your code and add the word you want to the .docx
currentWordBuilder = new StringBuilder();
currentIndex += shift;
m_CurrentFormat[currentFormatSymbol] = !m_CurrentFormat[currentFormatSymbol];
}
currentWordBuilder.Append(p_text[currentIndex]);
currentIndex++;
}
}
// Checks if the current position is the begining of a format symbol
// if true - p_currentFormatSymbol will be the discovered format delimiter
// and p_shift will denote it's length
private bool IsFormatSymbol(string p_text, int p_currentIndex, out Formatings p_currentFormatSymbol, out int p_shift)
{
// This is a trivial solution, you can do better if you need
string substring = p_text.Substring(p_currentIndex, 2);
foreach (var formatString in m_FormatingEncoding.Keys)
{
if (substring.StartsWith(formatString))
{
p_shift = formatString.Length;
p_currentFormatSymbol = m_FormatingEncoding[formatString];
return true;
}
}
p_shift = -1;
p_currentFormatSymbol = Formatings.Undefined;
return false;
}

Related

Finding list of objects that contain full or just part of searched string

I've a list of paragraphs. Each paragagraph can contain Text. I'm trying to search for a string that may be as whole within a single paragraph, or spread across multiple paragraphs with as bad case where each letter is different paragraph.
public List<WordParagraph> FindText(string text) {
List<WordParagraph> list = new List<WordParagraph>();
var found = false;
Paragraph currentParagraph = null;
foreach (var paragraph in this.Paragraphs) {
//if (currentParagraph == null) {
// currentParagraph = paragraph._paragraph;
//} else {
// if (currentParagraph != paragraph._paragraph) {
// found = false;
// }
//}
// paragraph.Text
// logic missing to find text that can start within some paragraph.Text, but
// can span across multiple paragraphs
// for example searching for text "This Is MyTest" within 4 paragraphs that
// may be written like
// paragraph.Text = "Thi"
// paragraph.Text = "s Is"
// paragraph.Text = " MyTes"
// paragraph.Text = "t"
}
return list;
}
I've tried some logic around foreach char in text, and nested loop over text from the paragraph.text but the logic was failing me.
To give you a bit of background. Consider a Word Document that has a single sentence - one long sentence but each word, or even letter is formatted differently - different font size, bold, underline or whatever. It looks like this:
Now what Word actually saved in the file is a single paragraph, but each paragraph has multiple "runs". The run contains a Text element. Each text element contains the text that you see in Word, but due to formatting of possibly even each word it can be split into many many small Text properties.
Now in my example, I've simplified the logic and for me, each "run" is a paragraph with a text. So List of WordParagraphs is a list of runs within Screenshot you see.
Now I need to find a string "I have that" from the whole sentence you see in word. That means I need to go thru all paragraphs, find the first letter that matches and then check if next letter matches as well, if not I need to start again.
My brain is having hard time to grasp this logic in code.

How to highlight text in a sentence using OpenXML?

I am using below code to search and highlight text in a MS Word document, it works fine for point 1 but not point 2:
1. John Alter
I search for Alter or John, it highlights John/Alter - works.
2. I am going to school
I search for going, it highlights going but it changes its order as I am to school going - does not work.
How to fix point 2? Below is my code.
private void HighLightText(Paragraph paragraph, string text)
{
string textOfRun = string.Empty;
var runCollection = paragraph.Descendants<DocumentFormat.OpenXml.Wordprocessing.Run>();
DocumentFormat.OpenXml.Wordprocessing.Run runAfter = null;
//find the run part which contains the characters
foreach (DocumentFormat.OpenXml.Wordprocessing.Run run in runCollection)
{
if (!string.IsNullOrWhiteSpace(paragraph.InnerText) && paragraph.InnerText != "\\s")
textOfRun = run.GetFirstChild<DocumentFormat.OpenXml.Wordprocessing.Text>().Text;
if (textOfRun.IndexOf(text, StringComparison.OrdinalIgnoreCase) >= 0)
{
//remove the character from this run part
run.GetFirstChild<DocumentFormat.OpenXml.Wordprocessing.Text>().Text = Regex.Replace(textOfRun, text, string.Empty, RegexOptions.IgnoreCase);//textOfRun.Replace(text, string.Empty);
runAfter = run;
break;
}
}
//create a new run with your customization font and the character as its text
DocumentFormat.OpenXml.Wordprocessing.Run HighLightRun = new DocumentFormat.OpenXml.Wordprocessing.Run();
DocumentFormat.OpenXml.Wordprocessing.RunProperties runPro = new DocumentFormat.OpenXml.Wordprocessing.RunProperties();
Highlight highlight = new Highlight() { Val = HighlightColorValues.Yellow };
DocumentFormat.OpenXml.Wordprocessing.Text runText = new DocumentFormat.OpenXml.Wordprocessing.Text() { Text = text };
runPro.Append(highlight);
HighLightRun.Append(runPro);
HighLightRun.Append(runText);
//insert the new created run part
paragraph.InsertAfter(HighLightRun, runAfter);
}

You need to split-up your Run if you want to highlight some text in the middle of that Run. So replacing the search text with an empty string won't work.
Your original text structure looks like:
<Run>
<Text>
I am going to school
</Text>
</Run>
If you want to highlight the going word, you need to make a more complex structure out of it:
<Run>
<Text>
I am
</Text>
</Run>
<Run>
<Text>
going
</Text>
</Run>
<Run>
<Text>
to school
</Text>
</Run>
Then, the Run in the middle can be set-up for highlighting.
Here is a working code sample. Please note, there's no error handing in this code! It should give you some idea how to solve your task. Implement the proper exception handing for production usage!
Also note that this sample only searches for the first occurrence, as it is in your code. If you need to highlight multiple search matches, you will have to improve this code.
void HighLightText(Paragraph paragraph, string text)
{
// Search for a first occurrence of the text in the text runs
var found = paragraph
.Descendants<Run>()
.Where(r => !string.IsNullOrEmpty(r.InnerText) && r.InnerText != "\\s")
.Select(r =>
{
var runText = r.GetFirstChild<Text>();
int index = runText.Text.IndexOf(text, StringComparison.OrdinalIgnoreCase);
// 'Run' is a reference to the text run we found,
// TextNode is a reference to the run's Text object,
// 'TokenIndex` is the index of the search string in run's text
return new { Run = r, TextNode = runText, TokenIndex = index };
})
.FirstOrDefault(o => o.TokenIndex >= 0);
// Nothing found -- escape
if (found == null)
{
return;
}
// Create a node for highlighted text as a clone (to preserve formatting etc)
var highlightRun = found.Run.CloneNode(true);
// Add the highlight node after the found text run and set up the highlighting
paragraph.InsertAfter(highlightRun, found.Run);
highlightRun.GetFirstChild<Text>().Text = text;
RunProperties runPro = new RunProperties();
Highlight highlight = new Highlight { Val = HighlightColorValues.Yellow };
runPro.AppendChild(highlight);
highlightRun.InsertAt(runPro, 0);
// Check if there's some text in the text run *after* the found text
int remainderLength = found.TextNode.Text.Length - found.TokenIndex - text.Length;
if (remainderLength > 0)
{
// There is some text after the highlighted section --
// insert it in a separate text run after the highlighted text run
var remainderRun = found.Run.CloneNode(true);
paragraph.InsertAfter(remainderRun, highlightRun);
var textNode = remainderRun.GetFirstChild<Text>();
textNode.Text = found.TextNode.Text.Substring(found.TokenIndex + text.Length);
// We need to set up this to preserve the spaces between text runs
textNode.Space = new EnumValue<SpaceProcessingModeValues>(SpaceProcessingModeValues.Preserve);
}
// Check if there's some text *before* the found text
if (found.TokenIndex > 0)
{
// Something is left before the highlighted text,
// so make the original text run contain only that portion
found.TextNode.Text = found.TextNode.Text.Remove(found.TokenIndex);
// We need to set up this to preserve the spaces between text runs
found.TextNode.Space = new EnumValue<SpaceProcessingModeValues>(SpaceProcessingModeValues.Preserve);
}
else
{
// There's nothing before the highlighted text -- remove the unneeded text run
paragraph.RemoveChild(found.Run);
}
}
This code works for highlighting the I, going, or school words in the I am going to school sentence.

How to highlight text using string indexes in WPF RichTextBox?

I'm working on a custom RichTextBox which highlights certain words typed in it.
(more like highlight certain strings, because I intent to highlight strings that are not separated by spaces)
I search for strings by loading the text to memory, and looking for a list of strings one by one, then applying formatting to them.
Issue is that, index I get from the plain text representation, doesn't necessarily point to the same position in the RichTextBox's content, when formatting is applied.
(First formatting is perfect. Any subsequent formatting starts to slip to the left. I assume this is because formatting adds certain elements to the documents which makes my indexes incorrect.)
Sample pseudo code for this is as follows.
// get the current text
var text = new TextRange(Document.ContentStart, Document.ContentEnd).Text;
// loop through and highlight
foreach (string entry in WhatToHighlightCollection)
{
var currentText = text;
var nextOccurance = currentText.IndexOf(suggestion); //This index is Unreliable !!!
while (nextOccurance != -1)
{
// Get the offset from start. (There appears to be 2 characters in the
// beginning. I assume this is document and paragraph start tags ??
// So add 2 to it.)
int offsetFromStart = (text.Length) - (currentText.Length) + 2;
var startPointer = Document.ContentStart.
GetPositionAtOffset(offsetFromStart + nextOccurance, LogicalDirection.Forward);
var endPointer = startPointer.GetPositionAtOffset(suggestion.Length, LogicalDirection.Forward);
var textRange = new TextRange(startPointer, endPointer);
textRange.ApplyPropertyValue(TextElement.BackgroundProperty, new SolidColorBrush(Colors.Yellow));
textRange.ApplyPropertyValue(TextElement.FontWeightProperty, FontWeights.Bold);
textRange.ApplyPropertyValue(TextElement.FontFamilyProperty, new FontFamily("Segoe UI"));
// Go to the next occurance.
currentText = currentText.Substring(nextOccurance + suggestion.Length);
nextOccurance = currentText.IndexOf(suggestion);
}
}
How do I map string indexes to rich text box content ?
NOTE: I'm not worried about the performance of this at the moment, although any suggestions are always welcome, as currently I run this on every TextChanged event to highlight 'as the user type' and it's getting a bit sluggish.

How to format and read CSV file?

Here is just an example of the data I need to format.
The first column is simple, the problem the second column.
What would be the best approach to format multiple data fields in one column?
How to parse this data?
Important*: The second column needs to contain multiple values, like in an example below
Name Details
Alex Age:25
Height:6
Hair:Brown
Eyes:Hazel

A csv should probably look like this:
Name,Age,Height,Hair,Eyes
Alex,25,6,Brown,Hazel
Each cell should be separated by exactly one comma from its neighbor.
You can reformat it as such by using a simple regex which replaces certain newline and non-newline whitespace with commas (you can easily find each block because it has values in both columns).

A CSV file is normally defined using commas as field separators and CR for a row separator. You are using CR within your second column, this will cause problems. You'll need to reformat your second column to use some other form of separator between multiple values. A common alternate separator is the | (pipe) character.
Your format would then look like:
Alex,Age:25|Height:6|Hair:Brown|Eyes:Hazel
In your parsing, you would first parse the comma separated fields (which would return two values), and then parse the second field as pipe separated.

This is an interesting one - it can be quite difficult to parse specific format files which is why people often write specific classes to deal with them. More conventional file formats like CSV, or other delimited formats are [more] easy to read because they are formatted in a similar way.
A problem like the above can be addressed in the following way:
1) What should the output look like?
In your instance, and this is just a guess, but I believe you are aiming for the following:
Name, Age, Height, Hair, Eyes
Alex, 25, 6, Brown, Hazel
In which case, you have to parse out this information based on the structure above. If it's repeated blocks of text like the above then we can say the following:
a. Every person is in a block starting with Name Details
b. The name value is the first text after Details, with the other columns being delimited in the format Column:Value
However, you might also have sections with addtional attributes, or attributes that are missing if the original input was optional, so tracking the column and ordinal would be useful too.
So one approach might look like the following:
public void ParseFile(){
String currentLine;
bool newSection = false;
//Store the column names and ordinal position here.
List<String> nameOrdinals = new List<String>();
nameOrdinals.Add("Name"); //IndexOf == 0
Dictionary<Int32, List<String>> nameValues = new Dictionary<Int32 ,List<string>>(); //Use this to store each person's details
Int32 rowNumber = 0;
using (TextReader reader = File.OpenText("D:\\temp\\test.txt"))
{
while ((currentLine = reader.ReadLine()) != null) //This will read the file one row at a time until there are no more rows to read
{
string[] lineSegments = currentLine.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
if (lineSegments.Length == 2 && String.Compare(lineSegments[0], "Name", StringComparison.InvariantCultureIgnoreCase) == 0
&& String.Compare(lineSegments[1], "Details", StringComparison.InvariantCultureIgnoreCase) == 0) //Looking for a Name Details Line - Start of a new section
{
rowNumber++;
newSection = true;
continue;
}
if (newSection && lineSegments.Length > 1) //We can start adding a new person's details - we know that
{
nameValues.Add(rowNumber, new List<String>());
nameValues[rowNumber].Insert(nameOrdinals.IndexOf("Name"), lineSegments[0]);
//Get the first column:value item
ParseColonSeparatedItem(lineSegments[1], nameOrdinals, nameValues, rowNumber);
newSection = false;
continue;
}
if (lineSegments.Length > 0 && lineSegments[0] != String.Empty) //Ignore empty lines
{
ParseColonSeparatedItem(lineSegments[0], nameOrdinals, nameValues, rowNumber);
}
}
}
//At this point we should have collected a big list of items. We can then write out the CSV. We can use a StringBuilder for now, although your requirements will
//be dependent upon how big the source files are.
//Write out the columns
StringBuilder builder = new StringBuilder();
for (int i = 0; i < nameOrdinals.Count; i++)
{
if(i == nameOrdinals.Count - 1)
{
builder.Append(nameOrdinals[i]);
}
else
{
builder.AppendFormat("{0},", nameOrdinals[i]);
}
}
builder.Append(Environment.NewLine);
foreach (int key in nameValues.Keys)
{
List<String> values = nameValues[key];
for (int i = 0; i < values.Count; i++)
{
if (i == values.Count - 1)
{
builder.Append(values[i]);
}
else
{
builder.AppendFormat("{0},", values[i]);
}
}
builder.Append(Environment.NewLine);
}
//At this point you now have a StringBuilder containing the CSV data you can write to a file or similar
}
private void ParseColonSeparatedItem(string textToSeparate, List<String> columns, Dictionary<Int32, List<String>> outputStorage, int outputKey)
{
if (String.IsNullOrWhiteSpace(textToSeparate)) { return; }
string[] colVals = textToSeparate.Split(new[] { ":" }, StringSplitOptions.RemoveEmptyEntries);
List<String> outputValues = outputStorage[outputKey];
if (!columns.Contains(colVals[0]))
{
//Add the column to the list of expected columns. The index of the column determines it's index in the output
columns.Add(colVals[0]);
}
if (outputValues.Count < columns.Count)
{
outputValues.Add(colVals[1]);
}
else
{
outputStorage[outputKey].Insert(columns.IndexOf(colVals[0]), colVals[1]); //We append the value to the list at the place where the column index expects it to be. That way we can miss values in certain sections yet still have the expected output
}
}
After running this against your file, the string builder contains:
"Name,Age,Height,Hair,Eyes\r\nAlex,25,6,Brown,Hazel\r\n"
Which matches the above (\r\n is effectively the Windows new line marker)
This approach demonstrates how a custom parser might work - it's purposefully over verbose as there is plenty of refactoring that could take place here, and is just an example.
Improvements would include:
1) This function assumes there are no spaces in the actual text items themselves. This is a pretty big assumption and, if wrong, would require a different approach to parsing out the line segments. However, this only needs to change in one place - as you read a line at a time, you could apply a reg ex, or just read in characters and assume that everything after the first "column:" section is a value, for example.
2) No exception handling
3) Text output is not quoted. You could test each value to see if it's a date or number - if not, wrap it in quotes as then other programs (like Excel) will attempt to preserve the underlying datatypes more effectively.
4) Assumes no column names are repeated. If they are, then you have to check if a column item has already been added, and then create an ColName2 column in the parsing section.

different format into one single line Interop.word

I've been trying to figure out how to insert 2 different formats into the same paragraph using interop.word in c# like this:
hello planet earth here's what I want to do

Assuming you have your document defined as oDoc, the following code should get you the desired result:
Word.Paragraph oPara = oDoc.Content.Paragraphs.Add(ref oMissing);
oPara.Range.Text = "hello planet earth here's what I want to do";
object oStart = oPara.Range.Start + 13;
object oEnd = oPara.Range.Start + 18;
Word.Range rBold = oDoc.Range(ref oStart, ref oEnd);
rBold.Bold = 1;

I had to modify Dennis' answer a little to get it to work for me.
What I'm doing it totally automated, so I have to only work with variables.
private void InsertMultiFormatParagraph(string text, int size, int spaceAfter = 10) {
var para = docWord.Content.Paragraphs.Add(ref objMissing);
para.Range.Text = text;
// Explicitly set this to "not bold"
para.Range.Font.Bold = 0;
para.Range.Font.Size = size;
para.Format.SpaceAfter = spaceAfter;
var start = para.Range.Start;
var end = para.Range.Start + text.IndexOf(":");
var rngBold = docWord.Range(ref objStart, ref objEnd);
rngBold.Bold = 1;
para.Range.InsertParagraphAfter();
}
The main difference that made me want to make this post was that the Paragraph should be inserted AFTER the font is changed. My initial thought was to insert it after setting the SpaceAfter property, but then the objStart and objEnd values were tossing "OutOfRange" Exceptions. It was a little counter-intuitive, so I wanted to make sure everyone knew.

The following code seemed to work the best for me when formatting a particular selection within a paragraph. Using Word's built in "find" function to make a selection, then formatting only the selected text. This approach would only work well if the text to select is a unique string within the selection. But for most situations I have run across, this seems to work.
oWord.Selection.Find.Text = Variable_Containing_Text_to_Select; // sets the variable for find and select
oWord.Selection.Find.Execute(); // Executes find and select
oWord.Selection.Font.Bold = 1; // Modifies selection
oWord.Selection.Collapse(); // Clears selection
Hope this helps someone!

I know this post is old, but it came out in almost all my searches. The answer below is in case someone, like me, wants to do this for more than one word in a sentence. In this case, I loop through a string array of variables that contain strings and change that text to bold--modifing #joshman1019
string[] makeBold = new string[4] {a, b, c, d};
foreach (string s in makeBold)
{
wApp.Selection.Find.Text = s; //changes with each iteration
wApp.Selection.Find.Execute();
wApp.Selection.Font.Bold = 1;
wApp.Selection.Collapse(); //used to 'clear' the selection
wApp.Selection.Find.ClearFormatting();
}
So, each string represented by the variable will be bold. So if a = "hello world", then Hello World is made bold in the Word doc. Hope it saves someone some time.

I know this is an old thread, but I thought I'd post here anyway for those that come across it via Google (like I did). I got most of the way to a solution with krillgar's approach, but I had trouble because some of my text contains newlines. Accordingly, this modification worked best for me:
private void WriteText(string text)
{
var para = doc.Content.Paragraphs.Add();
var start = para.Range.Start;
var end = para.Range.Start + text.IndexOf(":");
para.Range.Text = text;
para.Range.Font.Bold = 0;
para.Range.InsertParagraphAfter();
if(text.Contains(":")){
var rngBold = doc.Range(start, end);
rngBold.Bold = 1;
}
}
The key difference is that I calculate start and end earlier in the function. I can't quite put my finger on it, but I think if your new text has newlines in it, the later calculation of start/end messes something up.
And obviously my solution is intended for text with the format:
Label: Data
where Label is to be bolded.

Consider usage of Range.Collapse eventually with Microsoft.Office.Interop.Word.WdCollapseDirection.wdCollapseEnd as parameter.
That would allow next text to have formatting different than previous text (and next text formatting will not affect formatting of previous one).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# - Implementing Markdown to Word (OpenXML) - c#

Related

Finding list of objects that contain full or just part of searched string

How to highlight text in a sentence using OpenXML?

How to highlight text using string indexes in WPF RichTextBox?

How to format and read CSV file?

different format into one single line Interop.word

Categories

Resources