How to highlight text in a sentence using OpenXML? - c#

I am using below code to search and highlight text in a MS Word document, it works fine for point 1 but not point 2:
1. John Alter
I search for Alter or John, it highlights John/Alter - works.
2. I am going to school
I search for going, it highlights going but it changes its order as I am to school going - does not work.
How to fix point 2? Below is my code.
private void HighLightText(Paragraph paragraph, string text)
{
string textOfRun = string.Empty;
var runCollection = paragraph.Descendants<DocumentFormat.OpenXml.Wordprocessing.Run>();
DocumentFormat.OpenXml.Wordprocessing.Run runAfter = null;
//find the run part which contains the characters
foreach (DocumentFormat.OpenXml.Wordprocessing.Run run in runCollection)
{
if (!string.IsNullOrWhiteSpace(paragraph.InnerText) && paragraph.InnerText != "\\s")
textOfRun = run.GetFirstChild<DocumentFormat.OpenXml.Wordprocessing.Text>().Text;
if (textOfRun.IndexOf(text, StringComparison.OrdinalIgnoreCase) >= 0)
{
//remove the character from this run part
run.GetFirstChild<DocumentFormat.OpenXml.Wordprocessing.Text>().Text = Regex.Replace(textOfRun, text, string.Empty, RegexOptions.IgnoreCase);//textOfRun.Replace(text, string.Empty);
runAfter = run;
break;
}
}
//create a new run with your customization font and the character as its text
DocumentFormat.OpenXml.Wordprocessing.Run HighLightRun = new DocumentFormat.OpenXml.Wordprocessing.Run();
DocumentFormat.OpenXml.Wordprocessing.RunProperties runPro = new DocumentFormat.OpenXml.Wordprocessing.RunProperties();
Highlight highlight = new Highlight() { Val = HighlightColorValues.Yellow };
DocumentFormat.OpenXml.Wordprocessing.Text runText = new DocumentFormat.OpenXml.Wordprocessing.Text() { Text = text };
runPro.Append(highlight);
HighLightRun.Append(runPro);
HighLightRun.Append(runText);
//insert the new created run part
paragraph.InsertAfter(HighLightRun, runAfter);
}

You need to split-up your Run if you want to highlight some text in the middle of that Run. So replacing the search text with an empty string won't work.
Your original text structure looks like:
<Run>
<Text>
I am going to school
</Text>
</Run>
If you want to highlight the going word, you need to make a more complex structure out of it:
<Run>
<Text>
I am
</Text>
</Run>
<Run>
<Text>
going
</Text>
</Run>
<Run>
<Text>
to school
</Text>
</Run>
Then, the Run in the middle can be set-up for highlighting.
Here is a working code sample. Please note, there's no error handing in this code! It should give you some idea how to solve your task. Implement the proper exception handing for production usage!
Also note that this sample only searches for the first occurrence, as it is in your code. If you need to highlight multiple search matches, you will have to improve this code.
void HighLightText(Paragraph paragraph, string text)
{
// Search for a first occurrence of the text in the text runs
var found = paragraph
.Descendants<Run>()
.Where(r => !string.IsNullOrEmpty(r.InnerText) && r.InnerText != "\\s")
.Select(r =>
{
var runText = r.GetFirstChild<Text>();
int index = runText.Text.IndexOf(text, StringComparison.OrdinalIgnoreCase);
// 'Run' is a reference to the text run we found,
// TextNode is a reference to the run's Text object,
// 'TokenIndex` is the index of the search string in run's text
return new { Run = r, TextNode = runText, TokenIndex = index };
})
.FirstOrDefault(o => o.TokenIndex >= 0);
// Nothing found -- escape
if (found == null)
{
return;
}
// Create a node for highlighted text as a clone (to preserve formatting etc)
var highlightRun = found.Run.CloneNode(true);
// Add the highlight node after the found text run and set up the highlighting
paragraph.InsertAfter(highlightRun, found.Run);
highlightRun.GetFirstChild<Text>().Text = text;
RunProperties runPro = new RunProperties();
Highlight highlight = new Highlight { Val = HighlightColorValues.Yellow };
runPro.AppendChild(highlight);
highlightRun.InsertAt(runPro, 0);
// Check if there's some text in the text run *after* the found text
int remainderLength = found.TextNode.Text.Length - found.TokenIndex - text.Length;
if (remainderLength > 0)
{
// There is some text after the highlighted section --
// insert it in a separate text run after the highlighted text run
var remainderRun = found.Run.CloneNode(true);
paragraph.InsertAfter(remainderRun, highlightRun);
var textNode = remainderRun.GetFirstChild<Text>();
textNode.Text = found.TextNode.Text.Substring(found.TokenIndex + text.Length);
// We need to set up this to preserve the spaces between text runs
textNode.Space = new EnumValue<SpaceProcessingModeValues>(SpaceProcessingModeValues.Preserve);
}
// Check if there's some text *before* the found text
if (found.TokenIndex > 0)
{
// Something is left before the highlighted text,
// so make the original text run contain only that portion
found.TextNode.Text = found.TextNode.Text.Remove(found.TokenIndex);
// We need to set up this to preserve the spaces between text runs
found.TextNode.Space = new EnumValue<SpaceProcessingModeValues>(SpaceProcessingModeValues.Preserve);
}
else
{
// There's nothing before the highlighted text -- remove the unneeded text run
paragraph.RemoveChild(found.Run);
}
}
This code works for highlighting the I, going, or school words in the I am going to school sentence.

Related

c# rtb: Bolden inline overrides another bolden inline

My projekt is a Windows 10 universal app!
I almost have for two month the problem, that when I bold two words in one rtb, the second overrides the first bold inline.
Example:
I want to bold:
Hello; Bye
Text from rtb:
Hello and Bye
Now I search with regex, weather ther is "Hello"/ "Bye" in rtb.
For each time there is "Hello" in rtb I insert a bolden inline with the text "Hello" in the same Position than "hello" stand befor.
After that I make the same with "Bye".
My Code:
string text = run.Text; -> "Hello and Bye"
MatchCollection mc = Regex.Matches(text, "Hello", RegexOptions.Multiline);
int i = 0;
var bold = new Bold();
int iIn = 0;
int iLe = 0;
p.Inlines.Clear(); -> p = Paragraph from rtb
foreach (Match match in mc)
{
p.Inlines.Add(new Run { Text = text.Substring(i, match.Index - i) });
bold.Inlines.Add(new Run { Text = text.Substring(match.Index, match.Length) });
p.Inlines.Add(bold);
i = match.Index + match.Length;
if (i < text.Length)
{
p.Inlines.Add(new Run { Text = text.Substring(i) });
}
}
This is followed by the same Code with bye.
Now the Problem is, that I clear the first bold inline (Hello) while insert the second bold inline (Bye).
Does anyone know an alternative to bold an specific word in a rtb or a Suggestion to improve the Code? I almost tryed everything but nothing really worked...
Use the following in order to select the relevant characters to make bold:"
public void Select(
TextPointer start,
TextPointer end
)
In order to get TextPointer, try this (not checked):
TextPointer pointer = document.ContentStart;
TextPointer start = pointer.GetPositionAtOffset(0);
TextPointer end = start.GetPositionAtOffset(5);
rtb.Select (start,end); // for example to select Hello
// Then change the font to bold

Place every sentence from a text file into an array but detect headers/titles

I need to get each sentence from a text document/string into an array.
The issue is with how to handle headers, titles etc. sections of text which are not part of a sentence, but don't end in a full stop ". " to detect.
Being unable to detect these will result them being stuck on to the front of the following sentence (if I use ". " to distinguish sentences) which I can't have happen.
Initially I was going to use:
contentRefined = content.Replace(" \n", ". ");
Which I thought would remove all of the empty lines and newlines, as well as place full stops on the ends of headers to be detected and treated as sentences, it would result in ". . " but I could again Replace them with nothing.
But didn't work it simply left the full empty lines and just put a ". " at the start of the empty line.... As well as ". " at the start of every paragraph
I have now tried:
contentRefined = Regex.Replace(content, #"^\s+$[\r\n]*", "", RegexOptions.Multiline);
Which fully removes the full empty lines, but doesn't get me closer to adding a full stop to the ends of the headers.
I need to place the sentences and headers/titles in an array, I'm not sure if there is a method of which I can do this without having to split the string by something such as ". "
Edit: Full current code showing how I get the test from the file
public void sentenceSplit()
{
content = File.ReadAllText(#"I:\Project\TLDR\Test Text.txt");
contentRefined = Regex.Replace(content, #"^\s+$[\r\n]*", "", RegexOptions.Multiline);
//contentRefined = content.Replace("\n", ". ");
}
I'm making an assumption that 'Header' and 'Title' are on their own line and do not end in a period.
If that's the case, then this may work for you:
var filePath = #"C:\Temp\temp.txt";
var sentences = new List<string>();
using (TextReader reader = new StreamReader(filePath))
{
while (reader.Peek() >= 0)
{
var line = reader.ReadLine();
if (line.Trim().EndsWith("."))
{
line.Split(new[] {'.'}, StringSplitOptions.RemoveEmptyEntries)
.ToList()
.ForEach(l => sentences.Add(l.Trim() + "."));
}
}
}
// Output sentences to console
sentences.ForEach(Console.WriteLine);
UPDATE
Another approach using the File.ReadAllLines() method, and displaying the sentences in a RichTextBox:
private void Form1_Load(object sender, EventArgs e)
{
var filePath = #"C:\Temp\temp.txt";
var sentences = File.ReadAllLines(filePath)
// Only select lines that end in a period
.Where(l => l.Trim().EndsWith("."))
// Split each line into sentences (one line may have many sentences)
.SelectMany(s => s.Split(new[] {'.'}, StringSplitOptions.RemoveEmptyEntries))
// Trim any whitespace off the ends of the sentence and add a period to the end
.Select(s => s.Trim() + ".")
// And finally cast it to a List (or you could do 'ToArray()')
.ToList();
// To show each sentence in the list on it's own line in the rtb:
richTextBox1.Text = string.Join("\n", sentences);
// Or to show them all, one after another:
richTextBox1.Text = string.Join(" ", sentences);
}
UPDATE
Now that I think I understand what you're asking, here's what I would do. First, I would create some classes to manage all this stuff. If you break the document down into parts, you get something like:
HEADER
Paragraph sentence one. Paragraph sentence two. Paragraph
sentence three with a number, like in this quote: "$5.00 doesn't go as
far as it used to".
Header Over an Empty Section
Header over multiple paragraphs
Paragraph sentence one. Paragraph
sentence two. Paragraph sentence three with a number, like in this
quote: "$5.00 doesn't go as far as it used to".
Paragraph sentence one. Paragraph sentence two. Paragraph sentence
three with a number, like in this quote: "$5.00 doesn't go as far as
it used to".
Paragraph sentence one. Paragraph sentence two. Paragraph sentence
three with a number, like in this quote: "$5.00 doesn't go as far as
it used to".
So I would create the following classes. First, one to represent a 'Section'. This is defined by a Header and zero to many paragraphs:
private class Section
{
public string Header { get; set; }
public List<Paragraph> Paragraphs { get; set; }
public Section()
{
Paragraphs = new List<Paragraph>();
}
}
Then I would define a Paragraph, which contains one or more sentences:
private class Paragraph
{
public List<string> Sentences { get; set; }
public Paragraph()
{
Sentences = new List<string>();
}
}
Now I can populate a List of Sections to represent the document:
var filePath = #"C:\Temp\temp.txt";
var sections = new List<Section>();
var currentSection = new Section();
var currentParagraph = new Paragraph();
using (TextReader reader = new StreamReader(filePath))
{
while (reader.Peek() >= 0)
{
var line = reader.ReadLine().Trim();
// Ignore blank lines
if (string.IsNullOrWhiteSpace(line)) continue;
if (line.EndsWith("."))
{
// This line is a paragraph, so add all the sentences
// it contains to the current paragraph
line.Split(new[] {". "}, StringSplitOptions.RemoveEmptyEntries)
.Select(l => l.Trim().EndsWith(".") ? l.Trim() : l.Trim() + ".")
.ToList()
.ForEach(l => currentParagraph.Sentences.Add(l));
// Now add this paragraph to the current section
currentSection.Paragraphs.Add(currentParagraph);
// And set it to a new paragraph for the next loop
currentParagraph = new Paragraph();
}
else if (line.Length > 0)
{
// This line is a header, so we're starting a new section.
// Add the current section to our list and create a
// a new one, setting this line as the header.
sections.Add(currentSection);
currentSection = new Section {Header = line};
}
}
// Finally, if the current section contains any data, add it to the list
if (currentSection.Header.Length > 0 || currentSection.Paragraphs.Any())
{
sections.Add(currentSection);
}
}
Now we have the whole document in a list of sections, and we know the order, the headers, the paragraphs, and the sentences they contain. As an example of how you can analyze it, here's a way to write it back out to a RichTextBox:
// We can build the document section by section
var documentText = new StringBuilder();
foreach (var section in sections)
{
// Here we can display headers and paragraphs in a custom way.
// For example, we can separate all sections with a blank line:
documentText.AppendLine();
// If there is a header, we can underline it
if (!string.IsNullOrWhiteSpace(section.Header))
{
documentText.AppendLine(section.Header);
documentText.AppendLine(new string('-', section.Header.Length));
}
// We can mark each paragraph with an arrow (--> )
foreach (var paragraph in section.Paragraphs)
{
documentText.Append("--> ");
// And write out each sentence, separated by a space
documentText.AppendLine(string.Join(" ", paragraph.Sentences));
}
}
// To make the underline approach above look
// half-way decent, we need a fixed-width font
richTextBox1.Font = new Font(FontFamily.GenericMonospace, 9);
// Now set the RichTextBox Text equal to the StringBuilder Text
richTextBox1.Text = documentText.ToString();

C# - Implementing Markdown to Word (OpenXML)

I'm trying to implement my own version of markdown for creating Word Documents in a C# application. For bold/italic/underline I am going to use **/ `/_ respectively. I have created something that parses combinations of **'s to output bold text by extracting a match and using something like this:
RunProperties rPr2 = new RunProperties();
rPr2.Append(new Bold() { Val = new OnOffValue(true) });
Run run2 = new Run();
run2.Append(rPr2);
run2.Append(new Text(extractedString));
p.Append(run2);
My issue is when I come to combining the three different formats, as I'm thinking I would have to weigh up all the different formatting combinations and split them into separate runs. Bold runs, bold italic runs, underline runs, bold underline runs etc etc. I want my program to be able to handle something like this:
**_Lorem ipsum_** (creates bold & underlined run)
`Lorem ipsum` dolor sit amet, **consectetur _adipiscing_ elit**.
_Praesent `feugiat` velit_ sed tellus convallis, **non `rhoncus** tortor` auctor.
Basically any mix of the styles you could throw at it I want it to handle. However if I am programmatically generating these runs, I need to weigh everything up before setting the text into runs, should I handle this with an array of character indexes for each style and merge them into a big list of styles (not sure how exactly I would do this)?
The final question is does something like this already exist? If it does I have been unable to find it (markdown to word).
I think you'll have to split your text into parts by the formatting they have and add each part with the correct formatting to the document. Like here http://msdn.microsoft.com/en-us/library/office/gg278312.aspx.
So
**non `rhoncus** tortor` will become - "non "{bold}, "rhoncus "{bold,italic}, "tortor"{italic}
I think it'll be easier than performing several runs. You don't even have to parse the entire document. Just parse as you go and after each "change" in the formatting write to the docx.
Another thought - If all you're creating is simple text and that's all you need, it might be even simpler to generate the openXML itself. Your data is very structured, should be easy enough to create an XML out of it.
Here's a simple algorithm to do what I propose...
// These are the different formattings you have
public enum Formatings
{
Bold, Italic, Underline, Undefined
}
// This will store the current format
private Dictionary<Formatings, bool> m_CurrentFormat;
// This will store which string translates into which format
private Dictionary<string, Formatings> m_FormatingEncoding;
public void Init()
{
m_CurrentFormat = new Dictionary<Formatings, bool>();
foreach (Formatings format in Enum.GetValues(typeof(Formatings)))
{
m_CurrentFormat.Add(format, false);
}
m_FormatingEncoding = new Dictionary<string, Formatings>
{{"**", Formatings.Bold}, {"'", Formatings.Italic}, {"\\", Formatings.Underline}};
}
public void ParseFormattedText(string p_text)
{
StringBuilder currentWordBuilder = new StringBuilder();
int currentIndex = 0;
while (currentIndex < p_text.Length)
{
Formatings currentFormatSymbol;
int shift;
if (IsFormatSymbol(p_text, currentIndex, out currentFormatSymbol, out shift))
{
// This is the current word you need to insert
string currentWord = currentWordBuilder.ToString();
// This is the current formatting status --> m_CurrentFormat
// This is where you can insert your code and add the word you want to the .docx
currentWordBuilder = new StringBuilder();
currentIndex += shift;
m_CurrentFormat[currentFormatSymbol] = !m_CurrentFormat[currentFormatSymbol];
}
currentWordBuilder.Append(p_text[currentIndex]);
currentIndex++;
}
}
// Checks if the current position is the begining of a format symbol
// if true - p_currentFormatSymbol will be the discovered format delimiter
// and p_shift will denote it's length
private bool IsFormatSymbol(string p_text, int p_currentIndex, out Formatings p_currentFormatSymbol, out int p_shift)
{
// This is a trivial solution, you can do better if you need
string substring = p_text.Substring(p_currentIndex, 2);
foreach (var formatString in m_FormatingEncoding.Keys)
{
if (substring.StartsWith(formatString))
{
p_shift = formatString.Length;
p_currentFormatSymbol = m_FormatingEncoding[formatString];
return true;
}
}
p_shift = -1;
p_currentFormatSymbol = Formatings.Undefined;
return false;
}

How to highlight text using string indexes in WPF RichTextBox?

I'm working on a custom RichTextBox which highlights certain words typed in it.
(more like highlight certain strings, because I intent to highlight strings that are not separated by spaces)
I search for strings by loading the text to memory, and looking for a list of strings one by one, then applying formatting to them.
Issue is that, index I get from the plain text representation, doesn't necessarily point to the same position in the RichTextBox's content, when formatting is applied.
(First formatting is perfect. Any subsequent formatting starts to slip to the left. I assume this is because formatting adds certain elements to the documents which makes my indexes incorrect.)
Sample pseudo code for this is as follows.
// get the current text
var text = new TextRange(Document.ContentStart, Document.ContentEnd).Text;
// loop through and highlight
foreach (string entry in WhatToHighlightCollection)
{
var currentText = text;
var nextOccurance = currentText.IndexOf(suggestion); //This index is Unreliable !!!
while (nextOccurance != -1)
{
// Get the offset from start. (There appears to be 2 characters in the
// beginning. I assume this is document and paragraph start tags ??
// So add 2 to it.)
int offsetFromStart = (text.Length) - (currentText.Length) + 2;
var startPointer = Document.ContentStart.
GetPositionAtOffset(offsetFromStart + nextOccurance, LogicalDirection.Forward);
var endPointer = startPointer.GetPositionAtOffset(suggestion.Length, LogicalDirection.Forward);
var textRange = new TextRange(startPointer, endPointer);
textRange.ApplyPropertyValue(TextElement.BackgroundProperty, new SolidColorBrush(Colors.Yellow));
textRange.ApplyPropertyValue(TextElement.FontWeightProperty, FontWeights.Bold);
textRange.ApplyPropertyValue(TextElement.FontFamilyProperty, new FontFamily("Segoe UI"));
// Go to the next occurance.
currentText = currentText.Substring(nextOccurance + suggestion.Length);
nextOccurance = currentText.IndexOf(suggestion);
}
}
How do I map string indexes to rich text box content ?
NOTE: I'm not worried about the performance of this at the moment, although any suggestions are always welcome, as currently I run this on every TextChanged event to highlight 'as the user type' and it's getting a bit sluggish.

C# get text from file between two hashes

In my C# program (at this point) I have two fields in my form. One is a word list using a listbox; the other is a textbox. I have been able to successfully load a large word list into the listbox from a text file. I can also display the selected item in the listbox into the textbox this way:
private void wordList_SelectedIndexChanged(object sender, EventArgs e)
{
string word = wordList.Text;
concordanceDisplay.Text = word;
}
I have another local file I need to get at to display some of its contents in the textbox. In this file each headword (as in a dictionary) is preceded by a #. So, I would like to take the variable 'word' and search in this local file to put the entries into the textbox, like so:
#headword1
entry is here...
...
...
#headword2
entry is here...
...
...
#headword3
entry is here...
...
...
You get the format of the text file. I just need to search for the correct headword with # before that word, and copy all info from there until the next hash in the file, and place it in the text box.
Obviously, I am a newbie, so be gentle. Thanks much.
P.S. I used StreamReader to get at the word list and display it in the listbox like so:
StreamReader sr = new StreamReader("C:\\...\\list-final.txt");
string line;
while ((line = sr.ReadLine()) != null)
{
MyList.Add(line);
}
wordList.DataSource = MyList;
var sectionLines = File.ReadAllLines(fileName) // shortcut to read all lines from file
.SkipWhile(l => l != "#headword2") // skip everything before the heading you want
.Skip(1) // skip the heading itself
.TakeWhile(l => !l.StartsWith("#")) // grab stuff until the next heading or the end
.ToList(); // optional convert to list
string getSection(string sectionName)
{
StreamReader sr = new StreamReader(#"C:\Path\To\file.txt");
string line;
var MyList = new List<string>();
bool inCorrectSection = false;
while ((line = sr.ReadLine()) != null)
{
if (line.StartsWith("#"))
{
if (inCorrectSection)
break;
else
inCorrectSection = Regex.IsMatch(line, #"^#" + sectionName + #"($| -)");
}
else if (inCorrectSection)
MyList.Add(line);
}
return string.Join(Environment.NewLine, MyList);
}
// in another method
textBox.Text = getSection("headword1");
Here are a few alternate ways to check if the section matches, in rough order of how accurate they are in detecting the right section name:
// if the separator after the section name is always " -", this is the best way I've thought of, since it will work regardless of what's in the sectionName
inCorrectSection = Regex.IsMatch(line, #"^#" + sectionName + #"($| -)");
// as long as the section name can't contain # or spaces, this will work
inCorrectSection = line.Split('#', ' ')[1] == sectionName;
// as long as only alphanumeric characters can ever make up the section name, this is good
inCorrectSection = Regex.IsMatch(line, #"^#" + sectionName + #"\b");
// the problem with this is that if you are searching for "head", it will find "headOther" and think it's a match
inCorrectSection = line.StartsWith("#" + sectionName);

Categories