How to format a RichTextBox with existing contents - c#

I'm getting data from a web API that returns text, and formatting information. The formatting data only includes the type of formatting (bold, italic, etc.) and the range of that formatting. The main problem with this is, that two ranges can "collide" (for example the first 3 characters of a word are bold and italic but the last 3 characters are only italic). Example response
{
"text" : "This is an example text",
"inlineStyles" : [
{
"offsetFromStart" : 5,
"length" : 10,
"type" : "bold"
}
{
"offsetFromStart" : 10,
"length" : 10,
"type" : "italic"
}
]
}
I already tried doing this with a simple TextBlock and failed. And I also tried this with a RichTextBox but when I added a Span I couldn't insert it into its original position. I also tought about formatting each character with its own span or run but that would be very ugly and in general just a bad solution. (My main concern is speed..)
var tb = new RichTextBox();
var para = new Paragraph();
para.Inlines.Add("This is an example text") // Text parsed from the response
var startingPointer1 = para.ContentStart.GetPositionAtOffset(5);
var sp1 = new Span(startingPointer1, startingPointer1.GetPositionAtOffset(10));
sp1.FontWeight = FontWeights.Bold;
var startingPointer2 = para.ContentStart.GetPositionAtOffset(10);
var sp2 = new Span(startingPointer2 , startingPointer2 .GetPositionAtOffset(10));
sp2.FontStyle= FontStyles.Italic;
para.Inlines.Add(sp1);
para.Inlines.Add(sp2);
tb.Document.Blocks.Add(para);
This code appends it to the end and when combining multiple inline elements like in my example it doesn't work at all (because of the first problem.)
Example result:

I don't think you can overlap Runs/Spans like this, you'll have to find all the breaking points in your text and format each text range separately. It's similar to HTML, where
<bold>some<italic> bold italic</bold> and other </italic> text.
is not valid. In your case, you'll have a bold from (5,10), bolditalic from (11, 15) etc.
It's probably useful to find some kind of Range class with methods to combine ranges, split, find overlaps, etc. A while ago I started with this.
EDIT: I don't exactly have an idea how to implement all this (last time I did something similar was almost 10 years ago), but you can try something like this:
Create a List<Range<int>>. Initially it contains a single Range(0, length of text).
Load the first style, create a new Range with start/end offset. Overlap (or whatever method is appropriate) this range with the range in the list. This should give you 3 ranges, something like (0, start of style), (start of style, end of style), (end of style, end of text). Remove old range from the list and add new ones.
Load the next, find overlaps, with the ranges in the list, delete the ones that are overlapped and add new ranges.
This should give you a list of nonoverlapping ranges.
Now, for the styles. You can create a kind of stylesheet class. This class can use the FontWeights, FontStyles and other enums, defined in System.Windows. Modify a list, so that it contains, for example, List<Tuple<int, Stylesheet>>. To calculate overlaps just use the first param in the Tuple.
Before you remove old ranges from the list, combine the styles.
This should give you a list of nonoverlapped regions, with the appropriate styles. Create TextRanges, apply styles
Other idea that might work:
Again, create a stylesheet. Initially it should be normal weighy, normal style, default font size etc.
Find the next offset from the input (the first one that is larger than the current), create a TextRange and apply a style.
Find the next offset from the input, modify current (and only) style and apply.
If I remember correctly, inserting style definition in the text also counts as characters, so you might need to adjust offsets when you insert style tags in the final text. Also, I believe it is doable just using TextBlock.
As I said, I don't know if this works like described, but this might give you and idea.

My current solution is that I go through every character one by one and scan through the ranges detecting if the current character is in any of them and then assigning a span to the character. This is not ideal at all, but it gets the job done. I'll try to implement an actual algorithm for this later. Until then, if you have any information that could help, please comment.
If anyone needs sample code of my current implementation I'd happily share it with you. (Even though it's not really efficient at all)

Related

Edit an existing PDF file using iTextSharp

I have a pdf file which I am processing by converting it into text using the following coding..
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
During processing if I am seeing any type of ambiguity in the content means error in the data of the PDF file, I have to mark the entire line of the pdf(Color that line with Red) file but I am not able to analyze how to achieve that. Please help me.
As already mentioned in comments: What you essentially need is a SimpleTextExtractionStrategy replacement which not only returns text but instead text with positions. The LocationTextExtractionStrategy would be a good starting point for that as it collects the text with positions (to put it in the right order).
If you look into the source of LocationTextExtractionStrategy you'll see that it keeps its text pieces in a member List<TextChunk> locationalResult. A TextChunk (inner class in LocationTextExtractionStrategy) represents a text piece (originally drawn by a single text drawing operation) with location information. In GetResultantText this list is sorted (top-to-bottom, left-to-right, all relative to the text base line) and reduced to a string.
What you need, is something like this LocationTextExtractionStrategy with the difference that you retrieve the (sorted) text pieces including their positions.
Unfortunately the locationalResult member is private. If it was at least protected, you could simply have derived your new strategy from LocationTextExtractionStrategy. Instead you now have to copy its source to add to it (or do some introspection/reflection magic).
Your addition would be a new method similar to GetResultantText. This method might recognize all the text on the same line (just like GetResultantText does) and either
do the analysis / search for ambiguities itself and return a list of the locations (start and end) of any found ambiguities; or
put the text found for the current line into a single TextChunk instance together with the effective start and end locations of that line and eventually return a List<TextChunk> each of which represents a text line; if you do this, the calling code would do the analysis to find ambiguities, and if it finds one, it has the start and end location of the line the ambiguity is on. Beware, TextChunk in the original strategy is protected but you need to make it public for this approach to work.
Either way, you eventually have the start and end location of the ambiguities or at least of the lines the ambiguities are on. Now you have to highlight the line in question (as you say, you have to mark the entire line of the pdf(Color that line with Red)).
To manipulate a given PDF you use a PdfStamper. You can mark a line on a page by either
getting the UnderContent for that page from the PdfStamper and fill a rectangle in red there using your position data; this disadvantage of this approach is that if the original PDF already has underlayed the line with filled areas, your mark will be hidden thereunder; or by
getting the OverContent for that page from the PdfStamper and fill a somewhat transparent rectangle in red; or by
adding a highlight annotation to the page.
To make things even smoother, you might want to extend your copy of TextChunk (inner class in your copy of LocationTextExtractionStrategy) to not only keep the base line coordinates but also maximal ascent and descent of the glyphs used. Obviously you'd have to fill-in those information in RenderText...
Doing so you know exactly the height required for your marking rectangle.
Too long to be a comment; added as answer.
My good fellow and peer Adi, It depends a lot on your PDF contents. It's kind of hard to do a generic solution to something like this. What does currentText contain? Can you give an example of it? Also, if you have a lot of these PDFs to check, you need to get currentText of a few of them, just to make sure that your current PDF to string conversion produces the same result every time. If it is same every time from different PDFs; then you can start to automate.
The automation also depends a lot on your content, for example if current Text is something like this: Value: 10\nValue: 11\nValue: 9Value\n15 then what I recommend is going through every line, extracting the value and checking it against what you need it to be. This is untested semi-pseudo code that gives you an idea of what I mean:
var lines = new List<string>(currentText.Split('\n'));
var newlines = new List<string>();
foreach (var line in lines) {
if (line != "Value: 10") {
newLines.Add(line); // This line is correct, no marking needed
} else {
newlines.Add("THIS IS WRONG: " + line); // Mark as incorrect; use whatever you need here
}
}
// Next, return newlines to the user showing them which lines are bad so they can edit the PDF
If you need to automatically edit the existing PDF, this will be very, very, very hard. I think it's beyond the scope of my answer - I was answering how to identify the wrong lines and not how to mark them - sorry! Someone else please add that answer.
By the way; PDF is NOT a good format for doing something like this. If you have access to any other source of information, most likely the other one will be better.

Insert Text to Word Range

I use interop.Word to create a Word document programmatically.
In the document I have a particular range which I would like to insert text to.
When I google it I see that the way to do this is :
range.Text=" Whatever...";
but I have no "Text" property for the range object.
Any ideas?
For the orignal question - this is just an intellisense bug, there is such property in the Range class.
For the problem from comments that
Range range=wordApp.ActiveDocument.TablesOfFigures[i].Range;
range.Text=" Whatever...";
replaces the ToF instead of prepending it with text. If you just want to set a header of the table, you can use Caption:
wordApp.ActiveDocument.TablesOfFigures[i].Caption = "Header text";
If however you need some text preceeding the ToF - check out this thread which is discussing similar case, but for the list instead of Table of Figures.
Another way to set caption is to select range you need and call InsertCaption:
wordApp.ActiveDocument.TablesOfFigures[i].Range.Select();
wordApp.Selection.InsertCaption("Whatever");
Note that InsertCaption accepts various args of various types, make sure to try different.
If you want to insert text at a range position, you can use Range.InsertBefore.
Range range=wordApp.ActiveDocument.TablesOfFigures[i].Range;
range.InsertBefore("My Text here. ");

regex that can handle horribly misspelled words

Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures

Get text between 2 html tags c#

I am trying to get the data between the html (span) provided (in this case 31)
Here is the original code (from inspect elements in chrome)
<span id="point_total" class="tooltip" oldtitle="Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again." aria-describedby="ui-tooltip-0">31</span>
I have a rich textbox which contains the source of the page, here is the same code but in line 51 of the rich textbox:
<DIV id=point_display>You have<BR><SPAN id=point_total class=tooltip jQuery16207621750175125325="23" oldtitle="Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again.">17</SPAN><BR>Points </DIV><IMG style="FLOAT: right" title="Gain subscribers" border=0 alt="When people subscribe to you, you lose a point" src="http://static.subxcess.com/images/page/decoration/remove-1-point.png"> </DIV>
How would I go about doing this? I have tried several methods and none of them seem to work for me.
I am trying to retrieve the point value from this page: http://www.subxcess.com/sub4sub.php
The number changes depending on who subs you.
You could be incredibly specific about it:
var regex = new Regex(#"<span id=""point_total"" class=""tooltip"" oldtitle="".*?"" aria-describedby=""ui-tooltip-0"">(.*?)</span>");
var match = regex.Match(#"<span id=""point_total"" class=""tooltip"" oldtitle=""Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again."" aria-describedby=""ui-tooltip-0"">31</span>");
var result = match.Groups[1].Value;
You'll want to use HtmlAgilityPack to do this, it's pretty simple:
HtmlDocument doc = new HtmlDocument();
doc.Load("filepath");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//span"); //Here, you can also do something like (".//span[#id='point_total' class='tooltip' jQuery16207621750175125325='23' oldtitle='Note: If the number is black, your points are actually a little bit negative. Don't worry, this just means you need to start subbing again.']"); to select specific spans, etc...
string value = node.InnerText; //this string will contain the value of span, i.e. <span>***value***</span>
Regex, while a viable option, is something you generally would want to avoid if at all possible for parsing html (see Here)
In terms of sustainability, you'll want to make sure that you understand the page source (i.e., refresh it a few times and see if your target span is nested within the same parents after every refresh, make sure the page is in the same general format, etc..., then navigate to the span using the above principle).
There are multiple possibilities.
Regex
Let HTML be parsed as XML and get the value via XPath
Iterate through all elements. If you get on a span tag, skip all characters until you find the closing '>'. Then the value you need is everything before the next opening '<'
Also look at System.Windows.Forms.HtmlDocument

Replacing part of text in richtextbox

I need to compare a value in a string to what user typed in a richtextbox.
For example: if a richtextbox holds string rtbText = "aaaka" and I compare this to another variable string comparable = "ka"(I want it to compare backwards). I want the last 2 letters from rtbText (comparable has only 2 letters) to be replaced with something that was predetermined(doesn't really matter what).
So rtbText should look like this:
rtbText = "aaa(something)"
This doesn't really have to be compared it can just count letters in comparable and based on that it can remove 2 letters from rtbText and replace them with something else.
UPDATE:
Here is what I have:
int coLen = comparable.Length;
comparable = null;
TextPointer caretBack = rtb.CaretPosition.GetPositionAtOffset(coLen, LogicalDirection.Backward);
TextRange rtbText = new TextRange(rtb.CaretPosition, caretBack);
string text = rtbText.Text;
rtbText returns an empty string or I get an error for everything longer than 3 characters. What am I doing wrong?
Let me elaborate it a little bit further. I have a listbox that holds replacements for values that user types in rtb. The values(replacements) are coming from there, meaning that I don't really need to go through the whole text to check values. I just need to check the values right before caret. I am comparing these values to what I have stored in another variable (comparable).
Please let me know if you don't understand something.
I did my best to explain what needs to be done.
Thank you
You could use Regex.Replace.
// this replaces all occurances of "ka" with "Replacement"
Regex replace = new Regex("ka");
string result = replace.Replace("aaaka","Replacemenet");
gumenimeda, I had similar problems few weeks ago. I found my self doing the following (I asume you will have more than one occurance in the RichTextBox that you will need to change), note that I did it for Windows Forms where I have access directly to the Rtf text of the control, not quite sure if it will work well in your scenario:
I find all the occurancies of the string (using IndexOf for example) and store them in a List for example.
Sort the list in descending order (max index goes first, the one before him second, etc)
Start replacing the occurancies directly in the RichTextBox, by removing the characters I don't need and appending the characters I need.
The sorting in step 2 is necessary as we always want to start from the last occurance going up to the first. Starting from the first occurance or any other and going down will have an unpleasant surprise - if the length of the chunk you want to remove and the length of the chunk you want to append are different in length, the string will be modified and all other occurancies will be invalid (for example if the second occurance was in at position 12 and your new string is 2 characters longer than the original, it will become 14th). This is not an issue if we go from the last to the first occurance as the change in string will not affect the next occurance in the list).
Ofcourse I can not be sure that this is the fastest way that can be used to achieve the desired result. It's just what I came up with and what worked for me.
Good luck!

Categories