C# Unknown Text Found

C# Unknown Text Found - c#

I'm creating a program to transfer text from a word document to a database. During some testing I came across some text inside a textbox after setting it's text to a table cell range as follows:
textBox1.Text = oDoc.Tables[1].Cell(1, 3).Range.Text;
What appeared in the form was:
What wasn't expected was the dot at the end of the text and I have no idea what it is supposed to represent. The dot can be highlighted but if you try and copy and paste it nothing appears. You can delete the dot manually. Can anyone help me identify what this is?

The identification bit shouldn't be too hard:
string text = oDoc.Tables[1].Cell(1, 3).Range.Text;
textBox1.Text = ((int) text[4]).ToString("x4");
That will give you the Unicode UTF-16 code unit for that character... you can then find out what it is on the Unicode web site. (I usually look at the Charts page or the directory of PDFs and guess which chart it will be in based on the numbering - it's not ideal, and there are probably better ways, but it's always worked well enough for me...)
Of course when you've identified it you'll still need to work out what the heck it's doing there... does the original Word document just have "HOLD"?

Related

Word merge field in header loses value in print preview

I have an ASP WebForms app where I use a word template that contains merge fields, to replace them with data extracted from the database. The app works great, the word document is exported, but when trying to print the document, one of the merge fields, which exists in the header, loses it's value and restores to the initial merge field name. Is this something that has to be fixed from the application's code or is this a word settings issue.
Any help is greatly appreciated.
Thank you!

I have managed to solve this problem using OpenXML Productivity Tool. It turns out that you can't add a merge field in the header, so what I did was to put it inside of a textbox. I forgot to mention that part in the initial description. Thus, the text element was buried deep in open xml. When I managed to find it and log what was inside, I found out that I inserted the MERGEFIELD <> MERGEFORMAT. Every time I tried to insert the value that I wanted in this <>, it got reset when I hit print preview. So what I did, based on a suggestion from someone who had a similar problem, was to delete this textbox and create a clean one where I only entered "Test". It needs to have a string inside so that open xml created the element Text (instrTxt).
In C# I did this:
foreach (var hPart in firstDoc.MainDocumentPart.HeaderParts)
{
foreach (var txt in hPart.Header.Descendants<Text>())
{
if(txt.Text == "Test")
{
txt.Text = "My custom text";
}
}
}
So for each header part (because I can't tell for sure in which one it is..it could even be in multiple ones), get all descendants of type text.
I got a few more than I wanted. I also got a few that contained the page number (since I have it in the header as well). So I added an if to check if it's the text element that I wanted. Once I found it, I added the text I wanted.
So long story short, instead of using a merge field in the header, I just used a text. Perhaps it's not the most efficient way of doing this. Maybe the question still remains, (if I could have inserted a merge field in the header and actually made it work without having word reset the value upon print preview? idk), but this worked for me.

Aspose text replacement in PDF doesn't rearrange content

I'm using Aspose to replace a set of words in an existing PDF file via a WPF application.
I took a look at this https://docs.aspose.com/display/pdfnet/Replace+Text+in+a+PDF+Document
But the section 'Text Replacement should automatically re-arrange Page Contents' didn't help solve my problem which is exactly what is described in the section.
"However recently some customers encountered issues during text
replace when particular TextFragment is replaced with smaller contents
and some extra spaces are displayed in resultant PDF or in case the
TextFragment is replaced with some longer string, then words overlap
existing page contents."
The problem is if I replace a word by a shorter or longer word, there's either a blank gap or overlap.
I tried some options such as
textFragmentAbsorber.TextReplaceOptions.ReplaceAdjustmentAction =
TextReplaceOptions.ReplaceAdjustment.WholeWordsHyphenation;
or AdjustSpaceWidth
but is has no effect at all.
My code is the same as the one I linked above except I only replace the text and let the rest untouched.
Also my TextFragmentAbsorber looks like this
var textFragmentAbsorber = new TextFragmentAbsorber("(?i)("+ text.OriginalText +")", new TextSearchOptions(true));
With text.OriginalText being the text I want to replace with case insensitive regex.

Copy numeric codes to clipboard and paste to Excel without having them formatted as numbers

I have a .NET Windows Forms applications and I need to copy a list of 8-digit numeric codes into the clipboard to be pasted to Excel sheet.
string tabbedText = string.Join("\n", codesArray);
Clipboard.SetText(tabbedText);
The problem is that when a code begins with one or more zeros (ex. "00001234") it's pasted as number with the zeros trimmed.
Is there a way how to set clipboard text so that Excel accepts it as text?

I would treat this problem inside of Excel (and not in your application programaticaly). Format your cells to be treated as text, and then paste from clipboard. This way leading zeros are always pasted.

EDIT: This doesn't work in Excel, in that the apostrophe gets pasted in and shows up too. I'm leaving the answer here as an explicit statement that this approach won't help for Excel.
It does work for OpenOffice Calc though.
The standard way to 'tell' Excel to treat a string as a string is to prefix it with an apostrophe. Have you tried something like:
string tabbedText = "'" + string.Join("\n'", codesArray);
(note the extra apostrophe in there... it's a bit hard to see).
Of course, this may cause you issues if you're planning to use this value thereafter in Excel calculations but there are ways to handle that too.

Correct Hebrew character sequence in C# and searchable PDFs

I'm testing an SDK that extracts text from a searchable PDF. One of the SDK's dependencies was recently updated, and it's causing an existing test on Hebrew text to fail. I don't know Hebrew nor enough about how the involved technologies represent right-to-left languages.
The NUnit test asserts that the extracted text matches the C# string "מנבוצץז ".
string hebrewText = reader.ReadToEnd();
Assert.AreEqual("מנבוצץז ", hebrewText);
The rasterized PDF has what I believe are the same characters, but in the opposite order.
The unit test fails with this message:
Expected: "מנבוצץז "
But was: " זץצובנמ"
Although the actual result more closely matches what I see in the rasterized PDF, I'm not completely sure the original test is wrong.
Are Hebrew characters in a C# string supposed to be read right-to-left like printed Hebrew text?
Does any part of the .NET stack tamper with the direction of Hebrew strings?
What about NUnit?
Are Hebrew characters embedded in a searchable PDF normally supposed to go in the same direction as the rasterized text?
Anything else I should know before deciding whether to "fix" this unit test?

There are various ways to encode RTL languages. The most common way (and Window's default) is to use logical ordering, which means the first letter is encoded as the first character in a string (or file). So whether visually the first letter appears on the left or right side of the screen doesn't affect the order in which they are stored.
Now as for the text appearing in Visual Studio, it depends on the version. As far as I remember, prior to Visual Studio 2010 the code editor displayed Hebrew backwards, and it was apparent as when you tried to select Hebrew text, it reversed in an odd way (which was visually confusing). It appears this issue no longer exists is Visual Studio 2010 (at least with SP1 which I just tested).
Let's take a Hebrew word for which the direction is more clear to non-Hebrew speakers than the string specified in your text:
יון
The word happens to be the Hebrew word for an ion, and on your screen, it should appear as three letters where the tallest letter is on the left and the shortest is on the right. In a .NET string, the expression "יון".Substring(0, 1) will produce the short letter, since it's the first letter in the string. The string can also be written as "\u05D9\u05D5\u05DF" where the leftmost Unicode character \u05D9 represents the short letter displayed on the right, which clearly demonstrates the order in which the letters are stored.
Since the string in your test case is nonsensical, I can't tell you whether it was a wrong test all along or if it a correct test that should pass. If the image you uploaded has been rendered correctly then it appears the actual result of your test is correct and the expected value is incorrect, and so you should fix the test.

I believe that all strings in C# will be stored internally as LTR; RTL strings will have a non-printable character (or something) denoting that they are indeed RTL.
More than likely. RTL GUIs and rendered text for example need certain properties (specifically RightToLeft and RightToLeftLayout) to be set in order to display correctly.
NUnit shouldn't. Nor should it care. IMHO a reversed string != the original string.
I couldn't comment. I'd assume that they should be whatever the test is expecting though, assuming it was passing at first.
Don't do half measures with RTL, it really doesn't like it. Either have full RTL support, or nothing. It can be pretty nasty, I wish you the best of luck!

Best way to extracting only the bold text from a PDF

iTextSharp is a great tool, I can use
PdfTextExtractor.GetTextFromPage(reader, iPage) + " ";
and it works great, but is there a way to extract only the bold text (e.g. the headlines) from the pdf, and not everything?
Any solution is useful, regardless of the programing language. Thank you

From within iText, You need to use the classes from the com.itextpdf.text.pdf.parser package.
Specifically, you'll need to use a PdfTextExtractor with a custom TextExtractionStrategy that checks the font name. Bold fonts USUALLY have the world "bold" in their name.
Potential Issues:
1) Not everything that looks like text is rendered with fonts and letters. It can be paths or a bitmap. The only way to extract such text is with OCR, and there's no way to get font info.
2) Font Encoding. The bytes that map to the glyphs you're seeing in the PDF may not have a map from those bytes to actual character information.
3) Not all bold-looking text is made with a bold font. Some bold text is made by stroking the text outline with a fairly thin line as well as the usual filling. In this case, the text render mode will be set to "stroke & fill" instead of the usual "fill". This is pretty rare, but it does happen from time to time.
An easy way to test for problems 1 and 2 is to attempt to copy and paste the text within Reader/Acrobat. If you can't select it, it's almost certainly paths or an image. If you can select it but the characters come out as random junk when pasted, then iText will come up with the same junk.
Problem 3 isn't that hard to test for programattically, though you have to handle it on a case by case basis. You need to call TextRenderInfo.getTextRenderMode(). 0 is fill (the standard way of doing things), and 2 is "stroke and fill".
So your TextExtractionStrategy can stub out beginTextBlock, endTextBlock, renderImage, and getResultantText. In your renderText implementation, you'll have to check the font name (for "bold", case insensitive) and the text render mode. If either of those is the case, it's part of on of your headings.
All this is supposing that you are dealing with arbitrary PDF files. If all your PDFs come from the same source, you can start cutting corners. I'll leave that as an Exercise For The Reader.

One of your best bets for this job surely is TET by pdflib.com with its ability to extract to the TETML format. Available for Windows, Mac OS X, Linux, Solaris, AIX, HP-UX...
I'm not sure if it does indeed recognize "headlines" as such (because PDF does not know much of structural markups, only visual ones) -- but it surely can tell you exact position and font used by each string of characters.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.