pdf extract to docx with unicode characters C#

pdf extract to docx with unicode characters C# - c#

I'm using the itext7 library to convert pdf file to Docx, the file includes Vietnamese text (type Unicode/utf8 ), some parts of it were converted correctly but some were not. Example: "DÇu th¶o méc (L¹c, võng, c¸m...)" stand for "Dầu thảo mộc( lạc, vừng, cám)". So how can I handle this problem? Do I need include some fonts?

Related

What encoding be used to create MS-DOS txt file using C#(UTF8Encoding vs Encoding)

I am trying to create a flat file for a legacy system and they mandates that the data to be presented in TextEncoding of MS DOS .txt file (Text Document - MS-DOS Format CP_OEM). I am a bit confused between files generated by using UTF8Encoding class in C# (.net4.0 framework) and I think it produce a file in default txt file (Encoding: CP_ACP).
I think Encoding names CP_ACP , Winodows and ANSI refers to same thing and Windows default is ANSI and it will omit any unicode character information.
If I use UTF8Encoding class in C# library to create a text file(as below), is it going to be in the MS DOS txt file format?
byte[] title = new UTF8Encoding(true).GetBytes("New Text File");
As per the answer supplied it is evident that UTF8 is NOT equivalent to MSDOS txt format and should use Encoding.GetEncoding(850) method to get the encoding library.
I read the following posts to check on my information but nothing conclusive yet.
https://blogs.msdn.microsoft.com/oldnewthing/20120220-00?p=8273
https://blog.mh-nexus.de/2015/01/character-encoding-confusion
https://blogs.msdn.microsoft.com/oldnewthing/20090115-00?p=19483
Finally the conclusion is to go with Encoding.GetEncoding(850) when creating a byte array to be converted back to the actual file(note: i am using byte array as i can leverage existing middle wares).

You can use the File.ReadXY(String, Encoding) and File.WriteXY(String, String[], Encoding) methods, where XY is either AllLines, Lines or AllText working with string[], IEnumerable<string> and string respectively.
MS-DOS uses different code pages. Probably the code page 850 "Western European / Latin-1" or code page 437 "OEM-US / OEM / PC-8 / DOS Latin US" (as #HansPassant suggests) will be okay. If you are not sure, which code page you need, create example files containing letters like ä, ö, ü, é, è, ê, ç, à or greek letters with the legacy system and see whether they work. If you don't use such letters or other special characters, then the code page is not very critical.
File.WriteAllText(path, "Hello World", Encoding.GetEncoding(850));
The character codes from 0 to 127 (7-bit) are the same for all MS-DOS code pages, for ANSI and UTF-8. UTF files are sometimes introduced with a BOM (byte order mark).
MS-DOS knows only 8-bit characters. The codes 128 to 255 differ for the different national code pages.
See: File Class, Encoding Class and Wikipedia: Code Page.

ITextSharp for arabic support

I have to read the content of a .pdf file, I am using ITextSharp.net,
I have three problems:
1- the Arabic terms are extracted in reverse order.( ex: احمد is extracted as دمحا) which is reversed ( in English: Ahmad is extracted as damha )
if my file contains both Arabic and English, How to extract each language with its correct direction.
2- sometimes the glyphs are no defined as characters, so they appear as symbols, how to add my own definition for glyphs?
3- Can I extract the text with its formattings, to convert to html and display the file in a web page as is?

Extracting text from pdf using itextSharp changes digits

I have a pdf file which I have a problem extracting text from it - using an itextsharp api.
some of the numbers are replaced by other numbers or backslashes : "//"
The pdf file was originally came from MS Word and exported to pdf using "Save as pdf", and i have to work with the pdf file and not the Doc.
You can see the problem very clearly when you try to copy and paste some numbers from the file
For example - if you try to copy and paste a 6 digit number in the bottom you can see that it changes from 201333 to 333222.
You can also see the problem with the date string : 11/4/2016 turns into // // 11110
When I print the pdf file using adobe Pdf converter printer on my computer, it get fixed, but i need to fix it automaticlly, using C# for example
Thanks
The file is shared here :
https://www.dropbox.com/s/j6w9350oyit0od8/OnePageGili.pdf?dl=0

In a nutshell
iTextSharp text extraction results exactly reflect what the PDF claims the characters in question mean. Thus, text extraction as recommended by the PDF specification (which relies on these information) always will return this.
The embedded fonts contain different information. Thus, text extraction methods disbelieving this information may return more satisfying results.
In more detail
First of all, you say
I have a pdf file which I have a problem extracting text from it - using an itextsharp api.
and so make it sound like an iTextSharp-specific issue. Later, though, you state
You can see the problem very clearly when you try to copy and paste some numbers from the file
If you can also see the issue with copy&paste, it is not an iTextSharp-specific issue but either an issue of multiple PDF processors including the viewer you copied&pasted with or it simply is an issue of the PDF you have.
As it turns out, it is the latter, you have a PDF that lies about its contents.
For example, let's look at the text you pointed out:
For example - if you try to copy and paste a 6 digit number in the bottom you can see that it changes from 201333 to 333222.
Inspecting the PDF page content stream, you'll find those six digits generated by these instructions:
/F3 11.04 Tf
...
[<00150013>-4<0014>8<00160016>-4<0016>] TJ
I.e. the font F3 is selected (which uses Identity-H encoding, so each glyph is represented by two bytes) and the glyphs drawn are from left to right:
0015
0013
0014
0016
0016
0016
The ToUnicode mapping of the font F3 in your PDF now claims:
1 beginbfrange
<0013> <0016> [<0033> <0033> <0033> <0032>]
endbfrange
I.e. it says
glyph 0013 represents Unicode codepoint 0033, the digit 3
glyph 0014 represents Unicode codepoint 0033, the digit 3
glyph 0015 represents Unicode codepoint 0033, the digit 3
glyph 0016 represents Unicode codepoint 0032, the digit 2
So the string of glyphs drawn using the instructions above represent 333222 according to the ToUnicode map.
The PDF specification presents the ToUnicode mapping as the highest priority method to map a character code to a Unicode value. Thus, a text extractor working according to the specification will return 333222 here.

Is it possible to read French characters into a C# string from an .eml file?

I have a project where I need to generate a .pdf file based on the content in an .eml file. When dealing with just english characters, I'm fine, the pdf is created flawlessly and everything works (after I strip all the needless html junk).
However an issue arrives when I try to read in an .eml file that is filled with french characters. In particular the french characters are stored as number codes like =E9, =E8, &#339, so on and so forth.
So my issue is this. I read the .eml file in with:
string content = File.ReadAllText(filePath, Encoding.UTF8);
However it comes in as plain text and I don't know how to make the system interpret the =E9 and =E8, etc., codes as French Characters. I can always Regex.Replace everything but I'm hoping for a more elegant solution. Is there any way to take in that long string of plain text and interpret the codes embedded within properly so that the french characters appear instead of their respective codes without using like 30 Regex.Replace expressions?
Due note I can't use any built in iTextSharp functionality since I also need to be able to incorporate french characters (pulled from that .eml file) into the file name of the pdf.
Thanks

You can use regexes, but two regexes should be enough:
text = Regex.Replace(text, #"=([0-9A-Fa-f]{2})", match => ((char)uint.Parse(match.Groups[1].Value, NumberStyles.HexNumber)).ToString());
text = Regex.Replace(text, #"&#(\d+);", match => ((char)uint.Parse(match.Groups[1].Value)).ToString());
A different way would be to find a MIME parsing library which exposes methods for parsing parts of MIME messages, that way you'd decode the =E9 codes. Then, you'd need to call WebUtility.HtmlDecode to parse the HTML entities.

How to read a special character from rtf file using C#?

I am using C# to convert RTF to XML.
When reading the rich text there are special character's that are represented by special RTF tags. Eg: the double quote symbol ” the rtf tag is \'d3
When ever I read this symbol I need to write "&rdquo" in xml.
Is there any documentation that lists the RTF tags for all the special characters or is it possible to do some encoding to convert?

Check out the RTF Specification and read up on the Document Area.
The article has all the code values for the special characters (RTF Tags) list, eg:
\rdblquote 0xD3

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

pdf extract to docx with unicode characters C# - c#

Related

What encoding be used to create MS-DOS txt file using C#(UTF8Encoding vs Encoding)

ITextSharp for arabic support

Extracting text from pdf using itextSharp changes digits

Is it possible to read French characters into a C# string from an .eml file?

How to read a special character from rtf file using C#?

Categories

Resources