How can I read a Lync conversation file containing HTML?

How can I read a Lync conversation file containing HTML? - c#

I'm having trouble reading a local file, into a string, in c#.
Here's what I came up with till now:
string file = #"C:\script_test\{5461EC8C-89E6-40D1-8525-774340083829}.html";
using (StreamReader reader = new StreamReader(file))
{
string line = "";
while ((line = reader.ReadLine()) != null)
{
textBox1.Text += line.ToString();
}
}
And it's the only solution that seems to work.
I've tried some other suggested methods for reading a file, such as:
string file = #"C:\script_test\{5461EC8C-89E6-40D1-8525-774340083829}.html";
string html = File.ReadAllText(file).ToString();
textBox1.Text += html;
Yet it does not work as expected.
Here are the first few lines of the file i'm trying to read:
as you can see, it has some funky characters, honestly I don't know if that's the cause of this weird behavior.
But in the first case, the code seems to skip those lines, printing only "Document generated by Office Communicator..."

Your task would be easier if you could use an API or the SDK or even would have a description of the format you try to read. However the binary format looks not to be that complicated and with an hexviewer installed I got this far to get the html out of the example you provided.
To parse non-text files you fall-back to the BinaryReader and then use one of the Read methods to read the correct type from the bytestream. I used ReadByte and ReadInt32. Notice how in the description of the method is explained how many bytes are read. That becomes handy when you try to decipher your file.
private string ParseHist(string file)
{
using (var f = File.Open(file, FileMode.Open))
{
using (var br = new BinaryReader(f))
{
// read 4 bytes as an int
var first = br.ReadInt32();
// read integer / zero ended byte arrays as string
var lead = br.ReadInt32();
// until we have 4 zero bytes
while (lead != 0)
{
var user = ParseString(br);
Trace.Write(lead);
Trace.Write(":");
Trace.Write(user.Length);
Trace.Write(":");
Trace.WriteLine(user);
lead = br.ReadInt32();
// weird special case
if (lead == 2)
{
lead = br.ReadInt32();
}
}
// at the start of the html block
var htmllen = br.ReadInt32();
Trace.WriteLine(htmllen);
// parse the html
var html = ParseString(br);
Trace.Write(len);
Trace.Write(":");
Trace.Write(html.Length);
Trace.Write(":");
Trace.WriteLine(html);
// other structures follow, left unparsed
return html.ToString();
}
}
}
// a string seems to be ascii encoded and ends with a zero byte.
private static string ParseString(BinaryReader br)
{
var ch = br.ReadByte();
var sb = new StringBuilder();
while (ch != 0)
{
sb.Append((char)ch);
ch = br.ReadByte();
}
return sb.ToString();
}
You could use the simple parsing logic in a winform application as follows:
private void button1_Click(object sender, EventArgs e)
{
webBrowser1.DocumentText = ParseHist(#"5461EC8C-89E6-40D1-8525-774340083829-Copia.html");
}
Keep in mind that this is not bullet proof or the recommended way but it should get you started. For files that don't parse well you'll need to go back to the hexviewer and work-out what other byte structures are new or different from what you already had. That is not something I intend to help you with, that is left as an exercise for you to figure out.

I don't know if it's the right way to answer this, but here's what I've managed to do so far:
string file = #"C:\script_test\{1C0365BC-54C6-4D31-A1C1-586C4575F9EA}.hist";
string outText = "";
//Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
StreamReader reader = new StreamReader(file, utf8);
char[] text = reader.ReadToEnd().ToCharArray();
//skip first n chars
/*
for (int i = 250; i < text.Length; i++)
{
outText += text[i];
}
*/
for (int i = 0; i < text.Length; i++)
{
//skips non printable characters
if (!Char.IsControl(text[i]))
{
outText += text[i];
}
}
string source = "";
source = WebUtility.HtmlDecode(outText);
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(source);
string html = "<html><style>";
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//style"))
{
html += node.InnerHtml+ Environment.NewLine;
}
html += "</style><body>";
foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//body"))
{
html += node.InnerHtml + Environment.NewLine;
}
html += "</body></html>";
richTextBox1.Text += html+Environment.NewLine;
webBrowser1.DocumentText = html;
The conversation displays correctly, both style and encoding.
So it's a start for me.
Thank you all for the support!
EDIT
Char.IsControl(char)
skips non printable characters :)

Related

How to convert the text direction to normal

I am reading an arabic localized pdf document in memory using c# and after reading the text i am getting it like this
٩٠/٤٠/٧٣٤١ ٩١/١٠/٦١٠٢
but the correct direction of this text in pdf is ٢٠١٦/٠١/١٩ ١٤٣٧/٠٤/٠٩
Can somebody please guide how can change this text direction to proper direction as it is appearing in pdf.
Edit
This is the function i am using. I am using Devexpress Document server, I am skipping upto line 36 as I do not need the data before line 36.
private void button1_Click(object sender, EventArgs e)
{
using (var documentStream = new FileStream(#"D:\Data\Projects\DotNet\ElectricBillReader\electricbill.pdf", FileMode.Open, FileAccess.Read))
{
using (PdfDocumentProcessor documentProcessor = new PdfDocumentProcessor())
{
documentProcessor.LoadDocument(documentStream);
using (var sr = new StringReader(documentProcessor.Text))
{
var counter = 0;
string line = string.Empty;
do
{
line = sr.ReadLine();
if (counter > 36)
{
if (line != null)
{
}
}
counter++;
} while (line!=null);
}
}
}
}

You're gonna need a library that implements the Unicode bidirectional algorithm, i'm not aware of any libraries that does this for .NET but there's an effort to port ICU to .NET here
Also, check this out: https://sourceforge.net/projects/nbidi/

Did you think about just reversing the string?
public static string Reverse( string s )
{
char[] charArray = s.ToCharArray();
Array.Reverse( charArray );
return new string( charArray );
}
Source

How can I specify region of reading from file?

I have the following txt file
//test.txt
information needed[12334,56565]important numbers
I want to read from [ until ]
string print= File.ReadAllText(#"C:/Users/kokos/Desktop/test.txt");
Console.WriteLine(print);
The above is reading the whole file, but i want to print only
[12334,56565]

string pattern = #"\[(.*?)\]";
string print = File.ReadAllText(#"C:/Users/kokos/Desktop/test.txt");
var result = Regex.Matches(print, pattern);
foreach (Match r in result)
{
Console.WriteLine(r.Groups[1]);
}
As mentioned by Matthew, here is a solution using regex. At the top of your .cs. Add the line: using System.Text.RegularExpressions;
Note
This answer assumes the OP desires to load in the entire file to memory.

You can do this with LINQ.
var text = File.ReadAllText(#"C:/Users/kokos/Desktop/test.txt");
var print = new string(text.SkipWhile(c => c != '[')
.TakeWhile(c => c != ']')
.ToArray())+"]";
// print = "[12334,56565]"
... if you don't want the leading [ then do this...
var print = new string(text.SkipWhile(c => c != '[').Skip(1)
.TakeWhile(c => c != ']')
.ToArray());
// print = "12334,56565"
Here are a few more options if you just want to mess around with the string. (these are more error prone.)
var print = text.Substring(text.IndexOf('['), text.IndexOf(']') - text.IndexOf('[') + 1);
... or ...
var print = "[" + text.Split('[')[1].Split(']')[0] + "]";
... regex would probably look nicer.

var data = Encoding.UTF8.GetBytes("here is a simulated file [here's the data I'm after]");
var read = new StringBuilder();
var inScope = false;
using(var ms = new MemoryStream(data)) {
using(var sr = new StreamReader(ms)) {
while(!sr.EndOfStream) {
var by = sr.Read();
if (((char)by) == '[') {
inScope = true;
continue;
}
else if (((char)by) == ']') {
inScope = false;
break;
}
if (inScope) {
read.Append((char)by);
}
}
}
}
read.ToString().Dump();
The above code is a LINQPad snippet, that shows how you can read a stream byte by byte and pull out the data you're after without loading the whole thing into memory.
Instead of using a memory stream, just use a file stream for the file you want to read.
This is less than optimal with all the casting (just do it once), but it should be enough to demonstrate the basic idea.
The output of this is: "here's the data I'm after"
WARNING be sure to use the encoding object for whichever encoding your file is using!

Finding ® in a string of text

Let me rephrase my question:
I am reading in text where one of the characters is the registered symbol, ®, from a text file that has no problem displaying the symbol. When I try to print the string after reading it from the file, the symbol is an unprintable character. When I read in the string and split the string to characters and convert the character to an Int16 and print out the hex, I get 0xFFFD. I specify Encoding.UTF8 when I open the StreamReader.
Here is what I have
using (System.IO.StreamReader sr = new System.IO.StreamReader(HttpContext.Current.Server.MapPath("~/App_Code/Hormel") + "/nutrition_data.txt", System.Text.Encoding.UTF8))
{
string line;
while((line = sr.ReadLine()) != null)
{
//after spliting the file on '~'
items[i] = scrubData(utf8.GetString(utf8.GetBytes(items[i].ToCharArray())));
//items[i] = scrubData(items[i]); //original
}
}
Here is the scrubData function
private String scrubData(string data)
{
string newStr = String.Empty;
try
{
if (data.Contains("HORMEL"))
{
string[] s = data.Split(' ');
foreach(string str in s)
{
if (str.Contains("HORMEL"))
{
char[] ch = str.ToCharArray();
for(int i=0; i<ch.Length; i++)
{
EventLogProvider.LogInformation("LoadNutritionInfoTask", "Test", ch[i] + " = " + String.Format("{0:X}", Convert.ToInt16(ch[i])));
}
}
}
}
return String.Empty;
}
catch (Exception ex)
{
EventLogProvider.LogInformation("LoadNutritionInfoTask", "ScrubData", ex.Message);
return data;
}
}
I'm not concerned with what is being returned right now, I am printing out the characters and the hex codes that correspond to them.

First, you need to make sure you're reading the text with the correct encoding. It appears to me that you are using UTF-8, since you say ® (Unicode code point U+00AE) is 0xC2AE, which is the same as UTF-8. You can use that like:
Encoding.UTF8.GetString(new byte[] { 0xc2, 0xae }) // "®", the registered symbol
// or
using (var streamReader = new StreamReader(file, Encoding.UTF8))
Once you've got it as a string in C#, you should use HttpUtility.HtmlEncode to encode it as HTML. E.g.
HttpUtility.HtmlEncode("SomeStuff®") // result is "SomeStuff®"

Check encoding you are decoding bytes with.

Try this:
string txt = "textwithsymbol";
string html = "<html></html>";
txt = txt.Replace("\u00ae", html);
Obviously you would replace the txt variable with the text you have read in and "\u00ae" is the symbol you are looking for.

Extract text by line from PDF using iTextSharp c#

I need to run some analysis my extracting data from a PDF document.
Using iTextSharp, I used the PdfTextExtractor.GetTextFromPage method to extract contents from a PDF document and it returned me in a single long line.
Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.
Below is the code I used:
string urlFileName1 = "pdf_link";
PdfReader reader = new PdfReader(urlFileName1);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
candidate3.Text = text.ToString();

public void ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
MessageBox.Show(line);
}
}
}
}

I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = #"Your said path\the file name.pdf";
string outPath = #"the output said path\the text file name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.

All the other code samples here didn't work for me, probably due to changes to the itext7 API.
This minimal example here works ok:
var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());

LocationTextExtractionStrategy will automatically insert '\n' in the output text. However, sometimes it will insert '\n' where it shouldn't.
In that case you need to build a custom TextExtractionStrategy or RenderListener. Bascially the code that detects newline is the method
public virtual bool SameLine(ITextChunkLocation other) {
return OrientationMagnitude == other.OrientationMagnitude &&
DistPerpendicular == other.DistPerpendicular;
}
In some cases '\n' shouldn't be inserted if there is only small difference between DistPerpendicular and other.DistPerpendicular, so you need to change it to something like Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10
Or you can put that piece of code in the RenderText method of your custom TextExtractionStrategy/RenderListener class

Use LocationTextExtractionStrategy in lieu of SimpleTextExtractionStrategy. LocationTextExtractionStrategy extracted text contains the new line character at the end of line.
ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;

Try
String page = PdfTextExtractor.getTextFromPage(reader, 2);
String s1[]=page.split("\n");

Delete specific line from a text file?

I need to delete an exact line from a text file but I cannot for the life of me workout how to go about doing this.
Any suggestions or examples would be greatly appreciated?
Related Questions
Efficient way to delete a line from a text file (C#)

If the line you want to delete is based on the content of the line:
string line = null;
string line_to_delete = "the line i want to delete";
using (StreamReader reader = new StreamReader("C:\\input")) {
using (StreamWriter writer = new StreamWriter("C:\\output")) {
while ((line = reader.ReadLine()) != null) {
if (String.Compare(line, line_to_delete) == 0)
continue;
writer.WriteLine(line);
}
}
}
Or if it is based on line number:
string line = null;
int line_number = 0;
int line_to_delete = 12;
using (StreamReader reader = new StreamReader("C:\\input")) {
using (StreamWriter writer = new StreamWriter("C:\\output")) {
while ((line = reader.ReadLine()) != null) {
line_number++;
if (line_number == line_to_delete)
continue;
writer.WriteLine(line);
}
}
}

The best way to do this is to open the file in text mode, read each line with ReadLine(), and then write it to a new file with WriteLine(), skipping the one line you want to delete.
There is no generic delete-a-line-from-file function, as far as I know.

One way to do it if the file is not very big is to load all the lines into an array:
string[] lines = File.ReadAllLines("filename.txt");
string[] newLines = RemoveUnnecessaryLine(lines);
File.WriteAllLines("filename.txt", newLines);

Hope this simple and short code will help.
List linesList = File.ReadAllLines("myFile.txt").ToList();
linesList.RemoveAt(0);
File.WriteAllLines("myFile.txt"), linesList.ToArray());
OR use this
public void DeleteLinesFromFile(string strLineToDelete)
{
string strFilePath = "Provide the path of the text file";
string strSearchText = strLineToDelete;
string strOldText;
string n = "";
StreamReader sr = File.OpenText(strFilePath);
while ((strOldText = sr.ReadLine()) != null)
{
if (!strOldText.Contains(strSearchText))
{
n += strOldText + Environment.NewLine;
}
}
sr.Close();
File.WriteAllText(strFilePath, n);
}

You can actually use C# generics for this to make it real easy:
var file = new List<string>(System.IO.File.ReadAllLines("C:\\path"));
file.RemoveAt(12);
File.WriteAllLines("C:\\path", file.ToArray());

This can be done in three steps:
// 1. Read the content of the file
string[] readText = File.ReadAllLines(path);
// 2. Empty the file
File.WriteAllText(path, String.Empty);
// 3. Fill up again, but without the deleted line
using (StreamWriter writer = new StreamWriter(path))
{
foreach (string s in readText)
{
if (!s.Equals(lineToBeRemoved))
{
writer.WriteLine(s);
}
}
}

Read and remember each line
Identify the one you want to get rid
of
Forget that one
Write the rest back over the top of
the file

I cared about the file's original end line characters ("\n" or "\r\n") and wanted to maintain them in the output file (not overwrite them with what ever the current environment's char(s) are like the other answers appear to do). So I wrote my own method to read a line without removing the end line chars then used it in my DeleteLines method (I wanted the option to delete multiple lines, hence the use of a collection of line numbers to delete).
DeleteLines was implemented as a FileInfo extension and ReadLineKeepNewLineChars a StreamReader extension (but obviously you don't have to keep it that way).
public static class FileInfoExtensions
{
public static FileInfo DeleteLines(this FileInfo source, ICollection<int> lineNumbers, string targetFilePath)
{
var lineCount = 1;
using (var streamReader = new StreamReader(source.FullName))
{
using (var streamWriter = new StreamWriter(targetFilePath))
{
string line;
while ((line = streamReader.ReadLineKeepNewLineChars()) != null)
{
if (!lineNumbers.Contains(lineCount))
{
streamWriter.Write(line);
}
lineCount++;
}
}
}
return new FileInfo(targetFilePath);
}
}
public static class StreamReaderExtensions
{
private const char EndOfFile = '\uffff';
/// <summary>
/// Reads a line, similar to ReadLine method, but keeps any
/// new line characters (e.g. "\r\n" or "\n").
/// </summary>
public static string ReadLineKeepNewLineChars(this StreamReader source)
{
if (source == null)
throw new ArgumentNullException(nameof(source));
char ch = (char)source.Read();
if (ch == EndOfFile)
return null;
var sb = new StringBuilder();
while (ch != EndOfFile)
{
sb.Append(ch);
if (ch == '\n')
break;
ch = (char)source.Read();
}
return sb.ToString();
}
}

Are you on a Unix operating system?
You can do this with the "sed" stream editor. Read the man page for "sed"

What?
Use file open, seek position then stream erase line using null.
Gotch it? Simple,stream,no array that eat memory,fast.
This work on vb.. Example search line culture=id where culture are namevalue and id are value and we want to change it to culture=en
Fileopen(1, "text.ini")
dim line as string
dim currentpos as long
while true
line = lineinput(1)
dim namevalue() as string = split(line, "=")
if namevalue(0) = "line name value that i want to edit" then
currentpos = seek(1)
fileclose()
dim fs as filestream("test.ini", filemode.open)
dim sw as streamwriter(fs)
fs.seek(currentpos, seekorigin.begin)
sw.write(null)
sw.write(namevalue + "=" + newvalue)
sw.close()
fs.close()
exit while
end if
msgbox("org ternate jua bisa, no line found")
end while
that's all..use #d

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can I read a Lync conversation file containing HTML? - c#

Related

How to convert the text direction to normal

How can I specify region of reading from file?

Finding ® in a string of text

Extract text by line from PDF using iTextSharp c#

Delete specific line from a text file?

Categories

Resources