How can I read any file into a string - c#

I want to be able to read any file into a string, for instance the way notepad might open a word file. Using the following code:
StreamReader sr = new StreamReader(filePath);
text += sr.ReadToEnd();
sr.Close();
works fine on a basic text file but when using it on say a word file I just get a few odd characters whereas opening the same file in notepad shows me the entire file, text, special characters etc. I'm using this as part of a file drop into a textbox. Basically I'm looking to get the same output you would get when you open any file in notepad. What should I be using instead?

Using your code from the original question and opening a file, does show the entire stream (when looking it in debugger) - The problem is that most of these binary files have null terminators (\0 char) which will cause most viewers to stop reading the contents of the stream.
If you remove/escape the '\0' you'll see the entire stream just like in notepad.
For example:
string filePath = #"c:\windows\system32\calc.exe";
StreamReader sr = new StreamReader(filePath);
string text = sr.ReadToEnd();
sr.Close();
textBox1.Text = text.Replace('\0', ' ');
Add a textbox1 to a form and see for yourself... You'll see the entire stream...

This should give you the functionality that you want. First read the file in as a byte[] using
byte[] data = File.ReadAllBytes(fileName);
then just encode it with ascii, or whatever.
string s = Encoding.ASCII.GetString(data);

I'm assuming you're referring to WordPad, which is also included with Windows, rather than Notepad. WordPad, in addition to showing basic text files, also knows to parse and edit Word files (.DOCX, but oddly enough not the older .DOC files), Rich Text Format files (.RTF), and OpenOffice documents (*.ODT). This doesn't come freely just by opening the Word file and displaying its content - there is a lot of code inside WordPad to parse this binary data and display it properly, not to mention the code to edit and save it again.
If you need to retrieve the data from Word files, there are several programmatic options, starting with automating the Word application itself using the Word APIs. However, this solution is problematic for running on a server, or if you need to open them where there is no Word installed.
In this case you also have several options. For post-2007 documents with the .DOCX extension, you can use the System.IO.Packaging namespace to open the DOCX and extract its relevant parts, but it's up to you to understand the syntax of the XML files within. Alternately, you can purchase a third-party library that does it for you, such as Aspose, which I've worked with and were fine. There are others out there too.

Related

read any file to buffer

I am not new to C# but quite new to file handling. My current idea is to read files (of any kind, for example jpg, txt, pdf etc) to a buffer to be able to do something with it later, for example just write an exact copy to the same folder (for testing) or send it to another pc via network. I know that there is a specific method for sending files via network, but I'd like to be able to handle the file itself and understand how to open files the correct way and write them the correct way to have a working copy.
If I just open a file and use for example a StreamReader like this:
using (StreamReader sr = new StreamReader(sourcePath, GetEncoding(sourcePath)))
{
// Read the stream to a string, and write the string to the console.
String line = sr.ReadToEnd();
Console.WriteLine(line);
WriteFile(outputFile, GetEncoding(sourcePath), line);
}
it will create a bigger file (for example of an jpg) which does not work in the end. I think it has something to do with the encoding, but since I have to little knowledge about files itself maybe someone can give me some helpful tips.

Extract Contents code from PDF / Word File

I have to big files of MS Word & PDF which contains images, text fields, tables.
I need to insert text into these files dynamically at specific locations. I've tried Bookmarks method in Word but I can't use that method now. I've extracted data into byte array and tried to write in pdf but file gets corrupted. Here is the code:
byte[] bytes = System.IO.File.ReadAllBytes("CDC.doc");
FileStream fs = new FileStream("CDC.pdf", FileMode.OpenOrCreate);
fs.Write(bytes, 0, bytes.Length);
fs.Close();
Is there any way that I can convert these pdf/ word files to get PDF code for these files and then I can append data to specific locations in that code. Please advise. Thanks!
If I understand you right, you would like to develop a code that would replace all placeholders in a Word document acting as a template with your application data. For placeholders you can use Bookmarks, but a better choice would be Content Controls. You can use Open XML SDK to parse such a template Word document and replace Content Controls with data. This approach uses a free MS library but is tedious.
A much easier approach would be using a ready-made library which can work with templates, which contain placeholders that will get replaced with your real app data at runtime. In your C# application you can prepare the data (as C# data objects or XML) and merge this data with the template. Output can be in docx, pdf or xps format. You can check out some of the examples here.

Universal Data Link - File cannot be opened. Ensure it is a valid Data Link file

I am attempting to create a UDL file programmatically in C#. In my program, I want to show the user the Data Link properties window but with my own default values for the connection string. I initially thought to do the following:
string[] lines = new string[]
{
"[oledb]",
"; Everything after this line is an OLE DB initstring",
"Provider=SQLOLEDB.1;Persist Security Info=False"
};
File.WriteAllLines("Test.udl", lines);
Process p = Process.Start("Test.udl");
p.WaitForExit();
However, I get this error when trying to open the file:
File cannot be opened. Ensure it is a valid Data Link file.
This is strange because I created an empty file, named it something.udl, opened it, clicked OK, and then opened the contents of the file which was:
[oledb]
; Everything after this line is an OLE DB initstring
Provider=SQLOLEDB.1;Persist Security Info=False
But there was a newline character at the end of the connection string. I used KDiff to compare the this file and the file I created in my program and it said the "Files are equal text but the they are not binary equal" or something to that effect.
I believe it has to do with how the File.WriteAllLines method writes the strings. So I attempted to use different encodings with the method but with no success. Any ideas on where I am going wrong?
I am using this MSDN link as a reference about UDL files. Its also interesting to note that if I open a new text file and past in all of the lines in my lines array, I arrive at the same error.
All you need to do is use the Unicode encoding:
File.WriteAllLines("Test.udl", lines, Encoding.Unicode);
When creating the file in a plain-text editor, use the UTF-16 Little Endian encoding and include a Byte Order Mark (since Microsoft started on the Intel platform they consider that the "default" when they talk about UTF-16).
When using a program, make sure to use that particular encoding as well, programming languages might still default to a legacy codepage or use UTF-8, in which case opening the UDL file will trigger the error shown in the question.

Reading a part of PDF file in c#

I have many large size PDF files that I need to only read a part of them. I want to start reading the PDF file and write it to another file like a txt file, or any other type of files.
However, I want to make a limitation on the size of the file that I am writing in. When the size of txt file is about 15 MB, I should stop reading the PDF document and then I keep the created txt file for my purpose.
Does anyone can help me how can I do this in C#?
Thanks for your help in advance.
Here is the code that I use for reading the whole file; (image content is not important for me)
using (StreamReader sr = new StreamReader(#"F:\1.pdf"))
{
using (StreamWriter sw = new StreamWriter(#"F:\test.txt"))
{
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
sw.WriteLine(line);
sw.Flush();
}
}
}
You have to use PDF library to do this.There are a lot of free and paid PDF libraries out there which can be used to do your task. Recently I have used EO.pdf library to read pdf page and extract page content. The best part is that it has NuGet package and also continuously developed. The cons is you have to pay for commercial use.
PDF can't be read directly using .NET. You should first convert PDF to text (or XML, or HTML).
there are lot of PDF libraries capable of converting PDF to text like iTextSharp (most popular and open-source) and lot of other tools
To control the size of the output text files you should
get number of pages from PDF
run pdf to text conversion page by page meanwhile checking the output text file size
once file size is over 15 MB just stop the conversion and move to another file

How to extract text from Word files using C#?

I am trying to convert a large number (100,000) of word DOC files, these are quite old. From around 1995 to 2000 version of Word, i supposed. I keep going around in circles from what i see here in stack overflow and the MS documentation.
What i want do so is simply read the file, stick the text into a string, parse the string, take out the structure stuff (the file is actually a structured report, looks like Patient: Jon Doe). At that point, I know what i am doing. I can parse the string data, stick it into useful variables, then stick this data into a database. But I do not know how to actually put the text into a string. Any help?
PPS i found this reference which supposedly puts a DOC file into a text file. It's a start, but i'd rather avoid doing a bunch of file manipulations.
If you try to use the Word object model, you must always instantiate a certain version of Word on the client (since running Word on a server is not recommended). Unfortunately, you'll depend of the restriction of Word concerning older files, e.g. in Word 2010 you can open files from Office 95 only in sandbox mode (i.e you're not able to access the file content programmatically). Additionally, you'll have to deal with unknown template content (documents with macros attached, for example).
In your case I'd rather look for a 3p-component which allows to access the content.
I know from document management systems like OpenText eDocs and Autonomy iManage that they use other tools to full-index documents of all types and can present the content in a viewer application. So if you look in this direction, may be you find something useful.
A word file is just a normal file as far as your code goes.
Try this:
using System.IO;
StreamReader streamReader = new StreamReader(filePath);
string text = streamReader.ReadToEnd();
streamReader.Close();

Categories