I have an RTF file with a content like this:
{\object\objemb{\*\objclass Excel.Sheet.12}\objw8415\objh3015{\*\objdata
01050000
02000000
0f000000...}}}
(may be Excel or Word)
What I need is to extract the \objdata part into an external file to be able to edit it. After that, the file shall be converted back to an embedded object in an RTF file.
I already searched around, and it seems that this is not a trivial problem. From this post and with a small modification, I tried to get access to the objdata and to save it to file, but this does not lead to a valid Excel file:
if (RtfReader.MoveToNextControlWord(enumerator, "objdata"))
{
byte[] data = RtfReader.GetNextTextAsByteArray(enumerator);
using (MemoryStream packageData = new MemoryStream())
{
RtfReader.ExtractObjectData(new MemoryStream(data), packageData);
File.WriteAllBytes(#"c:\temp\some-excel.xls", ReadToEnd(packageData));
}
}
Are there any ideas out there how to achieve the mentioned goals?
Thanks a lot in advance for any help!
In this case, the content of the objdata is a Compound File. You can spot the famous 'd0cf11e0' header (looks like "docfile"). More on this here: Developing a tool to recognise MS Office file types ( .doc, .xls, .mdb, .ppt ).
I have written a small example that you can use to extract the data. You can use it like this:
string ole = "2090_Object_Text_0.ole"; // your file
string text = File.ReadAllText(ole);
DocFile.Save(text, "mydoc.doc"); // you should adapt this depending on the object class (Word.Document.8 is a .doc).
And the DocFile helper code:
public static class DocFile
{
// magic Doc File header
// check this for more: http://social.msdn.microsoft.com/Forums/en-US/343d09e3-5fdf-4b4a-9fa6-8ccb37a35930/developing-a-tool-to-recognise-ms-office-file-types-doc-xls-mdb-ppt-
private const string Header = "d0cf11e0";
public static void Save(string text, string filePath)
{
if (text == null)
throw new ArgumentNullException("text");
if (filePath == null)
throw new ArgumentNullException("filePath");
int start = text.IndexOf(Header);
if (start < 0)
throw new ArgumentException(null, "Text does not contain a doc file.");
int end = text.IndexOf('}', start);
if (end < 0)
{
end = text.Length;
}
using (MemoryStream bytes = new MemoryStream())
{
bool highByte = true;
byte b = 0;
for (int i = start; i < end; i++)
{
char c = text[i];
if (char.IsWhiteSpace(c))
continue;
if (highByte)
{
b = (byte)(16 * GetHexValue(c));
}
else
{
b |= GetHexValue(c);
bytes.WriteByte(b);
}
highByte = !highByte;
}
File.WriteAllBytes(filePath, bytes.ToArray());
}
}
private static byte GetHexValue(char c)
{
if (c >= '0' && c <= '9')
return (byte)(c - '0');
if (c >= 'a' && c <= 'f')
return (byte)(10 + (c - 'a'));
if (c >= 'A' && c <= 'F')
return (byte)(10 + (c - 'A'));
throw new ArgumentException(null, "c");
}
}
Related
While looking at memory-mapped files in C#, there was some difficulty in identifying how to search a file quickly forward and in reverse. My goal is to rewrite the following function in the language, but nothing could be found like the find and rfind methods used below. Is there a way in C# to quickly search a memory-mapped file using a particular substring?
#! /usr/bin/env python3
import mmap
import pathlib
# noinspection PyUnboundLocalVariable
def drop_last_line(path):
with path.open('r+b') as file:
with mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as search:
for next_line in b'\r\n', b'\r', b'\n':
if search.find(next_line) >= 0:
break
else:
raise ValueError('cannot find any line delimiters')
end_1st = search.rfind(next_line)
end_2nd = search.rfind(next_line, 0, end_1st - 1)
file.truncate(0 if end_2nd < 0 else end_2nd + len(next_line))
Is there a way in C# to quickly search a memory-mapped file using a particular substring?
Do you know of any way to memory-map an entire file in C# and then treat it as a byte array?
Yes, it's quite easy to map an entire file into a view then to read it into a single byte array as the following code shows:
static void Main(string[] args)
{
var sourceFile= new FileInfo(#"C:\Users\Micky\Downloads\20180112.zip");
int length = (int) sourceFile.Length; // length of target file
// Create the memory-mapped file.
using (var mmf = MemoryMappedFile.CreateFromFile(sourceFile.FullName,
FileMode.Open,
"ImgA"))
{
var buffer = new byte[length]; // allocate a buffer with the same size as the file
using (var accessor = mmf.CreateViewAccessor())
{
var read=accessor.ReadArray(0, buffer, 0, length); // read the whole thing
}
// let's try searching for a known byte sequence. Change this to suit your file
var target = new byte[] {71, 213, 62, 204,231};
var foundAt = IndexOf(buffer, target);
}
}
I couldn't seem to find any byte searching method in Marshal or Array but you can use this search algorithm courtesy of Social MSDN as a start:
private static int IndexOf2(byte[] input, byte[] pattern)
{
byte firstByte = pattern[0];
int index = -1;
if ((index = Array.IndexOf(input, firstByte)) >= 0)
{
for (int i = 0; i < pattern.Length; i++)
{
if (index + i >= input.Length ||
pattern[i] != input[index + i]) return -1;
}
}
return index;
}
...or even this more verbose example (also courtesy Social MSDN, same link)
public static int IndexOf(byte[] arrayToSearchThrough, byte[] patternToFind)
{
if (patternToFind.Length > arrayToSearchThrough.Length)
return -1;
for (int i = 0; i < arrayToSearchThrough.Length - patternToFind.Length; i++)
{
bool found = true;
for (int j = 0; j < patternToFind.Length; j++)
{
if (arrayToSearchThrough[i + j] != patternToFind[j])
{
found = false;
break;
}
}
if (found)
{
return i;
}
}
return -1;
}
I am using iTextSharp product to change the PDF properties as follows.
I am unable to change the "PDF Producer" property at all. Please suggest, where am i getting wrong.
The code line
info["Producer"] = "My producer";
is not working as it should be.
string sourcePath = tbPath.Text;
IList<string> dirs = null;
string pdfName = string.Empty;
string OutputPath = string.Empty;
DirectoryInfo di = new DirectoryInfo(sourcePath);
DirectoryInfo dInfo = Directory.CreateDirectory(sourcePath + "\\" + "TempDir");
OutputPath = Path.Combine(sourcePath,"TempDir");
dirs = Directory.GetFiles(di.FullName, "*.pdf").ToList();
for (int i = 0; i <= dirs.Count - 1; i++)
{
try
{
PdfReader pdfReader = new PdfReader(dirs[i]);
using (FileStream fileStream = new FileStream(Path.Combine(OutputPath, Path.GetFileName(dirs[i])),
FileMode.Create,
FileAccess.Write))
{
PdfStamper pdfStamper = new PdfStamper(pdfReader, fileStream);
Dictionary<string, string> info = pdfReader.Info;
info["Title"] = "";
info["Author"] = "";
info["Producer"] = "My producer"; ////THIS IS NOT WORKING..
pdfStamper.MoreInfo = info;
pdfStamper.Close();
pdfReader.Close();
}
You can only change the producer line if you have a license key. A license key needs to be purchased from iText Software. Instructions on how to apply the license key are sent along with that key.
If you want to use iText for free, you can't change the producer line. See the license header of every file in the open source version of iText:
* In accordance with Section 7(b) of the GNU Affero General Public License,
* a covered work must retain the producer line in every PDF that is created
* or manipulated using iText.
For your info: iText Group has successfully sued a German company that changed the producer line without purchasing a license. You can find some documents related to this case here: IANAL: What developers should know about IP and Legal (slide 57-62)
By the way, I won a JavaOne Rockstar award with this talk: https://twitter.com/itext/status/704278659012681728
Summarized: if you don't have a commercial license for iText, you can not legally change the producer line in iText. If you have a commercial license, you need to apply the license key.
If you are using known producer, you can replace bytes in PDF file.
You need producer to be at least length of your Company (or producer replacement text) name.
In this example I'm assuming that producer has at least 20 chars. You have to determine that by editing PDF file with text editor.
Before using this check licence for the program used to create PDF
Here is an example in C#.
// find producer bytes: "producer... " in array and replace
// them with "COMPANY", and after fitth with enough spaces (code: 32)
var textForReplacement = "producer";
var bytesForReplacement = System.Text.Encoding.UTF8.GetBytes(textForReplacement);
var newText = "COMPANY";
var newBytes = System.Text.Encoding.UTF8.GetBytes(newText);
var result = this.Search(pdf, bytesForReplacement);
if (result > -1)
{
var j = 0;
for (var i = result; i < result + 20; i++)
{
// if we have new bytes, then replace them
if (i < result + newBytes.Length)
{
pdf[i] = newBytes[j];
j++;
}
// if not, fill spaces (32)
else
{
pdf[i] = 32;
}
}
}
return pdf;
}
int Search(byte[] src, byte[] pattern)
{
int c = src.Length - pattern.Length + 1;
int j;
for (int i = 0; i < c; i++)
{
if (src[i] != pattern[0]) continue;
for (j = pattern.Length - 1; j >= 1 && src[i + j] == pattern[j]; j--) ;
if (j == 0) return i;
}
return -1;
}
What is the easiest way to read a file character by character in C#?
Currently, I am reading line by line by calling System.io.file.ReadLine(). I see that there is a Read() function but it doesn;t return a character...
I would also like to know how to detect the end of a line using such an approach...The input file in question is a CSV file....
Open a TextReader (e.g. by File.OpenText - note that File is a static class, so you can't create an instance of it) and repeatedly call Read. That returns int rather than char so it can also indicate end of file:
int readResult = reader.Read();
if (readResult != -1)
{
char nextChar = (char) readResult;
// ...
}
Or to loop:
int readResult;
while ((readResult = reader.Read()) != -1)
{
char nextChar = (char) readResult;
// ...
}
Or for more funky goodness:
public static IEnumerable<char> ReadCharacters(string filename)
{
using (var reader = File.OpenText(filename))
{
int readResult;
while ((readResult = reader.Read()) != -1)
{
yield return (char) readResult;
}
}
}
...
foreach (char c in ReadCharacters("foo.txt"))
{
...
}
Note that all by default, File.OpenText will use an encoding of UTF-8. Specify an encoding explicitly if that isn't what you want.
EDIT: To find the end of a line, you'd check whether the character is \n... you'd potentially want to handle \r specially too, if this is a Windows text file.
But if you want each line, why not just call ReadLine? You can always iterate over the characters in the line afterwards...
Here is a snippet from msdn
using (StreamReader sr = new StreamReader(path))
{
char[] c = null;
while (sr.Peek() >= 0)
{
c = new char[1];
sr.Read(c, 0, c.Length);
// do something with c[0]
}
}
Anyone who knows how to solve this error please help me. Below is my code for reading the csv file. When I tried to upload it showed me *Server Error in '/' Application. Could not find file 'C:/...csv * Im a beginner in c#.
ReadCSV
string filename = FileUpload1.PostedFile.FileName;
using (CsvFileReader reader = new CsvFileReader(filename))
{
CsvRow row = new CsvRow();
while (reader.ReadRow(row))
{
foreach (string s in row)
{
Console.Write(s);
Console.Write(" ");
TextBox1.Text += s;
}
Console.WriteLine();
}
}
CSVClass
public class CsvFileReader : StreamReader
{
public CsvFileReader(Stream stream)
: base(stream)
{
}
public CsvFileReader(string filename): base(filename)
{
}
/// <summary>
/// Reads a row of data from a CSV file
/// </summary>
/// <param name="row"></param>
/// <returns></returns>
public bool ReadRow(CsvRow row)
{
row.LineText = ReadLine();
if (String.IsNullOrEmpty(row.LineText))
return false;
int pos = 0;
int rows = 0;
while (pos < row.LineText.Length)
{
string value;
// Special handling for quoted field
if (row.LineText[pos] == '"')
{
// Skip initial quote
pos++;
// Parse quoted value
int start = pos;
while (pos < row.LineText.Length)
{
// Test for quote character
if (row.LineText[pos] == '"')
{
// Found one
pos++;
// If two quotes together, keep one
// Otherwise, indicates end of value
if (pos >= row.LineText.Length || row.LineText[pos] != '"')
{
pos--;
break;
}
}
pos++;
}
value = row.LineText.Substring(start, pos - start);
value = value.Replace("\"\"", "\"");
}
else
{
// Parse unquoted value
int start = pos;
while (pos < row.LineText.Length && row.LineText[pos] != ',')
pos++;
value = row.LineText.Substring(start, pos - start);
}
// Add field to list
if (rows < row.Count)
row[rows] = value;
else
row.Add(value);
rows++;
// Eat up to and including next comma
while (pos < row.LineText.Length && row.LineText[pos] != ',')
pos++;
if (pos < row.LineText.Length)
pos++;
}
// Delete any unused items
while (row.Count > rows)
row.RemoveAt(rows);
// Return true if any columns read
return (row.Count > 0);
}
}
FileUpload1.PostedFile.FileName is the filename from your client/browser - it does not contain a path...
You either use FileUpload1.PostedFile.InputStream to access it
using (CsvFileReader reader = new CsvFileReader(FileUpload1.PostedFile.InputStream))
OR you first save it to disk (anywhere you have needed permissions) via FileUpload1.PostedFile.SaveAs and then access that file.
You need to save the file that is being uploaded to disk first. Something like this:
string fileSavePath= Sever.MapPath("/files/" + FileUpload1.PostedFile.FileName);
FileUpload1.SaveAs(fileSavePath);
using (CsvFileReader reader = new CsvFileReader(fileSavePath))
....
Haven't tested this code, but should give you a starting point.
_documentContent contains the whole document as html view source.
patternToFind contains text to be searched in _documentContent.
Code snippet below works fine if language is English.
The same code however doesn't works at all when it encounters a language like Korean.
Sample Document
Present Tense
The present tense is just as you have learned. You take the dictionary form of a verb, drop the 다, add the appropriate ending.
먹다 - 먹 + 어요 = 먹어요
마시다 - 마시 + 어요 - 마시어요 - 마셔요.
This tense is used to represent what happens in the present. I eat. I drink. It is a general term for the present.
When I am trying to find 먹 the code belows fails.
can someone please suggest some solution to this
using System;
using System.Collections.Generic;
using System.Text;
namespace MultiByteStringHandling
{
class Program
{
static void Main(string[] args)
{
string _documentContent = #"먹다 - 먹 + 어요 = 먹어요";
byte[] patternToFind = Encoding.UTF8.GetBytes("먹");
byte[] DocumentBytes = Encoding.UTF8.GetBytes(_documentContent);
int intByteOffset = indexOf(DocumentBytes, patternToFind);
Console.WriteLine(intByteOffset.ToString());
}
public int indexOf(byte[] data, byte[] pattern)
{
int[] failure = computeFailure(pattern);
int j = 0;
if (data.Length == 0) return 0;
for (int i = 0; i < data.Length; i++)
{
while (j > 0 && pattern[j] != data[i])
{
j = failure[j - 1];
}
if (pattern[j] == data[i])
{
j++;
}
if (j == pattern.Length)
{
return i - pattern.Length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private int[] computeFailure(byte[] pattern)
{
int[] failure = new int[pattern.Length];
int j = 0;
for (int i = 1; i < pattern.Length; i++)
{
while (j > 0 && pattern[j] != pattern[i])
{
j = failure[j - 1];
}
if (pattern[j] == pattern[i])
{
j++;
}
failure[i] = j;
}
return failure;
}
}
}
Seriously, why not just do the following?
var indexFound = documentContent.IndexOf("data");
Converting strings into byte arrays and then searching those doesn't make much sense to me when you're original data is text. You can always find the byte position after if you wish.
UTF-8 is a variable multi-byte format. Searching for English text in Korean data will never match on a direct pattern match. If you are scanning text you would be much better off using .IndexOf(pattern) [as Noldorin pointed out] or .Contains(pattern).