How to convert many files from doc to docx with multithreading - c#

I have millions of doc files which need to be converted to docx. I am currently using the below method to convert each file in the specified directory. How can I effectively multithread this process?
static void ConvertDocToDocx(string path)
{
Application word = new Application();
var sourceFile = new FileInfo(path);
var document = word.Documents.Open(sourceFile.FullName);
string newFileName = sourceFile.FullName.Replace(".doc", ".docx");
document.SaveAs2(newFileName, WdSaveFormat.wdFormatXMLDocument,
CompatibilityMode: WdCompatibilityMode.wdWord2010);
word.ActiveDocument.Close();
word.Quit();
//File.Delete(path);
}
My current approach is to use Directory.GetFiles to create a list of files which are in my path, then use Parallel.ForEach to convert the files. Here's my code:
string[] filesList = Directory.GetFiles(path);
Parallel.ForEach(filesList, new ParallelOptions { MaxDegreeOfParallelism = 20 }, file =>
{
if (file.Contains(".doc"))
{
ConvertDocToDocx(file);
}
});
However, this doesn't seem to increase performance. Am I misunderstanding the use of Parallel.ForEach?

You are using Word via automation which is equivalent of opening the files manually one by one and saving them. This method may have one performance increasing possibility: there is no need to create new Word instances for each file, just reuse the first instance.
...
var wordInstance = new Application();
try
{
var fileNameList = Directory.GetFiles(path);
foreach(var fileName in fileNameList)
{
if (fileName.Contains(".doc"))
{
ConvertDocToDocx(wordInstance, file);
}
}
}
finally
{
word.Quit();
}
...
static void ConvertDocToDocx(Application wordInstance, string path)
{
var sourceFile = new FileInfo(path);
var newFileName = sourceFile.FullName.Replace(".doc", ".docx");
var document = wordInstance.Documents.Open(sourceFile.FullName);
document.SaveAs2(
newFileName,
WdSaveFormat.wdFormatXMLDocument,
CompatibilityMode: WdCompatibilityMode.wdWord2010);
wordInstance.ActiveDocument.Close();
//File.Delete(path);
}
But as others already mentioned that is the limit of this approach.
You should have a look at solutions which are based on file format knowledge, like e.g. NPOI. It is a C# rewrite of popular Apache POI package so if you search for "POI convert doc to docx" and find Java code do not be afraid almost the same code will compile under C# with NPOI package too, in most cases just minor syntax changes would be required.

Related

Display files in a folder to a CSV using C#

Is there a simple way to display the files in a folder to a CSV that shows whether the files in that folder are JPG, PDF, or other using C#?
There is no built-in way to do this in C#, but you can use the System.IO.Directory and System.IO.File classes to get a list of files in a directory and then check their extensions to see if they are JPG, PDF, or other.
using System;
using System.IO;
class Program
{
static void Main(string[] args)
{
string directory = "C:\\";
string[] files = Directory.GetFiles(directory);
using (StreamWriter writer = new StreamWriter("files.csv"))
{
foreach (string file in files)
{
string extension = Path.GetExtension(file);
writer.WriteLine("{0},{1}", file, extension);
}
}
}
}
You can use Matcher to traverse a folder structure using wild cards to getting specific file types and exclusions also.
Example method
public static async Task Demo(string parentFolder, string[] patterns, string[] excludePatterns, string fileName)
{
StringBuilder builder = new StringBuilder();
Matcher matcher = new();
matcher.AddIncludePatterns(patterns);
matcher.AddExcludePatterns(excludePatterns);
await Task.Run(() =>
{
foreach (string file in matcher.GetResultsInFullPath(parentFolder))
{
builder.AppendLine(file);
}
});
await File.WriteAllTextAsync("TODO", builder.ToString());
}
Example call, find all png and pdf files but exclude some specified in the exclude array which is explained in the docs.
string[] include = { "**/*.png", "**/*.pdf" };
string[] exclude = { "**/*in*.png", "**/*or*.png" };
await GlobbingOperations.Demo(
"Parent folder",
include,
exclude,
"Dump.txt");
When traversing files add any other logic to get specifics about a file and use that to add to the builder variable for writing to a file.

XML file from ZIP Archive is incomplete in C#

I've work with large XML Files (~1000000 lines, 34mb) that are stored in a ZIP archive. The XML file is used at runtime to store and load app settings and measurements. The gets loadeted with this function:
public static void LoadFile(string path, string name)
{
using (var file = File.OpenRead(path))
{
using (var zip = new ZipArchive(file, ZipArchiveMode.Read))
{
var foundConfigurationFile = zip.Entries.First(x => x.FullName == ConfigurationFileName);
using (var stream = new StreamReader(foundConfigurationFile.Open()))
{
var xmlSerializer = new XmlSerializer(typeof(ProjectConfiguration));
var newObject = xmlSerializer.Deserialize(stream);
CurrentConfiguration = null;
CurrentConfiguration = newObject as ProjectConfiguration;
AddRecentFiles(name, path);
}
}
}
}
This works for most of the time.
However, some files don't get read to the end and i get an error that the file contains non valid XML. I used
foundConfigurationFile.ExtractToFile();
and fount that the readed file stops at line ~800000. But this only happens inside this code. When i open the file via editor everything is there.
It looks like the zip doesnt get loaded correctly, or for that matter, completly.
Am i running in some limitations? Or is there an error in my code i don't find?
The file is saved via:
using (var file = File.OpenWrite(Path.Combine(dirInfo.ToString(), fileName.ToString()) + ".pwe"))
{
var zip = new ZipArchive(file, ZipArchiveMode.Create);
var configurationEntry = zip.CreateEntry(ConfigurationFileName, CompressionLevel.Optimal);
var stream = configurationEntry.Open();
var xmlSerializer = new XmlSerializer(typeof(ProjectConfiguration));
xmlSerializer.Serialize(stream, CurrentConfiguration);
stream.Close();
zip.Dispose();
}
Update:
The problem was the File.OpenWrite() method.
If you try to override a file with this method it will result in a mix between the old file and the new file, if the new file is shorter than the old file.
File.OpenWrite() doenst truncate the old file first as stated in the docs
In order to do it correctly it was neccesary to use the File.Create() method. Because this method truncates the old file first.

Trying to compress files from multiple folders

I have files stored in multiple different folders and I need to make a ZIP archive from all the files in those folders. I have created a simple function using System.IO.Compression, that takes the data from just one folder and makes a ZIP archive, but I can't figure out how to do that for multiple folders. No folders needed in ZIP, just the files from it.
If it can't be done in this library, I can use a different one like DotNetZip or similar.
string folder1 = #"c:\ex\ZipFolder1";
string zipPath = #"c:\ex\AllFiles.zip";
ZipFile.CreateFromDirectory(folder1, zipPath);
I think you're using DotNetZip.Just create a ZipFile and add files by the method AddFiles
using (var file = new ZipFile())
{
//fileNames is an array containing the paths of the files from differents folders
file.AddFiles(fileNames);
file.Save(zipFile);
}
Have you tried using AddFile() method of ZipFile Class. Here you can replace 'PathOfFiles' dynamically as per your requirement.
var fileNames = new string[] { "a.txt", "b.xlsx", "c.png" };
using (var zip = new ZipFile()){
foreach (var file in fileNames)
zip.AddFile(#"PathOfFiles\" + fileName, "");
zip.Name = "ZipFile";
var pushStreamContent = new PushStreamContent((stream, content, context) =>
{
zip.Save(stream);
stream.Close();
}, "application/zip");
}
for anyone encountering this today, here's an updated short and simple answer based on Microsoft's docs:
https://learn.microsoft.com/en-us/dotnet/api/system.io.compression.zipfileextensions.createentryfromfile?view=net-6.0
static void SaveFilesToZip(string zipTargetPath, string[] filePaths)
{
using var newZip = ZipFile.Open(zipTargetPath, ZipArchiveMode.Create);
foreach (var filePath in filePaths)
{
newZip.CreateEntryFromFile(filePath, Path.GetFileName(filePath));
}
}

Storing and getting back files from MongoDB

I am working on c# .Net 4.5
I have to upload some files on MongoDB and in other module, I have to get them back based on metadata.
for that I am doing like below,
static void uploadFileToMongoDB(GridFSBucket gridFsBucket)
{
if (Directory.Exists(_sourceFilePath))
{
if (!Directory.Exists(_uploadedFilePath))
Directory.CreateDirectory(_uploadedFilePath);
FileInfo[] sourceFileInfo = new DirectoryInfo(_sourceFilePath).GetFiles();
foreach (FileInfo fileInfo in sourceFileInfo)
{
string filePath = fileInfo.FullName;
string remoteFileName = fileInfo.Name;
string extension = Path.GetExtension(filePath);
double fileCreationDate = fileInfo.CreationTime.ToOADate();
GridFSUploadOptions gridUploadOption = new GridFSUploadOptions
{
Metadata = new BsonDocument
{{ "creationDate", fileCreationDate },
{ "extension", extension }}
};
using (Stream fileStream = File.OpenRead(filePath))
gridFsBucket.UploadFromStream(remoteFileName, fileStream, gridUploadOption);
}
}
}
and downloading,
static void getFileInfoFromMongoDB(GridFSBucket bucket, DateTime startDate, DateTime endDate)
{
double startDateDoube = startDate.ToOADate();
double endDateDouble = endDate.ToOADate();
var filter = Builders<GridFSFileInfo>.Filter.And(
Builders<GridFSFileInfo>.Filter.Gt(x => x.Metadata["creationDate"], startDateDoube),
Builders<GridFSFileInfo>.Filter.Lt(x => x.Metadata["creationDate"], endDateDouble));
IAsyncCursor<GridFSFileInfo> fileInfoList = bucket.Find(filter); //****
if (!Directory.Exists(_destFilePath))
Directory.CreateDirectory(_destFilePath);
foreach (GridFSFileInfo fileInfo in fileInfoList.ToList())
{
string destFile = _destFilePath + "\\" + fileInfo.Filename;
var fileContent = bucket.DownloadAsBytes(fileInfo.Id); //****
File.WriteAllBytes(destFile, fileContent);
}
}
in this code (working but) I have two problems which I am not sure how to fix.
If i have uploaded a file and I upload it again, it actually gets
uploaded. How to prevent it?
Ofcourse both uploaded files have different ObjectId but while uploading a file I will not be knowing that which files are already uploaded. So I want a mechanism which throws an exception if i upload already uploaded file. Is it possible? (I can use combination of filename, created date, etc)
If you have noticed in code, actually i am requesting to database server twice to get one file written on disk, How to do it in one shot?
Note lines of code which I have marked with "//****" comment. First I am querying into database to get fileInfo (GridFSFileInfo). I was expecting that I could get actual content of file from this objects only. But I didnot find any related property or method in that object. so I had to do var fileContent = bucket.DownloadAsBytes(fileInfo.Id); to get content. M I missing something basic here ?

How to use SharpSVN to modify file xml and commit modified file

am a SharpSVN newbie.
We are currently busy with rebranding, and this entails updating all our reports with new colours etc. There are too many reports to do manually, so I am trying to find a way to find and replace the colours/fonts etc in one go.
Our reports are serialized and stored in a database, which is easy to replace, but we also want to apply the changes in the .rdl reports in our source control, which is Subversion.
My question is the following:
I know you can write files to a stream with SharpSVN, which I have done, now I would like to push the updated xml back into Subversion as the latest version.
Is this at all possible? And if so, how would I go about doing this? I have googled alot, but haven't been able to find any definitive answer to this.
My code so far (keep in mind this is a once off thing, so I'm not too concerned about clean code etc) :
private void ReplaceFiles()
{
SvnCommitArgs args = new SvnCommitArgs();
SvnCommitResult result;
args.LogMessage = "Rebranding - Replace fonts, colours, and font sizes";
using (SvnClient client = new SvnClient())
{
client.Authentication.DefaultCredentials = new NetworkCredential("mkassa", "Welcome1");
client.CheckOut(SvnUriTarget.FromString(txtSubversionDirectory.Text), txtCheckoutDirectory.Text);
client.Update(txtCheckoutDirectory.Text);
SvnUpdateResult upResult;
client.Update(txtCheckoutDirectory.Text, out upResult);
ProcessDirectory(txtCheckoutDirectory.Text, args, client);
}
MessageBox.Show("Done");
}
// Process all files in the directory passed in, recurse on any directories
// that are found, and process the files they contain.
public void ProcessDirectory(string targetDirectory, SvnCommitArgs args, SvnClient client)
{
var ext = new List<string> { ".rdl" };
// Process the list of files found in the directory.
IEnumerable<string> fileEntries = Directory.EnumerateFiles(targetDirectory, "*.*", SearchOption.AllDirectories)
.Where(s => ext.Any(e=> s.EndsWith(e)));
foreach (string fileName in fileEntries)
ProcessFile(fileName, args, client);
}
private void ProcessFile(string fileName, SvnCommitArgs args, SvnClient client)
{
using (MemoryStream stream = new MemoryStream())
{
SvnCommitResult result;
if (client.Write(SvnTarget.FromString(fileName), stream))
{
stream.Position = 0;
using (var reader = new StreamReader(stream))
{
string contents = reader.ReadToEnd();
DoReplacement(contents);
client.Commit(txtCheckoutDirectory.Text, args, out result);
//if (result != null)
// MessageBox.Show(result.PostCommitError);
}
}
}
}
Thank you to anyone who can provide some insight on this!
You don't want to perform a merge on the file, as you would only use that to merge the changes from one location into another location.
If you can't just checkout your entire tree and replace+commit on that, you might be able to use something based on:
string tmpDir = "C:\tmp\mytmp";
using(SvnClient svn = new SvnClient())
{
List<Uri> toProcess = new List<Uri>();
svn.List(new Uri("http://my-repos/trunk"), new SvnListArgs() { Depth=Infinity }),
delegate(object sender, SvnListEventArgs e)
{
if (e.Path.EndsWith(".rdl", StringComparison.OrdinalIgnoreCase))
toProcess.Add(e.Uri);
});
foreach(Uri i in toProcess)
{
Console.WriteLine("Processing {0}", i);
Directory.Delete(tmpDir, true);
// Create a sparse checkout with just one file (see svnbook.org)
string name = SvnTools.GetFileName(i);
string fileName = Path.Join(tmpDir, name)
svn.CheckOut(new Uri(toProcess, "./"), new SvnCheckOutArgs { Depth=Empty });
svn.Update(fileName);
ProcessFile(fileName); // Read file and save in same location
// Note that the following commit is a no-op if the file wasn't
// changed, so you don't have to check for that yourself
svn.Commit(fileName, new SvnCommitArgs { LogMessage="Processed" });
}
}
Once you updated trunk I would recommend merging that change to your maintenance branches... and if necessary only fix them after that. Otherwise further merges will be harder to perform than necessary.
I managed to get this done. Posting the answer for future reference.
Basically all I had to do was create a new .rdl file with the modified XML, and replace the checked out file with the new one before committing.
string contents = reader.ReadToEnd();
contents = DoReplacement(contents);
// Create an XML document
XmlDocument doc = new XmlDocument();
string xmlData = contents;
doc.Load(new StringReader(xmlData));
//Save XML document to file
doc.Save(fileName);
client.Commit(txtCheckoutDirectory.Text, args, out result);
Hopefully this will help anyone needing to do the same.

Categories