Storing and getting back files from MongoDB - c#

I am working on c# .Net 4.5
I have to upload some files on MongoDB and in other module, I have to get them back based on metadata.
for that I am doing like below,
static void uploadFileToMongoDB(GridFSBucket gridFsBucket)
{
if (Directory.Exists(_sourceFilePath))
{
if (!Directory.Exists(_uploadedFilePath))
Directory.CreateDirectory(_uploadedFilePath);
FileInfo[] sourceFileInfo = new DirectoryInfo(_sourceFilePath).GetFiles();
foreach (FileInfo fileInfo in sourceFileInfo)
{
string filePath = fileInfo.FullName;
string remoteFileName = fileInfo.Name;
string extension = Path.GetExtension(filePath);
double fileCreationDate = fileInfo.CreationTime.ToOADate();
GridFSUploadOptions gridUploadOption = new GridFSUploadOptions
{
Metadata = new BsonDocument
{{ "creationDate", fileCreationDate },
{ "extension", extension }}
};
using (Stream fileStream = File.OpenRead(filePath))
gridFsBucket.UploadFromStream(remoteFileName, fileStream, gridUploadOption);
}
}
}
and downloading,
static void getFileInfoFromMongoDB(GridFSBucket bucket, DateTime startDate, DateTime endDate)
{
double startDateDoube = startDate.ToOADate();
double endDateDouble = endDate.ToOADate();
var filter = Builders<GridFSFileInfo>.Filter.And(
Builders<GridFSFileInfo>.Filter.Gt(x => x.Metadata["creationDate"], startDateDoube),
Builders<GridFSFileInfo>.Filter.Lt(x => x.Metadata["creationDate"], endDateDouble));
IAsyncCursor<GridFSFileInfo> fileInfoList = bucket.Find(filter); //****
if (!Directory.Exists(_destFilePath))
Directory.CreateDirectory(_destFilePath);
foreach (GridFSFileInfo fileInfo in fileInfoList.ToList())
{
string destFile = _destFilePath + "\\" + fileInfo.Filename;
var fileContent = bucket.DownloadAsBytes(fileInfo.Id); //****
File.WriteAllBytes(destFile, fileContent);
}
}
in this code (working but) I have two problems which I am not sure how to fix.
If i have uploaded a file and I upload it again, it actually gets
uploaded. How to prevent it?
Ofcourse both uploaded files have different ObjectId but while uploading a file I will not be knowing that which files are already uploaded. So I want a mechanism which throws an exception if i upload already uploaded file. Is it possible? (I can use combination of filename, created date, etc)
If you have noticed in code, actually i am requesting to database server twice to get one file written on disk, How to do it in one shot?
Note lines of code which I have marked with "//****" comment. First I am querying into database to get fileInfo (GridFSFileInfo). I was expecting that I could get actual content of file from this objects only. But I didnot find any related property or method in that object. so I had to do var fileContent = bucket.DownloadAsBytes(fileInfo.Id); to get content. M I missing something basic here ?

Related

XML file from ZIP Archive is incomplete in C#

I've work with large XML Files (~1000000 lines, 34mb) that are stored in a ZIP archive. The XML file is used at runtime to store and load app settings and measurements. The gets loadeted with this function:
public static void LoadFile(string path, string name)
{
using (var file = File.OpenRead(path))
{
using (var zip = new ZipArchive(file, ZipArchiveMode.Read))
{
var foundConfigurationFile = zip.Entries.First(x => x.FullName == ConfigurationFileName);
using (var stream = new StreamReader(foundConfigurationFile.Open()))
{
var xmlSerializer = new XmlSerializer(typeof(ProjectConfiguration));
var newObject = xmlSerializer.Deserialize(stream);
CurrentConfiguration = null;
CurrentConfiguration = newObject as ProjectConfiguration;
AddRecentFiles(name, path);
}
}
}
}
This works for most of the time.
However, some files don't get read to the end and i get an error that the file contains non valid XML. I used
foundConfigurationFile.ExtractToFile();
and fount that the readed file stops at line ~800000. But this only happens inside this code. When i open the file via editor everything is there.
It looks like the zip doesnt get loaded correctly, or for that matter, completly.
Am i running in some limitations? Or is there an error in my code i don't find?
The file is saved via:
using (var file = File.OpenWrite(Path.Combine(dirInfo.ToString(), fileName.ToString()) + ".pwe"))
{
var zip = new ZipArchive(file, ZipArchiveMode.Create);
var configurationEntry = zip.CreateEntry(ConfigurationFileName, CompressionLevel.Optimal);
var stream = configurationEntry.Open();
var xmlSerializer = new XmlSerializer(typeof(ProjectConfiguration));
xmlSerializer.Serialize(stream, CurrentConfiguration);
stream.Close();
zip.Dispose();
}
Update:
The problem was the File.OpenWrite() method.
If you try to override a file with this method it will result in a mix between the old file and the new file, if the new file is shorter than the old file.
File.OpenWrite() doenst truncate the old file first as stated in the docs
In order to do it correctly it was neccesary to use the File.Create() method. Because this method truncates the old file first.

How to convert many files from doc to docx with multithreading

I have millions of doc files which need to be converted to docx. I am currently using the below method to convert each file in the specified directory. How can I effectively multithread this process?
static void ConvertDocToDocx(string path)
{
Application word = new Application();
var sourceFile = new FileInfo(path);
var document = word.Documents.Open(sourceFile.FullName);
string newFileName = sourceFile.FullName.Replace(".doc", ".docx");
document.SaveAs2(newFileName, WdSaveFormat.wdFormatXMLDocument,
CompatibilityMode: WdCompatibilityMode.wdWord2010);
word.ActiveDocument.Close();
word.Quit();
//File.Delete(path);
}
My current approach is to use Directory.GetFiles to create a list of files which are in my path, then use Parallel.ForEach to convert the files. Here's my code:
string[] filesList = Directory.GetFiles(path);
Parallel.ForEach(filesList, new ParallelOptions { MaxDegreeOfParallelism = 20 }, file =>
{
if (file.Contains(".doc"))
{
ConvertDocToDocx(file);
}
});
However, this doesn't seem to increase performance. Am I misunderstanding the use of Parallel.ForEach?
You are using Word via automation which is equivalent of opening the files manually one by one and saving them. This method may have one performance increasing possibility: there is no need to create new Word instances for each file, just reuse the first instance.
...
var wordInstance = new Application();
try
{
var fileNameList = Directory.GetFiles(path);
foreach(var fileName in fileNameList)
{
if (fileName.Contains(".doc"))
{
ConvertDocToDocx(wordInstance, file);
}
}
}
finally
{
word.Quit();
}
...
static void ConvertDocToDocx(Application wordInstance, string path)
{
var sourceFile = new FileInfo(path);
var newFileName = sourceFile.FullName.Replace(".doc", ".docx");
var document = wordInstance.Documents.Open(sourceFile.FullName);
document.SaveAs2(
newFileName,
WdSaveFormat.wdFormatXMLDocument,
CompatibilityMode: WdCompatibilityMode.wdWord2010);
wordInstance.ActiveDocument.Close();
//File.Delete(path);
}
But as others already mentioned that is the limit of this approach.
You should have a look at solutions which are based on file format knowledge, like e.g. NPOI. It is a C# rewrite of popular Apache POI package so if you search for "POI convert doc to docx" and find Java code do not be afraid almost the same code will compile under C# with NPOI package too, in most cases just minor syntax changes would be required.

Saving multiple images using File.WriteAllBytes saves only the last one

i'm trying to save multiple images with File.WriteAllBytes(), even after i tried to seperate between the saves with 'Thread.Sleep()' it's not working..
my code:
byte[] signatureBytes = Convert.FromBase64String(model.Signature);
byte[] idBytes = Convert.FromBase64String(model.IdCapture);
//Saving the images as PNG extension.
FileManager.SaveFile(signatureBytes, dirName, directoryPath, signatureFileName);
FileManager.SaveFile(idBytes, dirName, directoryPath, captureFileName);
SaveFile Function:
public static void SaveFile(byte[] imageBytes, string dirName, string path, string fileName, string fileExt = "jpg")
{
if (!string.IsNullOrEmpty(dirName)
&& !string.IsNullOrEmpty(path)
&& !string.IsNullOrEmpty(fileName)
&& imageBytes.Length > 0)
{
var dirPath = Path.Combine(path, dirName);
var di = new DirectoryInfo(dirPath);
if (!di.Exists)
di.Create();
if (di.Exists)
{
File.WriteAllBytes(dirPath + $#"\{fileName}.{fileExt}", imageBytes);
}
}
else
throw new Exception("File cannot be created, one of the parameters are null or empty.");
}
File.WriteAllBytes():
"Creates a new file, writes the specified byte array to the file, and then closes the file. If the target file already exists, it is overwritten"
As expecify in :
https://msdn.microsoft.com/en-ca/library/system.io.file.writeallbytes(v=vs.110).aspx
So if you can only see the last one, you are overwriting the file.
Apart from the possibility (as mentioned by #Daniel) that you're overwriting the same file, I'm not sure about this code:
var di = new DirectoryInfo(dirPath);
if (!di.Exists)
di.Create();
if (di.Exists)
{
...
}
I'd be surprised if, having called di.Create(), the Exists property is updated. In fact, it is not updated - I checked.
So, if the directory did not exist, then you won't enter the conditional part even after creating the directory. Could that explain your issue?

How to tell if Azure Container has Virtual folders/directories?

I am working on a project that is reading the data from a Azure Blob and saving that data into an Object. I am currently running into a problem. The way my code is set up now - it will read all the .txt data within a container if there are no Virtual Folders present.
However, if there is a virtual folder structure present within a Azure Container
my code will error out, with a NullExceptionReference. My idea was to do a if check to see if there was Virtual Folders present within an Azure Container if so execute //some code. Is there a way to tell if there is a virtual folder is present?
ReturnBlobObject()
private List<Blob> ReturnBlobObject(O365 o365)
{
List<Blob> listResult = new List<Blob>();
string textToFindPattern = "(\\/)";
string fileName = null;
string content = null;
//Loop through all Blobs and split the container form the file name.
foreach (var blobItem in o365.Container.ListBlobs(useFlatBlobListing: true))
{
string containerAndFileName = blobItem.Parent.Uri.MakeRelativeUri(blobItem.Uri).ToString();
string[] subString = Regex.Split(containerAndFileName, textToFindPattern);
//subString[2] is the name of the file.
fileName = subString[2];
content = ReadFromBlobStream(o365.Container.GetBlobReference(subString[2]));
Blob blobObject = new Blob(fileName, content);
listResult.Add(blobObject);
}
return listResult;
}
ReadFromBlobStream
private string ReadFromBlobStream(CloudBlob blob)
{
Stream stream = blob.OpenRead();
using (StreamReader reader = new StreamReader(stream))
{
return reader.ReadToEnd();
}
}
I was able to solve this by refactoring my code. Instead of using Regex - which was returning some very odd behavior I decided to take a step back and think the problem. Below is the solution I came up with.
ReturnBlobObject()
private List<Blob> ReturnBlobObject(O365 o365)
{
List<Blob> listResult = new List<Blob>();
//Loop through all Blobs and split the container form the file name.
foreach (var blobItem in o365.Container.ListBlobs(useFlatBlobListing: true))
{
string fileName = blobItem.Uri.LocalPath.Replace(string.Format("/{0}/", o365.Container.Name), "");
string content = ReadFromBlobStream(o365.Container.GetBlobReference(fileName));
Blob blobObject = new Blob(fileName, content);
listResult.Add(blobObject);
}
return listResult;
}

Don't modify file's edit date when this is added to a archive

I'm creating my zip archive using SharpCompress library. I successfully create Zip archive but, the library, automaticcally update file's edit date to current datetime. I don't want this behaviour. What I would is: the edit date will be unchanged (i.e.: the edit date of file in the archive is the same of the file before archiviation).
How can avoid this behaviour? This is my code:
private String CreaPacchettoZip(String idProcesso, String pdfBasePath)
{
List<String> listaPdfDiProcesso = FileHelper.EstraiListaPdfDaDirecotry(pdfBasePath);
String zipFile = Path.Combine(pdfBasePath, idProcesso + ".zip");
using (var archive = ZipArchive.Create())
{
foreach (String file in listaPdfDiProcesso)
{
archive.AddEntry(file, new FileInfo(pdfBasePath, file));
}
using (Stream newStream = File.Create(zipFile))
{
archive.SaveTo(newStream, SharpCompress.Common.CompressionType.None);
}
}
return zipFile;
}
Somewhere in here
foreach (String file in listaPdfDiProcesso)
{ var fileInfo = new FileInfo(pdfBasePath, file);
// If here date is ~DateTime.Now read it some other way like File.Open from file and then set it to fileInfo if it has Setter or Date setting method
var originalFileCreationDate = e.DateItGotCreated; //DateItGotCreated is not actual propety or method.
archive.AddEntry(file, fileInfo);
}

Categories