Identify & Remove "Duplicate" PPT files that are not 100% the same [closed] - c#

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
As part of the Discovery process for an upcoming project, I am trying to find a way of taking a representative sample of the PPT files on our network. So far, I have collected & organized all of the PPT files that we have, however I've realized that there is an overwhelming volume of documents, such that I need to find a way to reduce it down. To this end, I was thinking that it'd be helpful to delete all "duplicate" files.
Our company does not have any sort of version control system for files on our network. As such, users often create copies of files in order to make small minor changes. This has led to a high volume of "duplicate" files with no real naming convention, etc. Ideally, I'd be able to make a best-guess as to which files are "duplicates" and keep the most recent version. Since I just need a representative sample, I do not need to be 100% accurate regarding the save/delete decision and it's also ok if I lose a chunk of the files due to (there are currently 135K files, and I expect to end up with 3-5K). I am not sure how to go about this, as tools like http://www.easyduplicatefinder.com/ seem to look for truly identical documents, as opposed to a more nuanced difference.
Here are a couple of additional details:
File names do not follow any standard convention
I think it's fair to assume that many of the PPT properties would remain unchanged across versions
Versions of files are always located in the same folder, however other PPT files may also exist in the same folder
I'm open to addressing this problem in any of the following languages/technologies: C#, VB, Ruby, Python, IronPython, PowerShell

I would approach it like:
extract all visible text strings from each .ppt file
dump the strings into text files, one per .ppt
run diff across all pairs of text files (in the same directory?) to get min edit distance
run the resulting distance matrix through a clustering algorithm

Related

How to track content changes of a specific folder [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Here is my scenario:
Customer blames that a specific type of file will appear in a specific folder, when saving some files with our system. Our system works together with another system, not developed by us. I want to find out which system ist creating such files.
This is what makes it more difficult:
I suppose that those files appear temporarily. So in my small development scenario, it is nearly not possible to recognize them. But when working with many thousands of files, I suppose, the amount of temporary files will increase. And, due to the limits of customers hardware, they will exist much longer.
So what I am looking for is a tool which traces all changes of content in a specific folder. Ideally, I could filter for a specific type of file. It should work on Win10.
My questions:
Does anybody know such a tool or could give me a suitable keywork for
searching?
Or is this too specific, so I have to make my own tool?
In the 2nd case I usually prefer C#/.NET. Is there anything suitable available, which I can extend or change or should use? (e.g. a tool or framework or NuGet, e.g. extending a tool such as Everything)
The namespace System.IO has a class that allows file and folder monitoring: FileSystemWatcher.
From the documentation:
Listens to the file system change notifications and raises events when
a directory, or file in a directory, changes.
See https://learn.microsoft.com/en-us/dotnet/api/system.io.filesystemwatcher?view=net-6.0.
The documentation above gives a good code example, as well as many explanations about how changes are tracked, and eventual limitations.
You can use this class to log each change in a target folder, and then use this log to understand what happens.
If needed, you could then narrow down the issue using a tool like SysInternal's ProcessMonitor. Assuming you gathered enough informations to be able to reproduce the problem, or be able to predict roughly when it could happen again, you could use ProcessMonitor to record system events.
ProcessMonitor records system events, via its Capture button. You can filter the events with the provided Filter mechanism. For instance (this is a simple case) you can filter to see only events from a specific PID (Process ID). You can find the PID of your target by looking at the details page of the Task Manager. This way you will likely find which process created which file.

Save millions of files on windows [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
My system will save ~20-40 million image files
Each file is 150-300KB
My application will run on windows server 2012 R2 and the files will be saved on storage (don't know which one yet)
My application is written in C#
My requirements are:
- The system will constantly delete old files and save new files (around 100K files per day)
- The most recent images will be automatically displayed to users on web and wpf applications
- I need fast access to recent files (last week) for report purposes
What is the best practice for storing / organizing this amount of files?
Broad questions much? If you're asking about how to organize them for efficient access that's a bit harder to answer without knowing the reason you're storing that many files.
Let me explain:
Lets say you're storing a ton of log files. Odds are your users are going to be most interested in the logs from the last week or so. So storing your data on disk in a way that you can easily access the files by day (e.g. yyyy-mm-dd.log) will speed up getting access to a specific day log.
Now instead think of it like a phone book and you're accessing peoples names. Well storing it by the time you inserted that name in the phone book really isn't going to help you get to the result you want quickly. Better come up with a better sorting algorithm.
Essentially look at how your data will be accessed, try to sort it in a logical manner so that you can do a binary search algorithm or better algorithm on it.
I'd highly recommend rewording your question so it is clearer though.

How to improve Project structure in c# web application [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am working on C# web application and i build various application with simple project structure like image is given below.
Basically i want to know about the best project structure of large web application on C#? or is above project structure(in the image) is right way to build an web application ?
Now i want to improve my project structure. So where should i start improve myself to build an lage web application.
There's no "best way", but there are some "worst ways". I see that you are using some of the "worst ways". Here are a few of them:
Do not use web site "projects". They are not projects. They're just whatever happens to be in a particular directory tree. I notice that you have ~1.pdf in your project, or did you just happen to have it in the same folder? Use Web Application Projects instead - use File->New Project.
Do not use App_Code. App_Code is for a small number of code files which are compiled at runtime. No other kind of "project" has code compiled at runtime. This is a bizarre corruption of computer programming.
At the very least, create separate folders for the different parts of your code which are not in code-behind. You can start with a single folder (maybe called "My_Code", not App_Code), and when you start to get "too many" files in that one folder, start separating the files based on their function. Keep one class per file, and it will become obvious quickly that different sets of classes have different sets of functions. For instance, you'll find you have a set of classes related to database access. Put those into a "DataAccess" folder.
Better, once you find that you have many such folders, move each of the folders out into their own class library project. This way, you don't have to jump right into a complex project structure - the structure can evolve over time.
And if you wind up in ten years with only three classes in My_Code, then you have saved yourself the waste of creating tiny class libraries.
You may have too much code in your code behind. I can't see it, but projects which have the first two symptoms usually have the third. Your codebehind should only have code in it that directly involves user interface - for instance, it can call a method to fetch data to bind to a data bound control, but the code to fetch the data should be in a different class, not in the codebehind. This different class should not be in App_Code, as I say in #2.
Are you using source control? I don't see any source control icons on your project icons. Use source control. Visual Studio Online is free for up to five users, and allows you to use either TFS or Git for source control, as well as providing work item tracking, automated builds and quite a bit more. For free.
I don't think there is really a correct way to structure a project. It depends on what the goals of the project our and your plans for implementation. Each project is different and have will different requirements.
I would go with the most logical approach for your particular project. Whether that's grouping similar components together or grouping whole sub-systems together.
You should carefully analyze your requirements up front before you even begin development and plan your structure accordingly. Having a good plan before you begin coding will save you a bunch of hassle down-stream if things aren't working out and you need to restructure.
I would suggest reading a book like Code Complete, it'll give you excellent tips on how to plan and structure your projects. You can find it here: Code Complete

Is there a way to improve copy of folder structure to a network location [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
If you are copying many folders with files inside it can be usually better to just create a ZIP/RAR with the folder and files, copy it to network path and unzip it. This usually works much faster than copy paste
Is there a way to do this programatically and embed it into windows so that It can try to detect which way is faster and use that one (normal way or "compress on the fly") to improve speed
"compression on the fly" is a waste unless there is something on the other end that can perform the decompress OR if the compressed state is acceptable. That said:
Yes, you can write an app that zips/rars files.
Yes, you can have that app copy the zip/rar to a network directory.
Yes, you can have an app on the other end wait for the file and unzip it locally...
Can you have it detect "which way is faster"?? Although possible it is unlikely to be of benefit for anything other than large files... at which point you should always zip/rar and transfer...which would make the entire exercise rather pointless. Of course, you should probably evaluate the data that is likely to be transferred using your app to see if it is even a candidate for compression. Video, for example, might not be...
More to the point here, each end would have to have an application that is aware of each other (or at least the protocols involved). One app (we'll call it the client) would zip and post the file to another app (we'll call that one the server). When the server receives the file it would unzip it and store it on the file system.
update
I thought of another situation for zipping: transferring LOTs of little files at one time. Normal network file copy routines go much faster for a single large file vs lots of little files. So, if they are selecting a few hundred files to go at once you might be better off always zipping. Which, incidentally, doesn't change the requirement of having something on the other side able to decompress it.
Have you tried using robocopy ? It's built-in on Windows, robust, multi-threaded and features a lot of options, including mirroring and retries in case of failure. I use it with all copy to network locations. Give it a try.

Simple Virus Remover [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am trying to create a simple virus remover. The algorithm I developed is meant to:
inspect the original file and the infected file
separate the virus from the infected file
use same algorithm for repairing other files infected with the virus
I know this is possible, since this is the same way patches are created, but I am a little bit lost on how to go about with this.
Any help around??
You'll have to put more intelligence than simply do some pattern matching and remove the isolated virus code.
The viruses you are aiming at are files infectors which are rarely used in our days.
Most of the time their replication process is as follow:
They copy themselves at the beginning or at the end of the PE files
Locate the entry point of the PE files
Put a jump instruction to this location pointing at theirs code
Disinfecting a file is the most difficult part for any anti-virus. It relies on the quality of the virus code: if it's buggy, the host file will just be unrecoverable.
In any case, you are entering a world of machine instructions where disassemblers (IDA, PE Explorer ...), and debuggers will be your dearest friends.
Do a difference of the two files, the basic idea would be to compare the original and infected files character by character until and saving discrepancies to some data structure. Then in the future you could look for the "virus" which would hypothetically be a collection of the differences, in other files and remove the "virus".
The only problem with this is that there will probably be discrepancies between the two files which have nothing to do with the "virus", e.g. the infected file was modified in some way different from the original, which has nothing to do with the virus.
EDIT***
Checking other files for the virus would not be too hard, but I am running under the assumption that you are dealing with some plain text form of file, for binary propitiatory files, I do not think you would be able to remove the "virus".

Categories