I'm trying to extract a .tbz file using .net
Anyone have any suggestions?
The file will be very large (3GB) if this makes any difference.
First you'll need to know about, how to implement, or use a supporting library for BZIP2 decompression. Then you will need to do the same, in a different fashion, and de-archive the contents of the resulting TAR file.
You can use SharpZipLib:
http://www.icsharpcode.net/opensource/sharpziplib/
However, it is licensed under the GPL, with an exception that I'm not familiar with, so you'll have to read carefully to see if it suits your needs.
Related
I've looked at FileHelpers v2.0 but there is a serious problem woth that. I cannot define a class that maps to the record in the source/detination file.
The reason is I don't know what file I'm going to get. A big part of my program is mapping the file's fields to the database's fields... I don't know how many fields there wil be, nor wich will need to be imported.
I have no intention on rolling my own lib, especially since I have no control over the files that are going to be fed to my program.
Any solutions tot his?
Dennis
Check out the Fast CSV reader on the CodeProject. It helped me with my project a while ago. Its really easy to use, and is quite good.
You can use ADO.NET to directly read the .CSV file into a DataTable. If you don't know how many fields will exist in advance, this can be a useful means of working with the data. This also has the advantage of not requiring any external libraries.
For details, please see Deborah Kurata's article on the subject.
StreamReader has been fast enough for me for pretty much every text file, though you are pretty screwed if you cant even guarantee value ordering.
I'm trying to create an application which will be basically a catalogue of my PDF collection. We are talking about 15-20GBs containing tens of thousands of PDFs. I am also planning to include a full-text search mechanism. I will be using Lucene.NET for search (actually, NHibernate.Search), and a library for PDF->text conversion. Which would be the best choice? I was considering these:
PDFBox
pdftotext (from xpdf) via c# wrapper
iTextSharp
Edit: Other good option seems to be using iFilters. How well (speed/quality) would they perform (Foxit/Adobe) in comparison to these libraries?
Commercial libraries are probably out of the question, as it is my private project and I don't really have a budget for commercial solutions - although PDFTextStream looks really nice.
From what I've read pdftotext is a lot faster than PDFBox. How well performs iTextSharp in comparison to pdftotext? Or maybe someone can recommend other good solutions?
If it is for a private project, is this going to an ongoing conversion process? E.g. after you've converted the 15-20Gb are you going to still be converting?
The reason I ask is because I'm trying to work out whether speed is your primary issue. If it were me, for example, converting a library of books, my primary concern would be the quality of the conversion, not the speed. I could always leave the conversion over-night/-weekend if necessary!
The desktop version of Foxit's PDF IFilter is free
http://www.foxitsoftware.com/pdf/ifilter/
It will automatically do the indexing and searching, but perhaps their index is available for you to use as well. If you are planning to use it in an application you sell or distribute, then I guess it won't be a good choice, but if it's just for yourself, then it might work.
The Foxit code is at the core my company's PDF Reader/Text Extraction library, which wouldn't be appropriate for your project, but I can vouch for the speed and quality of the results of the underlying Foxit engine.
I guess using any library is fine, but do you want to search all these 20Gb files at time of search?
For full text search, best is you can create a database, something like sqlite or any local database on client machine, read all pdf and convert them to plain text and store it in database when they are added first.
Your database can simpley be as following..
Table: PDFFiles
PDFFileID
PDFFilePath
PDFTitle
PDFAuthor
PDFKeywords
PDFFullText....
and you can search this table when you need to, this way your search will be extremely fast independent of type of pdf, plus this conversion from pdf to database is needed only when pdf is added to your collection or modified.
My application has historically used an ini file on the same file server as the data it consumes is located to store per user settings so that they roam if the user logs on from multiple computers. To do this we had a file that looked like:
[domain\username1]
value1=foo
value2=bar
[domain\username2]
value1=foo
value2=baz
For this release we're trying to migrate away from using ini files due to limitations in the win32 ini read/write functions without having to write a custom ini file parser.
I've looked at app.config and user settings files and neither appear to be suitable. The former needs to be in the same folder as the executable, and the latter doesn't provide any means to create new values at runtime.
Is there a built in option I'm missing, or is my best path to write a preferences class of my own and use the framework's XML serialization to write it out?
I have found that the fastest way here is to just create an XML file that does what you want, then use XSD.exe to create a class and serialize the data. It is fast, and a few lines of code and works quite well.
Have you not checked out or have heard of nini which is a third party ini handler. I found it quite easy to use and simple in reading/writing to ini file.
For your benefit, it would mean very little changes, and easier to use.
The conversion from ini to another format needs to be weighed up, like the code impact, ease of programming (nitpicky aside, changing the code to use xml may be easy but it is limiting in that you cannot write to it). What would be the benefit in ripping out the ini codes and replace it with xml is a question you have to decide?
There may well be a knock on effect such as having to change it and adapt the code...but... for the for-seeable time, sure, ini is a bit outdated and old, but it is still in use, I cannot see Microsoft dropping the ini API support as it is very much alive and in use behind the scenes for driver installation...think of inf files used to specify where the drivers go and how is it installed...it is here to stay as the manufacturers of drivers have adopted it and is de-facto standard way of driver distribution...
Hope this helps,
Best regards,
Tom.
I'm building an network application that needs to be able to switch from normal network traffic to a zlib compressed stream, mid stream. My thoughts on the matter involve boolean switch that when on will cause the network code to pass all the data through a class that I can feed IEnumerable<byte> into, and then pull out the decompressed stream, passing that on to the already existing protocol parsing code.
Things I've looked at:
ZLib.NET - It seems a little... Ecclectic, and not quite what I want. Would still make a decent start to build off though. (Jon Skeet's comments here hardly inspire me either.)
SharpZipLib - This doesn't seem to support zlib at all? Can anyone confirm or deny this?
I would very much prefer and all managed solution, but let's have at it... are there any other implementations of this library out there in .NET, that might be better suited to what I want to do, or should I take ZLib.NET and build off that as a start?
PS:
Jon's asked for more detail, so here it is.
I'm trying to implement MCCP 2. This involves a signal being sent in the network stream, and everything after this signal is a zlib compressed data stream. There's links to exactly what they mean by that in the above link. Anyway, to be clear, I'm on the recieving end of this (client, not server), and I have a bunch of data read out of the network stream already, and the toggle will be in the middle of this (in all likelyhood atleast), so any solution needs to be able to have some extra data fed into it, before it takes over the NetworkStream (or I manually feed in the rest of the data).
SharpZipLib does support ZLib. Look in the FAQ.
Additionally, have you checked whether the System.IO.Compression namespace supports what you need?
I wouldn't use an IEnumerable<byte> though - streams are designed to be chained together.
EDIT: Okay... it sounds like you need a stream which supports buffering, but with more control than BufferedStream provides. You'd need to "rewind" the stream if you saw the decompression toggle, and then create a GZipStream on top of it. Your buffer would need to be at least as big as your biggest call to Read() so that you could always have enough buffer to rewind.
Included in DotNetZip there is a ZlibStream, for compressing or decompressing zlib streams of data. You didn't ask, but there is also a GZipStream and a DeflateStream. As well as a ZlibCodec class, if that is your thing. (just inflates or deflates buffers, as opposed to streams).
DotNetZip is a fully-managed library with a liberal license. You don't need to use any of the .zip capability to get at the Zlib stuff. And the zlib stuff is packaged as a separate (smaller) DLL just for this purpose.
I can recommend you Gerry Shaw's zlib wrapper for .NET:
http://www.organicbit.com/zip/
As far as I know the ZLib (gzip) library doesn't support listing the files in the header. Assuming that matters to you, but it seems a big shortcoming. This was when I used the sharp zip library a while ago, so I'm willing to delete this :)
Old question, but System.IO.Compression.DeflateStream is actually the right answer if you need proper zlib support:
Starting with the .NET Framework 4.5, the DeflateStream class uses the
zlib library. As a result, it provides a better compression algorithm
and, in most cases, a smaller compressed file than it provides in
earlier versions of the .NET Framework.
Doesn't get better than that.
I´m creating a PDA app and I need to upload/download a lot of small files and my idea is to gather them in an uncompressed zip file.
The question is: It´s a good idea to read those files from the zip without separating them? How can I do so? Or is it better to unzip them? Since the files are not compressed my simple mind points that maybe reading them from the zip it´s more or less as efficient as reading them directly from the file system...
Thanks for you time!
Since there are two different Open-source libraries (SharpZipLib and DotNetZip Library) to handle writing & extracting files from a zip file, why worry about doing it yourself?
ewww - don't use J#.
The DotNetZip library, as of v1.7, runs on the .NET Compact Framework 2.0 and above. It can handle reading or writing compressed or uncompressed files within a ZIP archive. The source distribution includes a CF example app. It's really simple.
Sounds as if you want to use the archive to group your files.
From a reading the files point of view, it makes very little difference if the files are handled one way or the other. You would need to implement the ability to read zip files, though. Even if you use a lib like James Curran suggested, it means additional work, which can mean additional sources of error.
From the uploading the files point of view, it makes more sense: The uploader could gather all the files needed and would have to take care of only one single upload. This reduces overhead as well as error handling (if one uplaod fails, do you have to delete all files of this group already uploaded?).
As for the efficiency of reading them from the archive vs. reading them directly from the disc: The difference should be minimal. You (or your zip library) need to once parse the zip directory structure, which is pretty straight forward. The rest is reading part of a file into memory vs. reading a file into memory.