RyanJ Posted April 25, 2010 Posted April 25, 2010 Hey there guys! I have a question for anyone. Maybe some code already exists for this or a good algorithm is already written down somewhere. Basically what I'm trying to do is make an unknown file type identifier. I'm trying to find if there are any .NET implementations of a file-type matching algorithm out there but I can't seem to find one. Or a good description of an implementation in another language. If there are none then can anyone point me in the general direction of writing one? I'm looking for help with the project so if anyone else is interested in this let me know. Cheers!
khaled Posted April 27, 2010 Posted April 27, 2010 i don't know but, files types have extension .xxx and its content start with a unique code [ CODE ... ]
Icefire Posted April 27, 2010 Posted April 27, 2010 probably one that takes the characters after the last period in the name (there can be more than one, but only the last one is important - except for exceptions such as .tar.gz), then goes to a site in inputs it as a search, then reformats the page by removing excess info (not the best idea if the website constantly redesigns itself...)
D H Posted April 27, 2010 Posted April 27, 2010 I think Ryan is talking about identifying the file type by looking at the file's contents rather than at the file's name. The Unix file tool does a semi-decent job of doing just that.
RyanJ Posted April 27, 2010 Author Posted April 27, 2010 I think Ryan is talking about identifying the file type by looking at the file's contents rather than at the file's name. The Unix file tool does a semi-decent job of doing just that. That's the idea. Though that uses a trick called magic numbers which misses most file types these days.
StringJunky Posted April 27, 2010 Posted April 27, 2010 This program does the kind of job you are looking to do I think: http://www.freedownloadsplace.com/Products/38290/TrID-File-Identifier
RyanJ Posted April 27, 2010 Author Posted April 27, 2010 I know. It uses a pattern matching approach. However it also fails with files that have no structure, such as ISO files.
jryan Posted April 29, 2010 Posted April 29, 2010 You mean something like this?: http://filext.com/ I think there may be some trouble writing a program that can determine the exact application the file is connected to since file extension nomenclature is not strictly policed. As such you will have file types with the same names that are actually completely different formats. To really determine the root application you will need to get in and read the files themselves and match it against a structure database in the same what a virus scanner sniffs out infected files.
RyanJ Posted April 29, 2010 Author Posted April 29, 2010 I know that. Actually that's what I'm working on at the moment I'm thinking of using pattern match and possible techniques such as entropy and compressibility in cases where patterns are not clear cut. The program is being written in C# and WPF so if anyone wants to take a look and work on this with me - feel free to let me know
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now