Jo27 Posted August 27, 2015 Posted August 27, 2015 Hey Guys: For my research project, I would need a python code that will enable me to extract specific lines from a textfile. The textfile has the following format: fixedStep chrom=chr3 start=56424 step=1 0.000 0.001 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.007 fixedStep chrom=chr3 start=56425 step=1 Etc.... In fact, I would like to obtain an other textfile without the numerical lines, in the following format: fixedStep chrom=chr3 start=56424 step=1 fixedStep chrom=chr3 start=56425 step=1 Etc.... Looking forward to hearing from you soon, Best, JEFF O.
Sensei Posted August 27, 2015 Posted August 27, 2015 Open file, read row by row counting lines. If line counter is at specific line, return it. Then close file handle.
fiveworlds Posted August 27, 2015 Posted August 27, 2015 like this? #!/pubserver/python27/python print "Content-type: text/html" print print "<html><head>" print "</head><body>" print "This was written in Python." txt = open("test.txt", "r") s=txt.readlines() print s[0],s[11]; file.close() print "</body></html>"
Strange Posted August 27, 2015 Posted August 27, 2015 (edited) Use awk instead. awk "/^[^0-9]/" textfile Both sensei and fiveworlds seem to have missed the point of the question. In fiveworld's example the "print" line needs to be replaced with something like: if (<regular expression to find lines not starting with number): print (I am not familiar enough with python to write that off the top of my head.) Also, I have no idea why he is printing out (invalid) HTML Edited August 27, 2015 by Strange
Sensei Posted August 27, 2015 Posted August 27, 2015 (edited) Both sensei and fiveworlds seem to have missed the point of the question. I didn't want to give final code, as it sounds quite like homework.. In fiveworld's example the "print" line needs to be replaced with something like: if (<regular expression to find lines not starting with number): Why do you want to use regular expressions for such silly task as checking whether 1st character is digit or not.. ? Equivalent of C/C++ code: if( !( ( row[ 0 ] >= '0' ) && ( row[ 0 ] <= '9' ) ) ) or if( !isdigit( row[ 0 ] ) ) will be sufficient (and faster, reg expressions are pretty slow). (assuming null-terminated row is in buffer) http://www.tutorialspoint.com/python/string_isdigit.htm Edited August 27, 2015 by Sensei
Strange Posted August 27, 2015 Posted August 27, 2015 This seems to work #!/usr/bin/python import re import sys if len(sys.argv) != 2: print("Missing file name argument") sys.exit(1) filename = sys.argv[1] # Pattern to match lines that don't begin with a number pat = re.compile('^[^0-9]') for line in open(filename): m = pat.match(line) if (m): print(line), For Python 3, the print line will need to change to: print(line, end="")Why do you want to use regular expressions for such silly task as checking whether 1st character is digit or not.. ? Because that is how awk does it? Actually, from the short example given, it might just be enough to check if the first character is 'f' ...
timo Posted August 27, 2015 Posted August 27, 2015 ... or even use the much simpler "grep". However, that assumes Jo27 is on a linux (or mac) system. And not only is that not certain. In fact, if someone asks for a Python script to do trivial text processing I would assume a Windows user.
fiveworlds Posted August 27, 2015 Posted August 27, 2015 (edited) Also, I have no idea why he is printing out (invalid) HTML I haven't installed python because I have a few versions. awk "/^[^0-9]/" textfile Both sensei and fiveworlds seem to have missed the point of the question. In fiveworld's example the "print" line needs to be replaced with something like: if (<regular expression to find lines not starting with number): print (I am not familiar enough with python to write that off the top of my head.) From what I can see from the ops example he/she wants lines that are separated by a number of lines not the lines without numbers.But I can see where you are going to just regex out the numbers. This does depend on the lines he/she wants because they can't contain numbers like that. fixedStep chrom=chr3 start=56424 step=1 0.000 0.001 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.007 fixedStep chrom=chr3 start=56425 step=1 Ie each line is after a fixed number of lines. My code retrieves the first line s[0]=fixedStep chrom=chr3 start=56424 step=1 the last line s[11]=fixedStep chrom=chr3 start=56425 step=1 I didn't want to give final code, as it sounds quite like homework.. I didn't want to either he\she still needs to output the data to a new file. Edited August 27, 2015 by fiveworlds
Strange Posted August 27, 2015 Posted August 27, 2015 From what I can see from the ops example he/she wants lines that are separated by a number of lines not the lines without numbers. He says: "I would like to obtain an other textfile without the numerical lines" Seemed pretty clear to me. But let's see. ... or even use the much simpler "grep". That would do it. However, that assumes Jo27 is on a linux (or mac) system. And not only is that not certain. In fact, if someone asks for a Python script to do trivial text processing I would assume a Windows user. The first thing I do on a new Windows machine is install cygwin!
pzkpfw Posted August 28, 2015 Posted August 28, 2015 Erm, instead of looking for '0' to '9' as the first character of the line (via regex or not) - why not just look for the 'f' of "fixedStep"? (or the whole word). Seems safe enough, going by the specification (which does say "...without the numerical lines..." but equally shows the only non numeric lines to begin "fixedStep ..."). (Minor point, but, well ...) @fiveworlds: the spec says "Etc....". A solution that only shows lines 1 and 12 of the file (elements 0 and 11 of the list) misses all lines in the potential "Etc.". Also, a solution that loads (with the use of "readlines") the entire file at once into a list, will get to be a drag on the system if the input file gets large.
fiveworlds Posted August 28, 2015 Posted August 28, 2015 Also, a solution that loads (with the use of "readlines") the entire file at once into a list, will get to be a drag on the system if the input file gets large. I would use an assembly code executable for that probably written in c# but that was not what the op asked @fiveworlds: the spec says "Etc....". A solution that only shows lines 1 and 12 of the file (elements 0 and 11 of the list) misses all lines in the potential "Etc.". #!/pubserver/python27/python print "Content-type: text/html" print "Accept-Language: en-US" print "Cache-Control: no-cache" print "" print "<html><head>" print "</head><body>" print "This was written in Python." fp = open("test.txt", "r") txt =fp.readlines() print len(txt) print "</br>" i=0 while i<len(txt): print "</br>" print txt[i] txt[i]="hello\n" print "</br>" print txt[i+11] txt[i+11]="hello\n" i=i+12 file.close(fp) fclose(fp) print "</body></html>"
pzkpfw Posted August 28, 2015 Posted August 28, 2015 I would use an assembly code executable for that probably written in c# ... That doesn't really make sense (given how C# works; even if you're talking about .Net Native under .NET Framework 4.6 and 4.5). Having said that, C# is what I'd use too - I (more or less) currently make my living as a C# programmer. ... but that was not what the op asked He didn't ask for mangled HTML either! ... python code ... Still has the issue of reading the file all at once. I don't know Python specifically, but reading line by line seems possible (e.g. http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python ) Reading line by line also means you'd not be tied to the assumption that there's always 10 lines between the lines wanted. And your code assumes the input ends on a wanted line; but going by the sample (with the "Etc ...") I'd suggest that's not guaranteed. And if that "last line" is not there (or there are less than 10 unwanted lines), the last iteration of your while loop may well have an i that's less than len(txt) ... but adding 11 (i.e. the "print txt[i+11]") would push you past the end of the list. That is, your code will only work if the input file is exactly like: wanted line 10 x unwanted lines wanted line wanted line 10 x unwanted lines wanted line wanted line 10 x unwanted lines wanted line ... etc. What's the point of the "hello" replacements?
fiveworlds Posted August 28, 2015 Posted August 28, 2015 (edited) Still has the issue of reading the file all at once. Not really python only handles small files. the last iteration of your while loop may well have an i that's less than len(txt) An empty string What's the point of the "hello" replacements? hello\n it was to point out that if \n isn't included in the replacements a line is lost Edited August 28, 2015 by fiveworlds
Strange Posted August 28, 2015 Posted August 28, 2015 If you read one line at a time, it doesn't matter how big the file is. If you read the entire file, it is going to be limited by available (virtual) memory - I don't think Python has any hard limits.
fiveworlds Posted August 28, 2015 Posted August 28, 2015 If you read the entire file, it is going to be limited by available (virtual) memory - I don't think Python has any hard limits. It does.
fiveworlds Posted August 28, 2015 Posted August 28, 2015 (edited) Citation needed. Try opening any text file larger than 1GB. I think it is 2 or 3 gb for notepad. Edited August 28, 2015 by fiveworlds
Strange Posted August 28, 2015 Posted August 28, 2015 Try opening any text file larger than 1GB. I think it is 2 or 3 gb for notepad. That is not a limitation of Python.
fiveworlds Posted August 28, 2015 Posted August 28, 2015 That is not a limitation of Python. Of your computer.
Strange Posted August 28, 2015 Posted August 28, 2015 As I said: If you read the entire file, it is going to be limited by available (virtual) memory - Python does not have any hard limits
fiveworlds Posted August 29, 2015 Posted August 29, 2015 Only it has nothing to do with the available (virtual) memory of my computer which is 8 GB that file was 3.5GB. I can open the file just not with notepad or python.
Strange Posted August 29, 2015 Posted August 29, 2015 Whatever the reason, it is not a limitation in Python. (And it is irrelevant, as the file needs to be processed one line at a time, anyway.)
Sensei Posted August 29, 2015 Posted August 29, 2015 (edited) Only it has nothing to do with the available (virtual) memory of my computer which is 8 GB that file was 3.5GB. I can open the file just not with notepad or python. Limitation of some app (like Notepad), or language, is not the same as limitation of system. I made C/C++ project for you. Compiled for either 32 bit and 64 bit. Run it in command line as follows: OpenFile "file name" OpenFile.zip I used in this project ftell() to learn file size, which is defined as follows: long __cdecl ftell(_Inout_ FILE * _File); in includes. It's returning 32 bit integer. There is yet another function for 64 bit: __int64 _ftelli64( FILE *stream ); I used it (and _fseeki64()) in this project: OpenFile64.zip so you can compare results. First project is written without using Windows specific functions (ANSI C, portable code possible to compile without changes on Linux/MacOS), while 2nd project uses functions available only in Windows, added by Microsoft to support 64 bit. To support 64 bit file handling on Linux, there are other functions, available on Linux, but not available on Windows (and vice versa): http://stackoverflow.com/questions/9026896/get-large-file-size-in-c Program written in the past by default uses 32 bit file handling functions. Program written now, have to be written by professional programmer, who is aware of how to deal with too large files. Otherwise he/she will use wrong functions, and will make limitation by himself/herself. I know plentiful programmers who use obsolete ancient computers with (32 bit) Windows XP, and don't want to upgrade (and it's not a matter of money). My neighborhood was using Pentium III laptop (OMG). Their private code, private projects, will most likely be affect by this issue. Edited August 29, 2015 by Sensei 1
fiveworlds Posted August 29, 2015 Posted August 29, 2015 (edited) Here is the python code for it but it doesn't get over the limitations on pythons open() which just reads the file as one line in a massive string. Ps I am going out for the day. Basically I load the massive string into an array memory because that is all i can do then I am writing the lines as separate files and deleting the massive string from memory. Then I perform a regex on each individual file and any files which pass the test are placed into the output directory. Then the files in the output directory are compiled and the two temporary directories are deleted. I thought I should probably use the current time for the directory names but I didn't get around to it, import osimport reimport shutilwhile os.path.exists("temp"): shutil.rmtree('temp')while os.path.exists("output"): shutil.rmtree('output')fp = open("test.txt", "r")txt =fp.readlines()print len(txt)print "</br>"handler=0while handler<len(txt): if not os.path.exists("temp"): os.makedirs("temp") path="temp\\"+str(handler)+".txt" fz=open(path,"w") fz.write(txt[handler]) file.close(fz) handler=handler+1file.close(fp)txt=0while txt<handler: if not os.path.exists("output"): os.makedirs("output") path="temp\\"+str(txt)+".txt" fz = open(path, "r") out =fz.readline() file.close(fz) pattern = '([a-z]+)' result = re.compile(pattern).search(out) if result: path="output\\"+str(txt)+".txt" fz=open(path,"w") fz.write(out) file.close(fz) else: print "" txt=txt+1txt=0out=""fp=open("output.txt","w")for i in os.listdir("output"): if i.endswith(".txt"): fz = open("output\\"+str(i), "r") out =str(out)+str(fz.readline()) file.close(fz) continue else: continuefp.write(out)file.close(fp)del(out)del(pattern)del(result)while os.path.exists("temp"): shutil.rmtree('temp')while os.path.exists("output"): shutil.rmtree('output') Edited August 29, 2015 by fiveworlds 1
Strange Posted August 29, 2015 Posted August 29, 2015 Basically I load the massive string into an array memory because that is all i can do What do you mean, it is all you can do? Of course it isn't. You can read the file one line at a time, which would be more appropriate. You could even read it one byte at a time, if you wanted to. You do talk nonsense sometimes.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now