Python code to extract specific lines in a textfile

Jo27 · August 27, 2015

Hey Guys:

For my research project, I would need a python code that will enable me to extract specific lines from a textfile.

The textfile has the following format:

fixedStep chrom=chr3 start=56424 step=1

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

fixedStep chrom=chr3 start=56425 step=1

Etc....

In fact, I would like to obtain an other textfile without the numerical lines, in the following format:

fixedStep chrom=chr3 start=56424 step=1

fixedStep chrom=chr3 start=56425 step=1

Etc....

Looking forward to hearing from you soon,

Best, JEFF O.

Sensei · August 27, 2015

Open file, read row by row counting lines. If line counter is at specific line, return it.

Then close file handle.

fiveworlds · August 27, 2015

like this?

#!/pubserver/python27/python

print "Content-type: text/html"
print
print "<html><head>"
print "</head><body>"
print "This was written in Python."
txt = open("test.txt", "r")
s=txt.readlines()
print s[0],s[11];
file.close()
print "</body></html>"

**Strange** · August 27, 2015

Use awk instead.

awk "/^[^0-9]/" textfile

Both sensei and fiveworlds seem to have missed the point of the question. In fiveworld's example the "print" line needs to be replaced with something like:

if (<regular expression to find lines not starting with number):

print

(I am not familiar enough with python to write that off the top of my head.)

Also, I have no idea why he is printing out (invalid) HTML :confused:

Edited August 27, 2015 by Strange

Sensei · August 27, 2015

Both sensei and fiveworlds seem to have missed the point of the question.

I didn't want to give final code, as it sounds quite like homework..

In fiveworld's example the "print" line needs to be replaced with something like:

if (<regular expression to find lines not starting with number):

Why do you want to use regular expressions for such silly task as checking whether 1st character is digit or not.. ?

Equivalent of C/C++ code:

if( !( ( row[ 0 ] >= '0' ) && ( row[ 0 ] <= '9' ) ) )

or

if( !isdigit( row[ 0 ] ) )

will be sufficient (and faster, reg expressions are pretty slow).

(assuming null-terminated row is in buffer)

http://www.tutorialspoint.com/python/string_isdigit.htm

Edited August 27, 2015 by Sensei

**Strange** · August 27, 2015

This seems to work

#!/usr/bin/python

import re
import sys

if len(sys.argv) != 2:
	print("Missing file name argument")
	sys.exit(1)

filename = sys.argv[1]

# Pattern to match lines that don't begin with a number
pat = re.compile('^[^0-9]')

for line in  open(filename):
	m = pat.match(line)
	if (m):
		print(line),

For Python 3, the print line will need to change to: print(line, end="")

Why do you want to use regular expressions for such silly task as checking whether 1st character is digit or not.. ?

Because that is how awk does it?

Actually, from the short example given, it might just be enough to check if the first character is 'f' ...

timo · August 27, 2015

... or even use the much simpler "grep". However, that assumes Jo27 is on a linux (or mac) system. And not only is that not certain. In fact, if someone asks for a Python script to do trivial text processing I would assume a Windows user.

fiveworlds · August 27, 2015

Also, I have no idea why he is printing out (invalid) HTML

I haven't installed python because I have a few versions.

awk "/^[^0-9]/" textfile

Both sensei and fiveworlds seem to have missed the point of the question. In fiveworld's example the "print" line needs to be replaced with something like:

if (<regular expression to find lines not starting with number):

print

(I am not familiar enough with python to write that off the top of my head.)

From what I can see from the ops example he/she wants lines that are separated by a number of lines not the lines without numbers.But I can see where you are going to just regex out the numbers. This does depend on the lines he/she wants because they can't contain numbers like that.

fixedStep chrom=chr3 start=56424 step=1
0.000
0.001
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.007
fixedStep chrom=chr3 start=56425 step=1

Ie each line is after a fixed number of lines. My code retrieves

the first line

s[0]=fixedStep chrom=chr3 start=56424 step=1

the last line

s[11]=fixedStep chrom=chr3 start=56425 step=1

I didn't want to give final code, as it sounds quite like homework..

I didn't want to either he\she still needs to output the data to a new file.

Edited August 27, 2015 by fiveworlds

**Strange** · August 27, 2015

From what I can see from the ops example he/she wants lines that are separated by a number of lines not the lines without numbers.

He says: "I would like to obtain an other textfile without the numerical lines"

Seemed pretty clear to me. But let's see.

... or even use the much simpler "grep".

That would do it.

However, that assumes Jo27 is on a linux (or mac) system. And not only is that not certain. In fact, if someone asks for a Python script to do trivial text processing I would assume a Windows user.

The first thing I do on a new Windows machine is install cygwin!

pzkpfw · August 28, 2015

Erm, instead of looking for '0' to '9' as the first character of the line (via regex or not) - why not just look for the 'f' of "fixedStep"? (or the whole word).

Seems safe enough, going by the specification (which does say "...without the numerical lines..." but equally shows the only non numeric lines to begin "fixedStep ...").

(Minor point, but, well ...)

@fiveworlds: the spec says "Etc....". A solution that only shows lines 1 and 12 of the file (elements 0 and 11 of the list) misses all lines in the potential "Etc.".

Also, a solution that loads (with the use of "readlines") the entire file at once into a list, will get to be a drag on the system if the input file gets large.

fiveworlds · August 28, 2015

Also, a solution that loads (with the use of "readlines") the entire file at once into a list, will get to be a drag on the system if the input file gets large.

I would use an assembly code executable for that probably written in c# but that was not what the op asked

@fiveworlds: the spec says "Etc....". A solution that only shows lines 1 and 12 of the file (elements 0 and 11 of the list) misses all lines in the potential "Etc.".

#!/pubserver/python27/python

print "Content-type: text/html"
print "Accept-Language: en-US"
print "Cache-Control: no-cache"
print ""
print "<html><head>"
print "</head><body>"
print "This was written in Python."
fp = open("test.txt", "r")
txt =fp.readlines()
print len(txt)
print "</br>"
i=0
while i<len(txt):

    print "</br>"
    print txt[i]
    txt[i]="hello\n"
    print "</br>"
    print txt[i+11]
    txt[i+11]="hello\n"
    i=i+12

file.close(fp)

    
fclose(fp)

print "</body></html>"

pzkpfw · August 28, 2015

I would use an assembly code executable for that probably written in c# ...

That doesn't really make sense (given how C# works; even if you're talking about .Net Native under .NET Framework 4.6 and 4.5). Having said that, C# is what I'd use too - I (more or less) currently make my living as a C# programmer.

... but that was not what the op asked

He didn't ask for mangled HTML either!

... python code ...

Still has the issue of reading the file all at once. I don't know Python specifically, but reading line by line seems possible (e.g. http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python )

Reading line by line also means you'd not be tied to the assumption that there's always 10 lines between the lines wanted. And your code assumes the input ends on a wanted line; but going by the sample (with the "Etc ...") I'd suggest that's not guaranteed. And if that "last line" is not there (or there are less than 10 unwanted lines), the last iteration of your while loop may well have an i that's less than len(txt) ... but adding 11 (i.e. the "print txt[i+11]") would push you past the end of the list.

That is, your code will only work if the input file is exactly like:

wanted line

10 x unwanted lines

wanted line

10 x unwanted lines

wanted line

10 x unwanted lines

wanted line

... etc.

What's the point of the "hello" replacements?

fiveworlds · August 28, 2015

Still has the issue of reading the file all at once.

Not really python only handles small files.

the last iteration of your while loop may well have an i that's less than len(txt)

An empty string

What's the point of the "hello" replacements?

hello\n it was to point out that if \n isn't included in the replacements a line is lost

Edited August 28, 2015 by fiveworlds

**Strange** · August 28, 2015

If you read one line at a time, it doesn't matter how big the file is.

If you read the entire file, it is going to be limited by available (virtual) memory - I don't think Python has any hard limits.

fiveworlds · August 28, 2015

If you read the entire file, it is going to be limited by available (virtual) memory - I don't think Python has any hard limits.

It does.

**Strange** · August 28, 2015

It does.

Citation needed.

fiveworlds · August 28, 2015

Citation needed.

Try opening any text file larger than 1GB. I think it is 2 or 3 gb for notepad.

Edited August 28, 2015 by fiveworlds

**Strange** · August 28, 2015

Try opening any text file larger than 1GB. I think it is 2 or 3 gb for notepad.

That is not a limitation of Python.

fiveworlds · August 28, 2015

That is not a limitation of Python.

Of your computer.

a7vhEbH.jpg?1

**Strange** · August 28, 2015

As I said: If you read the entire file, it is going to be limited by available (virtual) memory - Python does not have any hard limits

fiveworlds · August 29, 2015

Only it has nothing to do with the available (virtual) memory of my computer which is 8 GB that file was 3.5GB. I can open the file just not with notepad or python.

**Strange** · August 29, 2015

Whatever the reason, it is not a limitation in Python. (And it is irrelevant, as the file needs to be processed one line at a time, anyway.)

Sensei · August 29, 2015

Only it has nothing to do with the available (virtual) memory of my computer which is 8 GB that file was 3.5GB. I can open the file just not with notepad or python.

Limitation of some app (like Notepad),

or language,

is not the same as limitation of system.

I made C/C++ project for you.

Compiled for either 32 bit and 64 bit.

Run it in command line as follows:

OpenFile "file name"

OpenFile.zip

I used in this project ftell() to learn file size, which is defined as follows:

long __cdecl ftell(_Inout_ FILE * _File);

in includes.

It's returning 32 bit integer.

There is yet another function for 64 bit:

__int64 _ftelli64(

FILE *stream

);

I used it (and _fseeki64()) in this project:

OpenFile64.zip

so you can compare results.

First project is written without using Windows specific functions (ANSI C, portable code possible to compile without changes on Linux/MacOS),

while 2nd project uses functions available only in Windows,

added by Microsoft to support 64 bit.

To support 64 bit file handling on Linux, there are other functions, available on Linux, but not available on Windows (and vice versa):

http://stackoverflow.com/questions/9026896/get-large-file-size-in-c

Program written in the past by default uses 32 bit file handling functions.

Program written now, have to be written by professional programmer, who is aware of how to deal with too large files.

Otherwise he/she will use wrong functions, and will make limitation by himself/herself.

I know plentiful programmers who use obsolete ancient computers with (32 bit) Windows XP, and don't want to upgrade (and it's not a matter of money).

My neighborhood was using Pentium III laptop (OMG).

Their private code, private projects, will most likely be affect by this issue.

Edited August 29, 2015 by Sensei

fiveworlds · August 29, 2015

Here is the python code for it but it doesn't get over the limitations on pythons open() which just reads the file as one line in a massive string. Ps I am going out for the day. Basically I load the massive string into an array memory because that is all i can do then I am writing the lines as separate files and deleting the massive string from memory. Then I perform a regex on each individual file and any files which pass the test are placed into the output directory. Then the files in the output directory are compiled and the two temporary directories are deleted. I thought I should probably use the current time for the directory names but I didn't get around to it,

import os
import re
import shutil

while os.path.exists("temp"):
shutil.rmtree('temp')
while os.path.exists("output"):
shutil.rmtree('output')

fp = open("test.txt", "r")
txt =fp.readlines()
print len(txt)
print "</br>"

handler=0
while handler<len(txt):

if not os.path.exists("temp"):
os.makedirs("temp")

path="temp\\"+str(handler)+".txt"
fz=open(path,"w")
fz.write(txt[handler])
file.close(fz)
handler=handler+1

file.close(fp)
txt=0

while txt<handler:

if not os.path.exists("output"):
os.makedirs("output")

path="temp\\"+str(txt)+".txt"
fz = open(path, "r")
out =fz.readline()
file.close(fz)
pattern = '([a-z]+)'
result = re.compile(pattern).search(out)
if result:
path="output\\"+str(txt)+".txt"
fz=open(path,"w")
fz.write(out)
file.close(fz)
else:
print ""

txt=txt+1

txt=0
out=""
fp=open("output.txt","w")

for i in os.listdir("output"):
if i.endswith(".txt"):
fz = open("output\\"+str(i), "r")
out =str(out)+str(fz.readline())
file.close(fz)
continue
else:
continue

fp.write(out)
file.close(fp)
del(out)
del(pattern)
del(result)
while os.path.exists("temp"):
shutil.rmtree('temp')

while os.path.exists("output"):
shutil.rmtree('output')

Edited August 29, 2015 by fiveworlds

**Strange** · August 29, 2015

Basically I load the massive string into an array memory because that is all i can do

What do you mean, it is all you can do? Of course it isn't. You can read the file one line at a time, which would be more appropriate. You could even read it one byte at a time, if you wanted to.

You do talk nonsense sometimes.

Sign In

Python code to extract specific lines in a textfile

Recommended Posts

Jo27

Sensei

fiveworlds

Strange

Sensei

Strange

timo

fiveworlds

Strange

pzkpfw

fiveworlds

pzkpfw

fiveworlds

Strange

fiveworlds

Strange

fiveworlds

Strange

fiveworlds

Strange

fiveworlds

Strange

Sensei

fiveworlds

Strange

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Important Information