theme selector

light blue screenshot grey screenshot navy screenshot dark green screenshot red and black screenshot
 

by Tony Chang
tony@ponderer.org

All opinions on this site are my own and do not represent those of my employer.

Creative Commons Attribution License

python log reader

Mar 06, 2006, 12:37am EST

 

 

I often want to do some quick analysis of my apache log files and find myself using grep or awk on my log files. I normally have to carefully construct the appropriate regular expression for getting the data I want. I’ve done this frequently enough that I decided it was time to write a python module to do it for me.

log_reader - a fast apache log reader in python (download)

Example:

>>> import log_reader
>>> reader = log_reader.ApacheReader(file(‘access.log’))
>>> reader.next()
{‘username’: ‘-‘, ‘status’: 200, ‘ident’: ‘-‘, ‘tz’:
‘-0500’, ‘protocol’: ‘HTTP/1.0’, ‘user-agent’: ‘Mozilla/4.0
(compatible; MSIE 6.0; Windows 98; iTreeSurf 3.6.1 (Build
056))’, ‘ips’: [‘123.123.123.123’], ‘referer’: ‘Field blocked
by Outpost (http://www.agnitum.com)’, ‘time’:
datetime.datetime(2005, 3, 3, 21, 37, 58), ‘path’:
‘/webnote/webnote’, ‘method’: ‘GET’, ‘size’: 46472}
>>> status = [f[‘status’] for f in reader]
>>> status.count(200) # request ok
14047
>>> status.count(404) # file not found
159

It’s implemented as a CPython module so it’s substantially faster than trying to read/parse strings in python itself. A simple script reading a 17,089 line log file takes about 1.45 seconds on my 1.2ghz laptop.

The constructor takes either a filename or any iterable as a parameter.[1] Optionally, one can pass in an apache format string (defaults to apache combined format).

[1] Oddly, sequences fail PyIter_Check and need to be wrapped by iter. That is, log_reader.ApacheReader([‘..’]) fails, but log_reader.ApacheReader(iter([‘..’])) works. I’m not sure what the distinction is because lists are iterable and have __iter__ defined.

Wade Leftwich at May 10, 2006, 10:14am EDT

Excellent utility, thanks!

Regarding why log_reader.ApacheReader([‘ .. ‘]) fails — sequences don’t have a next() method:

In [1]: L = [1,2,3]

In [2]: L.next()

exceptions.AttributeError Traceback (most recent call last)

AttributeError: ‘list’ object has no attribute ‘next’

In [3]: iter(L).next()

Out[3]: 1


Craig Ringer at Nov 01, 2009, 10:30pm EST

It looks like there’s a memory handling bug in log_reader. If you build with:

python setup.py build —debug python setup.py install

and write a simple test (“test.py”) like:

for x in log_reader.ApacheReader( open(‘access.log’,’r’) ): pass

then invoke it with:

MALLOC_CHECK_=2 gdb —args python ./test.py

you’ll find after “run” that it’s crashing in the log reader in ApacheReader_dealloc line 67:

Program received signal SIGABRT, Aborted.

[Switching to Thread 0xb7d096c0 (LWP 15358)]

0xb7ee5424 in __kernel_vsyscall ()

(gdb) bt

#0 0xb7ee5424 in __kernel_vsyscall ()

#1 0xb7d366d0 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64

#2 0xb7d38098 in *__GI_abort () at abort.c:88

#3 0xb7d7a633 in malloc_printerr (action=2, str=0xb7e4c4a1 “free(): invalid pointer”, ptr=0xb7cc96f8) at malloc.c:5999

#4 0xb7d7c555 in *__GI___libc_free (mem=0xb7cc96f8) at malloc.c:3589

#5 0xb7c3d0f7 in ApacheReader_dealloc (self=0xb7cc96f8) at log_reader.cpp:67

#6 0x0808cb61 in ?? ()

#7 0x0808ee69 in PyDict_SetItem ()

#8 0x080909c1 in _PyModule_Clear ()

#9 0x080f17a2 in PyImport_Cleanup ()

#10 0x080fd90d in Py_Finalize ()

#11 0x0805c129 in Py_Main ()

#12 0x0805b972 in main ()

at:

PyMem_DEL(self);

(by the way, your weblog interface really needs to add the <pre> tag to its allowed list - it’s a real nightmare to paste code, backtraces, etc)


tony at Nov 01, 2009, 11:26pm EST

You’re right! This seems to be a python2.5 change. I’ve updated the code in CVS to use the matching dealloc so it should no longer crash in python2.5+.

I’ve added the pre tag to the allowed list, although you could have used the code tag.


Craig Ringer at Nov 01, 2009, 11:50pm EST

Thanks!

I appreciate your publishing the code, by the way. It’s the sort of thing that gets written and re-written so many times it’s painful, so it’s good to have a quality implementation out there.

Here’s a less-good-quality implementation of a log loader for slurping the logs into postgresql:

http://www.postnewspapers.com.au/~craig/weblinks/logloader.py

(BTW, I didn’t use <code> because it still does “helpful” line re-wrapping)


Javier Frias at Feb 22, 2010, 09:07pm EST

Hi, thanks for the module, its exactly what i was looking for. One issue though, it’s that it doesn’t handle parsing of custom headers, ie, %{Foobar}i. Is there a way to go around this? thanks

btw, i tried posting the stack trace, and the board wouldn’t let me, so there’s a parsing error there too heheh :-D

ValueError: Unknown named format type