python log reader
Mar 06, 2006, 12:37am EST
I often want to do some quick analysis of my apache log files and find myself using grep or awk on my log files. I normally have to carefully construct the appropriate regular expression for getting the data I want. I’ve done this frequently enough that I decided it was time to write a python module to do it for me.
log_reader - a fast apache log reader in python (download)
Example:
>>> import log_reader >>> reader = log_reader.ApacheReader(file(‘access.log’)) >>> reader.next() {‘username’: ‘-‘, ‘status’: 200, ‘ident’: ‘-‘, ‘tz’: ‘-0500’, ‘protocol’: ‘HTTP/1.0’, ‘user-agent’: ‘Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; iTreeSurf 3.6.1 (Build 056))’, ‘ips’: [‘123.123.123.123’], ‘referer’: ‘Field blocked by Outpost (http://www.agnitum.com)’, ‘time’: datetime.datetime(2005, 3, 3, 21, 37, 58), ‘path’: ‘/webnote/webnote’, ‘method’: ‘GET’, ‘size’: 46472} >>> status = [f[‘status’] for f in reader] >>> status.count(200) # request ok 14047 >>> status.count(404) # file not found 159
It’s implemented as a CPython module so it’s substantially faster than trying to read/parse strings in python itself. A simple script reading a 17,089 line log file takes about 1.45 seconds on my 1.2ghz laptop.
The constructor takes either a filename or any iterable as a parameter.[1] Optionally, one can pass in an apache format string (defaults to apache combined format).
[1] Oddly, sequences
fail PyIter_Check and
need to be wrapped by iter. That is,
log_reader.ApacheReader([‘..’]) fails, but
log_reader.ApacheReader(iter([‘..’])) works. I’m not sure what
the distinction is because lists are iterable and have __iter__
defined.
Wade Leftwich at May 10, 2006, 10:14am EDT
Excellent utility, thanks!
Regarding why log_reader.ApacheReader([‘ .. ‘]) fails — sequences don’t have a next() method:
In [1]: L = [1,2,3]
In [2]: L.next()
exceptions.AttributeError Traceback (most recent call last)
AttributeError: ‘list’ object has no attribute ‘next’
In [3]: iter(L).next()
Out[3]: 1
Craig Ringer at Nov 01, 2009, 10:30pm EST
It looks like there’s a memory handling bug in log_reader. If you build with:
python setup.py build —debug python setup.py install
and write a simple test (“test.py”) like:
for x in log_reader.ApacheReader( open(‘access.log’,’r’) ): pass
then invoke it with:
MALLOC_CHECK_=2 gdb —args python ./test.py
you’ll find after “run” that it’s crashing in the log reader in ApacheReader_dealloc line 67:
Program received signal SIGABRT, Aborted.
[Switching to Thread 0xb7d096c0 (LWP 15358)]
0xb7ee5424 in __kernel_vsyscall ()
(gdb) bt
#0 0xb7ee5424 in __kernel_vsyscall ()
#1 0xb7d366d0 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2 0xb7d38098 in *__GI_abort () at abort.c:88
#3 0xb7d7a633 in malloc_printerr (action=2, str=0xb7e4c4a1 “free(): invalid pointer”, ptr=0xb7cc96f8) at malloc.c:5999
#4 0xb7d7c555 in *__GI___libc_free (mem=0xb7cc96f8) at malloc.c:3589
#5 0xb7c3d0f7 in ApacheReader_dealloc (self=0xb7cc96f8) at log_reader.cpp:67
#6 0x0808cb61 in ?? ()
#7 0x0808ee69 in PyDict_SetItem ()
#8 0x080909c1 in _PyModule_Clear ()
#9 0x080f17a2 in PyImport_Cleanup ()
#10 0x080fd90d in Py_Finalize ()
#11 0x0805c129 in Py_Main ()
#12 0x0805b972 in main ()
at:
PyMem_DEL(self);
(by the way, your weblog interface really needs to add the <pre> tag to its allowed list - it’s a real nightmare to paste code, backtraces, etc)
tony at Nov 01, 2009, 11:26pm EST
You’re right! This seems to be a python2.5 change. I’ve updated the code in CVS to use the matching dealloc so it should no longer crash in python2.5+.
I’ve added the
pretag to the allowed list, although you could have used thecodetag.Craig Ringer at Nov 01, 2009, 11:50pm EST
Thanks!
I appreciate your publishing the code, by the way. It’s the sort of thing that gets written and re-written so many times it’s painful, so it’s good to have a quality implementation out there.
Here’s a less-good-quality implementation of a log loader for slurping the logs into postgresql:
http://www.postnewspapers.com.au/~craig/weblinks/logloader.py
(BTW, I didn’t use <code> because it still does “helpful” line re-wrapping)
Javier Frias at Feb 22, 2010, 09:07pm EST
Hi, thanks for the module, its exactly what i was looking for. One issue though, it’s that it doesn’t handle parsing of custom headers, ie, %{Foobar}i. Is there a way to go around this? thanks
btw, i tried posting the stack trace, and the board wouldn’t let me, so there’s a parsing error there too heheh :-D