02 4 / 2013
ElementTree and xmlns
Working with XML in Python is quite nice, but when trying to render a document back to XML after manipulating it, you can get some weird shortnames for namespaces in the result. Take the following example:
from xml.etree import ElementTree
xml = ElementTree.fromstring("""
<document xmlns="http://www.lol.com/"
xmlns:foo="http://www.foo.com/"
xmlns:bar="http://www.bar.com/">
<something>
<foo:thing>Hello,</foo:thing>
<bar:thing>world!</bar:thing>
</something>
</document>
""")
print ElementTree.tostring(xml)
Which yields:
<ns0:document xmlns:ns0="http://www.lol.com/" xmlns:ns1="http://www.foo.com/" xmlns:ns2="http://www.bar.com/">
<ns0:something>
<ns1:thing>Hello,</ns1:thing>
<ns2:thing>world!</ns2:thing>
</ns0:something>
</ns0:document>
The XML parser doesn’t maintain the namespace short names, so you get ns%d in your output XML. It’s obviously perfectly valid XML, just a bit unsightly. To fix it up, a simple solution if you know the namespaces in advance, is to register them with ElementTree before rendering:
from xml.etree import ElementTree
xml = ElementTree.fromstring("""
<document xmlns="http://www.lol.com/"
xmlns:foo="http://www.foo.com/"
xmlns:bar="http://www.bar.com/">
<something>
<foo:thing>Hello,</foo:thing>
<bar:thing>world!</bar:thing>
</something>
</document>
""")
namespaces = {
'': 'http://www.lol.com/',
'foo': 'http://www.foo.com/',
'bar': 'http://www.bar.com/',
}
for prefix, uri in namespaces.iteritems():
ElementTree.register_namespace(prefix, uri)
print ElementTree.tostring(xml)
Which yields:
<document xmlns="http://www.lol.com/" xmlns:bar="http://www.bar.com/" xmlns:foo="http://www.foo.com/">
<something>
<foo:thing>Hello,</foo:thing>
<bar:thing>world!</bar:thing>
</something>
</document>
If you don’t know the namespaces at runtime, it seems that a good solution is to use ElementTree.iterparser to pull them out like so:
from cStringIO import StringIO
from xml.etree import ElementTree
xmlin = """
<document xmlns="http://www.lol.com/"
xmlns:foo="http://www.foo.com/"
xmlns:bar="http://www.bar.com/">
<something>
<foo:thing>Hello,</foo:thing>
<bar:thing>world!</bar:thing>
</something>
</document>
"""
xml = None
namespaces = {}
for event, elem in ElementTree.iterparse(StringIO(xmlin), ('start', 'start-ns')):
if event == 'start-ns':
if elem[0] in namespaces and namespaces[elem[0]] != elem[1]:
# NOTE: It is perfectly valid to have the same prefix refer
# to different URI namespaces in different parts of the
# document. This exception serves as a reminder that this
# solution is not robust. Use at your own peril.
raise KeyError("Duplicate prefix with different URI found.")
namespaces[str(elem[0])] = elem[1]
elif event == 'start':
if xml is None:
xml = elem
break
for prefix, uri in namespaces.iteritems():
ElementTree.register_namespace(prefix, uri)
print ElementTree.tostring(xml)
28 3 / 2013
Rewording commit messages in git
Sometimes it’s useful to be able to edit a commit message from a previous commit (my spelling can be awful at times). If you’ve not pushed your changes, it’s relatively simple to reword a previous commit. First off, find the parent revision before the last commit you want to rename, you can find this using git log. Then run git rebase --interactive <parent_hash>, git will fire up $EDITOR, with a list of the commits you are going to replay. Each is formatted like:
pick <hash> First line of commit msg.
To reword the commit message, change the line for the commit you want to rename to:
reword <hash> First line of commit msg.
Then save and close the text file. Git will fire up $EDITOR again for each commit you asked to reword, and once they are all saved and closed, you’ll be back at HEAD again with the commits replayed and messages updated. #winning
26 3 / 2013
Delegate methods in Python
One thing that’s quite useful when you’re programming is knowing when a method was called. A naive approach for this would be to add glue code into the method, and have some mechanism to know who wanted to know when the method was called etc.
However a lot of the time the requirements for someone interested in a particular method fall into a few categories, they are:
- Before it runs what arguments was it called with.
- After it ran what was the return value.
- Was there an un-handled exception in the method.
The delegate pattern is a neat approach that can provide a neat answer to these sort of questions. It is heavily used in objective-c, and now in python using mixins and decorators we’ve got much the same (if not better)!
My solution has two parts:
The
DelegateProviderMixinclass, that provides methods for adding/removing delegates for all delegatable (is that a word!?) methods, or for specific methods on an instance.For any method you want to enable delegates on, simply add the
@notify_delegatesannotation to allow the delegates system to notify delegates before and after any calls to that particular method.
The source is available at https://gist.github.com/tomhennigan/5245713, and here’s an example to help get you started using it:
from delegates import DelegateProviderMixin, notify_delegates
class Something(DelegateProviderMixin):
@notify_delegates
def do_something(self):
print 'do_something'
def do_something_else(self):
print 'do_something_else'
@notify_delegates
def throw_an_exception(self):
raise Exception('oh hai')
class SomethingDelegate:
def on_before_do_something(self, *args, **kwargs):
print '- on_before_do_something'
def on_after_do_something(self, ret_value, *args, **kwargs):
print '- on_after_do_something -args=%r -kwargs=%r -ret_value=%r' % (args, kwargs, ret_value)
def on_exception_in_throw_an_exception(self, exception, *args, **kwargs):
print ' - on_exception_in_throw_an_exception -args=%r -kwargs=%r -exception=%r' % (args, kwargs, exception)
if __name__ == '__main__':
thing = Something()
delegate = SomethingDelegate()
thing.add_delegate(delegate)
thing.do_something()
thing.do_something_else()
thing.throw_an_exception()
26 3 / 2013
Secondary sorting flags for Hadoop 0.20.2 streaming
Yesterday I got trolled fairly hard by hadoop’s sorting options so I figured I would write up my findings to stop anyone who had the same issues being trolled like I was.
It turns out that if you only supply a single field number to a -k flag in mapred.text.key.comparator.options then it has the same behaviour as GNU sort, and assumes you want to sort on all fields from there that position to the end of the line. This means that when you try and secondary sort using -k1 -k2n you’re infact just sorting on fields 1 -> stream.num.map.output.key.fields (the -k2n flag is ignored because it has already sorted on field 2).
Anyway, the simple fix is to explicitely set the ranges using -k{from},{to}{opts} for each field. The correct configuration on Hadoop 0.20.2 for a secondary sort on a numeric field is as follows:
Partitioner:
org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Job configuration:
stream.num.map.output.key.fields = 2mapred.text.key.partitioner.options = '-k1,1'mapred.output.key.comparator.class = 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator'mapred.text.key.comparator.options = '-k1,1 -k2,2nr'
13 3 / 2013
Number range overlap algorithm
Here’s a quick algorithm to work out if two number ranges overlap. An overlap is defined as whether the ranges contain at least one of the same number. e.g. the ranges from 1-3 and 2-4 both contain 2 and 3 so they overlap.
The ranges are defined by a pair of numbers (start, end). Building on an excellent answer from stackoverflow.com I knocked up the following python to fit the bill.
def overlap((start_a, end_a), (start_b, end_b)):
if start_b < start_a:
# Guarantee a starts before b.
start_a, end_a, start_b, end_b = start_b, end_b, start_a, end_a
return (start_a <= end_b) and (end_a >= start_b)
def all_overlap(*ranges):
ranges = sorted(ranges)
prev_start, prev_end = None, None
for start, end in ranges:
if prev_start is not None:
if not ((prev_start <= end) and (prev_end >= start)):
return False
prev_start, prev_end = start, end
return True
# A few tests.
assert(overlap((1, 2), (2, 3)))
assert(overlap((2, 3), (1, 2)))
assert(not overlap((1, 2), (3, 5)))
assert(not overlap((3, 5), (1, 2)))
assert(all_overlap((1, 2), (2, 3), (3, 4)))
assert(all_overlap((1, 2), (3, 4), (2, 3)))
assert(not all_overlap((1, 2), (3, 4), (3, 4)))
assert(not all_overlap((3, 4), (5, 10), (3, 4)))
12 11 / 2012
List difference in bash
First initialise two lists:
$ a="a b c d"
$ b="a b c e"
Now find the unique values in each list:
$ diff=$(echo "$a" "$b" | xargs -n1 echo | sort | uniq -u | xargs echo)
$ echo $diff
d e
We are now able to find the values unique to each individual list:
$ in_a=$(echo $diff | tr ' ' '|' | xargs -I{} sh -c "echo \"$a\" | xargs -n1 echo | grep -E \"{}\" | xargs echo")
$ echo $in_a
d
$ in_b=$(echo $diff | tr ' ' '|' | xargs -I{} sh -c "echo \"$b\" | xargs -n1 echo | grep -E \"{}\" | xargs echo")
$ echo $in_b
e
04 5 / 2012
openssl-0.9.8c - Illegal Instruction
As part of a recent coursework we had to compile an older version of OpenSSL in order to exploit a Debian specific weakness in the RNG. After a successful ./config && make && sudo make install some of the OpenSSL commands (specifically those for generating rsa keys) were throwing SIGILL and causing the process to fail. After some digging around I found that these issues were coming from some function pointer weirdness that the OpenSSL team had used and that newer versions of gcc (>= 4.2) treated in a different and weird way.
The fix was relatively simple, and just required me to apply some patches from the upstream version of OpenSSL (r16526 and 16528).
05 2 / 2012
Error opening terminal: xterm-color.
So I kept seeing this error (htop Error opening terminal: xterm-color.) and there is a lot of information online about how to resolve it, however the workarounds didn’t work for me. I think the issue was something to do with installing macports. I’m not aware of a proper fix, however running this will work:
sudo ln -s /usr/share/terminfo /opt/local/share/terminfo
21 1 / 2012
mysqldump removing conditional comments
When researching MySQL backup solutions one of the obvious choices was to go for mysqldump. It’s simple to use and ran through gzip/bzip2 it offers a very compressed version of the database, ideal for storage (NB. it’s is a pretty crappy format if you actually want to inspect the backup and it does take some time to load the data back into the tables if your database is large or you have lots of indicies.
One thing that annoyed me slightly on the output was that conditional comments show for very old versions of MySQL, and there is no flag to disable comments below a given MySQL version. This could be an interesting feature to add seeing as MySQL is open source, however for now my time is limited and I had to hack around the issue. The solution I came up with is a bit brutal but strips comments using sed or grep and a hunky regex. I also posted this answer on stackoverflow. NB. The only difference in the sed/grep versions below are that grep will drop the entire line, sed will replace the comment with the empty string however the new line will still exist. Use whichever version you prefer.
To drop all conditional comments use one of the following:
mysqldump … | grep -v ’^\/\*![0-9]\{5\}.*\/;$’
mysqldump … | sed -e ’s/^\/\*![0-9]\{5\}.*\/;$//g’
Or to drop comments for MySQL v4 or below:
mysqldump … | grep -v ’^\/\*![0-4][0-9]\{4\}.*\/;$’
mysqldump … | sed -e ’s/^\/\*![0-4][0-9]\{4\}.*\/;$//g’
18 1 / 2012
Grouped emails in latex with href and formatting.
In many LaTeX documents you might see many authors having their emails combined together in the document preamble (e.g. {user1,user2}@domain.com). This tip shows how to achieve this with each email being a link to the correct address and the @domain portion having the same format as the emails.
% This package is required for href and the nolinkurl commands.
\usepackage{hyperref}
% This is a shortcut to add the url style and href.
\newcommand{\mailtodomain}[1]{\href{mailto:#1@domain.com}{\nolinkurl{#1}}}
\begin{document}
% ...
% Use this in the body to insert the emails.
\texttt{\{\mailtodomain{mail1}, \mailtodomain{mail2}\}@domain}
% ...
\end{document}
Here’s an example:
