memcached keys can’t have spaces in them!

Posted May 22, 2009 by pos.thum.us
Categories: Python, Uncategorized, Work

Tags: , , , ,

I am shouting this, in case some other hapless soul goes Googling.

While adding a bunch of items to a memcached using the python-memcached library, I noticed that things did not really seem to be any faster. This was weird of course. Finally I found that some objects just did not seem to be in the cache at all, and that the client library seemed to be getting ‘confused’. The problem? Some of my keys had spaces in them.

Now of course the gut-feel definition of a key is that it should not have a space, but the docs just say that your keys have to be strings. So you assume any kind of string will do. What it should actually say is: “Strings without spaces”. I confirmed this by checking out the memcached protocol, and indeed it is a text based protocol using spaces in the protocol itsself to separate commands and keys sent to the server.

Would be good if the python client library had a check for this and threw an error. Turns out it does have a check, but it does:

if ord(char) < 32 or ord(char) == 127:

while it should be:

if ord(char) < 33 or ord(char) == 127:

What is even stranger is that there is a test for this case in the code, but in the version currently on PyPI (1.44)  the test fails.

Will mail the owner with this small patch, and see if it gets into the project. In the meantime, here’s hoping you don’t bang your head and lose a few hours tracking down such an innocous bug like I just did.

There is no metadata just data

Posted March 25, 2009 by pos.thum.us
Categories: TUDelft, Uncategorized, Work

Tags: , , , ,

Why is there a rift between the library catalogue and the repository? The separation between the two has been bugging me.

In our catalogue we record PhD dissertations, because they are books, innit? But there is a whole section for Dissertations in the repository because you can’t get your degree without submitting it. And many dissertations are born digital nowadays so they aren’t ‘books’ anymore. But then, the catalogue is filled with ebooks, these are books aren’t they?

And at the end of the day, what matters is some metadata describing the object and how you can get the actual object. Where ‘get’ can be a shelfnumber, PDF, SFX etc. and who knows a cortical shunt if we can get the details worked out.

For the past few months I have been building our institutional repository infrastructure, my colleagues have been building a new Discovery system, work continues apace on our catalogue, and I just can’t shake the feeling that we are all barking on different sides of the same tree. There are probably some very good organisational, political and historical reasons that I am just not aware of, being a newbie in library-land. Would love to learn about it though.

This thought was spewed by me on Twitter too, and Peter van Boheemen pointed me to his relevant blog post on what they are doing in this direction. Via Twitter we also received a pointer to The OLE Project which is exploring the same theme.

Authorisation (aka Access Management) using Bloomfilters

Posted December 10, 2008 by pos.thum.us
Categories: Uncategorized

Tags: , , , , , , ,

While zooming along the highway on the way home the other day, I had an idea for making an authorisation service using Bloomfilters.

Let us assume you have a company website content management service, called ‘CMS’. Users can log in to the CMS and make changes to webpages or upload new content. The actual login (aka authentication) is handled by a separate system called A-Select which we do not have to concern ourselves with in this discussion.

The Access Management details needs to be maintained in a central place. So we have a service called A11n that runs on one machine, and the CMS service has to query the A11n service to find out if a user can perform some action.

The core of the idea is that the A11n service maintains the userids and roles which a userid is allowed to fullfill for various applications, and then makes the complete set of these available as an openly downloadable (via HTTP) Bloomfilter.

The CMS service (or any other application) periodically downloads the Bloomfilter file. When a user logs into the CMS, the CMS service does not need to contact the A11n server, it just has to check the userid+application+role entry in the Bloomfilter. If you get a definitive NO answer, you know that the user is not allowed to have a role. If there are multiple possible roles for a userid, the application server can check the existence of the userid+role for each role, and so the end result is a list of allowed roles.

What I like about this idea is that there is minimal crypto involved, no SSL necessary and the application servers can operate in a ‘disconnected’ mode from the centralised A11n server.

We have kicked the idea around a bit here internally at the TU Delft Library and have discussed the various tradeoffs with regards to choosing filter size and whether or not to include a shared secret between each application server and the A11n server to make it just slightly more difficult to guess the existence of roles for a given userid for privacy reasons.

A cursory Google on this idea did not reveal any immediate hits to others doing the same thing. I would appreciate hearing from others that are either doing the same thing, or can point out the glaring holes on why this might be a terribly bad idea.

Making lots and lots of directories

Posted November 21, 2008 by pos.thum.us
Categories: Python, Uncategorized

I was testing how much space creating lots and lots of directories would take, and used this Python snippet:

for x in range(256):
    a = '%.02X' % x
    os.mkdir(a)
    for xx in range(256):
        b = '%.02X' % xx
        os.mkdir(os.path.join(a, b))

This creates 256 x 256 = 65536 folders, a top level of 00 to ff (in hex) and then in each top-level folder another 00 to ff.

Last time I made and deleted a truckload of folders and files my disk went wonky after a while, so for these I created a separate 10MB disk image using Disk Utility. (yes I use a Mac). If the DIsk Image uses DOS FAT as the filesystem, it runs out of space at 26/60, while the Mac OS filesystem runs out of space at E5/77.

Not sure what to make of these numbers at this point in time. If I Google long enough I could probably do the research to find exactly how much space each directory entry takes. Makes it abundantly clear that making lots of folders certainly doesn’t come for ‘free’. Must remember that  they also use space.

No point to make in doing this, just jotting down some notes for future reference.

Data-on-a-stick. Deepfried for extra goodness.

Posted October 17, 2008 by pos.thum.us
Categories: TUDelft, Work

Tags: , , , ,

For the 3TU Datacentrum we are busy setting up an infrastructure to store and preserve scientific datasets used in the research at the 3 technical universities of the Netherlands. One of the heated discussions concerns the way in which you would like to ‘encode’ or represent the data. Nowadays the knee-jerk response to data preservation is pouring it into some flavour of XML, so of course that is also our plan. But even then, the way in which you decide to do the ‘markup’ of your data is wide open. Elements vs attributes vs nesting vs granularity vs namespaces vs standards vs CDATA vs Infoset purity vs Validation vs Schemas vs running for the hills gibering like a demented biddy.

Even something seemingly innocent like: “How do we record the date?” leads to fisticuffs in the hallways. Well almost.

One of my alternative fevered dreams considered: “What if we were to ruthlessly make RDF out of absolutely everything?” Every datapoint, every measurement, every sensor, every experiment becomes a triple. All poured into an industrial-strength triple-store like Allegrograph, or the back-end that drives Freebase.

And then you can start asking interesting questions. Soaring over datasets linked by affinity, whether that may be a geographical, problem-domain, sensor-type, or unit of measure used. Cross-domain, giving those who are able to ask the truly interesting questions the tools to do so, and maybe one day even making the tools that ask the interesting questions.

All fine and well, but first I have to look at that zip file which was sent containing a crap-load of CSV files with missing years in the date, timestamp that wrap over files and unit-less figures based on incorreect assumptions. Joy.

Python-powered spreadsheet

Posted October 13, 2008 by pos.thum.us
Categories: Python

Tags: , , ,

I have sort-of know about ‘resolver one’ from http://www.resolversystems.com/ for a while, but it has never muscled it’s way to the front of my attention queue. Today Michael Foord Twittered a new screencast from them giving a good overview of exactly what it does:

A spreadsheet on steroids. Using Python as the scripting glue, to make your spreadsheet do more stuff.

 

In a previous life before I had ever heard of Python I used to do a lot of data wrangling using Excel. In those days I was also a big MS-fanboy, wishing the whole world would just make the switch to MS products and stop using their own stuff. Never could I guess how phantasmagorically my wishes would come true. The data wrangling I was doing was setting up the South African operations for http://www.moneymate.ie/ Basically getting the historical data and current details for all mutual funds of the country. An interesting exercise that I would approach very differently given the the experience I have now. But that is a different story. From that job came my love of Guiness, BTW.

 

Nowadays I tend to view all nails with a Python-shaped hammer. And seeing a spreadsheet with Python goodness makes me very excited. Maybe I could get my own financial adminstration in order for a change… ;-)

I will need to run it under a Windows VM, but that is the price you have to pay. Having it run under IronPython and having access to all the third-party .NET tools infrastructure was a smart move for them. 

 

See here for the screencast: http://twurl.nl/upjysb

How to Remove the Twitter ‘Election’ info

Posted October 1, 2008 by pos.thum.us
Categories: Uncategorized

Tags: , , ,

While I love using Twitter, the US elections don’t interest me all that much because I live in Amsterdam (The Netherlands). The joy of Firefox and Greasemonkey is that you can customise your own browsing experience by writing little script made especially for certain sites.

So I have made a Greasemonkey script to remove the election banner from view on Twitter. If you:

a) Use Firefox

b) Use Greasmonkey, or know what it is

You can try and install this script, and the election banner should be gone.

UPDATE:

I am such a typical geek in some ways. My wife asked at the breakfast table: “What election thing on Twitter? Oh that one. Why don’t you just click on the little X to make it go away? Why bother with this complicated thing you’re talking about?”

She has a point. :-D But at least very very briefly, I felt like I was actually make my own minuscule little corner of the world more to my own liking.

There is a comedy sketch waiting to be made from this interaction though.

Pesterfish

Posted August 28, 2008 by pos.thum.us
Categories: Javascript, Python, Uncategorized, Web, Work

Tags: , , , ,

In my research for what data serialisation models to use to represent our repository data, I was hunting down some XML to JSON conversions. For the past few months I have been hell-bent on using JSON as the representations, because it is far more useful to me than chunks of XML, and closer to the nature of the data being represented. As Steve Yegge says:
“XML is better if you have more text and fewer tags. And JSON is better if you have more tags and less text”.

The problem is however that almost all the tools out there in Fedora-Commons land are biased towards XML, and have (almost) never heard of JSON. So if I only have some basic Dublic Core and a gob of JSON in my objects, that is not going to help very much. The other extreme, and undoubtedly “The Right Way To Do It” is to have a lovely shiny Content Model and define everything in an ontology and RDF and then my brain melts. And we want to migrate our repository sooner rather than later and don’t have the time to remain in la-la land designing around our navels forever.

So now I am swinging back the other way and have decided to store the data in the objects as simple stupid XML with a smattering of tags, and at least have some XSLT disseminators to provide more buzzword compliant FOAF, Bibliontology, OAI-ORE etc. And JSON. Via XSLT. Which brings me to http://code.google.com/p/xml2json-xslt/ which pointed at http://badgerfish.ning.com/ and finally to pesterfish:

Jacob Smullyan wrote a related Python model, pesterfish, which he describes as: “a quick Python module which is uses the same xml object model as the dominant xml module in the Python world, elementtree; Some elementtree implementation (there are several) and simplejson are required. BTW, elementtree stores namespaces in Clark notation: {http://www.w3.org/1999/xhtml}br and so does this.”. pesterfish also lets you round trip XML through it without any data loss.

I like the ElementTree API. Lots. It is very elegant and a joy to work with, especially if you have ever had to use any other XML APIs like DOM or SAX that makes you want to gnaw your own arm to the bone out of frustration. So the chance of having an ElementTree-like way of consuming XML in Javascript expecially appeals to me. (well until http://en.wikipedia.org/wiki/E4X E4X is widely supported)

And so, with this long-winded preamble I would like to present pesterfish on AppEngine:

http://epoz.appspot.com/pesterfish/

You can POST an XML file to it, and it will return the pesterfish application/json.

Or you can call it with a URL for the XML file specified in the ‘in=’ parameter, like this.

http://epoz.appspot.com/pesterfish/?in=http://www.w3schools.com/XML/note.xml

It turns this XML:

<note>

	<to>Tove</to>

	<from>Jani</from>

	<heading>Reminder</heading>

	<body>Don't forget me this weekend!</body>

</note>

Into this JSON:

{"text": "\n\t", "tag": "note", "children":
 [{"text": "Tove", "tail": "\n\t", "tag": "to"},
 {"text": "Jani", "tail": "\n\t", "tag": "from"},
 {"text": "Reminder", "tail": "\n\t", "tag": "heading"},
 {"text": "Don't forget me this weekend!
", "tail": "\n", "tag": "body"}]}

And it supports JSONP:

http://epoz.appspot.com/pesterfish/?in=http://www.w3schools.com/XML/note.xml&callback=some_callback

Thanks Jacob!

Quotes from Jeremy Clarkson, writer and presenter of Top Gear

Posted August 20, 2008 by pos.thum.us
Categories: Uncategorized

Tags:

A friend mailed me these quotes this morning, and I twittered about one of them. It was a completely silly one about librarians, which jumped out at me because I happen to also work in a library. The quote was too long and in splitting it to fit into tweets, I inadvertently reversed the order of the statements.

Ah, the danger of languages and meaning.

On another level it made me smile. I have been to events where librarians were involved that would make Mr Clarksons eyebrows raise. The public image of librarians is very much removed from the reality.

Here is the complete list mailed to me, for reference purposes:

‘I’m sorry, but having a DB9 on the drive and not driving it is a bit like having Keira Knightley in your bed and sleeping on the couch.’

‘… the last time someone was as wrong as you, was when a politician stepped off an aeroplane in 1939 waving a piece of paper in the air saying
there will be no war with Germany ’

Illustrating the lack of power of a Boxster: ‘It couldn’t pull a greased stick out of a pig’s bottom’

On the Vauxhall Vectra VXR: ‘there is a word to describe this car: it begins with ‘s’ and ends with ‘t’ and it isn’t soot

‘The Suzuki Wagon R should be avoided like unprotected sex with an Ethiopian transvestite’

‘The air conditioning in a Lambos used to be an asthmatic sitting in the dashboard blowing at you through a straw.’

‘Koenigsegg are saying that the CCX is more comfortable. More comfortable than what… BEING STABBED?’

‘This is the Renault Espace, probably the best of the people carriers. Not that that’s much to shout about. That’s like saying ‘Ooh good I’ve got
syphilis, the BEST of the sexually transmitted diseases.”

‘I don’t understand bus lanes. Why do poor people have to get to places quicker than I do?’

Clarkson’s highway code on cyclists: ‘Trespassers in the motorcars domain, they do not pay road tax and therefore have no right to be on the road,
some of them even believe they are going fast enough to not be an obstruction. Run them down to prove them wrong.’

‘ Britain ‘s nuclear submarines have been deemed unsafe…probably because they don’t have wheel-chair access.’

‘Now we get quite a lot of complaints that we don’t feature enough affordable cars on the show…….so we’ll kick off tonight with the cheapest Ferrari of them all!’

On the Lotus Elise: ‘This car is more fun than the entire French air force crashing into a firework factory.’

‘Sure it’s quiet, for a diesel. But that’s like being well-behaved….for a murderer.’

‘I don’t often agree with the RSPCA as I believe it is an animal’s duty to be on my plate at supper time.’

‘There are footballers wives that would be happy with this quality of stitching… on their face.’

‘Much more of a hoot to drive than you might imagine. Think of it if you like, as a librarian with a G-string under her tweed pants. I do, and it helps.’

‘You cannot have this car with a diesel. Its like saying, I won’t go to Stringfellows tonight, I’ll get my mum to give me a lapdance, she’s a woman!’

‘Tonight, the new Viper, which is the American equivalent of a sportscar… in the same way, I guess, that George Bush is the equivalent of a President.’

On the Porsche Cayenne: ‘Honestly, I have seen more attractive gangrenous wounds than this. It has the sex appeal of a camel with gingivitis.’

Statelessness, Buddha-nature, purely descriptive markup

Posted August 18, 2008 by pos.thum.us
Categories: Uncategorized

Tags: , , ,

A great quote by Tim Bray:

And I think there’s a lesson here: that statelessness, like many other good things (Buddha-nature, purely descriptive markup) is an Aristotelian virtue; unattainable in an absolute sense, but rewarded to the extent you can practice it.

Posthing it here because it struck me as a great quote which I would love to remember, but it goes way past the 160 chars to Twitter it. The most recent flare-up in the REST debate which he refers to in that post has been very enjoyable to see now that many more people have started to think that the WS-* stuff is bunk. Good to not jump in the next bandwagon without questions.


Follow

Get every new post delivered to your Inbox.