Friday, May 15, 2009
RSA Public Key Cryptography in Java

Public key cryptography is a well-known concept, but for some reason the JCE (Java Cryptography Extensions) documentation doesn't at all make it clear how to interoperate with common public key formats such as those produced by openssl. If you try to do a search on the web for how to make RSA public key cryptography work in Java, you quickly find a lot of people asking questions and not a lot of people answering them. In this post, I'm going to try to lay out very clearly how I got this working.

Just to set expectations, this is not a tutorial about how to use the cryptography APIs themselves in javax.crypto (look at the JCE tutorials from Sun for this); nor is this a primer about how public key cryptography works. This article is really about how to manage the keys with off-the-shelf utilities available to your friendly, neighborhood sysadmin and still make use of them from Java programs. Really, this boils down to "how do I get these darn keys loaded into a Java program where they can be used?" This is the article I wish I had when I started trying to muck around with this stuff....

Managing the keys



Openssl. This is the de-facto tool sysadmins use for managing public/private keys, X.509 certificates, etc. This is what we want to create/manage our keys with, so that they can be stored in formats that are common across most Un*x systems and utilities (like, say, C programs using the openssl library...). Java has this notion of its own keystore, and Sun will give you the keytool command with Java, but that doesn't do you much good outside of Java world.

Creating the keypair. We are going to create a keypair, saving it in openssl's preferred PEM format. PEM formats are ASCII and hence easy to email around as needed. However, we will need to save the keys in the binary DER format so Java can read them. Without further ado, here is the magical incantation for creating the keys we'll use:


# generate a 2048-bit RSA private key
$ openssl genrsa -out private_key.pem 2048

# convert private Key to PKCS#8 format (so Java can read it)
$ openssl pkcs8 -topk8 -inform PEM -outform DER -in private_key.pem \
-out private_key.der -nocrypt

# output public key portion in DER format (so Java can read it)
$ openssl rsa -in private_key.pem -pubout -outform DER -out public_key.der


You keep private_key.pem around for reference, but you hand the DER versions to your Java programs.


Loading the keys into Java



Really, this boils down to knowing what type of KeySpec to use when reading in the keys. To read in the private key:


import java.io.*;
import java.security.*;
import java.security.spec.*;

public class PrivateKeyReader {

public static PrivateKey get(String filename)
throws Exception {

File f = new File(filename);
FileInputStream fis = new FileInputStream(f);
DataInputStream dis = new DataInputStream(dis);
byte[] keyBytes = new byte[(int)f.length()];
dis.readFully(keyBytes);
dis.close();

PKCS8EncodedKeySpec spec =
new PKCS8EncodedKeySpec(keyBytes);
KeyFactory kf = KeyFactory.getInstance("RSA");
return kf.generatePrivate(spec);
}
}


And now, to read in the public key:


import java.io.*;
import java.security.*;
import java.security.spec.*;

public class PublicKeyReader {

public static PublicKey get(String filename)
throws Exception {

File f = new File(filename);
FileInputStream fis = new FileInputStream(f);
DataInputStream dis = new DataInputStream(dis);
byte[] keyBytes = new byte[(int)f.length()];
dis.readFully(keyBytes);
dis.close();

X509EncodedKeySpec spec =
new X509EncodedKeySpec(keyBytes);
KeyFactory kf = KeyFactory.getInstance("RSA");
return kf.generatePublic(spec);
}
}


That's about it. The hard part was figuring out a compatible set of:

  1. openssl DER output options (particularly the PKCS#8 encoding)

  2. which type of KeySpec Java needed to use (strangely enough, the public key needs the "X509" keyspec, even though you would normally handle X.509 certificates with
    the openssl x509 command, not the openssl rsa command. Real intuitive.)



From here, signing and verifying work as described in the JCE documentation; the only other thing you need to know is that you can use the "SHA1withRSA" algorithm when you get your java.security.Signature instance for signing/verifying, and that you want the "RSA" algorithm when you get your javax.crypto.Cipher instance for encrypting/decrypting.

Many happy security returns to you.

Tuesday, February 10, 2009
Downloading your Blogger archives

A friend was looking for a way to grab an archive of his Blogger posts into a CSV file he could do text mining on (and presumably, for a low-fi backup mechanism).

I wrote this Python script for him, enjoy.



#!/usr/bin/env python
#
# Copyright (C) 2009 by Jon Moore
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

import csv
import urllib2
import unicodedata
import xml.etree.ElementTree as etree

blog_feed = 'http://codeartisan.blogspot.com/feeds/posts/default'
output = 'posts.csv'

ATOM_NS = 'http://www.w3.org/2005/Atom'

def norm(s):
if not s: return None
return s.encode('ascii','ignore')

def main():
f = open(output, 'wb')
csv_wr = csv.writer(f)
url = blog_feed + '?max-results=100'
csv_wr.writerow(['id','published','updated','permalink','title','content'])
while url:
print "fetching", url
feed = etree.fromstring(urllib2.urlopen(url).read())
for entry in feed.findall("{%s}entry" % ATOM_NS):
id = entry.find("{%s}id" % ATOM_NS).text
published = entry.find("{%s}published" % ATOM_NS).text
updated = entry.find("{%s}updated" % ATOM_NS).text
title = norm(entry.find("{%s}title" % ATOM_NS).text)
content = norm(entry.find("{%s}content" % ATOM_NS).text)
perm_url = ''
for link in entry.findall("{%s}link" % ATOM_NS):
if (link.get('rel') == 'alternate'
and link.get('type') == 'text/html'):
perm_url = link.get('href')
break
csv_wr.writerow([id,published,updated,perm_url,title,content])
print "wrote",id
url = None
for link in feed.findall("{%s}link" % ATOM_NS):
if link.get('rel') == 'next':
url = link.get('href')
break
f.close()

if __name__ == "__main__":
main()

Saturday, January 24, 2009
Business Cases and Cloud Computing

I just read a very interesting article by Gregory Ness on seekingalpha.com that talks about some of the technology trends behind cloud computing. One key quote:


Automation and control has been both a key driver and a barrier for the adoption of new technology as well as an enterprise’s ability to monetize past investments. Increasingly complex networks are requiring escalating rates of manual intervention. This dynamic will have more impact on IT spending over the next five years than the global recession, because automation is often the best answer to the productivity and expense challenge.


One other cited link is to an IDC study that includes the following graph:

Graph showing that 60% of the total cost of ownership (TCO) for a server over a 3 year lifetime comes from staffing costs.

Note that staffing accounts for 60% of the cost of maintaining a server over its lifetime. Cloud infrastructure services like Amazon EC2 would really only save an enterprise data center on hardware setup / software install costs, which are probably, in terms of staffing, a small amount of staff time for a given server. Actually administering the server once it is running is really the bulk of the cost, and that won't go away on EC2 -- you'll still need operations staff to provision/image cloud infrastructure. EC2 makes sense if the economies of scale of AWS are such that they can achieve a lower operational cost for that other 40% than you can, or if there is a business / time-to-market value proposition that makes sense in being able to provision hardware on EC2 more rapidly than we can acquire and install hardware yourself.

Given the huge economy of scale that the large cloud providers have--tens of thousands of servers, it is going to be hard to get your costs for that 40% lower than what they can achieve with their existing infrastructure automation and ability to purchase hardware in bulk, especially for a startup company whose hardware needs are initially modest. Let's guess that there's a 33% markup on cost for EC2, so when you are getting charged $0.10 per CPU hour, it's really only costing them $0.075. Let's assume a 75% experience curve on infrastructure (meaning, once you have doubled the number of servers you have deployed, the last server costs only 75% of what the halfway point was).

By one estimate, Amazon has 30,000 servers. Now let's work backward (1/0.75 = 1.33): at 15,000 servers, their cost was $0.075 * 1.33 = $0.9975. At 7500 servers, their marginal cost was $0.9975 * 1.33 = $0.13. In other words, you'd have to be planning to deploy 15,000 servers in order to have a hope of getting your marginal cost under what they'll charge you retail.

(I think this is actually a conservative estimate: the experience/learning curve for infrastructure deployment is probably steeper than 75% due to existing hierarchical deployment patterns and a product (provisioned servers) that lends itself well to automation. Also, due to the high barrier to entry for cloud computing in terms of number of servers you need to be competitive, they can probably get away with charging an even higher markup).

One corollary of this is that if you are currently running a data center with far fewer servers (i.e. the hardware is a sunk cost), you might actually be better off turning your data center off and leasing from Amazon. Now of course, there are some things (customer credit card data, extremely sensitive business information) that you just wouldn't be willing to host somewhere outside your own data center. But that's probably a very specific set of data--host that stuff and lease the rest in the cloud, particularly if you can get adequate SLAs from your cloud vendor.

So that deals with the 40% of the TCO for a server that isn't staffing. How do you cut costs on the other 60%?

You won't really be able to make a dent in that 60% until you get not just to fully automated infrastructure provisioning, but until you get to fully automated software deployment and provisioning. This is not possible until you get to standardized computing platforms with specific functionality that are scale-on-demand, like Akamai NetStorage, Amazon S3/EBS/SQS/SimpleDB, and Google AppEngine. These are known as "Platform-as-a-Service" (PaaS) offerings.

There's a similar experience curve argument here: you could spend internal development time here to set up some kind of application deployment framework, but you'd essentially have to be willing to build and deploy within orders of magnitude the number of different apps as the Google App Engine team in order to get your costs under what Google will charge you. Unless you are in the business of directly competing with them in the PaaS market, you might as well buy from them and focus your energy on providing your unique business value, not software or hardware infrastructure. [Editor's note: this was something Matt Stevens said to me a while ago, and it wasn't until I went through the mental exercise of writing this article that I actually got it].

Yesterday I implemented (not prototyped) a service in Google App Engine in about 6 hours that would cost around $400 per month (according to their recent pricing announcements) if projected usage were more than double what it is now. I estimate this would require at least 10 database servers just to host the data in a scalable, performant fashion, nevermind the REST data interface (webnodes) sitting in front of it. On Amazon EC2, that'd be $720 per month on your small instances (assuming those were even beefy enough), and per the experience curve argument above, it's probably way more than that in our data center. And that's not counting any of the reliability/load balancing infrastructure.

So my open question is: how, as a software developer, can you justify not building your app in one of these cloud frameworks?