Dublin Tech: NoSQL

So recently, I had a requirement to store unstructured JSON data that was coming back from a web service. The web service was returning back various soccer teams from around the world. Amongst the data contained in most of the soccer teams was a list of soccer players, who were part of the team. Some of the teams had 12 players, some had 20 some had even more than 20. The players had their own attribute, some were easy to predict some impossible. For the entire data structure, the only attribute that I knew would definitely be coming back was team's teamname. After that, it depended on each team.

{
   "teams": [{
       "teamname":"Kung fu pirates",
       "founded":1962,
       "players": [
          {"name": "Robbie Fowler", "age": 56},
          {"name": "Larry David", "age": 55}
          ...
        ]},
        { 
        "teamname":"Climate change observers",
        "founded":1942,
        "players": [
          {"name": "Jim Carrey", "age": 26},
          {"name": "Carl Craig", "age": 35}
          ...
        ]},
        ...
   ]

}

There are several different ways to do store this data. I decided to go for MongoDB. Main reasons:

I wanted to store the data in as close as possible format to the JSON responses I was getting back from the web service. This would mean, less code, less bugs, less hassle.
I wanted something that had a low learning curve, had good documentation and good industry support (stackoverflow threads, blog posts etc)
Something that had a grails plugin that was documented, had footfall and looked like it was maintained
Features such as text stemming were nice to have's. Some support would have been nice, but it didn't need to be cutting age.
Would support good JSON search facilities, indexing, etc

MongoDB ticked all the boxes. So this is how I got it all working. After I installed MongoDB as per Mongo's instructions and the MongoDB Grails plugin, it was time to write some code. Now here's the neat part, there was hardly any code. I created a domain object for the Team.

class Team implements Serializable {

    static mapWith = "mongo"

    static constraints = {
    }

    static mapping = {
        teamname index: true
    }

    String teamname

    List players
    static embedded = ['players']
}

Regarding the Team domain object:

The first point to make about the Team domain object was that I didn't even need to create it. The reason why I did use this approach was so that I could use GORM style api's such as Team.find() if I wanted to.
Players are just a List of object. I didn't bother creating a Player object. I like the idea of always ensuring the players for the team were always in a List data structure, but I didn't see the need to type anything further.
The players are marked as embedded. This means the team and players are stored in a single denormalised data structure. This allows - amongst other things - the ability to retrieve and manipulate the team data in a single database operation.
I marked the teamname as a index.
I marked the domain object as
```
static mapWith = "mongo"
```
This means that if I was also using another persistence solution with my GORM (postgres, MySQL, etc.) I am telling the GORM that this Team domain class is only for Mongo - keep your relational hands off it. See here for info. Note: This is a good reminder that the GORM is a higher level of abstraction than hibernate. It is possible to have a GORM object that doesn't use hibernate but instead goes to a NoSQL store and doesn't go near hibernate.

You'll note that in the JSON that there are team attributes such as founded that haven't been explicitly declared in the Team class. This is where Groovy and NoSQL play really well with each other. We can use some of the Meta programming features of Groovy to dynamically add attributes to the Team domain object.

private List importTeams(int page) {
    def rs = restClient.get("teams") // invoke web service
    List teams = rs.responseData.teams.collect {
         teamResponse ->
                Team team = new Team(teamname: teamResponse.teamname)
                team.save(); // Save is needed to dynamically add the attribute
                teamname.each {key, value ->
                    team["$key"] = value 
                }
                teamname.save(); // We need the second save to ensure the variants get saved.
                return teamname
        }
    log.info("importTeams(),teams=teams);
    teams
}

Ok, so the main points in our importTeams() method

After getting our JSON response we run a collect function on the teams array. This will create the Team domain objects.
We use some meta programming to dynamically add any attribute that comes back in the JSON team structure to the Team object. Note: we have to invoke save() first to be able to dynamically add the attributes that are declared in the Team domain object to the Team domain object. We also have to invoke save() again to ensure that attributes that are declared in the Team domain object to ensure they are saved. This may change in future versions of the MongoDB plugin, but it is what I had to do to get it working (I was using MongoDB plugin version 3.0.1)

So what's next? Write some queries. Ok so two choices here. First, you can use the dynamic finders and criteria queries with the GORM thanks to the MongoDB plugin. But, I didn't do this. Why? I wanted to write the queries as close as possible to how they are supposed to be written in Mongo. There were a number of reasons for this:

A leaky abstraction is inevitable here. Sooner or later you are going to have to write a query that the GORM won't do very well. Better to approach this heads on.
I wanted to able to run the queries in the Mongo console first, check explain plans if I needed to and then use the same query in my code. Easier to do this, if I write the query directly without having to worry about what the GORM is going to do.

The general format of queries is:

teams = Team.collection.find(queryMap) // where queryMap is a map of fields and the various values you are searching for.

Ok, some examples of queries...

Team.collection.find(["teamname":"hicks"]) // Find a team name hicks
Team.collection.find(["teamname":"hicks", "players.name": "Robbie Fowler"] // As above but also has Robbie Fowler
Team.collection.find(["players.name": "Robbie Fowler"] // Any teams that has a Robbie Fowler
Team.collection.find(["teamname":"hicks", "players.name": "Robbie Fowler", {"players.$":1}]  // Returns matching player only
Team.collection.find(["teamname":"/ick/"]) // Match on the regular expression /ick/, i.e. any team that contains text ick.

Anything else? Yeah sure. I wanted to connect to a Mongo instance on my own machine when in development but to a Mongo machine on a dedicated server in other environments (CI, stage, production). To do this, I updated my DataSource.groovy as:

environments {
    development {
        grails {
            mongo {
                host = "localhost"
                port = 27017
                username = "test"
                password = "test"
                databaseName = "mydb"
            }
        }
        dataSource {
            dbCreate = "create-drop" // one of 'create', 'create-drop', 'update', 'validate', ''
            url = "jdbc:h2:mem:devDb;MVCC=TRUE;LOCK_TIMEOUT=10000"
        }
    }

    
    ci {
        println("In bamboo environment")
        grails {
            mongo {
                host = "10.157.192.99"
                port = 27017
                username = "shop"
                password = "shop"
                databaseName = "tony"
            }
        }
        dataSource {
            dbCreate = "create-drop" // one of 'create', 'create-drop', 'update', 'validate', ''
            url = "jdbc:h2:mem:devDb;MVCC=TRUE;LOCK_TIMEOUT=10000"
        }
    }
}

You'll see I have configured multiple datasources (MongoDB and PostGres). I am not advocating using both MongoDB and a relational database, just pointing out it is possible. The other point is that the MongoDB configuration is always under:grails { mongo {

Ok this is a simple introductory post, I will try to post up something more sophisticated soon. Until the next time, take care of yourselves.

So what's the deal with NoSQL?

Is NoSQL just a controversial buzzword? Could you imagaine if the term 'Object Orientated' didn't exist and instead architectures based on concepts such as encapsulation, polymorphism and inhertiance were referred to as 'NoProcedural'? Could you imagine if .net was called 'NoJava'? Leinster was called 'NoMunster'?

Well controversial name aside, a good way to appreciate the hype about NoSQL is to consider scalability - the classical non-functional architectural concern. In a classical OLTP architecture, when load increases and your JVM is under pressure, you need to scale. You have two choices:

vertical scaling - adding more CPU power to your JVM
horizontal scaling - adding more JVMs (usally one more boxes)

It's generally never any problem scaling the business tier horizontally. Follow J2EE / JEE specs and unless you've done something crazy your business tier will scale. Just add more JVMs and load balance between them. However, while the business tier may be straighforward, the persistence tier ain't so easy. Let's say you are using a classical relational database (such as MySQL, SQLServer, DB2 or Oracle) for your persistence, you can't just add database machines like you can add JVMs. Why not? Imagine trying to do SQL joins when tables are on the same machine and when the tables are on different machines! Imagine trying to do maintain ACID characteristics for your transactions when your database is split across various CPUs? Now think trying to do all that on 5 machines, 50 , 500, 5000 machines? The more machines the harder it gets.

The leading relational databases will scale horizontally. But only by so much. To get around this an architect usually will consider:

Scaling vertically - putting the database on the best hardware that can be afforded
Partitioning out legacy data and thus reduce things like the size of index tables. This will boost performance and put less pressure on the need to scale
Remove the amount of pressure on the database by caching more in the business tier
Pay a DBA a lot of money!

But what if you just run out of all possible database optimisations options and you have to scale horizontally? Not just to a few machines but to a few hundred if not thousand. This is where NoSQL architectures become relevant.

With a NoSQL database there is no strict schema. Everything is effectively collapsed into one very fat table - a bit like an old skool flat file, but where each row stores a huge amout of data. So, instead of having a table for Users and a table for Activities (representing User's activities), you put all the User information together in one fat row. This means there are no joins across tables. It also means there is a lot of data redundancy which means more storage space required. In addition, more computational power will be needed for writes. But because data that is used together is located at the very same place - within the same row - it means no complex joins and hence it is easier to scale. The computational requirement for reads is also less. So reads can go faster.

Another advantage of NoSQL databases is derived from the freedom that comes with not having to be tied to strict schema. You know that headache where a change to a data model can cause big problems? Well since there is no strict schema with NoSQL - this problem does not exist. This makes the architecture more flexible and more extensible.

Right now, it's fair to say NoSQL is only relevant in the minority of architectures. But could this be another case of technical innovation driving business innovation as we have seen with smart phones? There wasn't a need for smart phones but the technical innovation provided business opportunities. I think the same could happen with NoSQL Architectures.

Take a step back from Computer Science and just think Science. Science used to be hypotheisis centric, now it is becoming more and more data centric. CERN, genome sequencing, climate change analysis - all involve tonnes and tonnes of data. Surely NoSQL architectures allied with searching technologies such as MapReduce / Hadoop will open up new ways to do Science?

So any disadvantages with NoSQL architectures? Well it's still an immature technology. Indexing, Security models are just not as sophisticated as they are with classical relational databases. And because most of it is coming from the open source community the support is not as good as it is for relational databases. So don't throw out your SQL just yet!

PS Well done Dublin and winning the All Ireland!

References:
1. http://about.digg.com/blog/looking-future-cassandra
2. http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772
3. http://nosqltapes.com/

Dublin Tech

Tuesday, June 17, 2014

MongoDB and Grails

Thursday, September 22, 2011

SQL or NOSQL that is the question?

Google Profile

Java Code Geek