Implementing and Testing the Bulk Insert/Save Pattern in MongoDB

Here at Vena, we use MongoDB pretty heavily in our back end stack. We also are working on making automated testing a larger, more prevalent part of our culture. We're not perfect at this, but, like everything, we have a drive for constant improvement.

Today I'm going to walk you through implementing a useful pattern we've developed here for doing bulk saves with Objects that don't conform to MongoDB's ideas about unique identifiers. I'll also go over some of the challenges that crop up when unit testing code that interacts with an external system (in this case MongoDB) and how we overcame those challenges.

Let's Start With an Example

In a part of our codebase, we're doing a bulk insert into MongoDB with a list of Java objects. Let's pretend they are Person objects, with a unique combination of name and birthday. Let's say person objects also have a bank account balance value. (In reality this is a terrible database design but, hey, it's a contrived example. Work with me here.)

We have a PersonDAO that's defined as follows (DAO stands for Data Access Object):

public class PersonDAO extends BasicDAO<Person, Long> {  
    public PersonDAO(MongoClient mongoClient, Morphia morphia, String dbName) {
        super(mongoClient, morphia, dbName);
        createIndexes();
    }

    // Create a composite index called "primary_name_bday" on the "name" and "bday" fields
    public void createIndexes() {
        getDatastore().ensureIndex(getEntityClass(), "primary_name_bday", "name, bday", /*unique*/true, /*dropOnCreate*/false);
    }

    public Person getPerson(Person dupe) {
        Query<Person> q = createQuery();
        q.field("name").equal(dupe.getName());
        q.field("bday").equal(dupe.getBday());
        return q.get();
    }

    public void bulkInsertPerson(List<Person> personsToInsert) {
        // TODO
    }
}

Most of this is fairly routine, but the bulkInsertPerson method needs some extra logic to handle a clash between MongoDB's philosophy and our own internal use of these objects. The spec for this method states that if an object in the list already exists in MongoDB, we should update that object.

MongoDB requires that all objects being stored have an ID...but instead, we uniquely identified Person objects by the name and birthday fields, and we had a composite index on those fields.

In database terms: The name-birthday combination is the natural key of the data. We want to use that combination as the primary key, but MongoDB only allows for a single field to be the primary key. This means we have to generate a surrogate key (the numeric ID) to convince MongoDB to hold our data.

This meant that our bulk insert code has to account for the fact that there could be Person objects that already existed in the database, with different IDs, but where name-birthday matched. In this case we'd be attempting to update an existing person's account balance.

That meant our bulk insert code was as follows:

public void bulkInsertPerson(List<Person> personsToInsert) {  
    BulkWriteOperation builder = getCollection().initializeUnorderedBulkOperation();
    for (Person person : personsToInsert) {
        DBObject insObj = entityToDBObj(person);
        builder.insert(insObj);
    }

    try {
        builder.execute();
    } catch (BulkWriteException bwe) {
        for (BulkWriteError e : bwe.getWriteErrors()) {
            if (e.getCode() == 11000) { //duplicate key exception
                Person duplicatePerson = personsToInsert.get(e.getIndex());
                Person fromMongo = getPerson(duplicatePerson);  // look it up in the collection

                duplicatePerson.setId(fromMongo.getId());
                save(duplicatePerson);
            } else {
                //only handle bulkwrite errors if they're all duplicate key exceptions
                throw bwe;
            }
        }
    }
}

I just threw a wall of code at you, so let's dissect it chunk by chunk.

The summary is: we're defining a method called bulkInsertPerson that takes a List<Person> and inserts all of the Person objects in the list.

To do this, we first create a BulkWriteOperation and, for each Person, convert it to a DBObject and then queue it for insertion. MongoDB's BulkWriteOperation will insert new objects and update existing objects for us, as long as there are no conflicts. Then, with the line builder.execute(), we ask the BulkWriteOperation to insert/update all of the objects we gave it, and we wait until the insertion has completed.

The hairy bit comes inside the catch block. We know that MongoDB will throw a BulkWriteException if one or more of the inserts/updates fail; the BulkWriteException encapsulates a list of exceptions that occur for the bulk write. Each exception will have an error code, describing the kind of error it is, and an index, which corresponds to the insertion order of the objects we attempted to insert into the collection.

We catch the BulkWriteException and iterate through all of the write errors it contains. If this error is because the object already exists in the collection, but with a different ID, then the error will be a DuplicateKeyException and the error code will be 11000. If any of the errors we get are not a DuplicateKeyException, we just rethrow that error and let the caller handle it. Otherwise, we do this:

Person duplicatePerson = personsToInsert.get(e.getIndex());  
Person fromMongo = getPerson(duplicatePerson);  // look it up in the collection

duplicatePerson.setId(fromMongo.getId());  
savePerson(duplicatePerson);  

This looks up the Person java object that failed, finds the corresponding object in the MongoDB collection (getPerson looks the object up by its natural name-birthday key, not its ID), and then takes the ID that is stored in the MongoDB collection and writes that onto the java object. Finally, we save the "fixed" Person to the MongoDB collection, confident that it will not fail with a DuplicateKeyException.

Ok, so that's the logic to bulk update a list of Person objects when you don't actually care about the ID field MongoDB requires on each object.

So You Think You Can Test

I only stumbled onto this piece of code when we had a bug report related to this functionality. The code above is the working, post-fix code. After fixing the issue, though, we wanted to write unit tests to prevent regressions.

Seems reasonable, right?

Well, at first, sure. "Write tests to prevent regressions" makes sense. So let's think through this: what would the unit test look like here?

Maybe it would look a little like the following JUnit test:

@Test
public void saveDuplicates() {  
    Person jSmith = new Person("John Smith", "1985-02-25", 1234.56);
    personDAO.save(jSmith);

    List<Person> personList = new ArrayList<>();
    Person alice = new Person("Alice", "1990-04-18", 400.00);
    personList.add(alice);
    Person jSmithDupe = new Person("John Smith", "1985-02-25", 1000.00);
    personList.add(jSmithDupe);
    personDAO.bulkInsertPerson(personList);

    assertEquals(1000.00, personDAO.getPerson(jSmith).getAccountBalance(), 0); // compare 2 doubles and verify that they are identical
}

But wait! getPerson(), which we're using in our test to validate the result, as well as the entire bulkInsertPerson API and the savePerson() method used within the bulkInsertFoo() method, all talk to MongoDB!

So now we've introduced a dependency onto MongoDB in our unit tests, and we need to have a MongoDB server running just so our unit tests can run? That feels weird, and it hinders the runtime of our unit tests. We want these tests to be fast to run, so we can run them on every commit and every build without paying a huge cost.

It turns out it's not just me who thinks this is weird: this actually a common anti-pattern in testing code. The recommended approach is to mock out the external dependency in the unit test, and that way the test only exercises your own code. In effect, we assume that the external library/dependency is set in stone, and write our unit test under that assumption.

We could introduce a mock MongoDB object with something like Mockito, which lets us mock out arbitrary objects and control their responses to functions we want to call. However, we would then need to know exactly which methods to mock, and how they behave.

We could do this for this one case, and I encourage you to download the sample code from our repo and try it out. However, this wouldn't scale. We use MongoDB a lot, and if we wanted to use this approach to test other methods, we wouldn't be able to reuse any code. Instead, we'd have to re-mock the right methods, in different ways each time.

If we wanted to create re-usable mocks for these MongoDB methods, we'd eventually just re-implement the MongoDB API!

It turns out that there's an open-source project called Fongo that already just does this. From their README on Github:

Fongo is an in-memory java implementation of MongoDB. It intercepts calls to the standard mongo-java-driver for finds, updates, inserts, removes and other methods. The primary use is for lightweight unit testing where you don't want to spin up a mongod process.

Sounds like exactly what we need!

Faking Reality

So how do we get our test set up with Fongo? Well, the test case itself is relatively straightforward: it's exactly the test case I posted above! However, left unaltered, that code will indeed try to talk to MongoDB. Instead, we use JUnit's @BeforeClass annotation to set up Fongo once for the entire test class, as follows:

@BeforeClass
public static void setup() {  
    Fongo fongo = new Fongo("fongo mock server");
    MongoClient mongoClient = fongo.getMongo();
    Morphia morphia = new Morphia();

    personDAO = new PersonDAO(mongoClient, morphia, "mydb");
}

This creates a new PersonDAO that talks to the mydb database in a fake MongoDB environment. This fake MongoDB environment is entirely in memory, meaning that once the test stops, the entire thing will get garbage collected by the JVM. We don't even have to do any clean up!

The beauty of this approach is that our unit tests use the exact same code that any other consuming code will use. We don't have to call any special methods or pass any special flags; we can simply use our methods the same way they will be used in production code.

Additionally, if we're doing some debugging and actually want the database to stick around after our test completes, we can simply replace the MongoClient declaration with

MongoClient mongoClient = new MongoClient();  

and then our unit test will simply go back to talking to a real instance of MongoDB (running at the default location, localhost:27017).

And the true beauty is that we're done! With that 4-line method set up to run once, before any of our tests run, we're able to write real, useful tests that provide value and verify that our code works.

The One Drawback

We've assuming that the behaviour of MongoDB is set in stone; it won't ever mysteriously change (unless we switch versions), and so we can fake it out and our PersonDAO doesn't notice the difference. To fake it, we used Fongo, an in-memory implementation of the MongoDB API.

It actually doesn't matter to us if MongoDB has a bug; as long as we can mimic its behaviour without actually spinning up a MongoDB instance, we can run our tests quickly and without relying on external dependencies.

But what happens when Fongo has a bug in it? Well, we actually had this happen to us before, and I'll tell you all about it...next time.

Discuss on hackernews and twitter.