When is a Set not a Set?

I come into work one morning, and I hear about a new customer bug. A data import failed, with this technical-looking error message:

Document size of 16990947 is larger than maximum of 16777216.

We've seen this error message before; it's printed from MongoDB when you try to save a document that's too big. A document in this database just holds metadata for one import job, so 16 MB is really very big and it's unusual to hit that limit unless something went wrong. We look at the DB, and we see a whole lot of duplicate data in an array field like this:

[
    {
        "11" : NumberLong("448990584577851392"),
        "8" : NumberLong("448991147965284352"),
        "9" : NumberLong("449036765240754176"),
        "10" : NumberLong("486623036733128716")
    },
    {
        "11" : NumberLong("448990584577851392"),
        "8" : NumberLong("448991147965284352"),
        "9" : NumberLong("449036765240754176"),
        "10" : NumberLong("486623036733128716")
    },
    {
        "11" : NumberLong("448990584577851392"),
        "8" : NumberLong("448991147965284352"),
        "9" : NumberLong("449036765240754176"),
        "10" : NumberLong("486623036733128716")
    },
    ...
]

There were something like 288,000 entries with the same 4 key-value pairs in each! This field is supposed to track unique combinations of data, so we didn't expect to see any duplicates, let alone that many in one document.

Hint: It happens in Java

This MongoDB collection is actually managed by a Java server, which accesses the database using Morphia, an object-document-mapper. In a nutshell, we write Java code with objects, and Morphia translates that to DB reads and writes for us. In this case, the object's class had a field like this:

private Set<Map<Integer, Id>> uniqueCombos;  

That meant the duplicates were occuring inside the Java Set! How was that possible?

In Java, a Set is supposed to guarantee uniqueness of its elements, as long as those elements implement equals() (and hashcode() if it's a HashSet). We double-checked that Map, Integer, and the Id class (a simple wrapper for a long, Java's 64-bit integer) all did this correctly.

Answer: When your data is mutable

It turns out the culprit was this snippet of code that ran for each row of the import:

Map<Integer, Id> map = new HashMap<>();  
for (Integer key : keys) {  
    map.put(key, row.get(key));
    uniqueCombos.add(map);
}

The line uniqueCombos.add(map); was accidentally placed inside the loop, when it should have been outside. This means the map was getting modified after it was already added to the set, but a Java set only checks uniqueness at the time of add!

Since there were 4 keys, we'd try to add the same map to the set 4 times per row. The set add would see a different state of the map each time (with 1, 2, 3, or 4 entries). If the set only did object comparisons with equals(), that'd be fine, but since we used a HashSet, the hashcode() would be different each time, and each entry would (with high probability) be assigned a different bucket. After one run of this code, there'd be 4 entries all referencing the same map with 4 elements. Similarly, subsequent rows that should have been duplicates would succeed in adding "new" entries to the set 3 out of 4 times -- only the 4th one would be correctly identified as a duplicate by the equals() method. This explains why the size of the final set was 3 times the number of processed rows.

Interestingly, our automated tests didn't catch it because we only had end-to-end functional tests, which passed. This intermediate data structure didn't have a test. While there was no functional problem caused by these duplicates (aside from exceeding the document limit), they did cause a slowdown for the steps downstream.

A few hours later, the bug was patched, 288,000 items became 36 items, and the customer was happy. Then we vowed to do a better job of testing and code reviewing, all while cursing the mutability of data structures. :)

Discuss on Hacker News