"Don't Use MongoDB" --> Hmmm

You had to be hiding under a rock to not have seen the (anonymous) Don't Use MongoDB post that has been making the rounds.  I've replicated it below, just in case it vanishes.
The author is (as of  this writing) unknown, but he/she brings up a lot of points.  For what its worth, I completely agree with the final recommendations, viz.

1. Don't lose data, be very deterministic with data
2. Employ practices to stay available
3. Multi-node scalability
4. Minimize latency at 99% and 95%
5. Raw req/s per resource

Obligatory caveat here - I'm a big fan of BigCouch, and have been using it for a while now.
That said, to defend Mongo (just a little bit. and yes, it hurts), when we started going down the CouchDB path, I knew what I was doing.  I'd read everything, I'd tested the bejeebers out of Couch (especially in the 0.9.0 days), and already had a pretty thorough grounding in NoSQL tech for a variety of unrelated reasons which I won't get into.
And yet.
And yet...
And yet, we got caught flatfooted.
The problem is, no matter how well you think you know new tech, you are still the victim of your own pre-knowledge and pre-judices.  There are things you will see which you absolutely understand, and utterly ignore, because, well, you know that it isnt' an issue.  In CouchWorld, you know that the write doesnt return till it completes (ignore the fsync flag please), but you never realized that if, god forbid, you *had* to restart your Never Ever Goes Down Massively Distributed Erlang System On A Ton Of Nodes, and every one of the Gazillion processes tried to save their state at exactly the same time, and your CouchDB server suddenly found itself so insanely overloaded it crapped out, and you lost everyone's State, and people are yelling at you cause it takes the entire weekend to sping things up from scratch and
Well then.
Well then.
Well then, we learnt something didnt we?
So yeah, this isn't all that much of a defense of MongoDB - or CouchDB for that matter :-) - but the point remains that there is a huge amount of caveat emptor-ing that goes into any kind of system design, and when dealing with new stuff, you need to, basically, throw all the salt you have over your shoulder.
(By the way, we did re-architect - under stress - our erlang-world environment, and we moved from CouchDB to BigCouch.  And we found a bunch of different issues that hit us with BigCouch.  But thats a different story, and it all worked out in the end)

All that said, MongoDB does have a bunch of craziness associated with it.  Fixable?  Hell yes.  But, its stuff that you need to understand, because, if you don't, and any of this matters to you, well then, you're up a tree without a paddle :-)


  • Keeping your working set in memory.  Which may not matter to you at all, but if it does, well, you have some serious planning to do.  I mean, you can't easily clobber errant processes.  In fact, you don't even know what your headroom is... (obligatory couchDB plug here - just kill -9 the errant process.  You don't lose anything.  Ah, MVCC rocks)
  • Read performance is brilliant.  Brilliant.  But, if you're doing a bunch-a writes, it *is* going to suck all the air out of your reads.  Really.  Trust me on this.
  • Oh Indexing.  Oh Wow.  Talk about performance hogs.
  • This one is more relevant for those of us who are huge Map/Reduce people (hey, CouchDB bias showing again), but Map/Reduce performance on Mongo is, well, whatever.  It'll probably get better (but then again, it can't get much worse)
  • You don't got master/master replication.  Which is entirely and possibly not an issue for you at all (high read, low write, well, most of the above isn't a problem either!).  But, if you're both read *and* write heavy, and you need master/master, well, not here please, really not here.
  • Did I mention MVCC rocks?  Part of the point there is that built into the Mongo framework is the likelihood of data corruption/loss.  Which is *bad*.  It was horrifically bad back in the pre-1.8 days (oh, all of a few months ago!), and its supposedly fixed now, but hey, it shouldn't have been there in the first place.  I mean, this is supposed to be a database.  You're supposed to be worrying about CAP - Consistency, Availability, Partition Tolerance (Eric Brewer) - not ICAP (Is-My-Data-Safe).  And yes, for an object lesson in how to deal with Database badness, see what CouchDB did in their moment of infamy.  Stuff happens - its how you deal with it thats important.
Lastly, a minor rant against defenders - not just of MongoDB, but of any NoSQL database - the clause But MySQL clusters sucked just as much when they first came out is really not much of an argument.  The obligatory response is - And you should have learnt from it, and not let it happen to you....

Anyhow, go read the original post, and then see the responses from HackerNews.  Ignore the obligatory ones from 10gen and other assorted "We are company X, use MongoDB, and have no problems at all" posts.  There is a much more detailed and nuanced set of pro and con arguments made in the rest of the thread.  If I had to summarize it, I'd say it comes down to

  • You need to know what you are doing (my point above)
  • There are (or at least, were) issues
  • You really need to know what you are getting into


Original Post Below....


Don't use MongoDB
=================

I've kept quiet for awhile for various political reasons, but I now
feel a kind of social responsibility to deter people from banking
their business on MongoDB.

Our team did serious load on MongoDB on a large (10s of millions
of users, high profile company) userbase, expecting, from early good
experiences, that the long-term scalability benefits touted by 10gen
would pan out.  We were wrong, and this rant serves to deter you
from believing those benefits and making the same mistake
we did.  If one person avoid the trap, it will have been
worth writing.  Hopefully, many more do.

Note that, in our experiences with 10gen, they were nearly always
helpful and cordial, and often extremely so.  But at the same
time, that cannot be reason alone to supress information about
the failings of their product.

Why this matters
----------------

Databases must be right, or as-right-as-possible, b/c database
mistakes are so much more severe than almost every other variation
of mistake.  Not only does it have the largest impact on uptime,
performance, expense, and value (the inherit value of the data),
but data has *inertia*.  Migrating TBs of data on-the-fly is
a massive undertaking compared to changing drcses or fixing the
average logic error in your code.  Recovering TBs of data while
down, limited by what spindles can do for you, is a helpless
feeling.

Databases are also complex systems that are effectively black
boxes to the end developer.  By adopting a database system,
you place absolute trust in their ability to do the right thing
with your data to keep it consistent and available.

Why is MongoDB popular?
-----------------------

To be fair, it must be acknowledged that MongoDB is popular,
and that there are valid reasons for its popularity.

 * It is remarkably easy to get running
 * Schema-free models that map to JSON-like structures
   have great appeal to developers (they fit our brains),
   and a developer is almost always the individual who
   makes the platform decisions when a project is in
   its infancy
 * Maturity and robustness, track record, tested real-world
   use cases, etc, are typically more important to sysadmin
   types or operations specialists, who often inherit the
   platform long after the initial decisions are made
 * Its single-system, low concurrency read performance benchmarks
   are impressive, and for the inexperienced evaluator, this
   is often The Most Important Thing

Now, if you're writing a toy site, or a prototype, something
where developer productivity trumps all other considerations,
it basically doesn't matter *what* you use.  Use whatever
gets the job done.

But if you're intending to really run a large scale system
on Mongo, one that a business might depend on, simply put:

Don't.

Why not?
--------

**1. MongoDB issues writes in unsafe ways *by default* in order to
win benchmarks**

If you don't issue getLastError(), MongoDB doesn't wait for any
confirmation from the database that the command was processed.
This introduces at least two classes of problems:

 * In a concurrent environment (connection pools, etc), you may
   have a subsequent read fail after a write has "finished";
   there is no barrier condition to know at what point the
   database will recognize a write commitment
 * Any unknown number of save operations can be dropped on the floor
   due to queueing in various places, things outstanding in the TCP
   buffer, etc, when your connection drops of the db were to be KILL'd or
   segfault, hardware crash, you name it

**2. MongoDB can lose data in many startling ways**

Here is a list of ways we personally experienced records go missing:

 1. They just disappeared sometimes.  Cause unknown.
 2. Recovery on corrupt database was not successful,
    pre transaction log.
 3. Replication between master and slave had *gaps* in the oplogs,
    causing slaves to be missing records the master had.  Yes,
    there is no checksum, and yes, the replication status had the
    slaves current
 4. Replication just stops sometimes, without error.  Monitor
    your replication status!

**3. MongoDB requires a global write lock to issue any write**

Under a write-heavy load, this will kill you.  If you run a blog,
you maybe don't care b/c your R:W ratio is so high.

**4. MongoDB's sharding doesn't work that well under load**

Adding a shard under heavy load is a nightmare.
Mongo either moves chunks between shards so quickly it DOSes
the production traffic, or refuses to more chunks altogether.

This pretty much makes it a non-starter for high-traffic
sites with heavy write volume.

**5. mongos is unreliable**

The mongod/config server/mongos architecture is actually pretty
reasonable and clever.  Unfortunately, mongos is complete
garbage.  Under load, it crashed anywhere from every few hours
to every few days.  Restart supervision didn't always help b/c
sometimes it would throw some assertion that would bail out a
critical thread, but the process would stay running.  Double
fail.

It got so bad the only usable way we found to run mongos was
to run haproxy in front of dozens of mongos instances, and
to have a job that slowly rotated through them and killed them
to keep fresh/live ones in the pool.  No joke.

**6. MongoDB actually once deleted the entire dataset**

MongoDB, 1.6, in replica set configuration, would sometimes
determine the wrong node (often an empty node) was the freshest
copy of the data available.  It would then DELETE ALL THE DATA
ON THE REPLICA (which may have been the 700GB of good data)
AND REPLICATE THE EMPTY SET.  The database should never never
never do this.  Faced with a situation like that, the database
should throw an error and make the admin disambiguate by
wiping/resetting data, or forcing the correct configuration.
NEVER DELETE ALL THE DATA.  (This was a bad day.)

They fixed this in 1.8, thank god.

**7. Things were shipped that should have never been shipped**

Things with known, embarrassing bugs that could cause data
problems were in "stable" releases--and often we weren't told
about these issues until after they bit us, and then only b/c
we had a super duper crazy platinum support contract with 10gen.

The response was to send up a hot patch and that they were
calling an RC internally, and then run that on our data.

**8. Replication was lackluster on busy servers**

Replication would often, again, either DOS the master, or
replicate so slowly that it would take far too long and
the oplog would be exhausted (even with a 50G oplog).

We had a busy, large dataset that we simply could
not replicate b/c of this dynamic.  It was a harrowing month
or two of finger crossing before we got it onto a different
database system.

**But, the real problem:**

You might object, my information is out of date; they've
fixed these problems or intend to fix them in the next version;
problem X can be mitigated by optional practice Y.

Unfortunately, it doesn't matter.

The real problem is that so many of these problems existed
in the first place.

Database developers must be held to a higher standard than
your average developer.  Namely, your priority list should
typically be something like:

 1. Don't lose data, be very deterministic with data
 2. Employ practices to stay available
 3. Multi-node scalability
 4. Minimize latency at 99% and 95%
 5. Raw req/s per resource

10gen's order seems to be, #5, then everything else in some
order.  #1 ain't in the top 3.

These failings, and the implied priorities of the company,
indicate a basic cultural problem, irrespective of whatever
problems exist in any single release:  a lack of the requisite
discipline to design database systems businesses should bet on.

Please take this warning seriously.


  

Comments

Popular posts from this blog

Erlang, Binaries, and Garbage Collection (Sigh)

Its time to call Bullshit on "Technical Debt"

Visualizing Prime Numbers