In this article I am trying to describe the challenges and problems one will eventually face with MongoDB in production environments. I am pretty much certain that a lot of people (MongoDB fanboys) will probably argue against many points - please do. The discussion is open, leave a comment.
But as my role as Systems Engineer, I am responsible for the data integrity and security of the systems, including MongoDB servers, since June 2013. I am not a developer who stores or reads data from MongoDB, so I am more a "classical DBA" in this scenario.
Note that this article was written back in 2015. Meanwhile MongoDB has made important steps and improved many things such as replication (using replicaSets). Not all mentioned points are relevant today (2023) anymore, but they still still serve as a good list of gotchas and what to watch out for when using MongoDB.
First let's begin with the strengths of MongoDB.
Before I became a Linux Systems Engineer I was a web application developer, so I know what it's like to write code and use databases in the backend. Hence I understand that from a point of view of a developer MongoDB is great. It removes complexity of the SQL syntax (although it is not that complex), it allows to store data without having to think of NOT NULL fields or similar and there are no complex permissions or privileges to maintain. Just write into and read from MongoDB, no problem.
There are also a lot of helpful and great tools to manage and visualize your MongoDB data out there. Another plus point is that no matter what language you're programming with, there is a MongoDB connector or library. May this be PHP (PECL Package mongo), python (pymongo), ruby (mongo in rubygems) and so on; the connection to a MongoDB database is easily and quickly established.
If every company or technology would have a marketing department as MongoDB, Inc. (yes, that's a company), we IT professionals would be bombarded with newsletters and invitations to a workshop here and a conference there.
MongoDB organizes multiple events per month, actually it seems more like at least an event per week. But that's how a technology (or in this case rather how a company) should present itself. Go out there, take the market.
Have you ever seen so many PostgreSQL conferences or workshops? No. Does PostgreSQL offer NOSQL capabilities? Yes. Did you know about it? No. Because of that fact that MongoDB Inc. know what they're doing, and they're doing it well.
Most people I have talked with who wanted me to install a MongoDB server, mainly developers, did not even know that there are similar projects to MongoDB or, as mentioned, that PostgreSQL offers similar database possibilities. It's all about making your brand become stuck in minds when we're talking about a technology. MongoDB is NOSQL. NoSQL is MongoDB. Well done!
But now let's turn the page and see MongoDB as a Systems or Database Engineer/Administrator.
There is a lot of criticism out there concerning MongoDB's data integrity. I highly recommend you read Emin Gün Sirer's article "Broken by Design: MongoDB Fault Tolerance" which describes in detail, how MongoDB does not handle fault tolerance correctly and can lose data. Yes - lose. Out, in the nirvana. And you wouldn't even know it through a log entry or similar. This article from 2013 has caused a big reaction from 10Gen (now MongoDB, Inc.) having responded to most of the critics. The responses on the other hand were picked up again by Mr. Sirer who also reacts on the responses in his follow up "When PR says No but Engineering says Yes".
There are more articles out there who describe that MongoDB has lost data but the article from Emin Gün Sirer is clearly the most interesting one, at least for a technology person.
On the other hand I personally have never witnessed data loss on my MongoDB servers so I keep this critics in my mind with a "neutral skepticism" (I'm Swiss, remember?). Keeping such information in mind helps to better build the architecture and help developers so the application itself already checks for data integrity and correct write operations.
When I first started with MongoDB in 2013 I was literally shocked when I installed the first server. There is no authentication activated by default! Every user can simply connect to the MongoDB and do whatever the hell he wants. A lot of MongoDB servers were "hacked" or are hack-able because they're running with the default settings and therefore no user authentication. A recent article from February 2015 revealed that almost 40'000 MongoDB databases were found in the Internet, ready to be hacked or to be manipulated. Simply because authentication is disabled by default. MongoDB Inc responded to the issue (see at the bottom of the article) with the advise to limit network access to localhost. Yeah... sure, that solves it all.
Authentication needs to be enabled manually and for classical DBA tasks, a "real admin user" must be created. I described this in my article "First steps with MongoDB: Create a real admin user". So when I created the MongoDB users and assigned them the roles they actually require for their databases I raised an eyebrow concerning the "missing source" for a user. I have managed MySQL and PostgreSQL databases for several years and I'm used to have a kind of a "three layer" authentication:
1) Username must exist,
2) Password for this username must match and
3) The ip address/source of this request must match with the username and its password.
In MongoDB there is only 1) Username and 2) Password for this username. No additional ip source can be defined which would add additional security.
Granted, ip addresses could be spoofed, but this already requires additional knowledge of the internal systems and therefore would become more difficult to attempt a hack or fake a login.
Now it's 2015 and even with the latest MongoDB 3.x version, authentication is still disabled by default. This just leaves me shaking my head when I think of security.
With the release of MongoDB 3.x, the authentication scheme changed its mechanism from the previous default of MONGODB-CR to the new and more secure SCRAM-SHA-1. Better security: so far, so good. But it seems that something important got forgotten: Tell the third party developers. A lot of tools were or still are not able to connect to a new MongoDB 3.x installation because the authentication scheme cannot be changed. Unfortunately the new 3.x version has ditched MONGODB-CR completely and didn't leave it as a secondary authentication mechanism in the background. This breaks backward compatibility with the developer tools so this needs to be debugged and fixed which costs time and therefore money. As long as tools and libraries are not updated to be able to use SCRAM-SHA-1, a workaround is to set the default authentication mechanism manually to the older standard MONGODB-CR. See my article "Authentication on MongoDB 3.x fails with mechanism MONGODB-CR" for more details.
Every database system, may this be a relational database like MySQL or PostgreSQL or a non-relational database like MongoDB handles disk/block space a bit differently. But they have something in common: When data is deleted from the database, the disk space (from the system/OS) is not freed automatically. On MySQL you run a task like OPTIMIZE to regain disk space freed by the deletion of data. On PostgreSQL you have the VACUUM command (which you can even configure to run periodically with autovacuum) for the same task. Both MySQL and PostgreSQL lock the relevant table(s) which are in process of freeing disk space, but the commands run out of the box. But not so in MongoDB. According to the MongoDB documentation, only repairDatabase allows to regain freed disk space:
"[...] repairDatabase is the appropriate and the only way to reclaim disk space."
Sounds easy and comparable to the MySQL and PostgreSQL way. But you weren't made aware of the following snappy requirement in order to actually run repairDatabase:
"repairDatabase requires free disk space equal to the size of your current data set plus 2 gigabytes. If the volume that holds dbpath lacks sufficient space, you can mount a separate volume and use that for the repair. When mounting a separate volume for repairDatabase you must run repairDatabase from the command line and use the --repairpath switch to specify the folder in which to store temporary repair files."
So let's say you have a MongoDB installation which uses 500GB. Of course you are running a monitoring system which periodically checks and warns you about a full disk space. So you contact the developers and they delete data in the database/collection which should in theory free 100GB of disk space. Now the task is back to you when you need to run repairDatabase (or rather you manually start mongod with the --repair parameter). In order to do so, you now need to attach 502GB of additional free disk space to the MongoDB server. Umm what?! Yes, you have read correctly. That's the "free disk space equal to the size of your current data plus 2 gigabytes" as outlined in the documentation. Now go to your superior and ask for additional 502GB, just for a few moments, so you are able to regain disk space. This way of freeing disk space is biting itself. If you are in need of freeing disk space then you are in shortage of actual disk space (or allocation of new disk space if we're talking about central storage). Doubling the server's disk space in order to gain disk space is really... stupid. The other database systems show that it's possible.
But wait! If you have several MongoDB servers in a replica set, then you could use a replica server to run this command and then it replicates the disk allocation back to the other members in the replica set, right? Wrong! See again the documentation:
"mongod --repair will fail if your database is not a master or primary."
So not only will you lose the possibility to write into this database because of a "global write lock" (ergo downtime), you also have a hassle to organize and configure additional disk space on the server running MongoDB.
Believe it or not, but backups are important. At least to me. I want to have a complete database dump (with a guarantee of data integrity) when a worst case scenario happens. So on my MongoDB 2.x servers there is a daily script running which dumps all databases/collections with the mongodump command. For the total database size of 553MB it takes the script/mongodump 6 seconds to create bson binary dumps. That's very good and I got used to it. But something has drastically changed in MongoDB 3.x. mongodump seems to be way slower than in the previous version. For a total size of ~100GB, it took mongodump to run for more than 24 hours to complete. For daily backups this simply doesn't work. I am not alone with the conclusion that mongodump became much slower: "Mongodump 3.0.1 very slow compared to mongodump 2.6.9". Strangely the official documentation ("MongoDB Backup Methods") seems to focus rather on copying data files when mongod was stopped. Or to use a snapshot process (for example when LVM is used) and then copy the files. The backup method with mongodump mentions to be run on a secondary member of the replica set - or to turn off MongoDB completely and work directly with the data files:
"To mitigate the impact of mongodump on the performance of the replica set, use mongodump to capture backups from a secondary member of a replica set. Alternatively, you can shut down a secondary and use mongodump with the data files directly. If you shut down a secondary to capture data with mongodump ensure that the operation can complete before its oplog becomes too stale to continue replicating."
And yes, mongodump is indeed having a big impact on the performance. It may be possible that this is all memory related, but I haven't found a hint in the (official) documentation, that more memory would decrease runtime of mongodump.
While I continue to run mongodump daily on the MongoDB 2.x servers, I decided to use LVM snapshots and copy the MongoDB data files to a remote backup server on the MongoDB 3.x servers.
After now 2 years of having worked with and managed MongoDB servers, my personal conclusion is, that MongoDB's existance is certainly permitted, but whenever possible, I'd advise to use another technology. Most applications can be developed to store its data in traditional SQL databases like MySQL. It makes sense to use NOSQL-like databases for smaller amount of data, for example application configuration switches or parameters. But would you store that in MongoDB? Why not use a much faster technology like redis for such a scenario? And for big data you might ask? Well there are (well known) alternatives out there. Apache Cassandra is just one of it. And Cassandra is not even radically faster than MongoDB, it also solves the killer-problem of regaining disk space (no additional disks/disk space is required). But I have to admit, that it's not always easy to "go against" the mindset of certain decision makers, even with technical proof - especially when they think NOSQL equals MongoDB.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder