When I left off last part, we were discussing MongoDB’s availability features. We will start next on:
Scalability – We’ve gone over replica sets in availability. For those who come from more of an infrastructure/hardware background. This is very close to something like RAID 1. You are creating multiple copies of a set of data and then placing it on separate physical nodes (instead of drives). This does allow a bit of scalability but is not as efficient as say, RAID 6 is. So, what we will next get into is, sharding. Sharding is a lot closer to what us hardware people would think RAID 6 is. You are breaking pieces of one overall replica and spreading those across physical nodes. For refresher purposes let’s throw in a graphic. I’ve modified it slightly. This diagram makes more sense to me according to the above comparison.
Now if we add in shards this is what we look like.
I have just 3 nodes there but you can scale much bigger. This sharding is automatic and built-in. No need for 3rd party add-ins. Rebalancing is done automatically as you add or subtract nodes. A bit of planning is needed as to how you plan to distribute the data. Sharding is applied based off a shard key. This is defined by a data modeler that describes the range of values that will be used to partition the data into chunks – this is then known as the shard key. Much like you would use a key on a map to tell you how to use it. There are three components to this.
Query Router (mongos) – This provides an interface between client apps and the sharded cluster.
Config Server – Stores the metadata for the sharded cluster. Metadata includes location of all the sharded chunks and the ranges that define the chunks (each shard is broken into chunks) The Query router cache this data and use it to properly direct where they need to send read and write operations. Config servers also store authentication information and manage distributed locks. These must be unique to a sharded cluster. They store information in a “config” database. Config Servers can be setup as a replica set since keeping these going is necessary for the sharded cluster.
Shard – This is a subset of the data. This can be deployed as a replica set.
There is a total of 3 types of sharding strategies. The three types are Range, Hash, and Zone.
Ranged Sharding – You would create the shard key by giving it a range and all documents within a range zone would be grouped on the same shard. This approach is great for co-locating data such as all customers within a specific region.
Hashed Sharding – Documents are distributed across shards more evenly using a MD5 hash of the shard key, optimizing write performance, and is optimal for ingesting streams of times-series or event data.
Zoned Sharding – Developers can define specific criteria for a zone to partition data as needed by the business. This allows much more precise control over where the data is placed physically. If a customer was concerned of data locality this would be a great way to enforce that. Reasons might include GDPR etc.
You can learn more (by watching the MongoDB webinar on sharding here).
MongoDB has a number of security features that can be taken advantage of to keep data safe, which is becoming more and more important with the ever-increasing amount of personal information being kept and stored. The main features utilized are:
Authentication. MongoDB offers integration with all the main external methods of authentication. LDAP, AD, Kerberos, x.509 Certs. You can take this a step further and implement IP white-listing as well.
RBAC. Role-Based Authentication Controls allow for granular user permissions to be assigned. Either to a user or application. Developers can also create specific views just to show pertinent data as needed.
Auditing. This will need to be configured but Auditing is offered and can be output to the console, a JSON, or BSON file. The types of operations that can be audited are schema, Replica/Sharded events, authentication and authorization, and CRUD operations (create, read, update, delete)
Encryption. MongoDB supports both at-rest encryption and transport encryption. Transport encryption is taken care of by support of TLS/SSL certs. They must be a minimum of 128bit key length. As of 4.0 TLS 1.0 is disabled if 1.1 is available. Either self-signed or CA authority certs can be used. Identity verification is also supported for both client and server node members. FIPS is supported but only on Enterprise level. Encryption at-rest is taken care of by a new as of 3.2 storage engine. Default is AES256-CBC.
Next up we will go over a bit of what the hardware should look like.