What we will learn today?
- SQL vs NoSQL
- Why MongoDB?
- Mongo Shell Basic Commands
- MongoDB administration
- Aggregation
- Replication
- Sharding
- What's next?
Download and install MongoDB from https://www.mongodb.com/download-center#community. Follow the instructions for your platform (Windows, Linux or OS X).
Open two instances of terminal.
Run mongod
in the first instance - this will start MongoDB.
In the second instance, first download the sample data script:
curl https://gist.githubusercontent.com/agiamas/35b2b954cc942f95709273d3cb9d2cf3/raw/b1bb399942dab287832d41b3b75f6b54c6f00bb1/mongodb_data.js > mongo_data.js
then import the data we just downloaded using
mongo < mongo_data.js
Finally, run mongo
in the second terminal instance.
SQL is a well known paradigm in data storage and retrieval, serving us for decades. Our data was living in mainframes and being stored and processed in isolation. The nature of Web and mobile has created a paradigm shift in the past decade. Data has exploded in volume, veracity and velocity. We have many times semi-structured data of varying quality and unpredictable volumes of them.
This has led to a new breed of databases, the NoSQL ones.
- Document databases, e.g. MongoDB
- Graph stores, e.g. Neo4J
- key-value stores, e.g. Redis
- Wide-column stores , e.g. Cassandra
In this class we will examine Document databases and its most popular one, MongoDB.
First of all, how does MongoDB compare with traditional SQL queries that we have learned in the past?
https://s3.amazonaws.com/info-mongodb-com/sql_to_mongo.pdf
MongoDB is a leader in NoSQL database space.
- de facto leader in NoSQL database space.
- uses JSON, you are familiar with json in FE/Node, you can use it in DB layer too!
- one language to rule them all, part of MEAN stack
- mongodb data types as compared to JSON
Open the terminal that you ran mongo
.
First type show dbs
You should see something similar to:
admin 0.000GB
local 0.000GB
Then type use cyf
You should now get:
switched to db cyf
That's it! You have created a new database.
Now type again show dbs
admin 0.000GB
local 0.000GB
??? Where is our new database? ???
- Lazy initialisation.
Collections are like tables in SQL databases. Initialising a collection is also happening lazily like this:
> db.Student.insert({name: 'alex'})
Now if we run again show dbs
after we create a collection with a document
we will see that the database appears in the list.
This is because at the moment that we inserted this document,
the document was created, which in turn created the collection
which in turn created the database.
Insert a new student with name: Mary
Insert a new student with name: Madeline and id=2 (integer)
Insert a new student with name: Steve, midterm score of 80 and final score of 100
Scores should be embedded in a sub-document like this:
scores:
{
midterm: 0,
final: 0
}
Finding a document by a single attribute:
> db.Student.find({name: 'alex'})
Querying embedded attributes:
> db.Student.find({"scores.midterm": {$gte: 40}} )
AND / OR queries:
> db.Student.find({"$or": [{"scores.midterm": {$gt: 60}}, {"scores.final": {$lte: 75}}]} )
Find the user Mary that you inserted in exercise 1
Search for students that have scored between [50,80) in midterm AND [80,100] in final exam
In these two examples above we can see the operators GreaterThanEquals >=, GreaterThan > and LessThanEquals <= in action.
All operators in MongoDB: https://docs.mongodb.com/manual/reference/operator/query/
Updating documents has 2 parts. The first one is finding the document(s) we want to update and the second one is modifying their values.
(!)
db.Student.update({name: 'alex'}, {name: 'alexander'})
What will this command do?
REPLACE the first instance of document matching {name:'alex'} with the new {name:'alexander'} !!!!
Definitely not what we want...
> db.Student.update({name: 'alex'}, {$set: {name: 'alexander'}})
We can also use other operators like $inc, $mul, $min, $max
https://docs.mongodb.com/manual/reference/operator/update/#fields
(!) Update by default will update only the first instance it matches. multi:true
(!) We can get update to update or create the document by using upsert:true
Update the student madeline that you created back in exercise 2 to have midterm score of 50 and final score of 100 respectively.
Update the grades of all students to be 90
(!) How could we boost the grades of all students by 10%? (!)
Deleting a single or more documents is as simple as:
> db.Student.remove({"_id":ObjectId("5a99e1209056c9e237d071d9")})
WriteResult({ "nRemoved" : 1 })
Deleting a whole collection:
db.Student.drop()
Deleting a database:
db.dropDatabase()
to delete the current database
Delete user steve that you created back in exercise 3
Delete all users with midterm score less than 80
There are several ways to write scripts for MongoDB shell:
- use
mongo < mongo_script.sh
- use
mongo mongo_script.js
What's the difference?
The first one will pipe in our script and execute its lines one by one as if we were typing them in Mongo shell.
The second one will evaluate our javascript and attempt to run it with Mongo shell.
Always test scripts locally in MongoDB (and in general).
Always test in staging before production.
Always keep backups in production. And test that you can restore from these backups regualarly.
All of these are important for every database system but even more important in MongoDB as...
THERE IS NO ROLLBACK! There are no transactions. Once you delete something, it's gone.
Using MongoDB shell, create a script to output in a new collection named BoostedStudents one document for every document in Student collection with their final grade being boosted by 10%.
Why aggregation framework?
Aggregation framework in MongoDB is modelled after the familiar concept of data processing pipelines. Documents enter the pipeline with the MongoDB structure and exit the other end transformed into BSON documents with calculated fields. Commands in a pipeline are executed sequentially and in the order that they appear in the array [].
SQL | Aggregation framework |
---|---|
WHERE / HAVING | $match |
GROUP BY | $group |
SELECT | $project |
ORDER BY | $sort |
LIMIT | $limit |
sum() / count() | $sum |
average() | $avg |
join | $lookup |
db.students.aggregate([
1. { $match: { name: {$regex: /^aA/} } },
2. { $group: { _id: "$name", average: { $avg: "$scores.final" } } },
3. { $sort: { average: -1 } },
4. { $project: { name: 1, average: 1 } }
])
What does the pipeline above do?
- matches all documents with a name starting from aA
- groups them by average final score
- sorts them by average score
- projects(selects) name and average score in the output
What's the output like?
{_id: .., name: .., average: ..}
More information: https://docs.mongodb.com/manual/aggregation/
Now let's redo Exercise 9 using the aggregation framework.
Replication in MongoDB is used to increase redundancy and data availability. In its essence it's a way for 3 or more (or even 2 with some caveats..) servers to keep the same copy of data.
Writes always go to the primary and get propagated asynchronously to the secondaries.
Reads can go to the primary or any of the secondaries.
Election process:
Replica sets implement by default automatic failover. If a primary server fails, the remaining secondaries will elect the new primary. This will by default be the secondary that is most "up to date" with the primary but we can affect (rig) the election process by assigning different votes to each server.
More information: https://docs.mongodb.com/manual/core/replica-set-elections/
Using replication we can perform a few interesting tasks:
- delayed replica for backup. Delay backups by an hour, enabling us to recover from dropping a database in production
- hidden replicas for reporting. These replicas will never become primaries so we can safely apply read load to them for reporting purposes
- replicas in different location for disaster recovery
- replicas in different location to be closer to our users
Sharding is a method for horizontal scaling that MongoDB uses. It essentially partitions data across the shard key in different servers thereby distributing the read and write load.
What is horizontal scaling?
When our data exceeds the disk space, I/O capacity and/or memory available in a single server we have two options:
- Buy/Rent a bigger server. This is vertical scaling. It's of course the easiest way to scale but does not scale linearly in terms of cost and capacity.
- Distribute our data across different servers of the same initial capacity. This is called horizontal scaling, is more difficult to achieve but theoretically if we achieve linear scaling then our system can be infinitely scaleable.
With MongoDB it's important to understand that we will start with #1 for as long as it makes financial sense. If we are on AWS it will probably be easier to tweak our replica set from S to M to L sized servers rether than implement sharding. At some point though, we should start planning for sharding, definitely sooner rather than later.
Sharding as we can see from the diagram above is not a trivial task. We need router(s) config servers and shards.
Router:
The router is essentially our query server. Queries no longer go to individual servers but must go to the router which will decide which server(s) hold our data.
Config Servers:
Config servers are holding configuration information for the whole cluster. They must be deployed as a replica set in order to achieve high availability.
Shards:
Each shard is essentially a replica set. Each shard holds a cut of our data and all the shards together hold the total of our data.
More information: https://docs.mongodb.com/manual/sharding/
If you have time and interest, please register to this class or any other class in MongoDB university.
All classes are free and on average require 6-10 hours of time per week.
https://university.mongodb.com/courses/M001/about
I am also the author of the Mastering MongoDB 3.X book by Packt publishing, available here.