How to Write Denormalized Data in Firebase

How to write denormalized data in Firebase

Great question. I know of three approaches to this, which I'll list below.

I'll take a slightly different example for this, mostly because it allows me to use more concrete terms in the explanation.

Say we have a chat application, where we store two entities: messages and users. In the screen where we show the messages, we also show the name of the user. So to minimize the number of reads, we store the name of the user with each chat message too.

users
so:209103
name: "Frank van Puffelen"
location: "San Francisco, CA"
questionCount: 12
so:3648524
name: "legolandbridge"
location: "London, Prague, Barcelona"
questionCount: 4
messages
-Jabhsay3487
message: "How to write denormalized data in Firebase"
user: so:3648524
username: "legolandbridge"
-Jabhsay3591
message: "Great question."
user: so:209103
username: "Frank van Puffelen"
-Jabhsay3595
message: "I know of three approaches, which I'll list below."
user: so:209103
username: "Frank van Puffelen"

So we store the primary copy of the user's profile in the users node. In the message we store the uid (so:209103 and so:3648524) so that we can look up the user. But we also store the user's name in the messages, so that we don't have to look this up for each user when we want to display a list of messages.

So now what happens when I go to the Profile page on the chat service and change my name from "Frank van Puffelen" to just "puf".

Transactional update

Performing a transactional update is the one that probably pops to mind of most developers initially. We always want the username in messages to match the name in the corresponding profile.

Using multipath writes (added on 20150925)

Since Firebase 2.3 (for JavaScript) and 2.4 (for Android and iOS), you can achieve atomic updates quite easily by using a single multi-path update:

function renameUser(ref, uid, name) {
var updates = {}; // all paths to be updated and their new values
updates['users/'+uid+'/name'] = name;
var query = ref.child('messages').orderByChild('user').equalTo(uid);
query.once('value', function(snapshot) {
snapshot.forEach(function(messageSnapshot) {
updates['messages/'+messageSnapshot.key()+'/username'] = name;
})
ref.update(updates);
});
}

This will send a single update command to Firebase that updates the user's name in their profile and in each message.

Previous atomic approach

So when the user change's the name in their profile:

var ref = new Firebase('https://mychat.firebaseio.com/');
var uid = "so:209103";
var nameInProfileRef = ref.child('users').child(uid).child('name');
nameInProfileRef.transaction(function(currentName) {
return "puf";
}, function(error, committed, snapshot) {
if (error) {
console.log('Transaction failed abnormally!', error);
} else if (!committed) {
console.log('Transaction aborted by our code.');
} else {
console.log('Name updated in profile, now update it in the messages');
var query = ref.child('messages').orderByChild('user').equalTo(uid);
query.on('child_added', function(messageSnapshot) {
messageSnapshot.ref().update({ username: "puf" });
});
}
console.log("Wilma's data: ", snapshot.val());
}, false /* don't apply the change locally */);

Pretty involved and the astute reader will notice that I cheat in the handling of the messages. First cheat is that I never call off for the listener, but I also don't use a transaction.

If we want to securely do this type of operation from the client, we'd need:

  1. security rules that ensure the names in both places match. But the rules need to allow enough flexibility for them to temporarily be different while we're changing the name. So this turns into a pretty painful two-phase commit scheme.

    1. change all username fields for messages by so:209103 to null (some magic value)
    2. change the name of user so:209103 to 'puf'
    3. change the username in every message by so:209103 that is null to puf.
    4. that query requires an and of two conditions, which Firebase queries don't support. So we'll end up with an extra property uid_plus_name (with value so:209103_puf) that we can query on.
  2. client-side code that handles all these transitions transactionally.

This type of approach makes my head hurt. And usually that means that I'm doing something wrong. But even if it's the right approach, with a head that hurts I'm way more likely to make coding mistakes. So I prefer to look for a simpler solution.

Eventual consistency

Update (20150925): Firebase released a feature to allow atomic writes to multiple paths. This works similar to approach below, but with a single command. See the updated section above to read how this works.

The second approach depends on splitting the user action ("I want to change my name to 'puf'") from the implications of that action ("We need to update the name in profile so:209103 and in every message that has user = so:209103).

I'd handle the rename in a script that we run on a server. The main method would be something like this:

function renameUser(ref, uid, name) {
ref.child('users').child(uid).update({ name: name });
var query = ref.child('messages').orderByChild('user').equalTo(uid);
query.once('value', function(snapshot) {
snapshot.forEach(function(messageSnapshot) {
messageSnapshot.update({ username: name });
})
});
}

Once again I take a few shortcuts here, such as using once('value' (which is in general a bad idea for optimal performance with Firebase). But overall the approach is simpler, at the cost of not having all data completely updated at the same time. But eventually the messages will all be updated to match the new value.

Not caring

The third approach is the simplest of all: in many cases you don't really have to update the duplicated data at all. In the example we've used here, you could say that each message recorded the name as I used it at that time. I didn't change my name until just now, so it makes sense that older messages show the name I used at that time. This applies in many cases where the secondary data is transactional in nature. It doesn't apply everywhere of course, but where it applies "not caring" is the simplest approach of all.

Summary

While the above are just broad descriptions of how you could solve this problem and they are definitely not complete, I find that each time I need to fan out duplicate data it comes back to one of these basic approaches.

What is denormalization in Firebase Cloud Firestore?

What is denormalization in Firebase Cloud Firestore?

The denormalization is not related only to Cloud Firestore, is a technique generally used in NoSQL databases.

What is really this denormalization?

Denormalization is the process of optimizing the performance of NoSQL databases, by adding redundant data in other different places in the database. What I mean by adding redundant data, as @FrankvanPuffelen already mentioned in his comment, it means that we copy the exact same data that already exists in one place, in another place, to suit queries that may not even be possible otherwise. So denormalization helps cover up the inefficiencies inherent in relational databases.

How does this denormalization really help?

Yes, it does. It's also a quite common practice when it comes to Firebase because data duplication is the key to faster reads. I see you're new to the NoSQL database, so for a better understanding, I recommend you see this video, Denormalization is normal with the Firebase Database. It's for Firebase realtime database but the same principles apply to Cloud Firestore.

Is it always necessary?

We don't use denormalization just for the sake of using it. We use it, only when it is definitely needed.

Is database flatten and denormalization the same thing?

Let's take an example of that. Let's assume we have a database schema for a quiz app that looks like this:

Firestore-root
|
--- questions (collections)
|
--- questionId (document)
|
--- questionId: "LongQuestionIdOne"
|
--- title: "Question Title"
|
--- tags (collections)
|
--- tagIdOne (document)
| |
| --- tagId: "yR8iLzdBdylFkSzg1k4K"
| |
| --- tagName: "History"
| |
| --- //Other tag properties
|
--- tagIdTwo (document)
|
--- tagId: "tUjKPoq2dylFkSzg9cFg"
|
--- tagName: "Geography"
|
--- //Other tag properties

We can flatten the database by simply moving the tags collection in a separate top-level collection like this:

Firestore-root
|
--- questions (collections)
| |
| --- questionId (document)
| |
| --- questionId: "LongQuestionIdOne"
| |
| --- title: "Question Title"
|
--- tags (collections)
|
--- tagIdOne (document)
| |
| --- tagId: "yR8iLzdBdylFkSzg1k4K"
| |
| --- tagName: "History"
| |
| --- questionId: "LongQuestionIdOne"
| |
| --- //Other tag properties
|
--- tagIdTwo (document)
|
--- tagId: "tUjKPoq2dylFkSzg9cFg"
|
--- tagName: "Geography"
|
--- questionId: "LongQuestionIdTwo"
|
--- //Other tag properties

Now, to get all the tags that correspond to a specific question, you need to simply query the tags collection where the questionId property holds the desired question id.

Or you can flatten and denormalize the database at the same time, as you can see in the following schema:

Firestore-root
|
--- questions (collections)
| |
| --- questionId (document)
| |
| --- questionId: "LongQuestionIdOne"
| |
| --- title: "Question Title"
| |
| --- tags (collections)
| |
| --- tagIdOne (document) //<----------- Same tag id
| | |
| | --- tagId: "yR8iLzdBdylFkSzg1k4K"
| | |
| | --- tagName: "History"
| | |
| | --- //Other tag properties
| |
| --- tagIdTwo (document) //<----------- Same tag id
| |
| --- tagId: "tUjKPoq2dylFkSzg9cFg"
| |
| --- tagName: "Geography"
| |
| --- //Other tag properties
|
--- tags (collections)
|
--- tagIdOne (document) //<----------- Same tag id
| |
| --- tagId: "yR8iLzdBdylFkSzg1k4K"
| |
| --- tagName: "History"
| |
| --- questionId: "LongQuestionIdOne"
| |
| --- //Other tag properties
|
--- tagIdTwo (document) //<----------- Same tag id
|
--- tagId: "tUjKPoq2dylFkSzg9cFg"
|
--- tagName: "Geography"
|
--- questionId: "LongQuestionIdTwo"
|
--- //Other tag properties

See, the tag objects are the same as well in users -> uid -> tags -> tagId as in tags -> tagId. So we flatten data to group somehow existing data.

For more information, you can also take a look at:

  • What is the correct way to structure this kind of data in Firestore?

Because you say you have a SQL background, try to think at a normalized design which will often store different but related pieces of data in separate
logical tables, which are called relations. If these relations are stored physically as separate disk files, completing a query that draws information from several relations (join operations) can be slow. If many relations are joined, it may be prohibitively slow. Because in NoSQL databases, we do not have "JOIN" clauses, we have to create different workarounds to get the same behavior.

How to handle data denormalization in Firestore using Firebase Cloud functions when offline

Question 1

When the user is offline I wouldn't attempt to recalculate the balance because you're opening yourself up to:

  1. Angry users who see a different balance in their offline app than when online and will accuse you of all sorts of things.

  2. Potential exploits caused by users performing actions offline and then reconnecting.

My recommendation based on your description would be to display the balance as grayed out with an "(Out of Sync)" message if the last retrieved balance is older than the latest transaction. If the user needs to perform actions that would subtract from that balance, I would use the last retrieved value as the source of truth.

Question 2

Regarding database structure what you have is fine, but I personally prefer a flatter structure where you'd have:

/accounts
/{accountId}

/ledger
/{txId}

/balances
/{accountId}

The reasoning is primarily for easier maintenance because it's easy to delete a parent collection and forget to delete its subcollections and documents, which will continue to live on invisibly racking up costs. The deeper your nests the worst worse it gets.

How to denormalize/normalize data structure for firebase realtime database?

I'm with Jay here: you pretty much got all of it in your question already. Great summary of the practices we recommend when using Firebase Database.

Your questions boils down to: should I duplicate my user profile information into each story? Unfortunately there's no single answer for that.

Most developers I see will keep the profile information separate and just keep the user UID in the post as a unmanaged foreign key. This has the advantage of needing to update the user profile in only one place when it changes. The performance to read a single story is not too bad: the two reads are relatively fast, since they go over the same connection. When you're showing a list of stories, it is unexpectedly fast since Firebase pipelines the requests over its single connection.

But one of the first bigger implementation I helped with actually duplicated the user data over the stories. As you said: reading a story or list of stories is as fast as it can be in that case. When asked how they dealt with keeping the user information up to date in the stories (see strategies here), they admitted they didn't. In fact: they argued many good reasons why they needed the historical user information for each story.

In the end, it all depends on your use-case. You'll need to answer questions such as:

  • Do you need the historical information for each user?
  • Is it crucial that you show the up-to-date information for a user in older posts?
  • Can you come up with a good caching strategy for the user profiles in your client-side code?

Firebase database structure - denormalized data?

I've done this storing user's data in a place and setting just the userID as post attribute.

posts:
userID:
postID:
userID: 'user1',
attachedImageURL: 'http:..',
message: 'hey',
reblogID: 'post4',
type: 'audio|poll|quote'
users:
user1:
name: 'john',
profileImage: 'http..'

It requires one more query to Firebase to retrieve user's profile data but it's a good way to solve this. It really depends on how you want to use those data.

React + Firebase - Handle denormalized data

But not sure if this is a good and "clean" approach.

This is highly dependent on your functional requirements, on the access rights, on the frequency of update and on the volume of the data that is not "included in denormalization".

If:

  1. All the authorized readers of UserList can read the corresponding Profiles;
  2. Between the moment a user opens the UserList and the moment he/she opens one of the Profile (click on a line of UserList if I correctly understand) data that is not "included in denormalization" is not going to change;
  3. This data that is not "included in denormalization" is not really heavy (I imagine that totalFollowers and totalFollowing are numbers, status may be a code and premium a boolean);

Then you should probably include "in the denormalization" the data that is not included: you will save some reads, since the data to be displayed when opening a Profile will have already been fetched.

On the other hand, if one of the above condition is not true, then you should probably stay as it is and fetch the data that is not "included in denormalization" when the user opens a Profile



Related Topics



Leave a reply



Submit