Efficiency of Searching Using Wherearraycontains

Efficiency of searching using whereArrayContains

As the number of documents in the collection grows and the number of items in the array grows will this search become very inefficient?

The problem isn't the fact that the search will become very inefficient, the problem is that the documents have limits. So there are some limits when it comes to how much data you can put into a document. According to the official documentation regarding usage and limits:

Maximum size for a document: 1 MiB (1,048,576 bytes)

As you can see, you are limited to 1 MiB total of data in a single document. When we are talking about storing text, you can store pretty much but as your array gets bigger, be careful about this limitation.

If you are storing a large amount of data in arrays and those arrays should be updated by lots of users, there is another limitation that you need to take care of. So you are limited to 1 write per second on every document. So if you have a situation in which a lot of users al all trying to write/update data to the same documents all at once, you might start to see some of these writes to fail. So, be careful about this limitation too.

As you probably noticed, queries in Cloud Firestore are very fast and this is because Firestore automatically creates an index for any fields you have in your document.

If you think that you'll be querying for a parent based on their containing a specific member of a collection, then use maps and not arrays.

There many posts out there that say that arrays don't work well on Cloud Firestore because when you have data that can be altered by multiple clients, it's very easy to get confused because you cannot know what is happening and on which field. If I'm using a map and users want to edit several different fields, even the exact same field, we generally know what is happening. In arrays, things are different. Try to think what might happen if a user wants to edit a value at index 0, some other user wants to delete the value at index 0 you'll end up having a very different results and why not, array out of bounds exceptions. So Firestore actions with arrays are a little bit different. So you cannot perform actions like, insert, update or delete at a specific index. But if don't care about the exact order that you store element into an array, then you should use arrays. Firestore added a few days ago some features to add or remove specific elements but only if don't care about their exact position. See here official documentation.

In conclusion, put data in the same document only if you need it to display it together. Also, don't make them so big so you'll need to download more data than you actually need. To put data in a collection when you want to search for individual fields of that data or if you want your data to have room to grow. Leave your data as a map field if you want to search your parent object based on that data. And if you got items that you generally use as flags, go ahead with arrays.

Also, don't worry about slow query in Firestore.

What is the correct way to structure this kind of data in Firestore?

What is the correct way to structure this kind of data in Firestore?

You need to know that there is no "perfect", "the best" or "the correct" solution for structuring a Cloud Firestore database. The best and correct solution is the solution that fits your needs and makes your job easier. Bear also in mind that there is also no single "correct data structure" in the world of NoSQL databases. All data is modeled to allow the use-cases that your app requires. This means that what works for one app, may be insufficient for another app. So there is not a correct solution for everyone. An effective structure for a NoSQL type database is entirely dependent on how you intend to query it.

The way you are structuring your data looks good to me. In general, there are two ways in which you can achieve the same thing. The first one would be to keep a reference of the provider in the product object (as you already do) or to copy the entire provider object within the product document. This last technique is called denormalization and is a quite common practice when it comes to Firebase. So we often duplicate data in NoSQL databases, to suit queries that may not be possible otherwise. For a better understanding, I recommend you see this video, Denormalization is normal with the Firebase Database. It's for Firebase Realtime Database but the same principles apply to Cloud Firestore.

Also, when you are duplicating data, there is one thing that needs to keep in mind. In the same way, you are adding data, you need to maintain it. In other words, if you want to update/delete a provider object, you need to do it in every place that it exists.

You might wonder now, which technique is best. In a very general sense, the best way in which you can store references or duplicate data in a NoSQL database is completely dependent on your project's requirements.

So you should ask yourself some questions about the data you want to duplicate or simply keep it as references:

  1. Is the static or will it change over time?
  2. If it does, do you need to update every duplicated instance of the data so they all stay in sync? This is what I have also mentioned earlier.
  3. When it comes to Firestore, are you optimizing for performance or cost?

If your duplicated data needs to change and stay in sync in the same time, then you might have a hard time in the future keeping all those duplicates up to date. This will also might imply you spend a lot of money keeping all those documents fresh, as it will require a read and write for each document for each change. In this case, holding only references will be the winning variant.

In this kind of approach, you write very little duplicated data (pretty much just the Provider ID). So that means that your code for writing this data is going to be quite simple and quite fast. But when reading the data, you will need to load the data from both collections, which means an extra database call. This typically isn't a big performance issue for reasonable numbers of documents, but definitely does require more code and more API calls.

If you need your queries to be very fast, you may want to prefer to duplicate more data so that the client only has to read one document per item queried, rather than multiple documents. But you may also be able to depend on local client caches makes this cheaper, depending on the data the client has to read.

In this approach, you duplicate all data for a provider for each product document. This means that the code to write this data is more complex, and you're definitely storing more data, one more provider object for each product document. And you'll need to figure out if and how to keep up to date on each document. But on the other hand, reading a product document now gives you all information about the provider document in one read.

This is a common consideration in NoSQL databases: you'll often have to consider write performance and disk storage vs. reading performance and scalability.

For your choice of whether or not to duplicate some data, it is highly dependent on your data and its characteristics. You will have to think that through on a case-by-case basis.

So in the end, remember that both are valid approaches, and neither of them is pertinently better than the other. It all depends on what your use-cases are and how comfortable you are with this new technique of duplicating data. Data duplication is the key to faster reads, not just in Cloud Firestore or Firebase Realtime Database but in general. Any time you add the same data to a different location, you're duplicating data in favor of faster read performance. Unfortunately in return, you have a more complex update and higher storage/memory usage. But you need to note that extra calls in Firebase real-time database, are not expensive, in Firestore are. How much duplication data versus extra database calls is optimal for you, depends on your needs and your willingness to let go of the "Single Point of Definition mindset", which can be called very subjective.

After finishing a few Firebase projects, I find that my reading code gets drastically simpler if I duplicate data. But of course, the writing code gets more complex at the same time. It's a trade-off between these two and your needs that determines the optimal solution for your app. Furthermore, to be even more precise you can also measure what is happening in your app using the existing tools and decide accordingly. I know that is not a concrete recommendation but that's software development. Everything is about measuring things.

Remember also, that some database structures are easier to be protected with some security rules. So try to find a schema that can be easily secured using Cloud Firestore Security Rules.

Please also take a look at my answer from this post where I have explained more about collections, maps and arrays in Firestore.

Firestore: How to keep data consistent between user and documents that have user information?

How could I model my database in Firebase to keep, for example, reviews in a specific page updated with the user's info, this is, if a user changes its avatar or name, the reviews should also display the updated data of the user.

Without knowing the queries you intend to perform, it's hard to provide a viable schema. We are usually structuring a Firestore database according to the queries that we want to perform.

In Mongo I would just store a ref to the user in the review, and populate the field to retrieve the data I wanted from the document. Is there something like this in Firebase, and is it even a good or acceptable practice?

Yes, there is. According to the official documentation regarding Firestore supported data-types, a DocumentReference is one of them, meaning that you can store only a path to a document and not the entire document. In the NoSQL world, it's quite common to duplicate data, so to have the same data in more than one place. Again, without knowing the use-case of your app it's hard to say whether using normalization it's better than holding only a reference. For a better understanding, I recommend you read my answer from the following post:

  • What is denormalization in Firebase Cloud Firestore?

And to answer your questions:

  1. Is there something like ".populate()" in Firebase?

If you only store a DocumentReference, it doesn't mean that the data of the document that the reference is pointing to will be auto-populated. No, you first need to get the reference from the document, and right after that, based on that reference, you have to perform another database call, to actually get the data from the referenced document.


  1. Should I model the documents as much as possible to have the data that will be used in the view, and avoid "joins"?

Yes, you should only store the data that you actually need to be displayed in your views. Regarding a JOIN clause, there isn't something like this supported in Firestore. A query can only get documents in a single collection at a time. If you want to get, for example, data from two collections, you'll have at least two queries to perform.

Another solution would be to add a third collection with data already merged from both collections so you can perform a single query. This is already explained in the link above.

Some other information that might be useful is explained in my answer from the following post:

  • Efficiency of searching using whereArrayContains

Where you can find the best practice to save data into a document, collection, or subcollection.

Structuring a firestore database to filter by what is not in the array?

Firestore is not very well suited for queries that need to look for things that don't exist. The problem is that the indexes it uses are only meant to tell you if things exist. The universe of strings that don't exist would be impossible to efficiently quantify for indexing.

The only want to make this happen is to know the names of all the processes ahead of time, and create values for them in the index. You would do this with a map type object, not an array:

- token: "1234"
- history: {
"process-001": false,
"process-002": false,
"process-003": false
}

This document can be queried to find out if "history.process-001" has a value of false, then updated to true when the process uses it. But again, without all the process names known ahead of time and populated in each document, the query is not possible.

See also:

  • Firestore get documents where value not in array?
  • How to query Cloud Firestore for non-existing keys of documents

Saving to firestore objects more than 1MB

Can you suggest or guide to the right direction of what would someone do in order to store such "big" data?

In order to store such "big" data you should change the way you are holding that data from within a single documents to a collection. In case of collections, there is no limitation. You can add as many documents as you want. According to the official documentation regarding Cloud Firestore Data model:

Cloud Firestore is optimized for storing large collections of small documents.

So you should take advantage of this feature.

For details, I recommend you see my answer from this post where I have explained some practices regarding storing data in arrays (documents), maps or collections.

Array or Subcollection for storing events user uploaded

There is no simple right or wrong answer when you need to choose between these two options. Data duplication is the key to faster reads, not just in Firebase Realtime Database or Cloud Firestore, but in general. Any time you add the same data to a different location, you're duplicating data in favor of faster read performance. Unfortunately in return, you have a more complex update and higher storage/memory usage. But you need to note that extra calls in the Firebase Realtime Database are not expensive, in Firestore are. How much duplication data versus extra database calls is optimal for you, depends on your needs and your willingness to let go of the "Single Point of Definition mindset", which can be also called very subjective.

After finishing a few Firebase projects, I find that my reading code gets drastically simpler if I duplicate data. But of course the writing code gets more complex at the same time. It's a trade-off between these two and your needs that determines the optimal solution for your app.

Please also take a look at my answer from this post where I have explained more about collections, maps and arrays in Firestore.



Related Topics



Leave a reply



Submit