Search Architecture

Instagram is in the fortunate position to be a small company within the infrastructure of a much larger one. When it makes sense, we leverage resources to leapfrog into experiences that have taken Facebook ten years to build. Facebook’s search infrastructure, Unicorn, is a social-graph-aware search engine that has scaled to indexes containing trillions of documents. In early 2015, Instagram migrated all search infrastructure from Elasticsearch into Unicorn. In the same period, we saw a 65% increase in search traffic as a result of both user growth and a 12% jump in the number of people who are using search every time they use Instagram.

These gains have come in part from leveraging Unicorn’s ability to rank queries using social features and second-order connections. By indexing every part of the Instagram graph, we powered the ability to search for anything you want - people, places, hashtags, media - faster and more easily as part of the new Search and Explore experience in our 7.0 update.

What Is Search?

Instagram’s search infrastructure consists of a denormalized store of all entities of interest: hashtags, locations, users and media. In typical search literature these are called documents. Documents are grouped together into sets which can be queried using extremely efficient set operations such as AND, OR and NOT. The results of these operations are efficiently ranked and trimmed to only the most relevant documents for a given query. When an Instagram user enters a search query, our backend encodes it into set operations and then computes a ranked set of the best results.

Getting Data In

Instagram serves millions of requests per second. Many of these, such as signups, likes, and uploads, modify existing records and append new rows to our master PostgreSQL databases. To maintain the correct set of searchable documents, our search infrastructure needs to be notified of these changes. Furthermore, search typically needs more information than a single row in PostgreSQL — for example, the author’s account vintage is used as a search feature after a photo is uploaded.

To solve the problem of denormalization, we introduced a system called Slipstream where events on Instagram are encoded into a large Thrift structure containing more information than typical consumers would use. These events are binary-serialized and sent over an asynchronous pub/sub channel we call the Firehose. Consumers, such as search, subscribe to the Firehose, filter out irrelevant events and react to remaining events. The Firehose is implemented on top of Facebook's Scribe which makes the messaging process asynchronous. The figure below shows the architecture:

Since Thrift is schematized, we re-use objects across requests and have consumers consume messages without the need for custom deserializers. A subset of our Slipstream schema, corresponding to a photo like is shown below:

struct User { 1: required i64 id; 2: string username; 3: string fullname; 4: bool is_private; ... } struct Media { 1: required i64 id; 2: required i64 owner_id; 3: required MediaContentType content_type; ... } struct LikeEvent { 1: required i64 liker_id; 2: required i64 media_id; 3: required i64 media_owner_id; 4: Media media; 5: User liker; 6: User media_owner; ... 8: bool is_following_media_owner; } union InstagramEvent { ... 2: LikeEvent like; ... } struct FirehoseEvent { 1: required i64 server_time_millis; 2: required InstagramEvent event; }

Firehose messages are treated as best-effort and a small percentage of data loss is expected in messaging. We establish eventual consistency in search by a process of reconciliation or a base build. Each night, we scrape a snapshot of all Instagram PostgreSQL databases to Hive for data archiving. Periodically, we query these Hive tables and construct all appropriate documents for each search vertical. The base build is merged against data derived from Slipstream to allow our systems to be eventually consistent even in the event of data loss.

Getting Data Out

Processing Queries

Assuming that we have ingested our data correctly, our search infrastructure enables an efficient path to extracting relevant documents given a constraint. We call this constraint a query,

which is typically a derived form of user-supplied text (e.g. “Justin” with the intent of searching for Justin Bieber). Behind the scenes, queries to Unicorn are rewritten into S-Expressions that express clear intent, for example:

(and user:maxime (apply followed_by: followed_by:me) )

which translates to “people named maxime followed by people I follow”. Our search infrastructure proceeds in two (intermixed) steps:

Candidate generation: finding a set of documents that match a given query. Our backend dives into a structure called a reverse index, which finds sets of document ids indexed by a term. For example, we may find the set of users with the name “justin” in the “name:justin” term.

Ranking: choosing the best documents from all the candidates. After getting candidate documents, we look up features which encode metadata about a document. For example, one feature for the user justinbieber would be his number of followers (32.3MM). These features are used to compute a “goodness” score, which is used to order the candidates. The “goodness” score can be either machine learned or hand-tuned — in the machine learning case, we may engineer features that discriminate for clicks or follows to a given candidate.

The result of the two steps is an ordered list of the best documents for a given query.

Graph-Aware Searches

As part of our search improvements, Instagram now takes into account who you follow and who they follow in order to provide a more personalized set of results. This means that it is easier for you to find someone based on the people you follow.

Using Unicorn allowed us to index all the accounts, media, hashtags and places on Instagram and the various relationships between these entities. For example, by indexing a user’s followers, Unicorn can provide answers to questions such as:

“Which accounts does User X follow and are also followed by user Y”

Equally, by indexing the locations tagged in media Unicorn can provide responses for questions such as:

“Media taken in New York City from accounts I follow”

Improving Account Search

While utilizing the Instagram graph alone may provide signals that improve the search experience, it may not be sufficient to find the account you are looking for. The search ranking infrastructure of Unicorn had to be adapted to work well on Instagram.

One way we did this was to model existing connections within Instagram. On Facebook, the basic relationship between accounts is non-directional (friending is always reciprocal). On Instagram, people can follow each other without having to follow back. Our team had to adapt the search ranking algorithms used to store and retrieve account to Instagram’s follow graph. For Instagram, accounts are retrieved from unicorn by going through different mixes of:

“people followed by people you follow”

and

“People followed by people who follow you”

In addition, on Instagram, people can follow each other for various reasons. It doesn’t necessarily mean that a user has the same amount of interest in all the accounts they follow. Our team built a model to rank the accounts followed by each user. This allows us to prioritize showing people followed by people that are more important to the searcher.

A Unified Search Box

Sometimes, the best answer for a search query can be a hashtag or a place. In the previous search experience, Instagram users had to explicitly choose between searching for accounts or hashtags. We made it easier to search for hashtags and places by removing the necessity to select between the different types of results. Instead, we built a ranking framework that allows us to predict which type of results we think the user is looking for. We found in tests that blending hashtags with accounts was such a better experience that clicks on hashtags went up by more than 20%! This increase fortunately didn’t come at the cost of significantly impacting account search.

Our classifiers are both personalized and machine-learned on the logs of searches that users are doing on Instagram. The query logs are aggregated per country to determine if a given search term such as “#tbt” would most likely result in a hashtag search or an account search. Those signals are combined with other signals, such as past searches by a given user and the quality of the results available to show, in order to produce a final blended list of results.

Media Search

Instagram’s search infrastructure is used to power discovery features far away from user-input search. Our largest search vertical, media, contains the billions of posts on Instagram indexed by the trillions of likes. Unlike our other tiers, media search is purely infrastructure — users never enter any explicit media search queries in the app. Instead, we use it to power features that display media: explore, hashtags, locations and our newly launched editorial clusters.

Candidate Generation

Lacking an explicit query, we get creative with our media reverse index terms to enable slicing along different axes. The table below shows a list of some term types currently supported in our media index:

Within each posting list, our media is ordered (“statically ranked”) reverse-chronologically to encourage a strong recency bias for results. For example, we can serve the Instagram’s profile page for @thomas with a single query: (term owner:181861901). Extending to hashtags, we can serve recent media from #hyperlapse through (term hashtag:#hyperlapse). Composing Unicorn’s operators enable us to find @thomas’ Hyperlapses, by issuing (and hashtag:#hyperlapse owner:181861901).

Many of terms exist to encourage diversity in our search results. For example, we may be interested in making sure that some #hyperlapse candidates are posted by verified accounts. Through the use of Unicorn’s WEAK AND operator we can guarantee that at least 30% of candidates come from verified accounts:

(wand (term hashtag:#hyperlapse) (term verified:1 :optional-weight 0.3) )

We exploit diversity to serve better content in the “top” sections of hashtags and locations.

Features

Although postings lists are ordered chronologically we often want to surface the top media for a given query (hashtag, location, etc.). After candidate generation, we go through a process of ranking which chooses the best media by assigning a score to each document. The scoring function consumes a list of features and outputs a score representing the “goodness” of a given document for our query.

Features in our index can be divided broadly into three categories:

Visual: features that look at the visual content of the image itself. Concretely, we run each of Instagram’s photo through a deep neural net (DNN) image classifier in an attempt to categorize the content of the photo. Afterwards, we perform face detection in order to determine the number and size each of the faces in the photo.
Post metadata: features that look at non-visual content of a given post. Many Instagram posts contain captions, location tags, hashtags and/or mentions which aid in determining search relevancy. For example, the FEATURE_IG_MEDIA_IS_LOCATION_TAGGED is an indicator feature determining whether a post contains a location tag.
Author: features that look at the person who made a given post. Some of the richest information about a post is determined by the person that made it. For example, FEATURE_IG_MEDIA_AUTHOR_VERIFIED is an indicator feature determining whether the author of a post is verified.

Depending on the use case, we tune features weights differently. On the “top” section of location pages we may wish to differentiate between photos of a location and photos in a location and down-rank photos containing large faces. Instagram uses a per-query-type ranking model that allows for modeling choices appropriate to a particular app view.

Case study: Explore

Our media search infrastructure also extends itself into discovery, where we serve interesting content that users aren’t explicitly looking for. Instagram’s Explore Posts feature showcases interesting content from people near to you in the Instagram graph. Concretely, one source of explore candidates “photos liked by people whose photos you have liked”. We can can encode this into a single unicorn query with:

(apply liker:(extract owner: liker:))

This proceeds inwards-outwards by:

liker:: posts that you’ve liked
(extract owner:...): the owner of those posts
(apply liker:..): media liked by those owners

After this query generates candidates, we are able to leverage our existing ranking infrastructure to determine the top posts for you. Unlike top posts on hashtag and location pages, the scoring function for explore is machine-learned instead of hand tuned.

Acknowledgements

By Maxime Boucher and Thomas Dimson

This project wouldn’t be possible without the contributions of Tom Jackson, Peter DeVries, Weiyi Liu, Lucas Ou-Yang, Felipe Sodre da Silva and Manoli Liodakis