Roee Eilat and Udi Weinsberg are Research Scientist Managers within Facebook Core Data Science (CDS), a research and development team working on improving Facebook’s processes, infrastructure, and products that enable more than 1.5 billion people to communicate with one another every day. This work was done during Adam N. Breuer’s internship with CDS.
What we did
Online social networks are frequently targeted by malicious actors who create fake accounts for the purpose of carrying out abuse. Broadly, abuse is conducted in three phases. First, malicious actors create accounts. Then, these accounts need to establish connections with real users (e.g., by sending friend requests on Facebook). Finally, once they establish a sufficient number of connections, fake accounts can expose their social networks to a variety of malicious activities and abuse.
In our paper, “Friend or Faux: graph-based early detection of fake accounts on social networks,” we focus on the detection of new fake accounts that manage to evade registration-time classifiers but have not yet made sufficient connections to perpetrate abuse. This means that we target accounts that manage to circumvent defenses that target new accounts immediately when they are created, e.g., using a suspicious email or IP address. In this work, we aim to identify such accounts that are at the early stages of their activity on the platform, potentially before abuse has been manifested. By detecting fake accounts early, we mitigate their ability to launch abuse against other users.
We designed and built an algorithm, named SybilEdge, that processes the way new users add friends (or “edges”) to their social networks. By looking at the selection of friend request targets and the corresponding request acceptances and rejections, SybilEdge accurately detects fake accounts with as few as 20 friend requests.
How we did it
In order to launch abuse on a social network, abusers need to connect to targets. On Facebook, this means that the abusers first need to find targets, then send them a friend request, and finally have the request accepted. We observed these actions over time for two sets of accounts: benign people and those that were deemed fake based on actual abuse. We found that these two sets of users differed significantly in both their selection of potential friends (targets), and those targets’ responses to their friend requests.
In Figure 1, we show the distribution of the rate of rejections (number of rejected requests out of total friend requests). As the figure shows, fake accounts’ rejection rate is skewed to the right, meaning that their requests are rejected more often than real users’ requests.
This is a rather intuitive finding; people are alert, and they are more likely to reject requests coming from fake accounts (that they probably don’t know) than from real people (whom they are more likely to know). However, as the figure also shows, requests from real people are also rejected sometimes, and fake users aren’t rejected all the time.
In order to overcome this, we took a closer look at individual users and the difference in their rate of acceptance of requests coming from real and fake accounts. Figure 2 shows the distribution of these rates: 1 indicates people that are just as likely to accept fake accounts as they are to accept real people; greater than 1 indicates acceptance of more requests from real people than from fake ones; and less than 1 indicates those that accept more requests from fake accounts than from real ones.
As Figure 2 shows, while there is a large mass of users that are as likely to accept requests from real accounts as they are to accept requests from fake ones, there are many users that behave in a way that provides a signal about the requester: If the receiver is more likely to accept a request from a real account, and they accepted an incoming request, it serves as a signal that the requester is a real user. Similarly, if they rejected the incoming request, it increases the probability that the requester is fake. Moreover, even the people who accept more requests from fake accounts than real accounts can be leveraged, as they provide a similar signal but in the other direction: If they accept a request, the probability that the requester is fake increases.
Finally, another intuitive behavior that we observed in data is that fake accounts oftentimes are careful when picking their friend request targets, most likely to maximize the probability of their requests being accepted. Figure 3 shows the distribution of the rates of incoming requests from real users and from fake users:
Like the plot in Figure 2, this plot shows that we can leverage the targets of friend requests as a signal to the likelihood of the sender being fake. People on the >1 side of the plot are more likely to receive requests from real people, while people on the <1 side are more likely to receive requests from fakes. As such, we can increase the probability of the requester being fake or real based on the target of the requester.
These observations land nicely into a probabilistic framework, where we determine the probability of a user to be fake, by observing the set of targets (people that friend requests were sent to) and their corresponding responses (accepting or rejecting).
SybilEdge works in two stages. First, we train the model by observing data over some period of time, and leveraging outputs from other classifiers. Specifically, on Facebook, we use behavioral and content classifiers that flag an account as abusive based on actual abuse. This training phase provides us with all the necessary model parameters. Then, we run the model in real time for each friend request and response (accept or reject), and update the probability for the requester being fake.
While there are a few Bayesian equations to solve, it turns out to be quite simple and efficient to compute. Those interested in the exact details can refer to our research paper. SybilEdge also produces great results compared with the state of the art. Figure 4 shows a comparison between SybilEdge and related models, using the area under curve (AUC) of the precision/recall curve (AUC=1 is a perfect classifier, while AUC=0.5 is a random class-prior classifier) for increasing the number of friend requests.
As Figure 4 shows, SybilEdge significantly outperforms existing models, reaching a very high AUC. Furthermore, it converges quickly, with as few as 15 friend requests on average. Interestingly, the performance of the other models actually degrades with more friend requests. This is likely the result of more friend requests being accepted overall as more friend requests are sent, which possibly throws these models off.
SybilEdge has some additional desirable properties:
- Rapid classification of new users: SybilEdge converges quickly, especially if the user sends requests to highly discriminating users. For example, it takes only a few rejects by savvy real users for an eager abusive account to be flagged as abusive.
- Adversarial robustness: SybilEdge has two main components: one is the selection of the target, and the other is the responses of the targets. While the former can be controlled by attackers, the latter is completely out of their control, thus making it difficult to circumvent our detection model. In the paper, we show a version of SybilEdge that includes only the responses, and while it has a lower performance than the full SybilEdge model, it still outperformed the state-of-the-art models.
- Low complexity: SybilEdge’s complexity is linear in the number of requests, which, due to the sparsity of the social network, is proportional to the number of people — i.e., assuming V to be the set of people in the social network, SybilEdge’s complexity is O(|V|).
- Interpretable: A favorable property of the Bayesian setting is that we can explain the reason for a given output, e.g., the user is likely to be fake because they were rejected by people who tend to reject fakes. This is unlike more blackbox models, such as neural nets, which are extremely difficult to reason about.
- Robust training: Even though SybilEdge uses behavioral and content models to learn its priors, it is quite robust to inaccuracies of these models. In addition, it also performs well in a wide range of settings, including varying levels of fakes in the network and different social network topology (attachment models). We refer the avid reader to the paper, where we study these robustness properties extensively.
A key challenge in fighting abuse is detecting abusive users quickly and confidently, and then removing them from the platform before they can launch their attacks. SybilEdge helps us identify abusers quickly and in a way that can be explained and analyzed. In the near future, we plan to look at additional ways that can further speed up the detection of abusive accounts and help make confident decisions even faster than SybilEdge. We plan to accomplish this by mixing feature-based and behavior-based models.