Farcaster Social Graph Project Update #5
Hi folks! Excited to share that we’ve finished at all our initial milestones and we’re now actively working on development and integration. Our ensemble ML model is showing promising results with a 0.98 AUC score on the test set and we’ve created an initial API endpoint. The next steps are related to improve the model and integrate the API with the previously developed frame.
Milestones Tracker
|
Milestone |
Type |
Status |
ETA |
 |
Requirements Definition and Data Extraction |
Critical |
Finished |
Finished |
 |
Algorithm Design and Architecture |
Critical |
Finished |
Finished |
 |
Algorithm Development and Evaluation |
Critical |
In progress |
2 weeks |
 |
Algorithm Integration and Monitoring |
Critical |
In progress |
2 weeks |
Key progress
List of human users
We attempted several strategies to compile a high-confidence list of human accounts, including:
- Filtering for accounts with validated ENS names
- Selecting users with high Gitcoin scores (above 40)
- Identifying accounts with validated ownership of an X account
However, even after applying these conditions, the resulting lists still consisted predominantly of bots (over 80%). Further investigation of a random sample of users with casts revealed an even higher proportion of bots (greater than 90%). Given these challenges, we determined that the most reliable approach was to manually inspect filtered samples, particularly focusing on examining the accounts they followed. This manual verification process resulted in a validated list of 416 human accounts, which were incorporated as labels together with the ~4k bot labels from Bot or Not (with 90% accuracy after inspection).
Development of initial machine learning model
Our experimental ensemble model combines XGBoost, Random Forest, and LightGBM classifiers, leveraging over 70 features extracted from the Neynar dataset. The ensemble achieved a strong AUC score of 0.98 on the test set. However, it’s important to note that our training data represents only approximately 0.5% of total Farcaster users, so these results should be interpreted with caution until we can validate the model’s performance across a broader user base.
The current feature set encompasses three main categories shown below. The full set of features each one encompasses can be seen on the project’s repo.
- User identity metrics, which capture account-specific characteristics and verification status
- Network analysis metrics that examine user interactions and connection patterns
- Temporal behavior metrics, which track posting frequency and activity patterns over time
- Content engagement metrics that analyze statistics about replies, reactions and participation on the social network in general
- Reputation meta metrics, which capture verifications and scores values
In the coming weeks, we plan to expand this research by evaluating additional models, refining our feature selection to identify the most impactful variables, and incorporating new features focused on content engagement metrics.
Integration of ML model and SybilSCAR
We developed a Python library for detecting sybil accounts using machine learning, transforming our initial notebook experiments into production code. This library has been successfully integrated with our existing social graph API, completing the design and architecture phase of the project.
Deployment of basic API
We have deployed an initial version of the API to enable early testing and feedback. It consists of an endpoint where you can pass an fid and get the user sybil probability.
Upcoming tasks
- Provide a public endpoint for the API
- Integration of the API with the developed Farcaster frame
- Improvements on the current machine learning model:
- Feature selection to decrease computational costs
- Engineering of new features
- Experiments to guarantee the current results are applicable to all users