A deterministic upper bound for the network load of the fully decentralized Freenet spam filter

Goal: Improve the decentralized spam filter in Freenet [1] (WoT [2]) to have deterministic network load, bounded to a low, constant number of subscriptions and fetches.

This article provides calculations which show that decentralized spam filtering with privacy through pseudonyms can scale to communication systems that connect all of humanity. It is also applicable to other systems than Freenet, see use in other systems [3].

Inhaltsverzeichnis

Originally written as a comment to bug 3816 [4]. The bug report said "someone SHOULD do the math". I then did the math. Here I’m sharing the results.

Useful prior reading is Optimizing a distributed spam filter for Freenet [5].

This proposal has two parts:

Ensuring an upper bound on the network cost, and
Limiting the cost due to checking stale IDs.

Slang

ID, "identity" or "pseudonym" is a user account. You can have multiple.
OwnID is one of your own identities, a pseudonym you control.
Trust is a link from one ID (a) to another ID (b). It has a numerical value.
- Positive values mean that (a) considers (b) to be a constructive contributor.
- Negative values mean that (a) thinks that (b) is trying to disrupt communication.
Key is an identifier you can use as part of a link to download data. Every ID has one key.
Editions are the versions of keys. They are increased by one every time a key is updated.
Fetch means to download some data from some key for some edition.
Subscription is a lightweight method to get informed if a key was updated to a new edition.
Edition hints are part of an ID. They show for each trusted ID (b) which edition of it was last seen by the trusting ID (a).
The rank of an ID describes the number of steps needed to get from your OwnID to that ID when following trust paths.

Variables

N the number of identities the OwnID gave positive trust. Can be assumed to be bounded to 150 active IDs (as by Dunbar’s number).⁰
M a small constant for additional subscriptions, i.e. 10.
F a small constant for additional fetches per update, i.e. 10.

⁰: https://en.wikipedia.org/wiki/Dunbar's_number [6] - comment by bertm: that assumes all statements of "OwnID trusts ID to not be a spammer" to be equivalent to "OwnID has a stable social relationship with ID". I'm not quite sure of that equivalence. That said, for purposes of analysis, we can well assume it to be bounded by O(1).

Limit network load with a constant upper bound

Process

Subscribe to all rank 1 IDs (which have direct trust from your OwnID). These are the primary subscriptions. There are N primary subscriptions.

All the other IDs are split into two lists: rank2 (secondary IDs) and rank3+ (three or more steps to reach them). Only a subset of those get subscriptions, and the subset is regularly changed:

Subscribe to the M rank2 IDs which were most recently updated. These have the highest probability of being updated again. The respective list must be updated whenever a rank2 ID is fetched successfully (the ordering might change).
Subscribe to the M rank3+ IDs which were most recently updated. The respective list must be updated whenever a rank3+ ID is fetched successfully (the ordering might change).
Subscribe to M rank2 IDs chosen at random (secondary subscriptions). When a secondary or random subscription yields an update, replace it with another ID of rank2, chosen at random.
Subscribe to M IDs of rank 3 or higher chosen at random (random subscriptions). When a random subscription yields an update, replace it with another rank3+ ID, chosen at random.

Also replace one of the randomly chosen rank2 and rank3+ subscription every hour. This ensures that WoT will always eventually see every update.

If any subscription yields an update, download its key and process all edition hints. Queue these as fetches in separate queues for rank1 (primary), rank2 (secondary), and rank3+ (random), and process them independently.

At every update of a subscription (rank1, rank2, or rank3+), choose F fetches from the respective edition hint fetch queue at random and process them. This bounds the network load to ((N × F) + (4M × F)) × update frequency.

These fetches and subscriptions must be deduplicated: If we already have a subscription, there’s no use in starting a fetch, since the update will already have been seen.

Calculating the upper bound of the cost

To estimate an upper bound for the fetch frequency, we can use the twitter frequency, which is about 5 tweets per day on average and 10 to 50 for people with many followers¹ (those are more likely to be rank1 IDs of others).

There are two possible extremes: Very hierarchic trust structure and egalitarian trust structure. Reality is likely a power-law structure.

In a hierarchic trust structure, we can assume that rank1 or rank2 IDs (trustee subscriptions) are all people with many followers, so we use 22 updates per day (as by ¹).
In an egalitarian trust structure we can assume 5 updates per day (as by ¹).

For high frequency subscriptions (most recently updated) we can assume 4 updates per hour for 16 hours per day, so 64 updates per day.⁰ For random subscriptions we can assume 5 updates per day (as by ¹).

¹: http://blog.hubspot.com/blog/tabid/6307/bid/4594/Is-22-Tweets-Per-Day-the-Optimum.aspx [7] ← on the first google page, not robust, but should be good enough for this usecase.

((N × F) + (M × F)) × trustee update frequency + 2M × F × high update frequency + 2M × F × random update frequency.

For a very hierarchic WoT (primaries are very active) this gives the upper bound:

= (150 × 10 × 22) + (10 × 10 × 22) + (10 × 10 × 64) + (2 × 10 × 10 × 5) + (10 × 10 × 64)
= (1500 × 22) + (100 × 22) + (100 × 64) + (100 × 5) + (100 × 64)
= 33000 + 2200 + 6400 + 500 + 6400 # primary triggered + random rank2 + active rank2 + random rank3+ + active rank3+
= 48500 fetches per day
~ 34 fetches per minute.

For an egalitarian trust structure (primaries have average activity) this gives the upper bound:

= (150 × 10 × 5) + (10 × 10 × 5) + (10 × 10 × 64) + (10 × 10 × 5) + (10 × 10 × 64)
= (1500 × 5) + (100 × 5) + (100 × 64) + (100 × 5) + (100 × 64)
= 7500 + 500 + 6400 + 500 + 6400 # primary triggered + random rank2 + active rank2 + random rank3+ + active rank3+
= 21300 fetches per day
~ 15 fetches per minute.

This gives a plausible upper bound of the network load per day from this scheme, assuming a very centralized WoT. The upper bound for a very hierarchic trust structure is dominated by the primary subscriptions. The upper bound for an egalitarian trust structure is dominated by the primary subscriptions and the high frequency subscriptions.

The rank2 subscriptions and the random subscriptions together make up about 5% of the network load. They are needed to guarantee that the WoT always eventually converges to a globally consistent view.

One fetch for an ID transfers about 1KiB data. For a hierarchic WoT (one fetch per two seconds) this results in a maximum bandwidth consumption on a given node of 1KiB/s × hops. This is about 5KiB/s for the average of 5 hops — slightly higher than our minimum bandwidth. For an egalitarian WoT this results in a maximum bandwidth consumption on a given node of 0.5KiB/s × hops. This is about 2.5KiB/s for the average of 5 hops — 60% of our minimum bandwidth. The real bandwidth requirement should be lower, because IDs are cached very well.

The average total number of subscriptions to active IDs should be bounded to 190.

⁰: The cost of active IDs might be overestimated here, because WoT has an upper bound of one update per hour. In this case the cost of this algorithm would be reduced by about 30% for the egalitarian structure and by about 10% for the hierarchic structure.

prune subscriptions to stale IDs to improve the rank2+ update detection delay to (less than) O(N), with N the known active IDs

The process to check IDs with rank >= 2 can be improved from essentially checking them at random (with the real risk of missing IDs — there is no guarantee to ever check them all, not even networkwide), to having each active ID check all IDs in O(N) (with N the number of of IDs).

Process

When removing a random subscription to an ID with rank2 or higher, with 50% probability add the ID+current_version to a blocklist which avoids processing this same ID with this or a lower version again and prune it from the WoT.¹

When receiving a version hint from another ID with a higher version than the one which is blocked, the ID is removed from the blocklist.

The total cost in memory is on the order of the number of old IDs already checked, bounded to O(N), the number of Identities.

¹: Pruning the ID from WoT is not strictly necessary on the short term. However on the long term (a decade and millions of users), we must remove information.

Expected effect

Assume that 9k of the 10k IDs in WoT are stale (a reasonable assumption, because only about 300 IDs are inserted from an up to date version of WoT right now).

When replacing one random rank2 and one random rank3+ subscription per hour, that yields about 16k subscription replacements per year, or (in a form which simplifies the math) about two replacements per ID in the WoT.

Looking at only a single ID:

For the first replacement there is a 90% probability that the ID in question is stale, and a 50% probability that it will be put on the blocklist if it is stale, which yields a combined 45% probability that the number of stale IDs decreases by one. In other words, it takes on average 2.2 steps to remove the first stale ID from the IDs to check.

As a rough estimate, for 10 IDs it would take 15 steps to prune out 5 of the 9 stale IDs. Scaling this up should give an estimation of the time required for 9k IDs. So after about 15k steps (one year) half the stale IDs should be on the blocklist.

Looking at the whole network

For a given stale ID, after one year there is roughly a 50% chance that it is on the blocklist of a given active ID. But the probability that it is on the blocklist of every active ID is just about 0.5^k, with k the number of active IDs. So when there is an update to this previously stale ID, it is almost certain that some ID will see it and remove it from the blocklists of most other IDs within O(N) steps by providing an edition hint (this will accelerate as more stale IDs are blocked).

Rediscovering inactive IDs when they return

I am sure that there is a beautiful formula to calculate exactly the proportion of subscriptions to stale IDs we’ll have with this algorithm when it entered a steady state, and the average discovery time for a previously stale ID to be seen networkwide again when it starts updating again. To show that this algorithm should work, we only need a much simpler answer, though:

How long will it take an ID which was inactive for 10 years to be seen networkwide again (if its direct trusters are all inactive, else the primary subscriptions would detect and spread its update within minutes)?

After 10 years, the ID will be on the blocklist of 99.9% of the IDs. In a network with 10k active IDs, that means that only about 10 IDs did not block it yet¹. Every year there is a 50% probability for each of the IDs that the update will be seen.

Therefore detection of the update to an ID which was inactive for 10 years and whose direct trusters are all inactive will take about 10 weeks. Then the update should spread rapidly via edition hints.

¹: There is a 7% probability that 15 or more IDs could still see it and a 1.2% probability that less than 5 IDs still see it. The probability that only a single ID did not block it yet is just 0.005%. In other words: If 99% of IDs would become inactive and then active again after 10 years, approximately one will need about two years to be seen and most will be detected again within 10 weeks. Therefore this scheme is robust against long-term inactivity.

Summary

This algorithm can give the distributed spam filter in Freenet a constant upper bound in cost without limiting interaction.