Privacy by Design: Frequency Capping

by Ed Felten

One of the principles of Privacy by Design, as advocated in the FTC Privacy Report, is that when you design a business process, it’s a best practice to think carefully about how to minimize the information you collect, retain, and use in that process.  Often, you can implement the feature you want, with a smaller privacy footprint, if you think carefully about your design alternatives.

As an example, let’s look at frequency capping in the online ad market.  Advertisers want to limit the number of times a particular user sees ads from a particular ad campaign.  This is called a “frequency cap” for the campaign.  The more times a user sees an ad, the less likely that one more viewing of the ad will get them to buy; and the more likely that they’ll find the repeated ad annoying.

One way to implement frequency caps is to use third-party tracking.  The ad network assigns each user a unique userID (a pseudonym), stored in a cookie on the user’s computer, and the ad network records which userIDs saw which ads on which sites.  The ad network uses these records to keep a count of how many times each userID has seen each ad, and to avoid repeating ads too many times.   This approach works, but it gathers a lot of data–full tracking of user activities across all sites served by the ad network.

There are at least two ways to do frequency capping without gathering so much data.

The first way is to move information storage to the client (i.e., the user’s computer).  The idea is to keep a count of how many ads the user has seen from each campaign, and store those counts on the client’s computer rather than on the ad network’s computers.   A blog post by Jonathan Mayer and Arvind Narayanan gives more details.  The main advantage of this approach is that, because the information is stored on the user’s own computer, the user can always delete the information if they’re concerned about the privacy implications.   The main drawback is that the ad network would have to re-engineer how they choose which ads to place, because ad placement decisions are normally made on the ad network’s servers but the frequency information will now be stored elsewhere.

The second way to do frequency capping with less information collection is to store information on the ad network’s server, but to think carefully about how to minimize what is stored and how to reduce its linkability back to the user.  In this approach, the user still gets a unique pseudonym, stored in a cookie, but the ad network does not store a complete record of what the user did online.  Instead, the ad network just keeps a count of how many times each pseudonym has seen ads from each campaign.

For example, if you see an ad for the new Monster Mega Pizza, the ad network will remember that you (i.e., your pseudonym) have seen that ad–but it won’t remember which site you were reading when you saw that ad.  And for ads that aren’t frequency-capped, it won’t store anything at all.  Of course, the data about you seeing the Monster Mega Pizza ad campaign can be deleted once that campaign is over.

In practice, an ad network might want to collect and retain more information, in order to make other uses of that information later.   But users will probably want the ad network to be straightforward about what it is doing, and to admit that it is collecting more information than it needs for frequency capping, because it wants to make other uses of the data.

[Bonus content for geeks: The ad network can use crypto to store information with even better privacy properties. Rather than using the pseudonymous userID as a key for storing and retrieving the frequency counts, the ad network can hash the userID together with the advertiser's campaignID and use the resulting value as the storage key.  Then (assuming the userID is neither recorded nor guessable) the ad network won't be able to determine whether the person who saw the Monster Mega Pizza ad also saw some other ad from a different campaign.   This is easy to do and provides some extra protection for the user's privacy, while still allowing frequency capping.]

[Thanks for participants in the W3C Tracking Protection Working Group for suggesting the second approach, including the hashing trick.]

[Extra-credit homework for serious geeks: How can you use Bloom Filters to store this information more efficiently?  Assume it's acceptable to refuse to show an ad to a user even though that user hasn't yet hit the cap for that ad, as long as the probability that this happens is small.]

17 Responses to “Privacy by Design: Frequency Capping”

  1. Hey Ed,

    Great article. I agree that ad networks storing an extensive database of information without the user’s knowledge or consent is poor practice from a privacy standpoint. A few thoughts on your points:

    – It is actually challenging from a business standpoint for ad networks in an RTB environment to move userdata storage from server side to client side. The reason is that, when an ad network is bidding on an auction that is run by a 3rd party (such as an SSP or an exchange), the ad network’s systems do not have access to the client cookie. As a result, they must maintain userdata server-side if they want to take advantage of it in 3rd party-run RTB auctions.

    – I don’t think moving data from server-side to client-side necessarily gives users additional control over their data. Let’s say an ad network is following standard practice – they store userdata server-side, keyed by a unique user ID, and they store that user ID in the client cookie. If the user clears their cookies (and the ad network does not employ any shady methods to “revive” the cookie), then all information on that user is effectively lost. Although the ad network will still retain the data server-side, it will effectively be useless because it cannot be tied to the behavior of any browser.

    My general thought is that consumers in general do not understand the concept of cookies or how they are used to track their activity. I think there should be a more transparent and easily-controlled mechanism by which a consumer allows or disallows ad networks or other information trackers to track their behavior.

    Jon

    • Jon,

      Your first point (“It is actually challenging …”) is one example of what I was referring to in the post as re-engineering. Proposals to move storage to the client side need to either find ways to re-engineer functionality of that sort, or to argue that the privacy benefits of client-side storage are worth the foregone engineering practices.

      Your second point, regarding whether client-side storage actually increases user control, is an interesting one. I think there are two reasons why users might prefer client-side storage. First, if there is only client-side storage, then the user knows that deleted data can’t be “re-animated”, whereas if there is a client-side ID cookie coupled to server-side storage, then the user will worry that deleting the cookie does not prevent the remaining server-side data from being relinked to the user’s device. Second, client-side storage can more easily offer transparency to the user about what is stored. The data might be stored in ways that frustrate the user’s attempt to understand it, but straightforward data structures can be analyzed pretty easily by the user (with the help of tools), and services can choose to take steps to make their data more easily inspectable. (This is possible with server-side storage too, but it’s more complicated to provide it.)

  2. One of the great parts about using a Bloom filter for a privacy application like this is that it then becomes impossible to extract IDs out of this data structure. So, if you use a counting bloom filter, and are prepared to lose a small percentage of accuracy because of collisions, then you can increase efficiency and privacy at the same time. Although, it would be a challenge from the storage standpoint, because in the naive case, you’d have to start out with a Bloom filter as large as you’ll ever need. But I believe there are some Bloom filters variants that are resizable. In any case, great to know that somebody thinks about the same technologies that we spend out days thinking about.

    • Jimmy,

      Bloom Filters do create some opportunities to protect privacy here, but the details are non-trivial. Given a Counting Bloom Filter that has been populated with data, and bearing in mind that different campaigns might have different frequency caps, how much might an analyst be able to recover about the ad-viewing history of the user, compared to a scenario where you used a simple counter for each campaign?

      Issues like this are the reason I made it an extra-credit problem….

  3. Prof. Felten,

    Very glad you’re covering this topic – it’s critical for the advertising industry to be able to frequency cap in a privacy-friendly way. Let me start by making sure we’re on the same page about the frequency capping functionality that advertisers actually need and use:

    1. Serve a campaign no more than once per day per user (this is the simplest case)
    2. Serve a campaign no more than X times per day per user (still relatively simple, but means user might see the ad X times in a row in the course of a few pages then never again that day)
    3. Serve a campaign no more than once per X hours (you could specify this with the previous variant, like, no more than 5 times a day with at least an hour between them)
    4. Serve a campaign no more than once per session (without 20 minutes of inactivity)
    5. Serve a campaign no more than X times ever (for the lifetime of the cookie)
    – Serve a creative (a particular ad) no more than [1-5 above]
    – Serve an ad for the entire advertiser (ie Coke) no more than [1-5 above] regardless of the campaign or creative

    At AppNexus, we use the second method you mention in your article: we store counts. Here’s a real-life example from my AppNexus profile:

    “frequency”:[["a",1291,0,0,1,1341210991],["c",12441,0,0,3,1341350729],["a",33475,0,0,9,1341220399]]

    The first array element says I saw an ad from advertiser 1291 0 times this session, 0 times today, and 1 time ever. The last time I saw an ad from advertiser 1291 was at timestamp 1341210991. Note that we don’t store any information about what site I saw the ad on.

    The information we do store lets us perform all five frequency capping functions. The first field lets us specify whether the frequency is for the advertiser, campaign, or creative. The second field tells us the id of this object. The session frequency count lets us determine #4, in combination with the timestamp so we know when we should reset the counter. The daily and lifetime frequency count give us #1, #2, and #5. The timestamp itself gives us #3 since we can check how long it’s been since we last served this creative, campaign, or advertiser. Note that this means we have to update three records every time we serve an ad – the creative, campaign, and the advertiser.

    That explains the data we store. Now, how do we use it? Say we have 600,000 creatives live in our system and 100,000 campaigns (those are probably quite low, but you get the idea). When we go to choose an ad to serve, we have to check some basic rules for each campaign (does the advertiser want to serve on this site? does the advertiser want to serve to this geography?) and then see if the campaign’s frequency capping rules allow it to be served.

    To check the frequency cap, we look up the campaign ID in our frequency data array and evaluate rules 1-5 above. Then we do the same for the advertiser and the creative. This process could happen tens of thousands of times per ad served (depending on how many campaigns pass targeting for a particular ad call). With the full array in memory, we can do this quite fast (binary search, say, on an array averaging 200 elements).

    Now, let’s consider the campaign-user hash solution you suggest in your article. A simple approach would be to have a three-key index (user id, object type, object id) into a structure like {session frequency, daily frequency, lifetime frequency, timestamp}. On the face of it, this supports all of the use-cases above.

    However, there’s a hitch. User data is too big (~10TB) to be resident in memory on the targeting servers so we need to store it on a separate server cluster. In the array model, we make a single request to the user data cluster for a user ID and get back the full frequency array. The RTT is around 10 milliseconds. In the hash model, I don’t have locality for a particular user, so I have to query for each campaign. Now that 10 milliseconds becomes 100 seconds or more! You could reduce this by batching requests and parallelizing the clusters, but you would need a 1000x performance improvement to make this feasible, and even so you’d scale with the number of campaigns that pass targeting instead of with the number of frequency records for the user like the array model. I don’t think the campaign-user hash is practical in a scale production environment.

    On the extra credit, I think it works well for use case #1. You bloom filter on existence of the user-object key. It can’t give you a false negative, which is critical, and a false positive is a small likelihood as you mention. I don’t think it works for the more complex cases.

    Question for you: Given that AppNexus is following your recommendations on privacy-sensitive storage of frequency data, what can we do to help the FTC make sure that frequency caps are not thrown out with the privacy bathwater?

    Brian O’Kelley

    • Brian,

      Thanks for your comment. This is exactly the kind of conversation I was hoping to start with the original post. The FTC privacy report emphasized privacy by design, which is really about thinking carefully about how to reduce collection, retention, and use of consumers’ information, without compromising necessary functionality. How to do that will always depend on the kinds of factors you discuss in your comment.

      The technical issues around frequency capping have been under discussion in the W3C Do Not Track standards process. I expect that people in that group will have seen your posting and will take it into account as the discussions proceed.

  4. I recall that either WhenU or Claria/Gator kept their profiles on the client side, back in the days of adware. And some of the “deep packet” advertising folks had some interesting data minimization efforts. Of course, they were all driven by the motivation to have a privacy story, given other significant privacy concerns about their business models. But that being said, some useful work done there to show how some of these efforts may be implemetable. Also useful are some of the reports of the Europrise seal, where German companies like Wunderloop and Nugg.ad were approved to do behavioral ads subject to data minimization requirements around real cookie expirations, no maintaining IP address logs etc.

  5. Interesting. I think the client side approach will fail. Not because of the value prop, but because of the implementation. You’d be asking all the browser OEMs to figure out the UI and that’s going to lead to inconsistencies. The reason everyone loves server side solutions is that it’s far easier to re-engineer than a client side one. Of course the real problem is the lack of context provided to 3rd parties by a DNT:1 setting. There’s no way that Brian’s frequency capping solution (which i think is very elegant) survives with DNT:1 – basically it’s off the table “unless” you figure out an exception policy.

    So if you really think about it – everything that we’re discussing is the result of a do not track policy that lacks “context”. I wonder when people will realize that privacy is not binary and requires a more adaptive solution.

    So here’s a bonus question for geeks… how would you change the DNT approach so you would not have to change anything on the server i.e. no issues with exceptions, no issues with frequency capping etc? Two hints – context and permission.

  6. Last week I shared my thoughts about your post with my peers here at the IAB, and was today urged to share with this larger audience.

    Your initial thoughts on moving information storage to the client cookie jar are incorrect. Client-side frequency capping doesn’t fail “because ad placement decisions are normally made on the ad network’s servers but the frequency information will now be stored elsewhere,” it fails because client-side storage of the frequency state of multiple campaigns has the potential of overloading the cookie header, and because updating client state can be blocked in 3rd party scenarios. Having the frequency state on the client (and therefore on the inbound HTTP request) can actually make frequency capping easier on the server side, because you don’t have to propagate this state across all the physical ad servers.

    With regards to the “second way”, tech may have changed, but in my experience profitable ad servers record the minimum required information in order to bill – recording something like the HTTP REFERER (IE, what page the ad was delivered onto) can increase log record size several times over, significantly increasing hardware costs and data processing latency. That said, recording the publisher ID or campaign ID (IE, what site/network is supposed to be delivering the ad) is standard practice, since you need to know who to pay. As Brian O’Kelley has indicated, minimizing the set of data stored is still sensible design for performance.

    Implied in Brian’s comment is that there’s a “frequency” record for every userID – the data structures for targeting in current systems use user pseudonyms as primary keys for targeting data, including frequency capping. Moving to a hash of userID/CampaignID for storing frequency capping multiplies the number of records in the data structure storing information by the number of campaigns (and possibly the number of advertisers), thereby incurring additional cost in storage and look up time, in addition to the (minor) hashing cost of inbound userID/CampaignID.

    I think the suggestion of using bloom filters for storing frequency capping assumes that there is an absolute maximum number of times a specific userID can be shown an ad – a strictly additive situation. However, as Brian pointed out, the capping happens at intervals less than the campaign duration – and the intervals are user specific. You could implement by recording whether UserID interacted with CampaignID during arbitrary intervals, but doing so would generate either a significant increase of items to store, or the additional complexity of maintaining time sequenced Bloom filters, and either implementation loses out on some granularity of timing.

    • Brendan,

      Thanks for your response.

      To be clear, my discussion of client-side storage was meant to refer to client-side storage mechanisms generally, not just cookies. So limitations on the number or size of cookies won’t rule out that option, at least for modern browsers.

      You say that client-side storage might be blocked. I’m not sure why that would be a problem. If a user wants to block information collection, they already have ways to do so–but it would be a shame if the only recourse for a user who wanted to control collection was to block ads entirely. It seems to me that it’s better to offer a more polite system that gives users control over collection without blocking ads entirely.

      I was a bit surprised by your assertion that companies “record the minimum required information required to bill”. If this is true, industry reps should be able to make significant concessions in the Do Not Track discussions, as some group members have been insisting on a need to collect information for non-billing purposes.

      With regard to the technical issues that Brian raised, there is some interesting technical discussion to be had, and I’m glad Brian jumped in and started it. The key question is whether there is a win-win approach, in which data collection is reduced without impairing the ability to do frequency capping. That was the spirit I was trying to convey in my original post, including the extra-credit homework assignment on Bloom Filters.

  7. Good afternoon! I know of the following widely available, general purpose client-side storage options:

    HTTP Cookies
    LSOs (aka “Flash Cookies”, that aren’t accessible outside of Flash without JavaScript)
    HTTP Cache (creating user-specific JavaScript files to store user-specific information there)

    Storing user-specific data in the HTTP cache is less reliable than using HTTP cookies, is an “off-label” use, and the data is only available to the server immediately on cache-revalidate requests. LSOs are only accessible once the Flash object has loaded, and then only via JavaScript, a design which ensures that the data won’t be available until the JavaScript executes. Neither system guarantees that the user-specific data will be sent to the server on the 1st request like HTTP cookies do, thereby requiring multiple transactions to transmit the same amount of data, which is bad for performance across the board. Is there another type of widely adopted client-side storage that I missed, or are you proposing the creation of one specific for frequency capping?

    My comment with regards to blocking was specifically “updating client state can be blocked in 3rd party scenarios” – IE, that if I were to set a cookie in a first party context initially, I may not be able to update that cookie later when my content embedded in an iFrame on a different web site. If the cookie contains an identity, the frequency data can be updated server side. If the cookie contains frequency data, it can’t. I don’t know how you got from this technical assertion to the “the only recourse … was to block ads entirely” – my goal was simply to indicate that a workable implementation of frequency capping should be able to update frequency counts every time, wherever it’s stored.

    Your surprise is noted – and I probably should have extended that to be “bill and retain advertisers”. Frequency capping is an expected feature – without it, advertisers may move to systems that support it. If the W3C Tracking Protection standard requires a complex implementation to support the subset of opt-out users, the advertising systems that choose not to implement the standard are rewarded with lower complexity implementations and a larger potential feature-set to attract advertisers.

  8. Brendan,

    My reference to client-side storage was meant to reply to a set of technologies, including HTML5 Local Storage. (For some technical background, see, e.g., Dive Into HTML5.)

  9. HTML5 Local Storage has the same limitations as Flash LSOs – the data therein isn’t accessible until the page has loaded, and it’s not available on the first HTTP request. Given Same-Origin rules around scripts, this would necessitate a second HTTP transaction from within an iFrame rendered to the ad server’s domain for feature parity, or would require just a second HTTP transaction and the storage of frequency capping data in the publisher’s domain HTML5 Local Storage for publisher-specific caps.

    In other words, using HTML5 Local Storage would definitely introduce latency into access to frequency capping data (best case would probably be in the range of 80-150ms), and would either require the use of iFrames (which we would like to see less of – it’s a separate topic), or would limit the ability of ad networks to frequency cap across multiple domains.

  10. I’ll immediately grab your rss as I can not in finding your e-mail subscription hyperlink or newsletter service. Do you have any? Please permit me know so that I may just subscribe. Thanks.

Trackbacks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: