Making Sure Things Work

2 min read

Over the weekend, I pulled together some recommendations on how to protect privacy while working from home, and possibly sharing a computer with one or more people. However, after writing the recommendations, I was curious about if or how the differences could be quantified. To get a sense of this, I tested four different scenarios:

To run the test, I needed four sites that are crawling with trackers. Unfortunately, the web has no shortage of sites that are overrun with trackers.

For this test, I chose:

  • Weather dot Com
  • WebMD
  • HuffingtonPost
  • Breitbart

This test was pretty simple -- I visited the home page of each site, and scrolled to the footer. Then, I went to the next site, and repeated until I had visited all four sites. The order was the same for each test.

Web traffic was intercepted and observed using an intercepting proxy.

The results showed some clear differences.

  • De-Tuned Firefox - 157 calls to different domains
  • Tuned Firefox, no uBlock Origin - 42 calls to different domains
  • Tuned Firefox with uBlock Origin - 22 calls to different domains
  • Chrome, set to defaults - 192 calls to different domains

Not all of the third party domains called were explicitly about ad tracking, but it's worth noting that the sites were just as functional using Firefox with uBlock Origin -- which communicated with 22 different domains -- as they were when using default Chrome, which sent information to 192 different domains.

My takeaway from this: these limited, simple tests suggest that Chrome's defaults do little to nothing to minimize the number of calls to third party web sites, and protect users from tracking. Even a detuned version of Firefox -- where the defaults were adjusted to allow more trackers through -- was more effective than default Chrome. The steps outlined in my earlier post on browser hygiene -- and in particular, using uBlock Origin -- offer good protection from tracking.

The datset generated from the tests is available on Github.

Browser Hygiene for Better Privacy - Think of it Like Washing Your Hands Online!

6 min read

This post covers some of the basics of keeping the online components of your work life (or your school life) separate from your personal life. This split was good practice before Covid19, but now that we are all spending more time online -- for school, work, social interactions, shopping, news, entertainment, etc -- keeping a split between our personal lives and our school/work lives is an important element to protecting your privacy.

The steps in this post won't block all tracking, but they will minimize risk and minimize exposure. The advice and steps laid out in this post are all available free of charge. In one place, I recommend a password manager that has a subscription fee, but I also include a free option.

This post does not cover choosing a VPN. VPN's are a key component to both privacy and security. My one main piece of advice with regards to VPNs is to NEVER use a free VPN because they make money by exploiting their users. My second piece of advice regarding VPNs is to point people to https://thatoneprivacysite.net. The information on choosing a VPN helps provide context about things to consider when using a VPN.

The instructions in the post are split into three sections:

Setting up the Profile in Firefox

Open Firefox. If you don't have Firefox installed on your computer, get it here.

1. Enter "about:profiles" in the address bar. Click "Create New Profile".

Start to create a new profile

2. Click "Next" to navigate past the informational dialog text.

Skip the chitchat

3. Give your new profile a distinctive name, and click "Finish".

Name your new profile

4. Find the profile in the list, and click "Launch profile in new browser"

Launch profile in new browser

Congratulations! You now have a clean and fresh profile! The next section covers how to set it up for increased privacy protection.

Configuring the Browser Settings

1. In the Address Bar, enter "about:preferences#privacy" - this will allow you to adjust some privacy settings. If a setting isn't mentioned, you can leave it at the default setting.

General privacy settings

2. At "Cookies and Site Data" - select "Delete cookies and site data when Firefox is closed". This will wipe out tracking cookies when you close the browser, but it will also wipe out your logins so you will need to login each time. This is less convenient, but on a shared computer it prevents someone who isn't you from accessing sites where you have logged in.

Cookies and site data

3. For "Logins and Passwords" - de-select "Ask to save logins and passwords for websites" and "Show alerts about passwords for breached websites".

Logins and Passwords

Later in this post, we'll cover getting a good password manager.

The alerts feature for breached websites is powered by a great website, Have I been pwned. I strongly recommend signing up for an account on this site at https://haveibeenpwned.com/

4. In the "History" settings, selecting "Never Remember History" can bring additional privacy benefits, but a lot of people like the benefit of having the browser remember their history. If you choose to have the browser remember your history, you should clear your browsing history weekly using the "Clear history" button.

Don't remember browser history. World history? Remember that.

5. In "Address Bar" - de-select all options. While these suggestions could all be processed locally, I recommend erring on the side of caution. In the future, I might test this option by monitoring network traffic but I haven't done that yet.

Suggestions

6. In "Permissions" - for "Location" click the "Settings" button, and then select the "Block all requests for location" checkbox. For the other options here, you can block or leave open at your discretion. Firefox generally does a decent job of alerting you when an app or site asks to access your camera or microphone (for example, when you want to join a web based videoconference, you will need to provide access to your camera and microphone).

Location - hard no.

7. Under "Firefox Data Collection and Use" - de-select all options.

Just say no to data collection. FFS. No.

8. In the address bar, enter "about:preferences#search". Choose "DuckDuckGo" as your default search engine.

default search engine

9. In "Search Suggestions" - deselect all options. This prevents keystrokes being sent to any search engine when you enter your search terms in the address bar.

Search suggestions

10. In the address bar, enter "about:preferences#home". For "Homepage and new windows" and "New tabs" select "Blank page" as an option.

Homepage and new windows

Your profile is now set with some additional privacy protection.

Additional Protections

1. Now that the browser is set up to run cleanly, we want to add an extra layer of protection against tracking. While there are a range of options that exist, uBlock Origin provides a good balance between protection and usability. You can get the Firefox Extension here: https://addons.mozilla.org/en-US/firefox/addon/ublock-origin/

If you have chosen the option to "Never Remember History" you will need to select the "Allow to run in Private Windows" option to complete the install.

Allow to run in Private Windows

You will know that the install is successful when you see the uBlock Origin icon in the top right of your browser window.

uBlock Origin logo - installed

2. The final step we will include here is getting a password manager. If you want a web-based password manager you can use across your computer and phone, 1Password is a good option. They offer individual and family plans, and their subscription rate is reasonable. https://1password.com

If you only need a password manager that works on your computer, KeepassXC is a great option. It's open source, free, mature, and stable. Get it here: https://keepassxc.org/

Conclusion

With these steps in place -- a distinct browser profile for work and school, some tuned settings in the browser to increase protection, and some ad blocking paired with a password manager -- you have made some real improvements in safeguarding your privacy. The first few days you use this setup, it might feel awkward. That's okay - it's a new way of working, and change generally feels awkward.

Stick with it. As the steps become familiar, this way of working will become second nature -- and that's a skill you will need after the pandemic is over. It's not like adtech and the other companies that track us are going away anytime soon.

Maybe It Isn't a Great Idea to Outsource Public Education to Private Companies

2 min read

As the rapid switch to online learning has made abundantly clear, K12 schools in the United States schools need learning management systems, student information systems, and videoconferencing to function. It was pretty obvious before, but school during Covid19 has brought even more focus on the infrastructure that makes school possible. Learning management systems are the mediator between students and the work they are assigned; video conferencing (used well) allows students to connect with one another, and with their teachers, when a concept is better explained as a group, or when some community bonding is needed to maintain cohesion within the course.

How many public schools in the US rely on proprietary (closed source) software supplied by private companies to run their student information system and learning management system?

How many schools use videoconferencing solutions provided by a for profit vendor?

Let's be clear about this: the glue holding the required infrastructure of our public education system together is owned by private companies. The leaning management systems, the student information systems, the videoconferencing tools -- the most widely used systems are owned by private companies, and these private companies are paid with public dollars.

The privacy issues that plague K12 education exist for many reasons, but the central role played by for profit companies collecting data from K12 students as these students engage in their legally required public education is near the top of the list.

The observations in this post aren't new, but against the backdrop of a pandemic we should take careful note: the public education infrastructure is largely run by private companies with an obligation to shareholders first, school customers second, and students somewhere after that. It doesn't need to be this way; education shouldn't need to go hat in hand to private companies to have basic needs met, but here we are. Once we are through the worst of Covid19 (realistically, when we have a working vaccine that is widely accessible) we should re-evaluate a lot of assumptions that have shaped our educational system. Hopefully, our habit of outsourcing public education to private companies will be among the many items that get improved.

What Shows Up In Facebook's Ad Library Anyways?

6 min read

The Facebook Ad Library is part of Facebook's effort at increasing their transparency around political ads.

This post is going to ignore the myriad usability issues with the Ad Library, and focus on a more fundamental, but less visible question: what exactly can we see in the Ad Library anyways?

To start, we'll look at this overview page about the Ad Library. The second paragraph of this descriptive page contains this fairly specific description of what is covered in the Facebook Ads Archive:

The Ad Library contains all active ads running across our products. Transparency is a priority for us to help prevent interference in elections, so the Ad Library offers additional information about ads about social issues, elections or politics, including spend, reach and funding entities. These ads are visible whether they're active or inactive and will be stored in the Ad Library for seven years.

This description makes it clear that all active ads are in the Ad Library, and that "additional information" is available for ads "about social issues, elections or politics". The language in this description -- "These ads are visible whether they're active or inactive" -- is less than clear, primarily because of the unclear reference of "these."

The Facebook page describing the Ads Archive also contains makes it clear that keyword search only works on ads that have been categorized as about social issues, elections, or politics.

Ads that aren't about social issues, elections or politics will only be discoverable through visiting a Page in the Ad Library and will not surface in keyword searches.

We will return to the subject of keyword searches later in this post.

Over the weekend, Rob Leathern -- A Director of Product at Facebook -- responded to questions from two journalists, Brandy Zadrozny and Shoshana Wodinsky. The conversation was originally about the overlaps between boosted posts and ads, and in the ensuing conversation, Leathern added some details about how the Ad Library works, and about some things that the Ad Library omits.

In response to several questions, Leathern provided a clarification that should be added to the About the Ad Library page. In this Twitter conversation, Leathern appears to be very clear that, while all active ads are present in the Ad Library, only ads that are explicitly tagged as about social issues, elections, or politics will be stored in the archive after the ads are no longer active.

Ad Library

This clarification, while informative, raises the possibility of some clear and obvious loopholes, which prompted me to ask for some additional clarification -- because based on Leathern's description, it seems incredibly simple to avoid the additional review that is directed at political ads.

clarification

At this point in the post, I want to take a step back and highlight that I am sincerely appreciative of Rob Leathern's willingness to engage at all. This conversation took place on a weekend, and he is under no obligation (that I know of) to engage with anyone on Twitter about anything, including political ads. I see his willingness to answer questions as an act of good faith, and I appreciate his time and openness.

With that said, the current functionality of the Ad Library ensures that bad actors can operate with relative freedom. Leathern describes this as a "'tree falls in the woods' variety: if nobody knows it is a political ad, obviously it can’t be labeled and archived right?"

However, anything can be labelled and archived. Bad actors engaging in disinformation are not looking to work within the system, and they won't be kind enough to willingly label their posts accurately. This is where even a basic feature like keyword search across all active ads would be helpful - but as noted above, keyword search only works on ads that are labelled as about social issues, elections or politics.

Because unlabelled ads disappear from the archive when they stop running, this means that political posts from bad actors disappear from public view almost immediately. Additionally, because unlabelled posts are invisible to keyword search, the process of finding them in real time is essentially blind luck: either a person is served an ad when they are logged in, or they happen to stumble over a page promoting political posts.

At this point, it's not clear (to me, anyways) what percentage of past ads are available in the Ad Library. However, based on these descriptions, it's highly likely that many successful misinformation or disinformation campaigns are completely hidden from public view because Facebook is making an intentional choice to drop ads from view immediately after they stop running. A bad actor could minimize scrutiny simply by running ads for short durations. For operations focused on vote suppression, small numbers of tightly focused ads (content, demographic makeup, and geographic region) running for brief periods could possibly be both devastatingly effective, and largely invisible in the Ad Library. It's not like the dates of the US Elections are secret; a nation state actor or a political operative would have no problems creating dummy pages years in advance to use when needed.

Facebook has internal teams dedicated to fighting misinformation, and these teams also do some work with outside experts, and what I am describing is almost certainly not news to anyone doing misinformation or disinformation work inside or outside Facebook. However, this work is largely invisible to the vast majority of people outside Facebook. Facebook could increase transparency, and improve the usefulness of the Ad Library by taking the following steps:

0. Continue to archive all political ads for 7 years.
1. Expand the archive to include all ads in the Ad Library for somewhere between 6-12 months after they have stopped running.
2. Extend keyword search to all ads in the Ad Library.
3. Allow retroactive tagging of ads (ie, an Ad can be flagged as a political ad even after it has run).
4. Publish a rough percentage of the number of political ads, social issue ads, and election ads relative to the overall number of ads.

There are myriad other usability issues with the Ad Library, but steps 0-4 listed above would at least provide consistent and comprehensible results for external researchers looking to understand misinformation and disinformation within Facebook, Instagram, and Messenger.

Update on "Personal Email, School-Required Software, and Ad Tracking"

4 min read

I just re-ran the scan that, earlier this week, found what appeared to be advertising-related tracking in Canvas when a student logged in to Canvas after logging in to a personal GMail account.

The latest round of tests showed very different behavior: the tracking that was observed in the earlier tests is not present in the more recent tests. This change appears to have happened since I put out my original blog post approximately 36 hours ago. The technical details are in my original writeup (linked above), but the short version:

  • In the original scan, after logging into Canvas, there were two subdomains connected via redirects: "google.com/ads" and "stats.g.doubleclick.net". Calls to these subdomains appeared to map cookie IDs set for advertising to Canvas's Google Analytics ID.
  • In the original scan, after logging into Canvas, these subdomains were called multiple times (at least three times each over approximately 90 seconds of browsing).
  • In the most recent scan, after logging into Canvas, using an identical script to the original scan, these subdomains and the related cookie IDs are not called at all.

Fixed?

Viewed through a privacy lens, the removal of the cookie mapping is a good thing. It's an interesting shift, and raises a few questions and possibilities. I will attempt to include as many of these as possible, even options that are fairly unlikely.

  1. the fix for the issue I flagged in my post was already in the development pipeline and was deployed yesterday right on schedule;
  2. the ID mapping was part of a larger strategic plan and was removed intentionally;
  3. the ID mapping was in place as a result of human error, and this was addressed;
  4. the issue was related to how Google deploys Analytics, and Google made a change on their end completely unrelated to anything I observed;
  5. my original tests reported a bug or some other aberration that was subsequently fixed;
  6. ???

In my opinion -- based both on past experience with issues like this, and just a gut feeling (which for all obvious reasons, doesn't mean much) -- the third option (human error) feels most likely.

Regardless of the reason, I would strongly advise Instructure to provide a clear, transparent, and complete breakdown of what exactly happened here. There are range of plausible and reasonable explanations -- but students and families that have their information entrusted to Instructure deserve a clear, transparent, and complete explanation.

Taking a step back, this is an issue that goes beyond Instructure. While Instructure had the bad luck to be the vendor included in this scan, we need to look long and hard at the reliance the edtech industry places on Google Analytics.

Analytics data are tracking data, and can easily be repurposed to support profiling and advertising. Google Analytics is increasingly transparent about this, but we shouldn't pretend that analytics from other services can't be used in similar ways. Google describes the relationship very clearly:

When you link your Google Analytics account to your Google Ads account, you can:

  • view Google Ads click and cost data alongside your site engagement data in Google Analytics;
  • create remarketing lists in Analytics to use in Google Ads campaigns;
  • import Analytics goals and transactions into Google Ads as conversions; and
  • view Analytics site engagement data in Google Ads.

The distinctions made between educational data/student data and consumer data are often contrived, and the protections offered over "educational" data are fragile. Instead of thinking about "student data," we would be better off thinking about data that are collected in an educational setting -- and we would be even better off with real privacy protections that protected the rights of individuals regardless of where the data were collected.

Personal Email, School-Required Software, and Ad Tracking

18 min read

UPDATE December 21, 2019: After I put this post out, I re-ran the scan as part of routine follow up. The cookie mapping that was observed in the original scan and documented in this post is no longer present. It's not clear how or why this shift occurred, but at some point between the original scan, publishing this writeup, and a new scan completed after this writeup was published, the tracking behavior observed within Canvas has changed. More details are available here. END UPDATE.

Recently, a friend reached out to me with some questions about ad tracking, and the potential for ad tracking that may or may not occur when a learner is using a Learning Management System (or LMS) provided by a school. LMSs are often required by schools, colleges, and universities. LMSs hold a unique spot in student learning, effectively positioned between students, faculty, and the work both need to do to succeed and progress.

With the central placement of LMSs in mind, we wanted to look at a common use case for students required to use an LMS as part of their daily school experience. In particular, we wanted to look at the potential for third party tracking when students do a range of pretty normal things: check their personal email, search and find information, watch a video, and check an assignment for school. The tasks in this scan take a person about five to seven minutes to complete.

The account used for testing is from a real student above the age of 13 in a K12 setting in the United States. The LMS accessed in the test is Canvas from Instructure, and the LMS is required for use in the school setting. The full testing scenario, additional details on the testing process, and screenshots documenting the results, are all available below.

Summary and Overview

The scan described in this post focuses on one question: if a high school student has a personal GMail account and is required to use a school provided LMS with a school provided email, what ad tracking could they be exposed to via regular web browsing?

In this scan, we observed tracking cookies set on a person's browser almost immediately after logging into their consumer GMail account. These tracking cookies were used to track the person as they searched on Google and YouTube, and as they browsed a popular site focused on providing medical information. Because the GMail account used for the scan is a consumer GMail account, the observed tracking is not unexpected.

However, when the student logged into Canvas, the LMS provided by their school, using their school-provided email address which is not a GSuite account, we also observed the same ad tracking cookies getting synched to the LMS' Google Analytics tracking ID. This synchronization clearly occurred when the student was logged into the LMS.

This tracking activity raises several questions, but in this summary we will limit the scope to three:

  1. Why is a Google Analytics ID being mapped to tracking cookies that are tied to an individual identity and set in an ad tracking context?
  2. Why is the LMS -- in this example, Canvas -- using Analytics that potentially exposes learners to ad tracking?

These two questions lead into the third question, which will be the subject of follow up scans: given the large number of educational sites that also use Google Analytics, can similar mapping of Google Analytics IDs to adtech cookie IDs be observed on other educational sites?

The analysis of the scan is broken into multiple sections, and each section has a "Breakpoint" that summarizes the report.

  • Testing Scenario: The steps used in this scan to allow anyone to replicate this work.
  • Testing Process: The process used to set up for the scan.
  • Results: The full results of the scan.
  • Breakpoint 1: A summary of the process that sets the tracking cookies after a person logs in to a consumer GMail account.
  • Breakpoint 2: Search activity on Google.
  • Breakpoint 3: Ad tracking on the Mayo Clinic site.
  • Breakpoint 4: Search activity on YouTube.
  • Breakpoint 5: Mapping of Instructure's Google Analytics IDs to ad tracking IDs.
  • Additional Scans: Follow up work indicated by this scan.
  • Conclusions: Takeaways and observations from the scan.

Testing Scenario

The scan was run using a real GMail account, and a real school email account provisioned by a public K12 school district in the United States. The owner of both accounts is over the age of 13. The school email account was not a GSuite EDU account. The LMS used to run this test was Canvas from Instructure. The testing scan used these steps:

A. Consumer Google Account

  1. Log in at google.com
  2. Go to email
  3. Read an email
  4. Return to google.com
  5. Search for "runny nose"

B. Medical Information

  1. View the top hit from Mayo Clinic or WebMD

C. YouTube

  1. Go to YouTube.com
  2. Search for "runny nose"
  3. View the top hit for 90 seconds
  4. Watch one of the top recommended videos for 90 seconds.

D. School-supplied LMS in K12

  1. Go to Canvas login page and log in using a school-provided email address
  2. Navigate course materials (approximately 10 clicks to access assignments and notes)
  3. Return to student dashboard
  4. Log out of Canvas

Testing Process

The testing used a clean browser with all cookies, cache, browsing history, and offline content deleted prior to beginning the scan. The GMail account used had not modified or altered the default settings.

Web traffic was observed using OWASP ZAP, an intercepting proxy.

Results

In summarizing the results, we will focus on tracking that happens related to Google, and while logged in to Canvas. This analysis does not get into the tracking that Canvas does, or the tracking and data access permitted by Canvas via Canvas's APIs. For a good analysis of the tracking and access that Canvas allows via their APIs, read Kin Lane's breakdown of the data elements supported by Canvas's public APIs.

This post looks at one specific question: if a person is both browsing the web and using their school-provided LMS, what could tracking look like? The results described here provide a high level summary of the full scan; for reasons of focus and brevity, we only cover observed tracking from Google. Other entities that appear in this scan also get data, but Google gets data throughout the testing script.

In the scan, multiple services set multiple IDs. The analysis in this post highlights two IDs set by Google; these two IDs merit a higher level of attention because they are called across multiple sites, are mapped to one another, and are mapped to a separate Google Analytics ID connected to Canvas. In the scan, mapping Google Analytics IDs to IDs that appear to be connected to ad tech happens on both sites that use Google Analytics - the Mayo Clinic site, and the Canvas site.

To protect the privacy of the account used to run this scan, we obscure the IDs when we show the screenshots. The first ID will be marked by this screen:

Screen for Tracker 1

The second ID is marked by this screen:

Screen for Tracker 2

For privacy reasons, I also obscure the referrer URL and the user-agent string. The referrer URL shows the domain that was scanned, which in turn would expose the specific Canvas instance, which would compromise the privacy of the account used to run the scan. The user-agent string provides technical information about computer running the scan, including details about the web browser, version, and operating system. This information is the foundation of a device fingerprint, which can be used to identify an individual.

Step A. Consumer Google Account

Our scan begins with a person logging in to a personal GMail account.

Almost immediately after logging into GMail, the two tracking cookies are set. These cookies are set sequentially, and are mapped to one another immediately.

A call to "adservice.google.com" sets the first cookie. This initial request both sets a cookie (indicated by the value screened by "Tracker 1") and redirects to a second subdomain (googleads.g.doubleclick.net) controlled by Google:

Initial GET request

Screenshot 1

And this is the response that sets the cookie:

Response and set cookie

Screenshot 2

In the response shown above, three things can be observed/noted:

1. the initial request returns a 302 redirect that calls a new URL; 2. the location of the URL is specified in the "Location" line, highlighted in yellow; 3. the tracker value screened by "Tracker 1" is set via the "Set Cookie" directive.

The next event tracked in the scan is the get request to the URL (in the googleads.g.doubleclick.net subdomain) indicated in Screenshot 2.

Get request for Doubleclick

Screenshot 3

The screenshot below shows the response, including the directive to set the second tracking cookie (marked at "Set-Cookie").

Set Doubleclick cookie

Screenshot 4

At this point in the scan, the two cookies (marked by the "Tracker 1" and "Tracker 2" screens) that will be called repeatedly across all sites visited have been set. As shown in the screenshots, these cookies are mapped to one another from the outset. These two cookies are set after a person logs into a GMail account, so they can be tied to a person's actual identity.

As we will observe in this scan, these cookies are accessed repeatedly across multiple web sites, and connected to a range of different activities and behaviors.


Breakpoint 1: Two tracking cookies have been set. The specific responses that set the cookies are shown in Screenshots 2 and 4. As the cookie values are initially set, the values are set to "IDE" and "ANID" and it's important to note that the cookies are almost certainly synchronized with one another via the 302 redirect used to set both values sequentially. When the first cookie value is set, the response header specifies the exact call that sets the second cookie value. In practical terms, this means that Google and Doubleclick both "know" that Tracker 1 and Tracker 2 correspond to the same person. Moreover, because these cookies are set after a person logs into their personal GMail account, these values are directly tied to a person's identity.

Google provides some partial documentation on the cookies they set and access:

We also use one or more cookies for advertising we serve across the web. One of the main advertising cookies on non-Google sites is named ‘IDE‘ and is stored in browsers under the domain doubleclick.net. Another is stored in google.com and is called ANID

As shown above in Screenshot 2 the ANID value (marked by Tracker 1) is accessible from within .google.com. As shown above in Screenshot 4, the IDE value (marked by Tracker 2) is accessible from within .doubleclick.net.


Search on Google

After reading the email, we returned to google.com to do a search for "runny nose." After all, it is the season for colds.

One thing to note for any search functionality that returns suggestions while you type: this functionality doubles as a key logger. For example, when searching for "runny nose" we can observe every keystroke being sent to Google in real time.

Search autocomplete

Screenshot 5

As shown in the above screenshot, every keystroke entered while searching is tied to the first tracking cookie documented in our scan. The text entered in the search box is highlighted in yellow, and we can observe each new keystroke being sent to Google, with the get request mapped to the cookie ID set in Screenshot 2.


Breakpoint 2: Search activity on google.com is (obviously) managed by Google. The full search activity, including individual keystrokes, is tracked and tied to Tracker 1.


Step B. Medical Information

The search for information about a runny nose leads to a page on the Mayo Clinic web site. Visiting this page kicks off some additional tracking and advertising-related behavior.

First, we see the Google Analytics ID for the Mayo Clinic site mapped to the second tracking cookie ID. The Google Analytics ID for the Mayo Clinic site, along with the referrer URL, are both highlighted in yellow.

Mayo Clinic Analytics mapping

Screenshot 6

Then, we can observe what appears to be additional adtech and tracking-related behavior connected to this same tracking cookie ID

Mayo Clinic ad tracking behavior

Screenshot 7

As we can see in the above screenshot, the referrer url is from the specific page on the Mayo Clinic web site. As noted above, the cookie IDs are mapped to a specific identity known to Google. Thus, Google knows when the account used for this scan searched for a specific piece of medical information, and accessed a web site about it. Because these tracking cookies were set when a person logged into GMail, this activity can be directly tied to a specific person.


Breakpoint 3: when a person moves off a Google property, the tracking switches to Tracker 2, which can be read by Doubleclick. Screenshot 6 shows Tracker 2 being mapped to the Google Analytics ID of Mayo Clinic. Screenshot 7 shows additional ad related behavior connected to Tracker 2. In this section, we can observe two additional subdomains; stats.g.doubleclick.net (often connected to Analytics) and ad.doubleclick.net (generally connected to ads). It is not clear why the Tracker 2 value, which was clearly set in an advertising/tracking context, needs to be mapped to a Google Analytics ID.


Step C. YouTube

After visiting the Mayo Clinic web site, the scan continued on YouTube. Here, we searched for a video about a "runny nose" and watched the video.

As noted above when searching using Google, YouTube search also functions as a key logger, and ties the results to a cookie ID that is directly connected to a person's real identity.

Mapping cookies in YouTube

Screenshot 9

Screenshot 9 shows the "ru" of the eventual search query "runny nose". As shown in Screenshot 5 related to searching on Google, a request is sent for every keystroke, including spaces and deletions.


Breakpoint 4: Search activity within YouTube is managed by Google. As with search on google.com, the full search activity, including individual keystrokes, is tracked and tied to Tracker 1.


Step D. School-supplied LMS

After searching for and watching a video about a runny nose, the scan proceeded to log in to a K12 instance of Canvas.

For this scan, the person logged into the LMS with a school-provided email account. The school provided email account was not provisioned from a GSuite for EDU domain. The email address was from a domain connected to a K12 school district connected to a student account.

After the person logs into Canvas, both cookie IDs are mapped to Instructure's Google Analytics ID. The mapping occurs via 302 redirects, with the Analytics ID contained in URL calls that include the Cookie IDs in the request headers. The process is documented in the screenshots below, and is similar to the mapping that occurred while browsing the Mayo Clinic web site.

The referring URL is clearly a course within the LMS. The Google Analytics ID (UA-9138420) that belongs to Canvas/Instructure is highlighted in yellow.

The first call is to stats.g.doubleclick.net. As you can see in the screenshot below, the request includes the Google Analytics ID and the tracking cookie in the request header. The response returns a redirect that also includes the Google Analytics ID.

First call to map trackers in Canvas

Screenshot 10

As shown in Screenshot 10, the URL specified by the redirect points to google.com/ads. The redirect also contains the Google Analytics ID for Instructure.

Mapping trackers in Canvas

Screenshot 11

As described and shown in Screenshots 10 and 11, these two calls map both cookie IDs to Instructure's Google Analytics ID. To emphasize, both of the cookie IDs mapped to Instructure's Google Analytics ID are also directly connected to a personal GMail account that is tied to a person's identity.


Breakpoint 5: While logged into a school-provided (and required) LMS, both Tracker 1 and Tracker 2 are mapped to the Google Analytics ID of the LMS. This means that the same advertising IDs that are tied to a specific student's identity, tied to browsing history on a site with medical information, and tied to search history on Google and YouTube, are also tied to the Google Analytics ID of an EdTech vendor. In practical terms, this means that Google could theoretically incorporate general LMS usage data (time on site, time on page, pages visited, etc) into their profiles of learners and/or educators.


Visiting Subdomains

Visiting the subdomains called when the cookies were mapped to Instructure's Analytics ID returns web sites that appear to serve advertisers.

Attempting to visit google.com/ads redirects to a page that clearly appears to be connected to advertising:

Google Ads web page

Screenshot 12

Attempting to visit stats.g.doubleclick.net redirects to a page that offers services for analytics related to Google Marketing Platform.

Google Marketing

Screenshot 13

A look at the features overview page shows that there is a "native data onboarding integration" with Google Ads and Adsense, and "native remarketing integrations" with Google Ads.

Google Analytics integration

Screenshot 14

Additional Areas for Examination

This initial scan was limited in scope to test one specific -- yet common -- use case: what does ad tracking look like when a person has a consumer GMail account, and uses the same browser to access that personal account as their school-provided LMS? With this initial scan in place, several follow up tests would help create a more complete picture.

  • Use a school-provided Gmail account.
  • Visit other sites with ads and observe other ad-related interactions that are mapped to either of these cookies.
  • Test other LMSs that use Google Analytics to see if there is comparable mapping of Google Analytics IDs to cookie IDs.
  • Test other educational sites that use Google Analytics to see if there is comparable mapping of Google Analytics IDs to cookie IDs.

These scans would each provide additional information that would help create a more complete picture, and would build on and provide additional context to what was observed in this initial scan. If the mapping observed in this scan is replicated across the web on other educational sites that use Google Analytics within K12 or higher ed, then -- theoretically -- students could be profiled based on their interactions with sites they are required to use for school. The types of redlining, targeting, or "predictions" that would be possible from this type of profiling are clearly not in the best interests of learners.

Conclusions

This scan covers a pretty common use case: a person who checks their personal email and searches for other information, and then does some schoolwork. As documented in this writeup, this behavior results in a range of tracking behavior that includes:

  • a. tracking cookies are set shortly after a person logs into a Google account, and these cookies are directly tied to a person's specific identity;
  • b. via these cookies, Google gets specific information about searches on YouTube and Google, including keylogging of the search process;
  • c. via these cookies, Google gets specific information about the sites a person visits, and when they visit them;
  • d. on both sites in this scan that used Google Analytics, the domain's Google Analytics ID was synched with tracking cookies;
  • e. while logged in to an LMS as a high school student, the Google Analytics ID of the required LMS for a public high school student is mapped to cookie IDs that appear to be used for ad targeting, and are tied to a student's real identity.

It is not clear why Instructure's Google Analytics ID needs to be mapped to cookie IDs that are set in a consumer context and appear to be related to ad tracking.

To be very clear: the tracking cookies mapped to a person's actual identity occurred within the context of consumer use. When a person uses Gmail, or searches via Google, or browses a site for medical information, they are tracked, and they are tracked in ways that can be connected back to their real identity. This is how adtech works, and -- based on current privacy law in the United States -- this is completely legal.

As observed in this scan, the tracking cookies set in a consumer context are also accessed when a student is logged into their LMS, in a strictly educational context. In practical terms, the only way for a high school student to completely avoid the type of tracking documented in this scan would be to practice abnormally strong browser hygiene -- for example, they could set up a separate profile in Firefox that they only used while accessing the LMS. But realistically, the chances of that happening are slim to none, and "solutions" like this put the onus in the wrong place: a high school student should not be required to fix the excesses of the adtech industry, especially when they are accessing the required software that comes as a part of their legally required public education.

Dark Patterns and Passive Aggressive Disclaimers - It's CCPA Season!

4 min read

In today's notes on CCPA compliance, Dashlane gets the award for passive aggressive whinging paired with a dark pattern designed to obscure consent. I have managed to get my hands on secret video of Dashlane's team while they were planning how to structure their opt out page. This completely legitimate video is included below.

Hidden camera video of Dashlane team
Hidden camera video of the design process for Dashlane's opt out page.

In case you've never heard of Dashlane, they are a password manager. Three alternatives that are all less whingy are 1Password, LastPass, and KeePassXC -- and KeePassXC is an open source option.

Dashlane appears to be preparing for California's privacy law, CCPA, which is set to go into effect in 2020. 

The screenshot below is from Dashlane's spash page where, under CCPA, they are required to allow California residents to opt out of having their data sold. CCPA has a reasonably broad definition of what selling data means, and, predictably, some companies are upset at having any limits placed on their ability to use the data they have collected or accumulated. 

Full page screenshot

Dashlane's disclaimer and opt out page provides a good example of how a company can comply, yet exhibit bad faith in the process.

First, let's look at their description of sales as defined by CCPA:

However, the California Consumer Privacy Act (“CCPA”), defines “sale” very broadly, and it likely includes transfers of information related to advertising cookies.

Two thoughts come to mind nearly simultaneously: this is cute, and stop whining. Companies have used a range of jargon to define commercial transfers of data for years - for example, "sharing" with "affiliates", or custom definitions of what constitutes PII, or shell games with cookies that are mapped between vendors and/or mapped to a browser or device profile. It's also worth noting that Dashlane is theoretically a company that helps people maintain better privacy and security practice via centralized password management. It's hard to imagine a better example of a company that should look to exceed the basic ground level requirements of privacy laws. Instead, Dashlane appears to be whinging about it.

However, Dashlane does more than just whine about CCPA. They take the extra step of burying their opt out in a multilayered dark pattern, complete with unclear "help" text and labels.

Dark pattern

As shown in the above screenshot, Dashlane's text instructs people to make a selection in "the box below". However, two obvious problems immediately become clear. First, there is no box, below or otherwise - the splash page contains a toggle and a submit button.

Second, assuming that the toggle is what they mean by "box", we have two options: "active" or "inactive." It's not clear what option turns cookies "off" - does the "active" setting means that we have activated enhanced privacy protections, or does the "active" setting means that ad tracking is activated? This is a pretty clear example of a dark pattern, or a design pattern that intentionally misleads or confusers end users. 

Based on additional language on the splash page, it looks like the confusion that Dashlane has created is pretty meaningless because anything we set on this page appears pretty easy to wipe out, either intentionally or accidentally. So, even if the user makes the wrong choice because the language is intentionally confusing, this vague choice can get erased pretty easily.

Brittle settings

Based on this description, the ad tracking opt out sounds like it's cookie based, and therefore brittle to the point of meaningless.

While it remains to be seen how other companies will address their obligations under CCPA, I'd like to congratulate Dashlane on taking an early lead in the "toothless compliance" and "aggressive whinging" categories.

The Data Aren't Worth Anything But We'll Keep Them Forever Anyways. You're Welcome.

4 min read

Earlier this week, Instructure announced that they were being acquired by a private equity firm for nearly 2 billion dollars. 

Because Instructure offers a range of services, including a learning management system, this triggered the inevitable conversation: how much of the 2 billion price tag represented the value of the data?

The drone is private equity.

There are multiple good threads on Twitter that cover some of these details, so I won't rehash these conversations - the timelines of Laura Gibbs, Ian Linkletter, Matt Crosslin, and Kate Bowles all have some interesting commentary on the acquisition and its implications. I recommend reading their perspectives.

My one addition to the conversation is relevant both to Instructure and educational data in general. Invariably, when people raise valid privacy concerns, defenders of what currently passes as acceptable data use say that people raising privacy concerns are placing too much emphasis on the value of the data, because the data aren't worth very much.

Before we go much further, we also need to understand what we mean when we say data in this context: data are the learning experiences of students and educators; the artifacts that they have created through their effort that track and document a range of interactions and intellectual growth. "Data" in this context are personal, emotional, and intellectual effort -- and for everyone who had to use an Instructure product, their personal, emotional, and intellectual effort have become an asset that is about to be acquired by a private equity firm.

But, to return to the claims that the data have no real value: these claims about the lack of value of the underlying data are often accompanied by long descriptions of how companies function, and even longer descriptions about where the "real" value resides (hint: in these versions, it's never the data).

Here is precisely where these arguments fall apart: if the data aren't worth anything, why do companies refuse to delete them?

We can get a clear sense of the worth of the data that companies hold by looking at the lengths they go to both obfuscate their use of this data, and the lengths that they go to hold on to it. We can see a clear example of what obfuscation looks like from this post on the Instructure blog from July of 2019. The post includes this lengthy non-answer about why Canvas doesn't support basic user agency in the form of an opt out:

What can I say to people at my institution who are asking for an "opt-out" for use of their data?

When it comes to user-generated Canvas data, we talk about the fact that there are multiple data stewards who are accountable to their mission, their role, and those they serve. Students and faculty have a trust relationship with their educational institutions, and institutions rely on data in order to deliver on the promise of higher education. Similarly, Instructure is committed to being a good partner in the advancement of education, which means ensuring our client institutions are empowered to use data appropriately. Institutions who have access to data about individuals are responsible to not misuse, sell, or lose the data. As an agent of the institution, we hold ourselves to that same standard.

Related to this conversation, when we hear companies talking about developing artificial intelligence (AI) or machine learning (ML) to develop or improve their product, they are describing a process that requires significant amounts of data to start the process, and significant amounts of new/additional data to continue to develop the product.

But for all the companies, and the paid and unpaid defenders of these companies: you claim that the data have no value while simultaneously refusing to delete the data -- or to even allow a level of visibility into or learner control over how their data are used.

If -- as you claim -- the data have no value, then delete them.

Misinformation: Let's Study the Olds, Shall We?

4 min read

The Stanford History Education Group (SHEG) released a study recently about Student's Civic Online Reasoning. The link shared here contains a link to a download page for the full study, and it's worth reading.

The Executive Summary somberly opens with a Serious Question (tm):

The next presidential election is in our sights. Many high school students will be eligible to vote in 2020. Are these first-time voters better prepared to go online and discern fact from fiction?

The conclusions shared in the Executive Summary are equally somber:

Nearly all students floundered. Ninety percent received no credit on four of six tasks.

I was able to capture secret video of Serious People reacting to this study. You're welcome.

FREAK OUT!!

All kidding aside, the "kids are bad at misinformation" cliche needs to be retired. It was never particularly good to begin with, and it hasn't gotten much better with age.

To be clear: if the adults in the building lack basic information literacy, it will be increasingly difficult for students to master these skills. Last I checked, high school sophomores didn't vote in the 2016 or 2018 elections. But their teachers, and their school administrators? They sure did.

Also, to be clear: if I had a nickel for every time I discovered a romance scam account only to see some of our educational and edtech "leaders" following the account, I could retire, right now. To date, I have refrained from naming names, but hoo boy my patience wears thin on some days.

But when we are studying misinformation, we need to stop doing it in a vacuum, and in a limited way. The recent SHEG study pulls in demographic information including race, gender, and maternal education levels, and this is a good start, but it's still incomplete.

Additional data points that are readily and publicly available, and that could be assembled once and reused indefinitely, include:

  • Voter turnout percentages for 2012, 2014, 2016, 2018, and -- eventually -- 2020 elections.
  • Voter results (Federal House and Senate, and Presidential) for 2012, 2014, 2016, 2018, and -- eventually -- 2020 elections.

These data can be obtained via postal code or FIPS code, and would provide an additional point of reference to results. Given that many studies of misinformation and youth are contextualized within the frame of civic participation, we should probably have some measure of actual civic participation that holds true across the entire country.

While this addition would provide some useful context, it still doesn't get any information about the adults in the system, and their skill levels. Toward that end, surveys should include as many adults within evaluated systems as possible: administrative and district staff; school board members; superintendents and assistant superintendents; curriculum staff and technical staff; building level principals and assistant principals; school librarians (ha, yeah, I know); and classroom teachers. Data should also note levels of participation across staff.

By including adults in the study, the relative skill level of the adults could be cross referenced against the students for whom they are responsible, and the overall levels of participation in national elections. Rate of participation from adults would also be an interesting data point.

This is a very different study than what SHEG put out. Getting adult participation would make recruiting participant districts even more time consuming -- but if we are going to move past where we are now, we need to do better than we're currently doing. All of us need to get better at addressing misinformation, and we're not going to get there by pointing fingers at youth or by taking too narrow a view of the problem. But we can't shy away from the reality that adults have played an outsized role in creating and perpetuating the success of misinformation. To fix the problems caused my misinformation, we need to study ourselves as well.

Adtech, Tracking, and Misinformation: It's Still Messy

15 min read

Introduction

Over the last several months, I have wasted countless hours read through and collected online posts related to several conversational spikes that were triggered by current events. These conversational spikes contained multiple examples of outright misinformation and artificial amplification of this misinformation.

I published three writeups describing this analysis: one on a series of four spikes related to Ilhan Omar, a second related to the suicide of Jeffrey Epstein, and a third related to trolls and sockpuppets active in the conversation related to Tulsi Gabbard. For these analyses, I looked at approximately 2.7 million tweets, including the domains and YouTube videos shared.

Throughout each of these spikes, right leaning and far right web sites that specialize in false or misleading information were shared far more extensively than mainstream news sources. As shown in the writeups, there was nothing remotely close to balance in the sites shared. Rightwing sites and sources dominated the conversation, both in number of shares, and in number of domains shared.

This imbalance led me to return to a question I looked at back in 2017: is there a business model or a sustainability model for publishing misinformation and/or hate? This is a question multiple other people have asked; as one example, Buzzfeed has been on this beat for years now.

To begin to answer this question, I scanned a subset of the sites used when spreading or amplifying misinformation, along with several mainstream media sites. This scan had two immediate goals:

  • get accurate information about the types of tracking and advertising technology used on each individual site; and 
  • observe overlaps in tracking technologies used across multiple sites.

Both mainstream news sites and misinformation sites rely on advertising to generate revenue.

The companies that sell ads collect information about people, the devices they use, and their geographic location (at minimum, inferred from IP addresses, but also captured via tracking scripts), as part of how they sell and deliver ads.

This scan will help us answer several questions:

  1. what companies help these web sites generate revenue?
  2. what do these adtech companies know about us?
  3. given what these companies know about us, how does that impact their potential complicity in spreading, supporting, or profiting from misinformation?

Methodology

25 sites were scanned -- each site is listed below, followed by the number of third parties that were called on each site. The sites selected for scanning meet one or more of the following criteria: were used to amplify false or misleading narratives on social media; have a track record of posting false or misleading content; are recognized as a mainstream news site; are recognized as a partisan but legitimate web site.

Every site scan began by visiting the home page. From the home page, I followed a linked article. From the linked article, I followed a link to another article within the site, for a total of three pages in each site.

On each pageload, I allowed any banner ads to load, and then scrolled to the bottom of the page. A small number of the sites used "infinite scroll" - on these sites, I would scroll down the equivalent of approximately 3-4 screens before moving on to a new page in the site.

While visiting each site, I used OWASP ZAP (an intercepting proxy) to capture the web traffic and any third party calls. For each scan, I used a fresh browser with the browsing history, cookies, and offline files wiped clean.

Summary Results

The list of sites scanned are listed below, sorted in order of observed trackers, from low to high.

The sites at the top of the list shared information about site visitors with more third party domains. In general, each individual domain is a different company, although in some cases (like Google and Facebook) a single company can control multiple domains. This count is at the domain level, so if a site sent user information to subdomain1.foo.com and subdomain2.foo.com, the two distinct subdomains count as a single site.

  • dailycaller (dot) com -- 189
  • thegatewaypundit (dot) com -- 160
  • thedailybeast (dot) com -- 154
  • mediaite (dot) com -- 153
  • dailymail.co.uk -- 151
  • zerohedge (dot) com -- 145
  • cnn (dot) com -- 143
  • westernjournal (dot) com -- 140
  • freebeacon (dot) com -- 137
  • huffpost (dot) com -- 131
  • breitbart (dot) com -- 107
  • foxnews (dot) com -- 101
  • twitchy (dot) com -- 92
  • thefederalist (dot) com -- 88
  • townhall (dot) com -- 83
  • washingtonpost (dot) com -- 82
  • dailywire (dot) com -- 71
  • pjmedia (dot) com -- 61
  • lauraloomer.us -- 52
  • nytimes (dot) com -- 42
  • infowars (dot) com -- 40
  • vdare (dot) com -- 21
  • prageru (dot) com -- 19
  • reddit (dot) com -- 18
  • actblue (dot) com -- 13

The list below highlights the most commonly used third party domains. The list breaks out the domain, the number of times it was called, and the company that owns the domain. As shown below, the top 24 third parties were all called by 18 or more sites.

The top 24 third party sites getting data include some well known names in the general tech world, such as Google, Facebook, Amazon, Adobe, Twitter, and Oracle.

However, lesser known companies are also broadly used, and get access to user information as well. These less known companies collecting information about people's browsing habits include AppNexus, MediaMath, The Trade Desk, OpenX, Quantcast, RapLeaf, Rubicon Project, comScore, and Smart Ad Server.

Top third party domains called:

  • doubleclick.net - 25 - Google
  • googleapis.com - 24 - Google
  • facebook.com - 23 - Facebook
  • google.com - 23 - Google
  • google-analytics.com - 22 - Google
  • googletagservices.com - 22 - Google
  • gstatic.com - 22 - Google
  • adnxs.com - 21 - AppNexus
  • googlesyndication.com - 21 - Google
  • adsrvr.org - 20 - The Trade Desk
  • mathtag.com - 20 - MediaMath
  • twitter.com - 20 - Twitter
  • yahoo.com - 20 - Yahoo
  • amazon-adsystem.com - 19 - Amazon
  • bluekai.com - 19 - Oracle
  • facebook.net - 19 - Facebook
  • openx.net - 19 - OpenX
  • quantserve.com - 19 - Quantcast
  • rlcdn.com - 19 - RapLeaf
  • rubiconproject.com - 19 - Rubicon Project
  • scorecardresearch.com - 19 - comScore
  • ampproject.org - 18 - Google
  • everesttech.net - 18 - Adobe
  • smartadserver.com - 18 - Smart Ad Server (partners with Google and the Trade Desk)

The full list of domains, and the paired third party calls, are available on Github.

As noted above, Doubleclick -- an adtech and analytics service owned by Google -- is used on every single site in this scan. We'll take a look at what that means in practical terms later in this post. But other domains are also used heavily across multiple sites.

amazon-adsystem.com -- controlled by Amazon -- was called on 19 sites in the scan, including Mediaite, CNN, Reddit, Huffington Post, the Washington Post, the NY Times, Western Journal, PJ Media, ZeroHedge, the Federalist, Breitbart, and the Daily Caller.

adsrvr.org -- a domain that appears to be owned by The Trade Desk, was called on 20 sites in the scan, including Breitbart, PJMedia, ZeroHedge, The Federalist, CNN, Mediaite, Huffington Post, and the Washington Post.

Stripe -- a popular payment platform -- was called on right wing sites to outright hate sites. While I did not confirm that each payment gateway is active and functional, the chances are good that Stripe is used to process payments on some or all of the sites where it appears. Sites where calls to Stripe came up in the scan include VDare (a white nationalist site), Laura Loomer, Breitbart, and Gateway Pundit.

Stripe is primarily a payment processor, and is included here to show an additional business model -- selling merchandise -- used to generate revenue. However, multiple adtech and analytics providers are used indiscriminately on sites across the political spectrum. While some people might point to the ubiquity and reuse of adtech across the political spectrum -- and across the spectrum of news sites, from mainstream to highly partisan sites, to hate sites and misinformation sites -- as a sign of "neutrality", it is better understood as an amoral stance.

Adtech helps all of these sites generate revenue, and helps all of these sites understand what content "works" best to generate interaction and page views. When mainstream news sites use the same adtech as sites that peddle misinformation, the readers of mainstream sites have their reading and browsing habits stored and analyzed alongside the browsing habits of people who live on an information diet of misinformation. In this way, when mainstream news sites choose to have reader data exposed to third parties that also cater to misinformation sites, it potentially exposes these readers to advertising designed for misinformation platforms. In the targeted ad economy, one way to avoid being targeted is to be less visible in the data pool, and when mainstream news sites use the same adtech as misinformation sites, they sell us out and increase our visibility to targeted advertisers.

Note: Ad blockers are great. Scriptsafe, uBlock Origin, and/or Privacy Badger are all good options.

Looking at this from the perspective of an adtech or analytics vendor, they have the most to gain financially from selling to as many customers as possible, regardless of the quality or accuracy of the site. The more data they collect and retain, the more accurate (theoretically) their targeting will become. The ubiquity of adtech used across sites allows adtech vendors to skim profit off the top as they sell ads on web properties working in direct opposition to one another.

In short, while our information ecosystem slowly collapses under the weight of targeted misinformation, adtech profits from all sides, and collects more data from people being misled, thus allowing more accurate targeting of people most susceptible to misleading content over time. Understood this way, adtech has a front row seat to the steady erosion of our information ecosystem, with a couple notable caveats: first, with the dataset adtech has collected and continues to grow, they could identify the most problematic players. Second, adtech profits from lies just as much as truth, so they have a financial incentive to not care.

But don't take my word for it. In January 2017, Randall Rothenberg, the head of the Interactive Advertising Bureau (IAB, the leading trade organization for online adtech), described this issue:

We have discovered that the same paths the curious can trek to satisfy their hunger for knowledge can also be littered deliberately with ripe falsehoods, ready to be plucked by – and to poison – the guileless.

In his 2017 speech, Rothenberg correctly observes that advertising has what he describes as a "civic responsibility":

Our objective isn’t to preserve marketing and advertising. When all information becomes suspect – when it’s not just an ad impression that may be fraudulent, but the data, news, and science that undergird society itself – then we must take civic responsibility for our effect on the world.

In the same speech in 2017, Rothenberg highlights the interdependence of adtech and the people affected by it, and the responsibilities that requires from adtech companies.

First, let me dispense with the fantasy that your obligation to your company stops at the door of your company. For any enterprise that has both customers and suppliers – which is to say, every enterprise – is a part of a supply chain. And in any supply chain, especially one as complex as ours in the digital media industry, everything is interdependent – everything touches something else, which touches someone else, which eventually touches everyone else. No matter how technical your company, no matter how abstruse your particular position and the skill it takes to occupy it, you cannot divorce what you do from its effects on the human beings who lie, inevitably, at the end of this industry’s supply chain.

Based on what is clearly observable in this scan of 25 sites that featured heavily in misinformation campaigns, nearly three years after the head of the IAB called for improvements, actual improvements appear to be in very short supply.

Tracking Across the Web

To illustrate how tracking looks in practice, I did a sample scan across six web sites: Gateway Pundit Breitbart PJ Media Mediaite The Daily Beast The Federalist

While all of these sites use dozens of trackers, for reasons of time we will limit our review to two: Facebook and Google. Also, to be very clear: the proxy logs for this scan of six sites contains an enormous amount of information about what is collected, how it's shared, and the means by which data are collected and synched between companies. The discussion in this post barely scratches the surface, and this is an intentional choice. Going into more detail would have required a deeper dive into the technical implementation of tracking, and while this deeper dive would be fun, it's outside the scope of this post.

In the screenshots below, the urls sent in the headers of the request, the User Agent information, and the full cookie ID are partially obfuscated for privacy reasons.

Facebook:

Facebook sets a cookie on the first site: Gateway Pundit. This cookie has a unique ID, which gets reused across multiple sites. The initial request sent to Facebook includes a timestamp, and basic information about the system used to access the site (details like operating system, browser, browser version, and screen height and width). The request also includes the time of day, and the referring URL.

Gateway Pundit and Facebook tracking ID

At this point, Facebook doesn't need much more flesh out a device fingerprint to map to this ID to a specific device. However, a superficial scan of multiple scripts loaded by domains affiliated with Facebook suggest that Facebook collects adequate data to generate a device fingerprint, which would allow them to then tie that more specific identifier to different cookie IDs over time.

The cookie ID is consistently included in headers across multiple web sites. In the screenshot below, the cookie ID is included in a request on Breitbart:

Breitbart and Facebook tracking ID

And PJ Media:

PJ Media and Facebook tracking ID

And Mediaite:

Mediaite and Facebook tracking ID

And the Daily Beast:

Daily Beast and Facebook tracking ID

And the Federalist:

Federalist and Facebook tracking ID

Google:

Google (or more specifically, Doubleclick, which is owned by Google) works in a similar way as Facebook.

The initial Doubleclick cookie, with a unique value, gets set on the first site, Gateway Pundit. As with Facebook, this cookie is repeatedly included in header requests on every site in this scan.

Gateway Pundit and Google tracking ID

Here, we see the same ID getting included in the header on PJ Media:

PJ Media and Google tracking ID

And on Breitbart:

Breitbart and Google cookie ID

As with Facebook, Google repeatedly gets browsing information, and information about the device doing the browsing. This information is tied to a common identifier across web sites, and this common identifier can be tied to a device fingerprint, which can be used to precisely identify individuals over time. The data collected by Facebook and Google in this scan includes specific URLs accessed, and patterns of activity across the different sites. Collectively, over time, this information provides a reasonably clear picture of a person's habits and interests. If this information is combined with other data sets -- like search history from Google, or group and interaction history from Facebook, we can begin to see how browsing patterns provide an additional facet that can be immensely revealing as part of a larger profile.

Conclusion, or Thoughts on Why this Matters

Political campaigns are becoming increasingly more aggressive with how they track people and target them for outreach.

As has been demonstrated, it's not difficult to identify the location of specific individuals using even rudimentary adtech tools.

Given the opacity of the adtech industry, it can be difficult to detect and punish fraudulent behavior -- such as what happened with comScore, an adtech service used in 19 of the 25 sites scanned.

As social media platforms -- who are also adtech vendors and data brokers -- flail and fail to figure out their role, the ability to both amplify questionable content and to target people using existing adtech services provide powerful opportunities to influence people who might be prone to a nudge. This is the promise of advertising, both political and consumer, and the tools for one are readily adaptable for the other.

Adtech both profits from and extends information asymmetry. The companies that act as data brokers and adtech vendors know far more about us than we do about them. Web sites pushing misinformation -- and the people behind these sites -- can potentially use this stacked deck to underwrite and potentially profit from misinformation.

Adtech in its current form should be understood as a parasite on the news industry. When mainstream news sites throw money and data into the hands of adtech companies that also support their clear enemies, mainstream sites are actively undermining their long term interests.

Conversely, though, the adtech companies that currently profit from the spread of misinformation, and the targeting of those who are most susceptible to it, are sitting on the dataset that could help counter misinformation. The same patterns that are used to target ads and analyze individuals susceptible to those ads could be put to use to better understand -- and dismantle -- the misinformation ecosystem. And the crazy thing, and a thing that could provide hope: all it would take is one reasonably sized company to take this on.

If one company decided that, finally, enough is enough, they could theoretically work with researchers to develop an ethical framework that would allow for a comprehensive analysis of the sites that are central to spreading specific types of misinformation. While companies like Google, Facebook, Amazon, Appnexus, MediaMath, the Trade Desk, comScore, or Twitter have shown no inclination to tackle this systematically, countless smaller companies would potentially have datasets that are more than complete enough to support detailed insights.

Misinformation campaigns are happening now, across multiple platforms, across multiple countries. The reasons driving these campaigns vary, but the tradecraft used in these campaigns has overlaps. While adtech currently supports people spreading misinformation, it doesn't need to be this way. The same data that are used to target individuals could be used to counter misinformation, and make it more difficult to profit from spreading lies.