request / response

a blog about the web, Go, and building things

(by Matt Silverlock)


Connecting to a Coral TPU Dev Board with Windows

•••

The Coral Dev Board is a TPU-enabled development board for testing out machine learning models with a requirement for near-real-time inference. For instance, image classification or object detection on video feeds, where a CPU would struggle to keep up.

However, the dev board’s setup instructions only document a native Linux process, but it’s entirely possible to flash the boot image via native Windows (without WSL): here’s how!

Pre-requisites

You’ll need to install a few things: this is mostly a process of clicking “next” a few times (the USB drivers) and unzipping a folder (the Android Platform Tools).

You should also be moderately familiar with serial consoles & have read through the Coral’s setup instructions to familiarize yourself with the process.

Note: It’s important to make sure you’re using a data-capable USB-C cable when connecting to the USB-C data port. Like many things USB-C / USB 3.x, this can be non-obvious at first. You’ll know when the Device Manager shows a “⚠ USB Download Gadget” in the Device Manager. If you use a power-only cable, nothing will show up and it’ll seem as if the OS isn’t seeing the device.

Connecting to the Serial Console

Mostly identical to the Coral setup instructions:

  1. Connect to the dev board’s micro-USB port, and identify the COM port the device is attached to in the Device Manager by looking under “Ports (COM & LPT)” for the “CP2105 USB to UART (Standard)” device. In my case, it was COM3.
  2. Power on the board by connecting the USB-C power cable to the power port (furthest from the HDMI port).
  3. Open PuTTY, select “Serial” as the connection option, set the COM port to the one you identified above, and the data rate to 115200bps. For confirmation, the serial comms settings should be at 8 data bits, no parity bits, 1 stop bit and XON/XOFF flow control.

The serial port on the dev board accepts other settings, but I’m documenting an explicit list for those who don’t have a background in serial comms.

You should now be at the dev board’s uboot prompt, and ready to flash the bootloader & disk image. If not, check that the board is powered on, that the COM port is correct, and that the Device Manager lists the device.

Flashing the Board

Connect the USB-C data cable to the dev board, and the other end to your PC.

In the Device Manager, you’ll see a “USB Download Gadget” appear with a warning symbol. Right click, choose “Update Driver”, select “Browse my computer for driver software” and then “Let me pick from a list of available drivers from my computer”. In the driver browser, choose “WinUsb Device” from the left side, and “ADB Device” (Android Debugger) from the right. Click “Next” and accept the warning. The Device Manager will refresh, and show the device under “Universal Serial Bus devices”.

To confirm it’s configured correctly and visible to the OS, head back to your command prompt and enter:

λ fastboot devices
122041d6ef944da7        fastboot

If you don’t see anything, confirm the device is still showing in the Device Manager, and that you have the latest version of fastboot from the Android Platform Tools (linked above).

From here, you’ll need to download and unzip the bootloader image and the disk image (identical to the official instructions), and confirm you see the contents below:

λ curl -O https://dl.google.com/aiyprojects/mendel/enterprise/mendel-enterprise-beaker-22.zip
λ unzip mendel-enterprise-beaker-22.zip
λ cd mendel-enterprise-beaker-22
λ ls
    boot_arm64.img  partition-table-16gb.img  partition-table-8gb.img  rootfs_arm64.img
    flash.sh*       partition-table-64gb.img  recovery.img             u-boot.imx

Unfortunately, the flash.sh script is a Bash script, which won’t work for us: but we can easily replicate what it does:

λ tail -n 15 flash.sh
fi

# Flash bootloader
${FASTBOOT_CMD} flash bootloader0 ${PRODUCT_OUT}/u-boot.imx
${FASTBOOT_CMD} reboot-bootloader

# Flash partition table
${FASTBOOT_CMD} flash gpt ${PRODUCT_OUT}/${PART_IMAGE}
${FASTBOOT_CMD} reboot-bootloader

# Flash filesystems
${FASTBOOT_CMD} erase misc
${FASTBOOT_CMD} flash boot ${PRODUCT_OUT}/boot_${USERSPACE_ARCH}.img
${FASTBOOT_CMD} flash rootfs ${PRODUCT_OUT}/rootfs_${USERSPACE_ARCH}.img
${FASTBOOT_CMD} reboot

Where we see “FASTBOOT_CMD” we simply run fastboot - and where we see USERSPACE_ARCH we only have one choice for the dev board: arm64. We can work with this.

In the serial console (e.g. in PuTTY), put the dev board into fastboot mode:

fastboot 0

Then, in the command prompt and from within the mendel-enterprise-beaker-22 directory, invoke the following commands. You should leave the serial console connected: you’ll see the progress of each step.

fastboot flash bootloader0 u-boot.imx
fastboot reboot-bootloader
 
fastboot flash gpt partition-table-8gb.img
fastboot reboot-bootloader

fastboot erase misc
fastboot flash boot boot_arm64.img
fastboot flash rootfs rootfs_arm64.img
fastboot reboot

When the device reboots, you’ll get a more familiar Linux login prompt in the serial console! Enter mendel (username) and mendel (password) to log in, and then follow the steps within the official documentation to set up network connectivity! You’ll then be able to log into the board remotely via SSH, and will only need to connect it to power unless you want to flash it again.

Beyond that: enjoy experimenting & building things on your Coral Dev Board! And if you run into issues, or find something unclear in these instructions, you can reach me on Twitter at @elithrar.


Updating Kubernetes Deployments on a ConfigMap Change

•••

One initially non-obvious thing to me about Kubernetes was that changing a ConfigMap (a set of configuration values) is not detected as a change to Deployments (how a Pod, or set of Pods, should be deployed onto the cluster) or Pods that reference that configuration. That expectation can result in unintentionally stale configuration persisting until a change to the Pod spec. This could include freshly created Pods due to an autoscaling event, or even restarts after a crash, resulting in misconfiguration and unexpected behaviour across the cluster.

Note: This doesn’t impact ConfigMaps mounted as volumes, which are periodically synced by the kubelet running on each node.

Updating the ConfigMap and running kubectl apply -f deployment.yaml results in a no-op, which makes sense if you consider the impacts of an unintended config change and rollout in a larger deployment.

But, there are certainly cases where we want to:

  • Update a ConfigMap
  • Have our Deployment reference that specific ConfigMap version (in a version-control & CI friendly way)
  • Rollout a new revision of our Deployment

So how can we accomplish that? It turns it out to be fairly straightforward, but let’s step through an example.

Example

Our ConfigMap, applied to our Kubernetes cluster:

➜  less demo-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: demo-config
  namespace: default
data:
  READ_TIMEOUT_SECONDS: "15"
  WRITE_TIMEOUT_SECONDS: "15"
  NAME: "elithrar"
➜  kubectl apply -f demo-config.yaml
configmap/demo-config created

And here’s our Deployment before we’ve referenced this version of our ConfigMap - notice the spec.template.metadata.annotations.configHash key we’ve added. It’s important to note that modifying a top-level Deployment’s metadata.annotations value is not sufficient: a Deployment will only re-create our Pods when the underlying template.spec (Pod spec) changes.

This is how we’ll couple the Deployment with our ConfigMap, triggering a change in our Deployment only when our ConfigMap actually changes.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-deployment
  labels:
    app: config-demo-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: config-demo-app
  template:
    metadata:
      labels:
        app: config-demo-app
    annotations:
      # The field we'll use to couple our ConfigMap and Deployment
      configHash: ""
    spec:
      containers:
      - name: config-demo-app
        image: gcr.io/optimum-rock-145719/config-demo-app
        ports:
        - containerPort: 80
        envFrom:
        # The ConfigMap we want to use
        - configMapRef:
            name: demo-config
        # Extra-curricular: We can make the hash of our ConfigMap available at a
        # (e.g.) debug endpoint via a fieldRef
        env:
          - name: CONFIG_HASH
            valueFrom:
              fieldRef:
                fieldPath: spec.template.metadata.annotations.configHash

With these two pieces in mind, let’s create a SHA-256 hash of our ConfigMap. Because this hash is deterministic (the same input == same output), the hash only changes when we change our configuration: making this a step we can unconditionally run as part of our deployment (CI/CD) pipeline into our Kubernetes cluster.

Note that I’m using yq (a CLI tool for YAML docs, like jq is to JSON) to modify our Deployment YAML at a specific path.

➜  yq w demo-deployment.yaml spec.template.metadata.annotations.configHash \
>  $(kubectl get cm/demo-config -oyaml | sha256sum)
...
spec:
  ...
  template:
    metadata:
      ...
      annotations:
        configHash: 4431f6d28fdf60c8140d28c42cde331a76269ac7a0e6af01d0de0fa8392c1145

We can now re-deploy our Deployment, and because our spec.template changed, Kubernetes will detect it as a change and re-create our Pods.

As a bonus, if we want to make a shortcut for this during development/local iteration, we can wrap this flow in a useful shell function:

# Invoke as hash-deploy-config deployment.yaml configHash myConfigMap
hash-deploy-config() {
  yq w $1 spec.template.metadata.annotations.$2 \
  $(kubectl get cm/$3 -oyaml | sha256sum)
}

Accidentally From macOS to Windows and WSL

•••

It’s been ~5 months since I’ve used macOS proper, after 13+ years of personal use and a handful of work-use. This began when I started using my Windows “gaming” desktop & WSL (Windows Subsystem for Linux) for maintaining OSS projects & other dev-work—in-between dungeons or rounds of Overwatch—purely out of the convenience of being on the same machine.

It came to a head when I realized my 12” MacBook was collecting dust, that I wasn’t using it at work (ChromeOS + Crostini), and when I saw the Surface Pro 6 on sale. I decidd to see if I could live with WSL closer to full-time, and critically, go without macOS. And so I put it up on Craigslist, sold it that weekend, and unpacked the Surface Pro a week later.

I did it partially as an experiment: Windows has been seen some significant improvements as an OSS development over the last couple of years. Could I use it for writing Go, [an increasing amount of] data science / SQL / ML explorations, and testing new cloud infrastructure? Could it really compete with the macOS developer experience, which although not perfect, is pretty darned good? I figured it wouldn’t hurt to try out, seeing as I was most of the way there already: and I figured it’d be a worthwhile process to document for other developers curious about WSL.

If you’re considering the switch, or are just curious as to what it’s like—including how WSL integrates with Windows, what tool choices you have, and importantly, what you’re going to miss from macOS—then read on.

Side-note: I wrote a short guide around my original WSL-based setup a while ago. Some of this article revises the tool choices I made at the time; the rest of it talks around the general Windows + WSL-experience and how it compares to macOS.

“The Shim”

In short: you effectively have “1.5” computers to deal with, and it feels like it at times.

Linux & Windows co-exist via the WSL layer, and although it’s generally pretty great (if not technically impressive), there are parts where the facade peels back to reveal some less-than-great interactions.

Jessie Frazelle wrote a great post on how WSL internals work (Windows <-> Linux syscall translation), and touches on some of the challenges I speak to below.

The first, and most obvious, is the way the filesystems interact. You can write to Windows from WSL - e.g. /mnt/c/Users/Matt/Dropbox/ writes to my Dropbox and works as expected, but you can’t read/write files from Windows -> WSL. Thus, accessing Windows from WSL is the “happy” path: anything you download via Chrome, in your Dropbox, on an external drive, etc - is accessible via /mnt/<driveletter>. It’s when you’ve cloned a git repo, use wget/curl -O to pull something down, or are iterating on a $language package in WSL and want to use a Windows-native tool that you’re destined to shuffle things around. I’ve symlinked my core working folders back into the Windows filesystem to make this part a little more livable - e.g. ln -s $USERPROFILE/repos $HOME/repos.

You notice this filesystem gap the most when dealing with Windows-native editors but WSL-based toolchains: in my case, that’s VS Code on Windows and the Go toolchain inside WSL. VS Code doesn’t know how to look for your toolchain & packages inside WSL, and so you either need to live inside of Windows (losing your Linux tooling), install VS Code inside of WSL, which means losing the ability to open files outside of WSL + native Windows integration. The ‘partial’ solution is to use a shared $GOPATH within the Windows filesystem, which at least means your packages only need to be fetched once, but you’ll need to be wary of potential differences should a package change implementation across OS’ (inc. the standard lib!). This is far less of a solution for systems programmers. There’s an open issue for this as it relates to vscode-go, but it still speaks to the “1.5 computers” problem I mentioned earlier.

Overall? It’s usable, you learn to live with it, but it adds friction to my day-to-day.

Terminal Emulators

I’ve bounced between a few terminal emulators here. None are perfect, and all of them make me yearn for iTerm2 on macOS. I wish it was better.

The situation is improving though, and with the ConPTY API in the October 2018 Windows 10 build (1809) making it much easier to integrate existing terminal emulators, it can only improve.

What I’ve tried so far:

  • Cmder (ConEmu): fast & configurable, but poor Unicode support, tmux glitches & some emulation/escaping issues. Some improvements coming via ConPTY.
  • Hyper.js: Cross-platform due to Electron underpinnings, lots of third-party plugins. Same underlying emulator as VS Code (xterm.js), but tends to be very slow launch, spawn new shells, and doesn’t keep up with lots of terminal output. I used Hyper for most of this year because despite the perf issues, it was the least buggy.
  • wsltty (Mintty): Fast. Moderately configurable, but config DSL is a pain & docs are lacking. Not a bad option for most, and is the only one with mouse support for tmux out-of-the-box.
  • Terminus: Similar to Hyper.js in that’s it’s Electron-based, but faster, and easier to configure. Good font rendering, doesn’t break under tmux, and has a solid tab UI. It’s still innately limited to its Electron roots in that it can be slow to launch, but handles high velocity output much better than Hyper.
  • Alacritty: A (very) fast, minimalist cross-OS emulator with a well-documented configuration. Windows support relies on winpty-agent, and font rendering (esp. Unicode fallback) is far from perfect. There is upcoming support for the aforementioned ConPTY API is in the works, and font changes coming.

I’m using Terminus for now, but I’m hopeful about Alacritty becoming my default terminal by end of year. Terminus is “good enough despite the bugs”, which has been a good way to sum up how most tools work under WSL.

Automation & Package Management

There were (are) myriad ways to bootstrap a new Mac: usually some combination of Homebrew, a shell script calling defaults write to set preferences, and installation of your dotfiles. Certainly, there are ways to do this on Windows—but something lightweight that doesn’t involve directly hacking at registry keys via PowerShell and has a solid community to crib from has been historically lacking.

Thankfully, there are ways to do this on Windows now: both the OS-level configuration as well as desktop package management (via Chocolatey). The answer is Boxstarter, which is a wrapper around Chocolatey itself, as well as a number of convenience functions for modifying Windows Explorer settings, enabling WSL, and removing the (honestly pretty horrible amount of) bundled applications that Windows comes with. Why does my first-party Microsoft hardware comes with a FitBit app and Candy Crush? (rhetorical; it’s $$$).

Here’s a snippet of what my Boxstarter script looks like:

# Pre
Disable-UAC

# Set PC name
$computername = "junior"
if ($env:computername -ne $computername) {
    Rename-Computer -NewName $computername
}

# Set DNS upstreams
Set-DNSClientServerAddress -InterfaceIndex $(Get-NetAdapter | Where-object {$_.Name -like "*Wi-Fi*" } | Select-Object -ExpandProperty InterfaceIndex) -ServerAddresses "8.8.8.8", "1.1.1.1", "2001:4860:4860::8888", "2001:4860:4860::8844"

# Set environment variables
setx GOPATH "$env:USERPROFILE\go"
setx WSLENV "$env:WSLENV`:GOPATH/p:USERPROFILE/p"

# Install applications
choco install -y sysinternals
choco install -y vscode
choco install -y googlechrome.dev
choco install -y 1password
choco install -y docker-for-windows
choco install -y cmdermini
choco install -y discord
choco install -y spotify
choco install -y dropbox
choco install -y adobereader
choco install -y 7zip.install
choco install -y firacode

# WSL
choco install -y Microsoft-Hyper-V-All -source windowsFeatures
choco install -y Microsoft-Windows-Subsystem-Linux -source windowsfeatures
Invoke-WebRequest -Uri https://aka.ms/wsl-ubuntu-1804 -OutFile ~/Ubuntu.appx -UseBasicParsing
Add-AppxPackage -Path ~/Ubuntu.appx

RefreshEnv
Ubuntu1804 install --root
Ubuntu1804 run apt update
Ubuntu1804 run apt upgrade

# System-level configuration
Disable-BingSearch
Disable-GameBarTips

Set-WindowsExplorerOptions -EnableShowHiddenFilesFoldersDrives -EnableShowProtectedOSFiles -EnableShowFileExtensions
Set-TaskbarOptions -Size Small -Dock Bottom -Combine Full -Lock
Set-TaskbarOptions -Size Small -Dock Bottom -Combine Full -AlwaysShowIconsOn

You’ll still going to need to write some PowerShell for more advanced things (i.e. setting DNS servers), but you might also consider that a blessing, given it’s power.

Within WSL I’m using Linuxbrew, a fork of Homebrew (and which is on-track to merge with it) in cases where I need more cutting-edge packages beyond the Ubuntu repositories. Using the same brew install workflow as I’m used to on macOS is pretty nice, and makes it a friendlier development environment without having to add package-specific repositories or build from source.

Docker

Not much has changed from last time: it works, with a few minor problems.

The docker CLI inside WSL can talk to Docker for Windows (the daemon), so you get Hyper-V benefits there. The catch is that the CLI doesn’t know how to validate the certificates used by the daemon, and thus you either need to disable TLS for connections over localhost (bad), or do a cert-generation dance and edit the Docker for Window config file by hand to use these new certs. It’d be great if the Docker daemon did this for you, so you could just set DOCKER_CERT_PATH=/mnt/c/ProgramData/Docker/pki and have things work securely.

As a reminder, you don’t get Hyper-V support without Windows Pro, which impacts both Linux Containers on Windows and Windows Containers on Windows (unless you want to use VirtualBox).

What I Miss

I miss FileVault and Apple’s push towards securing the device, especially with their recent Secure Enclave-based improvements: a benefit of verticalizing, really. Windows’ BitLocker continues to be untrustworthy, and I’d be far more worried about a lost Windows machine vs. a lost macOS machine. BitLocker is also awkwardly positioned as a Windows 10 Pro only feature, which in 2018, is very much the wrong thing to nickle-and-dime users over. It’s frustrating to buy a Surface Pro and then have to dole out $99 for the Windows Pro upgrade.

macOS’ community of power-user tooling is also unsurpassed: the aforementioned Alfred App as a powerful search tool, great screen-capture tools, Preview.app (the Windows PDF editor landscape is not good), Quick Look, some fantastic design tools, Automator (still good!), easy keyboard shortcut customization (no RegEdit or third-party tools), consistent keyboard shortcuts, upper quartile battery life due to tight software-hardware integration, and a single filesystem no matter whether you’re in a Cocoa app on macOS or a cross-compiled GNU tool inside iTerm2. There’s room for improvement here in both Windows-itself & WSL-land, but much of it is around developer community, and that’s a hard win.

I also want to say that I don’t share the “macOS” is dead sentiment that others do, and that hasn’t been the driver for the change. It’s just that some alternatives have finally started to close the gap, both in terms of software experience & hardware quality/support, and I was in the position to experiment with them.

Why Not All-In on Linux?

I’ll keep this short: I still depend on Lightroom, writing tools (Notion, Evernote prior), a solid default desktop environment, first-party hardware support (be it a MacBook or Surface) & battery life, and most of all, my time. I respect those who’ve invested the time into maintaining & automating a full Linux environment they can use daily, but I just don’t have the time for that investment nor am I ready to make the trade-offs required for it. To each their own.

So, Are You Going to Stick with WSL?

Before I answer: I’d love to see a few things improve, and although I think they will, some improvements will be challenging given that the WSL and Windows environments are distinct. Specificallly:

  • Better interaction between filesystems; if I could access my WSL root partition via a (default, NFS) mount in Windows, then I’d have access both ways. Something like //wsl/ or //linux would be fantastic. For contrast, the Linux container environment within ChromeOS (“Crostini”) exposes your files into the native ChromeOS environment, and thus makes working on data across both OS’ a less disruptive process.
  • Improved VS Code interactions with WSL-based tools: pointing at language servers and file paths within the WSL environment would be key to this
  • A continued march towards a solid terminal emulator or two; I’m hopeful here thanks to the ConPTY changes. Microsoft contributing resources here would likely benefit the viability of WSL.

So, am I going to continue to use WSL as a dev environment?

The answer is a (reserved) yes, because most of the dev-work I do in it is OSS, exploratory or web-based, with tools that I mostly control. If I’d been dealing with the heavily Dockerized environment at my old job, and writing/debugging lots of Lua, the answer might be closer to “no”.

WSL needs another six months of tools development (ConPTY being core to that), and although I’d thought that 6+ months ago, and had hoped the experience would be a little more polished now, at least Microsoft has continued to invest resources into it. I’m not quite convinced that a Linux toolchain makes my life easier than the Darwin-based one in macOS, but here I am.

Still, try asking me again in another 6 months?


Diving Into FiveThirtyEight's "Russian Troll Tweets" Dataset with BigQuery

•••

FiveThityEight recently released a dataset of what is believed to be ~3 million tweets associated with “Russian trolls”. These tweets are designed to spread misinformation (let’s not mince words: lies), and ultimately influence voters. If you haven’t read the linked article, I highly suggest you do that before continuing on.

Exploring a ~700MB+ CSV file isn’t hugely practical (it’s since been sharded into < 100MB chunks), and so I’ve made the tweets available as a public dataset via Google’s BigQuery analytics engine. BigQuery has a sizeable free tier of 1TB per month, which should allow a fair bit of exploration, even if you’re a student or if paid services present a challenge for you.

Note: This isn’t a BigQuery & SQL tutorial: for that, take a look at the documentation.

If you’re already familiar with BigQuery & accessing public datasets, then you can simply run the below to start exploring the data:

#standardSQL
SELECT
  author,
  COUNT(*) AS tweets,
  followers
FROM
  `silverlock-bigquery.public_datasets.fivethirtyeight_troll_tweets`
GROUP BY
  author,
  followers
ORDER BY
  tweets DESC,
  followers DESC

For everyone else: read on.

Accessing the Dataset

We’re going to use the BigQuery web UI, so navigate to the BigQuery interface and select the project you want to access it from. You’ll see the fivethirtyeight_russian_troll_tweets table appear on the left-hand-side, in the Resource tab. From there, you can inspect the table russian_troll_tweets, look at the schema (also pasted below), and see a preview of the data.

name type mode
external_author_id FLOAT NULLABLE
author STRING NULLABLE
content STRING NULLABLE
region STRING NULLABLE
language STRING NULLABLE
publish_date TIMESTAMP NULLABLE
harvested_date TIMESTAMP NULLABLE
following INTEGER NULLABLE
followers INTEGER NULLABLE
updates INTEGER NULLABLE
post_type STRING NULLABLE
account_type STRING NULLABLE
new_june_2018 INTEGER NULLABLE
retweet INTEGER NULLABLE
account_category STRING NULLABLE

So with the data above, what can we do? We can look at how these tweets were amplified (updates), what language the tweet was posted in (what audience was it for?), and the direct audience of the account (followers). We don’t get details on the followers themselves however, which makes it hard to know how impactful the reach was: is it trolls/bots followed by other trolls, or members of the general Twitter populace?

Analyzing It

OK, let’s take a quick look at the data to get you thinking about it. We’ll answer:

  • Was there a specific account with a non-negligible fraction of tweets?
  • Which months saw the most activity?
  • Which tweets were the most amplified in each language?
-- Was there a specific account with a non-negligible fraction of tweets?
SELECT
  author,
  COUNT(*) AS count,
  FORMAT("%.2f", COUNT(*) / (
    SELECT
      COUNT(*)
    FROM
      `silverlock-bigquery.public_datasets.fivethirtyeight_troll_tweets`) * 100) AS percent
FROM
  `silverlock-bigquery.public_datasets.fivethirtyeight_troll_tweets`
GROUP BY
  author
ORDER BY
  percent DESC
LIMIT
  10

The EXQUOTE account was definitely a sizeable contributor, although there’s not an order-of-magnitude difference across the top 10.

author count percent
EXQUOTE 59652 2.01
SCREAMYMONKEY 44041 1.48
WORLDNEWSPOLI 36974 1.24
AMELIEBALDWIN 35371 1.19
TODAYPITTSBURGH 33602 1.13
SPECIALAFFAIR 32588 1.10
SEATTLE_POST 30800 1.04
FINDDIET 29038 0.98
KANSASDAILYNEWS 28890 0.97
ROOMOFRUMOR 28360 0.95
-- Which months saw the most activity?
SELECT
  FORMAT("%d-%d", EXTRACT(month
    FROM
      publish_date), EXTRACT(year
    FROM
      publish_date) ) AS date,
  COUNT(*) AS count
FROM
  `silverlock-bigquery.public_datasets.fivethirtyeight_troll_tweets`
GROUP BY
  date
ORDER BY
  count DESC
LIMIT
    10

Unsuprisingly here, we see October 2016 (just prior to the election on Nov 8th) feature prominently, as well August 2017, in which the North Korean conversation escalated immensely.

date count
8-2017 191528
12-2016 155560
10-2016 152115
7-2015 145504
4-2017 136013
1-2017 135811
11-2015 132306
3-2017 128483
11-2016 123374
8-2015 119454
-- Which tweets were the most amplified (likes, retweets) by language?
SELECT
  language,
  content,
  updates
FROM (
  SELECT
    language,
    content,
    updates,
    RANK() OVER (PARTITION BY language ORDER BY updates DESC) AS tweet_rank
  FROM
    `silverlock-bigquery.public_datasets.fivethirtyeight_troll_tweets`
  GROUP BY
    language,
    updates,
    content ) troll_tweets
WHERE
  tweet_rank = 1
GROUP BY
  language,
  content,
  updates
ORDER BY
  updates DESC
LIMIT
  10

I’ll leave analyzing these tweets as an exercise to the reader, but they certainly appear to prey on the hot button issues in a few places. Also note that I’ve truncated the output here, for brevity. Also be mindful of any links you follow here: I have not vetted them.

language truncated_content updates
English ‘@JustinTrudeau Mr. Trudeau, Canadian citizens dem 166113
Turkish KARMA, KARMA, KARMA!!! https://t.co/Eh5XUyILeJ 165833
Catalan ‘@HCDotNet Excellent! 🇺🇸👠🠻😆’ 165751
Farsi (Persian) Shameful https://t.co/rll2JrUzRI 165468
Dutch Trump’s tweets. #ThingsITrustMoreThanCNN https:/ 165407
Norwegian #2018PredictionsIn5Words Pro-Trump landslide 165371
Vietnamese So sad. @TitosVodka rocks!! https://t.co/sWtLlZxL5 164288
Lithuanian Stump for Trump @Stump4TrumpPac https://t.co/S0NS9 164082
Estonian #QAnon @Q #FOLLOWTHEWHITERABBIT 🠇 #FLYSIDFLY# 163448
Croatian ‘@FoxNews @rayann2320 @POTUS Bravo Mr President!!’ 163126

Wrap

There’s a lot of data to explore here, but it’s also worth keeping in mind that three (3) million tweets is only a small fraction of tweets associated with this kind of content, and this kind of bounded data collection may have some subjectivity to it.

If you have any questions about the dataset itself, you should open an issue on FiveThirtyEight’s GitHub repository. As for questions about exploring it via BigQuery: feel free to tweet @elithrar with your questions or explorations!


From Firestore to BigQuery with Firebase Functions

•••

In building my sentiment analysis service, I needed a way to get data into BigQuery + Data Studio so I could analyze trends against pricing data. My service (on App Engine) uses Firestore as its primary data store as an append-only log of all analysis runs to date.

The flexible schema (especially during development), solid Go client library & performance story were major draws, but one of the clear attractions was being able to trigger an external Firebase Function (Cloud Function) on Firestore events. Specifically, I wanted to get the results of each analysis run into BigQuery so I could run queries & set up Data Studio visualizations as-needed.

I wrote a quick function that:

  • Triggers on each onCreate event to Firestore
  • Pulls out the relevant fields I wanted to analyze in BigQuery: counts, aggregates and the search query used
  • Inserts them into the configured BigQuery dataset & table.

With that data in BigQuery, I’m able pull it into Data Studio, generate charts & analyze trends over time.

Creating the Function

If you haven’t created a Firebase Function before, there’s a great Getting Started guide that steps you through installing the SDK, logging in, and creating the scaffolding for your Function.

Note: Firebase Functions initially need to be created & deployed via the Firebase CLI, although it sounds like Google will support the Firebase-specific event types within Cloud Functions & the gcloud SDK (CLI) in the not-too-distant future.

Within index.js, we’ll require the necessary libraries, and export our sentimentsToBQ function. This function has a Firestore trigger: specifically, it triggers when any document that matches /sentiment/{sentimentID} is created (onCreate). The {sentimentID} part is effectively a wildcard: it means “any document under this path”.

const functions = require("firebase-functions")
const BigQuery = require("@google-cloud/bigquery")

exports.sentimentsToBQ = functions.firestore
  .document("/sentiments/{sentimentID}")
  .onCreate(event => {
    console.log(`new create event for document ID: ${event.data.id}`)

    // Set via: firebase functions:config:set centiment.{dataset,table}
    let config = functions.config()
    let datasetName = config.centiment.dataset || "centiment"
    let tableName = config.centiment.table || "sentiments"
    let bigquery = new BigQuery()

We can use the Firebase CLI to override the config variables that define our dataset & table names as needed via firebase functions:config:set centiment.dataset "centiment"- useful if we want to change the destination table during a migration/copy.

let dataset = bigquery.dataset(datasetName)
dataset.exists().catch(err => {
  console.error(
    `dataset.exists: dataset ${datasetName} does not exist: ${JSON.stringify(
      err
    )}`
  )
  return err
})

let table = dataset.table(tableName)
table.exists().catch(err => {
  console.error(
    `table.exists: table ${tableName} does not exist: ${JSON.stringify(err)}`
  )
  return err
})

We check that the destination dataset & table exist - if they don’t, we return an error. In some cases you may want to create them on-the-fly, but here we expect that they exist with a specific schema.

let document = event.data.data()
document.id = event.data.id

let row = {
  insertId: event.data.id,
  json: {
    id: event.data.id,
    count: document.count,
    fetchedAt: document.fetchedAt,
    lastSeenID: document.lastSeenID,
    score: document.score,
    variance: document.variance,
    stdDev: document.stdDev,
    searchTerm: document.searchTerm,
    query: document.query,
    topic: document.topic,
  },
}

The event.data.data() method returns the current state of the Firestore document, which is what we want to insert. The previous state of the document can also be accessed via event.data.previous.data(), which could be useful if we were logging specific deltas (say, a field changes by >= 10%) or otherwise tracking per-field changes within a document.

Note that we define an insertId to prevent duplicate rows in the event the function fails to stream the data and has to retry. The insertId is simply the auto-generated ID that Firestore provides, which is exactly what we want to de-duplicate a record on should it potentially be inserted twice, as our application treats Firestore as an append-only log. If we were expecting multiple writes to a record every minute, and wanted to stream those to BigQuery as distinct documents, we would need to use a different approach.

Beyond that, we compose an object with explicit columnName <=> fieldName mappings, based on our BigQuery schema. We don’t need every possible field from Firestore - only the ones we want to run analyses on. Further, since Firestore has a flexible schema, new fields added to our Firestore documents may not exist in our BigQuery schema.

The last part of our function is responsible for actually inserting the row into BigQuery: we call table.insert and set raw: true in the options, since we’re passing a row directly:

return table.insert(row, { raw: true }).catch(err => {
  console.error(`table.insert: ${JSON.stringify(err)}`)
  return err
})

As table.insert is a Promise, we should return the Promise itself, which will either resolve (success) or reject (failure). Because we don’t need to do any post-processing in the success case, we only explicitly handle the rejection, logging the error and returning it to signal completion. Not returning the Promise would cause the function to return early, and potentially prevent execution or error handling of our table.insert. Not good!

Deploying

Deploying our function is straightforward:

# Deploys our function by name
$ firebase deploy --only functions:sentimentsToBQ

=== Deploying to 'project-name'...
i  deploying functions
i  functions: ensuring necessary APIs are enabled...
✔  functions: all necessary APIs are enabled
i  functions: preparing _functions directory for uploading...
i  functions: packaged _functions (41.74 KB) for uploading
✔  functions: _functions folder uploaded successfully
i  functions: current functions in project: sentimentsToBQ
i  functions: uploading functions in project: sentimentsToBQ
i  functions: updating function sentimentsToBQ...
✔  functions[sentimentsToBQ]: Successful update operation.

Deployment takes about 10 - 15 seconds, but I’d recommend using the local emulator to ensure the functions behaves as expected.

Querying in BigQuery

So how do we query our data? We use the BigQuery console or the bq CLI. We’ll use the command line tool here, but the query is still the same:

bq query --nouse_legacy_sql 'SELECT * FROM `centiment.sentiments` ORDER BY fetchedAt LIMIT 5;'
Waiting on bqjob_r1af4578a67b94241_000001618c40385c_1 ... (1s)
Current status: DONE

+----------------------+---------+---------------------+-------+
|          id          |  topic  |        score        | count |
+----------------------+---------+---------------------+-------+
| PSux4gwOsHyUGqqdsdEI | bitcoin | 0.10515464281605692 |    97 |
| ug8Zm5sSZ2dtJXPIQWKj | bitcoin |  0.0653061231180113 |    98 |
| 63Qo2gRgsG7Cz2zywKOO | bitcoin | 0.09264705932753926 |    68 |
| Y5sraBzPrhBzsmOyHcm3 | bitcoin | 0.06601942062956613 |   103 |
| r3XApKXJ6feglUcyG1db | bitcoin | 0.13238095435358221 |   105 |
+----------------------+---------+---------------------+-------+
# Note that I've reduced the number of columns returned so it fits in the blog post

We can now see the results that we originally wrote to Firestore, and run aggregations, analyses and/or export them to other formats as needed.

sentiment-analysis-in-data-studio

The Code

For the record, here’s the full function as it is in production at the time of writing:

const functions = require("firebase-functions")
const BigQuery = require("@google-cloud/bigquery")

exports.sentimentsToBQ = functions.firestore
  .document("/sentiments/{sentimentID}")
  .onCreate(event => {
    console.log(`new create event for document ID: ${event.data.id}`)

    // Set via: firebase functions:config:set centiment.{dataset,table}
    let config = functions.config()
    let datasetName = config.centiment.dataset || "centiment"
    let tableName = config.centiment.table || "sentiments"
    let bigquery = new BigQuery()

    let dataset = bigquery.dataset(datasetName)
    dataset.exists().catch(err => {
      console.error(
        `dataset.exists: dataset ${datasetName} does not exist: ${JSON.stringify(
          err
        )}`
      )
      return err
    })

    let table = dataset.table(tableName)
    table.exists().catch(err => {
      console.error(
        `table.exists: table ${tableName} does not exist: ${JSON.stringify(
          err
        )}`
      )
      return err
    })

    let document = event.data.data()
    document.id = event.data.id

    let row = {
      insertId: event.data.id,
      json: {
        id: event.data.id,
        count: document.count,
        fetchedAt: document.fetchedAt,
        lastSeenID: document.lastSeenID,
        score: document.score,
        variance: document.variance,
        stdDev: document.stdDev,
        searchTerm: document.searchTerm,
        query: document.query,
        topic: document.topic,
      },
    }

    return table.insert(row, { raw: true }).catch(err => {
      console.error(`table.insert: ${JSON.stringify(err)}`)
      return err
    })
  })


© 2019 Matt Silverlock | His photo journal | Code snippets are MIT licensed | Built with Jekyll