Quick & dirty social media scrapping

In my work to examine social media use, particularly social photography, I have been faced with the challenge of collecting and managing data from social networks. Very often, accessing a social network through its official web, desktop or mobile apps fails to provide the features necessary for research. This is somewhat obvious since social networks are designed to support use not research of their use. In addition, many social networks, such as Twitter, limit the availability of data through searches to fairly recent time frames. After exploring a number of different approaches for using official and third party interfaces to networks such as Flickr and Instagram, I realized that the best way for me to move forward was to capture as much of the data I wanted as I could from a network and then be free to experiment with different ways of managing and presenting it. Using combinations of adaptations of existing social media scraping tools and my own simple inventions, I have created several home-brew ways of collecting and managing data from social networks.

Basic approach

My goal with scraping social media is to extract as much relevant data as possible to a format that restricts later processing and analysis as little as possible. To that end, I decided that a simple spreadsheet was probably the best way to store data. Since many social networks restrict the window of data that is accessible, I also decided that a solution which automatically updated without my needing to trigger a process would be best. Based on these criteria, it became evident that a so-called cloud based process would meet my needs. I decided to use the spreadsheet app that is part of Google Drive since it exists in the cloud and effectively runs all the time. For social media scraping, Google spreadsheets has the major advantage that it will populate a spreadsheet based on a feed in such common formats as XML, RSS and JSON, and a script can be written that will cause the feed to be checked automatically at an interval defined by the user. With this key solution that allows data to be constantly fed into a spreadsheet where it can be filtered, organized and stored, the next challenge is to get a specific social network to provide data in the form of a usable stream.

image

Automatically generated spreadsheet of Flickr images tagged with ‘göteborg naturhistoriska’

Twitter

Of all the networks I have tried to scrape, Twitter is the one with the most tools available that might be of use to researchers. Several sites such as http://snapbird.org/ archive tweets, allowing searches that go back far longer than Twitter’s own 5 day window. I tried these sites but found them cumbersome to use and limited. Instead, I have been using Martin Hawksey’s Twitter Archiving Google Spreadsheet (TAGS). This set of extensions and scripts for Google Spreadsheet conducts periodic searches of Twitter, archives the results and provides a variety of visualization methods. Using Twitter’s advanced search operators, it is possible to create a self updating archive for quite complex searches e.g tweet’s mentioning ‘obama’, from within a 5 km radius of Washington DC that contain hyperlinks: https://twitter.com/search/realtime?q=obama+near%3A+washington+within%3A5km+filter%3Alinks&src=typd

image

Archived tweets mentioning ‘universeum’ visualized through Martin Hawksey’s TAGS Google Spreadsheet extensions

Instagram

For Twitter I was able to find a relatively complete solution for my needs. Instagram, on the other hand, is much newer and required me to combine a variety of web-based service to build a functional solution. 

To scrape instagrams based on my own search criteria and automatically archive them I use the following process:

  • Create an IFTTT (If This Then That) recipe that searches Instagram for particular keywords at a particular interval, then adds the results as new lines in a Google Spreadsheet. Here is an example recipe that searches Instagram for ‘universeum’ and filters the information provided into columns in a Google Spreadsheet.

To scrape instagrams taken at a specific location, I use this process:

  • Define a geographic search using spots.io. This tool allows you to specify a location such as a park or museum within a particular city and receive all the instagrams geotagged with the corresponding coordinates.
  • Spots.io provides RSS feeds for searches. This means that it is possible to take the RSS feed for a geo-search and feed it into an IFTTT recipe in the same way as a straight Instagram search. Simply create a recipe that is triggered every time an Instagram is added to the spots.io RSS feed that writes the results to a Google Spreadsheet.

image

Results for Instagrams geotagged at Gothenburg Natural History Museum

Flickr

To extract information from Flickr I use a similar process to the one I use with Instagram:

  • Define a search with Steven DeGraeve’s Flickr RSS Feed Generator.
  • Use the resulting RSS feed to create an IFTTT recipe that filters the data and creates useful columns in a Google Spreadsheet.

image

Steven DeGraeve’s Flickr RSS Generator

Other social media

The basic process I outline here is a simple one. It takes advantage of the cloud based nature of Google Spreadsheets to automatically assemble data from social networks. In principle, any information that can be rendered as XML, RSS or JSON can be fed into a spreadsheet. The trick is finding or creating a tool that can produce such a feed. In many cases, feeds can be found that have been created for completely different purposes e.g. DeGraeves Flickr feed generator which was created to make it possible for users to make live updating slideshows from Flickr images. Beyond regular google searches for ways to turn social media into a usable feed, scraping specific resources can be found at https://scraperwiki.com/ and https://wiki.digitalmethods.net/Dmi/ToolDatabase.

Geocaching in kindergarten

Here is a really detailed site developed by Gothenburg kindergarten teacher Ann-Charlotte Keiller who is geocaching with her students. Its fantastic to see 4-6 year olds working together using GPS and solving clues to find caches.

Link in Swedish - English available through Google Translate button at bottom right corner of screen.

Automatically synchronized audio and notes

Here’s an idea for interviews. AudioNote is an iOS app designed for people who want to take notes during lectures. It records audio and has a note pad. Everytime you make an entry on the note pad, it tags the place in the audio recording that the entry was made. The result is written notes that are synchronized with the audio. Used during interviews instead, AudioNote offers a useful way to make notes without needing to manually record times or try to describe the point in the interview. Managing an interview is already a difficult task and trying to take notes makes it even harder. Anything that can make the process easier is probably worth a try.

iOS video-screen-capture part 2

Apple seems to have a problem with people capturing screen-video directly from an iOS device so they are not approving apps that allow it. This is a pain for anyone doing research that could benefit from screen-videos (like me).

Looking for a work around, I came across Matt Galligan’s really well put together instructions for using AirPlay to share an iOS device screen to a Mac. By setting a Mac up as an AirPlay receiver, it is possible to share an iOS screen and then do the capturing on the Mac side. It turns out to be a fairly robust way to do recording and has the added benefit of being wireless (within a given WIFI network).