# Lab 1 - Data Collection - Querying Web APIs

---

**Date:**

**Group:**
 - *Student Name 1*
 - *Student Name 2*
---

In this first lab, we will learn how to connect and retrieve data from different Web APIs and then perform basic processing and analysis of the results.

In [None]:
# General dependencies
# !! run this cell first before any other ones
import sys
import os
import time
import operator
from IPython.core.display import HTML
from IPython.display import display, IFrame
from collections import defaultdict
from pprint import pprint
%matplotlib inline

## Interacting with APIs in Python

### Introduction to `requests` 

In order to programmatically query HTTP APIs, we'll be using the Python `requests` module (http://docs.python-requests.org/en/master/).

You can consult the `requests`'s [quickstart](http://docs.python-requests.org/en/master/user/quickstart/#quickstart) documentation for some more details.

To start with, we need to run the following Python statement to import the module into our notebook:

In [None]:
import requests

Once the module is imported, we can start to query some URLs. As a first test, let's look at the Wikipedia page about [*Acoustic Fingerprint*](https://en.wikipedia.org/wiki/Acoustic_fingerprint)

In [None]:
r = requests.get('https://en.wikipedia.org/wiki/Acoustic_fingerprint')

Some more details about this request:
* We are performing an HTTP `GET` request (`.get(...)`), to retrieve data from the server at the given URL. There are several types of HTTP methods available to interact with an HTTP service. You can consult https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods and https://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html for more details.
* You can get details about the `requests.get` method as follows:

In [None]:
help(requests.get)

Let's have a look at the reply from the server:

In [None]:
r

The response returned is an object of type `requests.models.Response`.

In [None]:
type(r)

The documentation can be fetched at http://docs.python-requests.org/en/latest/api/#requests.Response or also inline in this notebook:

In [None]:
help(r)

### HTTP Status & Errors
As you may see, one of the attributes is `status_code` and can be used to test if the query was succesfull:

In [None]:
print('Status code:', r.status_code)

You can consult the list of HTTP status codes at http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html or https://en.wikipedia.org/wiki/List_of_HTTP_status_codes for more details.

The family of status codes *2xx* indicates a success, *4xx* an error that the caller may be able to fix by modifying the request, parameters (providing query parameter or adding authentication for instance), 5xx a server error.

The status code should be checked to make sure that the request completed correctly. Additional information about the error may be passed by the server in the content of the reply.

Alternatively, a Python exception can be raised if the status code is not a succesfull one:

In [None]:
r.raise_for_status() 

This doesn't raise a Python exception since the server returned a 200 status code. However, if we request a non existing page for instance:

In [None]:
r_error = requests.get('https://en.wikipedia.org/wiki/Acoustic_fingerprint3')
r_error.raise_for_status()

An `HTTPError` exception from the `requests` module is raised. It corresponds to a traditional 404 exception, meaning there is no data at the given URL. 

In [None]:
help(requests.exceptions.HTTPError)

In addition to the status code, the server may also provide some information in the reply:

In [None]:
r_error.content

In this case, the server replies with an HTML page containing instructions for the end user. See https://en.wikipedia.org/wiki/Acoustic_fingerprint3 for instance:

In [None]:
HTML(r_error.text)

### HTTP Headers
An interesting set of properties from the reply are the [HTTP Headers](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields). They contain different kind of information from the server, some of them are standard for any HTTP requests, some others may be specific to the API, the server and the reply.

In [None]:
r.headers

Amongst all these information, we can in particular note the following ones that indicates:
 - The type of content returned by the server: `Content-Type`
 - The language of the content: `Content-Language`
 - The last modified date of the content: `Last-Modified`

In [None]:
print('Content-Type:', r.headers['Content-Type'])
print('Content-language:', r.headers['Content-language'])
print('Last-Modified:', r.headers['Last-Modified'])

Headers can also be set when requesting data from a server and the server may use them to modify the reply being returned.

One of them is the `Accept` headers that tells the server what type of data the client is supporting (HTML, XML, JSON, ...).

An `Accept = 'application/json'` header set by the client will notify the server that result should be returned in JSON format. This may be not honored by the server if this behaviour is not supported.

Another common header is the `Agent` that is generally filled by Web Browsers (Firefox, Chrome, Internet Explorer, Safari, ...) which indicates the type of Browser requesting the URL. 

To set header when requesting data, the `.get()` method allows to provide a `headers` dictionary containing headers to be provided:

In [None]:
r = requests.get('https://en.wikipedia.org/wiki/Acoustic_fingerprint',
                 headers = {
                     'Agent': 'Python Requests - Scimus'
                 })

### Retrieve content
Now, let's have a look at the actual content returned by the server.

It is available using the `content` property as a Python `bytes` (the string is prefixed by a `b`):

In [None]:
print(type(r.content))
r.content

Since the `Content-Type` HTTP header is marked as text (`text/html; charset=UTF-8`), we can also directly retrieve the content as a string:

In [None]:
r.text

Which we can also display in HTML in this notebook:

In [None]:
HTML(r.text)

### Using API
In the previous section, we have directly queried the https://en.wikipedia.org/wiki/Acoustic_fingerprint URL which is the one you can consult directly in a web browser.
The data is returned as HTML, for end use consumption and is not necessarily appropriate for programatic querying. You would have to parse it to get the content, retrieve the comments or issue new requests to access previous versions of the page...

We can use the [MediaWiki API](https://www.mediawiki.org/wiki/API:Main_page) to perform the same actions in a more effective way:
    

In [None]:
r = requests.get('https://en.wikipedia.org/w/api.php',
                 params = {
                     'action': 'query',
                     'titles': 'Acoustic fingerprint',
                     'prop': 'revisions',
                     'rvprop': 'content|user',
                     'format': 'json'
                 })
r.raise_for_status()

* `https://en.wikipedia.org/w/api.php` is the WikiMedia API endpoint
* `params` is a Python `dict` containing the HTTP parameters to submit to the endpoint. `requests` will build the following URL and send the `GET` request to the server: https://en.wikipedia.org/w/api.php?action=query&titles=Acoustic%20fingerprint&prop=revisions&rvprop=content|user&format=json
* the different parameters and roles are documented in the WikiMedia API. In a nutshell, we *query* the API for pages with given *titles* looking for its *revisions*, returning the *content* and the last *user* of the last revision of the page. The result is returned as a *JSON* object.

The `Content-Type` header is generally set to `application/json` for JSON APIs:

In [None]:
r.headers['Content-Type']

We can then use the `json()` function to extract the content of the reply as a Python dictionary:

In [None]:
help(r.json)

In [None]:
json_data = r.json()
print(type(json_data))
json_data

We can display it in a nicer way with `pprint`:

In [None]:
pprint(json_data)

This is then easier to extract information from the reply:

In [None]:
for page, data in json_data['query']['pages'].items():
    print('Page Id:', page)
    print('Title:', data['title'])
    print('User:', data['revisions'][0]['user'])
    print('Content length:', len(data['revisions'][0]['*']))

The format and structure of the data returned by the API is specific to each API and API endpoints. This is generally refered to as *data strucure*, *data model* or *schema*.

## Exercice 1

The [MusicBrainz](https://musicbrainz.org/) project offers a collection of data and APIs to access data about artists, their releases and various other metadata associated to their work.

Amongst their services, they built an [XML](https://musicbrainz.org/doc/Development/XML_Web_Service/Version_2) and [JSON](https://musicbrainz.org/doc/Development/JSON_Web_Service) API to interact with their database and expose metadata about the artists.

We want to programmatically query their API to build an equivalent of their search and display pages:

In [None]:
IFrame("https://musicbrainz.org/search?query=deep+purple&type=artist&method=indexed", width=800, height=200)

In [None]:
IFrame("https://musicbrainz.org/artist/79491354-3d83-40e3-9d8e-7592d58d790a", width=800, height=200)

### Introduction to the MusicBrainz API

We first need to be able to use the MusicBrainz API to search for artists and identify their unique MBID (MusicBrainz ID). The MBID will allow to uniquely identify an artist and fetch more information about other data associated (release, release group, work, ...)

**Q**: Reading the [API documentation](https://musicbrainz.org/doc/Development/XML_Web_Service/Version_2#Introduction) for the MusicBrainz WebService, identify the URL that is associated to their API:

In [None]:
# your code here
MBZ_API_ROOT = ...

**Q:** Make a `GET` request to the endpoint to see where it leads to and use the `HTML` built-in feature of Jupyter to display the page.

In [None]:
# your code here

The `artist` end point of the API will allow us to both look up for artist and get their details:

In [None]:
MBZ_ARTIST = '%s/artist' % MBZ_API_ROOT.strip('/')
print(MBZ_ARTIST)
r = requests.get(MBZ_ARTIST)
r.raise_for_status()

As you can see, we get a 400 error which means we haven't submitted a valid request to the server. 

**Q:** Display the error returned by the server by printing the content of the reply.

In [None]:
# your code here

**Q:** What is the content type of the reply?

In [None]:
# your answer here

**Q:** Confirm it by displaying the content-type from the reply headers.

In [None]:
# your code here

By following the instructions at https://musicbrainz.org/doc/Development/JSON_Web_Service for the JSON service:

**Q:** Modify the request so that the reply is returned in JSON format and query the same artist endpoint.

In [None]:
# your code here
r = requests.get(MBZ_ARTIST, params = { 
    # set the appropriate parameters
})

**Q:** Display the error message.

In [None]:
# your code here

**Q:** Confirm that the content is in JSON format by inspecting the headers of the response.

In [None]:
# your code here

As described in the documentation, for each type of entity (artist, release, ...), the following syntax can be used to interact with the api:

    lookup:   /<ENTITY>/<MBID>?inc=<INC>
    browse:   /<ENTITY>?<ENTITY>=<MBID>&limit=<LIMIT>&offset=<OFFSET>&inc=<INC>
    search:   /<ENTITY>?query=<QUERY>&limit=<LIMIT>&offset=<OFFSET>
        
The `limit` and `offset` parameters are common parameters used in many APIs to page the response, ie, to limit the number of records being returned still allowing to traverse all the records:
* `limit` controls the maximum number of elements to be returned in an API call 
* `offset` indicates from where the request should start listing records

This is similar to SQL `limit` and `offset` keywords.
By incrementing `offset`, and issuing a new request to the server it is then possible to retrieve all results. 

### Searching for artists

The search capability of the API is in particular documented at https://musicbrainz.org/doc/Development/XML_Web_Service/Version_2/Search .

It follows the model previously described:

    search:   /<ENTITY>?query=<QUERY>&limit=<LIMIT>&offset=<OFFSET>
    
**Q:** Write a request that look for artist with name *Deep Purple* and returns the matches in JSON format. Use the `+` character to encode the space.

In [None]:
# your code here

**Q:** Inspect the json object returned and list its keys.

In [None]:
# Your code here

**Q:** Print the first artist returned.

In [None]:
# Your code here

**Q:** Print the list of properties (keys) of the first artist.

In [None]:
# Your code here

**Q:** In the reply, identify how many total matches are being found by the server and compare with the search provided by the MusicBrainz website (https://musicbrainz.org/search?query=deep%2Bpurple&type=artist&method=indexed).

In [None]:
# your code here

**Q:** How many artists are being returned in the reply?

In [None]:
# your code here

Now, we are going to retrieve all the artists using the `offset` parameter to fetch all results in a Python list.

**MusicBrainz is rate limiting API calls, so make sure you add a `time.sleep(1)` before fetching the next batch of results.**

**Q:** Complete/modify the below Python code to retrieve all artists matching the previous search for the terms *Deep Purple*.

In [None]:
# Your code here - we give you the following startint point

# this variable will contain all the artists at the end of the execution
artists = []
# this flag must be set to False when there is no more result to fetch from the API
has_next = True
# This is the initial offset at which we start
offset = 0
# This is the number of records we fetch from the API each time
limit = 20

# more varialbes if you need

# Fetch all the list
while has_next:
    # fetch the next batch of data
    r = requests.get(...)
    
    # extract the artists from the reply
    new_artists = ...
    
    # append the new artists to the list
    artists.extend(new_artists)
    
    # jump to the next offset
    offset += limit
    
    # wait to not overwhelm the MusicBrainz API
    time.sleep(1)

**Q:** Using the previous code, write a Python function which returns all the artists matching some given search terms and run it for a couple of artists and display the number of results.

In [None]:
# Your code here
def search_artists(search_terms):
    artists = []
    # ...
    return artists

In [None]:
artists_1 = search_artists(...)
artists_2 = search_artists(...)
...
print(...

You may have noticed the score parameter associated with each artist in the result.

**Q:** For the same search for the artist *Deep Purple*, display the first score of the first 10 artists in the list with their MusicBrainzID. Compare with the online search results.

In [None]:
# Your code here

We can also use the data retrieved and present the data into an HTML format. For this we can build an HTML page and display it withing this notebook with the `HTML()` function.

**Q:** Write a function that build an HTML page matching some given search terms. Similarly to the MusicBrainz search web result, display the result as an HTML table with the artist names, its MBID and its search score. Limit the display to the first 25 matches of the input list. Use the previous `search_artists` functions to fetch the matching artists.

In [None]:
# Complete / adapt the following code if needed
search_template = """
<html>
<body>
  <h1>Search Results for "%(search_terms)s":</h1>
  <table>
    <thead>
      <tr>
        <th>MBID</th>
        <th>Artist</th>
        <th>Search Score</th>
      </tr>
    </thead>
    <tbody>
      %(search_results)s
    </tbody>
</body>
"""
artist_template = """
<tr>
 <td>%(mbid)s</td>
 <td>%(name)s</td>
 <td>%(score)s</td>
</tr> 
"""

In [None]:
# Example to produce a string given the templates
HTML(artist_template % {'mbid': '1234', 'name': 'The Name', 'score': 40})

In [None]:
# Complete / adapt the following code if needed
def gen_search_results(search_terms):
    artists = ...
    table_content = ""
    for artist in ....:
        table_content += ...
    data =  {
        'search_terms': search_terms,
        'search_results': table_content
    }
    return search_template % data

**Q:** How would you modify the `gen_search_results` and `search_artists` function so that only a given number of 
    artists are retrieved from the API and displays *i.e.* `search_artists` is only fetching `x` first matches instead of
    all of them? Create a new set of functions implementing this feature and test it.

In [None]:
# Your code/answer here

**Q:** Write a function that takes an artist name in input and that returns its MusicBrainzID by picking the one with the highest score using the search API.

In [None]:
# your code here
def get_mbid_for_artist(artist_name):
    # ...
    return mbid

You can check your code against the following tests:

In [None]:
assert get_mbid_for_artist('deep purple') == '79491354-3d83-40e3-9d8e-7592d58d790a'
assert get_mbid_for_artist('pink floyd') == '83d91898-7763-47d7-b03b-b92132375c47'

### Getting Artist details

Once we have identify the correct MBID, we can use the appropriate endpoint to retrieve information about the artist:

    lookup:   /<ENTITY>/<MBID>?inc=<INC>
    
**Q:** Using the artist end point, retrieve the artist's details.

In [None]:
# your code here

Additional information can be retrieved from the endpoint, use the `inc` parameter.

**Q:** Using the `inc` parameter (https://musicbrainz.org/doc/Development/XML_Web_Service/Version_2#inc.3D_arguments_which_affect_subqueries), retrieve the ratings and tags associated with the band.

In [None]:
# your code here

**Q:** Using your previous answers, write a function that given a MusicBrainz ID of an artist, returns a Python dictionary containing the following information
 - The MBID
 - The artist's name
 - The most popular tag associated with the artist
 - The type of artist
 - Its ratings
 - The country


In [None]:
# Your code here
def get_artist_data(mbid):
    data = {}
    # ...
    return data

### Getting list of releases for an artist

In a similar way than for the search, we now want to retrieve all releases associated to the artist.

**Q:** Identify the different API end points that may be used to retrieve releases associated to the artist. For each of them write a small query to validate and test them.

In [None]:
# Your code here

**Q:**  Write a function that uses the release API end point to fetch all releases for a given artist identified by its MBID. As for the search, don't forget to use the `time.sleep(1)` function before each new call to the MusicBrainz API.

Validate your code against the *Deep Purple* release page: https://musicbrainz.org/artist/79491354-3d83-40e3-9d8e-7592d58d790a/releases

In [None]:
# your code here
def get_artist_releases(mbid):
    releases = []
    # ...
    return releases

You can validate your code against the following artists for instance:

In [None]:
deep_purple_releases = get_artist_releases('79491354-3d83-40e3-9d8e-7592d58d790a')
assert len(deep_purple_releases) == 649
pink_floyd_releases = get_artist_releases('83d91898-7763-47d7-b03b-b92132375c47')
assert len(pink_floyd_releases) == 1358

Let's generate some statistics on the data retrieved.

**Q:** Given a list of releases, write a function that returns the number of releases made by the artist by year and test it for a couple of difference artists. Test it against a couple of artists.

In [None]:
# your code here
def count_of_releases_by_year(releases):
    count_by_year = {}
    # ...
    return count_by_year

## Exercice 2

iTunes has also an API documented at https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api which allows to search and retrieve different information.


**Q:** Use the iTunes Search API to find out the `artistId` of *Deep Purple*.

In [None]:
# your code here

**Q:** Write a function that returns the data for a given `artistId`

In [None]:
# Your code / answers here
def get_itunes_artist_details(artist_id):
    data = {}
    # ...
    return data

**Q:** Write a function that returns the list of albums for the given artist.

In [None]:
# your code here
def get_itunes_album_for_artist(artist_id):
    albums = []
    # ...
    return albums

## Exercice 3

Using code similar to the search page, write a small function that generates an HTML page giving artist data retrieved from the different APIs:
 - Artist's details from MusicBrainz
 - Artist's details from iTunes Search
 - List of releases found in MusicBrainz
 - List of releases found in iTunes Search
 - Section containing the content of the wikipedia page associated to the artist.


In [None]:
# Your code here
template_page = """

"""
def generate_page(...):
    page = ...
    return page