Population

Population in Zürich

Zürich Statistical Office collects data on the city and its residents. This data is published as Linked Data.

In this tutorial, we will show how to work with Linked Data. Mainly, we will see how to work with population dataset.
We will look into how to query, process, and visualize it.

SPARQL endpoint

Population data is published as Linked Data thatcan be accessed with SPARQL queries.
You can send queries using HTTP requests. The API endpoint is https://ld.stadt-zuerich.ch/query.

Let's use SparqlClient from graphly to communicate with the database. Graphly will allow us to:

  • send SPARQL queries
  • automatically add prefixes to all queries
  • format response to pandas or geopandas
In [1]:
# Uncomment to install dependencies in Colab environment
#!pip install git+https://github.com/zazuko/graphly.git
In [2]:
import re

import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from graphly.api_client import SparqlClient
In [3]:
def natural_keys(txt: str) -> list[int]:
    """Extracts the digits from string
    Args:
        txt:             string with digits

    Returns:
        list[int]        digits in string
    """
    
    return [int(s) for s in txt.split() if s.isdigit()]
In [4]:
sparql = SparqlClient("https://ld.stadt-zuerich.ch/query")
sparql.add_prefixes({
    "schema": "<http://schema.org/>",
    "cube": "<https://cube.link/>",
    "property": "<https://ld.stadt-zuerich.ch/statistics/property/>",
    "measure": "<https://ld.stadt-zuerich.ch/statistics/measure/>",
    "collection": "<https://ld.stadt-zuerich.ch/statistics/collection/>",
    "skos": "<http://www.w3.org/2004/02/skos/core#>",
    "ssz": "<https://ld.stadt-zuerich.ch/statistics/>"
})

SPARQL queries can become very long. To improve the readibility, we will work wih prefixes.

Using the add_prefixes method, we define persistent prefixes. Every time you send a query, graphly will automatically add the prefixes for you.

Population in city districts

Let's find the number of inhabitants in different parts of the city. The population data is available in the BEW data cube.

The query for the number of inhabitants in different city districts, over time looks as follows:

In [5]:
query = """
SELECT ?time ?place ?count
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
  ssz:BEW a cube:Cube;
             cube:observationSet/cube:observation ?observation.   
  
  ?observation property:RAUM ?place_uri ;
                       property:TIME ?time ;
                       measure:BEW ?count .
  ?place_uri skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
         schema:name ?place .
  FILTER regex(str(?place),"ab|Stadtgebiet vor")
}
ORDER BY ?time
"""

df = sparql.send_query(query)
df.head()
Out[5]:
time place count
0 1408-12-31 Kreis 1 (Stadtgebiet vor 1893) 5675.0
1 1467-12-31 Kreis 1 (Stadtgebiet vor 1893) 4750.0
2 1529-12-31 Kreis 1 (Stadtgebiet vor 1893) 5080.0
3 1637-12-31 Kreis 1 (Stadtgebiet vor 1893) 8621.0
4 1671-12-31 Kreis 1 (Stadtgebiet vor 1893) 9590.0

Let's visualize the number of inhabitants per district. To do this, we will aggregate the numbers per place.
The cleaned dataframe becomes:

In [6]:
df.place = df.place.apply(lambda x: re.findall('Kreis \d+', x)[0])

df = pd.pivot_table(df, index="time", columns="place", values="count")
df.dropna(inplace=True)

df = df[df.columns[np.argsort(-df.iloc[0,])]]
df = df.reset_index().rename_axis(None, axis=1)

df.head()
Out[6]:
time Kreis 11 Kreis 3 Kreis 9 Kreis 7 Kreis 6 Kreis 10 Kreis 12 Kreis 2 Kreis 4 Kreis 8 Kreis 5 Kreis 1
0 1971-12-31 56863.0 52707.0 47257.0 39599.0 37837.0 36160.0 33664.0 32708.0 32231.0 20899.0 12833.0 9411.0
1 1972-12-31 56864.0 51674.0 47223.0 39118.0 37763.0 35760.0 33079.0 32561.0 31765.0 20371.0 12462.0 9007.0
2 1973-12-31 56464.0 50879.0 47215.0 38695.0 37059.0 35576.0 32201.0 31925.0 30906.0 19897.0 12235.0 8525.0
3 1974-12-31 56224.0 50175.0 47142.0 38045.0 36305.0 35449.0 31374.0 31706.0 30048.0 19552.0 12165.0 8076.0
4 1975-12-31 55627.0 49326.0 46491.0 37379.0 35294.0 35518.0 30943.0 31179.0 29061.0 19246.0 11798.0 7751.0

And now we can graph it using a line plot or a histogram.

In [7]:
sorted_df = df.reindex(sorted(df.columns, key=natural_keys), axis=1)
fig = px.line(sorted_df, x="time", y = sorted_df.columns)
fig.update_layout(
    title='Population in Zürich Districts', 
    title_x=0.5,
    yaxis_title="inhabitants",
    xaxis_title="Years",
    legend_title="District"
)
fig.show("notebook")
In [8]:
fig = px.histogram(df, x="time", y=df.columns, barnorm="percent")
fig.update_layout(
    title='Population in Zürich Districts', 
    title_x=0.5,
    yaxis_title="% of inhabitants",
    xaxis_title="Years",
    legend_title="District"
)
fig['layout']['yaxis']['range']= [0,100]
fig.show()

Population origin

Let's find the number of foreign and swiss inhabitants. The share of swiss/non-swiss population is available in the ANT-GGH-HEL data cube. The population count is available in BEW data cube.

The query for number of inhabitants and foreigners share over time looks as follows:

In [9]:
query = """
SELECT ?time (SUM(?pop_count) AS ?pop) (SUM(?foreigners_count)/SUM(?pop_count) AS ?foreigners) 
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
  ssz:BEW a cube:Cube;
             cube:observationSet/cube:observation ?obs_bew.   
  ?obs_bew property:TIME ?time ;
           property:RAUM ?place_uri;
           measure:BEW ?pop_count .
  
  ssz:ANT-GGH-HEL a cube:Cube;
             cube:observationSet/cube:observation ?obs_ant.   
  
  ?obs_ant property:TIME ?time ;
           property:RAUM ?place_uri;
           measure:ANT ?ratio .
  
  ?place_uri skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
             schema:name ?place .
  
  BIND((?pop_count * ?ratio/100) AS ?foreigners_count)
}
GROUP BY ?time
ORDER BY ?time
"""

df = sparql.send_query(query)
df.head()
Out[9]:
time pop foreigners
0 1934-12-31 317367.0 0.116519
1 1935-12-31 318981.0 0.110071
2 1936-12-31 319849.0 0.102763
3 1937-12-31 321380.0 0.097606
4 1938-12-31 329780.0 0.100623

And now lets visualize the data using absolute numbers as well as percentages.

In [10]:
fig = make_subplots(rows=2, cols=1)

fig.append_trace(go.Scatter(x=df["time"],y=df["pop"],
        name="Total population", 
        marker_color=px.colors.qualitative.Vivid[7],
        showlegend=False,
    ), row=1, col=1)

fig.append_trace(go.Bar(x=df["time"],y=(1-df["foreigners"])*100,
        name="swiss", 
        marker_color=px.colors.qualitative.Vivid[3]
    ), row=2, col=1)


fig.append_trace(go.Bar(x=df["time"],y=df["foreigners"]*100,
        name="foreign",
        marker_color=px.colors.qualitative.Vivid[9]
    ), row=2, col=1)

fig['layout']['yaxis']['title']='inhabitants'
fig['layout']['yaxis2']['title']='Population share in %'
fig['layout']['yaxis2']['range']= [0,100]
fig.update_layout(height=800, title={"text": "Population in Zürich", "x": 0.5}, barmode = "stack",
                  legend = {"x": 1, "y": 0.37})
fig.show()

Population distribution: age and time

Let's find the number of inhabitants in different age groups. The population count per age group is available in the BEW-ALT-HEL-SEX data cube.

The query for the number of inhabitants in various age buckets over time looks as follows:

In [11]:
query = """
    SELECT ?time ?age (SUM(?measure) AS ?count) 
    FROM <https://lindas.admin.ch/stadtzuerich/stat>
    WHERE {
      ssz:BEW-ALT-HEL-SEX a cube:Cube;
                 cube:observationSet/cube:observation ?observation.   
      ?observation property:TIME ?time ;
                   property:ALT ?age_uri ;
                   measure:BEW ?measure .
      collection:1-Jahres-Altersklasse skos:member ?age_uri.
      ?age_uri schema:name ?age .
    }
    GROUP BY ?time ?age
    ORDER BY asc(?time)
"""

df = sparql.send_query(query)
df.head()
Out[11]:
time age count
0 2002-12-31 4 Jahre alt 2847.0
1 2002-12-31 7 Jahre alt 2765.0
2 2002-12-31 0 Jahre alt 3373.0
3 2002-12-31 42 Jahre alt 5455.0
4 2002-12-31 71 Jahre alt 3032.0

Let's calculate the population share for each age group. The dataframe becomes:

In [12]:
df["year"] = df.time.apply(getattr, args=("year", ))
df["count"] = df.groupby(["year"]).transform(lambda x: (x/x.sum())*100)
df['age'] = df['age'].apply(lambda x: int(str(x.split(" ")[0])))

df = df.sort_values(by=["year", "age"]).reset_index(drop=True)
df.head()
/tmp/ipykernel_1970/589411296.py:2: FutureWarning:

Dropping invalid columns in DataFrameGroupBy.transform is deprecated. In a future version, a TypeError will be raised. Before calling .transform, select only columns which should be valid for the function.

Out[12]:
time age count year
0 2002-12-31 0 0.925525 2002
1 2002-12-31 1 0.874762 2002
2 2002-12-31 2 0.875036 2002
3 2002-12-31 3 0.826469 2002
4 2002-12-31 4 0.781194 2002

And lets visualize it using an interactive plot.

In [13]:
fig = px.bar(df, x="age", y="count", animation_frame="year", range_y=[0, 3], range_x=[0, df.age.max()])
fig.update_layout(
    title='Population Distribution', 
    title_x=0.5,
    yaxis_title="Population share in %",
    xaxis_title="Age",
    legend_title="District"
)
fig.show()

Population distribution: age and origin

Let's take a look at age distribution among swiss and foreign inhabitants. We can find this data in the BEW-ALT-HEL-SEX data cube.

The query for number of inhabitants in various age buckets, with their origin, over time looks as follows:

In [14]:
query = """
    SELECT ?age ?origin (SUM(?measure) AS ?count) 
    FROM <https://lindas.admin.ch/stadtzuerich/stat>
    WHERE {
      ssz:BEW-ALT-HEL-SEX a cube:Cube;
        cube:observationSet/cube:observation ?observation.   
        ?observation property:TIME ?time ;
        property:ALT/schema:name ?age;
        measure:BEW ?measure ;
        property:HEL/schema:name ?origin .

        collection:1-Jahres-Altersklasse skos:member ?age_uri.
        ?age_uri schema:name ?age .
        
        FILTER (?time = "2017-12-31"^^xsd:date)
    }
    GROUP BY ?age ?origin
    ORDER BY asc(?age)
"""

df = sparql.send_query(query)
df.head()
Out[14]:
age origin count
0 0 Jahre alt Ausland 1659.0
1 0 Jahre alt Schweiz 3316.0
2 1 Jahr alt Schweiz 3289.0
3 1 Jahr alt Ausland 1613.0
4 10 Jahre alt Schweiz 2423.0

Let's calculate the population share for each origin and age group. The dataframe becomes:

In [15]:
df["age"] = df["age"].apply(lambda x: int(str(x.split(" ")[0])))
df["count"] = df[["origin", "count"]].groupby(["origin"]).transform(lambda x: x/x.sum()*100)
df = df.sort_values(by=["age"]).reset_index(drop=True)
df.loc[df["origin"]=="Ausland", "origin"] = "foreign"
df.loc[df["origin"]=="Schweiz", "origin"] = "swiss"
df.head()
Out[15]:
age origin count
0 0 foreign 1.210463
1 0 swiss 1.158408
2 1 swiss 1.148976
3 1 foreign 1.176900
4 2 swiss 1.072820
In [16]:
fig = px.bar(df, x="age", y="count", 
             barmode="overlay", range_y = [0,4], color="origin")

fig.update_layout(
    title='Population Distribution', 
    title_x=0.5,
    yaxis_title="Population share in %",
    xaxis_title="Age",
    legend_title="Origin"
)
fig.show()

Population distribution: age and sex

Let's take a look at the age distribution for female and male inhabitants. We can find this data in the BEW-ALT-HEL-SEX data cube.

The query for number of inhabitants in various age buckets, with their sex, over time looks as follows:

In [17]:
query = """
SELECT ?time ?sex ?age (SUM(?measure) AS ?count) 
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
  ssz:BEW-ALT-HEL-SEX a cube:Cube;
    cube:observationSet/cube:observation ?observation.   
    ?observation property:TIME ?time ;
    measure:BEW ?measure ;
    property:SEX/schema:name ?sex ;
    property:ALT ?age_uri .
    
    collection:1-Jahres-Altersklasse skos:member ?age_uri.
    ?age_uri schema:name ?age .
    
}
GROUP BY ?time ?sex ?age
ORDER BY asc(?time)
"""

df = sparql.send_query(query)
df.head()
Out[17]:
time sex age count
0 2002-12-31 weiblich 15 Jahre alt 1298.0
1 2002-12-31 weiblich 83 Jahre alt 1108.0
2 2002-12-31 männlich 22 Jahre alt 2590.0
3 2002-12-31 weiblich 52 Jahre alt 2160.0
4 2002-12-31 männlich 71 Jahre alt 1244.0

Let's create a dataframe where one row represents one observation. It will allow us to use violin plots for our dataframe.
The dataframe becomes:

In [18]:
df.loc[df["sex"]=="weiblich", "sex"] = "female"
df.loc[df["sex"]=="männlich", "sex"] = "male"

df['age'] = df['age'].apply(lambda x: str(x.split(" ")[0])).astype(int)
df["year"] = df.time.apply(getattr, args=("year", )).astype(str)
df = df.sort_values(by=["year", "age"]).reset_index(drop=True)

df = df[(df.year == df.year.max()) | ((df.year == df.year.min()))]
df = df[["sex", "age", "year"]].loc[df.index.repeat(df["count"])].reset_index(drop=True)
df.head()
Out[18]:
sex age year
0 female 0 2002
1 female 0 2002
2 female 0 2002
3 female 0 2002
4 female 0 2002
In [19]:
fig = px.violin(df, y="age", x="year", color="sex", violinmode="overlay")
fig.data[0].update(span = [0, 105], spanmode='manual')
fig.data[1].update(span = [0, 105], spanmode='manual')
fig.update_layout(title={"text": "Population distrubution", "x": 0.5})
fig.show()
In [20]:
fig = go.Figure()

fig.add_trace(go.Violin(x=df['sex'][df['year'] == "2002"],
                        y=df['age'][df['year'] == "2002"],
                        legendgroup='2002', scalegroup='2002', name='2002',
                        side='negative',
                        line_color='blue', 
                        span = [0, 105], 
                        spanmode='manual'))

fig.add_trace(go.Violin(x=df['sex'][df['year'] == "2017"],
                        y=df['age'][df['year'] == "2017"],
                        legendgroup='2017', scalegroup='2017', name='2017',
                        side='positive',
                        line_color='orange',
                        span = [0, 105], 
                        spanmode='manual'))

fig.update_traces(meanline_visible=True)
fig.update_layout(title={"text": "Population distrubution", "x": 0.5}, yaxis_title="age")
fig.show()

Population and real estate prices

Let's compare real estate prices and number of inhabitants over time. We will need to work with population and real estate data sets. The population data is available in the BEW data cube. The real estate prices are in the QMP-EIG-HAA-OBJ-ZIM data cube.

The query for the number of inhabitants and the housing prices over time looks as follows:

In [21]:
query="""
SELECT * 
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE{ 
    {
      SELECT ?time (SUM(?pop_count) AS ?pop)
      WHERE {
        ssz:BEW a cube:Cube;
                   cube:observationSet/cube:observation ?obs_bew.   
        ?obs_bew property:TIME ?time ;
                 property:RAUM ?place_uri_pop;
                 measure:BEW ?pop_count .

        ?place_uri_pop skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
                   schema:name ?place_pop .

        FILTER regex(str(?place_pop),"ab|Stadtgebiet vor")
      }
      GROUP BY ?time
    }
    {
      SELECT ?time (AVG(?quote) AS ?price)
      WHERE {
        ssz:QMP-EIG-HAA-OBJ-ZIM a cube:Cube;
                   cube:observationSet/cube:observation ?obs_apt.   
        ?obs_apt property:TIME ?time ;
                             property:RAUM ?place_uri_apt;
                             measure:QMP ?quote .  

        ?place_uri_apt skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
                   schema:name ?place_apt .

        FILTER (?quote > 0)
        FILTER regex(str(?place_apt),"ab|Stadtgebiet vor")
      }
      GROUP BY ?time
      ORDER BY ?time
    }
}
"""

df = sparql.send_query(query)
df.head()
Out[21]:
time pop price
0 2009-12-31 382906.0 8702.231884
1 2010-12-31 385468.0 9206.983871
2 2011-12-31 390082.0 10061.363636
3 2012-12-31 394012.0 11442.333333
4 2013-12-31 398575.0 11096.770492
In [22]:
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(x=df["time"], y=df["pop"], name="inhabitants"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df["time"], y=df["price"], name="price per m<sup>2</sup>"),
    secondary_y=True,
)

# Layout
fig.update_layout(title={"text": "Population and real estate prices", "x": 0.5})
fig.update_yaxes(title_text="population", secondary_y=False)
fig.update_yaxes(title_text="price per m<sup>2</sup>", secondary_y=True)
fig.show()

Causes of death

The Statistical Office reports the number of deaths and the cause. Let's try to understand what are the main causes of death in Zurich. This data is available in the GES-SEX-TOU data cube.

The query for death cause and its broader category for the year 2015 looks as follows:

In [23]:
query = """
SELECT ?tou ?tou_broader (SUM(?ges) AS ?deaths)
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
  ssz:GES-SEX-TOU a cube:Cube;
                  cube:observationSet/cube:observation ?obs.   
  ?obs property:TIME ?time ;                           
       property:TOU ?tou_uri;
                   measure:GES ?ges .
  
  ?tou_uri schema:name ?tou ;
     skos:broader/schema:name ?tou_broader .

  MINUS {?three_level_tou skos:broader ?tou_uri .}
  FILTER (?time = "2015-12-31"^^xsd:date)
}
GROUP BY ?tou ?tou_broader
HAVING (?deaths > 0)
ORDER BY ?tou_broader
"""

df = sparql.send_query(query)
df.head()
Out[23]:
tou tou_broader deaths
0 Grippe Atmungsorgane 3.0
1 Asthma Atmungsorgane 1.0
2 Pneumonie / Lungenentzündung Atmungsorgane 39.0
3 andere Atmungsorgane Atmungsorgane 19.0
4 Chronische Bronchitis Atmungsorgane 53.0

Let's aggregate those results under more meaningful group names.

In [24]:
df.loc[(df.tou == "andere infektiöse Krankheiten"), "tou_broader"] = ""
df.loc[(df.tou == "andere infektiöse Krankheiten"), "tou"] = "Infektiöse Krankheiten"
df.loc[(df.tou == "Alkoholische Leberzirrhose"), "tou_broader"] = ""
df.loc[(df.tou == "Unbekannt"), "deaths"] = df.loc[(df.tou_broader == "Übrige"), "deaths"].sum()
df.loc[(df.tou == "Unbekannt"), "tou_broader"] = ""
df.loc[(df.tou_broader == "Krebskrankheiten/Bösartige Neubildungen"), "tou_broader"] = "Krebskrankheiten"

df = df.drop(df[df.tou == "Übrige (ohne unbekannte Todesursachen)"].index)
In [25]:
fig = px.treemap(df, path=['tou_broader', "tou"], values='deaths')
fig.update_layout(title={"text": "Causes of Death in 2015", "x": 0.5})
fig.show()