Real estate

Real estate in Zürich

the Zürich Statistical Office collects data on the city and its residents. This data is published as Linked Data.

In this tutorial, we will show how to work with Linked Data. Mainly, we will see how to work with the real estate dataset.
We will look into how to query, process, and visualize it.

SPARQL endpoint

Data on the real estate market is published as Linked Data. It can be accessed with SPARQL queries.
You can send queries using HTTP requests. The API endpoint is https://ld.stadt-zuerich.ch/query.

Let's use SparqlClient from graphly to communicate with the database. Graphly will allow us to:

  • send SPARQL queries
  • automatically add prefixes to all queries
  • format response to pandas or geopandas
In [1]:
# Uncomment to install dependencies in Colab environment
#!pip install git+https://github.com/zazuko/graphly.git
In [2]:
import datetime
import re

import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from graphly.api_client import SparqlClient
In [3]:
sparql = SparqlClient("https://ld.stadt-zuerich.ch/query")
sparql.add_prefixes({
    "schema": "<http://schema.org/>",
    "cube": "<https://cube.link/>",
    "property": "<https://ld.stadt-zuerich.ch/statistics/property/>",
    "measure": "<https://ld.stadt-zuerich.ch/statistics/measure/>",
    "skos": "<http://www.w3.org/2004/02/skos/core#>",
    "ssz": "<https://ld.stadt-zuerich.ch/statistics/>"
})

SPARQL queries can become very long. To improve the readibility, we will work wih prefixes.

Using the add_prefixes method, we define persistent prefixes. Every time you send a query, graphly will automatically add the prefixes for you.

Housing prices

Let's find the average price per m2 for an apartment in Zurich. This data is available in the QMP-EIG-HAA-OBJ-ZIM data cube. It will allow us to find the price per city district and apartment type. The data is also available for different points in time.

The query for housing prices in city of Zürich for different districts and apartment types over time looks as follows:

In [4]:
query = """
SELECT ?time ?place ?rooms ?price
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
  ssz:QMP-EIG-HAA-OBJ-ZIM a cube:Cube;
             cube:observationSet/cube:observation ?observation.   
  ?observation property:TIME ?time ;
                       property:RAUM ?place_uri;
                       property:ZIM/schema:name ?rooms;
                       measure:QMP ?price .
  ?place_uri skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
         schema:name ?place .
  FILTER regex(str(?place),"ab|Stadtgebiet vor")
  FILTER (?price > 0)
}
ORDER BY ?time
"""

df = sparql.send_query(query)
df.head()
Out[4]:
time place rooms price
0 2009-12-31 Kreis 10 (ab 1934) 4- und 4,5-Zimmer Wohnung 9306.0
1 2009-12-31 Kreis 2 (ab 1893) 1- und 1,5-Zimmer Wohnung 12176.0
2 2009-12-31 Kreis 3 (ab 1915) 1- und 1,5-Zimmer Wohnung 7894.0
3 2009-12-31 Kreis 10 (ab 1934) 6- und 6,5-Zimmer Wohnung 9463.0
4 2009-12-31 Kreis 9 (ab 1934) 2- und 2,5-Zimmer Wohnung 7980.0

Prices per apartment type

Let's visualize the housing prices per apartment type. To do this, we will aggregate the prices per rooms.
The cleaned dataframe becomes:

In [5]:
df.place = df.place.apply(lambda x: re.findall('Kreis \d+', x)[0])
df.rooms = df.rooms.apply(lambda x: int(re.findall('\d+', x)[0]))
plot_df = df[["rooms", "price"]][df.time == df.time.max()].groupby(["rooms"]).mean().astype(int).sort_values(by="rooms").reset_index()
plot_df
Out[5]:
rooms price
0 1 12706
1 2 12841
2 3 12618
3 4 12138
4 5 13792
5 6 13361
6 7 12146
7 8 20804
In [6]:
fig = px.bar(plot_df, x="rooms", y="price")
fig.update_layout(
    title='Housing prices in Zürich', 
    title_x=0.5,
    yaxis_title="CHF per m<sup>2</sup>",
    xaxis_title="rooms"
)
fig.show()

Prices per district

Let's visualize the housing prices per district. To do this, we will aggregate the prices per place.
The cleaned dataframe becomes:

In [7]:
plot_df = df[["place", "price"]][df.time == df.time.max()].groupby(["place"]).mean().astype(int).sort_values(by="price").reset_index()
In [8]:
fig = px.bar(plot_df, x="place", y="price")
fig.update_layout(
    title='Housing prices in Zürich', 
    title_x=0.5,
    yaxis_title="CHF per m<sup>2</sup>",
    xaxis_title="district",
)
fig.show()

Prices per district and apartment type

Now, let's combine both views. We will visualize housing prices across districts and apartment types.
The dataframe for plotting becomes:

In [9]:
plot_df = df[["place", "rooms", "price"]][df.time == df.time.max()]
plot_df = plot_df[plot_df.rooms <= 4]
plot_df["avg_price"] = plot_df[["price", "place"]].groupby(["place"]).transform(lambda x: sum(x)/len(x))
plot_df = plot_df.sort_values(by=["avg_price", "rooms"])

plot_df.head()
Out[9]:
place rooms price avg_price
539 Kreis 12 1 7826.0 8748.50
543 Kreis 12 2 8404.0 8748.50
528 Kreis 12 3 8987.0 8748.50
578 Kreis 12 4 9777.0 8748.50
547 Kreis 4 1 9955.0 10620.75
In [10]:
fig = make_subplots(rows=3, cols=4, subplot_titles=plot_df["place"].unique(), shared_yaxes=True, y_title='CHF per m<sup>2</sup>', x_title='rooms', vertical_spacing=0.1)

for i, district in enumerate(plot_df["place"].unique()):
    
    row = i//4 + 1
    col = i%4 + 1
    subset = plot_df[plot_df["place"] == district]
    fig.append_trace(go.Bar(
        x=subset["rooms"],
        y=subset["price"],
        name=district, 
        marker_color=px.colors.qualitative.Dark24[0]
    ), row=row, col=col)

fig.update_layout(height=800, width=1000, title={"text": "Housing prices in Zürich", "x": 0.5}, showlegend=False)
fig.update_yaxes(range=[0,20000])
fig.show()

City districts

drawing

Prices over time

Let's take a look at averge housing prices over time. We will aggregate the results by time.
Here is the dataframe and the graph:

In [11]:
plot_df = df[["time", "price"]].groupby(["time"]).mean().reset_index()
plot_df.head()
Out[11]:
time price
0 2009-12-31 8702.231884
1 2010-12-31 9206.983871
2 2011-12-31 10061.363636
3 2012-12-31 11442.333333
4 2013-12-31 11096.770492
In [12]:
fig = px.line(plot_df, x='time', y="price")
fig.update_layout(
    title='Housing prices in Zürich', 
    title_x=0.5,
    yaxis_title="CHF per m<sup>2</sup>",
    xaxis_title="Year",
)
fig.show()

Prices over time for various apartment types

Now that we have seen the trend, let's dig a bit deeper. We will visualize housing prices over time for 2, 4 and 6-room apartments. To achieve this, we need to reshape our dataframe to show prices per apartment type over time.

The reshaped dataframe becomes:

In [13]:
plot_df = pd.pivot_table(df, index="time", columns="rooms", values="price", aggfunc=np.mean).reset_index()
plot_df.head()
Out[13]:
rooms time 1 2 3 4 5 6 7 8
0 2009-12-31 8262.818182 8375.750000 8570.636364 8296.636364 9629.4 9994.666667 7241.25 9571.75
1 2010-12-31 9363.800000 8776.272727 9397.666667 8553.818182 8574.4 9704.200000 10356.00 18815.00
2 2011-12-31 8950.727273 12086.166667 8647.166667 9840.833333 9371.1 10297.000000 12704.00 14049.00
3 2012-12-31 10207.000000 10508.916667 10583.000000 11544.909091 11655.4 13159.857143 18831.00 13346.00
4 2013-12-31 10805.571429 10163.181818 11118.000000 11710.750000 11023.0 12801.800000 11085.00 8092.00
In [14]:
fig = make_subplots(rows=3, cols=1, y_title='CHF per m<sup>2</sup>')

for i, j in enumerate([2,4,6]):
    
    fig.append_trace(go.Scatter(
        x=plot_df["time"],
        y=plot_df[j],
        name="Rooms: {}".format(j), 
        marker_color=px.colors.qualitative.Dark24[i]
    ), row=i+1, col=1)

fig.update_layout(title={"text": "Housing prices in Zürich", "x": 0.5}, showlegend=True)
fig.update_yaxes(range=[8000,14000])
fig.show()

Apartments and population growth

Real estate prices are influenced by:

  • Number of available apartments
  • Number of people willing to buy a an apartment

Let's take a look at how those numbers evolved over time. Mainly, we want find the apartments and population count in city of Zurich.

The query for the number of inhabitants and apartments over time looks as follows:

In [15]:
query = """
SELECT * 
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE{ 
    {
      SELECT ?time (SUM(?pop_count) AS ?pop)
      WHERE {
        ssz:BEW a cube:Cube;
                   cube:observationSet/cube:observation ?obs_bew.   
        ?obs_bew property:TIME ?time ;
                 property:RAUM ?place_uri_pop;
                 measure:BEW ?pop_count .

        ?place_uri_pop skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
                   schema:name ?place_pop .

        FILTER regex(str(?place_pop),"ab|Stadtgebiet vor")
      }
      GROUP BY ?time
    }
    {
      SELECT ?time (SUM(?apt_count) AS ?apts)
      WHERE {
        ssz:WHG a cube:Cube;
                   cube:observationSet/cube:observation ?obs_apt.   
        ?obs_apt property:TIME ?time ;
                 property:RAUM ?place_uri_apt;
                 measure:WHG ?apt_count .

        ?place_uri_apt skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
                   schema:name ?place .

        FILTER regex(str(?place),"ab|Stadtgebiet vor")
      }
      GROUP BY ?time
    }
}
ORDER BY ?time
"""

df = sparql.send_query(query)
df.head()
Out[15]:
time pop apts
0 1896-12-31 39142.0 22334.0
1 1897-12-31 39553.0 23776.0
2 1898-12-31 39728.0 24811.0
3 1899-12-31 39623.0 25565.0
4 1900-12-31 39388.0 26138.0
In [16]:
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Scatter(x=df["time"], y=df["pop"], name="Population"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(x=df["time"], y=df["apts"], name="Apartments"),
    secondary_y=True,
)

fig.update_layout(title={"text": "Population and Apartments in Zürich", "x": 0.5})
fig.update_yaxes(title_text="population", secondary_y=False)
fig.update_yaxes(title_text="apartments", secondary_y=True)
fig.show()

Apartment types over time

Let's take a look at what apartments were, and are available in Zurich. How did the share of various apparmnent types evolve over time? Is there a trend to build more smaller (or bigger) apartments? These insights are available in the WHG-ZIM data cube.

The query for the number of different apartment types over time looks as follows:

In [17]:
query = """
SELECT ?time ?rooms (SUM(?count) AS ?apts)
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
  ssz:WHG-ZIM a cube:Cube;
             cube:observationSet/cube:observation ?obs.   
  ?obs property:TIME ?time ;
           property:RAUM ?place_uri;
           property:ZIM/schema:name ?rooms ;
           measure:WHG ?count .

  ?place_uri skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
             schema:name ?place .
             
  FILTER regex(str(?place),"ab|Stadtgebiet vor")
  FILTER (?time >= "1977-01-01"^^xsd:time)
}
GROUP BY ?time ?rooms
ORDER BY ?time ?rooms
"""

df = sparql.send_query(query)
df.head()
Out[17]:
time rooms apts
0 1977-12-31 1- und 1,5-Zimmer Wohnung 26132.0
1 1977-12-31 2- und 2,5-Zimmer Wohnung 34686.0
2 1977-12-31 3- und 3,5-Zimmer Wohnung 68026.0
3 1977-12-31 4- und 4,5-Zimmer Wohnung 33927.0
4 1977-12-31 5- und 5,5-Zimmer Wohnung 7593.0

Let's reshape our dataframe to show the apartment count per apartment type, over time:

In [18]:
def rename_rooms(x: str) -> str:
    
    elements = x.split("- und ")
    if elements[1] == "mehr-Zimmer Wohnung":
        return elements[0] + "+"
    else:
        return elements [0]


df.rooms = df.rooms.apply(rename_rooms)
df = pd.pivot_table(df, index="time", columns="rooms", values="apts")

# Data cleaning
df["5+"][df["5+"].isna()] = df["5"][df["5+"].isna()] + df["6+"][df["5+"].isna()]
df = df[["1", "2", "3", "4", "5+"]].reset_index().rename_axis(None, axis=1)
df = df.fillna(method="ffill")
df.head()
Out[18]:
time 1 2 3 4 5+
0 1977-12-31 26132.0 34686.0 68026.0 33927.0 12728.0
1 1978-12-31 26280.0 34963.0 68080.0 34066.0 12706.0
2 1979-12-31 27536.0 35324.0 68139.0 34147.0 12696.0
3 1980-12-31 28039.0 35717.0 68465.0 34328.0 12706.0
4 1981-12-31 27398.0 36103.0 68874.0 34851.0 12790.0
In [19]:
cols = ["1", "2", "3", "4", "5+"]
start = df[cols].iloc[0] / sum(df[cols].iloc[0])
end = df[cols].iloc[-1] / sum(df[cols].iloc[-1])
diff = abs(end-start)
cols = [x for _, x in sorted(zip(diff, cols))]

fig = px.histogram(df, x="time", y=cols, barnorm="percent")
fig.update_layout(
    title='Apartments in Zürich', 
    title_x=0.5,
    yaxis_title="Share of apartments in %",
    xaxis_title="Year",
    legend={"title": "Rooms"}
)
fig['layout']['yaxis']['range'] = [0,100]
fig.show()