Mannheim Web Panel (MWP)

Site content by Julian Oliver Dörr and Sebastian Schmidt

This site introduces the Mannheim Web Panel (MWP) – a panel dataset of website contents extracted from corporate websites for a large sample of German companies.
The MWP was developed at the ZEW – Leibniz Centre for European Economic Research in the Project Business and Economic Research Data Center (BERD@BW).

Why corporate websites?

Company websites pose an important source of economic data used by firms to spread product and service information (related to establishing a public image), to conduct transactions (e-business processes) and to ease opinion sharing (electronic word-of-mouth) (Balzquez & Domenech, 2018). Recent economic studies have used corporate website data to:

predict firm innovativeness (Gök et al., 2015, Kinne & Lenz, 2021; Axenbeck & Breithaupt, 2021)
examine market entry strategies (Arora et al., 2013)
examine enterprise growth (Li et al., 2016)
monitor firm export orientation (Balzquez & Domenech, 2017)
track crisis impacts on the corporate sector (Dörr et al., 2021)
…

Why panel data?

Firm characteristics, diffusion processes such as technological advances and technological adoption as well as business relations are clearly not static but evolve over time. It requires a continuous monitoring of corporate websites to capture this information. For this reason, the ZEW – Leibniz Centre for European Economic Research scrapes corporate website contents since 2018 and has established a panel format of these contents updated every three to six months.

Is the data available for researchers?

The MWP data is stored in ZEW’s cloud structure (Seafile) and can be accessed by externals for the purpose of research via ZEW’s Research Data Centre (ZEW-FDZ). Upon signing a licence agreement, interested users will have full access to the MWP in a secured environment provided by ZEW. For more details please contact:

Dr. Sandra Gottschalk

+49 621 1235-267

sandra.gottschalk@zew.de

How is the data structured?

The MWP includes a large amount of web data from corporate websites of German companies. The general scraping framework used to establish the MWP is available on Github. The scraping parameters for the MWP are standardized. For each website, the first 50 subpages are downloaded with shorter URLs in the corporate web-domain scraped more likely. Webpages which are in German language are preferred in the scraping process such that the majority of text content in the MWP is in German.

In the following, we will describe the data structure and access to the panel in more detail using Python. Note that the data can be accessed by any other programming language as well. The Seafile client that allows you to access the MWP from your local machine corresponds to “Q:\” drive in the following. Set the relevant drive on your machine in the cell below:

drive = "Q:\\"

In total, the MWP includes approximately 740 GB of data (as of April 2021) and information on more than 2.7 million firms. However, only a part of these firms is included in each wave. Thus, the MWP comes with a file showing for each firm whether it had been scraped successfully in the respective wave (in the table below 1 indicates a sucessful scraping attempt of the respective corporate website). In general, web data is available for the following dates:

December 2018
April 2019
August 2019
December 2019
March 2020
May 2020
August 2020
October 2020
January 2021

import pandas as pd
overview = pd.read_csv(drive + r"Für mich freigegeben\Mannheimer Webpanel (MWP)\Übersicht.csv", delimiter='\t')
overview.head(5)

	ID	url	2018_12	2019_04	2019_08	2019_12	2020_03	2020_05	2020_08	2020_10	2021_01
0	2010000001	www.wiener-conditorei.de	1	1	1	1	1	1	1	1	1
1	2010000057	www.hoernicke.de	1	1	1	1	1	1	1	1	1
2	2010000074	www.psi.de	1	1	1	1	1	1	1	0	0
3	2010000125	www.scherdel.de	1	1	1	1	1	1	1	0	1
4	2010000144	www.kleinschmidt.de	1	1	1	1	1	1	1	0	0

5 rows × 13 columns

Each wave is stored within a corresponding directory in Seafile and is split up in multiple chunks (generally about 100-150 MB each). These chunks are csv-files containing the scraped webiste information including:

ID: ID of the company
dl_rank: chronological order the webpage was downloaded (main page = 0)
dl_slot: domain name of the website
alias: domain of website if there was an initial redirect (e.g. from www.example.de to www.example.com)
error: indicating if there was an error when requesting the website’s main page (e.g. HTML error, timeout)
redirect: equals “True” if there was a redirect to another domain when requesting the first webpage from a website
start_page: first webpage that was scraped
title: title of the website as indicated in the website’s meta data
keywords: keywords as indicated in the website’s meta data
description: short description of the website as indicated in the website’s meta data
language: language of the website as indicated in the website’s meta data
text: text that was downloaded from the webpage, including the respective HTML tags
links: domains of websites (within-sample and out-of-sample) found on the focal website
timestamp: exact time when the webpage was downloaded
url: URL of the requested webpage

For some of the older waves, information on title, keywords, description and language are not included.

How can the data be accessed?

The files can be accessed for further analysis e.g. by using pandas. For this, the following modules are necessary.

import pandas as pd
import os
import glob

How can singular files be accessed?

Singular files can be accessed by path, as in a regular folder structure. Their names always follow the same patterns (ARGUS_chunk_p*).

# Define which wave you want to access
wave = "2020_03"

singular_file = pd.read_csv(drive + r"Für mich freigegeben\Mannheimer Webpanel (MWP)\\" + wave + r"\ARGUS_chunk_p12.csv", delimiter='\t')
singular_file.head(5)

	ID	dl_rank	dl_slot	error	redirect	…	description	text	timestamp	url	links
0	2011140655	0	juwelo.tv	None	True	…	Edelsteinschmuck in allen Variationen günstig …	Ihr Experte für zertifizierten Edelsteinschmuc…	Tue Mar 17 18:18:25 2020	https://www.juwelo.de/	juwelo.fr,juwelo.nl,juwelo.es,juwelo.it,juwelo…
1	2011140655	1	juwelo.tv	None	True	…	Liebhaber eleganter Schmucktrends sollten sich…	Ihr Experte für zertifizierten Edelsteinschmuc…	Tue Mar 17 18:18:25 2020	https://www.juwelo.de/tpc/	juwelo.fr,juwelo.nl,juwelo.es,juwelo.it,juwelo…
2	2011140655	2	juwelo.tv	None	True	…	Juwelo – AGB. Schmuckpräsentation und Herstell…	Ihr Experte für zertifizierten Edelsteinschmuc…	Tue Mar 17 18:18:25 2020	https://www.juwelo.de/agb/	juwelo.fr,juwelo.nl,juwelo.es,juwelo.it,juwelo…
3	2011140655	3	juwelo.tv	None	True	…	Ringe besetzt mit wunderschönen Edelsteinen, v…	Ihr Experte für zertifizierten Edelsteinschmuc…	Tue Mar 17 18:18:25 2020	https://www.juwelo.de/ringe/	juwelo.fr,juwelo.nl,juwelo.es,juwelo.it,juwelo…
4	2011140655	4	juwelo.tv	None	True	…	Schmuck mit echten Edelsteinen von Dagen ist n…	Ihr Experte für zertifizierten Edelsteinschmuc…	Tue Mar 17 18:18:25 2020	https://www.juwelo.de/dagen/	juwelo.fr,juwelo.nl,juwelo.es,juwelo.it,juwelo…

How can an entire wave be accessed?

Data from an entire wave can be accessed by looping over the csv-files. Keep in mind that the files are very large which might cause problems with your memory. Therefore, it is sensible to filter only the data you need in the loop.

# input directory
input_dir = drive + r"Für mich freigegeben\Mannheimer Webpanel (MWP)\\" + wave
os.chdir(input_dir)

# get all files in input directory
input_files = sorted(glob.glob('ARGUS_chunk_*.csv'))    

# structure for exemplary code
tick = 0
output = pd.DataFrame(columns=['number', 'percentage_errors'])
error_count = []

# loop over chunks
for input_file in input_files:

    #########################################
    # Exemplary calculations based on the MWP
    #########################################

    # Search for the following search term
    search_term = "digital"

    tick += 1 
    if tick > 2:
        break

    # load chunk in chunks due to file size
    for data in pd.read_csv(input_file, chunksize=20000, sep = "\t", encoding='utf-8', error_bad_lines=False):
        data = data.loc[data['text'].notnull(),:]
        temp = set(data.loc[data['text'].str.contains(search_term), 'ID'].values)
    temp.update(temp)

print(temp)

{2011052555, 2011057167, 2011061267, 2011027478, 2011052569, 2011047456, 2011063331, 2011025957, 2011042854, 2011026988, 2011041328, 2011058738, 2011066931, 2011044404, 2011059767, 2011046456, 2011065399, 2011024957, 2011022407, 2011027528, 2011029065, 2011024458, 2011032141, 2011044944, 2011052113, 2011074132, 2011028054, 2011053145, 2011036769, 2011060324, 2011037285, 2011048551, 2011031664, 2011025524, 2011065460, 2011054717, 2011032191, 2011062922, 2011040399, 2011026066, 2011024534, 2011051159, 2011027609, 2011041434, 2011049641, 2011017900, 2011016884, 2011053238, 2011072700, 2011031742, 2011043006, 2011062467, 2011032260, 2011015882, 2011055309, 2011029199, 2011063504, 2011047633, 2011024084, 2011023572, 2011066072, 2011012323, 2011046115, 2011063526, 2011028715, 2011053291, 2011051253, 2011058950, 2011063047, 2011023116, 2011045137, 2011051289, 2011024668, 2011025693, 2011049244, 2011033888, 2011044129, 2011043618, 2011043106, 2011050273, 2011036453, 2011052846, 2011052849, 2011044673, 2011049798, 2011050827, 2011062107, 2011040097, 2011034981, 2011021157, 2011044710, 2011029352, 2011030889, 2011041646, 2011039087, 2011039599, 2011030898, 2011041141, 2011044213, 2010999159, 2011037566, 2011051910, 2011051399, 2011039626, 2011048332, 2011036047, 2011030934, 2011038616, 2011039129, 2011027866, 2011042723, 2011062695, 2011037611, 2011026860, 2011040173, 2011032494, 2011049389, 2011025844, 2011062204, 2011025346, 2011032003, 2011054536, 2011042258, 2011059155, 2011048405, 2011040219, 2011027945, 2011037677, 2011041782, 2011025916, 2011055615}

The above output shows company IDs on whose corporate website the search term digital matched. This shall just give a very high level example on how to work with the MWP.

Can the MWP be combined with other data?

Please note that it is possible to combine the MWP with further firm-level data hosted at ZEW, such as for example the Mannheim Innovation Panel (MIP) and the IAB/ZEW Start-up Panel. This is how the MWP realizes its full potential. Find out more on the website of the ZEW-FDZ.

The Mannheim Web Panel was developed as part of the project BERD@BW funded by the Ministry of Science, Research and the Arts of Baden-Württemberg from 2019 to 2022.

news

follow us