Here, we present an analysis of the needs assessment online survey results that were collected for the German Competence Center Cloud Technologies for Data Management and Processing (de.KCD). The aim of the needs assessment is to get an overview of cloud technologies that are deployed for the purpose of data management and analysis in the different scientific fields and by the scientific community as a whole. Besides that, gaps in existing training materials of participating research data initiatives should be identified to inform experts from partner institutions where a need for additional materials exists.
The survey can be accessed by using the following link: https://ec.europa.eu/eusurvey/runner/deKCD_needs_assessment_2024
The anonymized survey results are available in the following GitHub repository: https://github.com/deKCD/needs-assessment-evaluation
The survey was carried out using the online survey platform EUSurvey. The survey results were exported in the proprietary .xlsx format and then converted to the .csv format to enable analysis with other software packages. Exploratory data analysis and plotting was conducted using the programming language R and the IDE RStudio, see R session info section for more detailed information.
# load libraries
library(tidyverse)
library(DT)
library(webshot)
library(gridExtra)
library(ggrepel)
library(ggplot2)
library(scales)
library(flextable)
# read in survey raw data
res <- read.csv("data/Content_Export_deKCD_needs_assessment_2024_2024-11-19.CSV", sep = ";", na.strings = c("", "NA"))
# add unique contribution ID to each entry
res$ID <- seq.int(nrow(res))
# separate results by topic and keep submission IDs for each subset
# General information
res_general <- select(res, c(59, 1:5))
# Provision of cloud services
res_CS_provision <- select(res, c(59, 6:14))
# Training content
res_training <- select(res, c(59, 15:23))
# Data processing
res_data <- select(res, c(59, 24:41))
# Infrastructure
res_infra <- select(res, c(59, 42:53))
The primary target group for the survey were people engaged in NFDI projects or other projects dedicated to provide data management and data analysis services for a scientific audience.
To understand which scientific fields are covered by the survey results, we asked participants to which field their project can be assigned. The number of participants from each field is visualized in the following plot:
# Plot: Distribution of scientific fields among participants
# Select relevant column, count occurrences, and add n=0 for missing values
plot1_data = select(res_general, 5) %>%
na.omit() %>%
rename(ScientificField = 1) %>%
count(ScientificField) %>%
complete(ScientificField = c("Engineering sciences","Humanities and social sciences","Life sciences","Natural sciences","Interdisciplinary project"), fill = list(n = 0))
# Move "Interdisciplinary project" to end for plotting
plot1_data = plot1_data %>%
bind_rows(slice(.,3)) %>%
slice(-3)
# lock factor level order
#plot1_data$ScientificField <- factor(plot1_data$ScientificField, levels = plot1_data$ScientificField)
# Make bar plot
plot_1 <- ggplot(plot1_data, aes(x=reorder(ScientificField, -as.numeric(n)), y=as.numeric(n), fill=ScientificField)) +
geom_bar(colour="black", stat="identity") +
scale_y_continuous(expand = expansion(mult = c(0, .1))) +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(fill = "Scientific field", y = "Participants", x = "Scientific field")
plot_1
Figure 1: Distribution of scientific fields in which participants of the survey work primarily.
In total, 34 individuals took part in the survey. The majority of contributions came from the Life Science domain with a total of 19 participants. The second largest group were people engaged in interdisciplinary projects. Of the 6 people within this group, two are active in the NFDI project Base4NFDI, two are working in local data initiatives at universities, one was active in the NFDI4Earth consortium, and one was active in multiple NFDI consortia. The rest of the participants were split between projects in Engineering sciences (3), Humanities and Social sciences (2), and Natural sciences (1). One participant classified themselves as local contact person for RDM matters without specifying a scientific domain. This distribution of scientific fields represents a clear bias towards contributions from the Life sciences domain, which will be taken into consideration in the further evaluation of the survey results.
Since most of the participants indicated that they are active in more than one NFDI consortium, we were also interested to know which consortia are not represented by at least one participant of the survey. This information can be helpful when deciding which NFDI consortia should be contacted directly, if it is intended to improve the data basis of this survey in the future. All NFDI consortia which were not represented in the current survey results are listed in the following table.
# Table: not represented NFDI consortia
# read in table with all NFDI consortia + domains
consortia = read.csv("data/NFDI_consortia.CSV", sep = ";")
# extract NFDI consortia data from results
tab1_data = select(res_general, 3) %>%
rename(consortium = 1) %>%
# discard empty rows
na.omit() %>%
# split fields with multiple entries
separate_longer_delim(consortium, delim = "; ") %>%
# discard duplicates
distinct()
# add new column indicating presence
tab1_data$represented = TRUE
# merge with consortia table, discard represented consortia
tab1_data = left_join(consortia, tab1_data, by = join_by(consortium)) %>%
mutate(represented = replace_na(represented, FALSE)) %>%
filter(represented == FALSE)
if (knitr::is_html_output()) {
datatable(tab1_data)
} else {
flextable(tab1_data, cwidth = c(2,3,1)) %>%
set_header_labels(consortium = "NFDI Consortium", domain = "Scientific domain", represented = "Represented?") %>%
theme_zebra()
}
Table 1: NFDI consortia that are not represented by at least one participant of the online survey.
Besides their involvement in NFDI projects, we also asked participants if they are active in other RDM initiatives and projects outside of the NFDI space. The following table shows an overview of the initiatives mentioned by participants.
#manually extract information from results (res_general) and generate table
Initiatives <- c("ELIXIR","de.NBI", "TRR projects", "local RDM initiatives", "EOSC", "RDA", "CLARIN ERIC", "QUADRIGA", "HMC")
n <- c(4, 4, 3, 3, 3, 1, 1, 1, 1)
tab2_data <- data.frame(Initiatives, n)
if (knitr::is_html_output()) {
datatable(tab2_data)
} else {
flextable(tab2_data, cwidth = c(3,1)) %>%
set_header_labels(Initiatives = "Initiative/Project", n = "n") %>%
theme_zebra()
}
Table 2: non-NFDI RDM projects represented by participants
The first topic-specific question block of the survey was primarily addressing participants who already provide cloud services for their target audience or are in the process of developing such a solution. We asked them which services they provide, what their experiences with implementing those services were, and what challenges they faced. If participants indicated that they are not offering cloud services for their users, we were also interested in the reason for this decision. Lastly, participants were asked to indicate how well they think cloud services are accepted in their user base and what the possible reasons for low acceptance could be.
#summarise answers to first question
b2q1 = res_CS_provision %>%
rename(answer = 2) %>%
select(answer) %>%
na.omit() %>%
count(answer)
# collect free text answers to second question
b2q2 = res_CS_provision %>%
rename(answer = 3) %>%
select(answer) %>%
na.omit()
Out of 33 participants, 23 (70%) indicated that they are providing cloud services for their target audience already, while 2 participants stated that they are currently implementing such a cloud service. The cloud services that are provided by participants of the survey can be roughly classified into four main types: (i) cloud storage, (ii) resources for field-specific applications, (iii) generic compute resources and VM hosting, and (iv) research metadata services.
Overall, the most frequently mentioned service that participants provide was cloud storage. Given examples ranged from commercial cloud storage solutions, over storage devices provided through self-hosted Infrastructure-as-a-Service (IaaS) solutions, to the application of established file-sharing protocols. The only commercial cloud storage service that was mentioned by participants is the Simple Storage Service (S3)1, which is part of Amazon Web Services (AWS)2. IaaS solutions that are hosted by some participants seem to be mostly based on the OpenStack cloud operating system3 and thus the OpenStack Object Storage (Swift) software4 was used to provide access to the storage backend in those cases.
Both, S3 and Swift, are object storage solutions which are particularly suited to handle large amounts of unstructured data in large files. Notably, block storage was also mentioned by two participants, presumably for specific applications that are based on structured data and benefit from shorter access times in contrast to object storage, especially for small files.
Another frequently mentioned cloud storage application use case is the provision of shared file systems, either within closed academic networks or on a global scale. The provision of local shared file systems was realized by using established file sharing protocols like the UNIX network protocol Network File System (NFS)5 or the equivalent protocol used by Windows operating systems: Server Message Block (SMB)6. In one record, a participant also mentioned the use case of making such shared file systems accessible from outside the closed academic network by using the non-profit service Globus share7, which was developed at the University of Chicago and can be accessed from around the world.
Besides cloud storage, field-specific applications running in the cloud were the most frequently mentioned cloud services provided by participants of the survey. Naturally, this is also the most diverse category of services, since every specialized field has developed its own applications to solve specific problems within the field. Most of the specialized applications mentioned by participants were developed to provide execution environments for particular workflows and/or handle field-specific data types.
A good example for specialized execution environments is the provision of compute resources to execute automated text analysis algorithms. This is a use-case that was in the past primarily popular in the social sciences field, but the recent explosion of Natural Language Processing (NLP) applications, primarily driven by the adoption of deep neural network methods (deep learning)8, has demonstrated that many fields can benefit from this expertise9 10 11.
An example of field-specific applications, that is more focused on handling specific data types, is the provision of environments that are specialized to deal with image data. One of those applications, that originated in the light-microscopy domain, is the Open Microscopy Environment Remote Objects (OMERO) software platform12. The SaaS platform was developed based on the Open Microscopy Environment (OME) Data Model13 and the image translation library Bio-Formats14 and provides a data management platform for analysis and processing of complex image data. It has also been adapted to provide the basis of scientific image repositories and extensions are developed to handle other use cases besides image data, like analysis of human genome data15.
Another application that might be particularly interesting for scientific disciplines with a wet-lab component are Lab Information Management Systems (LIMS). LIMS are software solutions that allow research laboratories to manage samples, instruments, processes, and generated data within one integrated system. Electronic Lab Notebooks (ELN) fall, depending on the feature set they provide, under the same umbrella term, but are usually more focused on documenting the research process and serve as a replacement of traditional paper lab notebooks. Both, LIMS and ELNs, are regularly deployed as closed institutional solutions and as cloud services with broader access possibilities and interoperability with other services. Popular open source examples of ELNs that are also deployed within several NFDI projects are Chemotion16, a service particularly designed for the chemical sciences, and eLabFTW17, a more generic service which can be adapted to several research disciplines.
Participants of the survey were also involved in the Galaxy project, which provides freely accessible data analysis tools and pipelines for researchers that can be accessed through a graphical user interface in a web browser environment. The project was originally initiated within the field of genomics and aimed at making reproducible computational workflows available to researchers without in-depth skills in bioinformatics 18. Since then, the project has massively grown and is now providing computational resources world wide for a constantly growing user base that has expanded far beyond the field of genomics 19. The service has been constantly updated and now encompasses a sophisticated learning platform for users and a vast collection of software packages for data analysis in a variety of research fields. Besides Galaxy, the Cloud-based workflow manager (CloWM) was also brought up by survey participants as an alternative platform to execute reproducible analysis pipelines in the cloud. The service allows developers to convert their workflows into a web service for publication and execution by end users on a scalable compute cluster20.
Beside the aforementioned specialized applications, provision of generic compute resources was also frequently mentioned by participants of the survey. This is likely connected to the fact that many professionals addressed by the survey work in core facilities of universities and other academic institutions and part of their work is the provision and maintenance of computational resources for institution members and the scientific public. As already hinted at in a previous section, OpenStack and AWS seem to be the preferred platforms for this task among participants, and also the deployment of kubernets clusters was frequently mentioned by participants.
# collect answers to third question
b2q3 = res_CS_provision %>%
rename(answer = 4) %>%
select(answer) %>%
na.omit()
We asked participants of the survey, who are involved in providing cloud services for researchers, which topics they found to be particularly challenging when implementing and operating those services. The answers to this question were split between four superordinate subject areas: (i) technical challenges, (ii) legal challenges, (iii) user interaction, and (iv) maintenance and continued development. Topics for each subject area are summarized below and can serve as a first orientation when deciding which training material addressing cloud service providers would be relevant for this group in context of the cloud technology competence center.
Technical challenges:
Legal challenges:
User interaction:
Maintenance and continued development:
In the last question of the first question block we wanted to know how well cloud services are accepted among the potential user base in the perception of survey participants. We asked the participants to rate user acceptance on a scale between 0 and 10, where 0 meant users did not accept cloud services at all and 10 equaled full acceptance of cloud services in the user base. The results of the rating are summarized in the histogram below.
#get rating values for cloud acceptance
b2q4 = res_CS_provision %>%
rename(val = 9) %>%
select(ID, val) %>%
na.omit()
#make a histogram for visualization of distribution
plot_2 <- ggplot(b2q4, aes(x=val)) +
geom_histogram(binwidth = 1, color = "black", fill="#00A2F9") +
geom_vline(aes(xintercept = median(val)), color = "red", linetype = "dashed", linewidth = 1) +
scale_x_continuous(breaks = seq(0, 10, by = 1), limits = c(0,11)) +
geom_text(x=1, y=10, label="n = 32", size = 5) +
geom_text(x=1, y=9, label="median = 8", size = 5) +
theme_minimal() +
xlab("value") +
ggtitle("Cloud service acceptance estimation of survey participants")
plot_2
Figure 2:
Overall the results imply that cloud services are generally well accepted with a median acceptance value of 8. Besides the clear peak around an acceptance value of 8, however, there was also a considerable number of participants that rated acceptance of cloud services with a value of 5 or less. We were interested to know if low acceptance values correlated with the scientific field in which they were used. Many of the already existing cloud services supporting research originate from the Life Science domain and have been actively used by the community since more than a decade in some cases. Users from the Life Science domain might therefore be accustomed to the tools and processes in contrast to other domains, potentially increasing acceptance. We therefore separated acceptance ratings of survey participants from the Life Science domain from the other results and generated two histograms, one with acceptance values from the Life Science domain and the other summarizing results from all other scientific disciplines.
#get scientific field answers and merge with acceptance value
scientific_field_data = select(res_general, ID, 5) %>%
na.omit() %>%
rename(ScientificField = 2)
plot2_data <- left_join(b2q4, scientific_field_data, by = "ID")
#subset data for Life Science and non Life science
plot2_data_LS <- filter(plot2_data, ScientificField == "Life sciences")
plot2_data_noLS <- filter(plot2_data, ScientificField != "Life sciences")
#plot both subsets next to each other
plot_2a <- ggplot(plot2_data_LS, aes(x=val)) +
geom_histogram(binwidth = 1, color = "black", fill="#00A2F9") +
geom_vline(aes(xintercept = median(val)), color = "red", linetype = "dashed", linewidth = 1) +
scale_x_continuous(breaks = seq(0, 10, by = 1), limits = c(0,11)) +
ylim(0, 8) +
geom_text(x=1, y=7, label="n = 18", size = 5) +
geom_text(x=1, y=5, label="median = 8", size = 5) +
theme_minimal() +
xlab("") +
ggtitle("Cloud service acceptance estimation in Life Sciences")
plot_2b <- ggplot(plot2_data_noLS, aes(x=val)) +
geom_histogram(binwidth = 1, color = "black", fill="#00A2F9") +
geom_vline(aes(xintercept = median(val)), color = "red", linetype = "dashed", linewidth = 1) +
scale_x_continuous(breaks = seq(0, 10, by = 1), limits = c(0,11)) +
ylim(0, 8) +
geom_text(x=1, y=7, label="n = 12", size = 5) +
geom_text(x=1, y=5, label="median = 6.5", size = 5) +
theme_minimal() +
xlab("value") +
ggtitle("Cloud service acceptance estimation outside Life Sciences")
grid.arrange(plot_2a, plot_2b, ncol = 1)
#calculate some statistics
n2a = nrow(plot2_data_LS)
median2a = median(plot2_data_LS$val)
n2b = nrow(plot2_data_noLS)
median2b = median(plot2_data_noLS$val)
# plot cloud service acceptance cloud providers vs non-cloud-providers
plot3_data = res_CS_provision %>%
rename(val = 9, provider = 2) %>%
select(ID, provider, val) %>%
na.omit()
plot3_data_provider <- filter(plot3_data, provider == "Yes")
plot3_data_nonprovider <- filter(plot3_data, provider != "Yes")
plot_3a <- ggplot(plot3_data_provider, aes(x=val)) +
geom_histogram(binwidth = 1, color = "black", fill="#00A2F9") +
geom_vline(aes(xintercept = median(val)), color = "red", linetype = "dashed", linewidth = 1) +
scale_x_continuous(breaks = seq(0, 10, by = 1), limits = c(0,11)) +
ylim(0, 8) +
theme_minimal() +
xlab("") +
ggtitle("Cloud service acceptance estimation of cloud service providers")
plot_3b <- ggplot(plot3_data_nonprovider, aes(x=val)) +
geom_histogram(binwidth = 1, color = "black", fill="#00A2F9") +
geom_vline(aes(xintercept = median(val)), color = "red", linetype = "dashed", linewidth = 1) +
scale_x_continuous(breaks = seq(0, 10, by = 1), limits = c(0,11)) +
ylim(0, 8) +
theme_minimal() +
xlab("value") +
ggtitle("Cloud service acceptance estimation of participants not providing cloud services")
#grid.arrange(plot_3a, plot_3b, ncol = 1)
Figure 3:
The two plots show that median acceptance of cloud services drops notably when considering only contributions from outside the Life Science domain to a value of 6.5, compared to a median acceptance of 8 within Life Sciences. These results suggest that scientists from other disciplines are generally more hesitant to use cloud services in their research process.
#collect free text answer for reasoning why cloud services are not accepted
b2q5 <- res_CS_provision %>%
rename(val = 9, answers = 10) %>%
select(ID, val, answers) %>%
filter(val <= 6) %>%
na.omit()
#b2q5$answers
To better understand the potential reasons for low acceptance of cloud services within the research community, we asked participants for their opinion on this topic. The free-text answers to this question are summarized in the following bullet points:
Technical issues
Practical issues
Missing education and lack of trust in cloud technologies
The second question block of the survey included questions around the topic of training materials teaching the use and implementation of cloud services for the purpose of research data management and data analysis. We wanted to know which training content is already existing within the projects of survey participants, in which form it is provided, and which gaps exist that can be filled by the de.KCD project.
Out of 34 survey participants, 20 (59%) indicated that they are providing training materials within their project already. When designing the training material, all participants who provided training content had PhD students and Post-docs as a target group in mind (see Figure 4). Besides that, employees at research institutions (16 out of 20) and students in general (15 out of 20) were the second most frequently mentioned target group. The least frequently mentioned target group were data stewards, who were considered by 8 out of 20 participants who provide training content within their project. One participant indicated that they are providing training content targeted at all academic levels.
## "Do you provide training content?"
# get summary of question 1
b3q1 = res_training %>%
rename(answer = 2) %>%
select(answer) %>%
na.omit() %>%
count(answer)
#if (knitr::is_html_output()) {
# datatable(b3q1)
#} else {
# flextable(b3q1, cwidth = c(1,1)) %>%
# set_header_labels(answer = "Answer", n = "n") %>%
# theme_zebra()
#}
## Training content target group
# collect answers to 2nd question and split multi-choice columns by delimiter
b3q2 = res_training %>%
rename(answer = 3) %>%
select(answer) %>%
na.omit() %>%
separate_longer_delim(answer, delim = ";")
# remove leading and trailing whitespaces
b3q2 <- as.data.frame(gsub("^\\s+|\\s+$", "", b3q2$answer))
# count occurences of answers
plot3_1_data = b3q2 %>%
rename(answer = 1) %>%
count(answer)
# make bar plot
plot3_1 <- ggplot(plot3_1_data, aes(x=reorder(answer, -as.numeric(n)), y=n, fill = reorder(answer, -as.numeric(n)))) +
geom_bar(color = "black", stat = "identity") +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(fill = "Target group", y = "count", x = "Target group")
plot3_1
Figure 4: Target groups for training materials provided within the projects of survey participants.
Training materials can be provided in many different ways, so we were interested to know in which form the content is provided to the target audience. The results of this question are shown in Figure 5. Almost all participants (19 out of 20) who provide training content use online courses and workshops as a vehicle to deliver knowledge about their cloud services to their users. Besides that, slide decks (12 out of 20) and knowledge databases or wikis (11 out of 20) were popular choices as well. Only a minority provided training materials via a project homepage (8 out of 20) and educational video material was the least popular choice to deliver knowledge to the target group (5 out of 20).
## Type of training content
b3q3 = res_training %>%
rename(answer = 5) %>%
select(answer) %>%
na.omit() %>%
separate_longer_delim(answer, delim = ";")
# remove leading and trailing whitespaces
b3q3 <- as.data.frame(gsub("^\\s+|\\s+$", "", b3q3$answer))
# count occurences of answers
plot3_2_data = b3q3 %>%
rename(answer = 1) %>%
count(answer)
# make bar plot
plot3_2 <- ggplot(plot3_2_data, aes(x=reorder(answer, -as.numeric(n)), y=n, fill = reorder(answer, -as.numeric(n)))) +
geom_bar(color = "black", stat = "identity") +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(fill = "Type of training content", y = "count", x = "Type of training content")
plot3_2
Figure 5: Types of training content that are used within projects of survey participants to deliver knowledge to their target audience.
Besides the pre-defined types of training content shown above, participants also used other means to transfer knowledge to their target audience. Those include individual training sessions, documentation in git repositories, jupyter notebooks, and interactive self learning platforms.
# list "other"
plot3_2_other = res_training %>%
rename(answer = 6) %>%
select(answer) %>%
na.omit()
#if (knitr::is_html_output()) {
# datatable(plot3_2_other)
#} else {
# flextable(plot3_2_other, cwidth = 3) %>%
# set_header_labels(answer = "Other answers") %>%
# theme_zebra()
#}
Subjects for which survey participants provide learning content:
## Training content subjects
b3q4 = res_training %>%
rename(answer = 7) %>%
select(answer) %>%
na.omit()
#b3q4$answer
HPC training
Cloud service training
Research data management training
Workflow/Analysis pipeline training
Do participants use cloud-based services for conducting training?
## cloud-based services for training
# summarize usage of cloud services to provide training content
b3q5 = res_training %>%
rename(answer = 8) %>%
select(answer) %>%
na.omit() %>%
count(answer)
if (knitr::is_html_output()) {
datatable(b3q5)
} else {
flextable(b3q5, cwidth = c(2,1)) %>%
set_header_labels(answer = "Answer", n = "n") %>%
theme_zebra()
}
Which cloud services are used to conduct training?
# collect answers
b3q6 = res_training %>%
rename(answer = 9) %>%
select(answer) %>%
na.omit()
#b3q6$answer
## Most important topics participants would need training on:
b3q7 = res_training %>%
rename(answer = 10) %>%
select(answer) %>%
na.omit()
#b3q7$answer
We asked the survey participants which, in their opinion, are the most important topics on which they or people in their project would need additional training materials. The free text answers to this question are summarized in the following bullet points.
Cloud infrastructure knowledge
Research data management for applications in the cloud
Data analysis in the cloud
Access to cloud resources
Other topics
Main types of data in the project:
b4q1 = res_data %>%
rename(answer = 2) %>%
select(answer) %>%
na.omit()
if (knitr::is_html_output()) {
datatable(b4q1)
} else {
flextable(b4q1, cwidth = 6) %>%
set_header_labels(answer = "Answers") %>%
theme_zebra()
}
Typical size of data files within the project:
x <- c("GBytes", "Hundreds of GBytes", "TBytes", "Hundreds of TBytes", "PBytes")
b4q2 = res_data %>%
rename(answer = 3) %>%
select(answer) %>%
na.omit() %>%
count(answer) %>%
mutate(answer = factor(answer, levels = x)) %>%
arrange(answer)
# make bar plot
plot4_1 <- ggplot(b4q2, aes(x=answer, y=n, fill = answer)) +
geom_bar(color = "black", stat = "identity") +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(fill = "Magnitude of data", y = "count", x = "Magnitude of data")
plot4_1
Figure 6:
Do participants provide cloud solutions for data deposition and sharing?
# Do participants provide cloud solutions for data deposition and sharing?
b4q3 = res_data %>%
rename(answer = 4) %>%
select(answer) %>%
na.omit() %>%
count(answer)
if (knitr::is_html_output()) {
datatable(b4q3)
} else {
flextable(b4q3, cwidth = c(2,1)) %>%
set_header_labels(answer = "Answer", n = "n") %>%
theme_zebra()
}
# Which services do they provide?
b4q4 = res_data %>%
rename(answer = 5) %>%
select(answer) %>%
na.omit()
#b4q4$answer
Used data deposition services:
Are standardized formats used for data and metadata?
# Are standardized formats used for data and metadata?
b4q5 = res_data %>%
rename(answer = 6) %>%
select(answer) %>%
na.omit() %>%
count(answer)
if (knitr::is_html_output()) {
datatable(b4q5)
} else {
flextable(b4q5, cwidth = c(1,1)) %>%
set_header_labels(answer = "Answer", n = "n") %>%
theme_zebra()
}
Which standardized data formats are used?
# Which standard formats are used?
b4q6 = res_data %>%
rename(answer = 7) %>%
select(answer) %>%
na.omit()
if (knitr::is_html_output()) {
datatable(b4q6)
} else {
flextable(b4q6, cwidth = 6) %>%
set_header_labels(answer = "Answers") %>%
theme_zebra()
}
Is sensitive data handled within the project?
# Is sensitive data handled within the project?
b4q7 = res_data %>%
rename(answer = 8) %>%
select(answer) %>%
na.omit() %>%
count(answer)
if (knitr::is_html_output()) {
datatable(b4q7)
} else {
flextable(b4q7, cwidth = c(1,1)) %>%
set_header_labels(answer = "Answer", n = "n") %>%
theme_zebra()
}
# How is sensitive data protected?
b4q8 = res_data %>%
rename(answer = 9) %>%
select(answer) %>%
na.omit()
#b4q8$answer
Measurements to protect sensitive data:
Is the use of AI algorithms relevant for the project?
# is the use of AI algorithms relevant for the project?
b4q9 = res_data %>%
rename(answer = 10) %>%
select(answer) %>%
na.omit() %>%
count(answer)
if (knitr::is_html_output()) {
datatable(b4q9)
} else {
flextable(b4q9, cwidth = c(1,1)) %>%
set_header_labels(answer = "Answer", n = "n") %>%
theme_zebra()
}
Which algorithms are used and for which purpose?
# Which algorithms and for what purpose?
b4q10 = res_data %>%
rename(answer = 11) %>%
select(answer) %>%
na.omit()
if (knitr::is_html_output()) {
datatable(b4q10)
} else {
flextable(b4q10, cwidth = 6) %>%
set_header_labels(answer = "Answers") %>%
theme_zebra()
}
Are solutions for reproducible analysis employed?
# Are solutions for reproducible analysis employed?
b4q11 = res_data %>%
rename(answer = 12) %>%
select(answer) %>%
na.omit() %>%
count(answer)
if (knitr::is_html_output()) {
datatable(b4q11)
} else {
flextable(b4q11, cwidth = c(1,1)) %>%
set_header_labels(answer = "Answer", n = "n") %>%
theme_zebra()
}
Which services or solutions are used for reproducible data analysis?
# Which services/solutions are used?
b4q12 = res_data %>%
rename(answer = 13) %>%
select(answer) %>%
na.omit()
if (knitr::is_html_output()) {
datatable(b4q12)
} else {
flextable(b4q12, cwidth = 6) %>%
set_header_labels(answer = "Answers") %>%
theme_zebra()
}
Are services for reproducible data analysis used or provided via the cloud?
# Are cloud solutions for reproducible analysis employed?
b4q13 = res_data %>%
rename(answer = 14) %>%
select(answer) %>%
na.omit() %>%
count(answer)
if (knitr::is_html_output()) {
datatable(b4q13)
} else {
flextable(b4q13, cwidth = c(1,1)) %>%
set_header_labels(answer = "Answer", n = "n") %>%
theme_zebra()
}
Is reproducible data analysis via cloud of interested, in case it is not yet used?
# Would cloud solutions for reproducible data analysis be of interest, in case they are not used yet?
b4q14 = res_data %>%
rename(answer = 15) %>%
select(answer) %>%
na.omit() %>%
count(answer)
if (knitr::is_html_output()) {
datatable(b4q14)
} else {
flextable(b4q14, cwidth = c(1,1)) %>%
set_header_labels(answer = "Answer", n = "n") %>%
theme_zebra()
}
Is data management software used as part of the project?
# Are data management solutions employed?
b4q15 = res_data %>%
rename(answer = 16) %>%
select(answer) %>%
na.omit() %>%
count(answer)
if (knitr::is_html_output()) {
datatable(b4q15)
} else {
flextable(b4q15, cwidth = c(1,1)) %>%
set_header_labels(answer = "Answer", n = "n") %>%
theme_zebra()
}
Which data management solutions are used?
# Which services/solutions are used?
b4q16 = res_data %>%
rename(answer = 17) %>%
select(answer) %>%
na.omit()
if (knitr::is_html_output()) {
datatable(b4q16)
} else {
flextable(b4q16, cwidth = c(4)) %>%
set_header_labels(answer = "Answer") %>%
theme_zebra()
}
Are cloud services hosted on own or external infrastructure?
# Where do participants host their cloud services?
b5q1 = res_infra %>%
rename(answer = 2) %>%
select(answer) %>%
na.omit() %>%
count(answer)
# make bar plot
plot5_1 <- ggplot(b5q1, aes(x=answer, y=n, fill = answer)) +
geom_bar(color = "black", stat = "identity") +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(fill = "Hosting of cloud services", y = "count", x = "Hosting of cloud services")
plot5_1
Figure 7:
Which infrastructure is used for hosting cloud services?
# Which infrastructure is used exactly?
b5q2 = res_infra %>%
rename(answer = 3) %>%
select(answer) %>%
na.omit()
#b5q2$answer
# manual summary of answers:
Infrastructure <- c("de.NBI cloud", "academic compute centres (OpenStack)", "academic compute centres (kubernetes)", "academic compute centres (undefined)", "commercial providers", "cloWM", "Galaxy EU", "VMware")
n <- c(13, 3, 2, 6, 2, 1, 1, 1)
tab5_1_data <- data.frame(Infrastructure, n)
if (knitr::is_html_output()) {
datatable(tab5_1_data)
} else {
flextable(tab5_1_data, cwidth = c(3, 1)) %>%
theme_zebra()
}
Overview of the usage of selected technologies among participants:
## summary of employed technologies
# gpu resources
b5q3 = res_infra %>%
rename(answer = 4) %>%
select(answer) %>%
count(answer) %>%
mutate(perc = round(`n` / sum(`n`) * 100), 1) %>%
mutate(csum = rev(cumsum(rev(perc))),
pos = perc/2 + lead(csum, 1),
pos = if_else(is.na(pos), perc/2, pos))
plot5_2a <- ggplot(b5q3, aes(x = "", y = perc, fill = answer)) +
geom_col(width = 1, color = "black") +
coord_polar(theta = "y") +
scale_fill_brewer(palette = "Set2", direction = -1) +
geom_label_repel(data = b5q3,
aes(y = pos, label = paste0(perc, "%")),
size = 4.5, nudge_x = 1, show.legend = FALSE) +
guides(fill = guide_legend(title = "Answer: ")) +
theme_void() +
labs(title = "GPU cloud computing resources")
# containerization software
b5q4 = res_infra %>%
rename(answer = 5) %>%
select(answer) %>%
count(answer) %>%
mutate(perc = round(`n` / sum(`n`) * 100), 1) %>%
mutate(csum = rev(cumsum(rev(perc))),
pos = perc/2 + lead(csum, 1),
pos = if_else(is.na(pos), perc/2, pos))
plot5_2b <- ggplot(b5q4, aes(x = "", y = perc, fill = answer)) +
geom_col(width = 1, color = "black") +
coord_polar(theta = "y") +
scale_fill_brewer(palette = "Set2", direction = -1) +
geom_label_repel(data = b5q4,
aes(y = pos, label = paste0(perc, "%")),
size = 4.5, nudge_x = 1, show.legend = FALSE) +
guides(fill = guide_legend(title = "Answer: ")) +
theme_void() +
labs(title = "Containerization software")
# container orchestration software
b5q5 = res_infra %>%
rename(answer = 8) %>%
select(answer) %>%
count(answer) %>%
mutate(perc = round(`n` / sum(`n`) * 100), 1) %>%
mutate(csum = rev(cumsum(rev(perc))),
pos = perc/2 + lead(csum, 1),
pos = if_else(is.na(pos), perc/2, pos))
plot5_2c <- ggplot(b5q5, aes(x = "", y = perc, fill = answer)) +
geom_col(width = 1, color = "black") +
coord_polar(theta = "y") +
scale_fill_brewer(palette = "Set2", direction = -1) +
geom_label_repel(data = b5q5,
aes(y = pos, label = paste0(perc, "%")),
size = 4.5, nudge_x = 1, show.legend = FALSE) +
guides(fill = guide_legend(title = "Answer: ")) +
theme_void() +
labs(title = "Container orchestration software")
# Workflow management systems
b5q6 = res_infra %>%
rename(answer = 11) %>%
select(answer) %>%
count(answer) %>%
mutate(perc = round(`n` / sum(`n`) * 100), 1) %>%
mutate(csum = rev(cumsum(rev(perc))),
pos = perc/2 + lead(csum, 1),
pos = if_else(is.na(pos), perc/2, pos))
plot5_2d <- ggplot(b5q6, aes(x = "", y = perc, fill = answer)) +
geom_col(width = 1, color = "black") +
coord_polar(theta = "y") +
scale_fill_brewer(palette = "Set2", direction = -1) +
geom_label_repel(data = b5q6,
aes(y = pos, label = paste0(perc, "%")),
size = 4.5, nudge_x = 1, show.legend = FALSE) +
guides(fill = guide_legend(title = "Answer: ")) +
theme_void() +
labs(title = "Workflow management systems")
# make unified plot
grid.arrange(plot5_2a, plot5_2b, plot5_2c, plot5_2d, ncol=2)
Figure 8:
Employed containerization software:
# collect answers for containerization software
b5q7 = res_infra %>%
rename(answer = 6) %>%
select(answer)
# manual summary
Containerization_software <- c("docker", "Singularity", "docker-compose", "Apptainer", "Kubermatic", "CharlieCloud")
n <- c(24,7,3,2,1,1)
tab5_2_data <- data.frame(Containerization_software, n)
if (knitr::is_html_output()) {
datatable(tab5_2_data)
} else {
flextable(tab5_2_data, cwidth = c(2, 1)) %>%
set_header_labels(Containerization_software = "Containerization software", n = "n") %>%
theme_zebra()
}
Employed container orchestration software:
# collect answers for containerization software
b5q8 = res_infra %>%
rename(answer = 9) %>%
select(answer)
# manual summary
Container_orchestration_software <- c("Kubernetes","docker swarm","Kubermatic","Rancher","SLURM")
n <- c(8,2,1,1,1)
tab5_3_data <- data.frame(Container_orchestration_software, n)
if (knitr::is_html_output()) {
datatable(tab5_3_data)
} else {
flextable(tab5_3_data, cwidth = c(2, 1)) %>%
set_header_labels(Container_orchestration_software = "Container orchestration software", n = "n") %>%
theme_zebra()
}
Employed workflow management systems:
# collect answers for containerization software
b5q9 = res_infra %>%
rename(answer = 12) %>%
select(answer)
# manual summary
Workflow_management_systems <- c("nextflow","Galaxy","snakemake","Common workflow language (CWL)","UNICORE","P-Grade","WebLicht","targets (R)","Apache airflow")
n <- c(15,6,4,2,2,2,1,1,1)
tab5_4_data <- data.frame(Workflow_management_systems, n)
if (knitr::is_html_output()) {
datatable(tab5_4_data)
} else {
flextable(tab5_4_data, cwidth = c(2, 1)) %>%
set_header_labels(Workflow_management_systems = "Workflow management system", n = "n") %>%
theme_zebra()
}
sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
[3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
[5] LC_TIME=German_Germany.utf8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] flextable_0.9.7 scales_1.3.0 ggrepel_0.9.6 gridExtra_2.3
[5] webshot_0.5.5 DT_0.33 lubridate_1.9.3 forcats_1.0.0
[9] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[13] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.5 xfun_0.48 bslib_0.8.0
[4] htmlwidgets_1.6.4 tzdb_0.4.0 crosstalk_1.2.1
[7] vctrs_0.6.5 tools_4.4.1 generics_0.1.3
[10] fansi_1.0.6 highr_0.11 pkgconfig_2.0.3
[13] data.table_1.16.0 RColorBrewer_1.1-3 uuid_1.2-1
[16] lifecycle_1.0.4 compiler_4.4.1 farver_2.1.2
[19] textshaping_0.4.0 munsell_0.5.1 fontquiver_0.2.1
[22] fontLiberation_0.1.0 sass_0.4.9 htmltools_0.5.8.1
[25] yaml_2.3.10 pillar_1.9.0 jquerylib_0.1.4
[28] openssl_2.2.2 cachem_1.1.0 fontBitstreamVera_0.1.1
[31] tidyselect_1.2.1 zip_2.3.2 digest_0.6.37
[34] stringi_1.8.4 labeling_0.4.3 fastmap_1.2.0
[37] grid_4.4.1 colorspace_2.1-1 cli_3.6.3
[40] magrittr_2.0.3 utf8_1.2.4 withr_3.0.1
[43] gdtools_0.4.1 timechange_0.3.0 rmarkdown_2.28
[46] officer_0.6.7 askpass_1.2.1 ragg_1.3.3
[49] hms_1.1.3 evaluate_1.0.0 knitr_1.48
[52] rlang_1.1.4 Rcpp_1.0.13 glue_1.8.0
[55] xml2_1.3.6 jsonlite_1.8.9 rstudioapi_0.16.0
[58] R6_2.5.1 systemfonts_1.1.0
https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-smb↩︎
https://www.globus.org/data-sharing; Ananthakrishnan R., Chard K., Foster I. and Tuecke S. (2015), Globus platform-as-a-service for collaborative science applications, Concurrency Computat.: Pract. Exper., 27, pages 290–305, https://doi.org/10.1002/cpe.3262↩︎
Khurana, D., Koli, A., Khatter, K. et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl 82, 3713–3744 (2023). https://doi.org/10.1007/s11042-022-13428-4↩︎
Turchin A, Florez Builes LF. Using Natural Language Processing to Measure and Improve Quality of Diabetes Care: A Systematic Review. Journal of Diabetes Science and Technology. 2021;15(3):553-560. https://doi.org/10.1177/19322968211000831↩︎
Yuexiong Ding, Jie Ma, Xiaowei Luo. Applications of natural language processing in construction. Automation in Construction 136, 104169 (2022). https://doi.org/10.1016/j.autcon.2022.104169↩︎
Dan Ofer, Nadav Brandes, Michal Linial. The language of proteins: NLP, machine learning & protein sequences. Computational and Structural Biotechnology Journal 19, 1750-1758 (2021). https://doi.org/10.1016/j.csbj.2021.03.022↩︎
Allan, C., Burel, JM., Moore, J. et al. OMERO: flexible, model-driven data management for experimental biology. Nat Methods 9, 245–253 (2012). https://doi.org/10.1038/nmeth.1896↩︎
Goldberg IG, Allan C, Burel JM, Creager D, Falconi A, Hochheiser H, Johnston J, Mellen J, Sorger PK, Swedlow JR. The Open Microscopy Environment (OME) Data Model and XML file: open tools for informatics and quantitative analysis in biological imaging. Genome Biol. 2005;6(5):R47. https://doi.org/10.1186/gb-2005-6-5-r47. Epub 2005 May 3. PMID: 15892875; PMCID: PMC1175959.↩︎
Linkert M, Rueden CT, Allan C, Burel JM, Moore W, Patterson A, Loranger B, Moore J, Neves C, Macdonald D, Tarkowska A, Sticco C, Hill E, Rossner M, Eliceiri KW, Swedlow JR. Metadata matters: access to image data in the real world. J Cell Biol. 2010 May 31;189(5):777-82. https://doi.org/10.1083/jcb.201004104. PMID: 20513764; PMCID: PMC2878938.↩︎
Allan, C., Burel, JM., Moore, J. et al. OMERO: flexible, model-driven data management for experimental biology. Nat Methods 9, 245–253 (2012). https://doi.org/10.1038/nmeth.1896↩︎
Tremouilhac P, Nguyen A, Huang YC, Kotov S, Lütjohann DS, Hübsch F, Jung N, Bräse S. Chemotion ELN: an Open Source electronic lab notebook for chemists in academia. J Cheminform. 2017 Sep 25;9(1):54. https://doi.org/10.1186/s13321-017-0240-0. PMID: 29086216; PMCID: PMC5612905.↩︎
CARPi N, Minges A, Piel M. eLabFTW: An open source laboratory notebook for research labs. Journal of Open Source Software, 2(12), 146 (2017). https://doi.org/10.21105/joss.00146↩︎
Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005 Oct;15(10):1451-5. https://doi.org/10.1101/gr.4086505. Epub 2005 Sep 16. PMID: 16169926; PMCID: PMC1240089.↩︎
Galaxy Community. The Galaxy platform for accessible, reproducible, and collaborative data analyses: 2024 update. Nucleic Acids Res. 2024 Jul 5;52(W1):W83-W94. doi: 10.1093/nar/gkae410. PMID: 38769056; PMCID: PMC11223835.↩︎
https://gitlab.ub.uni-bielefeld.de/cmg/clowm/clowm-backend↩︎