Learning Pathway Gallantries Grant - Intellectual Output 1 - Introduction to data analysis and -management, statistics, and coding
Date: No date given
This Learning Pathway collects the results of Intellectual Output 1 in the Gallantries Project
Keywords: beginner, contributing, data-science, sequence-analysis, transcriptomics
Learning objectives:
- Apply common
dplyr
functions to manipulate data in R. - Assess long reads FASTQ quality using Nanoplot and PycoQC
- Assess short reads FASTQ quality using FASTQE 🧬😎 and FastQC
- Be able to load and explore the shape and contents of a tabular dataset using base R functions.
- Be able to work with objects (i.e. applying mathematical and logical operators, subsetting, retrieving values, etc).
- Calculate basic statistics about datasets and columns
- Check the quality and trim the sequences with bash
- Clone a remote repository locally
- Commit a file
- Commit changes
- Compare various versions of tracked files.
- Compose an R script file containing comments, commands, objects, and functions.
- Configure
git
the first time it is used on a computer. - Construct absolute and relative paths that identify specific files and directories.
- Construct command pipelines with two or more stages.
- Correctly evaluate expressions containing
and
andor
. - Create a branch
- Create a directory hierarchy that matches a given diagram.
- Create a local Git repository.
- Create a pull request
- Create a repository
- Create files in that hierarchy using an editor or by copying and renaming existing files.
- Delete, copy and move specified files and/or directories.
- Demonstrate how to see what commands have recently been executed.
- Demonstrate the use of tab completion and explain its advantages.
- Describe the purpose of the
.git
directory. - Describe the types of data formats encountered during variant calling.
- Distinguish between descriptive and non-descriptive commit messages.
- Edit a file via GitHub interface
- Employ the ‘pipe’ operator to link together a sequence of functions.
- Estimate the number of reads per gens
- Explain Unix's 'small pieces, loosely joined' philosophy.
- Explain how the shell relates to the keyboard, the screen, the operating system, and users' programs.
- Explain key differences between integers and floating point numbers.
- Explain key differences between numbers and character strings.
- Explain the difference between a variable's name and its value.
- Explain the similarities and differences between a file and a directory.
- Explain what for loops are normally used for.
- Explain what is a BAM file and what it contains
- Explain what is meant by 'text' and 'binary' files, and why many common tools don't handle the latter well.
- Explain what the HEAD of a repository is and how to use it.
- Explain what usually happens if a program or pipeline isn't given any input to process.
- Explain when and why command-line interfaces should be used instead of graphical interfaces.
- Explain where information is stored at each stage of that cycle.
- Explain why programs need collections of values.
- Explain why spaces and some punctuation characters shouldn't be used in file names.
- Explore the bash dungeon and fight monsters
- Fork a repository on GitHub
- Go through the modify-add-commit cycle for one or more files.
- Identify and use Git commit numbers.
- Know advantages of analyzing data using R within Galaxy.
- Learn about
check_call
andcheck_output
and when to use each of these. - Learn about the potential pitfalls of glob
- Learn how sys.argv works
- Learn the basics to process RNA sequences
- Learn the fundamentals of programming in Python
- Make some changes
- Perform quality correction with Cutadapt (short reads)
- Process a file instead of keyboard input using redirection.
- Process single-end and paired-end data
- Push changes to a remote repository
- Re-run recently executed commands without retyping them.
- Read data from a file
- Read data with dplyr's
read_csv
- Read data with the built-in
read.csv
- Read it's output.
- Recap all previous modules.
- Redirect a command's output to a file.
- Reinforce the learning of CLI basics such as how to change directories, move around, find things, and symlinkings
- Restore old versions of files.
- Run a command in a subprocess.
- Run a tool to map reads to a reference genome
- Run our software from the command line.
- Set up a Conda environment for our software project using
conda
. - Set up a Python virtual environment for our software project using
venv
andpip
. - Summarise quality metrics MultiQC
- Trace the execution of a simple (unnested) loop and correctly state the values of variables in each iteration.
- Trace the values taken on by a loop variable during execution of the loop.
- Translate an absolute path into a relative path and vice versa.
- Translate some known math functions (e.g. euclidean distance, root algorithm) into python to transfer concepts from mathematics lessons directly into Python.
- Understand factors and how they can be used to store and work with categorical data.
- Understand the basics of how automated version control systems work.
- Understand the benefits of an automated version control system.
- Understand the fundamentals of object assignment and math in python and can write simple statements and execute calcualtions in order to be able to summarize the results of calculations and classify valid and invalid statements.
- Understand the meaning of the
--global
configuration flag. - Understand the steps involved in variant calling.
- Understand the structure of a "function" in order to be able to construct their own functions and predict which functions will not work.
- Undo a bad change
- Update a pull request
- Use
find
to find files and directories whose names match simple patterns. - Use
grep
to select lines from text files that match simple patterns. - Use
with
to ensure the file is closed properly. - Use argparse to make it nicer.
- Use built-in functions to convert between integers, floating point numbers, and strings.
- Use command line STAR aligner to map the RNA sequences
- Use command line tools to perform variant calling.
- Use dplyr and tidyverse functions to cleanup data.
- Use exercises to ensure that all previous knowledge is sufficiently covered.
- Use genome browser to understand your data
- Use glob to collect a list of files
- Use options and arguments to change the behaviour of a shell command.
- Use the CSV module to parse comma and tab separated datasets.
- Use the log to view the diff
- Use the output of one command as the command-line argument(s) to another command.
- Use the scientific libraries pandas and numpy to explore tabular datasets
- Write a loop that applies one or more commands separately to each file in a set of files.
- Write a simple command line program that sums some numbers
- Write a snakefile that does a simple QC and Mapping workflow
- Write conditional statements including
if
,elif
, andelse
branches. - Write for loops that use the Accumulator pattern to aggregate values.
- Write new data to a file
- Write programs that create flat lists, index them, slice them, and modify them through assignment and method calls.
- catch an exception
- raise your own exception
Event types:
- Workshops and courses
Sponsors: Avans Hogeschool, The Carpentries
Scientific topics: Mapping
Activity log