|
|
- # Data Science Workflow
-
-
- ## Introduction ##
-
- Data science is mainly the work that a graduated student has to do along his master. This post explores through examples how to use the command line in an efficient and productive way for data science tasks. In this sense, data science workflow consists on to (*i*)obtain, (*ii*)scrub, (*iii*)explore, (*iv*)model, and (*v*)**report your data**.
-
- To develop each of stage we are going to explore different tools and supporting lectures. However, the main idea is to create a *highly reproducible environment*. This means, that any person(student) can follow your procedure to obtain the same results.
-
- Then, to create a reproducible workflow (environment), let's start by the main base of our work system, the Operative System (OS).
-
- Therefore, the most basic and reproducible OSs are Unix like OS; similar to Linux or Macos OSs. This kind of OS, use a main user's interface to input and output data; the interface is called *Command Line*.
-
- ---
- The Unix philosophy
-
- According with [^1], Unix philosophy is an established cultural norm or a reference manual software developers need to adhere to whenever they create software for Unix-like systems. Its emphasis on the software structure is simplicity, modularity, and extensive maintenance. The most important point is that *The programs you write should implement a universal interface like handling text streams.*
-
- ---
-
- ## Command-line basics ##
- This new way(philosophy) to work requires an overall understanding about we should call the command line, and according to [^2] it is mainly defined by:*(a)the command-line tools, (b)the terminal, (c)the shell, and (d)the operating system*.
-
- ### Command-line tools ###
- We use them by typing their corresponding commands on the *terminal*. There are different types of command-line tools, examples of this tools are: `ls`, `cat`, and `more` commands.
-
- ### Terminal ###
- The terminal is the application where we type our commands in; see next figure:
-
- ![Terminal](https://i.blogs.es/56c9ee/cd/450_1000.jpg "terminal-ubuntu")
-
- The dollar sign `$` shown in the figure is known as the *prompt*, and you are watching the typical *Ubuntu terminal* in version 18.04; other kind of prompts are `>`, `~`, `->`, among others.
-
- The *terminal* is some kind of front-end to observe the input/output of a command process task.
-
- ### Shell ###
- The third element is the *shell*. Once we have typed *a command-line tool* and pressed `<Enter>`, the terminal sends that command to the *shell*. The shell is a program that interprets the command. The image shows the Bash (Bourne Again Shell), but there are many others available like *Z shell*.
-
- ### Operating system ###
- The last element is the operating system (OS), which is *GNU/Linux* in a Docker image or Virtual Machine. Linux is the name of the kernel, which is the heart of the operating system. The kernel has a direct contact with the CPU, disks, and other hardware. The kernel also executes the *command-line tools*. GNU, which stands for GNU’s not UNIX, refers to the set of basic tools.
-
-
- ## Type of command-line tools ##
-
- The *command-line* tools are some kind of apps called by text and return text, strings or files. Each command-line tool can be one of the following five types according to [^2]:
-
- * A binary executable
- * A shell builtin
- * An interpreted script
- * A shell function
- * An alias
-
- The most common are the first two, while the others allow to build up a toolbox that will make us more efficient and productive.
-
- ### Binary executable ###
- Binary executables are programs in the classical sense. A binary executable is created by compiling source code to machine code. This means that when you open the file in a text editor you cannot read it, most probably you will see strange characters.
-
- ### Shell builtin ###
- *Shell builtins* are command-line tools provided by the shell, which is Bash in our case. Examples include `cd` and `help`. These cannot be changed. Shell builtins may differ between shells. Like binary executables, they cannot be easily inspected or changed.
-
- ### Interpreted Script ###
- An interpreted script is a text file that is executed by a binary executable. Examples include: *Python*, *R*, and *Bash scripts*. One great advantage of an interpreted script is that you can read and change it. E.g. a script `fac.py`. This script is interpreted by Python not because of the file extension .py, but because the first line of the script defines the binary that should execute it.
-
- # Testing some tools #
- We employ the term command-line tool a lot, but so far, we have not yet explained what we actually mean by it. We use it as an umbrella term for anything that can be executed from the command line. Thus, lets try something:
-
- go to your working folder by using `cd` command
-
- ```bash
- $ cd /my/working/folder/
- ```
-
- then let's change the prompt:
-
- ```bash
- $ export PS1='> '
- ```
-
- let's update our system:
-
- ```bash
- > sudo apt update
- Pass:
- > sudo apt upgrade
- ```
-
- then, install some useful packages:
-
- ```bash
- > sudo apt install pandoc
- > sudo apt install vim
- ```
-
- test the power of the bash:
- ```bash
- > touch file.txt
- > echo "zap" >> file.txt
- > echo "dog" >> file.txt
- > echo "ape" >> file.txt
- > sort file.txt
- > sort file.txt > file-new.txt
- > sort file.txt | head -n 2
- ```
-
-
- ---
- References
-
- [^1]: Top 10 Unix Based Operating Systems, https://www.fosslinux.com/44623/top-unix-based-operating-systems.htm
-
- [^2]: Data science at the command line, 1st Ed,
|