Data science workflow repository to explore and guide you through the data science task using command line tools.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

180 lines
6.8 KiB

4 years ago
4 years ago
4 years ago
3 years ago
3 years ago
3 years ago
4 years ago
3 years ago
4 years ago
4 years ago
3 years ago
3 years ago
4 years ago
  1. # Data Science Workflow #
  2. This repository explores through examples how to use the command line in an efficient and productive way for data science tasks. Learning to obtain, scrub, explore, and model your data.
  3. # Introduction #
  4. During this examples your will learn how to: (*i*) run docker containers, (*ii*) use the command line, (*iii*) run a basic application.
  5. ## Docker ##
  6. Let us introduce docker, the first platform to make data science. Docker is a tool that allows developers, sys-admins or data-scientist to easily deploy their applications in a sandbox (**called containers**) to run on a host *operating system i.e. Linux, Windows, MacOS*.
  7. The key benefit of Docker is that it allows users to package an application with all of its dependencies into a standardized unit for software/data-science development. Unlike virtual machines, containers do not have high overhead and hence enable more efficient usage of the underlying system and resources.[^1]
  8. ## Installing and using a Docker image ##
  9. In this case we are going to create a new Docker image to work with. The image is based on **Ubuntu Bionic:18.04** and it is created using `docker build`, by using the next `Dockerfile`:
  10. ``` shell
  11. # Ubuntu
  12. FROM ubuntu:bionic
  13. ENV UNAME="data-science-workflow"
  14. MAINTAINER gmarxcc
  15. LABEL version="0.1"
  16. ARG DEBIAN_FRONTEND=noninteractive
  17. ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
  18. RUN apt-get update \
  19. && apt-get install -y python3 curl
  20. ```
  21. and calling the `docker build` command:
  22. ``` shell
  23. docker build -t gmarxcc/workflow:0.1 .
  24. ```
  25. ### Docker Pull ###
  26. ``` shell
  27. docker pull gmarxcc/workflow:0.1
  28. ```
  29. ### Docker Run ###
  30. ``` shell
  31. docker run -it -v `pwd/data:/home `gmarxcc/workflow:0.1
  32. ```
  33. We recommend that you create a new directory, navigate to this new directory, and then run the following when you’re on macOS or Linux:
  34. ``` shell
  35. $ docker run --rm -it -v`pwd`:/data datascienceworkshops/data-science-at-the-command-line
  36. ```
  37. Or the following when you’re on Windows and using the command line:
  38. ``` shell
  39. $ docker run --rm -it -v %cd%:/data datascienceworkshops/data-science-at-the-command-line
  40. ```
  41. Or the following when you’re using Windows PowerShell:
  42. ``` shell
  43. $ docker run --rm -it -v ${PWD}:/data datascienceworkshops/data-science-at-the-command-line
  44. ```
  45. In the above commands, the option `-v` instructs docker to map the current directory to the `/data` directory inside the container, so this is the place to get data in and out of the Docker container.
  46. ### Testing the container ###
  47. #### Command-line example ####
  48. ``` shell
  49. curl -s http://gmarx.jumpingcrab.com/examples-data/76-0.txt | tr '[:upper:]' '[:lower:]' | grep -oE '\w+' | sort | uniq -c | sort -nr | head -n 10
  50. ```
  51. #### Python example ####
  52. **creating the factor operation on docker**
  53. For testing the container a factorial function is created in Python 3:
  54. ``` python
  55. ##!/usr/bin/env python3
  56. def factorial(x):
  57. result = 1
  58. for i in range(2, x + 1):
  59. result *= i
  60. return result
  61. ```
  62. Improving the factorial function with system:
  63. ``` python
  64. ##!/usr/bin/env python3
  65. def factorial(x):
  66. result = 1
  67. for i in range(2, x + 1):
  68. result *= i
  69. return result
  70. if __name__ == "__main__":
  71. import sys
  72. x = int(sys.argv[1])
  73. print(factorial(x))
  74. ```
  75. #### Nginx server example ####
  76. ``` shell
  77. docker pull nginxdemos/hello
  78. docker run -P -d nginxdemos/hello
  79. docker ps
  80. ```
  81. # Command line basics #
  82. This new way to work requires an overall understanding about we should call the command line, and according to [^2] it is mainly defined by:*(i)the command-line tools, (ii)the terminal, (iii)the shell, and (iv)the operating system*.
  83. ## The command-line tools ##
  84. We use them by typing their corresponding commands on the *terminal*. There are different types of command-line tools, examples of this tools are: `ls`, `cat`, and `more` commands.
  85. ## Terminal ##
  86. The terminal is the application where we type our commands in; see next figure:
  87. ![Terminal](https://i.blogs.es/56c9ee/cd/450_1000.jpg "terminal-ubuntu")
  88. The dollar sign `$` shown in the figure is known as the *prompt*, and you are watching the typical *Ubuntu terminal* in version 18.04; other kind of prompts are `>`, `~`, `->`, among others.
  89. The *terminal* is some kind of front-end to observe the input/output of a command process task.
  90. ## Shell ##
  91. The third element is the shell. Once we have typed *a command-line tool* and pressed `<Enter>`, the terminal sends that command to the *shell*. The shell is a program that interprets the command. The image shows the Bash (Bourne Again Shell), but there are many others available like *Z shell*.
  92. ## Operating system ##
  93. The last element is the operating system (OS), which is *GNU/Linux* in the Docker image. Linux is the name of the kernel, which is the heart of the operating system. The kernel has a direct contact with the CPU, disks, and other hardware. The kernel also executes the *command-line tools*. GNU, which stands for GNU’s not UNIX, refers to the set of basic tools. In this case the Docker image is based on Ubuntu Linux.
  94. # Type of command-line tools #
  95. The command-line tools are some kind of apps called by text and return text, strings or files. Each command-line tool can be one of the following five types according to [^2]:
  96. * A binary executable.
  97. * A shell builtin.
  98. * An interpreted script.
  99. * A shell function.
  100. * An alias.
  101. The most common are the first two, while the others allow to build up a toolbox that will make us more efficient and productive.
  102. ## Binary executable ##
  103. Binary executables are programs in the classical sense. A binary executable is created by compiling source code to machine code. This means that when you open the file in a text editor you cannot read it[^2].
  104. ## Shell builtin ##
  105. Shell builtins are command-line tools provided by the shell, which is Bash in our case. Examples include cd and help. These cannot be changed. Shell builtins may differ between shells. Like binary executables, they cannot be easily inspected or changed.
  106. ## Interpreted Script ##
  107. An interpreted script is a text file that is executed by a binary executable. Examples include: *Python*, *R*, and *Bash scripts*. One great advantage of an interpreted script is that you can read and change it. E.g. a script `fac.py`. This script is interpreted by Python not because of the file extension .py, but because the first line of the script defines the binary that should execute it.
  108. ## Testing some tools ##
  109. We employ the term command-line tool a lot, but so far, we have not yet explained what we actually mean by it. We use it as an umbrella term for anything that can be executed from the command line. Under the ho
  110. # Notes #
  111. - [ ] Make an container with Ubuntu 18.04
  112. - [ ] Packages to install: csvkit,
  113. [^1]: Docker for beginners, https://docker-curriculum.com/.
  114. [^2]: Data science at the command line, 1st Ed,