Big Data Programming Lab01 - Shell Basics And Bash Scriping
Last updated on September 16, 2023 pm
In this week’s Big Data Programming Lab, the final task is to download a dataset called “facebookdata” and perform some data processing. While writing a bash script for this task, I encountered some difficulties, which I will document here.
Question
Find the 10 most popular status entries. For that, add all the values you find in columns 8-15. Your script should look something like the following:
1 |
|
Arithmetic expansion and evaluation in Bash is done by placing an integer expression using the following format $((expression))
, e.g., $(( n1+n2 ))
.
Solution
1 |
|
Data Cleaning
The raw data contains some areas that need cleaning, and the professor has provided a script file for cleaning, which uses regular expressions. You can run this script directly on the CSV file.
fix bash script:
1
2
3
4
5
6
7#!/bin/bash
cp facebookdata.csv facebookdata-clean.csv
while grep -q '"[^"][^"]*,.*"' facebookdata-clean.csv ;do
sed -i.bak 's/\("[^"][^"]*\),\(.*"\)/\1;\2/' facebookdata-clean.csv
doneRun and Verify:
1
2!bash fix.sh facebookdata.csv
!cut -d ',' -f4,6 facebookdata-clean.csvbash script to solve the problem
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54#!/bin/bash
#declare 10 variables (initialise them with a 0)
top_likes=(0 0 0 0 0 0 0 0 0 0)
top_statuses=("0" "0" "0" "0" "0" "0" "0" "0" "0" "0")
#here you're reading the output of a command line by line
tail -n +2 facebookdata-clean.csv | {
while read line; do
#get the values (cut) in several variables:
# echo "$line"
num_comments=$(echo "$line" | cut -d',' -f8)
num_shares=$(echo "$line" | cut -d',' -f9)
num_likes=$(echo "$line" | cut -d',' -f10)
num_loves=$(echo "$line" | cut -d',' -f11)
num_wows=$(echo "$line" | cut -d',' -f12)
num_hahas=$(echo "$line" | cut -d',' -f13)
num_sads=$(echo "$line" | cut -d',' -f14)
num_angrys=$(echo "$line" | cut -d',' -f15)
num_comments=$((num_comments))
num_shares=$((num_shares))
num_likes=$((num_likes))
num_loves=$((num_loves))
num_wows=$((num_wows))
num_hahas=$((num_hahas))
num_sads=$((num_sads))
num_angrys=$((num_angrys))
#add the values
sum=$((num_comments+num_shares+num_likes+num_loves+num_wows+num_hahas+num_sads+num_angrys))
# echo "$sum"
#keep only this sum if it's among the top 10:
if (( $sum > ${top_likes[9]} )); then
top_likes[9]=$sum
top_statuses[9]=$line
for (( i=8; i>=0; i-- )); do
if (( $sum > ${top_likes[i]} )); then
top_likes[(i+1)]=${top_likes[i]}
top_likes[i]=${sum}
top_statuses[(i+1)]=${top_statuses[i]}
top_statuses[i]=$line
fi
done
fi
# echo "${top_likes[@]}"
# echo "${top_statuses[@]}"
done
#print the 10 status entries
for (( i = 0; i < 10; i++ )); do
echo ${top_statuses[i]}
done
}Run the script
1
bash ex12.sh facebookdata-clean.csv
Challenges Encountered and Solutions:
After searching, I found a similar issue on StackOverflow. After investigation, I initially attempted to convert the strings to numeric types using num_comments=$((num_comments))
, but the same error persisted. During output checks, I discovered that the error was caused by the string in the CSV header. To address this, I modified the script to exclude the first line using tail
and piped the results to the while
loop, which resolved the issue and allowed me to proceed with the desired operations.
1 |
|
The problem has been resolved.
In the final step, I can only output the initialized array (e.g.,
top_statuses=("0" "0" "0" "0" "0" "0" "0" "0" "0" "0")
).Upon investigation, I learned that in Bash, a pipe (
|
) launches a subshell to run the commands in the pipeline. Therefore, variables defined in a subshell are destroyed when the subshell ends and do not affect the parent shell. This is why thetop_likes
andtop_statuses
arrays set within the while loop revert to their initial values outside the loop.Solution: Wrap the contents of the
while
loop and the output of the results in curly braces to ensure that the variable scope extends beyond the entire loop.