Big Data Programming Lab01 - Shell Basics And Bash Scriping

Last updated on September 16, 2023 pm

In this week’s Big Data Programming Lab, the final task is to download a dataset called “facebookdata” and perform some data processing. While writing a bash script for this task, I encountered some difficulties, which I will document here.

Question

Find the 10 most popular status entries. For that, add all the values you find in columns 8-15. Your script should look something like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash

#declare 10 variables (initialise them with a 0)

#here you're reading the output of a command line by line
for line in $(command-similar-to-previous-question); do

#get the values (cut) in several variables:
num_comments=???
num_shares=???
num_likes=???
etc.

#add the values
#keep only this sum if it's among the top 10:
#think insertion sort?
done
#print the 10 status entries

Arithmetic expansion and evaluation in Bash is done by placing an integer expression using the following format $((expression)), e.g., $(( n1+n2 )).

Solution

  1. Download the data from the server provided by the school.

1
!wget csserver.ucd.ie/~thomas/facebookdata.csv
  1. Data Cleaning

    The raw data contains some areas that need cleaning, and the professor has provided a script file for cleaning, which uses regular expressions. You can run this script directly on the CSV file.

    fix bash script:

    1
    2
    3
    4
    5
    6
    7
    #!/bin/bash

    cp facebookdata.csv facebookdata-clean.csv

    while grep -q '"[^"][^"]*,.*"' facebookdata-clean.csv ;do
    sed -i.bak 's/\("[^"][^"]*\),\(.*"\)/\1;\2/' facebookdata-clean.csv
    done

    Run and Verify:

    1
    2
    !bash fix.sh facebookdata.csv
    !cut -d ',' -f4,6 facebookdata-clean.csv
  2. bash script to solve the problem

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    #!/bin/bash

    #declare 10 variables (initialise them with a 0)
    top_likes=(0 0 0 0 0 0 0 0 0 0)
    top_statuses=("0" "0" "0" "0" "0" "0" "0" "0" "0" "0")

    #here you're reading the output of a command line by line
    tail -n +2 facebookdata-clean.csv | {
    while read line; do
    #get the values (cut) in several variables:
    # echo "$line"
    num_comments=$(echo "$line" | cut -d',' -f8)
    num_shares=$(echo "$line" | cut -d',' -f9)
    num_likes=$(echo "$line" | cut -d',' -f10)
    num_loves=$(echo "$line" | cut -d',' -f11)
    num_wows=$(echo "$line" | cut -d',' -f12)
    num_hahas=$(echo "$line" | cut -d',' -f13)
    num_sads=$(echo "$line" | cut -d',' -f14)
    num_angrys=$(echo "$line" | cut -d',' -f15)

    num_comments=$((num_comments))
    num_shares=$((num_shares))
    num_likes=$((num_likes))
    num_loves=$((num_loves))
    num_wows=$((num_wows))
    num_hahas=$((num_hahas))
    num_sads=$((num_sads))
    num_angrys=$((num_angrys))

    #add the values
    sum=$((num_comments+num_shares+num_likes+num_loves+num_wows+num_hahas+num_sads+num_angrys))
    # echo "$sum"
    #keep only this sum if it's among the top 10:
    if (( $sum > ${top_likes[9]} )); then
    top_likes[9]=$sum
    top_statuses[9]=$line
    for (( i=8; i>=0; i-- )); do
    if (( $sum > ${top_likes[i]} )); then
    top_likes[(i+1)]=${top_likes[i]}
    top_likes[i]=${sum}
    top_statuses[(i+1)]=${top_statuses[i]}
    top_statuses[i]=$line
    fi
    done
    fi
    # echo "${top_likes[@]}"
    # echo "${top_statuses[@]}"
    done

    #print the 10 status entries
    for (( i = 0; i < 10; i++ )); do
    echo ${top_statuses[i]}
    done
    }

    Run the script

    1
    bash ex12.sh facebookdata-clean.csv

Challenges Encountered and Solutions:

  1. num_comments: expression recursion level exceeded (error token is “num_comments”)

After searching, I found a similar issue on StackOverflow. After investigation, I initially attempted to convert the strings to numeric types using num_comments=$((num_comments)), but the same error persisted. During output checks, I discovered that the error was caused by the string in the CSV header. To address this, I modified the script to exclude the first line using tail and piped the results to the while loop, which resolved the issue and allowed me to proceed with the desired operations.

1
2
3
4
tail -n +2 facebookdata-clean.csv | 
while read line; do
#statement
done

The problem has been resolved.

  1. In the final step, I can only output the initialized array (e.g., top_statuses=("0" "0" "0" "0" "0" "0" "0" "0" "0" "0")).

    Upon investigation, I learned that in Bash, a pipe (|) launches a subshell to run the commands in the pipeline. Therefore, variables defined in a subshell are destroyed when the subshell ends and do not affect the parent shell. This is why the top_likes and top_statuses arrays set within the while loop revert to their initial values outside the loop.

    Solution: Wrap the contents of the while loop and the output of the results in curly braces to ensure that the variable scope extends beyond the entire loop.


Big Data Programming Lab01 - Shell Basics And Bash Scriping
http://hihiko.zxy/2023/09/16/Big-Data-Programming-Lab01-Shell-Basics-And-Bash-Scriping/
Author
Posted on
September 16, 2023
Licensed under