The federal government’s coverage to deduct salaries as a contribution to the Tabungan Perumahan Rakyat (Tapera) has sparked each help and opposition among the many public. This regulation is stipulated in Authorities Regulation No. 21 of 2024, amending Authorities Regulation No. 25 of 2020, which mandates wage deductions for all workers, each in the private and non-private sectors.
The wage deduction coverage for the Tabungan Perumahan Rakyat (Tapera) has triggered numerous reactions locally. The DPR RI, particularly Fee V, emphasizes the significance of transparency and equity within the implementation of this coverage. In the meantime, numerous employee and employer organizations oppose this coverage, contemplating it so as to add to the already heavy burden.
The information was obtained by scraping the X social media web page utilizing tweet-harvest. The information was scraped from Could 30, 2024 to Could 29, 2024 leading to a complete of 4112 tweets.
Preprocessing textual content knowledge utilizing library equivalent to Common Expression (re), Pure Language Toolkit (nltk), BeautifulSoup, and others to wash the info.
import re, string, unicodedata
import nltk
import contractions
import inflect
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize, RegexpTokenizer
def strip_html(textual content):
soup = BeautifulSoup(textual content, "html.parser")
return soup.get_text()def remove_between_square_brackets(textual content):
return re.sub('[[^]]*]', '', textual content)
def replace_contractions(textual content):
"""Change contractions in string of textual content"""
return contractions.repair(textual content)
def denoise_text(textual content):
textual content = strip_html(textual content)
textual content = remove_between_square_brackets(textual content)
textual content = replace_contractions(textual content)
return textual content
def remove_nonascii(textual content):
textual content = unicodedata.normalize('NFKD', textual content).encode('ascii', 'ignore').decode('utf-8', 'ignore')
return textual content
def remove_url(textual content):
textual content = re.sub(r'(?i)b((?:https?://|wwwd{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^s()<>]+|(([^s()<>]+|(([^s()<>]+)))*))+(?:(([^s()<>]+|(([^s()<>]+)))*)|[^s`!()[]{};:'".,<>?«»“”‘’]))', '', textual content)
return textual content
def remove_digit(textual content):
textual content = re.sub(r"bd+b", " ", textual content)
return textual content
def remove_punctuations(textual content):
textual content = re.sub(r'[^w]|_',' ',textual content)
return textual content
def remove_byte_str(textual content):
textual content = textual content.exchange("b'",'')
textual content = textual content.exchange('b"','')
return textual content
def remove_additional_white_spaces(textual content):
textual content = re.sub('[s]+', ' ', textual content)
return textual content
def remove_retweet_mention(textual content):
textual content = re.sub('RT @[w_]+: ', '', textual content)
return textual content
def remove_username(textual content):
textual content = re.sub('@[^s]+','',textual content)
return textual content
def remove_single_letter(textual content):
textual content = re.sub('(b[A-Za-z] b|b [A-Za-z]b)', '', textual content)
return textual content
def do_preprocessing(textual content):
textual content = denoise_text(textual content)
textual content = remove_byte_str(textual content)
textual content = remove_retweet_mention(textual content)
textual content = remove_username(textual content)
textual content = remove_nonascii(textual content)
textual content = remove_url(textual content)
textual content = remove_digit(textual content)
textual content = remove_punctuations(textual content)
textual content = remove_single_letter(textual content)
textual content = remove_additional_white_spaces(textual content)
textual content = textual content.decrease()
return textual content
tweet['text'] = tweet['full_text'].apply(do_preprocessing)
print(tweet['text'])
Subsequent, perform sentiment labeling utilizing IndoBERT. Labeling makes use of a sentiment textual content classifications mannequin out there on Hugging Face. After labeling with IndoBERT, rapidly verify the sentiment labeling outcomes manually.
!pip set up transformersfrom transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
pretrained_model = "mdhugol/indonesia-bert-sentiment-classification"
mannequin = AutoModelForSequenceClassification.from_pretrained(pretrained_model)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
sentiment_analysis = pipeline("sentiment-analysis", mannequin = mannequin, tokenizer = tokenizer)
def label_teks(teks):
outcomes = sentiment_analysis(teks)
label_index = {'LABEL_0': 'optimistic', 'LABEL_1': 'impartial', 'LABEL_2': 'damaging'}
labels = []
for end in outcomes:
labels.append(label_index[result['label']])
return labels
tapera['label'] = None
tapera['label'] = tapera['text'].apply(label_teks)
tapera['label'] = tapera['label'].astype(str)
tapera['label'] = tapera['label'].str.exchange(r"[[]']", '', regex=True)
tapera
Subsequent, I’ll rely every sentiment label to see the variety of damaging, optimistic, and impartial sentiments. I may even show a pie chart visualization to indicate the proportion of every sentiment label.
knowledge['label'].value_counts()
import matplotlib.pyplot as pltlabel_counts = knowledge['label'].value_counts() * 100
colours = ['#ff6b6b', '#feca57', '#48dbfb', '#1dd1a1', '#5f27cd']
plt.type.use('seaborn-poster')
plt.determine(figsize=(10, 8))
plt.pie(label_counts,
labels=label_counts.index,
autopct='%1.1f%%',
colours=colours,
startangle=90,
shadow=True,
explode=[0.1 if i == label_counts.idxmax() else 0 for i in label_counts.index])
plt.title('Sentiment Evaluation of Tapera Coverage', fontsize=16)
plt.present()
The creation of a sentiment classification mannequin associated to Tapera will make the most of switch studying strategies, particularly fine-tuning with IndoBERTweet. Utilizing IndoBERTweet as a pre-trained mannequin for fine-tuning has a number of compelling causes, as follows:
- The BERT structure is efficient in numerous NLP duties, enabling it to seize complicated contexts.
- IndoBERTweet is particularly designed for the Indonesian language.
- IndoBERTweet was skilled utilizing knowledge from social media platforms like X (Twitter), making it related for analyzing knowledge from social media platforms.
Earlier than modeling with IndoBERTweet, I carried out knowledge augmentation for the optimistic class because of knowledge imbalance. Augmentation was achieved utilizing synonym substitute with IndoBERT so as to add knowledge variation to the optimistic class. So, the augmentation output is as follows.
After performing augmentation, the subsequent step is to fine-tune with IndoBERTweet utilizing the code on GitHub. Click on here to view the GitHub code. The analysis outcomes are as follows.
The analysis outcomes confirmed an accuracy rating of 89.73%, precision of 90.31%, recall of 89.73%, and F1-Rating of 89.97%. These outcomes point out that the mannequin fine-tuned with IndoBERTweet is efficient in classifying public sentiment in the direction of the Tapera coverage.
Of the overall 261 knowledge labeled optimistic, the mannequin efficiently predicted 241 knowledge accurately, 4 knowledge have been predicted to be impartial, and 16 knowledge have been predicted to be damaging. For the impartial label with a complete of 369 knowledge, the mannequin efficiently predicted 275 knowledge accurately, 13 knowledge have been predicted to be optimistic, and 81 knowledge have been predicted to be damaging. For the damaging label with a complete of 508 knowledge, the mannequin efficiently predicted 490 knowledge accurately, 18 knowledge have been predicted to be optimistic, and 30 knowledge have been predicted to be impartial.
The mannequin has been deployed on Hugging Face for public use. Click on here to attempt the mannequin.
The sentiment evaluation outcomes on the Tapera coverage utilizing the dataset scraped from social media X, point out that public notion of Tapera tends to be damaging. Utilizing the pre-trained IndoBERTweet mannequin is efficient for my dataset, with an accuracy rating of 89.73% and an F1-Rating of 89.97%.
Elkurniawan, M. A. (2024, Could 29). Professional Kontra Kebijakan Tapera, dari DPR RI hingga Federasi Serikat Pekerja. Retrieved from narasi: https://narasi.tv/read/narasi-daily/pro-kontra-kebijakan-tapera
Thanks to Bagas Wahyu Herdiansyah for offering me with the dataset for sentiment evaluation.