This mission entails creating an utility that performs statistical evaluation on CSV recordsdata and generates varied plots utilizing Python, Pandas, Matplotlib, and a language mannequin (LLM). The appliance additionally gives complete and informative solutions to questions in regards to the knowledge.
import pandas as pd
import matplotlib.pyplot as plt
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
# Perform to learn and parse CSV recordsdata
def read_csv(file_path):
return pd.read_csv("/content material/data_by_artist.csv")
# Perform to calculate fundamental statistics
def calculate_statistics(knowledge):
# Choose solely numeric columns for calculations
numeric_data = knowledge.select_dtypes(embrace=['number'])statistics = {
'imply': numeric_data.imply(),
'median': numeric_data.median(),
'mode': numeric_data.mode().iloc[0],
'std': numeric_data.std(),
'correlation': numeric_data.corr()
}
return statistics
# Perform to generate plots
def plot_data(knowledge):
# Histograms for numeric columns
numeric_data = knowledge.select_dtypes(embrace=['number'])
numeric_data.hist(bins=15, figsize=(15, 10))
plt.present()# Scatter Matrix for numeric columns
pd.plotting.scatter_matrix(numeric_data, figsize=(15, 10))
plt.present()
# Perform to reply questions utilizing LLM
def answer_question(immediate, mannequin, tokenizer):
inputs = tokenizer(immediate, return_tensors="pt")
outputs = mannequin.generate(inputs["input_ids"], max_length=100)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Most important operate to combine every thing
def important():
file_path = '/content material/data_by_artist.csv' # Exchange along with your precise file path
knowledge = read_csv(file_path)print("Knowledge Head:")
print(knowledge.head())
stats = calculate_statistics(knowledge)
print("Statistics:")
print(stats)
plot_data(knowledge)
model_name = "google/flan-t5-small" # Exchange with a publicly accessible mannequin
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Use AutoModelForSeq2SeqLM for sequence-to-sequence fashions like T5
mannequin = AutoModelForSeq2SeqLM.from_pretrained(model_name)
query = "What's the median of the info?"
response = answer_question(f"Query: {query}nAnswer:", mannequin, tokenizer)
print("Response from LLM:")
print(response)
if __name__ == "__main__":
important()
Most important Perform:
- Orchestrates the execution of the complete script:
- Reads CSV knowledge from
file_path
. - Shows the top of the info.
- Calculates and prints statistics of numeric knowledge.
- Generates and shows plots utilizing
plot_data
operate. - Initializes a language mannequin (
mannequin
) and tokenizer (tokenizer
) from Hugging Face’s Transformers library. - Asks a predefined query in regards to the knowledge to the language mannequin and prints the response.
Abstract
This script demonstrates the right way to learn CSV knowledge, carry out fundamental statistical evaluation, visualize the info, and work together with a language mannequin for answering questions. It leverages Python libraries akin to
pandas
,matplotlib
, andtransformers
for environment friendly knowledge dealing with, visualization, and pure language processing duties.