I'm quite new in python and I try to resolve a problem. I have a JSON file and I need to make a multilabel classification on it.I decided to use tfidf
My JSON file looks like this training_data
My techniques files is like thisTechniques
Can someone give me some tips in order to preprocess the data?
Data for testing looks like this Testing Data
with open(propaganda_techniques_file,'r') as f:
techniques = [ line.rstrip() for line in f.readlines() if len(line)>2 ]
# Read data from training_set_task1
try:
with open(training_file, "r", encoding='utf-8') as f:
json_obj = json.load(f)
except:
sys.exit("ERROR: cannot load json file")
try:
with open(test_file,'r',encoding='utf-8') as f:
json_test = json.load(f)
except:
sys.exit("Error")
tech_list = []
text_list = []
i=0
for example in json_obj:
while i < len(json_obj):
tech_list.append(json_obj[i]['labels'])
text_list.append(json_obj[i]['text'])
i+=1
j=0
test_text = []
for ex in json_test:
while j < len(json_test):
test_text.append(json_test[j]['text'])
j+=1
vec_train = TfidfVectorizer()
X_train = vec_train.fit_transform(text_list)
y_train = tech_list
vec_test = TfidfVectorizer()
X_test = vec_test.fit_transform(test_text)
clf = LogisticRegression(penalty='l2', multi_class = 'multinomial',solver ='newton-cg')
y_pred = clf.predict(X_test)
this is my code for now, but I get an error message:
his LogisticRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
I tried to split my testing data into 2 parts, labels, and text. So I will have as an input the text and the output will be the label(I don't know if it's a good approach).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…