diff --git a/README.md b/README.md index 5cdf926..b961e4c 100644 --- a/README.md +++ b/README.md @@ -2,11 +2,11 @@ With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. Though existing methods have made remarkable progress, they mostly consider publicly known LLMs when testing the performance and a new challenge brought by text from privately-tuned LLMs is largely underexplored. -Due to the rapid development of open-source models like LLaMA and Qwen series and efficient LLM training methods, even ordinary users can now easily possess private LLMs by fine-tuning an open-source one with private corpora. This could lead to a significant performance drop of existing detectors in practice, due to their poor capability of capturing the essential LLM traits robust to fine-tuning operations. +Due to the rapid development of open-source models like the Qwen and LLaMA series, even ordinary users can now easily possess private LLMs by further tuning an open-source one with private corpora. This could lead to a significant performance drop of existing detectors by 41% at most in our preliminary study, due to their poor capability of capturing the essential LLM traits robust to fine-tuning operations. -Our preliminary examination reveals that fine-tuning an LLM with 11M tokens could make a detector's accuracy jump from 100\% to only 59\% at most. To address this issue, we propose **PhantomHunter**, an LLM-generated text detector specialized for detecting text from unseen privately-tuned LLMs, whose family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics. +To address this issue, we propose **PhantomHunter**, an LLM-generated text detector specialized for detecting text from unseen privately-tuned LLMs. Instead of memorizing individual characteristics, PhantomHunter's family-aware learning framework captures family-level traits shared across the base models and their derivatives. -Specifically, PhantomHunter first extracts base model features and enhances the family-shared information using a contrastive family-aware learning module. The enhanced features are then fed into a mixture-of-experts module containing multiple experts for corresponding families for final predictions. Experiments on data from four widely-adopted LLM families (LLaMA, Gemma, Mistral, and Qwen) show PhantomHunter's superiority over 8 baselines and 11 industrial services. +Specifically, PhantomHunter first extracts base model features and distills them into a lightweight proxy module for deployment efficiency consideration, followed by a contrastive family-aware learning module that enhances the family-shared information. The enhanced features are then fed into a mixture-of-experts module containing multiple experts for corresponding families for final predictions. Experiments on data from four widely-adopted LLM families (LLaMA, Gemma, Mistral, and Qwen) show PhantomHunter's superiority over 9 baselines and 11 industrial services. --- Here is the official implementation of "PhantomHunter: A Multi-Task Framework with Mixture of Experts for Generalized Generated Text Detection". @@ -26,7 +26,7 @@ PhantomHunter is a unified framework for detecting AI-generated text that levera ![PhantomHunter Architecture](pic/method.png) -**PhantomHunter** and the training process. Given a text sample $\mathbf{x}$, it **1)** extracts the probability feature from $M$ base models and encode them with CNN and transformer blocks; **2)** predicts the family of $\mathbf{x}$ to determine the family gating weights; and **3)** feeds the representation $\mathbf{R}_{F}$ to a mixture-of-experts network controlled by the gating weights from Step 2 for final prediction of $\mathbf{x}$ being LLM-generated. During training, contrastive learning is applied in each mini-batch to better model family relationships. The red terms are loss functions. +**PhantomHunter** and the training process. Given a text sample $\mathbf{x}$, it **1)** extracts base probability lists from $M$ base LLMs and distills them into a lightweight RoBERTa-MLP proxy module, so the base LLMs can be dropped during inference; **2)** encodes the proxy probability features with CNN and Transformer blocks and applies contrastive family-aware learning to enhance family-shared information; and **3)** feeds the enhanced representation $\mathbf{R}_{F}$ to a mixture-of-experts network controlled by family gating weights for the final LLM-generated text prediction. The red terms are loss functions. ## Data We simulate two common LLM usage scenarios: **writing** (69,297 arXiv paper abstracts) and **question-answering** (3,062 Q&A pairs from ELI5, finance, and medicine domains). We select four open-source models (LLaMA-2-7B-Chat, Gemma-7B-it, Mistral-7B-Instruct-v0.1, Qwen2.5-7B-Instruct) and fine-tune each with full-parameter and LoRA methods on domain-specific corpora, resulting in 48 derivative models for evaluation. @@ -73,6 +73,31 @@ python main.py \ --train ``` +### Train with Proxy MSE + +The Proxy module learns to approximate the white-box probability features from +the encoder hidden states. During training, `--proxy-prob` controls how often the +model consumes proxy-generated features, while `--mse-weight` supervises the +proxy features against the original white-box probability features. + +```bash +python main.py \ + --cuda \ + --seed 2024 \ + --exp-name moe+logits+cl+proxy_mse_arxiv-lora_5e-4 \ + --train-path /feature/arxiv_new/lora/train.jsonl \ + --val-path /feature//arxiv_new/lora/val.jsonl \ + --test-path /feature/arxiv_new/lora/test_ood.jsonl \ + --batch-size 64 \ + --lr 5e-4 \ + --is-cl \ + --proxy-prob 0.5 \ + --mse-weight 1.0 \ + --use-curriculum \ + --proxy-warmup-epochs 10 \ + --train +``` + ### Evaluation ```bash @@ -88,6 +113,22 @@ python main.py \ --test ``` +To evaluate with proxy-generated features instead of white-box probability +features, add `--use-proxy`: + +```bash +python main.py \ + --cuda \ + --seed 2024 \ + --exp-name moe+logits+cl+proxy_mse_arxiv-lora_5e-4 \ + --test-path /feature/arxiv_new/lora/test_ood.jsonl \ + --batch-size 64 \ + --lr 5e-4 \ + --is-binary \ + --use-proxy \ + --test +``` + ## License diff --git a/main.py b/main.py index 24774d4..3c0a611 100644 --- a/main.py +++ b/main.py @@ -30,6 +30,11 @@ def parse_args(): parser.add_argument('--train', action='store_true') parser.add_argument('--is-binary', action='store_true', help='True indicate binary classification,False indicate multi-classification') parser.add_argument('--is-cl', action='store_true', help='if use contrastive learning') + parser.add_argument('--use-proxy', action='store_true', help='use the proxy module to replace white-box probability features at inference time') + parser.add_argument('--proxy-prob', type=float, default=0.0, help='probability of using proxy-generated probability features during training') + parser.add_argument('--proxy-warmup-epochs', type=int, default=0, help='number of epochs used to warm up proxy probability and MSE weight') + parser.add_argument('--use-curriculum', action='store_true', help='linearly increase proxy probability and MSE weight during warmup') + parser.add_argument('--mse-weight', type=float, default=0.0, help='weight for MSE loss between proxy features and white-box probability features') return parser.parse_args() def set_seed(seed): @@ -62,7 +67,25 @@ def main(args): train_dataloader = get_dataloader(args.train_path, args.pretrain_model, args.batch_size, args.max_len, label2id, shuffle=True) if not args.test else None val_dataloader = get_dataloader(args.val_path, args.pretrain_model, args.batch_size, args.max_len, label2id, shuffle=False) if not args.test else None test_dataloader = get_dataloader(args.test_path, args.pretrain_model, args.batch_size, args.max_len, label2id, shuffle=False) - trainer = Trainer(device, args.pretrain_model, train_dataloader, val_dataloader, test_dataloader, args.epoch, args.lr, args.early_stop, model_save_path, args.n_family, args.is_cl, args.is_binary) + trainer = Trainer( + device, + args.pretrain_model, + train_dataloader, + val_dataloader, + test_dataloader, + args.epoch, + args.lr, + args.early_stop, + model_save_path, + args.n_family, + args.is_cl, + args.is_binary, + proxy_prob=args.proxy_prob, + proxy_warmup_epochs=args.proxy_warmup_epochs, + mse_weight=args.mse_weight, + use_proxy_inference=args.use_proxy, + use_curriculum=args.use_curriculum, + ) if not args.test: trainer.train() diff --git a/model.py b/model.py index 90ea14c..35621aa 100644 --- a/model.py +++ b/model.py @@ -8,6 +8,19 @@ from scl_loss import SupConLoss from typing import List, Tuple +class ProxyModule(nn.Module): + def __init__(self, emb_dim, n_feat): + super(ProxyModule, self).__init__() + self.net = nn.Sequential( + nn.Linear(emb_dim, 256), + nn.ReLU(), + nn.Linear(256, n_feat) + ) + + def forward(self, feature): + out = self.net(feature) + return out.transpose(1, 2) + class MLP(nn.Module): def __init__(self, input_dim, hidden_dims, output_dim, dropout): super(MLP, self).__init__() @@ -87,7 +100,8 @@ def __init__(self, n_family=4, emb_dim=768, hidden_dims=[256], dropout=0.2, feat super(Model, self).__init__() self.n_family = n_family - self.n_feat = 3 + self.n_feat = n_family - 1 + self.proxy_module = ProxyModule(emb_dim, self.n_feat) feature_enc_layers = [(64, 5, 1)] + [(128, 3, 1)] * 3 + [(64, 3, 1)] self.conv = ConvFeatureExtractionModel( conv_layers=feature_enc_layers, @@ -128,9 +142,17 @@ def conv_feat_extract(self, x): out = out.transpose(1, 2) return out - def forward(self, prob_feature, feature, mask): + def forward(self, prob_feature, feature, mask, use_proxy_prob=0.0): + pred_prob_feature = self.proxy_module(feature) + + if self.training and use_proxy_prob > 0.0: + if torch.rand((), device=feature.device).item() < use_proxy_prob: + prob_feature = pred_prob_feature + elif not self.training and use_proxy_prob >= 1.0: + prob_feature = pred_prob_feature + prob_feature = torch.cat([self.conv_feat_extract(prob_feature[:, i:i+1, :]) for i in range(self.n_feat)], dim=2) # (batch_size, seq_len, embedding_size) - prob_feature = prob_feature + self.position_encoding.cuda() + prob_feature = prob_feature + self.position_encoding.to(prob_feature.device) prob_feature = self.norm(prob_feature) prob_feature = self.encoder(prob_feature) prob_feature = self.dropout(prob_feature) # (bs, seq_len, embedding_size) @@ -142,12 +164,12 @@ def forward(self, prob_feature, feature, mask): shared_feature = sum([self.expert[i](prob_feature) * gate[:, i].unsqueeze(1) for i in range(self.n_family)]) pred_binary = self.binary_classifier(shared_feature) - pred_binary = torch.sigmoid(pred_binary).squeeze() + pred_binary = torch.sigmoid(pred_binary).squeeze(-1) - return pred_binary, pred_family, family_feature + return pred_binary, pred_family, family_feature, pred_prob_feature class Trainer: - def __init__(self, device, pretrain_model, train_dataloader, val_dataloader, test_dataloader, epoch, lr, early_stop, model_save_path, n_family, is_cl, is_binary): + def __init__(self, device, pretrain_model, train_dataloader, val_dataloader, test_dataloader, epoch, lr, early_stop, model_save_path, n_family, is_cl, is_binary, proxy_prob=0.0, proxy_warmup_epochs=0, mse_weight=0.0, use_proxy_inference=False, use_curriculum=False): self.device = device self.epoch = epoch self.train_dataloader = train_dataloader @@ -155,6 +177,13 @@ def __init__(self, device, pretrain_model, train_dataloader, val_dataloader, tes self.test_dataloader = test_dataloader self.early_stop = early_stop self.n_family = n_family + self.proxy_prob = proxy_prob + self.proxy_warmup_epochs = proxy_warmup_epochs + self.mse_weight = mse_weight + self.use_proxy_inference = use_proxy_inference + self.use_curriculum = use_curriculum + self.current_proxy_prob = 0.0 + self.current_mse_weight = 0.0 self.pretrain = RobertaModel.from_pretrained(pretrain_model).to(device) self.model_save_path = model_save_path self.model = Model(n_family=n_family).to(device) @@ -170,7 +199,7 @@ def get_loss(self, batch): label_family = batch['label_family'].to(self.device) label_binary = batch['label_binary'].to(self.device) - pred_binary, pred_family, family_feature = self.model(ll_tokens_list, feature, attention_mask) + pred_binary, pred_family, family_feature, pred_prob_feature = self.model(ll_tokens_list, feature, attention_mask, use_proxy_prob=self.current_proxy_prob) if self.is_clLoss: loss = nn.BCELoss()(pred_binary, label_binary.float()) \ + nn.CrossEntropyLoss()(pred_family, label_family) \ @@ -178,6 +207,10 @@ def get_loss(self, batch): else: loss = nn.BCELoss()(pred_binary, label_binary.float()) \ + nn.CrossEntropyLoss()(pred_family, label_family) + if self.current_mse_weight > 0.0: + real_targets = ll_tokens_list[:, :self.model.n_feat, :] + loss_mse = nn.MSELoss()(pred_prob_feature, real_targets) + loss = loss + self.current_mse_weight * loss_mse return loss def get_output(self, batch): @@ -187,7 +220,7 @@ def get_output(self, batch): attention_mask = batch['attention_mask'].to(self.device) feature = self.pretrain(input_ids, attention_mask).last_hidden_state.detach() with torch.no_grad(): - output, _, _ = self.model(ll_tokens_list, feature, attention_mask) + output, _, _, _ = self.model(ll_tokens_list, feature, attention_mask, use_proxy_prob=1.0 if self.use_proxy_inference else 0.0) return output else: ll_tokens_list = batch['ll_tokens_list'].to(self.device) @@ -195,7 +228,7 @@ def get_output(self, batch): attention_mask = batch['attention_mask'].to(self.device) feature = self.pretrain(input_ids, attention_mask).last_hidden_state.detach() with torch.no_grad(): - output, pred_family, _ = self.model(ll_tokens_list, feature, attention_mask) + output, pred_family, _, _ = self.model(ll_tokens_list, feature, attention_mask, use_proxy_prob=1.0 if self.use_proxy_inference else 0.0) return output, pred_family @@ -203,7 +236,18 @@ def get_output(self, batch): def train(self): recorder = Recorder(self.early_stop) for epoch in range(self.epoch): - print('----epoch %d----' % (epoch+1)) + if self.use_curriculum and self.proxy_warmup_epochs > 0: + warmup_ratio = min(1.0, (epoch + 1) / self.proxy_warmup_epochs) + self.current_proxy_prob = self.proxy_prob * warmup_ratio + self.current_mse_weight = self.mse_weight * warmup_ratio + else: + self.current_proxy_prob = self.proxy_prob + self.current_mse_weight = self.mse_weight + + if self.proxy_prob > 0.0 or self.mse_weight > 0.0: + print('----epoch %d (proxy_prob: %.2f, mse_weight: %.2f)----' % (epoch+1, self.current_proxy_prob, self.current_mse_weight)) + else: + print('----epoch %d----' % (epoch+1)) self.model.train() avg_loss = Averager() for i, batch in enumerate(tqdm(self.train_dataloader)): diff --git a/pic/method.png b/pic/method.png index 19e2e8b..b327d1e 100644 Binary files a/pic/method.png and b/pic/method.png differ