diff --git a/README.md b/README.md
index 5cdf926..b961e4c 100644
--- a/README.md
+++ b/README.md
@@ -2,11 +2,11 @@
 
 With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. Though existing methods have made remarkable progress, they mostly consider publicly known LLMs when testing the performance and a new challenge brought by text from privately-tuned LLMs is largely underexplored.
 
-Due to the rapid development of open-source models like LLaMA and Qwen series and efficient LLM training methods, even ordinary users can now easily possess private LLMs by fine-tuning an open-source one with private corpora. This could lead to a significant performance drop of existing detectors in practice, due to their poor capability of capturing the essential LLM traits robust to fine-tuning operations.
+Due to the rapid development of open-source models like the Qwen and LLaMA series, even ordinary users can now easily possess private LLMs by further tuning an open-source one with private corpora. This could lead to a significant performance drop of existing detectors by 41% at most in our preliminary study, due to their poor capability of capturing the essential LLM traits robust to fine-tuning operations.
 
-Our preliminary examination reveals that fine-tuning an LLM with 11M tokens could make a detector's accuracy jump from 100\% to only 59\% at most. To address this issue, we propose **PhantomHunter**, an LLM-generated text detector specialized for detecting text from unseen privately-tuned LLMs, whose family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics.
+To address this issue, we propose **PhantomHunter**, an LLM-generated text detector specialized for detecting text from unseen privately-tuned LLMs. Instead of memorizing individual characteristics, PhantomHunter's family-aware learning framework captures family-level traits shared across the base models and their derivatives.
 
-Specifically, PhantomHunter first extracts base model features and enhances the family-shared information using a contrastive family-aware learning module. The enhanced features are then fed into a mixture-of-experts module containing multiple experts for corresponding families for final predictions. Experiments on data from four widely-adopted LLM families (LLaMA, Gemma, Mistral, and Qwen) show PhantomHunter's superiority over 8 baselines and 11 industrial services.
+Specifically, PhantomHunter first extracts base model features and distills them into a lightweight proxy module for deployment efficiency consideration, followed by a contrastive family-aware learning module that enhances the family-shared information. The enhanced features are then fed into a mixture-of-experts module containing multiple experts for corresponding families for final predictions. Experiments on data from four widely-adopted LLM families (LLaMA, Gemma, Mistral, and Qwen) show PhantomHunter's superiority over 9 baselines and 11 industrial services.
 
 ---
 Here is the official implementation of "PhantomHunter: A Multi-Task Framework with Mixture of Experts for Generalized Generated Text Detection".
@@ -26,7 +26,7 @@ PhantomHunter is a unified framework for detecting AI-generated text that levera
 
 ![PhantomHunter Architecture](pic/method.png)
 
-**PhantomHunter** and the training process. Given a text sample $\mathbf{x}$, it **1)** extracts the probability feature from $M$ base models and encode them with CNN and transformer blocks; **2)** predicts the family of $\mathbf{x}$ to determine the family gating weights; and **3)** feeds the representation $\mathbf{R}_{F}$ to a mixture-of-experts network controlled by the gating weights from Step 2 for final prediction of $\mathbf{x}$ being LLM-generated. During training, contrastive learning is applied in each mini-batch to better model family relationships. The red terms are loss functions.
+**PhantomHunter** and the training process. Given a text sample $\mathbf{x}$, it **1)** extracts base probability lists from $M$ base LLMs and distills them into a lightweight RoBERTa-MLP proxy module, so the base LLMs can be dropped during inference; **2)** encodes the proxy probability features with CNN and Transformer blocks and applies contrastive family-aware learning to enhance family-shared information; and **3)** feeds the enhanced representation $\mathbf{R}_{F}$ to a mixture-of-experts network controlled by family gating weights for the final LLM-generated text prediction. The red terms are loss functions.
 
 ## Data
 We simulate two common LLM usage scenarios: **writing** (69,297 arXiv paper abstracts) and **question-answering** (3,062 Q&A pairs from ELI5, finance, and medicine domains). We select four open-source models (LLaMA-2-7B-Chat, Gemma-7B-it, Mistral-7B-Instruct-v0.1, Qwen2.5-7B-Instruct) and fine-tune each with full-parameter and LoRA methods on domain-specific corpora, resulting in 48 derivative models for evaluation. 
@@ -73,6 +73,31 @@ python main.py \
     --train
 ```
 
+### Train with Proxy MSE
+
+The Proxy module learns to approximate the white-box probability features from
+the encoder hidden states. During training, `--proxy-prob` controls how often the
+model consumes proxy-generated features, while `--mse-weight` supervises the
+proxy features against the original white-box probability features.
+
+```bash
+python main.py \
+    --cuda \
+    --seed 2024 \
+    --exp-name moe+logits+cl+proxy_mse_arxiv-lora_5e-4 \
+    --train-path /feature/arxiv_new/lora/train.jsonl \
+    --val-path /feature//arxiv_new/lora/val.jsonl \
+    --test-path /feature/arxiv_new/lora/test_ood.jsonl \
+    --batch-size 64 \
+    --lr 5e-4 \
+    --is-cl \
+    --proxy-prob 0.5 \
+    --mse-weight 1.0 \
+    --use-curriculum \
+    --proxy-warmup-epochs 10 \
+    --train
+```
+
 ### Evaluation
 
 ```bash
@@ -88,6 +113,22 @@ python main.py \
     --test
 ```
 
+To evaluate with proxy-generated features instead of white-box probability
+features, add `--use-proxy`:
+
+```bash
+python main.py \
+    --cuda \
+    --seed 2024 \
+    --exp-name moe+logits+cl+proxy_mse_arxiv-lora_5e-4 \
+    --test-path /feature/arxiv_new/lora/test_ood.jsonl \
+    --batch-size 64 \
+    --lr 5e-4 \
+    --is-binary \
+    --use-proxy \
+    --test
+```
+
 
 
 ## License
diff --git a/main.py b/main.py
index 24774d4..3c0a611 100644
--- a/main.py
+++ b/main.py
@@ -30,6 +30,11 @@ def parse_args():
     parser.add_argument('--train', action='store_true')
     parser.add_argument('--is-binary',  action='store_true', help='True indicate binary classification,False indicate multi-classification')
     parser.add_argument('--is-cl',  action='store_true', help='if use contrastive learning')
+    parser.add_argument('--use-proxy', action='store_true', help='use the proxy module to replace white-box probability features at inference time')
+    parser.add_argument('--proxy-prob', type=float, default=0.0, help='probability of using proxy-generated probability features during training')
+    parser.add_argument('--proxy-warmup-epochs', type=int, default=0, help='number of epochs used to warm up proxy probability and MSE weight')
+    parser.add_argument('--use-curriculum', action='store_true', help='linearly increase proxy probability and MSE weight during warmup')
+    parser.add_argument('--mse-weight', type=float, default=0.0, help='weight for MSE loss between proxy features and white-box probability features')
     return parser.parse_args()
 
 def set_seed(seed):
@@ -62,7 +67,25 @@ def main(args):
     train_dataloader = get_dataloader(args.train_path, args.pretrain_model, args.batch_size, args.max_len, label2id, shuffle=True) if not args.test else None
     val_dataloader = get_dataloader(args.val_path, args.pretrain_model, args.batch_size, args.max_len, label2id, shuffle=False) if not args.test else None
     test_dataloader = get_dataloader(args.test_path, args.pretrain_model, args.batch_size, args.max_len, label2id, shuffle=False)
-    trainer = Trainer(device, args.pretrain_model, train_dataloader, val_dataloader, test_dataloader, args.epoch, args.lr, args.early_stop, model_save_path, args.n_family, args.is_cl, args.is_binary)
+    trainer = Trainer(
+        device,
+        args.pretrain_model,
+        train_dataloader,
+        val_dataloader,
+        test_dataloader,
+        args.epoch,
+        args.lr,
+        args.early_stop,
+        model_save_path,
+        args.n_family,
+        args.is_cl,
+        args.is_binary,
+        proxy_prob=args.proxy_prob,
+        proxy_warmup_epochs=args.proxy_warmup_epochs,
+        mse_weight=args.mse_weight,
+        use_proxy_inference=args.use_proxy,
+        use_curriculum=args.use_curriculum,
+    )
 
     if not args.test:
         trainer.train()
diff --git a/model.py b/model.py
index 90ea14c..35621aa 100644
--- a/model.py
+++ b/model.py
@@ -8,6 +8,19 @@
 from scl_loss import SupConLoss
 from typing import List, Tuple
 
+class ProxyModule(nn.Module):
+    def __init__(self, emb_dim, n_feat):
+        super(ProxyModule, self).__init__()
+        self.net = nn.Sequential(
+            nn.Linear(emb_dim, 256),
+            nn.ReLU(),
+            nn.Linear(256, n_feat)
+        )
+
+    def forward(self, feature):
+        out = self.net(feature)
+        return out.transpose(1, 2)
+
 class MLP(nn.Module):
     def __init__(self, input_dim, hidden_dims, output_dim, dropout):
         super(MLP, self).__init__()
@@ -87,7 +100,8 @@ def __init__(self, n_family=4, emb_dim=768, hidden_dims=[256], dropout=0.2, feat
         super(Model, self).__init__()
         self.n_family = n_family
 
-        self.n_feat = 3
+        self.n_feat = n_family - 1
+        self.proxy_module = ProxyModule(emb_dim, self.n_feat)
         feature_enc_layers = [(64, 5, 1)] + [(128, 3, 1)] * 3 + [(64, 3, 1)]
         self.conv = ConvFeatureExtractionModel(
             conv_layers=feature_enc_layers,
@@ -128,9 +142,17 @@ def conv_feat_extract(self, x):
         out = out.transpose(1, 2)
         return out
 
-    def forward(self, prob_feature, feature, mask):
+    def forward(self, prob_feature, feature, mask, use_proxy_prob=0.0):
+        pred_prob_feature = self.proxy_module(feature)
+
+        if self.training and use_proxy_prob > 0.0:
+            if torch.rand((), device=feature.device).item() < use_proxy_prob:
+                prob_feature = pred_prob_feature
+        elif not self.training and use_proxy_prob >= 1.0:
+            prob_feature = pred_prob_feature
+
         prob_feature = torch.cat([self.conv_feat_extract(prob_feature[:, i:i+1, :]) for i in range(self.n_feat)], dim=2)  # (batch_size, seq_len, embedding_size)
-        prob_feature = prob_feature + self.position_encoding.cuda()
+        prob_feature = prob_feature + self.position_encoding.to(prob_feature.device)
         prob_feature = self.norm(prob_feature)
         prob_feature = self.encoder(prob_feature)
         prob_feature = self.dropout(prob_feature)  # (bs, seq_len, embedding_size)
@@ -142,12 +164,12 @@ def forward(self, prob_feature, feature, mask):
 
         shared_feature = sum([self.expert[i](prob_feature) * gate[:, i].unsqueeze(1) for i in range(self.n_family)])
         pred_binary = self.binary_classifier(shared_feature)
-        pred_binary = torch.sigmoid(pred_binary).squeeze()
+        pred_binary = torch.sigmoid(pred_binary).squeeze(-1)
 
-        return pred_binary, pred_family, family_feature
+        return pred_binary, pred_family, family_feature, pred_prob_feature
 
 class Trainer:
-    def __init__(self, device, pretrain_model, train_dataloader, val_dataloader, test_dataloader, epoch, lr, early_stop, model_save_path, n_family, is_cl, is_binary):
+    def __init__(self, device, pretrain_model, train_dataloader, val_dataloader, test_dataloader, epoch, lr, early_stop, model_save_path, n_family, is_cl, is_binary, proxy_prob=0.0, proxy_warmup_epochs=0, mse_weight=0.0, use_proxy_inference=False, use_curriculum=False):
         self.device = device
         self.epoch = epoch
         self.train_dataloader = train_dataloader
@@ -155,6 +177,13 @@ def __init__(self, device, pretrain_model, train_dataloader, val_dataloader, tes
         self.test_dataloader = test_dataloader
         self.early_stop = early_stop
         self.n_family = n_family
+        self.proxy_prob = proxy_prob
+        self.proxy_warmup_epochs = proxy_warmup_epochs
+        self.mse_weight = mse_weight
+        self.use_proxy_inference = use_proxy_inference
+        self.use_curriculum = use_curriculum
+        self.current_proxy_prob = 0.0
+        self.current_mse_weight = 0.0
         self.pretrain = RobertaModel.from_pretrained(pretrain_model).to(device)
         self.model_save_path = model_save_path
         self.model = Model(n_family=n_family).to(device)
@@ -170,7 +199,7 @@ def get_loss(self, batch):
         label_family = batch['label_family'].to(self.device)
         label_binary = batch['label_binary'].to(self.device)
 
-        pred_binary, pred_family, family_feature = self.model(ll_tokens_list, feature, attention_mask)
+        pred_binary, pred_family, family_feature, pred_prob_feature = self.model(ll_tokens_list, feature, attention_mask, use_proxy_prob=self.current_proxy_prob)
         if  self.is_clLoss:
             loss = nn.BCELoss()(pred_binary, label_binary.float()) \
                     + nn.CrossEntropyLoss()(pred_family, label_family) \
@@ -178,6 +207,10 @@ def get_loss(self, batch):
         else:
             loss = nn.BCELoss()(pred_binary, label_binary.float()) \
                     + nn.CrossEntropyLoss()(pred_family, label_family)
+        if self.current_mse_weight > 0.0:
+            real_targets = ll_tokens_list[:, :self.model.n_feat, :]
+            loss_mse = nn.MSELoss()(pred_prob_feature, real_targets)
+            loss = loss + self.current_mse_weight * loss_mse
         return loss
 
     def get_output(self, batch):
@@ -187,7 +220,7 @@ def get_output(self, batch):
             attention_mask = batch['attention_mask'].to(self.device)
             feature = self.pretrain(input_ids, attention_mask).last_hidden_state.detach()
             with torch.no_grad():
-                output, _, _ = self.model(ll_tokens_list, feature, attention_mask)
+                output, _, _, _ = self.model(ll_tokens_list, feature, attention_mask, use_proxy_prob=1.0 if self.use_proxy_inference else 0.0)
             return output
         else:
             ll_tokens_list = batch['ll_tokens_list'].to(self.device)
@@ -195,7 +228,7 @@ def get_output(self, batch):
             attention_mask = batch['attention_mask'].to(self.device)
             feature = self.pretrain(input_ids, attention_mask).last_hidden_state.detach()
             with torch.no_grad():
-                output, pred_family, _ = self.model(ll_tokens_list, feature, attention_mask)
+                output, pred_family, _, _ = self.model(ll_tokens_list, feature, attention_mask, use_proxy_prob=1.0 if self.use_proxy_inference else 0.0)
             return output, pred_family
 
 
@@ -203,7 +236,18 @@ def get_output(self, batch):
     def train(self):
         recorder = Recorder(self.early_stop)
         for epoch in range(self.epoch):
-            print('----epoch %d----' % (epoch+1))
+            if self.use_curriculum and self.proxy_warmup_epochs > 0:
+                warmup_ratio = min(1.0, (epoch + 1) / self.proxy_warmup_epochs)
+                self.current_proxy_prob = self.proxy_prob * warmup_ratio
+                self.current_mse_weight = self.mse_weight * warmup_ratio
+            else:
+                self.current_proxy_prob = self.proxy_prob
+                self.current_mse_weight = self.mse_weight
+
+            if self.proxy_prob > 0.0 or self.mse_weight > 0.0:
+                print('----epoch %d (proxy_prob: %.2f, mse_weight: %.2f)----' % (epoch+1, self.current_proxy_prob, self.current_mse_weight))
+            else:
+                print('----epoch %d----' % (epoch+1))
             self.model.train()
             avg_loss = Averager()
             for i, batch in enumerate(tqdm(self.train_dataloader)):
diff --git a/pic/method.png b/pic/method.png
index 19e2e8b..b327d1e 100644
Binary files a/pic/method.png and b/pic/method.png differ