Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Did you release the dataset and the code for testing? It would be interesting to check how 3.5 Sonnet performs on this task.


The dataset is there:

https://huggingface.co/datasets/strickvl/isafpressreleases_t...

but when looking for rows where GPT-4o was deemed inaccurate then to me it seems the label was wrong or at least it wasn't possible to infer that certain label from the input text. But finetuned model was able to predict it.

Which makes me wonder whether the finetuned models are poisoned with eval data...

See this one:

> ISAF Joint Command Morning Operational Update, March 8, 2011 ISAF Joint Command - Afghanistan 2011-03-S-022 For Immediate Release KABUL, Afghanistan (March 8, 2011) Afghan and coalition forces targeted a Taliban district chief, killed one insurgent and detained several others during an operation in Burkah district, Baghlan province, yesterday. The Taliban district chief maintains ties to Taliban senior leadership throughout Kunduz, Baghlan, and Takhar provinces. He is involved in purchasing weapons and IEDs. Intelligence reports led the security force to the targeted compound in the city, where Afghan forces called for all occupants to exit the buildings peacefully before conducting a search. During that time, an armed individual threatened the security force and the force returned fire, killing him. Several suspected insurgents were detained after initial questioning at the scene.

It claims "Yesterday" on March 8, so you would assume March 7 is correct start_date, but it's labelled Mar 6, and finetuned models get it "right", while GPT says Mar 7.


I was wondering if there was some info in the bizarrely formatted date, but I think 022 is just the issue number: https://www.dvidshub.net/news/66703/correction-isaf-joint-co...


Also a lot of the time the dates are wrong seems to be due to only having those formats, which does make me wonder again how do fine tuned get this right unless they have been fine tuned using eval data...


Props to the author for releasing the data. My instinct is also to immediately suspect data leakage. It's super easy for this to happen. For example the original dataset could contain multiple articles about the same event.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: