md = MarkItDown()
result = md.convert('leak_cheat.pdf')
print(result.text_content)
The result looks like this (preserving the head of the paper but removing all whitespaces from the body):
67
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics
Volume 1: Long Papers, pages 67–93
March 17-22, 2024 c(cid:13)2024 Association for Computational Linguistics
Leak,Cheat,Repeat:DataContaminationandEvaluationMalpracticesinClosed-SourceLLMsSimoneBalloccuPatríciaSchmidtováMateuszLangoOndˇrejDušekCharlesUniversity,FacultyofMathematicsandPhysicsInstituteofFormalandAppliedLinguisticsPrague,CzechRepublic{balloccu,schmidtova,lango,odusek}@ufal.mff.cuni.czAbstractNaturalLanguageProcessing(NLP)researchisincreasinglyfocusingontheuseofLargeLanguageModels(LLMs),withsomeofthemostpopularonesbeingeitherfullyorpartiallyclosed-source.Thelackofaccesstomodeldetails,especiallyregardingtrainingdata,hasrepeatedlyraisedconcernsaboutdatacontam-inationamongresearchers.Severalattemptshavebeenmadetoaddressthisissue,buttheyarelimitedtoanecdotalevidenceandtrialanderror.Additionally,theyoverlooktheprob-lemofindirectdataleaking,wheremodelsareiterativelyimprovedbyusingdatacom-ingfromusers.Inthiswork,weconductthefirstsystematicanalysisofworkusingOpe-nAI’sGPT-3.5andGPT-4,themostpromi-nentlyusedLLMstoday,inthecontextofdatacontamination.Byanalysing255papersandconsideringOpenAI’sdatausagepolicy,weex-tensivelydocumenttheamountofdataleakedtothesemodelsduringthefirstyearafterthemodel’srelease.Wereportthatthesemodelshavebeengloballyexposedto∼4.7Msamplesfrom263benchmarks.Atthesametime,wedocumentanumberofevaluationmalpracticesemerginginthereviewedpapers,suchasun-fairormissingbaselinecomparisonsandrepro-ducibilityissues.Wereleaseourresultsasacol-laborativeprojectonhttps://leak-llm.github.io/,whereotherresearcherscancontributetoourefforts.1IntroductionTherecentemergenceoflargelanguagemodels(LLMs),thatshowremarkableperformanceonawiderangeoftasks,haslednotonlytoadramaticincreaseintheiruseinresearchbutalsotoagrow-ingnumberofcompaniesjoiningtheraceforthebiggestandmostpowerfulmodels.Inpursuingacompetitiveadvantage,manypopularLLMsto-dayarelockedbehindAPIaccessandtheirde-tailsareunknown(OpenAI,2023;Thoppilanetal.,2022;Touvronetal.,2023).Thisincludesmodelweights(OpenAI,2023),trainingdata(Piktusetal.,2023),orinfrastructuraldetailstoassessmodelcar-bonfootprint(Lacosteetal.,2019).Inparticular,thelackofinformationontrainingdataraisesimportantquestionsaboutthecredibilityofLLMsperformanceevaluation.Thedatafromwhichthesemodelslearn,typicallycollectedau-tomaticallybyscrapingdocumentsfromtheweb,maycontaintraining,validation,and–mostcrit-ically–testsetscomingfromNLPbenchmarks.Becauseofthis,researchersandstakeholdersmaylaterinadvertentlyevaluateLLMsonthesamedatatheyweretrainedon.Thisphenomenon,knownasdatacontamination,maynotbeanissueinthegeneraluseofcommercialLLMs,whereadherencetoresearchprinciplesisnotmandatory,butitbe-comesaseriousproblemwhenthesemodelsarewidelyusedandevaluatedinresearch.Unfortunately,manyproprietarymodelsarelockedbehindinference-onlyAPIs,makingithardtoinspectdatacontamination.Becauseofthis,ex-istingworkonthemattermostlyfocusesondetect-ingextremeformsofoverfittingandmemorization,suchasthemodel’sabilitytogeneratebenchmarksverbatim.TheseapproachesarenotonlylimitedbutalsoneglectthatrecentproprietaryLLMsgetiterativelyimprovedfromuserinteractions.Ifsuchinteractionsinvolvebenchmarkdata(forexamplewhenresearchersevaluateLLMsagainstbaselines),themodelmay,infact,becomecontaminatedevenifitwascontamination-freeduringitsinitialtrain-ing.Werefertothisphenomenonasindirectdataleaking.Inthispaper,weaddresstheissueofindirectdatacontaminationinclosed-source1LLMsbycon-ductingasystematicliteraturereview.Wereview255papersandcarefullydetaildataleakageemerg-ingfromthem.Wefocusprimarilyonthemodels1Inthispaperweusetheterms“proprietary”and“closed-source”interchangeablytorefertothesemodels.68
domaintextenrichedbytextualinstructionsleadstoanincreaseinthemodelperformanceevenifgoldlabelsarenotshowntothemodel.ThissetupperfectlymatchesthekindofdatashowntochatLLMswhenevaluatedbyresearchers.Thismeansthatclosed-sourceLLMssuchasGPT-3.5andGPT-4canmakeuseofthesegoldstandardexamplesfromwidelyusedNLPbenchmarkstogainanunfairadvantageoverothermodels.Wealsopointoutthatrecentwork(Aiyappaetal.,2023)showedthataftermodelupdates,Chat-GPTperformanceimprovedonbenchmarkstowhichitwaspreviouslyexposed(Zhangetal.,2022).Withthesemotivations,weconductasystematicreviewtoquantifyhowmuchofsuchdatathemodelspoweringChatGPTcouldhaveobtained.4MethodologyFollowingthestandardsystematicreviewproto-colfromthemedicaldomain(Khanetal.,2003),weanalysetheexistingworkonLLMsevaluationtoinspecttheissueofindirectdatacontaminationandotherevaluationmalpractices.WefocusonOpenAI’sGPT-3.5andGPT-4models,astheyarethemostprominentlyusedinrecentNLPresearch.Weorganizeourworkintofivemacro-steps,corre-spondingtothefollowingsubsections.4.1FramingquestionsInreviewingtheexistingworkevaluatingtheper-formaceofGPT-3.5andGPT-4,weposethefol-lowingresearchquestions:(1)WhichdatasetshavebeendemonstrablyleakedtoGPT-3.5andGPT-4duringthelastyear?70
...
Other PDFs that I've tested work fine.
For a certain PDF of my test files
markitdownwill remove all whitespaces during conversion. The PDF can be found here: https://aclanthology.org/2024.eacl-long.5.pdfI run the example code in a jupyter notebook (Python 3.12.8) like this:
The result looks like this (preserving the head of the paper but removing all whitespaces from the body):
Other PDFs that I've tested work fine.