These fields are all optional and need only
be supplied if you would like a direct reply.
Subject
Your email address
Your real name
You must answer this!
If you don't, my spam filtering will
ensure that I never see your email.
What's 8 plus five (in digits only)?
Please make your changes here and then
Editing tips and layout rules.
File: GrepTimingAnomaly ''' <link rel="alternate" type="application/rss+xml" ''' href="/rss.xml" title="RSS Feed"> ********> width="25%" |>> ''' <a title="Subscribe to my feed" ''' rel="alternate" ''' href="https://www.solipsys.co.uk/rss.xml"> ''' <img style="border-width: 0px;" ''' src="https://www.feedburner.com/fb/images/pub/feed-icon32x32.png" ''' align="middle" ''' alt="" />Subscribe!</a> _ ''' <a href="https://twitter.com/ColinTheMathmo"> ''' <img src="https://www.solipsys.co.uk/new/images/TwitterButton.png" ''' title="By: TwitterButtons.net" ''' width="212" height="69" ''' alt="@ColinTheMathmo" ''' /></a> <<| ---- My latest posts can be found here: * ColinsBlog ---- Previous blog posts: * TheRadiusOfTheEarth * ThisWorksToCureMyHiccoughs * PerhapsWeSavedOne * ThinkingAboutMastodon * DisappearingTrainsOnVirgin * TheIndependenceGame * OneOfMyFavouritePuzzles * ThinkingAboutRecursion * MemorisingTheTube * SpikeySpheres * SurprisinglyQuick * AnUnexpectedFraction * YouHaveToAdmireTheirOptimism * RepresentativesMatter * PythagorasByIncircle * APuzzleAboutPuzzles * HowNotToDoTwitter * Calculating52FactorialByHand * SmallThingsMightNotBeSoSmall * NotIfYouHurry * FactoringViaGraphThreeColouring * AnotherProofOfTheDoodleTheorem * WhenObviousIsNotObvious * GraphThreeColouring * TheDoodleTheorem * BeCarefulWhatYouSay * TheMutilatedChessboardRevisited * AMirrorCopied * TheOtherOtherRopeAroundTheEarth * PhotocopyAMirror * ThePointOfTheBanachTarskiTheorem * SieveOfEratosthenesInPython * FastPerrinTest * RussianPeasantMultiplication * FindingPerrinPseudoPrimes_Part2 * FindingPerrinPseudoPrimes_Part1 * TheUnwiseUpdate * MilesPerGallon * TrackingAnItemOnHackerNews * HackerNewsUserAges * PokingTheDustyCorners * ThereIsNoTimeForThis * PublicallySharingLinks * LearningTimesTables * GracefulDegradation * DiagrammingMathsTopics * OnTheRack * SquareRootByLongDivision * BeyondTheBoundary * FillInTheGaps * SoftwareChecklist * NASASpaceCrews * TheBirthdayParadox * TheTrapeziumConundrum * RevisitingTheAnt * TheAntAndTheRubberBand * IrrationalsExist * MultipleChoiceProbabilityPuzzle * RandomEratosthenes * WrappingUpSquareDissection * DissectingASquarePart2 * DissectingACircle * DissectingASquare * AnOddityInTennis * DecisionTreeForTennis * DecisionTreesInGames * AMatterOfConvention * DoYouNourishOrTarnish * BinarySearchReconsidered * TwoEqualsFour * TheLostPropertyOffice * TheForgivingUserInterface * SettingUpRSS * WithdrawingFromHackerNews ---- Additionally, some earlier writings: * RandomWritings. * ColinsBlog2010 * ColinsBlog2009 * ColinsBlog2008 * ColinsBlog2007 * ColinsBlogBefore2007 ******** !! 2017/07/28 - Timing Anomaly in "grep" Some considerable time ago - long enough that probably the details are no longer relevant, or accurate - I got seriously fed up with the spam filtering on my email service. There were so many false positives that I had to wade through the spam bin every day, pulling out obviously good emails that needed attention. It got so that it cost me more time to have it running. So I disabled it, and implemented my own based on Paul Graham's article: ''' <a href="http://www.paulgraham.com/spam.html">A Plan For Spam</a> No doubt there are better schemes commercially available now, but it was phenomenally successful, and still runs perfectly. I have about two false positives a month, and it saves me wading through the scum and dross that I get, having been visible on the interwebs now for decades. There are many tweaks and enhancements, and I run a "White List" as well. People with whom I've interactions get automatically put on the white list as a matter of course, and that helps to seed the data nicely. But recently I noticed something interesting - the whitelist processing was taking a *huge* amount of time. Stupidly large. So I ran some timing tests. These are quick tests done on real data. First, let's just run the whitelist check: {{{ $ time grep -irlf whitelist good_confirmed | wc 47 47 1645 real 2m23.705s user 2m16.488s sys 0m0.240s }}} That's taking over 2 minutes to find the whitelisted messages. Well, OK, maybe it has a lot to do. But then I split the whitelist file into 10 separate files, each with a tenth of the entries: {{{ $ time ( for f in white_* ; do grep -irlf $f good_confirmed ; done ) | wc 55 55 1925 real 0m7.054s user 0m6.576s sys 0m0.072s }}} [[[>50 The different number of hits is because some files contain hits in more than one whitelist, so that's to be expected. ]]] That's a *huge* drop in processing time, from over two minutes to under 8 seconds. There's go to be something bizarre about this. Just to run a quick check: {{{ time grep -irlf <( sort white_* ) good_confirmed | wc 47 47 1645 real 2m19.213s user 2m11.036s sys 0m0.324s }}} OK, using exactly the same files it takes 15 times longer to process them as a single file than it does to run 10 separate times. That's got to be a problem somewhere. One day I'd love to chase this down, but for now I'll leave this here so I can come back to it when I get time. If you have any thoughts or comments, I'd love to hear them. ---- Adam Atkinson has been searching about to see if this is known and has found these: * http://savannah.gnu.org/bugs/?16305 * https://fam.tuwien.ac.at/~schamane/_/blog:080720_expensive_grep * https://fam.tuwien.ac.at/~schamane/_/blog:080727_grep_with_large_pattern_files * https://stackoverflow.com/questions/9066609/fastest-possible-grep * https://stackoverflow.com/questions/12629749/how-does-grep-run-so-fast * https://stackoverflow.com/questions/13913014/grepping-a-huge-file-80gb-any-way-to-speed-it-up * http://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/page2.html Also, a commenter found this: * https://blog.codinghorror.com/regex-performance/ Now to investigate properly ... ---- |>> | |>> <<<< Prev <<<< ---- TheRadiusOfTheEarth <<| | : | |>> >>>> Next >>>> ---- RadiusOfTheEarthPartTwo ... <<| | ---- ********> ''' <a href="https://mathstodon.xyz/@ColinTheMathmo"> ''' <img src="https://www.solipsys.co.uk/images/Mastodon_Mascot.png" ''' width="256" height="280" ''' alt="https://mathstodon.xyz/@ColinTheMathmo" ''' /></a> ******** ''' <a href="https://mathstodon.xyz/@ColinTheMathmo/">You can follow me on Mathstodon.</a> _ _ _ _ [[[> ''' <a href="https://twitter.com/ColinTheMathmo">Of course, you can also<br>follow me on twitter:</a> ''' <a href="https://twitter.com/ColinTheMathmo"> ''' <img src="https://www.solipsys.co.uk/new/images/TwitterButton.png" ''' title="By: TwitterButtons.net" ''' width="212" height="69" ''' alt="@ColinTheMathmo" ''' /></a> ''' <img src="/cgi-bin/CountHits.py?GrepTimingAnomaly" alt="" /> ]]] ********< ---- !! Send us a comment ... ''' <form action="https://www.solipsys.co.uk/cgi-bin/FormMail.pl" method=post> ''' <input type=hidden name="recipient" value="colinsblogcomment@solipsys.co.uk" > ''' <input type=hidden name="subject" value="Blog comment : GrepTimingAnomaly" > ''' <input type=hidden name="redirect" value="https://www.solipsys.co.uk/new/ThankYouForYourComment.html" > ''' <input type=hidden name="missing_fields_redirect" value="https://www.solipsys.co.uk/RequestError.html"> ''' <input type=hidden name="env_report" value="REMOTE_HOST, REMOTE_ADDR, HTTP_USER_AGENT" > ''' <input type=hidden name="print_blank_fields" value="1" > ********> width="47%" You can send us a message here. It doesn't get published, it just sends us an email, and is an easy way to ask any questions, or make any comments, without having to send a separate email. So just fill in the boxes and then ''' <font size="+4"><INPUT TYPE="submit" VALUE="CLICK HERE TO SEND"></font> ******** width="53%" ********< ''' <table cellpadding="5"> ''' <tr> ''' <td valign="top">Your name </td> <td valign="top">:</td> ''' <td> <input type=text name="realname" size="48"> </td> ''' <tr> ''' <td valign="top">Email </td> <td valign="top">:</td> ''' <td> <input type=text name="email" size="48"> </td> ''' </tr> ''' <tr> ''' <td valign="top">Message </td> <td valign="top">:</td> ''' <td> <TEXTAREA NAME="Message" ROWS=10 COLS=64></TEXTAREA> </td> ''' </tr> ''' </table> ''' <center> ''' <font size="+4"> ''' <INPUT TYPE="submit" VALUE="CLICK HERE TO SEND"> ''' </font> ''' </center> ''' </form> ********<