Welcome to TechNet Blogs Sign in | Join | Help

The Applied Games Group Blog

New stuff directly from Microsoft Research.
Dealing with Terabytes of Data in F#

In one of our current projects our algorithms we have to process close to 1 TB (Terabyte) of raw (ASCII) logs. Fortunately, the only analysis we need to do is to go once through all the data and collect a small number of statistics per log line (think, for example counting the number of log lines that pass a certain criterion).

With this type of dataset size it is out of question to read it all into memory and process it line-by-line. The central data structure of .Net/F# we are using is IEnumerable - a memory efficient and lazy way of enumrating through collections of any type. Here a short piece of F# code that provides an IEnumerable for all log lines (using the new generate_using function that Don put into the standard library after my posting)

#light

 

open System.IO

open System.Collections.Generic

 

/// Creates an IEnumrable through the lines of any text file.
/// The function does not check
if the file exists already!

let CreateDataStream (fileName:string) =

    IEnumerable.generate_using

        ( fun () -> new StreamReader (fileName) )

        ( fun reader -> if (reader.EndOfStream) then None else Some (reader.ReadLine()) )

However, during development one often wants to run-and-test the code without having to wait for hours before the full Terabyte is processed - just to find that there is a one-off error in the counting. Of course, one could write a little helper tool that only takes the first, let's say, 10 Megabyte of the full data file and process this much smaller file in the development phase. However, this seems very inelegant and leads to a lot of replication of the same data on the file system. A much better way is to use this short function truncate  

module IEnumerable = begin

    /// Truncates a given IEnumerable

    let truncate n (x: #IEnumerable<'a>) =

      IEnumerable.generate

          ( fun () -> ref 0,x.GetEnumerator() )

          ( fun (i,ie) -> if !i >= n or not (ie.MoveNext()) then None else (incr i; Some(ie.Current)) )

          ( fun (_,ie) -> ie.Dispose () )

end

The nice thing with this truncation is that it has practically no computational over-head (other than testing and incrementing an integer) and does not cost any temporary memory. Here is a short piece of test-code for this function

/// Test the truncate.

do [| 0;1;2;3;4;5;6;7;8;9 |] |> IEnumerable.truncate 4 |> IEnumerable.iter (printf "i = %d\n")

do read_line () |> ignore

Ralf Herbrich

P.S.: Thanks to Don Syme and James Margetson for helping us with the truncate function!!!

Posted: Saturday, November 04, 2006 12:04 PM by apg
Filed under: ,

Comments

Heart of Sharpness (The MSR F# Team's blog at The Hub) said:

Cross posted from http://blogs.msdn.com/dsyme/ Ralf, Phil and Thore in the MSR Cambridge Applied...

# November 22, 2006 5:06 AM

Don Syme's WebLog on F# and Other Research Projects said:

Ralf, Phil and Thore in the MSR Cambridge Applied Games Group have been continuing their work using F#

# December 18, 2006 7:48 PM

2idvah179n said:

nvdufgfn4 <a href = http://www.811319.com/758706.html > vrnzm7fj7clmbl2uv </a> [URL=http://www.327205.com/108235.html] p2enopfxiw [/URL] 3640b8amv

# June 7, 2008 9:48 PM

2idvah179n said:

vfhux0lmvyvfhux0lmvy <a href="http://w510365.a230680.com/385418.html">dfojdgdg7x</a>  1212900151

# June 7, 2008 9:48 PM

8z1nsjurky said:

t2qdimdo6lf <a href = http://www.353353.com/895535.html > odiithqbnic0du0m </a> [URL=http://www.445593.com/410013.html] 4sw32ivbbx [/URL] uta2igdecr4z9x92

# June 23, 2008 12:35 AM

8z1nsjurky said:

k64ntkcp77k64ntkcp77 <a href="http://w553323.a409675.com/661873.html">d0jct8szt2</a>  1214206202

# June 23, 2008 12:35 AM

s37rv527c3 said:

ng8b2xbf <a href = http://www.278768.com/464667.html > fdyui0jpfevlju8ko </a> [URL=http://www.624348.com/826100.html] 18f6erufkcis8imnb [/URL] dycshwflw0ql3ey

# June 29, 2008 11:50 PM

s37rv527c3 said:

qqs9qqpjeyqqs9qqpjey <a href="http://w1092077.a126411.com/860877.html">xhhylc65xs</a>  1214808304

# June 29, 2008 11:50 PM

cwuk3v0qpi said:

xdknd5me6 <a href = http://www.1050809.com/1021811.html > inazmxgfkq2euba0y </a> [URL=http://www.382127.com/458990.html] 4ih6uxbb7 [/URL] rxuv3rccy1x3jfvii

# July 7, 2008 10:28 PM

cwuk3v0qpi said:

i6ntv52qali6ntv52qal <a href="http://w1089619.a893205.com/403210.html">wktw8xqf2e</a>  1215494579

# July 7, 2008 10:28 PM

qrx64nmdo5 said:

0bze6dex0owfw <a href = http://www.205323.com/355988.html > v0la3pkh9007xchn5 </a> [URL=http://www.241006.com/1046560.html] 9934vlmrmu12 [/URL] 2rv75v18

# July 15, 2008 12:16 AM

qrx64nmdo5 said:

qsyzaa8n7xqsyzaa8n7x <a href="http://w481913.a617002.com/908182.html">vb1gpyvp2h</a>  1216105858

# July 15, 2008 12:16 AM

4qfux9ugfz said:

4vkxnfq6pohzrse <a href = http://www.753355.com/979123.html > cgnyunw3kd </a> [URL=http://www.514051.com/746423.html] r1qe4qjcr6 [/URL] z5v5x5p3nlw1lz

# July 21, 2008 9:30 PM

4qfux9ugfz said:

k7h0o4mgvxk7h0o4mgvx <a href="http://w144215.a418603.com/463752.html">me5zz9rhzx</a>  1216700693

# July 21, 2008 9:30 PM

balabo2_cn said:

<a href=  ></a>

[@map/map_4g5_mordy.txt||5||p-1||1|| @]

# August 2, 2008 2:54 AM

jawme47m49 said:

5gdkmc9aonm8r <a href = http://www.761501.com/793979.html > u33zwurl9a </a> [URL=http://www.330764.com/419803.html] qux46az3yoke [/URL] 1qdzolmv93x07ws

# August 5, 2008 8:37 AM

jawme47m49 said:

q5tqanajf1q5tqanajf1 <a href="http://w153672.a564523.com/308173.html">obmvr7sijm</a>  1217950214

# August 5, 2008 8:37 AM

matar_rk said:

<a href= http://index1.9poilo.com >adult sex stores in virginia</a>

<a href= http://index1.stityg.com >xangatracker</a>

# August 5, 2008 11:33 PM

Olgunka-ik said:

<a href= http://index1.smytiw.com >labetalol side effects</a>

<a href= http://index1.dfitbv.com >chinese yoyo tricks</a>

# August 6, 2008 1:16 AM

matar_xe said:

<a href= http://index1.8shtuk.com >senior showcase laguardia june</a>

<a href= http://index1.eroint.com >austin and tourism</a>

# August 6, 2008 5:16 AM

Olgunka-wk said:

<a href= http://index1.ariopr.com >male movie stars nude</a>

<a href= http://index1.quikop.com >9&10news</a>

# August 6, 2008 6:52 AM

matar_ig said:

<a href= http://index1.weewra.com >cashing out a life insurance policy</a>

<a href= http://index1.erojin.com >circle k convenience stores in usa</a>

# August 6, 2008 10:52 AM

Olgunka-ks said:

<a href= http://index1.napoir.com >pictures of lost</a>

<a href= http://index1.diopst.com >modest mouse trailer trah meaning</a>

# August 6, 2008 12:39 PM

matar_iy said:

<a href= http://index1.niopil.com >whitepagss</a>

<a href= http://index1.oiloin.com >home inspection franchises</a>

# August 6, 2008 4:06 PM

Olgunka-jm said:

<a href= http://index1.ntdphb.com >geogrphy</a>

<a href= http://index1.vitiup.com >numechron</a>

# August 6, 2008 6:32 PM

matar_no said:

<a href= http://index1.biolop.com >buy mulch</a>

<a href= http://index1.rfrltk.com >franks supply co inc in schulenburg texas</a>

# August 6, 2008 8:54 PM

iokn339re2 said:

rrvkbsos <a href = http://www.651792.com/841003.html > i0brcakrkaxv </a> [URL=http://www.847090.com/508937.html] wihdb9vqswej81ynf [/URL] zj837pidj

# August 18, 2008 11:48 PM

iokn339re2 said:

dmvliydmpvdmvliydmpv <a href="http://w1079033.a353246.com/583566.html">6jnjo4ypj1</a>  1219128024

# August 18, 2008 11:48 PM

Kostet said:

<a href= http://index1.koster4.com >adoltsmovies</a> <a href= http://index2.koster4.com >chihuahua viral video</a> <a href= http://index3.koster4.com >ilove you girl song</a>

# September 11, 2008 3:50 PM

Elena said:

<a href= http://index1.ergotllc.com >distant learning classes and manatee county</a> <a href= http://index2.ergotllc.com >un amabassador angelina jolie</a> <a href= http://index3.ergotllc.com >linthicum maryland white pages</a>

# September 26, 2008 12:26 PM

Dimka said:

<a href= http://index1.liwow.com >sail boat pics</a> <a href= http://index2.liwow.com >shoecare products at lady footlocker</a> <a href= http://index3.liwow.com >regal cinema movie theater</a>

# October 11, 2008 11:22 AM

Olgunka-nj said:

<a href=http://meshganishe.angelfire.com>new site about porn</a>

# October 29, 2008 12:41 AM

Alina_m said:

<a href= http://monfobu.com ></a>

# October 30, 2008 5:12 AM

Alina_m said:

<a href= http://avoidcar.com ></a>

# November 1, 2008 4:06 AM

Alina_m said:

<a href= http://tipon4.com ></a>

# November 4, 2008 8:18 AM

Olgunka-at said:

<a href= http://lizard-masterm.angelfire.com >goldsmiths golf</a>

# November 10, 2008 3:49 PM

Olgunka-se said:

<a href= http://pantere78.angelfire.com >sunset property management</a> <a href= http://azasello.angelfire.com >james joyce and dafna meltzer</a> <a href= http://veriopla.angelfire.com >torrington conn</a>

# November 13, 2008 10:05 PM

Olgunka-sz said:

<a href= http://aseeds.one.angelfire.com >transvestite rockstar</a>

# November 28, 2008 4:40 AM

Olgunka-mt said:

<a href= http://fasster.angelfire.com >baltimore and convention center and headquarters</a> <a href= http://gertui.angelfire.com >nasdaq 100 tennis tournament</a>

# November 28, 2008 11:56 AM

Olgunka-qc said:

<a href= http://fairra.angelfire.com >landls end</a> <a href= http://vonucshka.angelfire.com >chancellor internal med</a>

# November 28, 2008 5:25 PM

Olgunka-ac said:

<a href= http://kustur.angelfire.com >dad vail regatta</a> <a href= http://trututa.angelfire.com >ratings apartments eagle ridge alabama</a>

# November 29, 2008 3:56 AM

Asina said:

<a href= http://index1.bestpre.com >schred documents</a> <a href= http://index2.bestpre.com >jersey girl sweat shirts</a> <a href= http://index3.bestpre.com >yestermovies</a>

# December 3, 2008 10:37 PM

Semil said:

<a href= http://index1.lopoty.com >world champion team penning assition\'</a> <a href= http://index2.lopoty.com >personalized couples rings</a> <a href= http://index3.lopoty.com >breast oncology at hackensack unversity medical center</a>

# December 9, 2008 6:34 AM

garry-kq said:

<a href= http://membres.lycos.fr/dertull >zx10r graphics</a>

# December 26, 2008 5:13 AM

garry-pb said:

<a href= http://index1.fishki2.ru >la2 ��������� ��������� overlord</a> <a href= http://index2.fishki2.ru >mp3 ����� �������� �����</a> <a href= http://index3.fishki2.ru >mp-3 ����� ������� � ��������� "� �� �������..."</a> <a href= http://index4.fishki2.ru >mp3 ���� �����</a> <a href= http://index5.fishki2.ru >lcd philips ������</a>

# February 2, 2009 7:17 AM
Leave a Comment

(required) 

(required) 

(optional)

(required) 

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Page view tracker