python Data reading for processing big data
1 reference resources 1：python read GB Level of text data , prevent memoryError
We talked about it “ text processing ” Time , We usually mean what we're dealing with .Python It's very easy to read the contents of a text file into a string variable that you can manipulate . The file object provides three “ read ” method ：
.read(),.readline() and .readlines(). Each method can accept a variable to limit the amount of data read each time , But they usually don't use variables . .read()
Read the entire file at a time , It is usually used to put the contents of a file into a string variable . however .read()
The most direct string representation of the generated file content , For continuous row oriented processing , It is unnecessary , And if the file is larger than the available memory , This is not possible . Here is read() Method example ：
f = open('/path/to/file', 'r')
call read() The entire contents of the file are read at one time , If the file has 10G, Memory will explode , therefore , Be on the safe side , It can be called repeatedly read(size) method , Maximum reads per time size Bytes of content . in addition , call readline() You can read one line at a time , call readlines() Read everything at once and return by line list. therefore , You need to decide how to call it .
If the file is small ,read() One time reading is the most convenient ; If you can't determine the file size , Call repeatedly read(size) Comparative insurance ; If it is a configuration file , call readlines() Most convenient ：
for line in f.readlines():
process(line) # <do something with line>
Read In Chunks
It is easy to think of processing large files, that is, to divide large files into several small files , This part of memory is released after each small file is processed . It's used here iter & yield：
def read_in_chunks(filePath, chunk_size=1024*1024):
Lazy function (generator) to read a file piece by piece.
Default chunk size: 1M
You can set your own chunk size
file_object = open(filePath)
chunk_data = file_object.read(chunk_size)
if not chunk_data:
if __name__ == "__main__":
filePath = './path/filename'
for chunk in read_in_chunks(filePath):
process(chunk) # <do something with chunk>
Using with open()
with Statement to open and close a file , This includes throwing an inner block exception .for line in
f File object f As an iterator , Will automatically use buffering IO And memory management , So you don't have to worry about big files .
#If the file is line based
with open(...) as f:
for line in f:
process(line) # <do something with line>
in use python When reading large files , It should be handled by the system , The simplest way to use it , Give it to the interpreter , Just take care of your work .
reference resources 2： utilize python Some experience of processing 20 million data
Open the target file first , Write column name , Open the original file again , Read by line , Judge whether this line is a “ dirty ” data , If not, then follow the requirements in the table above , Then write the target file by line , thus , Computer memory utilization rate drops , The computer won't get stuck , Finally, the preliminary processed documents will be used pandas open , Take advantage of DataFrame The method of data structure is used to remove duplicate , 20 million pieces of data will be processed in five minutes , The following is the source code ：
import csv rows= with
open(r'C:\Users\Hanju\Desktop\uploadPortal(5).csv',"w", newline='') as
_csvfile: writer = csv.writer(_csvfile) # Write first columns_name
i=0 with open(r'D:\UploadPortalData\uploadPortal (5).csv',encoding='UTF-8') as
csvfile: readCSV=csv.reader(csvfile,delimiter=',') for row in readCSV:
if(len(row)!=8): continue row1= i+=1 row1.append(row.replace(':','')[-5:])
if row=='auth': row1.append('1') elif row=='deauth': row1.append('2')
elif row=='portal': row1.append('3') elif row=='portalauth':
row1.append(str(row.replace(':','')[0:6])) if row==row:
row1.append('2') else: row1.append('5') if 'City-WiFi-5G' in row:
row1.append('2') elif 'City-WiFi' in row: row1.append('1') else:
row1.append(row) writer.writerow(row1) print('Done') print(i) import pandas
as pd df=pd.read_csv(r'C:\Users\Hanju\Desktop\uploadPortal(5).csv')
#print(df.head()) #print(df.tail()) print(df.shape)
reference resources 3：
1） use SSCursor( Stream cursor ), Avoid the client to occupy a lot of memory .( this cursor There's actually no data cached , It doesn't read everything into memory , It does this by reading records from the storage block , And back to you one by one .)
2） Use iterators instead of fetchall, Save memory and get data quickly .
import MySQLdb.cursors conn = MySQLdb.connect(host='ip address ', user=' user name ',
passwd=' password ', db=' Database name ', port=3306, charset='utf8', cursorclass =
MySQLdb.cursors.SSCursor) cur = conn.cursor() cur.execute("SELECT * FROM
bigtable"); row = cur.fetchone() while row is not None: do something row =
cur.fetchone() cur.close() conn.close()
It should be noted that ,
1. because SSCursor Is a cursor that has no cache , As long as the result set is not finished , this conn You can't deal with anything else sql, Including generating another cursor It's not going to work either .
If you need to do something else , Please regenerate to another connection object .
2. The post-processing data should be fast every time it is read , Cannot exceed 60s, otherwise mysql This connection will be disconnected , It can also be modified SET NET_WRITE_TIMEOUT = xx
To increase the extra time interval .