#### [SOLVED] Estimating the number of tanks based on a sample of serial numbers 2.0

There was some confusion with the code I had posted in my previous version of this question and there was some good advice from @Oscar Smith.

The explanation is the same; please give improvements for speed in this new version:

``````#!/usr/bin/python

import numpy as np
import time

chosenNum = 429
numRuns = 10000
numCapturedTanks = 7
numGuessN = []
guesses = []
percentErrors = []
STDarray = []
start_time = time.time()
STDtimes = []

def getAverageStdTime(timetaken): # gets the average time it took to calculate standard deviations
STDtimes.append(timetaken)
if (len(STDtimes) == numRuns):
print ("Average List of Standard Devations Generation Time: " + str(round(np.mean(STDtimes),2)) + " seconds")

def createListOfStandardDeviations(start,end):
for y in range(start,int(end)):
tankSerialNumbersSimulated = np.random.randint(1, y + 1, size=numCapturedTanks) #from Oscar Smith
STDarray.append(simulatedSTD)

def getAllGuesses():
print ("Your guesses are: " + str(guesses))

def getAvgPercentError():
numCorrect = 0
closestNumber = 0
for x in range(len(guesses) - 1):
percentError = '%.2f' % round(((np.abs(guesses[x] - chosenNum))/float(chosenNum) * 100), 2)
percentErrors.append(float(percentError))
if(guesses[x] == chosenNum):
numCorrect = numCorrect + 1
else:
closestNumber = min(guesses, key=lambda x:abs(x-chosenNum))
averagePercentError = np.mean(percentErrors)
print ("The average Percent Error is: " + str(round(averagePercentError,2)) + "%")
getAccuracy(numCorrect,closestNumber)

def getAccuracy(amountCorrect,closestNumberToActual):
if (amountCorrect > 0):
print ("You got the correct number " + str(amountCorrect) + " out of " + str(len(guesses)) + " times.")
else:
print ("Your closest number was: " + str(closestNumberToActual))
getmode(guesses)

def getmode(inplist):
dictofcounts = {}
listofcounts = []
for i in inplist:
countofi = inplist.count(i) # count items for each item in list
listofcounts.append(countofi) # add counts to list
dictofcounts[i]=countofi # add counts and item in dict to get later
maxcount = max(listofcounts) # get max count of items
if maxcount ==1:
print ("There is no mode for this dataset, values occur only once")
else:
modelist = [] # if more than one mode, add to list to print out
for key, item in dictofcounts.items():
if item ==maxcount: # get item from original list with most counts
modelist.append(str(key))
print ("Most guessed number(s):",' and '.join(modelist))
return modelist

def getNumGuessed(givenSTD,maxNumber):
minStd = min(STDarray, key=lambda x:abs(x-givenSTD)) #finds closest standard deviation to the given standard deviation
for (z,this_std) in enumerate(STDarray):
if(minStd == this_std): #find closest number to original standard deviation
numGuessed = z + maxNumber
return numGuessed

def main():
print ("reached main")
for runsRan in range(numRuns):
tankSerialNumbers = np.random.randint(1, chosenNum + 1, size=numCapturedTanks) #from Oscar Smith
NumSTD = np.std(tankSerialNumbers)
highestTankSerial = np.mean(tankSerialNumbers) + 3*NumSTD
maxNum = np.amax(tankSerialNumbers)
print ("Tank Serial Numbers Generated")
print ("Standard Deviation and Range Calculated")
ListOfStandardDeviationsStartTime = time.time()
for _ in range(100):
del STDarray[:]
if (maxNum - highestTankSerial < 0):
createListOfStandardDeviations(maxNum,highestTankSerial)
else:
createListOfStandardDeviations(highestTankSerial,maxNum)
numGuessN.append(getNumGuessed(NumSTD,maxNum))
print ("Initial List of Standard Deviations Generated")
print ("List of Standard Devations Generation took " + str(round(time.time() - ListOfStandardDeviationsStartTime,2)) + " seconds")

guess = int(np.mean(numGuessN))
print ("Guess Generated " + str(runsRan + 1))
getAverageStdTime(float(time.time() - ListOfStandardDeviationsStartTime))
guesses.append(guess)
getAllGuesses()
getAvgPercentError()

main()
print ("My program took " + str(round((time.time() - start_time)/float(60),2)) + " minutes to run")
``````

Currently, the runtime is approximately 7.26 minutes for 1,000 runs. I want to get it to run 10,000 times and at this rate, it would take too long.

If anyone is confused by what the purpose is or any part of the code, please mention specifically what is confusing and I'll explain it.

To clarify: We are given 7 serial numbers based on a value n which is unknown (this number is given only to check how good our process of determining n is and to generate the 7 random serial numbers from it).

My process first finds the standard deviation of the given serial numbers. Then I find a limit that n definitely cannot exceed (three standard deviations above the mean) and a max from the given list as what it can't be below. I then simulate what random serial numbers would be generated from the predicted n from the range I found out. I take the standard deviation of each simulation and find which one is the closest to the standard deviation of the given serial numbers and store the corresponding guessed n. I do this x times (the more the better - I used 100) to get x guessed n 's. I take the mean of those guesses to get my final guess. I then find the percent error of my guess based on the actual number. #### @mochi 2017-12-11 02:15:22

You could vectorize your entire process in numpy.

From your previous post it looks like, from a given n number of tanks, you want to estimate a true N number of total tanks. The method is:

1. Generate m samples of n tanks
2. Calculate the std of these m samples
3. Find the sample with the closest std to your observed n tanks and use that as a guess to the true max, N
4. Repeat for a certain # of times to improve accuracy

So first, let's set some constants:

``````import numpy as np

true_n = 429 # Can set whatever constant here
num_tanks = 7
num_guesses = 10000
num_samples = 100
``````

Then a function for generating data:

``````def simulate_data(lower,upper,size):
return np.array([np.random.randint(1,x+1,(num_tanks,size)) for x in range(lower, upper)])
``````

Now we can generate our n sample and calculate relevant statistics:

``````obs = simulate_data(1,true_n,1)
obs_mean = np.mean(obs)
obs_std = np.std(obs)
obs_max = np.max(obs)
upper = int(obs_mean + 3*obs_std)
``````

``````min_guess, max_guess = np.sort([upper,obs_max])
guesses = []
for _ in range(num_guesses):
obs = simulate_data(1,true_n,1)
obs_mean = np.mean(obs)
obs_std = np.std(obs)
obs_max = np.max(obs)
upper = int(obs_mean + 3*obs_std)
sim_data = simulate_data(min_guess, max_guess, num_samples)

sim_std = np.std(sim_data,axis=1)
min_std_offset = np.argmin(np.abs(sim_std - obs_std),axis=1)
guess = min_guess + min_std_offset

guesses.append(int(np.mean(guess)))

print("actual:",true_n)
print("guess:",np.mean(guesses))
``````

(1) This combines your data generation with the total # of samples you want to make to generate all num_samples * num_tanks samples at once, in this case, 100 rows of 7 tank samples.

(2) Then calculates the standard deviation by row(axis=1) for all 100 samples.

(3) Next find the value that generated samples with the smallest std, this happens to be the starting value + whatever offset(index) the smallest value is at. Random samples are generated from (1,guess) where guess ranges from upper bound to the max value, or from max value to upper, whichever is larger.

(4) Stores the guessed value as a guess and repeats for 10000x.

This completes 10000 guesses for me in

25.7 s ± 1.01 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you had the memory for it, you could even add an additional dimension for num_guesses and skip the for loop.

You can also vectorize the way you calculate error in the same way:

``````def calc_error(guesses):
diff = np.abs(guesses - true_n)
perc_error = diff/true_n*100
num_correct = sum(guesses == true_n)
closest_guess = guesses[np.argmin(diff)]

print("num correct:",num_correct)
print("avg percent error: %.2f" % np.mean(perc_error))
print("closest guess was:",closest_guess)
``````

Then you just need to convert your guesses list into a numpy array and call the error function on it:

``````guesses = np.array(guesses)
calc_error(guesses)
``````

This follows your procedure now I believe, and from what I can tell finishes in a fraction of the time of your current solution. #### @PGODULTIMATE 2017-12-11 02:31:19

Why are you randomizing the true number? I think you misunderstood the first OP. I'll clarify it in this post. #### @mochi 2017-12-11 02:33:34

I made it random so you could get more random simulations, that's why it's only initialized once, you can set it to whatever constant N that you want. #### @PGODULTIMATE 2017-12-11 02:46:16

Ok. However, I don't get how you get one result from 10,000 guesses. I want 10,000 results from 10,000 guesses. #### @mochi 2017-12-11 02:48:36

Guesses is a list of num_guesses guesses, which in this case is a list of 10000 guesses. Where does it look like I'm only getting 1 result? Each iteration of the loop creates 1 guess from num_samples(in this case 100) simulated samples, by finding the max value of the simulated sample with the standard deviation closest to our observed standard deviation. #### @PGODULTIMATE 2017-12-11 02:59:41

My bad with the 10,000 guesses thing. I had forgotten to write the calc_error function. #### @mochi 2017-12-11 03:05:28 #### @PGODULTIMATE 2017-12-12 16:02:36

I think that something is weird here because the results are too accurate. I know that there should be a 5-6% error. Also, there are too many of the answers that are correct. #### @mochi 2017-12-12 23:54:35

Looks like the samples need to be re-sampled in every iteration, which I guess makes sense as that makes you less susceptible to a bad initial draw of rng. If samples are resampled every time, I get ~8% error which is about the same as what I get running your code. ### Drawing the serial numbers

``````tankSerialNumbers = np.random.randint(1, chosenNum + 1, size=numCapturedTanks)
``````

This is the wrong way to generate the serial numbers, because it is possible that the same serial number is drawn more than once (this is random sampling with replacement):

``````>>> np.random.randint(1, chosenNum + 1, size=numCapturedTanks)
array([ 50,  10,  71, 244, 394, 375,  10]
``````

You might not notice it, because the pool of possible serial numbers is much larger (`chosenNum` = 429) than the size of the sample you draw (`numCapturedTanks` = 7) and therefore the probability to draw the same number twice is low1.

But the serial numbers really should be unique; you need random sampling without replacement.

This is done in plain Python using `random.sample`,

``````random.sample(range(1, chosenNum + 1), numCapturedTanks)
``````

or in NumPy using `np.random.choice` with `replace=False`:

``````np.random.choice(range(1, chosenNum + 1), numCapturedTanks, replace=False)
``````

This could be further simplified if you chose to use zero-based serial numbers (i.e. for 100 tanks, the serial numbers would be 0, …, 99 instead of 1, …, 100):

``````random.sample(range(chosenNum), numCapturedTanks)
``````
``````np.random.choice(chosenNum, numCapturedTanks, replace=False)
``````

1Approximately 5% according to a quick Monte Carlo simulation I've run. (Source code) A closed form solution seems to be discussed here. #### @Oscar Smith 2017-12-13 00:01:21

That's really annoying, as it makes vectorization way more difficult. Is there a way to make my version of `createListOfStandardDeviations` vectorized, fast, and correct? #### @mkrieger1 2017-12-13 00:57:07

I don't see how my version makes vectorization any more or less difficult than the original version and how it should affect `createListOfStandardDeviations`. In both cases you get a 1-dimensional array `tankSerialNumbers`, only in my versions the numbers are guaranteed to be unique #### @Oscar Smith 2017-12-13 01:04:59

It is complicated because np.random.choice does not let you generate a 3d array where each element has a different maximum as I did in my version of `createListOfStandardDeviations`. This ability to vectorize led to much of the speed boost my solution provided. #### @mkrieger1 2017-12-13 01:12:23

Then I seem to not quite understand what you're proposing. Anyway, the "offending" line `tankSerialNumbers = np.random.randint(1, chosenNum + 1, size=numCapturedTanks)` is outside the `createListOfStandardDeviations` function. #### @Oscar Smith 2017-12-13 01:13:34

How can I replace these two lines to pick without replacement `maxes = np.broadcast_to(np.arange(start+1, end+1), (numCapturedTanks, 100, end - start)) tankSerialNumbersSimulated = np.floor(np.random.uniform(1, maxes))` #### @mkrieger1 2017-12-13 01:23:17

I don't yet understand how it works, but see here or here. #### @Oscar Smith 2017-12-13 01:40:22

That won't work because my code relies on generating different max values for each row. So the first row is in `1:5`, the second is in `1:6` for example. #### @Oscar Smith 2017-12-11 00:53:24

Some profiling at this point shows that almost all of the time is spent in createListOfStandardDeviations, so this is the next place to look for performance improvements. The way we can speed this up is by removing the for loop and making numpy do the work for us. In the old code, we calculated one simulatedSTD at a time, but it is possible to use a 2darray and compute the std of each row at the same time. Here is the updated code that does this.

``````def createListOfStandardDeviations(start,end):
end = int(end)
sim_num = end - start
maxes = np.broadcast_to(np.arange(start+1, end+1), (numCapturedTanks, sim_num))
STDarray.extend(simulatedSTDs)
``````

the main piece that is likely confusing here is the use of np.floor(np.random.uniform instead of np.randint. These do the same thing, and the switch is necessary because randint takes an int as the max, while uniform allows an ndarray instead. This change gives me a 35x speedup from before.

The next speedup we get is from making `STDarray` into an `ndarray`, and while we're at it, making it no longer a global variable. To do this, we make the last line of `createListOfStandardDeviations` be `return np.std(tankSerialNumbersSimulated, axis=0)`, store it in main, and pass it into `getNumGuessed`. This change lets `getNumGuessed` become simply `return maxNumber + np.argmin(np.abs(STDarray-givenSTD), axis=1)` which is simpler and faster. This is a smaller boost than before, but it still gives about a 2x speedup.

At this point `createListOfStandardDeviations` is the slow point again, and the solution is yet more vectorization. At this point we are running this function in a loop 100 times, but it is much faster to let numpy do this. All we have to do is change the function to

``````def createListOfStandardDeviations(start,end):
maxes = np.broadcast_to(np.arange(start+1, end+1), (numCapturedTanks, 100, end - start))
``````

and it will just work. It is important that `getNumGuessed` has the `axis=0` in the `argmin` call or else it will not work properly. This gives another 2x speedup. Here is the `main` function after removing the loop.

``````def main():
for runsRan in range(numRuns):
tankSerialNumbers = np.random.randint(1, chosenNum + 1, size=numCapturedTanks) #generates initial tank serial numbers
NumSTD = np.std(tankSerialNumbers)
highestTankSerial = np.mean(tankSerialNumbers) + 3*NumSTD
maxNum = np.amax(tankSerialNumbers)
ListOfStandardDeviationsStartTime = time.time()
start, end = min(maxNum, highestTankSerial), int(max(maxNum, highestTankSerial))

STDarray = createListOfStandardDeviations(start, end)
numGuessN.extend(getNumGuessed(STDarray, NumSTD, maxNum))

guess = int(np.mean(numGuessN)) #store actual guess
print ("Guess Generated " + str(runsRan + 1))
guesses.append(guess) #add guesses to a list
getAllGuesses()
getAvgPercentError()
``````

The last speedup worth bothering with is making `numGuessN` an `ndarray`. This will make finding it's mean much faster, and is easy enough to do. The three changes needed are changing the declaration to `numGuessN = np.array([], dtype=np.int64)`, and changing the `extend call to

``````global numGuessN
numGuessN = np.concatenate((numGuessN, getNumGuessed(STDarray, NumSTD, maxNum)))
``````

This yields another 2x speedup for me, At this point, we have gotten well over 200x over the original, and almost all of the time is spent either calculating `std`, `mean` or generating random numbers. It is possible that another factor of 2 could be gained from here, but it would be a pain, and it would probably require more effort than it's worth. #### @PGODULTIMATE 2017-12-11 01:11:12

This was a great improvement! The program took 10.22 minutes for 10,000 guesses. #### @Oscar Smith 2017-12-11 01:13:54

There is still plenty of room for speedup. `getNumGuessed` is still quite slow for similar reasons. I'll update this answer once I have a good solution. #### @PGODULTIMATE 2017-12-11 01:33:27

Since STDarray is no longer a global variable, what do I put for `STDarray.extend` in the `createListOfStandardDeviations(start,end)`? #### @Oscar Smith 2017-12-11 01:36:58

At this point, the pace to gain speed is by in-lining `createListOfStandardDeviations`and removing the loop around it through more vectorization #### @PGODULTIMATE 2017-12-11 01:40:27

I have an error saying that `STDarray` is not defined in the `createListOfStandardDeviations(start,end)`. I do not know what to put there since we are defining STDarray from this function and I have no idea what we would be extending in that case. #### @Oscar Smith 2017-12-11 01:52:36

Does this make it more clear? #### @PGODULTIMATE 2017-12-11 01:57:26

Yes, I made the changes as you specified but am getting an error: `TypeError: 'numpy.int64' object is not iterable`. This is in reference to the `numGuessN.extend(getNumGuessed(STDarray, NumSTD, maxNum))` in line 81 #### @PGODULTIMATE 2017-12-11 02:03:44

FIxed the problem #### @PGODULTIMATE 2017-12-11 04:02:18

Your method takes about 11.72 minutes to run for 10,000. This is weird because 1000 is under a minute. #### @Oscar Smith 2017-12-11 04:04:54

I think that nonlinearity is true of yours as well. `numGuessN` never gets reset, so it grows between iterations. #### @PGODULTIMATE 2017-12-11 04:40:26

I noticed another weird thing: The guesses generated the start of how they should be (quite a bit of variability), but progressively get closer and closer to the actual number indicating that each guess is being influenced by the previous guess. All guess should be independent of one another. Another weird thing is that the accuracy is insanely good at around a sample size of 100 but gets extremely off if it isn't. I would think that as sample size increases, the accuracy would increase because it would have more values to draw from. #### @Oscar Smith 2017-12-11 05:01:58

Problem fixed `getNumGuessed` needed to use `axis=1` not `axis=0` #### @PGODULTIMATE 2017-12-11 05:10:16

Yes, it is working normally now (the average percent error has become consistent).